microberx.OmicsIntegrator

This is a module that provides functions to analyze organims, and enzyme involved in the production of metabolites predicted by MicrobeRX.

The module contains the following functions:

  • plot_species_sunburst: This function creates a sunburst plot of the microbial species in the sources list.

  • fetch_batch_sequences: This function fetches a list of sequences from the NCBI Entrez database. It uses the Biopython library to access the Entrez API and parse the FASTA format. It also uses a helper function _fetch_sequence to fetch and return a single sequence.

  • get_interpro: This function retrieves the InterProScan results for a given sequence from the EBI InterProScan 5 web service.

  • plot_interpro_results: This functio creates a bar plot of the InterProScan results for a given sequence.

  • run_multi_sequence_aligment: This function performs a multiple sequence alignment (MSA) and a phylogenetic tree construction for a given set of sequences using the ClustalW2 program.

  • plot_similarity_matrix: This function creates a heatmap of the pairwise similarity scores for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color map and the homology percentage for the heatmap.

  • plot_aligment_chart: This function creates a chart of the multiple sequence alignment (MSA) for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color scale and the conservation method for the chart.

Functions

plot_species_sunburst(sources[, path])

The function plot_species_sunburst creates a sunburst plot of the microbial species in the sources list. It uses the global variables MICROBES_DATA and MICROBES_REACTIONS that are loaded by the function check_if_microbes_databases_are_loaded. It also uses the Plotly Express library to create the sunburst plot.

fetch_batch_sequences([entries, sequence_ids, email, ...])

The function fetch_batch_sequences fetches a list of sequences from the NCBI Entrez database. It uses the Biopython library to access the Entrez API and parse the FASTA format. It also uses a helper function _fetch_sequence to fetch and return a single sequence.

get_interpro([sequence_id, sequence, email, ...])

The function get_interpro retrieves the InterProScan results for a given sequence from the EBI InterProScan 5 web service. It uses the requests library to access the REST API and the pandas library to parse the tab-separated values (TSV) format. It also accepts optional parameters to include GO terms and pathway information in the output.

plot_interpro_results([interpro_results, compact])

The function plot_interpro_results creates a bar plot of the InterProScan results for a given sequence. It uses the Plotly Express library to create the bar plot. It also accepts an optional parameter to choose between a compact or a detailed view of the results.

run_multi_sequence_aligment([sequences_file, ...])

The function run_multi_sequence_aligment performs a multiple sequence alignment (MSA) and a phylogenetic tree construction for a given set of sequences using the ClustalW2 program. It uses the Biopython library to parse the input and output files and to run the ClustalW2 command line. It also returns a heatmap of the pairwise alignment scores.

plot_similarity_matrix([similarity_matrix, ...])

The function plot_similarity_matrix creates a heatmap of the pairwise similarity scores for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color map and the homology percentage for the heatmap.

plot_aligment_chart([msa_file, cmap, color_scale])

The function plot_aligment_chart creates a chart of the multiple sequence alignment (MSA) for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color scale and the conservation method for the chart.

Module Contents

microberx.OmicsIntegrator.plot_species_sunburst(sources, path='short')[source]

The function plot_species_sunburst creates a sunburst plot of the microbial species in the sources list. It uses the global variables MICROBES_DATA and MICROBES_REACTIONS that are loaded by the function check_if_microbes_databases_are_loaded. It also uses the Plotly Express library to create the sunburst plot.

Parameters:
  • sources (A list of strings that represent the source id of reactions from AGORA2. For example, ['CYSS3r'].) –

  • path (A string that specifies the path of the sunburst plot. It can be either 'full' or 'short'. The default value is 'short'. If 'full', the path is ['Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']. If 'short', the path is ['Kingdom', 'Phylum', 'Order', 'Genus', 'Species'].) –

Returns:

F

Return type:

A Plotly Figure object that contains the sunburst plot. It has one subplot for each source in the sources list. The subplots are arranged in a grid with three columns and variable rows. The sunburst plot shows the hierarchical distribution of the microbial species by their taxonomic ranks. The color of each segment is determined by the phylum of the species.

microberx.OmicsIntegrator.fetch_batch_sequences(entries=None, sequence_ids=None, email=None, database='protein')[source]

The function fetch_batch_sequences fetches a list of sequences from the NCBI Entrez database. It uses the Biopython library to access the Entrez API and parse the FASTA format. It also uses a helper function _fetch_sequence to fetch and return a single sequence.

Parameters:
  • entries (A list of strings that represent the accession numbers of the sequences to be fetched. For example, ['WP_015582217.1', 'WP_001277567.1'].) –

  • sequence_ids (A list of strings that represent the custom ids to be assigned to the fetched sequences. For example, ['seq1', 'seq2']. The length of this list should match the length of the entries list.) –

  • email (A string that specifies the email address of the user. This is required by the Entrez API to identify the user and avoid abusing the system. For example, ‘user@example.com’.) –

  • database (A string that specifies the name of the Entrez database to fetch the sequences from. The default value is 'protein'. For example, 'nucleotide'.) –

Returns:

sequences

Return type:

A list of Bio.SeqRecord objects that contain the fetched sequences. Each sequence has the id attribute set to the corresponding value in the sequence_ids list. If an error occurs while fetching a sequence, it is skipped and not added to the list.

microberx.OmicsIntegrator.get_interpro(sequence_id=None, sequence=None, email=None, sequence_type='protein', go_terms=True, pathways=True)[source]

The function get_interpro retrieves the InterProScan results for a given sequence from the EBI InterProScan 5 web service. It uses the requests library to access the REST API and the pandas library to parse the tab-separated values (TSV) format. It also accepts optional parameters to include GO terms and pathway information in the output.

Parameters:
  • sequence_id (A string that represents the id of the sequence to be scanned. For example, 'seq1'.) –

  • sequence (A string of the sequence to be scanned. For example, MKKLLIISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCP.) –

  • email (A string that specifies the email address of the user. This is required by the EBI InterProScan 5 web service to identify the user and avoid abusing the system. For example, ‘user@example.com’.) –

  • sequence_type (A string that specifies the type of the sequence to be scanned. It can be either 'protein' or 'nucleotide'. The default value is 'protein'.) –

  • go_terms (A boolean that indicates whether to include GO terms in the output. The default value is True.) –

  • pathways (A boolean that indicates whether to include pathway information in the output. The default value is True.) –

Returns:

interpro

Return type:

A pandas DataFrame object that contains the InterProScan results. It has the following columns: [‘accesion’, ‘token’, ‘sequence_length’, ‘analysis’, ‘signature_accession’, ‘signature_description’, ‘start_location’, ‘stop_location’, ‘score’, ‘status’, ‘date’, ‘interpro_accession’, ‘interpro_description’, ‘go_annotations’, ‘pathways’]. The last two columns are optional depending on the values of the go_terms and pathways parameters. If an error occurs while fetching the results, it returns the error message as a string.

microberx.OmicsIntegrator.plot_interpro_results(interpro_results=None, compact=True)[source]

The function plot_interpro_results creates a bar plot of the InterProScan results for a given sequence. It uses the Plotly Express library to create the bar plot. It also accepts an optional parameter to choose between a compact or a detailed view of the results.

Parameters:
  • interpro_results (A pandas DataFrame object that contains the InterProScan results. It should have the following columns: ['accesion', 'token', 'sequence_length', 'analysis', 'signature_accession', 'signature_description', 'start_location', 'stop_location', 'score', 'status', 'date', 'interpro_accession', 'interpro_description', 'go_annotations', 'pathways']. The last two columns are optional depending on the values of the go_terms and pathways parameters in the get_interpro function.) –

  • compact (A boolean that indicates whether to use a compact or a detailed view of the results. The default value is True. If True, the bar plot shows the InterPro accession and description for each segment of the sequence. If False, the bar plot shows the analysis and signature description for each segment of the sequence.) –

Returns:

fig

Return type:

A Plotly Figure object that contains the bar plot. It has one subplot for the sequence. The bar plot shows the distribution of the InterProScan results by their start and stop locations on the sequence. The color of each segment is determined by the InterPro accession or the analysis depending on the value of the compact parameter. The text of each segment shows the InterPro description or the signature description depending on the value of the compact parameter.

microberx.OmicsIntegrator.run_multi_sequence_aligment(sequences_file=None, input_format='fasta', output_aligment_format='fasta')[source]

The function run_multi_sequence_aligment performs a multiple sequence alignment (MSA) and a phylogenetic tree construction for a given set of sequences using the ClustalW2 program. It uses the Biopython library to parse the input and output files and to run the ClustalW2 command line. It also returns a heatmap of the pairwise alignment scores.

Parameters:
  • sequences_file (A string that represents the name of the file that contains the sequences to be aligned. For example, 'sequences.faa'.) –

  • input_format (A string that specifies the format of the input file. The default value is 'fasta'. For example, 'phylip'.) –

  • output_aligment_format (A string that specifies the format of the output alignment file. The default value is 'fasta'. For example, 'clustal'.) –

Returns:

  • similarity_matrix (A pandas DataFrame object that contains the heatmap of the pairwise alignment scores. It has the sequence ids as the row and column labels. The values are the percentage of identical positions in the pairwise alignment. The diagonal values are 100. For example:) – seq1 seq2 seq3 seq1 100.0 85.0 75.0 seq2 85.0 100.0 80.0 seq3 75.0 80.0 100.0

  • Side effects

  • ————

  • The function also creates two output files in the same directory as the input file

  • - A file named ‘sequences.fasta’ that contains the MSA in the specified output format. For example, ‘sequences.fasta’.

  • - A file named ‘sequences.dnd’ that contains the phylogenetic tree in the Newick format. For example – (((seq1:0.02941,seq2:0.02941):0.02941,seq3:0.05882):0.00000,);

microberx.OmicsIntegrator.plot_similarity_matrix(similarity_matrix=None, homology_percentage=90, cmap='custom')[source]

The function plot_similarity_matrix creates a heatmap of the pairwise similarity scores for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color map and the homology percentage for the heatmap.

Parameters:
  • similarity_matrix (A pandas DataFrame object that contains the pairwise similarity scores for the sequences. It should have the sequence ids as the row and column labels. The values should be the percentage of identical positions in the pairwise alignment. For example:) – seq1 seq2 seq3

  • 75.0 (seq1 100.0 85.0) –

  • 80.0 (seq2 85.0 100.0) –

  • 100.0 (seq3 75.0 80.0) –

  • homology_percentage (A float that specifies the threshold for the homology color. The default value is 90. It should be between 0 and 100. The dendogram will use a different color for the values that are above or equal to this threshold displaying homology clusters.) –

  • cmap (A string or a list that specifies the color map for the heatmap. The default value is 'custom'. If 'custom', the color map is [[0.0, 'rgb(64, 126, 156)'], [0.5,'rgb(242,241,241)'], [1.0, 'rgb(195,85,58)']]. Otherwise, it can be one of the predefined color maps in the Dash Bio library: 'Blackbody', 'Bluered', 'Blues', 'Earth', 'Electric', 'Greens', 'Greys', 'Hot', 'Jet', 'Picnic', 'Portland', 'Rainbow', 'RdBu', 'Reds', 'Viridis', 'YlGnBu', 'YlOrRd'. Alternatively, it can be a custom color map as a list of lists that map a value between 0 and 1 to a color. For example, [[0.0, 'red'], [0.5, 'white'], [1.0, 'blue']].) –

Returns:

  • fig (A Plotly Figure object that contains the heatmap of the pairwise similarity scores. It has the following features:)

  • - It shows the sequence ids and the similarity scores for each pair of sequences in the heatmap.

  • - It allows the user to zoom in and out, pan, and select a region of the heatmap.

  • - It allows the user to change the color map, the homology percentage, and the display options of the heatmap.

  • - It shows a color bar that indicates the range of the similarity scores and the homology color.

microberx.OmicsIntegrator.plot_aligment_chart(msa_file=None, cmap='custom', color_scale='mae')[source]

The function plot_aligment_chart creates a chart of the multiple sequence alignment (MSA) for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color scale and the conservation method for the chart.

Parameters:
  • msa_file (A string that represents the name of the file that contains the MSA in the FASTA format. For example, 'sequences.fasta'.) –

  • cmap (A string or a list that specifies the color map for the conservation scores. The default value is 'custom'. If 'custom', the color map is [[0.0, 'rgb(64, 126, 156)'], [0.5,'rgb(242,241,241)'], [1.0, 'rgb(195,85,58)']]. Otherwise, it can be one of the predefined color maps in the Plotly library: 'viridis', 'RdBu', etc...) –

  • color_scale (A string that specifies the color scale for the alignment symbols. The default value is 'mae'. It can be one of the predefined color scales in the Dash Bio library: 'buried', 'cinema', 'clustal', 'clustal2', 'helix', 'hydrophobicity', 'lesk', 'mae', 'nucleotide', 'purine', 'strand', 'taylor', 'turn', or 'zappo'. Alternatively, it can be a custom color map as a dictionary that maps each nucleotide or amino acid to a color. For example, {'A': 'red', 'C': 'blue', 'G': 'green', 'T': 'yellow'}.) –

Returns:

  • None

  • Side effects

  • ————

  • The function also creates and runs a Dash app that displays the chart of the MSA. The chart has the following features

  • - It shows the sequence ids, the alignment symbols, and the conservation scores for each position in the alignment.

  • - It allows the user to zoom in and out, pan, and select a region of the alignment.