microberx
MicrobeRX is a tool for enzymatic reaction-based metabolite prediction in the gut microbiome.
The tool allows you to: - Load and process reaction rules and evidences from various sources - Generate and rank metabolite candidates for a given input compound and a set of microbes - Visualize and explore the results using interactive plots and tables
Submodules
Attributes
Classes
A class to represent a chemical reaction with various attributes and methods. |
|
A class for predicting metabolites using reaction rules. |
|
A class for predicting reactions based on reaction rules. |
Functions
Get version information or return default if unable to do so. |
|
|
Decompose a chemical reaction into its individual compounds and stoichiometry. |
|
Sanitizes a chemical reaction by removing stereochemistry, replacing dummy atoms with carbon, and standardizing the molecules. |
|
Sanitizes a chemical reaction by removing stereochemistry, replacing dummy atoms with carbon, and standardizing the molecules. |
|
Sets the IDs of the reactants and products of a target reaction based on their molecular formulas and a reference reaction. |
|
Reverses a chemical reaction by swapping the reactants and products. |
|
Generates a dictionary of single reactant reactions from a mapped reaction. |
|
Generates a dictionary of rules for a single reactant reaction based on the reacting atoms and the rings. |
Load the reaction rules from a compressed tab-separated file. |
|
Load the human evidences from a compressed tab-separated file. |
|
Load the microbes reactions from a compressed tab-separated file. |
|
Load the microbes data from a compressed tab-separated file. |
|
|
Computes some molecular descriptors and filters for a given data frame of SMILES strings. |
|
Computes the isotopic mass distribution of a given data frame using the pyOpenMS library. |
|
Searches the PubChem database for compounds that match a given data frame of identifiers. |
|
Classify molecules based on their SMILES strings. |
|
Plots the molecular descriptors of a given data frame using polar coordinates. |
|
Plots the boiled egg diagram of a given data frame using scatter plot. |
|
Plots the isotopic mass distribution of a given data frame using plotly. |
|
Creates a 3D scatter plot of the data frame with the x, y, and z axes representing the similarity of substrates, products, and reacting atoms efficiency respectively. |
|
Creates a 2D image of a molecule with the atoms colored according to their metabolic accessibility. |
|
Displays a grid of molecules from a data frame, using different colors to indicate the values of a specified column. |
|
Creates a Sankey diagram to visualize the evidences of metabolite annotations in a data frame. |
|
The function plot_species_sunburst creates a sunburst plot of the microbial species in the sources list. It uses the global variables MICROBES_DATA and MICROBES_REACTIONS that are loaded by the function check_if_microbes_databases_are_loaded. It also uses the Plotly Express library to create the sunburst plot. |
|
The function fetch_batch_sequences fetches a list of sequences from the NCBI Entrez database. It uses the Biopython library to access the Entrez API and parse the FASTA format. It also uses a helper function _fetch_sequence to fetch and return a single sequence. |
|
The function get_interpro retrieves the InterProScan results for a given sequence from the EBI InterProScan 5 web service. It uses the requests library to access the REST API and the pandas library to parse the tab-separated values (TSV) format. It also accepts optional parameters to include GO terms and pathway information in the output. |
|
The function plot_interpro_results creates a bar plot of the InterProScan results for a given sequence. It uses the Plotly Express library to create the bar plot. It also accepts an optional parameter to choose between a compact or a detailed view of the results. |
|
The function run_multi_sequence_aligment performs a multiple sequence alignment (MSA) and a phylogenetic tree construction for a given set of sequences using the ClustalW2 program. It uses the Biopython library to parse the input and output files and to run the ClustalW2 command line. It also returns a heatmap of the pairwise alignment scores. |
|
The function plot_similarity_matrix creates a heatmap of the pairwise similarity scores for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color map and the homology percentage for the heatmap. |
|
The function plot_aligment_chart creates a chart of the multiple sequence alignment (MSA) for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color scale and the conservation method for the chart. |
Package Contents
- microberx.get_versions()[source]
Get version information or return default if unable to do so.
- Return type:
Dict[str, Any]
- microberx.decompose_reaction(reaction=None, compounds_map=None)[source]
Decompose a chemical reaction into its individual compounds and stoichiometry.
- Parameters:
reaction (str, optional) – The chemical reaction in string format. It can be specified using various notations: - “<=>” for reversible reactions. - “–>” for irreversible reactions. - “=” for generic reactions.
compounds_map (dict, optional) – A dictionary mapping compound names to their corresponding SMILES representations.
- Returns:
reaction_dict – A dictionary containing the decomposed information of the chemical reaction, including: - “Reaction”: The original input chemical reaction. - “Reversible”: A boolean indicating whether the reaction is reversible. - “LEFT”: A dictionary of reactants with stoichiometry and SMILES. - “RIGHT”: A dictionary of products with stoichiometry and SMILES. - “ReactionSmiles”: The SMILES representation of the entire reaction. - “ReactionNames”: The names of compounds in the reaction.
- Return type:
dict
Examples
>>> compounds_map = {"H2O": "O", "CO2": "O=C=O", "CH4": "C"} >>> reaction_str = "H2O + CO2 <=> CH4 + H2O" >>> reaction_info = decompose_reaction(reaction_str, compounds_map) >>> print(reaction_info) >>> reaction_str = "H2O + CO2 --> CH4 + H2O" >>> reaction_info = decompose_reaction(reaction_str, compounds_map) >>> print(reaction_info)
- microberx.sanitize_reaction(target_reaction)[source]
Sanitizes a chemical reaction by removing stereochemistry, replacing dummy atoms with carbon, and standardizing the molecules.
- Parameters:
target_reaction (AllChem.ChemicalReaction) – The input chemical reaction to be sanitized.
- Returns:
fixed_reaction –
- The output chemical reaction after sanitization. The fixed reaction has the following features:
The stereochemistry of the reactants and products is removed, as it may not be relevant or accurate for the reaction.
The dummy atoms (*) in the reactants and products are replaced with carbon atoms (#6), as they may represent unspecified groups or atoms.
The reactants and products are sanitized and standardized using the dm module, which performs operations such as kekulization, neutralization, tautomerization, etc.
The 2D coordinates of the reactants and products are computed using the AllChem.Compute2DCoords function, which may improve the visualization of the reaction.
- Return type:
AllChem.ChemicalReaction
Examples
>>>reaction = AllChem.ReactionFromSmarts(“[OH:1].[C:2]=O>>[C:2][OH:1]”) >>>fixed_reaction = sanitize_reaction(reaction) >>>img = Draw.ReactionToImage(fixed_reaction) >>>img.show()
- microberx.sanitize_reaction(target_reaction)[source]
Sanitizes a chemical reaction by removing stereochemistry, replacing dummy atoms with carbon, and standardizing the molecules.
- Parameters:
target_reaction (AllChem.ChemicalReaction) – The input chemical reaction to be sanitized.
- Returns:
fixed_reaction –
- The output chemical reaction after sanitization. The fixed reaction has the following features:
The stereochemistry of the reactants and products is removed, as it may not be relevant or accurate for the reaction.
The dummy atoms (*) in the reactants and products are replaced with carbon atoms (#6), as they may represent unspecified groups or atoms.
The reactants and products are sanitized and standardized using the dm module, which performs operations such as kekulization, neutralization, tautomerization, etc.
The 2D coordinates of the reactants and products are computed using the AllChem.Compute2DCoords function, which may improve the visualization of the reaction.
- Return type:
AllChem.ChemicalReaction
Examples
>>>reaction = AllChem.ReactionFromSmarts(“[OH:1].[C:2]=O>>[C:2][OH:1]”) >>>fixed_reaction = sanitize_reaction(reaction) >>>img = Draw.ReactionToImage(fixed_reaction) >>>img.show()
- microberx.set_reaction_ids(reference_reaction, target_reaction, reaction_ids)[source]
Sets the IDs of the reactants and products of a target reaction based on their molecular formulas and a reference reaction.
- Parameters:
reference_reaction (AllChem.ChemicalReaction) – The reference chemical reaction that has the same reactants and products as the target reaction, but in a different order or orientation.
target_reaction (AllChem.ChemicalReaction) – The target chemical reaction that needs to have its IDs set.
reaction_ids (str) – The IDs of the reactants and products of the reference reaction, in the format of ‘R1.R2.>>P1.P2.’, where R1, R2, P1, P2 are the IDs.
- Returns:
target_reaction –
- The target chemical reaction with its IDs set according to the reference reaction and the reaction_ids. The target reaction has the following features:
The reactants and products are the same as the input target reaction, but with an additional property ‘ID’ added to each molecule.
The value of the ‘ID’ property is determined by matching the molecular formula of each molecule in the target reaction with the corresponding molecule in the reference reaction, and then using the value from the reaction_ids string.
The order and orientation of the reactants and products in the target reaction are preserved in the output target reaction.
- Return type:
AllChem.ChemicalReaction
Examples
>>> reference_reaction = Chem.ReactionFromSmarts("[H][C:1]([H])=[O:2]>>[O:2]=[C:1]([H])") >>> target_reaction = Chem.ReactionFromSmarts("[O:2]=[C:1]([H])>>[H][C:1]([H])=[O:2]") >>> reaction_ids = "R1.R2.>>P1.P2." >>> target_reaction_with_ids = set_reaction_ids(reference_reaction, target_reaction, reaction_ids) >>> print(target_reaction_with_ids.GetReactants()[0].GetProp("ID")) >>> print(target_reaction_with_ids.GetProducts()[0].GetProp("ID"))
- microberx.reverse_reaction(reaction)[source]
Reverses a chemical reaction by swapping the reactants and products.
- Parameters:
reaction (AllChem.ChemicalReaction) – The input chemical reaction to be reversed.
- Returns:
reversed_reaction –
- The output chemical reaction that is the reverse of the input reaction. The reversed reaction has the following features:
The reactants are the same as the products of the input reaction, but in the opposite order.
The products are the same as the reactants of the input reaction, but in the opposite order.
The atom mapping, bond types, and stereochemistry of the reaction are preserved in the reversed reaction.
- Return type:
AllChem.ChemicalReaction
Examples
>>> reaction = Chem.ReactionFromSmarts("[H][C:1]([H])=[O:2]>>[O:2]=[C:1]([H])") >>> reversed_reaction = reverse_reaction(reaction) >>> print(Chem.MolToSmarts(reversed_reaction))
- microberx.generate_single_reactant_reactions(mapped_reaction)[source]
Generates a dictionary of single reactant reactions from a mapped reaction.
- Parameters:
mapped_reaction (AllChem.ChemicalReaction) – The input chemical reaction with atom mapping.
- Returns:
all_unique_reactions – A dictionary of single reactant reactions, keyed by the reactant index and containing the reactant ID and the single reactant reaction object.
- Return type:
dict
Examples
>>> mapped_reaction = Chem.ReactionFromSmarts("[H][C:1]([H])=[O:2]>>[O:2]=[C:1]([H])") >>> single_reactant_reactions = generate_single_reactant_reactions(mapped_reaction) >>> print(single_reactant_reactions["reactantIdx_1"]["ID"]) >>> print(Chem.MolToSmarts(single_reactant_reactions["reactantIdx_1"]["SingleReactantReaction"]))
- microberx.generate_rules(single_reactant_reaction)[source]
Generates a dictionary of rules for a single reactant reaction based on the reacting atoms and the rings.
- Parameters:
single_reactant_reaction (AllChem.ChemicalReaction) – The input chemical reaction with one reactant and one or more products.
- Returns:
reaction_rules – A dictionary of rules for the single reactant reaction, keyed by the reactant name and containing the reactant and product SMILES, the product name, and a sub-dictionary of rules keyed by the number of atoms to keep.
- Return type:
dict
Examples
>>> reaction_smiles = "[H][C:1]([H])=[O:2]>>[O:2]=[C:1]([H])" >>> reaction = Chem.ReactionFromSmarts(reaction_smiles) >>> rules = generate_rules(reaction) >>> print(rules["reactant_1"]["ReactantMap"]) >>> print(rules["reactant_1"]["ProductName"]) >>> print(rules["reactant_1"]["ProductMap"]) >>> print(rules["reactant_1"]["SingleReactantRules"][3]) # Rules for reactions with 3 atoms to keep
- class microberx.Reaction(reaction_smiles, reaction_ids, reversible=False, mapper='ReactionDecoder')[source]
Bases:
objectA class to represent a chemical reaction with various attributes and methods.
- Parameters:
reaction_smiles (str) –
reaction_ids (str) –
reversible (bool) –
mapper (str) –
- SanitizedReaction
The sanitized version of the input reaction, obtained by calling the SanitizeReaction function.
- Type:
AllChem.ChemicalReaction
- MappedReaction
The mapped version of the sanitized reaction, obtained by calling the MapReaction and SetReactionIds functions.
- Type:
AllChem.ChemicalReaction
- ReversedReaction
The reversed version of the mapped reaction, obtained by calling the ReverseReaction function, or None if the reversible argument is False.
- Type:
AllChem.ChemicalReaction or None
- __init__(reaction_smiles, reaction_ids, reversible, mapper)[source]
Initializes a REACTION object with the given arguments.
- Parameters:
reaction_smiles (str) – The input chemical reaction in SMILES format.
reaction_ids (str) – The IDs of the reactants and products of the input reaction, in the format of ‘R1.R2.>>P1.P2.’, where R1, R2, P1, P2 are the IDs.
reversible (bool) – A flag to indicate whether to generate a reversed reaction or not. Default is False.
mapper (str) – The name of the mapper to use for atom mapping. Either ‘RXNMapper’ or ‘ReactionDecoder’. Default is ‘ReactionDecoder’.
Example
>>> reaction_smiles = 'CC(=O)O.CCOC(=O)C>>CCOC(=O)CC.O' >>> reaction_ids = 'R1.R2>>P1.P2' >>> reaction = REACTION(reaction_smiles, reaction_ids, reversible=True, mapper='RXNMapper') >>> print(reaction.SanitizedReaction) [CH3:1][C:2](=[O:3])[OH:4].[CH3:5][CH2:6][O:7][C:8](=[O:9])[CH3:10]>><[CH3:1][CH2:6][O:7][C:8](=[O:9])[CH2:11][CH3:10].[OH:4][C:2](=[O:3])[H] >>> print(reaction.MappedReaction) [CH3:1][C:2](=[O:3])[OH:4].[CH3:5][CH2:6][O:7][C:8](=[O:9])[CH3:10]>><[CH3:1][CH2:6][O:7][C:8](=[O:9])[CH2:11][CH3:10].[OH:4][C:2](=[O:3])[H] >>> print(reaction.ReversedReaction) [CH3:1][CH2:6][O:7][C:8](=[O:9])[CH2:11][CH3:10].[OH:4][C:2](=[O:3])[H]>><[CH3:1][C:2](=[O:3])[OH:4].[CH3:5][CH2:6][O:7][C:8](=[O:9])[CH3:10]
- SanitizedReaction
- __mapped_reaction_raw
- __mapped_reaction_sanitized
- MappedReaction
- microberx.load_reaction_rules()[source]
Load the reaction rules from a compressed tab-separated file.
- Returns:
- A dataframe containing the reaction rules, with columns:
num_atoms : Number of atoms to match in the query to perfom a prediction.
rule : SMARTS string of the single reactant reaction rule (SRRR).
reaction_id : Reaction_id in unified MetaNetX v4.0 id or AGORA2.
substrate : MetaNetX id of the Real subtrate of the SRRR.
substrate_map : Atom mappeed SMARTS of the of the Real subtrate of the SRRR.
product : MetaNetX id of the Main real subtrate of the SRRR.
product_map : Atom mappeed SMARTS of Main real product of the SRRR.
- Return type:
pandas.DataFrame
- microberx.load_evidences()[source]
Load the human evidences from a compressed tab-separated file.
- Returns:
- A dataframe containing the human evidences, with columns:
source : The unique identifier of the source coming from the metabolic reconstruction.
name : Name of the biotransformations, can match with enzyme name.
ec : Enzyme Commission number for the biotransformation.
mnx_id : Unified id from MetaNetX v4.0.
organisms_count : Number of organims where this souce id has been found.
xrefs : coss-references to other reaction databases.
origin : Tells if the reaction is coming from human or gut microbes.
complexes_count : Numer of genes or complexes found in the metabolic network for this biotransformation.
- Return type:
pandas.DataFrame
- microberx.load_microbes_reactions()[source]
Load the microbes reactions from a compressed tab-separated file.
- Returns:
- A dataframe containing the microbes reactions.
index: strain name of all gut microbes included in microbeRX (source: AGORA2).
columns : source name of biotransformation from the metabolic reconstructions.
data : any cell contains information about the genes or complexes that have been annotated for each organims and biotransformation.
- Return type:
pandas.DataFrame
- microberx.load_microbes_data()[source]
Load the microbes data from a compressed tab-separated file.
- Returns:
- A dataframe containing the microbes data, with columns:
microbe_name
Strain
Species
Genus
Family
Order
Class
Phylum
Kingdom
Host
NCBI Taxonomy ID
Cultured
Ecosystem
Ecosystem Category
Ecosystem Subtype
Ecosystem Type
Gram Staining
Oxygen Requirement
Motility
- Return type:
pandas.DataFrame
- class microberx.MetabolitePredictor(query, query_name='metabolite', cut_off=0.6, biosystem='all')[source]
A class for predicting metabolites using reaction rules.
- Parameters:
rules_table (str) – The path to a table containing reaction rules and associated information.
query (rdkit.Chem.Mol) –
query_name (str) –
cut_off (float) –
biosystem (str) –
- predicted_metabolites
A DataFrame to store predicted metabolites and associated information.
- Type:
pd.DataFrame
- query
The query molecule for metabolite prediction.
- Type:
Chem.Mol
- query_name
The name associated with the query molecule.
- Type:
str
- query_atoms_num
The number of heavy atoms in the query molecule.
- Type:
int
- reacting_atoms
A list to store reacting atom indices.
- Type:
list
- reacting_atoms_in_unique_metabolites
A list to store reacting atom indices in unique metabolites.
- Type:
list
- run_prediction(query: Chem.Mol, name: str = 'metabolite')[source]
Run the metabolite prediction using the provided query molecule and name.
- biosystem = 'all'
- predicted_metabolites
- query
- query_name = 'metabolite'
- query_atoms_num
- reacting_atoms = []
- reacting_atoms_in_unique_metabolites = []
- cut_off = 0.6
- run_prediction()[source]
This method performs metabolite prediction using reaction rules from the rules table. It calculates confidence scores for predicted metabolites and stores the results in the ‘predicted_metabolites’ attribute.
- Parameters:
query (Chem.Mol) – The query molecule for metabolite prediction.
query_name (str, optional) – The name associated with the query molecule. Default is “metabolite”.
- Return type:
None
Example
>>> predictor = MetabolitePredictor(rules_table) >>> query_molecule = Chem.MolFromSmiles("CC(=O)O") >>> predictor.run_prediction(query_molecule, "acetate") >>> predicted_metabolites_df = predictor.predicted_metabolites
- class microberx.RunPredictionRule(query, rule_smarts, real_product, real_substrate)[source]
A class for predicting reactions based on reaction rules.
- Parameters:
query (Chem.Mol) – The query molecule for prediction.
rule_smarts (str) – The reaction rule in SMARTS format.
real_product (str) – The SMILES representation of the real product.
real_substrate (str) – The SMILES representation of the real substrate.
- query
The query molecule for prediction.
- Type:
Chem.Mol
- rule_smarts
The reaction rule in SMARTS format.
- Type:
str
- reaction
The reaction object created from the reaction rule.
- Type:
AllChem.Reaction
- real_product
The real product molecule.
- Type:
Chem.Mol
- real_substrate
The real substrate molecule.
- Type:
Chem.Mol
- unique_products
A dictionary to store information about unique predicted products.
- Type:
dict
- query
- rule_smarts
- reaction
- real_product
- real_substrate
- predict()[source]
This method predicts reaction products based on the provided query molecule, reaction rule, real product, and real substrate. It calculates various properties and stores the results in the ‘unique_products’ dictionary.
- Return type:
None
- __get_mol_substructure(mol_target, atom_indexes_to_keep)[source]
Get the reaction substructure from the input molecule. :return: The reaction substructure
- Parameters:
mol_target (rdkit.Chem.Mol) –
- __get_substructure_neighbors(mol_target, atom_indexes_to_keep)[source]
Get the substructure neighbors of the input molecule. :return: The substructure neighbors
- Parameters:
mol_target (rdkit.Chem.Mol) –
- microberx.compute_molecular_descriptors(data_frame, smiles_col)[source]
Computes some molecular descriptors and filters for a given data frame of SMILES strings.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains SMILES strings of molecules.
smiles_col (str) – The name of the column that contains the SMILES strings.
- Returns:
data_frame –
- The same data frame as input, but with additional columns for each molecular descriptor and filter computed. The descriptors and filters are:
MolWt: the molecular weight of the molecule
LogP: the octanol-water partition coefficient of the molecule
NumHAcceptors: the number of hydrogen bond acceptors in the molecule
NumHDonors: the number of hydrogen bond donors in the molecule
NumRotatableBonds: the number of rotatable bonds in the molecule
TPSA: the topological polar surface area of the molecule
MolFormula: the molecular formula of the molecule
Lipinski: a boolean value that indicates whether the molecule satisfies the Lipinski’s rule of five or not. The rule of five states that most drug-like molecules have molecular weight less than 500, LogP less than 5, number of hydrogen bond acceptors less than 10, and number of hydrogen bond donors less than 5.
Veber: a boolean value that indicates whether the molecule satisfies the Veber’s rule or not. The rule states that most orally active drugs have 10 or fewer rotatable bonds and a polar surface area equal to or less than 140 Å2.
Brenk: a string that contains the names of the Brenk filters that the molecule matches, separated by semicolons. The Brenk filters are a set of 68 unwanted substructures that are associated with reactive fucntional groups.
PAINS: a string that contains the names of the PAINS filters that the molecule matches, separated by semicolons. The PAINS filters are a set of 480 substructures that are associated with pan-assay interference compounds.
- Return type:
pd.DataFrame
- microberx.compute_isotopic_mass(data_frame, molformula_col)[source]
Computes the isotopic mass distribution of a given data frame using the pyOpenMS library.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains the molecular formulas as a column.
molformula_col (str) – A string that specifies the name of the column that contains the molecular formulas.
- Returns:
data_frame – A pandas data frame that has two additional columns: ‘probability_sum’ and ‘mass_distribution’. The ‘probability_sum’ column contains the sum of the probabilities of all isotopes for each molecular formula. The ‘mass_distribution’ column contains the mass and probability of each isotope as a string, separated by semicolons.
- Return type:
pd.DataFrame
Example
>>> import pandas as pd >>> from pyopenms import EmpiricalFormula, CoarseIsotopePatternGenerator >>> df = pd.DataFrame({'formula': ['C6H12O6', 'C2H4O2', 'C3H8O3']}) >>> df = compute_isotopic_mass(df, 'formula') >>> print(df) formula probability_sum mass_distribution 0 C6H12O6 1.0000 180.0634:100.0;181.0668:10.72;182.0701:1.176;183... 1 C2H4O2 1.0000 60.0211:100.0;61.0245:11.08;62.0279:1.216;63.031... 2 C3H8O3 0.9999 92.0473:100.0;93.0507:10.55;94.0541:1.159;95.057...
- microberx.search_pubchem(data_frame, entry_col, entry_type='smiles')[source]
Searches the PubChem database for compounds that match a given data frame of identifiers.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains the identifiers of the compounds to search for.
entry_col (str) – A string that specifies the name of the column that contains the identifiers.
entry_type (str, optional) – A string that specifies the type of the identifiers, such as ‘smiles’, ‘inchi’, ‘cid’, etc. The default is ‘smiles’.
- Returns:
data_frame –
- The same data frame as input, but with additional columns for each PubChem property retrieved. The properties are:
PubChem_CID: the PubChem compound identifier, separated by semicolons if there are multiple matches.
PubChem_SID: the PubChem substance identifier, separated by semicolons if there are multiple matches. Only the first three SIDs are shown.
PubChem_Synonyms: the synonyms of the compound, separated by semicolons if there are multiple matches.
- Return type:
pd.DataFrame
- microberx.classify_molecules(data_frame, smiles_col, names_col)[source]
Classify molecules based on their SMILES strings.
This function submits a query to the ClassyFire web service and returns a data frame with the classification results.
- Parameters:
data_frame (pd.DataFrame) – The input data frame with the molecules information.
smiles_col (str) – The name of the column that contains the SMILES strings.
names_col (str) – The name of the column that contains the molecule names.
- Returns:
- The output data frame with the classification results added as new columns. The columns are:
kingdom: the name of the chemical kingdom of the molecule, such as ‘Organic compounds’, ‘Inorganic compounds’, etc.
superclass: the name of the chemical superclass of the molecule, such as ‘Lipids and lipid-like molecules’, ‘Organoheterocyclic compounds’, etc.
class: the name of the chemical class of the molecule, such as ‘Steroids and steroid derivatives’, ‘Benzodiazepines’, etc.
subclass: the name of the chemical subclass of the molecule, such as ‘Cholestane steroids’, ‘1,4-benzodiazepines’, etc.
- Return type:
pd.DataFrame
- Raises:
requests.exceptions.HTTPError – If the query to the ClassyFire web service fails.
- microberx.plot_molecular_descriptors(data_frame, names_col)[source]
Plots the molecular descriptors of a given data frame using polar coordinates.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains the molecular descriptors as columns and the compound names as rows.
names_col (str) – A string that specifies the name of the column that contains the compound names.
- Returns:
Figure –
- A plotly figure object that shows the polar plot of the molecular descriptors. The plot has the following features:
The radial axis represents the normalized value of each molecular descriptor, ranging from 0 to 1.
The angular axis represents the different molecular descriptors, such as MolWt, LogP, NumHAcceptors, etc.
Each compound is plotted as a radial line with a distinct color and a marker at each descriptor value.
The upper and lower limits of the Lipinski’s rule of five are plotted as shaded regions in orange and yellow, respectively. The rule of five states that most drug-like molecules have molecular weight less than 500, LogP less than 5, number of hydrogen bond acceptors less than 10, and number of hydrogen bond donors less than 5.
A legend is displayed on the right side of the plot, showing the name and color of each compound.
- Return type:
plotly.graph_objects.Figure
- microberx.plot_boiled_egg(data_frame, names_col)[source]
Plots the boiled egg diagram of a given data frame using scatter plot.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains the TPSA and LogP values as columns and the compound names as rows.
names_col (str) – A string that specifies the name of the column that contains the compound names.
- Returns:
Figure –
- A plotly figure object that shows the scatter plot of the TPSA and LogP values. The plot has the following features:
The x-axis represents the topological polar surface area (TPSA) of each compound, ranging from 0 to 142.
The y-axis represents the octanol-water partition coefficient (LogP) of each compound, ranging from -2.3 to 6.8.
Each compound is plotted as a red dot with its name displayed in the hover.
The human intestinal absorption (HIA) and blood-brain barrier (BBB) regions are plotted as white and orange circles, respectively. The HIA region indicates the compounds that are likely to be absorbed by the human intestine, while the BBB region indicates the compounds that are likely to cross the blood-brain barrier.
- Return type:
plotly.graph_objects.Figure
- microberx.plot_isotopic_masses(data_frame, names_col, mass_distribution_col)[source]
Plots the isotopic mass distribution of a given data frame using plotly.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains the isotopic mass distribution as a column of strings, where each string has the format ‘mass:probability;mass:probability;…’
names_col (str) – A string that specifies the name of the column that contains the compound names.
mass_distribution_col (str) – A string that specifies the name of the column that contains the isotopic mass distribution.
- Returns:
Figure –
- A plotly figure object that shows the bar plot of the isotopic mass distribution for each compound. The plot has the following features:
The x-axis represents the mass values of the isotopes, rounded to four decimal places.
The y-axis represents the probability values of the isotopes, multiplied by 100 and rounded to four decimal places.
Each compound is plotted as a group of bars with a distinct color and a label at the top of each bar.
A legend is displayed on the right side of the plot, showing the name and color of each compound.
- Return type:
plotly.graph_objects.Figure
- microberx.plot_confidence_scores(data_frame, x='similarity_substrates', y='similarity_products', z='reacting_atoms_efficiency', cmap='RdYlGn')[source]
Creates a 3D scatter plot of the data frame with the x, y, and z axes representing the similarity of substrates, products, and reacting atoms efficiency respectively.
- Parameters:
data_frame (pd.DataFrame) – The data frame containing the columns ‘similarity_substrates’, ‘similarity_products’, ‘reacting_atoms_efficiency’, ‘confidence_score’, and ‘metabolite_id’.
x (str, optional) – The name of the column to use as the x-axis. Defaults to ‘similarity_substrates’.
y (str, optional) – The name of the column to use as the y-axis. Defaults to ‘similarity_products’.
z (str, optional) – The name of the column to use as the z-axis. Defaults to ‘reacting_atoms_efficiency’.
cmap (str, optional) – The name of the color map to use for the color scale. Defaults to ‘RdYlGn’.
- Returns:
- The 3D scatter plot figure. The figure has the following features:
The x-axis represents the similarity of substrates, ranging from 0 to 1.
The y-axis represents the similarity of products, ranging from 0 to 1.
The z-axis represents the reacting atoms efficiency, ranging from 0 to 1.
The color of each point indicates the confidence score of the corresponding metabolite id, ranging from 0 to 1. A color bar is displayed on the right side of the plot.
The hover text of each point shows the metabolite id and the values of x, y, z, and color.
The title of the plot shows the names of the columns used for x, y, z, and color.
- Return type:
plotly.Figure
- microberx.plot_metabolic_accesibility(data_frame, molecule, atom_map_col='reacting_atoms_in_query', mol_name='Query', alpha=0.5, cmap='RdYlGn_r')[source]
Creates a 2D image of a molecule with the atoms colored according to their metabolic accessibility.
- Parameters:
data_frame (pd.DataFrame) – The data frame containing the column with the atom map information.
molecule (Chem.Mol) – The molecule object to be drawn.
atom_map_col (str, optional) – The name of the column with the atom map information. The column should contain lists of integers representing the atom indices. Defaults to ‘reacting_atoms_in_query’.
mol_name (str, optional) – The name of the molecule to be displayed on the image. Defaults to ‘Query’.
alpha (float, optional) – The transparency level of the atom colors, ranging from 0 to 1. Defaults to 0.5.
cmap (str, optional) – The name of the color map to use for the color scale. Defaults to ‘RdYlGn_r’.
- Returns:
- The 2D image figure. The figure has the following features:
The molecule is drawn in a 2D projection with the atom symbols and bond types shown.
The atoms are colored according to their metabolic accessibility, which is calculated as the frequency of the atom in the atom map column of the data frame. The color scale ranges from red (low accessibility) to green (high accessibility).
A color bar is displayed on the right side of the image, showing the values of the metabolic accessibility.
The name of the molecule is displayed on the top left corner of the image.
- Return type:
matplotlib.Figure
- microberx.display_molecules(data_frame, legends_col='metabolite_id', smiles_col='main_product_smiles', scale_from_column='confidence_score', columns_to_display=['reaction_id'], cmap='RdYlGn')[source]
Displays a grid of molecules from a data frame, using different colors to indicate the values of a specified column.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame containing the molecular data.
legends_col (str, optional) – The name of the column to use as the legend for each molecule. Default is ‘metabolite_id’.
smiles_col (str, optional) – The name of the column containing the SMILES strings for each molecule. Default is ‘main_product_smiles’.
scale_from_column (str, optional) – The name of the column to use for scaling the colors of the molecules. Default is ‘confidence_score’.
columns_to_display (list, optional) – A list of column names to display as tooltips when hovering over the molecules. Default is [‘reaction_id’].
cmap (str, optional) – The name of the matplotlib colormap to use for coloring the molecules. Default is ‘RdYlGn’.
- Returns:
- A mols2grid display object that shows the grid of molecules with legends, colors and tooltips. The display object has the following features:
Each molecule is drawn in a 2D projection with the atom symbols and bond types shown.
The legend of each molecule is displayed below the image, using the value from the legends_col column.
The color of each molecule is determined by the value from the scale_from_column column, using the cmap colormap. A color bar is displayed on the top right corner of the grid, showing the range of values.
The tooltip of each molecule is displayed when hovering over the image, showing the values from the columns_to_display list.
The grid can be filtered, sorted and searched by using the widgets on the top left corner of the grid.
- Return type:
mols2grid.display
- microberx.plot_relationships(data_frame, nodes=['reaction_id', 'metabolite_id'])[source]
Creates a Sankey diagram to visualize the evidences of metabolite annotations in a data frame.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains the metabolite annotations and their evidences.
nodes (list, optional) – A list of column names that represent the nodes of the Sankey diagram. The default value is [‘reaction_id’, ‘metabolite_id’].
- Returns:
- A plotly figure object that contains the Sankey diagram. The diagram has the following features:
The nodes are arranged horizontally from left to right, corresponding to the order of the columns in the nodes list.
The links are drawn as curved lines connecting the nodes, representing the flow of evidences from one node to another.
The width of each link is proportional to the number of evidences for that pair of nodes.
The color of each link is determined by the color of the source node, using a distinct color for each node.
The label of each node is displayed on top of the node, using the value from the corresponding column in the data frame.
The tooltip of each link shows the source and target node names and the number of evidences for that link.
The title of the diagram shows the names of the columns used for the nodes.
- Return type:
plotly.Figure
- microberx.plot_species_sunburst(sources, path='short')[source]
The function plot_species_sunburst creates a sunburst plot of the microbial species in the sources list. It uses the global variables MICROBES_DATA and MICROBES_REACTIONS that are loaded by the function check_if_microbes_databases_are_loaded. It also uses the Plotly Express library to create the sunburst plot.
- Parameters:
sources (A list of strings that represent the source id of reactions from AGORA2. For example, ['CYSS3r'].) –
path (A string that specifies the path of the sunburst plot. It can be either 'full' or 'short'. The default value is 'short'. If 'full', the path is ['Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']. If 'short', the path is ['Kingdom', 'Phylum', 'Order', 'Genus', 'Species'].) –
- Returns:
F
- Return type:
A Plotly Figure object that contains the sunburst plot. It has one subplot for each source in the sources list. The subplots are arranged in a grid with three columns and variable rows. The sunburst plot shows the hierarchical distribution of the microbial species by their taxonomic ranks. The color of each segment is determined by the phylum of the species.
- microberx.fetch_batch_sequences(entries=None, sequence_ids=None, email=None, database='protein')[source]
The function fetch_batch_sequences fetches a list of sequences from the NCBI Entrez database. It uses the Biopython library to access the Entrez API and parse the FASTA format. It also uses a helper function _fetch_sequence to fetch and return a single sequence.
- Parameters:
entries (A list of strings that represent the accession numbers of the sequences to be fetched. For example, ['WP_015582217.1', 'WP_001277567.1'].) –
sequence_ids (A list of strings that represent the custom ids to be assigned to the fetched sequences. For example, ['seq1', 'seq2']. The length of this list should match the length of the entries list.) –
email (A string that specifies the email address of the user. This is required by the Entrez API to identify the user and avoid abusing the system. For example, ‘user@example.com’.) –
database (A string that specifies the name of the Entrez database to fetch the sequences from. The default value is 'protein'. For example, 'nucleotide'.) –
- Returns:
sequences
- Return type:
A list of Bio.SeqRecord objects that contain the fetched sequences. Each sequence has the id attribute set to the corresponding value in the sequence_ids list. If an error occurs while fetching a sequence, it is skipped and not added to the list.
- microberx.get_interpro(sequence_id=None, sequence=None, email=None, sequence_type='protein', go_terms=True, pathways=True)[source]
The function get_interpro retrieves the InterProScan results for a given sequence from the EBI InterProScan 5 web service. It uses the requests library to access the REST API and the pandas library to parse the tab-separated values (TSV) format. It also accepts optional parameters to include GO terms and pathway information in the output.
- Parameters:
sequence_id (A string that represents the id of the sequence to be scanned. For example, 'seq1'.) –
sequence (A string of the sequence to be scanned. For example, MKKLLIISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCP.) –
email (A string that specifies the email address of the user. This is required by the EBI InterProScan 5 web service to identify the user and avoid abusing the system. For example, ‘user@example.com’.) –
sequence_type (A string that specifies the type of the sequence to be scanned. It can be either 'protein' or 'nucleotide'. The default value is 'protein'.) –
go_terms (A boolean that indicates whether to include GO terms in the output. The default value is True.) –
pathways (A boolean that indicates whether to include pathway information in the output. The default value is True.) –
- Returns:
interpro
- Return type:
A pandas DataFrame object that contains the InterProScan results. It has the following columns: [‘accesion’, ‘token’, ‘sequence_length’, ‘analysis’, ‘signature_accession’, ‘signature_description’, ‘start_location’, ‘stop_location’, ‘score’, ‘status’, ‘date’, ‘interpro_accession’, ‘interpro_description’, ‘go_annotations’, ‘pathways’]. The last two columns are optional depending on the values of the go_terms and pathways parameters. If an error occurs while fetching the results, it returns the error message as a string.
- microberx.plot_interpro_results(interpro_results=None, compact=True)[source]
The function plot_interpro_results creates a bar plot of the InterProScan results for a given sequence. It uses the Plotly Express library to create the bar plot. It also accepts an optional parameter to choose between a compact or a detailed view of the results.
- Parameters:
interpro_results (A pandas DataFrame object that contains the InterProScan results. It should have the following columns: ['accesion', 'token', 'sequence_length', 'analysis', 'signature_accession', 'signature_description', 'start_location', 'stop_location', 'score', 'status', 'date', 'interpro_accession', 'interpro_description', 'go_annotations', 'pathways']. The last two columns are optional depending on the values of the go_terms and pathways parameters in the get_interpro function.) –
compact (A boolean that indicates whether to use a compact or a detailed view of the results. The default value is True. If True, the bar plot shows the InterPro accession and description for each segment of the sequence. If False, the bar plot shows the analysis and signature description for each segment of the sequence.) –
- Returns:
fig
- Return type:
A Plotly Figure object that contains the bar plot. It has one subplot for the sequence. The bar plot shows the distribution of the InterProScan results by their start and stop locations on the sequence. The color of each segment is determined by the InterPro accession or the analysis depending on the value of the compact parameter. The text of each segment shows the InterPro description or the signature description depending on the value of the compact parameter.
- microberx.run_multi_sequence_aligment(sequences_file=None, input_format='fasta', output_aligment_format='fasta')[source]
The function run_multi_sequence_aligment performs a multiple sequence alignment (MSA) and a phylogenetic tree construction for a given set of sequences using the ClustalW2 program. It uses the Biopython library to parse the input and output files and to run the ClustalW2 command line. It also returns a heatmap of the pairwise alignment scores.
- Parameters:
sequences_file (A string that represents the name of the file that contains the sequences to be aligned. For example, 'sequences.faa'.) –
input_format (A string that specifies the format of the input file. The default value is 'fasta'. For example, 'phylip'.) –
output_aligment_format (A string that specifies the format of the output alignment file. The default value is 'fasta'. For example, 'clustal'.) –
- Returns:
similarity_matrix (A pandas DataFrame object that contains the heatmap of the pairwise alignment scores. It has the sequence ids as the row and column labels. The values are the percentage of identical positions in the pairwise alignment. The diagonal values are 100. For example:) – seq1 seq2 seq3 seq1 100.0 85.0 75.0 seq2 85.0 100.0 80.0 seq3 75.0 80.0 100.0
Side effects
————
The function also creates two output files in the same directory as the input file
- A file named ‘sequences.fasta’ that contains the MSA in the specified output format. For example, ‘sequences.fasta’.
- A file named ‘sequences.dnd’ that contains the phylogenetic tree in the Newick format. For example – (((seq1:0.02941,seq2:0.02941):0.02941,seq3:0.05882):0.00000,);
- microberx.plot_similarity_matrix(similarity_matrix=None, homology_percentage=90, cmap='custom')[source]
The function plot_similarity_matrix creates a heatmap of the pairwise similarity scores for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color map and the homology percentage for the heatmap.
- Parameters:
similarity_matrix (A pandas DataFrame object that contains the pairwise similarity scores for the sequences. It should have the sequence ids as the row and column labels. The values should be the percentage of identical positions in the pairwise alignment. For example:) – seq1 seq2 seq3
75.0 (seq1 100.0 85.0) –
80.0 (seq2 85.0 100.0) –
100.0 (seq3 75.0 80.0) –
homology_percentage (A float that specifies the threshold for the homology color. The default value is 90. It should be between 0 and 100. The dendogram will use a different color for the values that are above or equal to this threshold displaying homology clusters.) –
cmap (A string or a list that specifies the color map for the heatmap. The default value is 'custom'. If 'custom', the color map is [[0.0, 'rgb(64, 126, 156)'], [0.5,'rgb(242,241,241)'], [1.0, 'rgb(195,85,58)']]. Otherwise, it can be one of the predefined color maps in the Dash Bio library: 'Blackbody', 'Bluered', 'Blues', 'Earth', 'Electric', 'Greens', 'Greys', 'Hot', 'Jet', 'Picnic', 'Portland', 'Rainbow', 'RdBu', 'Reds', 'Viridis', 'YlGnBu', 'YlOrRd'. Alternatively, it can be a custom color map as a list of lists that map a value between 0 and 1 to a color. For example, [[0.0, 'red'], [0.5, 'white'], [1.0, 'blue']].) –
- Returns:
fig (A Plotly Figure object that contains the heatmap of the pairwise similarity scores. It has the following features:)
- It shows the sequence ids and the similarity scores for each pair of sequences in the heatmap.
- It allows the user to zoom in and out, pan, and select a region of the heatmap.
- It allows the user to change the color map, the homology percentage, and the display options of the heatmap.
- It shows a color bar that indicates the range of the similarity scores and the homology color.
- microberx.plot_aligment_chart(msa_file=None, cmap='custom', color_scale='mae')[source]
The function plot_aligment_chart creates a chart of the multiple sequence alignment (MSA) for a given set of sequences using the Dash Bio library. It also accepts optional parameters to choose the color scale and the conservation method for the chart.
- Parameters:
msa_file (A string that represents the name of the file that contains the MSA in the FASTA format. For example, 'sequences.fasta'.) –
cmap (A string or a list that specifies the color map for the conservation scores. The default value is 'custom'. If 'custom', the color map is [[0.0, 'rgb(64, 126, 156)'], [0.5,'rgb(242,241,241)'], [1.0, 'rgb(195,85,58)']]. Otherwise, it can be one of the predefined color maps in the Plotly library: 'viridis', 'RdBu', etc...) –
color_scale (A string that specifies the color scale for the alignment symbols. The default value is 'mae'. It can be one of the predefined color scales in the Dash Bio library: 'buried', 'cinema', 'clustal', 'clustal2', 'helix', 'hydrophobicity', 'lesk', 'mae', 'nucleotide', 'purine', 'strand', 'taylor', 'turn', or 'zappo'. Alternatively, it can be a custom color map as a dictionary that maps each nucleotide or amino acid to a color. For example, {'A': 'red', 'C': 'blue', 'G': 'green', 'T': 'yellow'}.) –
- Returns:
None
Side effects
————
The function also creates and runs a Dash app that displays the chart of the MSA. The chart has the following features
- It shows the sequence ids, the alignment symbols, and the conservation scores for each position in the alignment.
- It allows the user to zoom in and out, pan, and select a region of the alignment.