microberx.MetaboliteAnalyzer
This module contains functions to compute and analyze molecular properties and descriptors of metabolites using various libraries and web services.
The module has the following functions:
compute_molecular_descriptors: Computes some molecular descriptors for a given data frame of SMILES strings.
compute_isotopic_mass: Computes the isotopic mass distribution of a given data frame using the pyOpenMS library.
search_pubchem: Searches the PubChem database for compounds that match a given data frame of identifiers.
classify_molecules: Classify molecules based on their SMILES strings using the ClassyFire web service.
Functions
|
Computes some molecular descriptors and filters for a given data frame of SMILES strings. |
|
Computes the isotopic mass distribution of a given data frame using the pyOpenMS library. |
|
Searches the PubChem database for compounds that match a given data frame of identifiers. |
|
Classify molecules based on their SMILES strings. |
Module Contents
- microberx.MetaboliteAnalyzer.compute_molecular_descriptors(data_frame, smiles_col)[source]
Computes some molecular descriptors and filters for a given data frame of SMILES strings.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains SMILES strings of molecules.
smiles_col (str) – The name of the column that contains the SMILES strings.
- Returns:
data_frame –
- The same data frame as input, but with additional columns for each molecular descriptor and filter computed. The descriptors and filters are:
MolWt: the molecular weight of the molecule
LogP: the octanol-water partition coefficient of the molecule
NumHAcceptors: the number of hydrogen bond acceptors in the molecule
NumHDonors: the number of hydrogen bond donors in the molecule
NumRotatableBonds: the number of rotatable bonds in the molecule
TPSA: the topological polar surface area of the molecule
MolFormula: the molecular formula of the molecule
Lipinski: a boolean value that indicates whether the molecule satisfies the Lipinski’s rule of five or not. The rule of five states that most drug-like molecules have molecular weight less than 500, LogP less than 5, number of hydrogen bond acceptors less than 10, and number of hydrogen bond donors less than 5.
Veber: a boolean value that indicates whether the molecule satisfies the Veber’s rule or not. The rule states that most orally active drugs have 10 or fewer rotatable bonds and a polar surface area equal to or less than 140 Å2.
Brenk: a string that contains the names of the Brenk filters that the molecule matches, separated by semicolons. The Brenk filters are a set of 68 unwanted substructures that are associated with reactive fucntional groups.
PAINS: a string that contains the names of the PAINS filters that the molecule matches, separated by semicolons. The PAINS filters are a set of 480 substructures that are associated with pan-assay interference compounds.
- Return type:
pd.DataFrame
- microberx.MetaboliteAnalyzer.compute_isotopic_mass(data_frame, molformula_col)[source]
Computes the isotopic mass distribution of a given data frame using the pyOpenMS library.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains the molecular formulas as a column.
molformula_col (str) – A string that specifies the name of the column that contains the molecular formulas.
- Returns:
data_frame – A pandas data frame that has two additional columns: ‘probability_sum’ and ‘mass_distribution’. The ‘probability_sum’ column contains the sum of the probabilities of all isotopes for each molecular formula. The ‘mass_distribution’ column contains the mass and probability of each isotope as a string, separated by semicolons.
- Return type:
pd.DataFrame
Example
>>> import pandas as pd >>> from pyopenms import EmpiricalFormula, CoarseIsotopePatternGenerator >>> df = pd.DataFrame({'formula': ['C6H12O6', 'C2H4O2', 'C3H8O3']}) >>> df = compute_isotopic_mass(df, 'formula') >>> print(df) formula probability_sum mass_distribution 0 C6H12O6 1.0000 180.0634:100.0;181.0668:10.72;182.0701:1.176;183... 1 C2H4O2 1.0000 60.0211:100.0;61.0245:11.08;62.0279:1.216;63.031... 2 C3H8O3 0.9999 92.0473:100.0;93.0507:10.55;94.0541:1.159;95.057...
- microberx.MetaboliteAnalyzer.search_pubchem(data_frame, entry_col, entry_type='smiles')[source]
Searches the PubChem database for compounds that match a given data frame of identifiers.
- Parameters:
data_frame (pd.DataFrame) – A pandas data frame that contains the identifiers of the compounds to search for.
entry_col (str) – A string that specifies the name of the column that contains the identifiers.
entry_type (str, optional) – A string that specifies the type of the identifiers, such as ‘smiles’, ‘inchi’, ‘cid’, etc. The default is ‘smiles’.
- Returns:
data_frame –
- The same data frame as input, but with additional columns for each PubChem property retrieved. The properties are:
PubChem_CID: the PubChem compound identifier, separated by semicolons if there are multiple matches.
PubChem_SID: the PubChem substance identifier, separated by semicolons if there are multiple matches. Only the first three SIDs are shown.
PubChem_Synonyms: the synonyms of the compound, separated by semicolons if there are multiple matches.
- Return type:
pd.DataFrame
- microberx.MetaboliteAnalyzer.classify_molecules(data_frame, smiles_col, names_col)[source]
Classify molecules based on their SMILES strings.
This function submits a query to the ClassyFire web service and returns a data frame with the classification results.
- Parameters:
data_frame (pd.DataFrame) – The input data frame with the molecules information.
smiles_col (str) – The name of the column that contains the SMILES strings.
names_col (str) – The name of the column that contains the molecule names.
- Returns:
- The output data frame with the classification results added as new columns. The columns are:
kingdom: the name of the chemical kingdom of the molecule, such as ‘Organic compounds’, ‘Inorganic compounds’, etc.
superclass: the name of the chemical superclass of the molecule, such as ‘Lipids and lipid-like molecules’, ‘Organoheterocyclic compounds’, etc.
class: the name of the chemical class of the molecule, such as ‘Steroids and steroid derivatives’, ‘Benzodiazepines’, etc.
subclass: the name of the chemical subclass of the molecule, such as ‘Cholestane steroids’, ‘1,4-benzodiazepines’, etc.
- Return type:
pd.DataFrame
- Raises:
requests.exceptions.HTTPError – If the query to the ClassyFire web service fails.