microberx.MetaboliteAnalyzer

This module contains functions to compute and analyze molecular properties and descriptors of metabolites using various libraries and web services.

The module has the following functions:

  • compute_molecular_descriptors: Computes some molecular descriptors for a given data frame of SMILES strings.

  • compute_isotopic_mass: Computes the isotopic mass distribution of a given data frame using the pyOpenMS library.

  • search_pubchem: Searches the PubChem database for compounds that match a given data frame of identifiers.

  • classify_molecules: Classify molecules based on their SMILES strings using the ClassyFire web service.

Functions

compute_molecular_descriptors(data_frame, smiles_col)

Computes some molecular descriptors and filters for a given data frame of SMILES strings.

compute_isotopic_mass(data_frame, molformula_col)

Computes the isotopic mass distribution of a given data frame using the pyOpenMS library.

search_pubchem(data_frame, entry_col[, entry_type])

Searches the PubChem database for compounds that match a given data frame of identifiers.

classify_molecules(data_frame, smiles_col, names_col)

Classify molecules based on their SMILES strings.

Module Contents

microberx.MetaboliteAnalyzer.compute_molecular_descriptors(data_frame, smiles_col)[source]

Computes some molecular descriptors and filters for a given data frame of SMILES strings.

Parameters:
  • data_frame (pd.DataFrame) – A pandas data frame that contains SMILES strings of molecules.

  • smiles_col (str) – The name of the column that contains the SMILES strings.

Returns:

data_frame

The same data frame as input, but with additional columns for each molecular descriptor and filter computed. The descriptors and filters are:
  • MolWt: the molecular weight of the molecule

  • LogP: the octanol-water partition coefficient of the molecule

  • NumHAcceptors: the number of hydrogen bond acceptors in the molecule

  • NumHDonors: the number of hydrogen bond donors in the molecule

  • NumRotatableBonds: the number of rotatable bonds in the molecule

  • TPSA: the topological polar surface area of the molecule

  • MolFormula: the molecular formula of the molecule

  • Lipinski: a boolean value that indicates whether the molecule satisfies the Lipinski’s rule of five or not. The rule of five states that most drug-like molecules have molecular weight less than 500, LogP less than 5, number of hydrogen bond acceptors less than 10, and number of hydrogen bond donors less than 5.

  • Veber: a boolean value that indicates whether the molecule satisfies the Veber’s rule or not. The rule states that most orally active drugs have 10 or fewer rotatable bonds and a polar surface area equal to or less than 140 Å2.

  • Brenk: a string that contains the names of the Brenk filters that the molecule matches, separated by semicolons. The Brenk filters are a set of 68 unwanted substructures that are associated with reactive fucntional groups.

  • PAINS: a string that contains the names of the PAINS filters that the molecule matches, separated by semicolons. The PAINS filters are a set of 480 substructures that are associated with pan-assay interference compounds.

Return type:

pd.DataFrame

microberx.MetaboliteAnalyzer.compute_isotopic_mass(data_frame, molformula_col)[source]

Computes the isotopic mass distribution of a given data frame using the pyOpenMS library.

Parameters:
  • data_frame (pd.DataFrame) – A pandas data frame that contains the molecular formulas as a column.

  • molformula_col (str) – A string that specifies the name of the column that contains the molecular formulas.

Returns:

data_frame – A pandas data frame that has two additional columns: ‘probability_sum’ and ‘mass_distribution’. The ‘probability_sum’ column contains the sum of the probabilities of all isotopes for each molecular formula. The ‘mass_distribution’ column contains the mass and probability of each isotope as a string, separated by semicolons.

Return type:

pd.DataFrame

Example

>>> import pandas as pd
>>> from pyopenms import EmpiricalFormula, CoarseIsotopePatternGenerator
>>> df = pd.DataFrame({'formula': ['C6H12O6', 'C2H4O2', 'C3H8O3']})
>>> df = compute_isotopic_mass(df, 'formula')
>>> print(df)
    formula  probability_sum                                   mass_distribution
0  C6H12O6            1.0000  180.0634:100.0;181.0668:10.72;182.0701:1.176;183...
1   C2H4O2            1.0000  60.0211:100.0;61.0245:11.08;62.0279:1.216;63.031...
2   C3H8O3            0.9999  92.0473:100.0;93.0507:10.55;94.0541:1.159;95.057...
microberx.MetaboliteAnalyzer.search_pubchem(data_frame, entry_col, entry_type='smiles')[source]

Searches the PubChem database for compounds that match a given data frame of identifiers.

Parameters:
  • data_frame (pd.DataFrame) – A pandas data frame that contains the identifiers of the compounds to search for.

  • entry_col (str) – A string that specifies the name of the column that contains the identifiers.

  • entry_type (str, optional) – A string that specifies the type of the identifiers, such as ‘smiles’, ‘inchi’, ‘cid’, etc. The default is ‘smiles’.

Returns:

data_frame

The same data frame as input, but with additional columns for each PubChem property retrieved. The properties are:
  • PubChem_CID: the PubChem compound identifier, separated by semicolons if there are multiple matches.

  • PubChem_SID: the PubChem substance identifier, separated by semicolons if there are multiple matches. Only the first three SIDs are shown.

  • PubChem_Synonyms: the synonyms of the compound, separated by semicolons if there are multiple matches.

Return type:

pd.DataFrame

microberx.MetaboliteAnalyzer.classify_molecules(data_frame, smiles_col, names_col)[source]

Classify molecules based on their SMILES strings.

This function submits a query to the ClassyFire web service and returns a data frame with the classification results.

Parameters:
  • data_frame (pd.DataFrame) – The input data frame with the molecules information.

  • smiles_col (str) – The name of the column that contains the SMILES strings.

  • names_col (str) – The name of the column that contains the molecule names.

Returns:

The output data frame with the classification results added as new columns. The columns are:
  • kingdom: the name of the chemical kingdom of the molecule, such as ‘Organic compounds’, ‘Inorganic compounds’, etc.

  • superclass: the name of the chemical superclass of the molecule, such as ‘Lipids and lipid-like molecules’, ‘Organoheterocyclic compounds’, etc.

  • class: the name of the chemical class of the molecule, such as ‘Steroids and steroid derivatives’, ‘Benzodiazepines’, etc.

  • subclass: the name of the chemical subclass of the molecule, such as ‘Cholestane steroids’, ‘1,4-benzodiazepines’, etc.

Return type:

pd.DataFrame

Raises:

requests.exceptions.HTTPError – If the query to the ClassyFire web service fails.