chemspider

ChemSpiPy - A Python wrapper for the ChemSpider API

I recently had the task of matching up a large amount of poorly organised molecular properties data with the corresponding structures, where the data was only identified by name. To make matters worse, the names were mostly an inconsistent mix of common names and trade names.

A good solution to this problem is to use the data available in a chemical database like ChemSpider - you can enter any type of chemical identifier into the simple search and it will attempt to resolve a structure for you. It also has a web API, so the whole process can be automated and performed for thousands of structures at a time.

I was just about to write a Python interface to the API from scratch, when I came across ChemSpiPy by Cameron Neylon, a bare bones Python wrapper for the API. I made a few bug fixes and extended the functionality, so now you can easily search ChemSpider and retrieve properties and identifiers for chemical structures from your Python scripts.

You can download it from GitHub here. Cameron Neylon’s original version is also available.

Usage is pretty straightforward - install using pip:

pip install chemspipy

Then simply import it at the top of your Python script, and connect to ChemSpider by creating a ChemSpider instance using your security token:

from chemspipy import ChemSpider
cs = ChemSpider('<YOUR-SECURITY-TOKEN>')

All your interaction with the ChemSpider database should now happen through this ChemSpider object, cs.

If you already know the ChemSpiderID of a compound, you can use that to retrieve the full compound record:

c = cs.get_compound(2157)

Compound objects have the following properties:

  • csid: ChemSpider ID.
  • image_url: URL of a PNG image of the 2D chemical structure.
  • molecular_formula: Molecular formula.
  • smiles: SMILES string.
  • inchi: InChI string.
  • inchikey: InChIKey.
  • average_mass: Average mass.
  • molecular_weight: Molecular weight.
  • monoisotopic_mass: Monoisotopic mass.
  • nominal_mass: Nominal mass.
  • alogp: AlogP.
  • xlogp: XlogP.
  • common_name: Common Name.
  • mol_2d: MOL file containing 2D coordinates.
  • mol_3d: MOL file containing 3D coordinates.
  • mol_raw: Unprocessed MOL file.
  • image: 2D depiction as binary data in PNG format.
  • spectra: List of spectra.

These are all retrieved lazily from ChemSpider only when requested to avoid unnecessary calls to the API. More details about what the API returns are available in the ChemSpider API Documentation.

It is also possible to search using any kind of chemical identifier:

for result in cs.search('Glucose'):
    print(result)

Read the full documentation for more details of what is possible.

Note: Most operations require a security token that is issued to you automatically when you register for a RSC ID and then sign in to ChemSpider. Once you have done this, you can find your security token on your ChemSpider User Profile.

Antony Williams on ChemSpider at #ACSDenver

Antony Williams presented “ChemSpider: Does Community Engagement Work to Build a Quality Online Resource for Chemists?” at the 242nd American Chemical Society National Meeting today.  This is just one of his five presentations on ChemSpider this week.  He noted at the end of the session that the presentation will be on his SlideShare page soon.

In the presentation, he noted that he has supposedly written two books according to Amazon.  One is Collaborative Computational Technologies for Biomedical Research, and the other is I Hate Sex, but there may be some author disambiguation in this case.  Maybe there is another Anthony J. Williams?

Throughout the presentation, he noted how much you can’t trust data from many supposedly reputable sources, but the staff at ChemSpider work to double and triple check their sources.  They work with about 400 outside suppliers of chemical data, and many data points do not match up.  Many data suppliers get their data from other sources, so often times errors can be repeated because of simple redundancy. 

Letting “the crowd” fix errors doesn’t really work because the interested crowd in chemistry is pretty small.

He mentions many other interesting projects such as the Spectral GameSpectraSchool, Open PHACTS, and the ChemSpider Synthetic Pages.  To date, they have only had a little over 130 people contribute to this freely available interactive database of synthetic chemistry, and they would like more people to be submit their data.

If you want more information on Antony Williams, you can also follow him on his twitter account or read his personal blog.

CIRpy - A Python interface for the Chemical Identifier Resolver (CIR)

In the past I have used the ChemSpider API (through ChemSpiPy) to resolve chemical names to structures. Unfortunately this doesn’t work that well for IUPAC names and I found myself wondering whether it was worth setting up a system that would try a number of different resolvers. More specifically, I wanted a system that would first try using OPSIN to match IUPAC names, and if that failed, try a ChemSpider lookup. Just as I was about to start doing this myself, I came across the Chemical Identifier Resolver (CIR) that does exactly that (and much more).

CIR is a web service created by by the CADD Group at the NCI that performs various chemical name to structure conversions. In short, it will (attempt to) resolve the structure of any chemical identifier that you throw at it. Under the hood it uses a combination of OPSIN, ChemSpider and CIR’s own database.

To simplify interacting with CIR through Python, I wrote a simple wrapper called CIRpy that handles constructing url requests and parsing XML responses. It’s available on github here.

Using it is a simple case of copying cirpy.py into a directory on your python path. Here’s an example using the resolve function:

import cirpy

smiles_string = cirpy.resolve('Aspirin','smiles')

There are full details of all available options in the readme.

- Amorolfine From Wikipedia, the free encyclopedia Amorolfine Amorolfine.svg Systematic (IUPAC) name (±)-(2R*,6S*)-2,6-dimethyl-4-{2-methyl-3-[4-(2-methylbutan-2-yl)phenyl]propyl}morpholine Clinical data AHFS/Drugs.com International Drug Names Legal status ? Identifiers CAS number 78613-35-1 Yes ATC code D01AE16 PubChem CID 54260 ChemSpider 49010 Yes UNII AB0BHP2FH0 Yes KEGG D02923 Yes ChEBI CHEBI:599440 Yes ChEMBL CHEMBL489411 Yes Chemical data Formula C21H35NO Molecular mass 317.509 g/mol SMILES[show] InChI[show] Yes (what is this?) (verify) Amorolfine (or amorolfin), is a morpholine antifungal drug that inhibits D14 reductase and D7-D8 isomerase, which depletes ergosterol and causes ignosterol to accumulate in the fungal cytoplasmic cell membranes. Marketed as Curanail, Loceryl, Locetar, and Odenil, amorolfine is commonly available in the form of a nail lacquer, containing 5% amorolfine as the active ingredient. It is used to treat onychomycosis (fungal infection of the toe- and fingernails). Amorolfine 5% nail lacquer in once-weekly or twice-weekly applications has been shown in two studies to be between 60% and 71% effective in treating toenail onychomycosis; complete cure rates three months after stopping treatment (after six months of treatment) were 38% and 46%. However, full experimental details of these trials were not available and since they were first reported in 1992 there have been no subsequent trials.[1] It is a topical solution for the treatment of toenail infections. Systemic treatments may be considered more effective.[1] It is approved for sale over the counter in Australia and the UK (recently re-classified to over the counter status), and is approved for the treatment of toenail fungus by prescription in other countries. It is not approved for the treatment of onychomycosis in the United States or Canada, but can be ordered from there by mail from other countries.[2]