Wednesday, January 25, 2012

Accuracy and Performance of the VABC Descriptor

Last week I added an implementation of the VABC molecular descriptor to the chemkit library.

To test the efficacy of the descriptor I used the molecule data from the MMFF94 validation suite which contains 753 drug-like organic compounds and their 3D coordinates.

I wrote a small application to generate the volume data. The code below will output a pair of comma-separated values containing the analytical and predicted volumes for each molecule in the file.

#include <iostream>

#include <chemkit/foreach.h>
#include <chemkit/molecule.h>
#include <chemkit/moleculefile.h>

using namespace chemkit;

int main()
    MoleculeFile file("MMFF94_hypervalent.mol2");
    bool ok =;
        std::cerr << file.errorString() << std::endl;
        return -1;

    foreach(const Molecule *molecule, file.molecules()){
        std::cout << molecule->descriptor("vdw-volume").toDouble();
        std::cout << ",";
        std::cout << molecule->descriptor("vabc").toDouble();
        std::cout << std::endl;

    return 0;

Below shows a plot of the VABC predicted volumes against the corresponding analytically calculated volumes.

As you can see, there is a very strong correlation (R^2 = 0.968) between the analytically calculated volume and the predicted volume. This indicates that the VABC descriptor does a fairly good job in predicting a compounds van der Waals volume from just its structure.

As far as performance goes, calculating the VABC volumes for this dataset was about five times faster than calculating the volumes analytically. For this data set it took approximately 220 msec for the VABC calculations and 1100 msec for the analytical calculations.

I would also be happy to send the raw data in a csv file to anyone who wants to perform further analysis.

UPDATE (1/28/2012): Analysis of Outliers

A number of people have asked for additional information concerning the outliers in the plot above (especially the data points in the lower left).

Looking at the structures for the molecules with the largest differences between predicted and analytical volumes gave me a few insights as to why the VABC descriptor underestimated in some cases.

First off, the VABC descriptor is only parametrized for 15 different elements (mostly all from the upper-right of the periodic table). For molecules containing other elements (such as potassium) the atom contribution is zero which lead to large underestimations. The analytical method uses the van der Waals radii from the Blue Obelisk Data Repository and thus can handle every known element. To correct for this I removed the molecules from the data set which contained elements that VABC was not parameterized for.

Secondly, VABC is designed for completely bonded molecules, not systems containing multiple separate molecules. For example, a few structures in the data set contain multiple water molecules surrounding another central molecule. To correct for this I removed the structures which contained multiple disconnected fragments.

After applying those two filters 16 structures were removed and the data set was left with 737 molecules.

I reran the analysis and produced the following plot which gave a new R^2 value of 0.979.

The major outliers from the lower left have been completely removed. Now, as you can see, most of the larger discrepancies come from VABC overestimating the volume. The largest difference was ~50 A^3 and came from the following, rather exotic, structure:

The VABC descriptor does not take bond order into account and so the above compound with its numerous double and triple bonds gets assigned a much larger volume.

As you can see from the new plot the VABC descriptor does a very good job assigning volumes to the compounds that it contains parameters for. Extending the parameters for a wider range of elements should definitely be possible.

Hope this has been helpful!

Tuesday, January 24, 2012

CAS support in chemkit

I added a new line format plugin to chemkit which can convert from a CAS number to a chemical structure via PubChem. The implementation was roughly a dozen lines of code and made use of the chemkit::PubChem class in the chemkit-web library.

This new plugin implements the chemkit::LineFormat interface just like the SMILES and InChI line formats and can be used in the same way. For example, to create a guanine molecule from its CAS number (108-95-2) one can simply use:

  chemkit::Molecule guanine("108-95-2", "cas");

This also works with the chemkit-translate command line application which allows users to easily convert between CAS numbers and any other molecular line format (e.g. SMILES, InChI, InChIKey) as follows:

  chemkit-translate -icas -osmiles 108-95-2

Which will output:

Let me know what you think!

Wednesday, January 18, 2012

Code examples posted on the chemkit wiki

I have posted a couple examples of how to use chemkit for various tasks and put them up on the new examples page on the chemkit wiki

Also I have contributed a few examples to the Chemistry Toolkit Rosetta project showing how one could use chemkit for various cheminformatics tasks and how it compares to other toolkits.

Let me know if there are any examples you would like to see!

Wednesday, January 11, 2012

VABC molecular descriptor added to chemkit

I saw this blog post by Egon Willighagen a while back and bookmarked it to take a look at later. He discusses a relatively new article (by Zhao et al) which describes a molecular descriptor (named VABC) that predicts the van der Waals volume of  molecule from just its atoms and bonds (e.g. from just its SMILES string). The title is "Fast Calculation of van der Waals Volume as a Sum of Atomic and Bond Contributions and Its Application to Drug Compounds" and can be found here.

Last week, while sitting on an airplane back from vacation, I hacked up an implementation of the descriptor for chemkit. You can see the commit on GitHub here. The implementation was fairly straightforward and I added a test suite using the data from the spreadsheet in the paper's supplementary information.

The simplest way to use it is via the Molecule::descriptor() method like so:

float value = molecule.descriptor("vabc").toFloat();

Chemkit is in an interesting place by being a toolkit that implements both an analytical van der Waals volume calculator (which uses the molecule's 3D coordinates) and this new predictive volume descriptor. When I find some time, I will make a post comparing the relative speed and accuracy of the two methods. Stay tuned!

Edit: I've posted a follow-up with an analysis of the accuracy and performance of the VABC descriptor here:

Tuesday, January 10, 2012

New removeAtomIf() method added to chemkit

A post by Andrew Dalke on the molcore-devel list gave an interesting idea to implement a remove_atoms() method which took a predicate and removed all atoms in a molecule that satisfy the predicate.

This lead me to add the removeAtomIf() method to the chemkit::Molecule class. The declaration is:

template<typename Predicate> removeAtomIf(Predicate predicate)

The predicate is any function object which takes a const Atom* and returns a bool. Using the boost::bind library makes it very easy to wrap various functions up as function objects that can be passed as predicates. For example, to remove every terminal hydrogen atom from a molecule one could simply:

molecule.removeAtomIf(boost::bind(&Atom::isTerminalHydrogen, _1));

This can even be used with the new C++11 lambdas. Code such as the following will remove all atoms that have an element symbol other than "C" (i.e. any non-carbon atom):

molecule.removeAtomIf([](const Atom *atom) { return atom->symbol() != "C"; });

Lambdas allows for arbitrarily complicated predicates to be defined at the point
removeAtomIf() is called. I, for one, cannot wait until C++11 features are widely supported.

Monday, January 9, 2012

Happy New Year

Happy new year everyone!

Over the last six months my life has changed quite a lot. I landed a great job at Kitware, Inc. working on the scientific visualization team. That meant moving all the way across the country from sunny Santa Barbara, California to Clifton Park, New York. I'm fairly settled in now but still getting used to the frigid weather.

In other news, chemkit has also gone through a great number of changes and improvements. I will try to find some time to blog about them soon.

One new years resolution of mine is to be more active in the open-source and open-science communities. To that end I will try to keep this blog more up-to-date with news of my life and developments in the chemkit and other projects.