Wednesday, February 23, 2011

Optimizing the UFF force field by unrolling the pow() function

Today I added a new performance benchmark to the chemkit library. The new benchmark, called uridine-minimization, measures the time to perform an energy minimization until convergence on a uridine molecule using the UFF force field.

Running the benchmark gives:
2,985 msecs per iteration
After running the benchmark through a profiler (I used callgrind from the valgrind package) I found some pretty interesting results and a potential easy performance optimization.

Total time spent in the program

Time spent in functions called by UffVanDerWaalsCalculation::energy()
We see that 53% of the total time the program is in the UffVanDerWaalsCalculation::energy() method and, of that, 85% of the time is spent executing the pow()function. A potential optimization is to remove the pow()function call and replace it with a manual calculation of the two exponent terms.

The current code is:
chemkit::Float UffVanDerWaalsCalculation::energy() const
{
    const chemkit::ForceFieldAtom *a = atom(0);
    const chemkit::ForceFieldAtom *b = atom(1);

    chemkit::Float d = parameter(0);
    chemkit::Float x = parameter(1);
    chemkit::Float r = distance(a, b);

    return d * (-2 * pow(x/r, 6) + pow(x/r, 12));
}

And the new code becomes:
chemkit::Float UffVanDerWaalsCalculation::energy() const
{
    const chemkit::ForceFieldAtom *a = atom(0);
    const chemkit::ForceFieldAtom *b = atom(1);

    chemkit::Float d = parameter(0);
    chemkit::Float x = parameter(1);
    chemkit::Float r = distance(a, b);

    chemkit::Float xr = x / r;
    chemkit::Float xr2 = xr * xr;
    chemkit::Float xr4 = xr2 * xr2;
    chemkit::Float xr6 = xr4 * xr2;
    chemkit::Float xr12 = xr6 * xr6;

    return d * (-2 * xr6 + xr12);
}
After making this change the benchmark time fell dramatically:
1,803 msecs per iteration
The new code is 60% faster. A good improvement. But, now we had to make sure that the new method is just as accurate as the old method. To test the difference in the normal and unrolled functions I wrote the following code which calculates the relative error introduced and the speed of each version:
#include <cmath>
#include <ctime>
#include <limits>
#include <iostream>

// calculates the van der waals energy for the given distance 'r'
double energy(double r)
{
    // parameters for Helium
    double d = 0.056;
    double x = 2.362;
    
    // [Rappe 1992] equation 20
    return d * (-2 * pow(x/r, 6) + pow(x/r, 12));
}

// calculates the van der waals energy for the given distance 'r'
//
// unrolls the pow() function for increased performance
double energy_unrolled(double r)
{
    // parameters for Helium
    double d = 0.056;
    double x = 2.362;

    double xr = x / r;
    double xr2 = xr * xr;
    double xr4 = xr2 * xr2;
    double xr6 = xr4 * xr2;
    double xr12 = xr6 * xr6;
    
    // [Rappe 1992] equation 20
    return d * (-2 * xr6 + xr12);
}

// returns the time (in msecs) per iteration of 'function' called 
// with r from 0 to 100 in picometer steps
double benchmark(double (*function)(double), int iterations)
{
    time_t start = time(0);

    for(int i = 0; i < iterations; i++){
        for(double r = 1e-2; r < 100; r += 1e-2){
            volatile double e = function(r);
        }
    }

    return difftime(time(0), start) / iterations * 1000;
}

// measures the speed of the energy() and energy_unrolled() functions
void measure_speed()
{
    std::cout << "energy time: " << benchmark(energy, 1e5) << " msec" << std::endl;
    std::cout << "energy_unrolled time: " << benchmark(energy_unrolled, 1e5) << " msec" << std::endl;
}

// measures the difference of the results of the energy() and
// energy_unrolled() functions
void measure_difference()
{
    int iterations = 0;
    double relative_error_sum = 0;
    double relative_error_max = 0;

    for(double r = 1e-2; r < 100; r += 1e-2){
        double e = energy(r);
        double e0 = energy_unrolled(r);

        double difference = std::abs(e - e0);
        double relative_error = std::abs(difference / e);

        relative_error_sum += relative_error;

        if(relative_error > relative_error_max){
            relative_error_max = relative_error;
        }

        iterations++;
    }

    double relative_error_mean = relative_error_sum / iterations;

    std::cout << "max relative error: " << relative_error_max << std::endl;
    std::cout << "mean relative error: " << relative_error_mean << std::endl;
}

int main(void)
{
    measure_speed();
    measure_difference();

    return 0;
}
The results are:
$ g++ energy.cpp -o energy -O2
$ ./energy
energy time: 2.28 msec
energy_unrolled time: 0.16 msec
max relative error: 5.6035e-15
mean relative error: 1.28544e-16
We see that the unrolled version is significantly faster but introduces a mean relative error of ~1.3e-16.

Instead of manually optimizing the code, it would be great if we could get the compiler to optimize it for us. Looking through the g++ documentation turned up the -ffast-math compiler flag. It claims to increase the performance of floating-point math at the expense of breaking IEEE floating-point compliance and potentially reducing accuracy.

Compiling and running the test with the -ffast-math flag gives:
$ g++ energy.cpp -o energy -O2 -ffast-math
$ ./energy
energy time: 0.13 msec
energy_unrolled time: 0.14 msec
max relative error: 1.81079e-14
mean relative error: 9.84612e-17
So we can see that using the -ffast-math flag enables the same optimization that we performed manually.

Going back to the UFF force field I compiled the UFF plugin with the -ffast-math flag and re-ran the benchmark. It gave:
1,590 msecs per iteration
Using the -ffast-math option increased the speed by 88%, an even bigger improvement than just manually unrolling the pow() function.

For now I will leave the unoptimized energy function and let the user choose if they want to trade speed for accuracy by passing the -ffast-math compiler flag when compiling.

Friday, February 18, 2011

Counting heteroatoms in Python: try/except vs. if

I found this answer on the Blue Obelisk eXchange interesting and, from the comments, seemingly controversial. The answer states that in Python using try/except with an assert could be faster than just an if statement. Intuitively I felt that try/except would be much slower than a simple if statement but I wanted to know for sure. So I opened my editor, wrote some code, and found out which actually performed better.

I used the chemkit library's Atom::isHeteroatom() method to perform the counting of the heteroatoms. For the timing I used Python's timeit module.

The code reads the 753 molecules from the MMFF validation suite's test file MMFF94_hypervalent.mol2 and then uses the timeit.Timer class to measure the execution time of counting the total number of heteroatoms.

Here is the code:
import timeit
import chemkit

# Counts the number of heteroatoms in the molecule
# using an if statement
def heteroatom_count_if(molecule):
    count = 0

    for atom in molecule.atoms():
        if atom.isHeteroatom():
            count += 1

    return count

# Counts the number of heteroatoms in the molecule
# using a try/except statement
def heteroatom_count_try(molecule):
    count = 0

    for atom in molecule.atoms():
        try:
            assert(atom.isHeteroatom())
            count += 1
        except:
            continue

    return count

# Counts the total number of heteroatoms in the list
# of molecules using the given heteroatom_function
def count(molecules, heteroatom_function):
    count = 0

    for molecule in file.molecules():
        count += heteroatom_function(molecule)

    return count

if __name__ == '__main__':
    # read molecules from file
    file = chemkit.ChemicalFile("MMFF94_hypervalent.mol2")
    if not file.read():
        print 'Error reading file: ' + file.errorString()
        exit()

    # list of molecules
    molecules = file.molecules()

    # measure heteroatom_count_if
    t = timeit.Timer("count(molecules, heteroatom_count_if)", 
                     "from __main__ import count, molecules, heteroatom_count_if")

    print 'heteroatom_count_if time: ' + str(t.timeit(500))

    # measure heteroatom_count_try
    t = timeit.Timer("count(molecules, heteroatom_count_try)", 
                     "from __main__ import count, molecules, heteroatom_count_try")

    print 'heteroatom_count_try time: ' + str(t.timeit(500))
And here are the results:
$ python heteroatoms.py
heteroatom_count_if time: 8.65832996368

heteroatom_count_try time: 17.5534369946
So we see that the if statement version is roughly twice as fast as the try/except version. That was my intuition, but it's good to be sure.