Monday, September 23, 2013

Introducing parallelRandomForest: faster, leaner, parallelized

Together with other members of Andreas Beyer's research group, I participated in the DREAM 8 toxicogenetics challenge. While the jury is still out on the results, I want to introduce my improvement of the R randomForest package, namely parallelRandomForest.

To cut to the chase, here is a benchmark made with genotype data from the DREAM challenge, using about 1.3 million genomic markers for 620 cell lines in the training set to predict toxicity for one drug (100 trees, mtry=1/3, node size=20):

  • randomForest (1 CPU): 3:50 hours (230 minutes), 24 GB RAM max.
  • parallelRandomForest (1 CPU): 37 minutes, 1.4 GB RAM max.
  • parallelRandomForest (8 CPUs): 5 minutes, 1.5 GB RAM max.

As you can see, parallelRandomForest is 6 times faster even when not running in parallel, and the memory consumption is about 16 times lower. Importantly, the algorithm is unchanged, i.e. parallelRandomForest produces the same output as randomForest.

For our submission, we wanted to try the simultaneous prediction of drug toxicity for all individuals and drugs. Our hope is that the increased training set enables the Random Forest (RF) to identify, for example, drugs with similar structure that are influenced by similar genotype variations.

It quickly became clear that the standard RF package was ill-suited for this task. The RAM needed by this implementation is several times the size of the actual feature matrix, and there is no built-in support for parallelization. I therefore made several changes and optimizations, leading to reduced memory footprint, reduced run-time and efficient parallelization.

In particular, the major changes are:

  • not modifying the feature matrix in any way (by avoiding transformations and extra copies)
  • no unnecessary copying of columns
  • growing trees in parallel using forked process, thus the feature matrix is stored only once in RAM regardless of the number of threads
  • using a single byte (values 0 to 255) per feature, instead of a double floating point number (eight bytes)
  • instead of sorting the items in a column when deciding where to split, the new implementation scans the column multiple times, each time collecting items that equal the tested cut-off value

The latter two optimizations are especially adapted to the use of RFs on genotype matrices, which usually contain only the values 0, 1, and 2. (Homozygous for major allele, heterozygous, homozygous for minor allele.) The 16-fold reduction in memory consumption seen above is mainly caused by switching to bytes (8-fold reduction) and avoiding extra copies (2-fold reduction). 

For our simultaneous mapping strategy, the combined training matrix contains about 120,000 columns and 52,700 rows. Total RAM consumption (including the test set and various accessory objects) was only 18 GB. It took about 2.5 hours to predict 100 trees on 12 CPUs. Both in terms of time and RAM, training with the standard RF package would have been impossible.

The optimizations currently only encompass the functions we needed, namely regression RF. Classification is handled by a separate C/Fortran implementation that I didn't touch. I would estimate that with two weeks of dedicated efforts, all functions of the RF package can be overhauled, and some restrictions (such as forcing the use of bytes) could be loosened (by switching implementations on the fly). However, I do not have the time for this. My question to the community is thus: How to proceed? Leave the package on BitBucket? Ask the maintainers of the standard RF package to back-port my changes? Would someone volunteer for the coding task?