Monday, September 23, 2013

Introducing parallelRandomForest: faster, leaner, parallelized

Together with other members of Andreas Beyer's research group, I participated in the DREAM 8 toxicogenetics challenge. While the jury is still out on the results, I want to introduce my improvement of the R randomForest package, namely parallelRandomForest.

To cut to the chase, here is a benchmark made with genotype data from the DREAM challenge, using about 1.3 million genomic markers for 620 cell lines in the training set to predict toxicity for one drug (100 trees, mtry=1/3, node size=20):

  • randomForest (1 CPU): 3:50 hours (230 minutes), 24 GB RAM max.
  • parallelRandomForest (1 CPU): 37 minutes, 1.4 GB RAM max.
  • parallelRandomForest (8 CPUs): 5 minutes, 1.5 GB RAM max.

As you can see, parallelRandomForest is 6 times faster even when not running in parallel, and the memory consumption is about 16 times lower. Importantly, the algorithm is unchanged, i.e. parallelRandomForest produces the same output as randomForest.

For our submission, we wanted to try the simultaneous prediction of drug toxicity for all individuals and drugs. Our hope is that the increased training set enables the Random Forest (RF) to identify, for example, drugs with similar structure that are influenced by similar genotype variations.

It quickly became clear that the standard RF package was ill-suited for this task. The RAM needed by this implementation is several times the size of the actual feature matrix, and there is no built-in support for parallelization. I therefore made several changes and optimizations, leading to reduced memory footprint, reduced run-time and efficient parallelization.

In particular, the major changes are:

  • not modifying the feature matrix in any way (by avoiding transformations and extra copies)
  • no unnecessary copying of columns
  • growing trees in parallel using forked process, thus the feature matrix is stored only once in RAM regardless of the number of threads
  • using a single byte (values 0 to 255) per feature, instead of a double floating point number (eight bytes)
  • instead of sorting the items in a column when deciding where to split, the new implementation scans the column multiple times, each time collecting items that equal the tested cut-off value

The latter two optimizations are especially adapted to the use of RFs on genotype matrices, which usually contain only the values 0, 1, and 2. (Homozygous for major allele, heterozygous, homozygous for minor allele.) The 16-fold reduction in memory consumption seen above is mainly caused by switching to bytes (8-fold reduction) and avoiding extra copies (2-fold reduction). 

For our simultaneous mapping strategy, the combined training matrix contains about 120,000 columns and 52,700 rows. Total RAM consumption (including the test set and various accessory objects) was only 18 GB. It took about 2.5 hours to predict 100 trees on 12 CPUs. Both in terms of time and RAM, training with the standard RF package would have been impossible.

The optimizations currently only encompass the functions we needed, namely regression RF. Classification is handled by a separate C/Fortran implementation that I didn't touch. I would estimate that with two weeks of dedicated efforts, all functions of the RF package can be overhauled, and some restrictions (such as forcing the use of bytes) could be loosened (by switching implementations on the fly). However, I do not have the time for this. My question to the community is thus: How to proceed? Leave the package on BitBucket? Ask the maintainers of the standard RF package to back-port my changes? Would someone volunteer for the coding task?

34 comments:

andri said...

Hi Michael
Great job! You asked how to proceed. I would like such improvements to be built in the original packages. So I'd vote for "Ask the maintainers of the standard RF package to back-port my changes".

Thanks, bye Andri

Zachary Mayer said...

I vote for this as well!

Nuncupative said...

Thank you for reporting peak memor5y consumption in your benchmarks, I wish everyone would do this :)

jake said...

Very nice write-up, Michael! I really hope to see these changes pushed upstream!

Pressman said...

Hi Michael,
I would like to offer to contribute development payments towards this. Perhaps you know a R coder you could contact and we could get this code base into production outside of the RF team. I know the guys who have the hands on RF at Salford are very protective of there IP and dont want much progress to occur outside there very expensive platform...

Christoph Molnar said...

This sounds like a good project for google summer of code But unfortunately the next time is in april 2014.

Michael Kuhn said...

Thanks for all the input. It would be great if a C/C++ programmer could help in pushing the changes upstream. The R implementation is under GPL, so license-wise it's free to develop, but I hope that there are no IP / patent mines.

I'll contact the maintainer of the R package and report back. :)

Tal Galili said...

Great work Michael.

I hope that the RF team will be willing to incorporate your improvements.


With regards,
Tal

Andy Liaw said...

This is Andy Liaw, the maintainer of the randomForest package. I learned about this because someone emailed me asking for help installing this on Windows. I had to google to find out about it. Anyway, I've been hoping someone would do many of the things Michael and his colleagues did for the past decade. I'm not a skilled C programmer (those who had seen the code can tell), so they remained my hopes. I've very glad someone had done the work. I'd be more than happy to work with Michael to incorporate the changes into randomForest, if he's willing to go that route.

Michael Kuhn said...

Hi Andy,

thanks for your offer, it is very encouraging. I have an idea how to make the standard package and my implementation compatible so that if a user provides a matrix with single bytes, it will use the more optimized algorithm. Otherwise, it uses doubles, but perhaps also with some optimizations. Someone else has to look into the classification forest, though, as I'm not using it at all. Perhaps you can email me (michael.kuhn@gmail.com), and then i can let you know when it's finished.

Alex Zolot said...

Hi Michael,

I'd like to try your app and just installed the package.
Could you please provide an example how to use it?

Thanks,

Alex

Michael Kuhn said...

Hi Alex,

you can use the package just like the normal randomForest package. You can add the nthreads parameter to the call to use the parallelization.

Akhil S Behl said...

Hey Michael,

I think this is great work. Hopefully, it gets integrated into randomForest.

However, there is another aspect of your problem that caught my attention. The dimensions of the problem you cite are similar to a problem that I am working on.

I have found that it takes an inordinate amount of memory to put these dimensions in an R matrix object. My problem allows me to use a sparse matrix but can't use RF on that.

Can you share what steps you took to manage this data size in that much memory?

Will be great help.

Thanks.

Michael Kuhn said...

Hi Akhil,

we saved memory by using "raw" objects (storage.mode(x) <- "raw"). So we directly create the big matrix as a raw object, saving lots of RAM. I don't know if you can convert your sparse matrix directly to a raw matrix.

Aaron said...

Michael,

This looks fantastic! I'm probably missing something obvious, but when I download the zip file you've pushed to GitHub and try to install it I get an MD5 sums error for both Mac and PC. Any thoughts about that? Have installations been going smoothly for other users?

Many thanks,
Aaron

Michael Kuhn said...

Hello Aaron,

where did you find a zip file? I didn't push one to BitBucket. But you're right, the MD5 checksums are probably wrong, I haven't updated them during development. R let me install it anyway.

Aaron said...

Thanks for writing back, Michael. On the bitBucket page (https://bitbucket.org/mkuhn/parallelrandomforest) there is a link on the main tab to a 222.2 kb download of the source file. That's what I was working with and what was throwing up the MD5 sums errors. No rush--just looking forward to experimenting with the package when it's ready for installation!

Michael Kuhn said...

Ah, I see -- this is just a service by BitBucket. If you unzip this and run "R CMD install ." in this directory (assuming you're on Mac/Linux), it should work.

Aaron said...

huh, still having a tough time with it. i unzipped the file, entered that directory via the command line, and ran the code snipper you supplied. MD5 sums still a problem. sorry to pester you with annoying questions! happy to flick you an email with the screenshot of the error that i'm getting, but that's probably more time than you want to spend on this at this point...

i'll look forward to playing with this whenever there's a release out there that i can install to R.

Claudio Cla said...

Hi, same problem here, I cant install it. I have windows 7. Does anyone have a compiled zip file for windows.
I cant wait to try it. Super job Michael!

Claudio

Claudio Cla said...

So, I've managed to install it for windows.
Here it is how I did it:

1: Install Rtools, from:
http://cran.r-project.org/bin/windows/Rtools/

2: download the parallelRandomForest from:
https://bitbucket.org/mkuhn/parallelrandomforest/downloads

3: extract the zip wherever you want:
my files are in :
D:\Mia\R\RandomForest\mkuhn-parallelrandomforest-804d353ec2f7\mkuhn-parallelrandomforest-804d353ec2f7

3: Open RGui then type:
library(devtools)
has_devel() ## check if your Rtools are properly installed

install("D:/Mia/R/RandomForest/mkuhn-parallelrandomforest-804d353ec2f7/mkuhn-parallelrandomforest-804d353ec2f7/",args="-build" )

4: enjoy it ;-)

Michael Kuhn said...

Hi Claudio, that's great! Does the parallelization (with the nthreads parameter) work on Windows?

Claudio Cla said...

So I did some preliminary test.

It works fine with regression, training and prediction no problem, but for classification I can only train the model but as soon as I try to predict even with the same data that I’ve used for training the all RGui crashes.

Just some feedback…to make it perfect ;-)

Thanks again for all….
Claudio

Claudio Cla said...

Yes it does, that’s how I did the parallelization in windows:

nNumberOfCores<-6
nNumberOfTrees<-10000

library("doSNOW")
cl<-makeCluster(nNumberOfCores, type="SOCK")
registerDoSNOW(cl)

library("parallelRandomForest")

M1 <- foreach(ntree=rep(round(nNumberOfTrees/nNumberOfCores,0), nNumberOfCores), .combine=combine, .multicombine=TRUE,
.packages='parallelRandomForest') %dopar% {
randomForest(X, Y, ntree=ntree)
}


stopCluster(cl)

Steven Pollack said...

So, I'm trying to build this and it's not working out.

I've cloned the repo, then I cd into the directory and try {R CMD INSTALL .} ...

The output is
* installing to library '/usr/lib64/R/library'
* installing *source* package 'parallelRandomForest' ...
files 'src/classTree.c', 'src/regTree.c', 'src/regrf.c', 'src/rf.c', 'src/rfutils.c' are missing
files 'DESCRIPTION', 'NAMESPACE', 'R/combine.R', 'R/importance.R', 'R/predict.randomForest.R', 'R/randomForest.default.R', 'R/rf
News.R', 'inst/CITATION', 'man/combine.Rd', 'man/rfcv.Rd', 'src/rf.h' have the wrong MD5 checksums
** libs
g++ -m64 -shared -L/usr/local/lib64 -o parallelRandomForest.so classTree.o regrf.o rf.o rfsub.o rfutils.o -lgfortran -lm -L/usr/
lib64/R/lib -lR
regrf.o: file not recognized: File format not recognized
collect2: ld returned 1 exit status
make: *** [parallelRandomForest.so] Error 1
ERROR: compilation failed for package 'parallelRandomForest'
* removing '/usr/lib64/R/library/parallelRandomForest'

Michael Kuhn said...

Hi Steven, this seems to be a problem with the compilation on your side. Could you please look around for the particular error message you're getting?

Steven Pollack said...

Yeah, I figured it out (I was using a tar'd version of a compilation I did on a Mac).

I was wondering if you could explicitly give an example of how we're to use this?

I tried doing something like

pRF.test <- randomForest(label~.,ntree=100,mtry=56,nthreads=8,do.trace=10,data=training.set,xtest=test.set[,-1],ytest=test.set[,1])

But received the following error:

Warning: memory-saving techniques have not been applied yet to classification RF.
Error in combine(mccollect()) :
Argument must be a list of randomForest objects

What gives?

Michael Kuhn said...

It looks like you're attempting to perform a classification, not a regression. As written towards the end of the blog post, I only improved regression. Still, classification could still work. It would be great if you could fix this error message in your fork, and I'd be happy to incorporate the fix. :) Even better, you could see if the parallelization can also be implemented in the classification part.

Steven Pollack said...

Sorry, I was cavalier when I forked your repo. I don't really have the chops to work on your code (I don't know C at all).

Looks like I'm stuck waiting for someone to integrate your changes into randomForest...

Eric Weinstein said...

Hi, Since it's been a few months since the last comment, just wanted to check in and find out - what is the latest with incorporating this work into the random forest package?

Michael Kuhn said...

Dear Eric, I had been in contact with Andy Liaw in November, but didn't hear anything back since. So I'm not sure if anything is going to happen. :-(

pommedeterresautee said...

Hi Michael, Is there a compiled version of this package for Windows?

Regards

Michael Kuhn said...

@pomme: sorry, there is no compiled Windows version -- but please look at the earlier comments where someone figured out how to compile

G.C. said...

How do I get rid of the following error:

Error in .Call(ifelse(raw_matrix, "callRegRFRaw", "callRegRFDouble"), :
"callRegRFDouble" not available for .Call() for package "parallelRandomForest"