Monday, September 23, 2013

Introducing parallelRandomForest: faster, leaner, parallelized

Together with other members of Andreas Beyer's research group, I participated in the DREAM 8 toxicogenetics challenge. While the jury is still out on the results, I want to introduce my improvement of the R randomForest package, namely parallelRandomForest.

To cut to the chase, here is a benchmark made with genotype data from the DREAM challenge, using about 1.3 million genomic markers for 620 cell lines in the training set to predict toxicity for one drug (100 trees, mtry=1/3, node size=20):

  • randomForest (1 CPU): 3:50 hours (230 minutes), 24 GB RAM max.
  • parallelRandomForest (1 CPU): 37 minutes, 1.4 GB RAM max.
  • parallelRandomForest (8 CPUs): 5 minutes, 1.5 GB RAM max.

As you can see, parallelRandomForest is 6 times faster even when not running in parallel, and the memory consumption is about 16 times lower. Importantly, the algorithm is unchanged, i.e. parallelRandomForest produces the same output as randomForest.

For our submission, we wanted to try the simultaneous prediction of drug toxicity for all individuals and drugs. Our hope is that the increased training set enables the Random Forest (RF) to identify, for example, drugs with similar structure that are influenced by similar genotype variations.

It quickly became clear that the standard RF package was ill-suited for this task. The RAM needed by this implementation is several times the size of the actual feature matrix, and there is no built-in support for parallelization. I therefore made several changes and optimizations, leading to reduced memory footprint, reduced run-time and efficient parallelization.

In particular, the major changes are:

  • not modifying the feature matrix in any way (by avoiding transformations and extra copies)
  • no unnecessary copying of columns
  • growing trees in parallel using forked process, thus the feature matrix is stored only once in RAM regardless of the number of threads
  • using a single byte (values 0 to 255) per feature, instead of a double floating point number (eight bytes)
  • instead of sorting the items in a column when deciding where to split, the new implementation scans the column multiple times, each time collecting items that equal the tested cut-off value

The latter two optimizations are especially adapted to the use of RFs on genotype matrices, which usually contain only the values 0, 1, and 2. (Homozygous for major allele, heterozygous, homozygous for minor allele.) The 16-fold reduction in memory consumption seen above is mainly caused by switching to bytes (8-fold reduction) and avoiding extra copies (2-fold reduction). 

For our simultaneous mapping strategy, the combined training matrix contains about 120,000 columns and 52,700 rows. Total RAM consumption (including the test set and various accessory objects) was only 18 GB. It took about 2.5 hours to predict 100 trees on 12 CPUs. Both in terms of time and RAM, training with the standard RF package would have been impossible.

The optimizations currently only encompass the functions we needed, namely regression RF. Classification is handled by a separate C/Fortran implementation that I didn't touch. I would estimate that with two weeks of dedicated efforts, all functions of the RF package can be overhauled, and some restrictions (such as forcing the use of bytes) could be loosened (by switching implementations on the fly). However, I do not have the time for this. My question to the community is thus: How to proceed? Leave the package on BitBucket? Ask the maintainers of the standard RF package to back-port my changes? Would someone volunteer for the coding task?

Monday, April 22, 2013

2D plot with histograms for each dimension (2013 edition)

In 2009, I wrote about a way to show density plots along both dimensions of a plot. When I ran the code again to adapt it to a new project, it didn't work because ggplot2 has become better in the meantime. Below is the updated code. Using the gridExtra package and this hint from the ggplot2 wiki, we get this output:

Source code:

Monday, January 23, 2012

A publicly mandated medical terminology with a restrictive license

Update: Good News! In the meantime, I've been contacted by MedDRA (most probably unrelated to this blog post) and after a fruitful discussion it seems to be possible for me to base SIDER on MedDRA.

The world's regulatory agencies are increasingly adopting and mandating a new medical terminology scheme called MedDRA to capture side effects (adverse drug reactions) during the regulatory process. (For example, it is used in CVAR, AERS and other instances at the FDA [pdf], which in turn have been used in recent papers). Sounds great, right? The only problem: The dataset is under an restrictive license (pdf): MedDRA data can only be shared among MedDRA subscribers (source [pdf]). I've clarified this via email with the help desk: one can only share text examples with less than 100 terms as examples, and no numeric codes.

This means: it is not possible to create a public dataset, or supplementary material for a paper, that contains a useful amount of data based on MedDRA. 

Two years ago, I created the SIDER database of drug–side effect relations (published in MSB). By relying only on publicly available drug labels and dictionaries like COSTART (with UMLS License Category 0), we were able to create a dataset that can be shared with everyone. (Disclaimer: we chose the license CC-BY-NC-SA.) If I were to base SIDER on MedDRA, the license would prevent me from making a machine-readable database available for download and further research. Thus, the next version of SIDER cannot be based on the dictionary of medical terms that regulatory agencies use at the moment.

What is especially sad about this is that the license fees themselves are not especially high, companies with an annual revenue less than $1 million have to pay only $190 USD, and I doubt that there are hundreds of subscribers who earn more than $5 billion and thus pay the maximum fee of $62,850 USD. So it would take relatively little financial effort to declare MedDRA a open access database.

IANAL, so it may be possible that a database like SIDER, which essential contains the following records: side effect identifier, side effect name, drug identifier, drug name, is derived enough to not fall under the MedDRA license. I remain doubtful, however, especially after reading the restrictions on UMLS License Category 3, under which MedDRA falls, like: "incorporation of material [...] in any publicly accessible computer-based information system [...] including the Internet; [...] creating derivative works from material from these copyrighted sources".

Information on public health, like drugs and their side effects, should be openly available for research, second only to privacy concerns. I'm not sure how to begin to change this (beyond writing this), but ideas are very welcome.

(Small fnord detail: MSSO, which manages MedDRA, is a subsidiary of the military contractor Northrop Grumman.)

Friday, January 6, 2012

Embarrassingly parallel BLAST search

A quick note: To blast a file with many proteins against a database, you can use recent version of GNU Parallel to fill up all CPUs (which the -num_threads option of BLAST doesn't do, as it only parallelizes some steps of the search):

cat query.fasta | parallel --block 100k --recstart '>' --pipe \ 
    blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result.tsv

This will split the FASTA file into smaller chunks of about 100 kilobyte, while making sure that the records are valid (i.e. start with an ">").

Thursday, August 11, 2011

ggplot2: Determining the order in which lines are drawn

In a time series, I want to plot the values of an interesting cluster versus the background. However, if I'm not careful, ggplot will draw the items in an order determined by their name, so background items will obscure the interesting cluster:

Correct: Interesting lines in front of backgroundWrong: Background lines obscure interesting lines

One way to solve this is to combine the label and name columns into one column that is used to group the individual lines. In this toy example, the line belonging to group 1 should overlay the other two lines:

Friday, July 15, 2011

SSH to same directory on other server

In my work environment, home directories are shared across servers. I often realize that I need to run a script on another server, e.g. due to RAM requirements or other jobs running on the current server. I finally figured out how to log in to the other server while staying in the same directory:

ssh -t server cd `pwd` "&& $SHELL"

(Of course, you can define aliases for different servers.)

Tuesday, June 21, 2011

Mekentosj Papers and Dropbox: get a bit more space

I use Dropbox to keep an automatic backup of my Papers library. However, I was scraping close to my space allowance, and discovered that Papers adds some temporary files to the "Library.papers2" folder that eat up precious cloud space. Here is how to remove them from Dropbox:
  1. Quit Papers 2.0. Make a backup of your Papers library.
  2. Go to the "Library.papers2" folder inside your Papers library.
  3. Copy the folders "Thumbnails" and "Spotlight" somewhere (e.g. your Desktop). The other folders don't need much space in my case.
  4. Click the Dropbox icon in the menu bar, navigate to Preferences, Advanced, Selective Sync and de-select the "Thumbnails" and "Spotlight" folders. (Dropbox will now delete the folders.)
  5. Move the folders you copied before back into the "Library.papers2" folder.
  6. Via Dropbox's web interface, delete the online version of the two folders.
  7. Voila: more space (200 MB, in my case).