Tuesday, March 10, 2015

Creating composite figures with ggplot2 for reproducible research

So far, I have been preparing composite figures by plotting the data using ggplot2, and then putting the panels together in OmniGraffle or Adobe Illustrator. Of course, every time the data is updated, I would need to go back to the vector editing program. After moving my manuscript from Word to knitr, I figured I should also try to cut out the image editing step.

ggplot2 does not make it easy to put different panels together in a seamless fashion and without any margins. However, by piecing together different StackOverflow answers, I found a way to extract different parts of the figures, and glue them back together with the gtable package. I can now produce a plot like this without a trip to Illustrator!

The solution is still a bit fragile, as the different dimensions of the PNG image and the rows and columns need to be adjusted manually to make it look right. Here is a minimal working example (with some superfluous steps, I'm sure):

The output of the Gist should produce this image:

Friday, June 6, 2014

A new online viewer for multiple alignments

For my work on the alignment of coiled-coil proteins, I needed an alignment viewer that could highlight coiled-coil domains, since they contain less phylogenetic signal than other parts of the protein. Adding this to JalView seemed very complicated, and not possible at all in other web-based viewers like MView. I therefore created amview (for "annotated multiple alignment viewer"), which looks like this in practice:

Shown is the MSA for spd-5 (and here is the entry in my coiled-coil orthologs database). Amino acids are colored according the ClustalW rules, coiled-coil residues are in a lighter color (and the "a" register of the heptad repeat is underlined). Below, two interaction domains are shown. At the very bottom, a small chart shows the degree of conservation across the whole alignment, which can be used to quickly scroll to the conserved regions. The cog on top hides two options: you can hide columns with too many gaps, and proteins that seem to be fragments.

I've only tested this thoroughly in Google Chrome, as I found other browsers to be too slow. Still, it's better than loading a Java applet, and even runs on iPhones/iPads etc.! The implementation relies on Django on the server side, and JavaScript / JQuery in the browser.

Monday, September 23, 2013

Introducing parallelRandomForest: faster, leaner, parallelized

Together with other members of Andreas Beyer's research group, I participated in the DREAM 8 toxicogenetics challenge. While the jury is still out on the results, I want to introduce my improvement of the R randomForest package, namely parallelRandomForest.

To cut to the chase, here is a benchmark made with genotype data from the DREAM challenge, using about 1.3 million genomic markers for 620 cell lines in the training set to predict toxicity for one drug (100 trees, mtry=1/3, node size=20):

  • randomForest (1 CPU): 3:50 hours (230 minutes), 24 GB RAM max.
  • parallelRandomForest (1 CPU): 37 minutes, 1.4 GB RAM max.
  • parallelRandomForest (8 CPUs): 5 minutes, 1.5 GB RAM max.

As you can see, parallelRandomForest is 6 times faster even when not running in parallel, and the memory consumption is about 16 times lower. Importantly, the algorithm is unchanged, i.e. parallelRandomForest produces the same output as randomForest.

For our submission, we wanted to try the simultaneous prediction of drug toxicity for all individuals and drugs. Our hope is that the increased training set enables the Random Forest (RF) to identify, for example, drugs with similar structure that are influenced by similar genotype variations.

It quickly became clear that the standard RF package was ill-suited for this task. The RAM needed by this implementation is several times the size of the actual feature matrix, and there is no built-in support for parallelization. I therefore made several changes and optimizations, leading to reduced memory footprint, reduced run-time and efficient parallelization.

In particular, the major changes are:

  • not modifying the feature matrix in any way (by avoiding transformations and extra copies)
  • no unnecessary copying of columns
  • growing trees in parallel using forked process, thus the feature matrix is stored only once in RAM regardless of the number of threads
  • using a single byte (values 0 to 255) per feature, instead of a double floating point number (eight bytes)
  • instead of sorting the items in a column when deciding where to split, the new implementation scans the column multiple times, each time collecting items that equal the tested cut-off value

The latter two optimizations are especially adapted to the use of RFs on genotype matrices, which usually contain only the values 0, 1, and 2. (Homozygous for major allele, heterozygous, homozygous for minor allele.) The 16-fold reduction in memory consumption seen above is mainly caused by switching to bytes (8-fold reduction) and avoiding extra copies (2-fold reduction). 

For our simultaneous mapping strategy, the combined training matrix contains about 120,000 columns and 52,700 rows. Total RAM consumption (including the test set and various accessory objects) was only 18 GB. It took about 2.5 hours to predict 100 trees on 12 CPUs. Both in terms of time and RAM, training with the standard RF package would have been impossible.

The optimizations currently only encompass the functions we needed, namely regression RF. Classification is handled by a separate C/Fortran implementation that I didn't touch. I would estimate that with two weeks of dedicated efforts, all functions of the RF package can be overhauled, and some restrictions (such as forcing the use of bytes) could be loosened (by switching implementations on the fly). However, I do not have the time for this. My question to the community is thus: How to proceed? Leave the package on BitBucket? Ask the maintainers of the standard RF package to back-port my changes? Would someone volunteer for the coding task?

Monday, April 22, 2013

2D plot with histograms for each dimension (2013 edition)

In 2009, I wrote about a way to show density plots along both dimensions of a plot. When I ran the code again to adapt it to a new project, it didn't work because ggplot2 has become better in the meantime. Below is the updated code. Using the gridExtra package and this hint from the ggplot2 wiki, we get this output:

Source code:

Monday, January 23, 2012

A publicly mandated medical terminology with a restrictive license

Update: Good News! In the meantime, I've been contacted by MedDRA (most probably unrelated to this blog post) and after a fruitful discussion it seems to be possible for me to base SIDER on MedDRA.

The world's regulatory agencies are increasingly adopting and mandating a new medical terminology scheme called MedDRA to capture side effects (adverse drug reactions) during the regulatory process. (For example, it is used in CVAR, AERS and other instances at the FDA [pdf], which in turn have been used in recent papers). Sounds great, right? The only problem: The dataset is under an restrictive license (pdf): MedDRA data can only be shared among MedDRA subscribers (source [pdf]). I've clarified this via email with the help desk: one can only share text examples with less than 100 terms as examples, and no numeric codes.

This means: it is not possible to create a public dataset, or supplementary material for a paper, that contains a useful amount of data based on MedDRA. 

Two years ago, I created the SIDER database of drug–side effect relations (published in MSB). By relying only on publicly available drug labels and dictionaries like COSTART (with UMLS License Category 0), we were able to create a dataset that can be shared with everyone. (Disclaimer: we chose the license CC-BY-NC-SA.) If I were to base SIDER on MedDRA, the license would prevent me from making a machine-readable database available for download and further research. Thus, the next version of SIDER cannot be based on the dictionary of medical terms that regulatory agencies use at the moment.

What is especially sad about this is that the license fees themselves are not especially high, companies with an annual revenue less than $1 million have to pay only $190 USD, and I doubt that there are hundreds of subscribers who earn more than $5 billion and thus pay the maximum fee of $62,850 USD. So it would take relatively little financial effort to declare MedDRA a open access database.

IANAL, so it may be possible that a database like SIDER, which essential contains the following records: side effect identifier, side effect name, drug identifier, drug name, is derived enough to not fall under the MedDRA license. I remain doubtful, however, especially after reading the restrictions on UMLS License Category 3, under which MedDRA falls, like: "incorporation of material [...] in any publicly accessible computer-based information system [...] including the Internet; [...] creating derivative works from material from these copyrighted sources".

Information on public health, like drugs and their side effects, should be openly available for research, second only to privacy concerns. I'm not sure how to begin to change this (beyond writing this), but ideas are very welcome.

(Small fnord detail: MSSO, which manages MedDRA, is a subsidiary of the military contractor Northrop Grumman.)

Friday, January 6, 2012

Embarrassingly parallel BLAST search

A quick note: To blast a file with many proteins against a database, you can use recent version of GNU Parallel to fill up all CPUs (which the -num_threads option of BLAST doesn't do, as it only parallelizes some steps of the search):

cat query.fasta | parallel --block 100k --recstart '>' --pipe \ 
    blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result.tsv

This will split the FASTA file into smaller chunks of about 100 kilobyte, while making sure that the records are valid (i.e. start with an ">").

Thursday, August 11, 2011

ggplot2: Determining the order in which lines are drawn

In a time series, I want to plot the values of an interesting cluster versus the background. However, if I'm not careful, ggplot will draw the items in an order determined by their name, so background items will obscure the interesting cluster:

Correct: Interesting lines in front of backgroundWrong: Background lines obscure interesting lines

One way to solve this is to combine the label and name columns into one column that is used to group the individual lines. In this toy example, the line belonging to group 1 should overlay the other two lines: