Friday, March 14, 2008

Using Makefiles for jobs that run on a cluster

Makefiles are great. While you work on a project, they make it convenient to run the necessary scripts. When you come back to the project half a year later, you don't have to dig in your brain how the scripts fit together—it's all there. (More on make, and related advice.)

However, often in bioinformatics computational tasks are too big for a single CPU, so jobs are submitted to a cluster. Then, the Makefile doesn't help you much: It can't detect that jobs are running on the cluster. There is qmake, but it only works if all your parallel jobs are specified in the Makefile. I usually write my parallel scripts in a way that they can submit as many instances as necessary of themselves to the cluster via qsub.

Therefore, I went ahead and wrote a small Python wrapper script that runs the job submission script and sniffs the job ids from the output of qsub. It then waits and monitors these jobs until they are all done. Then, the execution of the Makefile can continue.

Here's an example of how to invoke the wrapper script from the Makefile:
pubchem_compound_inchi.tsv.gz:
~/src/misc/qwrap.py ${SRC_DIR}/inchikeys.py
cat ../inchikey/* | gzip > pubchem_compound_inchi.tsv.gz
You can download the code (released under a BSD License, adapted to SGE). I hope it's useful!

Addendum: Hunting around in the SGE documentation I found the "-sync" option, which, together with job arrays, probably provides the same functionality but also checks the exit status of the jobs.

Wednesday, March 12, 2008

InChIKeys for PubChem

An InChIKey is a sort of checksum for chemical structures. It consists of two parts: The first captures the scaffold of the compound, the second is computed based on the stereochemistry, proton position etc. This makes the InChIKey ideal for STITCH, because we want to merge tautomers and stereoisomers.

PubChem doesn't provide an InChiKey yet in the SDF files that you can download. However, you can quickly generate a tab-delimited file with the help of the InChI toolkit (which you have to download and compile):
zcat SDF/Compound_00000001_00025000.sdf.gz | \
./cInChI-1 -STDIO -key -AuxNone -SDF:PUBCHEM_COMPOUND_CID | \
sed 's/Structure.*=//' | sed ':a; $\!N;s/\nInChI/\tInChI/;ta;P;D' > result
(The sed command is from a FAQ.)

Tuesday, March 11, 2008

Blogging for search engines

Related to my last post about the failings of Web 2.0 in biology, I want to ask the meta-question: Why do we blog? David Crotty proposes four reasons: Communication with other science bloggers, with non-specialists, with journalist and finally with search engine users. Unless you are a fairly well-known person, your regular audience will consist of your colleagues, collaborators and a random grad student or two. A journalist might only come by if you managed to get a press release about a Nature/Science/... paper out. But, Googlebot won't fail you and read all you posts!

Insightful blog posts won't stay without an audience. For one, the small circle of followers to your blog will spread the news if you write something worth sharing. Far more important are search engines. How do you survey a research area of interest? Most of us will query PubMed, but also do a Google search in the hope that some meaningful analysis is somewhere on a course website, in the text of a paper or maybe even in a blog.

Biologists use Google to query for their proteins of interest. STRING is a fairly successful database, and lots of people google for it by name. However, almost one quarter of all visitors from Google have actually searched for a protein name (random example) and found STRING. If you follow Lars J. Jensen's lead and publish your research observations and dead ends online, someone might serendipitously find them and use them for their own research. This will be the next steps towards open science (with open data, open notebooks—which we might never reach): "Publishing" small findings, data and back stories about papers on your blog, enabling others to gain insight.

Web 2.0, CiteULike and Mekentosj Papers

Roland Krause bookmarked a great post: "Why Web 2.0 is failing in Biology" by David Crotty. That I got to know about this post just by subscribing to his links in del.icio.us is a success of Web 2.0. I'm just not sure if the same successes are already in reach in the context of science. I especially agree with David Crotty's observations about entry barriers: Unless new tools/communities make it very easy to use them and provide great benefit, the rate of adoption will be low.

From my personal experience, I can share this: Almost two years ago, I participated in giving a series of talks about Web 2.0 and how it might impact biology. Looking back, I'm not sure many things have changed. I have been using CiteULike for the past three years or so, but I think I will now switch to Papers. CiteULike allows me to bookmark and tag my papers, but when I search my library I mostly use a custom Google search for the specific author.

Papers lets you easily create a collection of all PDFs you ever read. Thanks to Spotlight, you can perform full text searches on the articles and quickly retrieve the paper you have in mind. This avoids the overhead of applying tags to papers that you actually don't end up using. (GMail is another case in point: a quick search function eliminates the need for an intricate folder structure.)

I can't remember a specific case where the "Web 2.0" functions of CiteULike ever worked for me. Peeking in the bibliographies of other people can be interesting if you have some bookmarked papers in common, but the signal-to-noise ratio is very low. So, unless you know specific people or groups to follow, you'll most only use CiteULike in "Web 1.0 mode". And then we come back to the initial observation: If a web tool is more complicated or less featured than the desktop (or even, paper) version, it won't be used much.

Update: Mendeley might be the Windows equivalent some of you have been looking for (15/08/08, more in this post).