Friday, January 6, 2012

Embarrassingly parallel BLAST search

A quick note: To blast a file with many proteins against a database, you can use recent version of GNU Parallel to fill up all CPUs (which the -num_threads option of BLAST doesn't do, as it only parallelizes some steps of the search):

cat query.fasta | parallel --block 100k --recstart '>' --pipe \ 
    blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result.tsv

This will split the FASTA file into smaller chunks of about 100 kilobyte, while making sure that the records are valid (i.e. start with an ">").

3 comments:

Jermdemo said...

Nice feature, but I can't get this syntax to occupy more than 100% of a CPU (looking at "top"). Normally I see gnu parallel occupy up to 1200% of a CPU (12 cores).

Michael Kuhn said...

Perhaps it has something to do with the size of the FASTA file? The command line I posted takes 100k chunks of the file, so if you have a smaller file, it won't split it. You could try something like "wc" instead of blastp to see how many calls are made.

Sujai said...

This has been super useful. Thanks!

I spent a long time working out how to gnu-parallelise UCSC's blat and most tricks to specify the query file didn't work (e.g. "-" "</dev/stdin" etc), so am posting what did work for me:

https://gist.github.com/sujaikumar/8932968

(pasted gist because code not allowed in comments)