Thursday, February 4, 2010

Human Protein Atlas data for download

As I just learned in our lab's journal club, the data from the Human Protein Atlas is available for download, thanks to their recent paper in MSB. Curiosly enough, the HPA help page still states that they do not make data available as a matter of "general policy."

Labels:

Friday, January 15, 2010

A Newick parser for Python, supporting internal node labels

I just pushed a fork of Thomas Mailund's nice Newick parser for Python to bitbucket. I added support for labeled internal nodes, but probably partially broke support for bootstrap values.
>>> from newick import parse_tree
>>> t = parse_tree("((Human,Chimp)Primate,(Mouse,Rat)Rodent)Supraprimates;")
>>> print t
(('Human', 'Chimp')Primate, ('Mouse', 'Rat')Rodent)Supraprimates
>>> print t.identifier
Supraprimates

Labels: ,

Monday, July 27, 2009

One step towards writing papers in Google Wave

Google Wave's underlying technology will not only enable collaboration with other people, it also make it possible for bots to interact with what you've written. I think this is going to change the way we work. E.g., all applications which require a significant amount of typing will benefit from the statistical auto-correction provided by the Wave app Spelly. In effect, Spelly goes over the text as you're typing it and correcting the obvious mistakes, just as you would do a bit later.

In a similar vein, the proof-of-concept bot Igor is watching out for inserted references and automagically converts them to a citation and a reference list. When writing papers, I usually insert reminders: "REF Imming review", "REF PMID 16007907". If I adjust this convention a bit and provide a bit more detail, Igor can figure out by itself which paper is meant and fetch the citation. Google Wave and Igor save me the tiresome going back-and-forth between a reference manager and the editor to insert all the citation, and they remove distractions from the process of writing and editing the paper.

Of course, this is a proof of concept, so the style can't yet be customized. I further think it would be helpful to quickly look "what's inside" a particular citation. I don't know if Google Wave supports this, but it would be nice to click on a citation ("[23]") and be presented with a pop-up window showing not only infos about the article, but also links to PubMed / a DOI resolver.

Labels: ,

Thursday, May 21, 2009

How good is Wikipedia's coverage of chemical compounds?

Wikipedia has an excellent coverage of chemical compounds, featuring above 20000 articles whose names match those PubChem compounds. After finding a few important chemicals not featured in Wikipedia, I wanted to quantify the coverage of Wikipedia and point out the gaps that should be filled.

I wondered how much coverage Wikipedia actually has for "important" chemicals. Here, I define importance as "number of hits in PubMed", since that is a thing that I can easily measure (and, in fact, already determined as part of working on STITCH and Reflect).

Missing chemicals
For each bin of 100 chemicals, the number of PubMed hits for all synonyms of this chemical is plotted against the fraction of the chemicals that have a Wikipedia article for any of the synonyms. (I exclude three-letter names as they are often ambiguous.) So, for compounds that occur more than 1000 times in PubMed, Wikipedia's coverage is above 80%. Here is the list of articles that should be added.

So, if you know something about one of the missing compounds, go right ahead and create an article! :)

Missing synonyms
The second question is if Wikipedia is missing important redirects, i.e. if there are widely-used names for chemicals that don't occur in Wikipedia even though an article exists for the chemical itself (just under another name). For very common names, the coverage is slightly lower, however, the abstracts in PubMed often contain chemical notation that people probably won't use when searching Wikipedia, e.g. "Ca(2+)" is the top hit on the list of redirects that could be added.

Labels:

Wednesday, April 22, 2009

Announcing SIDER: a database of side effects

After using side effects to predict drug targets, we now created a public database of side effects with a total of 62269 side effects for 888 drugs. The database was created by doing text-mining on labels from various different public sources like the FDA. Furthermore, I developed rules to extract frequency information from the labels, this worked for about one third of the drug–side effect pairs.

We think that this database will make quite a bit of interesting research possible.

Labels: ,

Monday, November 10, 2008

4th German Conference on Chemoinformatics: ChEMBL

The talks of the first full day of the 4th German Conference on Chemoinformatics are over. Most interesting for me was Christoph Steinbeck's talk about the recently announced data acquired by the EBI. The database will be called "ChEMBL". There will be a monthly update cycle, so the acquisition does not only capture the current state, but the database is going to be extended. There are three parts (although they'll be combined eventually):
  • "DrugStore": interactions for 1500 drugs. Christoph says that he doesn't expect this to go much beyond what's already publicly available in DrugBank et al. today.
  • "CandiStore": 15,000 clinical leads
  • "StARLite": 500,000 medical chemistry leads. This is where most of the novelty (in terms of public data) lies. For this part, there are >5500 annotated targets, >3500 of which are proteins (the rest is e.g. tissues), and 2 million experimental bioactivities. The database contains bidirectional links to the literature on synthetic routes and assays for the ligands and descriptions of the targets.
The data will be first made available as database dumps, more user-friendly interfaces will be added later.

Two URLs of interest that I didn't know before: The ChEMBL blog and John Overington's lab homepage.

Other remarks about today will follow when I have a real internet connection (not just 6 kB/s via Bluetooth/GPRS for 9 ct/min) to do some more background research.

Labels: , ,

Wednesday, August 20, 2008

Web 2.0 killer app: FriendFeed for scientic papers

This post is inspired by Eva's thoughts on getting scientists to adopt Web 2.0 and Cameron's post on making Connotea a killer app for scientists.

Many people have added their CiteULike or Connotea libraries to FriendFeed, so during the day you can see various new papers flow by. Similarly, journal's TOC updates and saved searches on PubMed create a regular stream of possibly interesting papers. Lastly, after a few weeks or months, papers are processed by ISI Web of Science and can be tracked by citation alerts. In the end, you might see the same paper flow by a couple times.

This situation is far from ideal. You see echos of the same paper and papers arrive via multiple channels: RSS, email, web sites. There are far too many potentially interesting papers, so you have to focus your various alerts in order not to be overwhelmed.

My proposal for the killer app is a central place which tracks all of the above items (i.e., friend's libraries, PubMed searches, journal TOC and citation alerts) and integrates with your personal library. Just like in FriendFeed, there should be a way to rate/like a paper ("Faculty of 1,000,000"?), to prioritize the new papers, and to save papers to your library. The most important and difficult feature would be to merge equivalent entries, i.e. a Connotea link to PubMed needs to be merged with the journal TOC alert etc. So when you already identified something as interesting and filed it, you won't be alerted again if it comes in via another channel.

Of course, there should be a non-mandatory way to tag papers, to have groups, and to recommend papers to specific users (like the "for:" tag in delicious.com).

Bonus points: keep track of comments and blog posts of the paper, plus all the extended literature analysis that Cameron proposed.

Labels:

Friday, August 15, 2008

Mendeley = Mekentosj Papers + Web 2.0 ?

Via Ricardo Vidal: Mendeley seems to be a Windows (plus Mac/Linux) equivalent of Mekentosj Papers (which is Mac OS X only, and has been described as "iTunes for your papers"). In addition to handling your PDFs, it has an online component that allows sharing your papers and other Web 2.0 features (billing itself as "Last.fm for papers").

Here, I'm reviewing the Mac beta version (0.5.6). I am focusing most on the desktop side and compare it to Papers, because I have a working solution in place and I would only switch to Mendeley if the experience is as good as with Papers. (I.e., my main problem is off-line papers management, Web 2.0 features are icing on the cake.)

By Mac standards, the app is quite ugly. Both Mendeley and Papers allow full-text PDF searches, which is important if you want to avoid tagging/categorizing all your papers. Papers can show PDFs in the main window, copy the reference of the paper and email papers. Mendeley in principle can also copy the reference, but special characters are transformed to gibberish in this beta version. Papers allows you to match papers against PubMed, Web of Science etc., while Mendeley only offers to auto-extract often incomplete meta-data. This matching feature is extremely useful as you get all the authorative data from the source, and most often Papers can use the DOI in the PDF to immeadiately give you the correct reference. Update: Mendeley also uses DOIs to retrieve the correct metadata, if available. (Thanks, Victor for your comment.)

The beta version is quite rough, I just had to kill it because I found no way to close the "about" window. Extraction of meta-data and references doesn't always work, but this might be more of a problem of the information that's stored in the PDFs.

Of course, once there's a critical mass of people using Mendeley, there'll be all the Web 2.0 features that Papers doesn't have. Judging from the talk I think they might be trying to do too much: Connotea/CiteULike plus Dopplr plus LinkedIn. For me, a simple way to export new references from Papers to Connotea/CiteULike would be enough. More modularity is better, because it allows you to choose the best tool in each layer.

More info by the Mendely folks: Short demo, a little longer talk.

Labels: ,

CiteWeb: Following citations made easy

One good way to keep up with the literature in a field is to track which new papers are citing seminal papers of the field. Each Friday, I get lots of citation alerts from ISI Web of Science, but often enough I see the same paper again and again (citing different papers that are on my watch list). So I set out to write an app that would take ISI's RSS feeds, coalesce them, and give them back to you. For example, in the screenshot one review paper is citing five of my tracked papers:
If you're using citation alerts from Web of Science, then give CiteWeb a try at citeweb.embl.de. If you find a bug, you can either comment here, or grab the source code and fix it. :-)

I started working on this to try out if Google App Engine was useful. It turned out that downloading many items from a remote host leads to time-outs from App Engine, so I ported the app to Django. The source code is released under the MIT License.

Labels: ,

Tuesday, August 12, 2008

Google integrates Scholar into main page

I don't know if it's just me (sitting inside a research institution), but when I search for something that returns a paper, I get info from Google Scholar:

(See also the complete screenshot with notes on Flickr.) However, the order of the results is different: Google Scholar seems to weight by citations, Google by page rank.

Labels: ,

Saturday, July 19, 2008

ISMB 2008: "Career Paths in Bioinformatics and Computational Biology"

A panel discussion about "Career Paths in Bioinformatics and Computational Biology" was part of Friday's Student Council Symposium during ISMB. The four panelists were from academia: Philip E. Bourne (group leader at University of California San Diego), Alfonso Valencia (group leader at the Spanish National Research Council), Jong Bhak (director of the Korean BioInformation Center) and Richard Wintle (Assistant Director at The Centre for Applied Genomics). (Only RW had spent a longer time [6 years] in the industry at start-up companies.)

[All quotes are paraphrases based on the notes I took.]

Perhaps unsurprisingly, they couldn't offer real comforting answers to the questions of young researchers: "Isn't there a high chance that at the age of 40 you'll be highly trained and specialized, but without a job?" – "There's no job security in academia anyway; I'm not sure if academia is more competitive than industry" (AV). "After the initial boom in bioinformatics positions, will the fraction of grad students who become PIs approach biology with 5 to 10%?" – "Biology will morph into bioinformatics, so there will be more jobs." (JB)

However, I could take away some positive advice. In short: follow your heart, be passionate about something, don't do what everybody else is doing, start you own sub-field if you have to. From my perspective, this is both reasonable and encouraging. As I enter the last phase of being a PhD student, I begin to wonder how I can combine working in science and caring for my family. I guess I hope by staying motivated and by being effective in what I do I can have a chance to grow in my career and by there for my family. (I think this is a great advantage of computational biology: You can't make gels run faster, so to say, but you can be effective in programming and analyzing data.)

Another good insight was that a lot of basic, technological advances will come from industry in the future. Dr. Bhak cited the example of CPU development: The huge increases in processing power we see today is being implemented at Intel and AMD (although I cannot judge how much they rely on basic research by academia). My addition to this might be that part of bioinformatics will become more of an engineering discipline. So, for people interested in this, there will be a big job market in the future.

Similarly, the panelists expected that every biology lab will have embedded computational biologists in the future. I agree, but I think these will be mostly post-doc (i.e. non-permanent) positions.


Some of the questions and answers in more detail:

What will be the big opportunities in the next five years? The current generation of students will lead the bioinformatics industry, like the previous generation is currently leading in academia (JB). There will be many more embedded bioinformaticians (see above, AV). Hybrid skills (wet-lab and computational biology) will become more important (RW). The greatest opportunities are cross-disciplinary approaches that tackle as much complexity as possible (PB).

To stay motivated, and find out what you want to do: Always follow your heart in career decisions; create your own sub-division if you have to (PB). Don't do something just because it's trendy; what you like to do might change over time (at one point, industry might be appealing, later academia) (RW).

To find your spot in academia: Find influential people (RW). Diversify: try to do something that not everyone else is doing (AV).

Labels: ,

Friday, July 18, 2008

Micro-blogging ISMB

As Pablo announced, several people including me are micro-blogging about ISMB on FriendFeed and to a lesser extent on Twitter.

Labels:

Tuesday, March 11, 2008

Blogging for search engines

Related to my last post about the failings of Web 2.0 in biology, I want to ask the meta-question: Why do we blog? David Crotty proposes four reasons: Communication with other science bloggers, with non-specialists, with journalist and finally with search engine users. Unless you are a fairly well-known person, your regular audience will consist of your colleagues, collaborators and a random grad student or two. A journalist might only come by if you managed to get a press release about a Nature/Science/... paper out. But, Googlebot won't fail you and read all you posts!

Insightful blog posts won't stay without an audience. For one, the small circle of followers to your blog will spread the news if you write something worth sharing. Far more important are search engines. How do you survey a research area of interest? Most of us will query PubMed, but also do a Google search in the hope that some meaningful analysis is somewhere on a course website, in the text of a paper or maybe even in a blog.

Biologists use Google to query for their proteins of interest. STRING is a fairly successful database, and lots of people google for it by name. However, almost one quarter of all visitors from Google have actually searched for a protein name (random example) and found STRING. If you follow Lars J. Jensen's lead and publish your research observations and dead ends online, someone might serendipitously find them and use them for their own research. This will be the next steps towards open science (with open data, open notebooks—which we might never reach): "Publishing" small findings, data and back stories about papers on your blog, enabling others to gain insight.

Labels: ,

Web 2.0, CiteULike and Mekentosj Papers

Roland Krause bookmarked a great post: "Why Web 2.0 is failing in Biology" by David Crotty. That I got to know about this post just by subscribing to his links in del.icio.us is a success of Web 2.0. I'm just not sure if the same successes are already in reach in the context of science. I especially agree with David Crotty's observations about entry barriers: Unless new tools/communities make it very easy to use them and provide great benefit, the rate of adoption will be low.

From my personal experience, I can share this: Almost two years ago, I participated in giving a series of talks about Web 2.0 and how it might impact biology. Looking back, I'm not sure many things have changed. I have been using CiteULike for the past three years or so, but I think I will now switch to Papers. CiteULike allows me to bookmark and tag my papers, but when I search my library I mostly use a custom Google search for the specific author.

Papers lets you easily create a collection of all PDFs you ever read. Thanks to Spotlight, you can perform full text searches on the articles and quickly retrieve the paper you have in mind. This avoids the overhead of applying tags to papers that you actually don't end up using. (GMail is another case in point: a quick search function eliminates the need for an intricate folder structure.)

I can't remember a specific case where the "Web 2.0" functions of CiteULike ever worked for me. Peeking in the bibliographies of other people can be interesting if you have some bookmarked papers in common, but the signal-to-noise ratio is very low. So, unless you know specific people or groups to follow, you'll most only use CiteULike in "Web 1.0 mode". And then we come back to the initial observation: If a web tool is more complicated or less featured than the desktop (or even, paper) version, it won't be used much.

Update: Mendeley might be the Windows equivalent some of you have been looking for (15/08/08, more in this post).

Labels:

Friday, February 22, 2008

STITCH and STRING blog

As an outlet for additional information about various things going on with STRING and STITCH I've created a blog. In particular, I spent the last week in Japan at the BioHackathon 2008 in Tokyo. Besides enjoying the different culture, I got to work on an API for the servers. I guess we'll see if it actually gets used (one of the first uses could be in the Reflect text-mining / highlighting tool, which already uses STITCH to get the pop-ups).

Labels:

Tuesday, February 5, 2008

Max Planck Society signs agreement with Springer

In October, I reported that the German Max Planck Society failed to reach a new license agreement with Springer. Now, via heise.de, I learn that they have signed an agreement on January 29, 2008. Here's the press release (there's also a German version).

They details are very sparse, presumably Springer had to come down with the price but they won't state that. However, the press release devotes a lot of space to Open Access, saying that the license agreement "also includes Open Choice™". Open Choice is Springer's author-pays-for-OA program. Now, what does this mean? It doesn't make sense to assume that the agreement talks about access to Open Choice articles, so I guess it must mean that all MPG articles are now going to be published under the Open Choice model. Querying PubMed a bit, I find that the MPG accounted for 6% of the total German research output, so this is certainly an interesting development.

Labels: ,

Thursday, October 18, 2007

Max Planck Society cancels license agreement with Springer

heise.de reports that the Max Planck Society, a major German research organization with more than 80 institutes, canceled the negotiations for a renewal of a license agreement for online access to SpringerLink with the end of the year. This means that almost 20,000 scientists will not have access to anything published in 1200 journals after the end of the year. (The archive will still be available.)

The reason that the talks failed is that Springer wanted twice the amount the Max Planck Society considered as justifiable, although they say even the justifiable amount was higher than what other publishers charge.

The Max Planck Society is one of the main proponents of the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities. It is the first major research organization that I'm aware of that breaks free of the stranglehold of the publishing industry. The question is, of course: what will happen next? Do they stay firm and begin to cause a shift towards Open Access journals? Or will Springer go down with the price just enough that the contract will be renewed?

Update: heise.de now reports this story in English as well.

Update 2: Well, they have reached an agreement on January 29, 2008.

Labels: ,

Thursday, October 11, 2007

Sending Growl notifications from Python scripts

Working in bioinformatics can be seen as an infinite loop of: think, write a script, run script, analyze data. While I work on a Mac, most of the scripts run on a Linux server, and it would be nice to know when a script is done so that I can look at the data. In order to be notified when a script finishes, I now use Growl (see picture).

I downloaded netgrowl.py and wrote a quick and dirty wrapper package around it:

#!/usr/bin/env python

from netgrowl import *
import sys

def growlNotify(title = "Script Finished", message = ""):

addr = ("10.1.104.26", GROWL_UDP_PORT)
s = socket(AF_INET,SOCK_DGRAM)
#
# p = GrowlRegistrationPacket(application="Network Demo", password="?")
# p.addNotification("Script Finished", enabled=True)
#
# s.sendto(p.payload(), addr)

if not message:
message = sys.argv[0]

p = GrowlNotificationPacket(application="Network Demo",
notification="Script Finished", title=title,
description=message, priority=1,
sticky=True, password="?")
s.sendto(p.payload(),addr)
s.close()

if __name__ == '__main__':
growlNotify()



The registration is hidden in a comment, you only need to do that the very first time. So, in my scripts, I just insert the following right before the end of the script (using a Textmate snippet to save typing).

import growlnotify
growlnotify.growlNotify()

Labels: , ,

Friday, October 5, 2007

Feed for NAR Database Issue papers

The NAR Database Issue papers are trickling in. I created a Yahoo! Pipe (with an RSS feed) of the advance access papers (filtered out from the full list of articles). Enjoy.

Labels:

Tuesday, August 28, 2007

Running in circles

You know you're looking for an obscure topic when your own entry in CiteULike comes up on the second page or so of the Google results.

Luckily, I saw my name in the URL and so I didn't get frightened by a Woozle.

Labels: ,