Thursday, May 21, 2009

How good is Wikipedia's coverage of chemical compounds?

Wikipedia has an excellent coverage of chemical compounds, featuring above 20000 articles whose names match those PubChem compounds. After finding a few important chemicals not featured in Wikipedia, I wanted to quantify the coverage of Wikipedia and point out the gaps that should be filled.

I wondered how much coverage Wikipedia actually has for "important" chemicals. Here, I define importance as "number of hits in PubMed", since that is a thing that I can easily measure (and, in fact, already determined as part of working on STITCH and Reflect).

Missing chemicals
For each bin of 100 chemicals, the number of PubMed hits for all synonyms of this chemical is plotted against the fraction of the chemicals that have a Wikipedia article for any of the synonyms. (I exclude three-letter names as they are often ambiguous.) So, for compounds that occur more than 1000 times in PubMed, Wikipedia's coverage is above 80%. Here is the list of articles that should be added.

So, if you know something about one of the missing compounds, go right ahead and create an article! :)

Missing synonyms
The second question is if Wikipedia is missing important redirects, i.e. if there are widely-used names for chemicals that don't occur in Wikipedia even though an article exists for the chemical itself (just under another name). For very common names, the coverage is slightly lower, however, the abstracts in PubMed often contain chemical notation that people probably won't use when searching Wikipedia, e.g. "Ca(2+)" is the top hit on the list of redirects that could be added.