Wednesday, April 22, 2009

Announcing SIDER: a database of side effects

After using side effects to predict drug targets, we now created a public database of side effects with a total of 62269 side effects for 888 drugs. The database was created by doing text-mining on labels from various different public sources like the FDA. Furthermore, I developed rules to extract frequency information from the labels, this worked for about one third of the drug–side effect pairs.

We think that this database will make quite a bit of interesting research possible.

7 comments:

Jon said...

Very cool stuff. I'm working on a similar project extracting label data into MedDRA (rather than COSTART). Your interface is awesome!

Quick ?--

Does the "frequent","infrequent" etc parsing work irrespective of word order? e.g. can it correctly detect phrases like "chest pain, palpitations were frequent" as well as "Frequent: chest pain, palpitations..." ?

Michael Kuhn said...

Hi Jon,

we didn't use MedDRA because the license seemed to restrictive. I'd still be interested to see how the coverage of side effect terms varies between COSTART and MedDRA.

We don't do any Natural Language Processing, so we don't get any data from "X, Y and Z were frequent." We only rely on tables of frequencies and the highly structured lists ("frequent: X, Y, Z").

Jon said...

Hey Michael,

Agreed, MedDRA's policies are more restrictive and kind of a pain. But the U.S. FDA is using it these days so we're kind of of stuck with it.

I'll be happy to run some comparisons between our MedDRA extractions and SIDER and share the results.

We're also running frequency extractions, so any chance I can get a copy of the raw SIDER table with frequencies included?

Michael Kuhn said...

Sorry, I forgot to put the frequency data on the site. It's there now, together with an updated readme file.

Jon said...

Thanks!

Becky said...

How often is the database updated?

Michael Kuhn said...

Hi Becky,

we have not yet decided how often to update the database. We don't want to continuously update the database, but rather provide versions that are archived so that old datasets are also available. I guess in early 2010 we will look how many new labels are accessible via the FDA Structured Product Labels and update the db.