Source code:
bioCS
biology as computational science
Monday, April 22, 2013
2D plot with histograms for each dimension (2013 edition)
In 2009, I wrote about a way to show density plots along both dimensions of a plot. When I ran the code again to adapt it to a new project, it didn't work because ggplot2 has become better in the meantime. Below is the updated code. Using the gridExtra package and this hint from the ggplot2 wiki, we get this output:
Source code:
Source code:
Monday, January 23, 2012
A publicly mandated medical terminology with a restrictive license
Update: Good News! In the meantime, I've been contacted by MedDRA (most probably unrelated to this blog post) and after a fruitful discussion it seems to be possible for me to base SIDER on MedDRA.
The world's regulatory agencies are increasingly adopting and mandating a new medical terminology scheme called MedDRA to capture side effects (adverse drug reactions) during the regulatory process. (For example, it is used in CVAR, AERS and other instances at the FDA [pdf], which in turn have been used in recent papers). Sounds great, right? The only problem: The dataset is under an restrictive license (pdf): MedDRA data can only be shared among MedDRA subscribers (source [pdf]). I've clarified this via email with the help desk: one can only share text examples with less than 100 terms as examples, and no numeric codes.
This means: it is not possible to create a public dataset, or supplementary material for a paper, that contains a useful amount of data based on MedDRA.
Two years ago, I created the SIDER database of drug–side effect relations (published in MSB). By relying only on publicly available drug labels and dictionaries like COSTART (with UMLS License Category 0), we were able to create a dataset that can be shared with everyone. (Disclaimer: we chose the license CC-BY-NC-SA.) If I were to base SIDER on MedDRA, the license would prevent me from making a machine-readable database available for download and further research. Thus, the next version of SIDER cannot be based on the dictionary of medical terms that regulatory agencies use at the moment.
What is especially sad about this is that the license fees themselves are not especially high, companies with an annual revenue less than $1 million have to pay only $190 USD, and I doubt that there are hundreds of subscribers who earn more than $5 billion and thus pay the maximum fee of $62,850 USD. So it would take relatively little financial effort to declare MedDRA a open access database.
IANAL, so it may be possible that a database like SIDER, which essential contains the following records: side effect identifier, side effect name, drug identifier, drug name, is derived enough to not fall under the MedDRA license. I remain doubtful, however, especially after reading the restrictions on UMLS License Category 3, under which MedDRA falls, like: "incorporation of material [...] in any publicly accessible computer-based information system [...] including the Internet; [...] creating derivative works from material from these copyrighted sources".
Information on public health, like drugs and their side effects, should be openly available for research, second only to privacy concerns. I'm not sure how to begin to change this (beyond writing this), but ideas are very welcome.
(Small fnord detail: MSSO, which manages MedDRA, is a subsidiary of the military contractor Northrop Grumman.)
The world's regulatory agencies are increasingly adopting and mandating a new medical terminology scheme called MedDRA to capture side effects (adverse drug reactions) during the regulatory process. (For example, it is used in CVAR, AERS and other instances at the FDA [pdf], which in turn have been used in recent papers). Sounds great, right? The only problem: The dataset is under an restrictive license (pdf): MedDRA data can only be shared among MedDRA subscribers (source [pdf]). I've clarified this via email with the help desk: one can only share text examples with less than 100 terms as examples, and no numeric codes.
This means: it is not possible to create a public dataset, or supplementary material for a paper, that contains a useful amount of data based on MedDRA.
Two years ago, I created the SIDER database of drug–side effect relations (published in MSB). By relying only on publicly available drug labels and dictionaries like COSTART (with UMLS License Category 0), we were able to create a dataset that can be shared with everyone. (Disclaimer: we chose the license CC-BY-NC-SA.) If I were to base SIDER on MedDRA, the license would prevent me from making a machine-readable database available for download and further research. Thus, the next version of SIDER cannot be based on the dictionary of medical terms that regulatory agencies use at the moment.
What is especially sad about this is that the license fees themselves are not especially high, companies with an annual revenue less than $1 million have to pay only $190 USD, and I doubt that there are hundreds of subscribers who earn more than $5 billion and thus pay the maximum fee of $62,850 USD. So it would take relatively little financial effort to declare MedDRA a open access database.
IANAL, so it may be possible that a database like SIDER, which essential contains the following records: side effect identifier, side effect name, drug identifier, drug name, is derived enough to not fall under the MedDRA license. I remain doubtful, however, especially after reading the restrictions on UMLS License Category 3, under which MedDRA falls, like: "incorporation of material [...] in any publicly accessible computer-based information system [...] including the Internet; [...] creating derivative works from material from these copyrighted sources".
Information on public health, like drugs and their side effects, should be openly available for research, second only to privacy concerns. I'm not sure how to begin to change this (beyond writing this), but ideas are very welcome.
(Small fnord detail: MSSO, which manages MedDRA, is a subsidiary of the military contractor Northrop Grumman.)
Labels:
open access,
science
Friday, January 6, 2012
Embarrassingly parallel BLAST search
A quick note: To blast a file with many proteins against a database, you can use recent version of GNU Parallel to fill up all CPUs (which the -num_threads option of BLAST doesn't do, as it only parallelizes some steps of the search):
This will split the FASTA file into smaller chunks of about 100 kilobyte, while making sure that the records are valid (i.e. start with an ">").
cat query.fasta | parallel --block 100k --recstart '>' --pipe \
blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result.tsv
This will split the FASTA file into smaller chunks of about 100 kilobyte, while making sure that the records are valid (i.e. start with an ">").
Labels:
science
Thursday, August 11, 2011
ggplot2: Determining the order in which lines are drawn
In a time series, I want to plot the values of an interesting cluster versus the background. However, if I'm not careful, ggplot will draw the items in an order determined by their name, so background items will obscure the interesting cluster:
One way to solve this is to combine the label and name columns into one column that is used to group the individual lines. In this toy example, the line belonging to group 1 should overlay the other two lines:
![]() | ![]() |
| Correct: Interesting lines in front of background | Wrong: Background lines obscure interesting lines |
One way to solve this is to combine the label and name columns into one column that is used to group the individual lines. In this toy example, the line belonging to group 1 should overlay the other two lines:
Friday, July 15, 2011
SSH to same directory on other server
In my work environment, home directories are shared across servers. I often realize that I need to run a script on another server, e.g. due to RAM requirements or other jobs running on the current server. I finally figured out how to log in to the other server while staying in the same directory:
(Of course, you can define aliases for different servers.)
ssh -t server cd `pwd` "&& $SHELL"
(Of course, you can define aliases for different servers.)
Labels:
hacks
Tuesday, June 21, 2011
Mekentosj Papers and Dropbox: get a bit more space
I use Dropbox to keep an automatic backup of my Papers library. However, I was scraping close to my space allowance, and discovered that Papers adds some temporary files to the "Library.papers2" folder that eat up precious cloud space. Here is how to remove them from Dropbox:
- Quit Papers 2.0. Make a backup of your Papers library.
- Go to the "Library.papers2" folder inside your Papers library.
- Copy the folders "Thumbnails" and "Spotlight" somewhere (e.g. your Desktop). The other folders don't need much space in my case.
- Click the Dropbox icon in the menu bar, navigate to Preferences, Advanced, Selective Sync and de-select the "Thumbnails" and "Spotlight" folders. (Dropbox will now delete the folders.)
- Move the folders you copied before back into the "Library.papers2" folder.
- Via Dropbox's web interface, delete the online version of the two folders.
- Voila: more space (200 MB, in my case).
Labels:
mac
Thursday, March 10, 2011
Comparing two-dimensional data sets in R; take II
David commented on yesterday's post and suggested to put the continuous fitted distribution in the background and the discrete, empirical distribution in the foreground. This looks quite nice, although there's a slight optical illusion that makes the circles look as if they'd be filled with a gradient, even though they're uniformly colored:
![]() |
| Not-so-good fit |
![]() |
| Better fit |
Subscribe to:
Posts (Atom)




