Index/query scripts for HMDB and DrugBank xml datasets
-
index.py Index HMDB protein and metabolite datasets. Tests made with HMDB version 4.0; metabolites Jan 2019 update, proteins Jan 2019 update
-
../tests/test_hmdb_queries.py Includes example queries
-
drugbank.py Index DrugBank xml dataset with MongoDB, or Elasticsearch, or save drug-drug interactions as graph file in GML format. Tests made with DrugBank version 5.1.8, January 2021 update
./hmdb/drugbank.py --help
usage: drugbank.py [-h] -infile INFILE [--index INDEX] [--doctype DOCTYPE]
[--host HOST] [--port PORT] [--db DB]
[--graphfile GRAPHFILE] [--allfields]
Index DrugBank entries in xml format, with MongoDB or Elasticsearch, downloaded from
https://www.drugbank.ca/releases/latest
optional arguments:
-h, --help show this help message and exit
-infile INFILE, --infile INFILE
Input file name
--index INDEX Name of the MongoDB database or Elasticsearch index,
or filename for NetworkX graph
--doctype DOCTYPE MongoDB collection name or Elasticsearch document type
name
--host HOST MongoDB or Elasticsearch server hostname
--port PORT MongoDB or Elasticsearch server port number
--db DB Database: 'MongoDB' or 'Elasticsearch', if not set
drug-drug interaction network is saved to a graph file
specified with the '--graphfile' option
--graphfile GRAPHFILE
Database: 'MongoDB' or 'Elasticsearch',or if
'graphfile' drug-drug interactionnetwork saved as
graph file
--allfields By default sequence fields and the patents field is
not indexed. Select this option to index all fields
- queries.py Query API for DrugBank data indexed with MongoDB, at its early stages
./hmdb/queries.py --help
usage: queries.py [-h] {savegraph,cyview} ...
positional arguments:
{savegraph,cyview}
savegraph Save DrugBank interactions as graph files
cyview See HMDB/DrugBank graphs with Cytoscape runing on your local machine
./hmdb/queries.py savegraph --help
./hmdb/queries.py cyview --help
Index HMDB
# Download metabolites and proteins data
mkdir -p data
wget -P ./data http://www.hmdb.ca/system/downloads/current/hmdb_metabolites.zip
wget -P ./data http://www.hmdb.ca/system/downloads/current/hmdb_proteins.zip
# Index with Elasticsearch, time for proteins is ~15m, for metabolites ~ 30m to 250m
./hmdb/index.py --infile ./data/hmdb_metabolites.zip --db Elasticsearch --index hmdb_metabolite
./hmdb/index.py --infile ./data/hmdb_proteins.zip --db Elasticsearch --index hmdb_protein
# Index with MongoDB, time for proteins is ~ 2m to 8m, for metabolites ~ 20m to 100m
./hmdb/index.py --infile ./data/hmdb_metabolites.zip --db MongoDB --index biosets
./hmdb/index.py --infile ./data/hmdb_proteins.zip --db MongoDB --index biosets
# Index with project's main index script
./scripts/nosqlbiosets index hmdb MongoDB ~/data/hmdb/hmdb_proteins.zip
./scripts/nosqlbiosets index hmdb MongoDB ~/data/hmdb/hmdb_metabolites.zip
./scripts/nosqlbiosets index hmdb Elasticsearch ~/data/hmdb/hmdb_proteins.zip --index hmdb_protein
Index DrugBank
Download DrugBank xml dataset from http://www.drugbank.ca/releases/latest,
requires registration. Save drugbank_all_full_database.xml.zip
file to the
data
folder
# Index with MongoDB, takes ~ 5m to 30m, with MongoDB Atlas ~50m?
./hmdb/drugbank.py --infile ./data/drugbank_all_full_database.xml.zip\
--db MongoDB --index biosets
./scripts/nosqlbiosets index drugbank MongoDB ~/data/drugbank/drugbank-5.1.2.xml.zip
# Index with Elasticsearch, takes ~8m to 50m
./hmdb/drugbank.py --infile ./data/drugbank_all_full_database.xml.zip\
--db Elasticsearch --index drugbank
# Save drug-drug interactions as graph file in GML format
# (not a mature feature: queries.py have better response time
# and is the preferred way for building interaction graphs)
# takes ~ 4m to 15m, #edges ~ 2,712000, #nodes ~ 3950
./hmdb/drugbank.py --infile ./data/drugbank_all_full_database.xml.zip --db NetworkX
DrugBank graph queries
Example command lines to generate and save graphs for subsets of DrugBank data or for the complete set
# Complete drug-targets graph
./hmdb/queries.py savegraph '{}' targets.xml
# Complete drug-enzymes graph
./hmdb/queries.py savegraph '{}' enzymes.xml --connections=enzymes
# Drug-carriers graph for drugs that have referencs to "Serum albumin"
./hmdb/queries.py savegraph '{"carriers.name": "Serum albumin"}'\
carriers-sa.xml --connections carriers
# Drug-targets graph for drugs which have keyword "antitubercular" in text fields
./hmdb/queries.py savegraph '{"$text": {"$search": "antitubercular"}}'\
antitubercular.xml --connections targets
Example command lines to view graph results with Cytoscape
./hmdb/queries.py cyview --help
./hmdb/queries.py cyview --dataset HMDB meningitis
./hmdb/queries.py cyview --dataset drugbank meningitis
Example graphs
Related work
- https://github.com/egonw/create-bridgedb-hmdb, http://www.bridgedb.org/ BridgeDB identity mapping files from HMDB, ChEBI, and Wikidata
Related links
- http://www.hmdb.ca/sources: a brief introduction to HMDB, and a detailed list of data sources for the data fields of HMDB entries