Index/query scripts for HMDB and DrugBank xml datasets

./hmdb/ --help
usage: [-h] -infile INFILE [--index INDEX] [--doctype DOCTYPE]
                   [--host HOST] [--port PORT] [--db DB]
                   [--graphfile GRAPHFILE] [--allfields]

Index DrugBank entries in xml format, with MongoDB or Elasticsearch, downloaded from

optional arguments:
  -h, --help            show this help message and exit
  -infile INFILE, --infile INFILE
                        Input file name
  --index INDEX         Name of the MongoDB database or Elasticsearch index,
                        or filename for NetworkX graph
  --doctype DOCTYPE     MongoDB collection name or Elasticsearch document type
  --host HOST           MongoDB or Elasticsearch server hostname
  --port PORT           MongoDB or Elasticsearch server port number
  --db DB               Database: 'MongoDB' or 'Elasticsearch', if not set
                        drug-drug interaction network is saved to a graph file
                        specified with the '--graphfile' option
  --graphfile GRAPHFILE
                        Database: 'MongoDB' or 'Elasticsearch',or if
                        'graphfile' drug-drug interactionnetwork saved as
                        graph file
  --allfields           By default sequence fields and the patents field is
                        not indexed. Select this option to index all fields
./hmdb/ --help
usage: [-h] {savegraph,cyview} ...

positional arguments:
    savegraph         Save DrugBank interactions as graph files
    cyview            See HMDB/DrugBank graphs with Cytoscape runing on your local machine

./hmdb/ savegraph --help
./hmdb/ cyview --help

Index HMDB

# Download metabolites and proteins data
mkdir -p data
wget -P ./data
wget -P ./data

# Index with Elasticsearch, time for proteins is ~15m, for metabolites ~ 30m to 250m
./hmdb/ --infile ./data/ --db Elasticsearch --index hmdb_metabolite
./hmdb/ --infile ./data/ --db Elasticsearch --index hmdb_protein

# Index with MongoDB, time for proteins is ~ 2m to 8m, for metabolites ~ 20m to 100m
./hmdb/ --infile ./data/ --db MongoDB --index biosets
./hmdb/ --infile ./data/ --db MongoDB --index biosets

# Index with project's main index script
./scripts/nosqlbiosets index hmdb MongoDB ~/data/hmdb/
./scripts/nosqlbiosets index hmdb MongoDB ~/data/hmdb/

./scripts/nosqlbiosets index hmdb Elasticsearch ~/data/hmdb/ --index hmdb_protein

Index DrugBank

Download DrugBank xml dataset from, requires registration. Save file to the data folder

# Index with MongoDB,  takes ~ 5m to 30m, with MongoDB Atlas ~50m?
./hmdb/ --infile ./data/\
 --db MongoDB --index biosets

./scripts/nosqlbiosets index drugbank MongoDB ~/data/drugbank/

# Index with Elasticsearch,  takes ~8m to 50m
./hmdb/ --infile ./data/\
 --db Elasticsearch --index drugbank

# Save drug-drug interactions as graph file in GML format
# (not a mature feature: have better response time
#                        and is the preferred way for building interaction graphs)
# takes ~ 4m to 15m,  #edges ~ 2,712000, #nodes ~ 3950
./hmdb/ --infile ./data/ --db NetworkX

DrugBank graph queries

Example command lines to generate and save graphs for subsets of DrugBank data or for the complete set

# Complete drug-targets graph 
./hmdb/ savegraph '{}' targets.xml

# Complete drug-enzymes graph
./hmdb/ savegraph '{}' enzymes.xml --connections=enzymes

# Drug-carriers graph for drugs that have referencs to "Serum albumin"
./hmdb/ savegraph '{"": "Serum albumin"}'\
     carriers-sa.xml --connections carriers

# Drug-targets graph for drugs which have keyword "antitubercular" in text fields 
./hmdb/ savegraph '{"$text": {"$search": "antitubercular"}}'\
     antitubercular.xml --connections targets

Example command lines to view graph results with Cytoscape

  ./hmdb/ cyview --help
  ./hmdb/ cyview --dataset HMDB meningitis
  ./hmdb/ cyview --dataset drugbank meningitis

Example graphs