Skip to the content.

Index/query scripts for UniProtKB datasets

Usage

Example command lines for downloading uniprot_sprot.xml file and for indexing:

Download UniProt/Swiss-Prot data set

mkdir -p data
# ~760M(compressed), ~173.5 million lines, ~565,000 entries
wget -nc -P ./data ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/\
knowledgebase/complete/uniprot_sprot.xml.gz

Index with Elasticsearch or MongoDB

If you have not already installed nosqlbiosets project see the Installation section of the readme.md file on project main folder.

Server default connection settings are read from ../../conf/dbservers.json

# Index with Elasticsearch, typically requires about 1 to 8 hours
./nosqlbiosets/uniprot/index.py ./data/uniprot_sprot.xml.gz\
 --host localhost --db Elasticsearch  --esindex uniprot

# Index with MongoDB, typically requires about 1 to 2 hours
./nosqlbiosets/uniprot/index.py ./data/uniprot_sprot.xml.gz\
 --host localhost --db MongoDB --index biosets

Index/query scripts for InterPro dataset

Elasticsearch, ~10m
./nosqlbiosets/uniprot/interpro.py \
   ~/data/interpro/interpro.xml.gz\
   --esindex interpro\
   --dbtype Elasticsearch --recreateindex true\
   --host localhost 
MongoDB  ~3m
./nosqlbiosets/uniprot/interpro.py \
   ~/data/interpro/interpro.xml.gz\
   --dbtype MongoDB --recreateindex true\
   --mdbdb=biosets --mdbcollection interpro\
   --host localhost

PSI MI-TAB support

This folder also includes an index script for PSI-MI TAB protein interactions data files

### Links for the PSI MI-TAB format


wget -P ./data http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/HIPPIE-current.mitab.txt

# Index with Elasticsearch
./nosqlbiosets/uniprot/index_mitab.py --infile ./data/HIPPIE-current.mitab.txt\
 --db Elasticsearch

# Index with MongoDB
./nosqlbiosets/uniprot/index_mitab.py --infile ./data/HIPPIE-current.mitab.txt\
 --db MongoDB

HIPPIE indexing takes ~8m with MongoDB, ~2m with Elasticsearch