Running mpiBLAST
From BCCD 3.0
mpiBLAST is a tool for searching large databases of nucleotides or proteins. (For more information on mpiBLAST, please check out About mpiBLAST). This page is a walkthrough of using the BCCD to perform an mpiBLAST search. This tutorial assumes that you have booted the BCCD on one or more machines already. It also assumes that you are running OpenMPI. (OpenMPI is the default environment on the BCCD. To double check this, or set it up if you've switched to MPICH, see Running Open MPI.)
Using mpiBLAST
This section assumes that you'll be running mpiblast using OpenMPI. Unless you've specifically configured your BCCD to run MPICH instead of OpenMPI, you're already running it. If this doesn't sound familiar, you can assume you're ok.
mpiBLAST is used in a similar manner to NCBI-Blast. mpiBLAST uses the same variables that are available for NCBI Blast,
which means that you will need to have a .ncbirc file in your home directory. This file tells where mpiBLAST where to find its databases (the Shared variable) and workspace (the Local variable). To do this, log in as user bccd with the password you specified when booting up.
The .ncbirc file that is used for this looks like this:
[mpiBLAST] Shared=/home/bccd/blastdb Local=/home/bccd/blastdb
If you don't have such a file in your home directory (which you don't if you haven't made one yourself), copy the above into the file ~/.ncbirc using nedit, nano, vi or your other favorite text editor not listed here.
After setting up your .ncbirc file, there are four steps to running mpiblast. To get started, make the blastdb director and navigate there:
mkdir ~/blastdb cd ~/blastdb
Download a database from NIH (National Institute of Health)
In order to search a database using mpiBLAST, you first have to have a database. For this example we'll be using the Drosophila melonagaster (fruit fly) nucleotide database. You can download other databases (see the bottom of the page for links to additional databases) using the wget command. To get the Drosophila melonagaster database, for example, you would do something like this:
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz --17:00:38-- ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz => `drosoph.nt.gz' Resolving ftp.ncbi.nlm.nih.gov... 165.112.7.10 Connecting to ftp.ncbi.nlm.nih.gov|165.112.7.10|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD /blast/db/FASTA ... done. ==> PASV ... done. ==> RETR drosoph.nt.gz ... done. Length: 36,924,008 (35M) (unauthoritative) 100%[====================================>] 36,924,008 326.82K/s ETA 00:00 17:02:28 (338.88 KB/s) - `drosoph.nt.gz' saved [36924008]
After downloading, be sure to decompress it, using gunzip <database name>.
Format the database using mpiformatdb
Now comes the time where we separate the database into chunks that can be accessed by different processors. --nfrags is used to specific the number of fragments that the database should be subdivided into. You'll want to split it into the same number of fragments as processors you'll use for running mpiBLAST. This is done with mpiformatdb. In this instance, we're splitting it into four ways.
bccd@node000:~$ mpiformatdb --nfrags=4 -i ./drosoph.nt -pF --quiet Reading input file Done, read 1534943 lines Reordering 1170 sequence entries Breaking drosoph.nt (122 MB) into 4 fragments Executing: formatdb -p F -i /tmp/reorderoUDWYw -N 4 -n /home/bccd/blastdb/drosoph.nt -o T Removed /tmp/reorderoUDWYw Created 4 fragments. bccd@node000:~$ ls blastdb drosoph.nt formatdb.log
If you're using a different database you downloaded, be sure to specify that path rather than ./drosoph.nt. The output of this, the different chunks of the database, will then to be dumped to the shared folder specified in the .ncbirc file. (If you used the default above, this is ~/blastdb.) (Verify this with ls ~/blastdb.)
Error again?!
If you see a long list of the phrase [formatdb] FATAL ERROR: File write error, you've run out of RAM. Oops! See Customization Tips and Tricks: Supplementing RAM.
Create a test sequence file
Finally we're ready to run mpiBLAST against a test sequence. You can either create your own by pasting it in:
bccd@node000:~/blastdb$ cat > blast.in AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
(Remember, use ctrl-D to close the reading from stdin.)
Then, run mpiblast as follows:
bccd@node000:~$ mpirun -np 4 -machinefile ~/machines mpiblast -d drosoph.nt -i blast.in -p blastn -o results.txt bccd@node000:~$ ls [other stuff..] results.txt
- -np is the number of processors to run on (preferably the same number as you divided the database into!)
- -d is the database file to search against
- -i specifies the input file
- -p is the blast program name (should be blastn)
- -o specifies where to put the output
The results file should look similar to this:
BLASTN 2.2.10 [Oct-19-2004]
Reference: Aaron E. Darling, Lucas Carey, and Wu-chun Feng,
"The design, implementation, and evaluation of mpiBLAST."
In Proceedings of ClusterWorld 2003, June 24-26 2003, San Jose, CA
Query= Test
(560 letters)
Database: /home/bccd/blastdb/drosoph.nt
1170 sequences; 122,655,632 total letters
Score E
Sequences producing significant alignments: (bits) Value
gb|AE003681.2|AE003681 Drosophila melanogaster genomic scaffold ... 36 0.86
gb|AE002936.2|AE002936 Drosophila melanogaster genomic scaffold ... 36 0.86
gb|AE003698.2|AE003698 Drosophila melanogaster genomic scaffold ... 36 0.86
gb|AE003493.2|AE003493 Drosophila melanogaster genomic scaffold ... 36 0.86
gb|AE002615.2|AE002615 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003441.1|AE003441 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003525.2|AE003525 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003587.2|AE003587 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003673.2|AE003673 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003648.1|AE003648 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003628.1|AE003628 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003431.2|AE003431 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003484.1|AE003484 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003495.2|AE003495 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE002665.2|AE002665 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003740.2|AE003740 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003723.3|AE003723 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003447.2|AE003447 Drosophila melanogaster genomic scaffold ... 34 3.4
>gb|AE003681.2|AE003681 Drosophila melanogaster genomic scaffold 142000013386035 section 6 of
105, complete sequence
Length = 329362
Score = 36.2 bits (18), Expect = 0.86
Identities = 18/18 (100%)
Strand = Plus / Minus
Query: 96 taaattaaaattttattg 113
||||||||||||||||||
Sbjct: 111644 taaattaaaattttattg 111627
>gb|AE002936.2|AE002936 Drosophila melanogaster genomic scaffold 142000013385220, complete
sequence
Length = 48123
Score = 36.2 bits (18), Expect = 0.86
Identities = 18/18 (100%)
Strand = Plus / Minus
Query: 97 aaattaaaattttattga 114
||||||||||||||||||
Sbjct: 40704 aaattaaaattttattga 40687
>gb|AE003698.2|AE003698 Drosophila melanogaster genomic scaffold 142000013386035 section 23 of
105, complete sequence
Length = 225827
Score = 36.2 bits (18), Expect = 0.86
Identities = 18/18 (100%)
Strand = Plus / Minus
Query: 107 tttattgacttaggtcac 124
||||||||||||||||||
Sbjct: 151021 tttattgacttaggtcac 151004
>gb|AE003493.2|AE003493 Drosophila melanogaster genomic scaffold 142000013386053 section 10 of
30, complete sequence
Length = 308092
Score = 36.2 bits (18), Expect = 0.86
Identities = 18/18 (100%)
Strand = Plus / Minus
<<snipped>>
Database: /home/bccd/blastdb/drosoph.nt
Posted date: Dec 6, 2006 5:13 PM
Number of letters in database: 30,663,804
Number of sequences in database: 292
Database: /home/bccd/blastdb/drosoph.nt.001
Posted date: Dec 6, 2006 5:13 PM
Number of letters in database: 30,664,011
Number of sequences in database: 293
Database: /home/bccd/blastdb/drosoph.nt.002
Posted date: Dec 6, 2006 5:13 PM
Number of letters in database: 30,664,004
Number of sequences in database: 293
Database: /home/bccd/blastdb/drosoph.nt.003
Posted date: Dec 6, 2006 5:13 PM
Number of letters in database: 30,663,813
Number of sequences in database: 292
Lambda K H
1.37 0.711 1.31
Gapped
Lambda K H
1.37 0.711 1.31
Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Hits to DB: 35,658
Number of Sequences: 1170
Number of extensions: 35658
Number of successful extensions: 72
Number of sequences better than 10.0: 18
Number of HSP's better than 10.0 without gapping: 18
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 53
Number of HSP's gapped (non-prelim): 19
length of query: 1122
length of database: 122,655,632
effective HSP length: 18
effective length of query: 542
effective length of database: 122,634,572
effective search space: 66467938024
effective search space used: 66467938024
T: 0
A: 0
X1: 11 (21.8 bits)
X2: 15 (29.7 bits)
S1: 12 (24.3 bits)
S2: 17 (34.2 bits)
FMI
For more information...
- If you want to know more details about mpiformatdb or mpiblast, please refer to mpiBLAST Home at http://www.mpiblast.org/
- If you want to know more details about NCBI toolbox, please refer to NCBI home at http://www.ncbi.nlm.nih.gov
- If you want to know more details about MPICH, please refer to MPICH home at http://www-unix.mcs.anl.gov/mpi/mpich/
- If you want to know more details about OpenMPI, please refer to OpenMPI home at http://www.open-mpi.org/
- If you want to download BLAST databases from NCBI, please refer to the NCBI Blast database at ftp://ftp.ncbi.nih.gov/blast/db, or FASTA database at ftp://ftp.ncbi.nih.gov/blast/db/FASTA.
- mpiBLAST comes originally from Los Alamos National Laboratory (http://www.lanl.gov/)