Ortholog Detection by Blast+

The most highly cited tool in Bioinformatics, Blast, has been rewritten by C++ since 2009. Although released with a compatible perl script for blastall users, the parameters of Blast+ are quite different from those of its predecessor. The changes of parameters may influence people who use blast to detect orthology relationships by reciprocal best hits, or orthomcl, etc.

For RBH approach, Moreno-Hagelsieb and Latimer claim in their Bioinformatics paper that -F “m S” in blastall is the best choice balancing accuracy and running time. The corresponding parameters in blast+ are -seg yes -soft_masking true then. They also kindly provide a list of equivalent parameters used in their Bioinformatics paper between blastall and blast+ on their lab blog.

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6

For OrthoMCL, it is better to use the same masking strategy as RBH uses, but OrthoMCL has another issue. Because it can deal with multiple species at the same time, for some very large ortholog groups, the default limit on the number of alignments may be too low in some cases. Thus, the author suggests set -v 100000 -b 100000 in blastp to avoid missing any homologs. Well, actually only -b matters here, as it sets the upper limit on number of database sequence to show alignments for and -v only works when the output format is in -m 0 or -m 6. Neither of them is used in OrthoMCL. The equivalent parameter of -b in blast+ is -num_alignments, so for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-num_alignments 100000

The story never ends so early, as blast+ has another -max_target_seqs, which can control the number of aligned sequences to keep for any tabular formats (outfmt > 4). It is also incompatible with the ones, i.e. -num_descriptions and -num_alignments, used in output with separate definition line and alignment sections. So I think it is better to just use -max_target_seqs in blastp+ for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-max_target_seqs 100000

There seem to be some new characters in blast+, like database masking, which for sure can reduce running time, but their effects on orthology detection are unclear so far (as far as I know)

__END__
Advertisements