Ortholog Detection by Blast+

The most highly cited tool in Bioinformatics, Blast, has been rewritten by C++ since 2009. Although released with a compatible perl script for blastall users, the parameters of Blast+ are quite different from those of its predecessor. The changes of parameters may influence people who use blast to detect orthology relationships by reciprocal best hits, or orthomcl, etc.

For RBH approach, Moreno-Hagelsieb and Latimer claim in their Bioinformatics paper that -F “m S” in blastall is the best choice balancing accuracy and running time. The corresponding parameters in blast+ are -seg yes -soft_masking true then. They also kindly provide a list of equivalent parameters used in their Bioinformatics paper between blastall and blast+ on their lab blog.

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6

For OrthoMCL, it is better to use the same masking strategy as RBH uses, but OrthoMCL has another issue. Because it can deal with multiple species at the same time, for some very large ortholog groups, the default limit on the number of alignments may be too low in some cases. Thus, the author suggests set -v 100000 -b 100000 in blastp to avoid missing any homologs. Well, actually only -b matters here, as it sets the upper limit on number of database sequence to show alignments for and -v only works when the output format is in -m 0 or -m 6. Neither of them is used in OrthoMCL. The equivalent parameter of -b in blast+ is -num_alignments, so for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-num_alignments 100000

The story never ends so early, as blast+ has another -max_target_seqs, which can control the number of aligned sequences to keep for any tabular formats (outfmt > 4). It is also incompatible with the ones, i.e. -num_descriptions and -num_alignments, used in output with separate definition line and alignment sections. So I think it is better to just use -max_target_seqs in blastp+ for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-max_target_seqs 100000

There seem to be some new characters in blast+, like database masking, which for sure can reduce running time, but their effects on orthology detection are unclear so far (as far as I know)


Heatmap和K-means Clustering



# read data in an easy way to convert data frame into matrix
df <- read.table("file.txt", row.names=1, head=T)
df.m <- as.matrix(df)





在默认图形的基础上,想更换行的聚类方法,但heatmap和heatmap.2里并没有K-means聚类,所以我在这里的做法就是先把数据按k-means聚类然后排序,再用heatmap.2绘制heatmap,并设置Rowv=FALSE, trace=”none”去掉heatmap.2中默认的实线。

# k-means clustering
cl <- kmeans(df.m, centers=3)

# use order to sort a vector named $cluster in list cl
# it returns an index
order.idx <- order(cl$cluster)

# use the index to sort the original matrix by row
# note: row 1: [1,] / column 1: [,1]
df.m.order <- df.m[order.idx,]
heatmap.2(df.m.order, Rowv=FALSE, Colv=TRUE, trace="none")