bootstrap codon alignment

Codon models start being used in the field of phylogenetic reconstruction. They were implemented by both bayesian method, like MrBayes and Beast, as well as ML method, like GARLI, IQPNNI and CodonPhyML. Although there are some approximate way to estimate tree reliability, bootstrap is still the most common used one in communication. However, some tools, for example CodonPhyML, do not provide tools to bootstrap codon independently. Bootstrap can cost a huge amount of time, even though they are sometimes implemented in a parallel way, which may not apply to servers of all users (like me ;-)). The easiest way would be just running bootstraps independently then generating consensus tree to get bootstrap value. Therefore, I wrote a small (dark and dirty) code to bootstrap codon alignment by R.

require("ape")
dnaToCodon <- function(dna.indices){
  #if(!is.matrix(dna.indices) || length(dna.indices) != 3) {
  #  stop("\"dna.indices\" must be a matrix of length 3")
  #}
  
  ncl <- ncol(dna.indices)
  nrw <- nrow(dna.indices)
  ncl <- ncl / 3
  codon <- matrix(nrow=nrw, ncol=ncl)
  
  for (j in seq(1, ncl)) {
    for (i in seq(1, nrw)) {
      codon[i,j] <- paste(dna.indices[i,(3*j-2)], 
                          dna.indices[i,(3*j-1)], 
                          dna.indices[i,(3*j)], 
                          sep="")
    }
  }
  row.names(codon) <- row.names(dna.indices)
  return(codon)
}

bootstrapCodon <- function(codon.indices) {
  c <- ncol(codon.indices)
  r <- nrow(codon.indices)
  rcodon <- matrix(nrow=r, ncol=c)
  sample.col <- sample(seq(1,c), c, replace=T)
  
  for (i in seq(1, c)) {
    rcodon[,i] <- codon.indices[,sample.col[i]]
  }
  
  row.names(rcodon) <- row.names(codon.indices)
  return(rcodon)
    
}

codonToDna <- function(codon.indices) {
  nrw <- nrow(codon.indices)
  ncl <- ncol(codon.indices)
  
  dna <- matrix(nrow=nrw, ncol=ncl*3)
  
  for (i in seq(1, nrw)) {
    for (j in seq(1, ncl)) {
      codonstr <- strsplit(codon.indices[i,j], split="")
      dna[i,(j*3-2)] <- codonstr[[1]][1]
      dna[i,(j*3-1)] <- codonstr[[1]][2]
      dna[i,(j*3)] <- codonstr[[1]][3]
    }
  }
  
  row.names(dna) <- row.names(codon.indices)
  
  return(dna)
}

Then, if I have a alignment codon by codon, I can bootstrap it with following script.

require("ape")
source("codon.R")

for (i in seq(1,100)) {
  my.align <- read.dna("seq.msa.phy", as.character=T)
  my.codon.align <- dnaToCodon(my.align)
  boots.codon <- bootstrapCodon(my.codon.align)
  boots.align <- codonToDna(boots.codon)
  file <- paste("./bootstrap/seq.msa.boot", i, ".fasta", sep="")
  write.dna(boots.align, file, format="fasta", colsep="")
}
__END__
Advertisements

Ortholog Detection by Blast+

The most highly cited tool in Bioinformatics, Blast, has been rewritten by C++ since 2009. Although released with a compatible perl script for blastall users, the parameters of Blast+ are quite different from those of its predecessor. The changes of parameters may influence people who use blast to detect orthology relationships by reciprocal best hits, or orthomcl, etc.

For RBH approach, Moreno-Hagelsieb and Latimer claim in their Bioinformatics paper that -F “m S” in blastall is the best choice balancing accuracy and running time. The corresponding parameters in blast+ are -seg yes -soft_masking true then. They also kindly provide a list of equivalent parameters used in their Bioinformatics paper between blastall and blast+ on their lab blog.

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6

For OrthoMCL, it is better to use the same masking strategy as RBH uses, but OrthoMCL has another issue. Because it can deal with multiple species at the same time, for some very large ortholog groups, the default limit on the number of alignments may be too low in some cases. Thus, the author suggests set -v 100000 -b 100000 in blastp to avoid missing any homologs. Well, actually only -b matters here, as it sets the upper limit on number of database sequence to show alignments for and -v only works when the output format is in -m 0 or -m 6. Neither of them is used in OrthoMCL. The equivalent parameter of -b in blast+ is -num_alignments, so for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-num_alignments 100000

The story never ends so early, as blast+ has another -max_target_seqs, which can control the number of aligned sequences to keep for any tabular formats (outfmt > 4). It is also incompatible with the ones, i.e. -num_descriptions and -num_alignments, used in output with separate definition line and alignment sections. So I think it is better to just use -max_target_seqs in blastp+ for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-max_target_seqs 100000

There seem to be some new characters in blast+, like database masking, which for sure can reduce running time, but their effects on orthology detection are unclear so far (as far as I know)

__END__

Heatmap和K-means Clustering

经常会碰到一组矩阵数据,然后想把他们用heatmap的形式呈现出来。今天又碰到了一组,同时希望能把它用K-means按行聚类,用hierarchical按列聚类。

假设从文件读取这样一个数据框,可以在读取时加入row.names=1和head=T,给数据框的行列命名,这样就能轻松地将数据框用as.matrix轻松的转换成矩阵。

# read data in an easy way to convert data frame into matrix
df <- read.table("file.txt", row.names=1, head=T)
df.m <- as.matrix(df)

有了矩阵其实就可以很轻松地画出heatmap了,格式和聚类方法都使用默认参数:

heatmap(df.m)

或者使用gplots包里的heatmap.2

library("gplots")
heatmap.2(df.m)

在默认图形的基础上,想更换行的聚类方法,但heatmap和heatmap.2里并没有K-means聚类,所以我在这里的做法就是先把数据按k-means聚类然后排序,再用heatmap.2绘制heatmap,并设置Rowv=FALSE, trace=”none”去掉heatmap.2中默认的实线。

# k-means clustering
cl <- kmeans(df.m, centers=3)

# use order to sort a vector named $cluster in list cl
# it returns an index
order.idx <- order(cl$cluster)

# use the index to sort the original matrix by row
# note: row 1: [1,] / column 1: [,1]
df.m.order <- df.m[order.idx,]
heatmap.2(df.m.order, Rowv=FALSE, Colv=TRUE, trace="none")

这样一个简单的heatmap就做好了,可以通过调整其他参数,使得图形更优雅些。参数调整的差不多满意后,就可以写个小函数,以后碰到类似数据就可以直接作图先看看了。

__END__