bootstrap codon alignment

Codon models start being used in the field of phylogenetic reconstruction. They were implemented by both bayesian method, like MrBayes and Beast, as well as ML method, like GARLI, IQPNNI and CodonPhyML. Although there are some approximate way to estimate tree reliability, bootstrap is still the most common used one in communication. However, some tools, for example CodonPhyML, do not provide tools to bootstrap codon independently. Bootstrap can cost a huge amount of time, even though they are sometimes implemented in a parallel way, which may not apply to servers of all users (like me ;-)). The easiest way would be just running bootstraps independently then generating consensus tree to get bootstrap value. Therefore, I wrote a small (dark and dirty) code to bootstrap codon alignment by R.

require("ape")
dnaToCodon <- function(dna.indices){
  #if(!is.matrix(dna.indices) || length(dna.indices) != 3) {
  #  stop("\"dna.indices\" must be a matrix of length 3")
  #}
  
  ncl <- ncol(dna.indices)
  nrw <- nrow(dna.indices)
  ncl <- ncl / 3
  codon <- matrix(nrow=nrw, ncol=ncl)
  
  for (j in seq(1, ncl)) {
    for (i in seq(1, nrw)) {
      codon[i,j] <- paste(dna.indices[i,(3*j-2)], 
                          dna.indices[i,(3*j-1)], 
                          dna.indices[i,(3*j)], 
                          sep="")
    }
  }
  row.names(codon) <- row.names(dna.indices)
  return(codon)
}

bootstrapCodon <- function(codon.indices) {
  c <- ncol(codon.indices)
  r <- nrow(codon.indices)
  rcodon <- matrix(nrow=r, ncol=c)
  sample.col <- sample(seq(1,c), c, replace=T)
  
  for (i in seq(1, c)) {
    rcodon[,i] <- codon.indices[,sample.col[i]]
  }
  
  row.names(rcodon) <- row.names(codon.indices)
  return(rcodon)
    
}

codonToDna <- function(codon.indices) {
  nrw <- nrow(codon.indices)
  ncl <- ncol(codon.indices)
  
  dna <- matrix(nrow=nrw, ncol=ncl*3)
  
  for (i in seq(1, nrw)) {
    for (j in seq(1, ncl)) {
      codonstr <- strsplit(codon.indices[i,j], split="")
      dna[i,(j*3-2)] <- codonstr[[1]][1]
      dna[i,(j*3-1)] <- codonstr[[1]][2]
      dna[i,(j*3)] <- codonstr[[1]][3]
    }
  }
  
  row.names(dna) <- row.names(codon.indices)
  
  return(dna)
}

Then, if I have a alignment codon by codon, I can bootstrap it with following script.

require("ape")
source("codon.R")

for (i in seq(1,100)) {
  my.align <- read.dna("seq.msa.phy", as.character=T)
  my.codon.align <- dnaToCodon(my.align)
  boots.codon <- bootstrapCodon(my.codon.align)
  boots.align <- codonToDna(boots.codon)
  file <- paste("./bootstrap/seq.msa.boot", i, ".fasta", sep="")
  write.dna(boots.align, file, format="fasta", colsep="")
}
__END__
Advertisements

Ortholog Detection by Blast+

The most highly cited tool in Bioinformatics, Blast, has been rewritten by C++ since 2009. Although released with a compatible perl script for blastall users, the parameters of Blast+ are quite different from those of its predecessor. The changes of parameters may influence people who use blast to detect orthology relationships by reciprocal best hits, or orthomcl, etc.

For RBH approach, Moreno-Hagelsieb and Latimer claim in their Bioinformatics paper that -F “m S” in blastall is the best choice balancing accuracy and running time. The corresponding parameters in blast+ are -seg yes -soft_masking true then. They also kindly provide a list of equivalent parameters used in their Bioinformatics paper between blastall and blast+ on their lab blog.

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6

For OrthoMCL, it is better to use the same masking strategy as RBH uses, but OrthoMCL has another issue. Because it can deal with multiple species at the same time, for some very large ortholog groups, the default limit on the number of alignments may be too low in some cases. Thus, the author suggests set -v 100000 -b 100000 in blastp to avoid missing any homologs. Well, actually only -b matters here, as it sets the upper limit on number of database sequence to show alignments for and -v only works when the output format is in -m 0 or -m 6. Neither of them is used in OrthoMCL. The equivalent parameter of -b in blast+ is -num_alignments, so for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-num_alignments 100000

The story never ends so early, as blast+ has another -max_target_seqs, which can control the number of aligned sequences to keep for any tabular formats (outfmt > 4). It is also incompatible with the ones, i.e. -num_descriptions and -num_alignments, used in output with separate definition line and alignment sections. So I think it is better to just use -max_target_seqs in blastp+ for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-max_target_seqs 100000

There seem to be some new characters in blast+, like database masking, which for sure can reduce running time, but their effects on orthology detection are unclear so far (as far as I know)

__END__

Heatmap和K-means Clustering

经常会碰到一组矩阵数据,然后想把他们用heatmap的形式呈现出来。今天又碰到了一组,同时希望能把它用K-means按行聚类,用hierarchical按列聚类。

假设从文件读取这样一个数据框,可以在读取时加入row.names=1和head=T,给数据框的行列命名,这样就能轻松地将数据框用as.matrix轻松的转换成矩阵。

# read data in an easy way to convert data frame into matrix
df <- read.table("file.txt", row.names=1, head=T)
df.m <- as.matrix(df)

有了矩阵其实就可以很轻松地画出heatmap了,格式和聚类方法都使用默认参数:

heatmap(df.m)

或者使用gplots包里的heatmap.2

library("gplots")
heatmap.2(df.m)

在默认图形的基础上,想更换行的聚类方法,但heatmap和heatmap.2里并没有K-means聚类,所以我在这里的做法就是先把数据按k-means聚类然后排序,再用heatmap.2绘制heatmap,并设置Rowv=FALSE, trace=”none”去掉heatmap.2中默认的实线。

# k-means clustering
cl <- kmeans(df.m, centers=3)

# use order to sort a vector named $cluster in list cl
# it returns an index
order.idx <- order(cl$cluster)

# use the index to sort the original matrix by row
# note: row 1: [1,] / column 1: [,1]
df.m.order <- df.m[order.idx,]
heatmap.2(df.m.order, Rowv=FALSE, Colv=TRUE, trace="none")

这样一个简单的heatmap就做好了,可以通过调整其他参数,使得图形更优雅些。参数调整的差不多满意后,就可以写个小函数,以后碰到类似数据就可以直接作图先看看了。

__END__

读书八卦

有时候看书会挑出跟主题不太相关的内容,但好像又很接地气。比如这两天看《谈谈方法》,于是八卦的摘录了笛卡尔的学术生活的只言片语:

我决心避开一切可能遇到熟人的场合,在一个地方隐居下来。那里在连年烽火之后已经建立了良好的秩序,驻军的作用看来仅仅在于保障人们享受和平成果,居民人口众多,积极肯干,对自己的事情非常关心,对别人的事情并不注意。我住在那些人当中可享受到各种便利,不亚于通都大邑,而又可以独自一人,就像住在荒无人烟的大沙漠里一样。

这段关于荷兰人民的叙述还真是贴切,跟比国(当时大约就是荷兰)人民的生活状态差不多。

我参观皇帝加冕后回到部队的时候,冬天已经到了,只好留在驻地。那里既找不到人聊天解闷,幸好也没有什么牵挂,没有什么情绪使我分心,我成天独自关在一间暖房里,有充分的闲暇跟自己的思想打交道。

我的生活方式表面上跟某些人没有什么两样:不做什么事情,只是愉快地、正派地过着日子,用心把欢乐和邪恶分开;为了不至于闲得无聊,从事着各种正当的娱乐。可是尽管如此,我仍然在执行我的计划,增进我对真理的认识,成绩也许比埋头读书、只跟读书人往来还要大些。

言语恐惧症者说

不知道从什么时候开始,写东西对我变成了一件有些困难的事情。记得高中的时候,晚自习或者躲在被窝里,可以在一张A4的纸上写很长很长的文字,但是现在要花费了我很大的精力和较长的时间才能完成一篇文字。暂且不论文字质量如何,表达能力的退化倒是更加让人觉得恐怖。有时候不知道自己是不是患上了失语症,就连我在写现在这样一篇描述我为什么要写这个Blog的文章的时候,想法依旧像春天飘在身边的杨絮,长长短短的文字侵占着一个个自然段。

口头表达上好像也有一些所谓的言语恐惧症(Glossophobia)。这词是从Nexus 7的广告上学来的,故事大意是一个有些自闭的小男孩靠Nexus 7完成了在班上的一次presentaion(然后我就好想要台Nexus 7 )!虽然我并不完全认为我有医学上所描述的这种疾病,但是说话表达时的紧张或者有时回避交谈的心理,自己还是略知一二的,而由此造成的事务的拖延也一直时有发生。自我诊断的症状有:与陌生人说话会出汗或不自主挠头,说话比较短少,内容断断续续,不善于阐述复杂的观点,而且往往因为事物复杂而懒于解释。

无论是内在的自我性格和生活历程的影响,还是外界信息片段化的传播,造成上面这些问题的原因不仅仅是单方面的。不过如今接触的事物越来越复杂,在学术圈里接触到的也是本领域里越来越复杂的概念和愈发深刻的思维,都需要能够用“最清晰最简明的词汇来描述深刻的思维”,而且要“循序渐进,从最简单的东西一步步来往深处描述”,而“千万不能在心里约定俗成的觉得听众都应该清晰明白,然后跳跃性的描述思想”(引文来自豆瓣的理性批判)。

因此,新建这个Blog的主要想法是想强迫自己敢于并尽可能完整的表达自己的观点和认知,而不是片段化的没有思索过程的抛出一些结论。因为Blog与其他网络表达方式相比,篇幅长了许多而且几乎没有限制,可以强迫自己表达比较长而丰富的内容,从而训练下自己日渐衰老的表达能力。同时书写能够帮助整理思路,尽可能保持思路的连贯性,避免从用一个话题补充另一个话题,打乱表达的节奏。

同时,我还发现自己对很多知识的认知不够精确。一个明显的问题是,一个知识自己看明白后,想再解释给别人时总是感到困难重重,还得回头再去翻阅书中的表述才能转述给他人。所以这个Blog里也会包含我自己觉得比较难理解而又难表述清楚的内容。

最后,我之前的Blog们都是一种类似于日记的存在,同时都在一个比较固定的圈子里,因而要开辟一些新的内容总觉得很困难,因为我自己会受到外界对我的印象的影响。一个新的Blog可以看成是一个新的开始,可以尝试我一直想做的关于专业关于科学关于艺术这些形而上一些为主题的Blog。于是,在内心又纠结了很久才舒服,同时又终于肯花钱(20欧/年)之后,就有了现在这个新的Blog.