Quantile normalization, RLE, and TMM

There are just so many ways to normalise RNA-Seq data, even though one of the advantages of RNA-Seq is “it can capture transcriptome dynamics across different tissues of conditions without sophisticated normalization of data sets.” (/RNA-Seq: A revolutionary tool for transcriptomes/, Nature Review Genetics, 2009, ) BTW, I was wondering how many people actually take careful considerations on these normalisations. Or I’d like to know how robust RNA-Seq data sets are. Here are a few normalizations that came across today

  • Quantile normalization – /Selection between-sample RNA-Seq normalization methods from the perspective for their assumptions/, Briefings in Bioinformatics, 2018. The purpose for this normalization is to obtain identical distributions for read counts among different samples. It first orders read counts for each sample increasingly and leaves placement holders with the order indices. Then it uses the mean (or median) of the first, second, and so on read counts among samples to replace the placement holders. If there is a tie (aka with two values with identical placement holders i), then it uses the mean between value(i) and value (i+1). There are some varieties of the quantile normalization, e.g. upper quantile normalization or median quantile normalization. These are just approaches that replace the the original value by the division between the original value and the 75% quantile (or 50% quantile, i.e. median) of the values among samples. The rest should be the same.
  • RLE stands for relative log expression – /RLE plots: Visualizing unwanted variation in high dimensional data/, PLoS one, 2018. For each gene, it calculates a median of read counts among samples than uses the differences between read counts and the median to draw a box plot. The approach should remove variations among genes and only leave variations among samples, which all have similar distributions with a median of zero.
  • TMM stands for Trimmed Mean of M values – /A scaling normalization method for differential expression analysis of RNA-Seq data/, Genome Biology, 2010. I should look into this approach further tomorrow, but so far it seems to me that it considers the proportions of certain reads in a sample. It will be also interesting to read the first one /Selection between-sample RNA-Seq normalization methods from the perspective for their assumption/.

A little bit of RNA-Seq (2/n)

理想状态下,人们进行基因表达量的比较分析时,希望比较的应该是任意两个细胞之间基因表达量的绝对差异,虽然这是一种resolution最高的状态。然而,实际在比较的时候,比较的是基因的相对表达量。对于qRT-PCR,是一个基因对应于内参(endogenous control)的表达在两个样本中的差异,endogenous control一般选取表达量不太高的基因。对于RNA-Seq来说,是一个基因在两个样本中proportion的差异。因此,这里其实暗含的假设是:两个样本间的转录组大小不能有巨大的变化。这个假设被分解成两个经常被提及的假设,即:

  1. 大多数基因的表达量没有发生变化
  2. 高表达量的基因的表达量没有发生变化

不过,这个假设其实往往是不成立的,无论是在样本本身,或是制备及测序过程中,都可能引入偏差。样本本身的问题是相对最难控制的,比如在比较二倍体和多倍体时,这就是个不能忽略的问题。现在通用的计算表达量的方法,很明显的就是在计算一个proportion。

通常用于计算RNA-Seq表达量时,主要考虑的是测序深度和基因长度。RPKM和FPKM分别对应于single-end和pair-end sequencing。FPKM只是对于pair-end reads无论两个片段都map到还是一个片段map到基因上,都算成一次mapped fragment。这两者的具体做法就是把基因上reads/fragments的数量先处以所有map到的reads的数量,然后再处以基因长度。所以RPKM和FPKM是名字是很误导人的,准确的说应该是Reads (Fragements) per Million reads per Kilobase。

另外一种更流行的做法时TPM,具体的做法其实只是先除以基因的长度,再处以所有mapped reads数量,这样各个样本表达量的总和就会是相同的数值(即100%)。但TPM的全名更没有实际意义,Transcripts Per Million(RNA-Seq领域起名都很随意而没有实际意义)。

A little bit of RNA-SEQ (1/n)

RNA-Seq是一种测序方法,最先发明这种方法的目的是为了测定组织或者细胞内基因的表达量。后来这种技术又兼具了探索gene space和SNP鉴定的任务,尤其在没有可参考的基因组的情况下。传统RNA-Seq的基本过程是通过一定的实验手段将所需测序的RNA进行富集,如mRNA,然后将RNA反转录成cDNA,再对cDNA进行测序。现在比较新的RNA-Seq方法分别在样本提取和cDNA进行了改进,即单细胞测序(样本)和直接测序RNA(dRNA-Seq)。单细胞测序的一类衍生做法是使用组织冷冻切片,从而确定基因表达的空间秩序,也有类似于原位杂交的测序手段。

不过由于RNA本身具有多种形式,又通过各种方式参与生命过程,因此,RNA-Seq的方法延伸到了研究RNA biology的许多方面。通过采取不同的手段对所关注的RNA进行富集,便可以获取不同RNA的序列。比如通过抗体吸附RNA聚合酶II而获得正在转录中的RNA,可以更好的获得5’RNA并比较好的确定转录起始位。通过高速离心的方法区分富有核糖体和少有核糖体的mRNA,以研究RNA和翻译之间的关系(假设是核糖体的数量与翻译量正相关)。也可以通过富集RNA的结构片段或其与其他RNA及蛋白质相互作用的部分,从而获得与RNA结构和相互作用的结果。

RNA-Seq的每一步都有很多陷阱和不确定性。过去因为测序成本太高,可能产生了许多不尽如人意的数据。在常规的DGE分析中,有几个比较常见的偏差,比如RNA的降解使的测得的片段偏向3‘。破碎后片段中GC含量会导致PCR效率不同,高GC含量的片段PCR效率比较低,导致最终reads里这类片段比较少。合成cDNA时,又是会使用random hexmer primers,但其与RNA的结合效率不同,所以起始测序位点的核酸频率会出现偏差。

Ortholog Detection by Blast+

The most highly cited tool in Bioinformatics, Blast, has been rewritten by C++ since 2009. Although released with a compatible perl script for blastall users, the parameters of Blast+ are quite different from those of its predecessor. The changes of parameters may influence people who use blast to detect orthology relationships by reciprocal best hits, or orthomcl, etc.

For RBH approach, Moreno-Hagelsieb and Latimer claim in their Bioinformatics paper that -F “m S” in blastall is the best choice balancing accuracy and running time. The corresponding parameters in blast+ are -seg yes -soft_masking true then. They also kindly provide a list of equivalent parameters used in their Bioinformatics paper between blastall and blast+ on their lab blog.

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6

For OrthoMCL, it is better to use the same masking strategy as RBH uses, but OrthoMCL has another issue. Because it can deal with multiple species at the same time, for some very large ortholog groups, the default limit on the number of alignments may be too low in some cases. Thus, the author suggests set -v 100000 -b 100000 in blastp to avoid missing any homologs. Well, actually only -b matters here, as it sets the upper limit on number of database sequence to show alignments for and -v only works when the output format is in -m 0 or -m 6. Neither of them is used in OrthoMCL. The equivalent parameter of -b in blast+ is -num_alignments, so for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-num_alignments 100000

The story never ends so early, as blast+ has another -max_target_seqs, which can control the number of aligned sequences to keep for any tabular formats (outfmt > 4). It is also incompatible with the ones, i.e. -num_descriptions and -num_alignments, used in output with separate definition line and alignment sections. So I think it is better to just use -max_target_seqs in blastp+ for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-max_target_seqs 100000

There seem to be some new characters in blast+, like database masking, which for sure can reduce running time, but their effects on orthology detection are unclear so far (as far as I know)

__END__

读书八卦

有时候看书会挑出跟主题不太相关的内容,但好像又很接地气。比如这两天看《谈谈方法》,于是八卦的摘录了笛卡尔的学术生活的只言片语:

我决心避开一切可能遇到熟人的场合,在一个地方隐居下来。那里在连年烽火之后已经建立了良好的秩序,驻军的作用看来仅仅在于保障人们享受和平成果,居民人口众多,积极肯干,对自己的事情非常关心,对别人的事情并不注意。我住在那些人当中可享受到各种便利,不亚于通都大邑,而又可以独自一人,就像住在荒无人烟的大沙漠里一样。

这段关于荷兰人民的叙述还真是贴切,跟比国(当时大约就是荷兰)人民的生活状态差不多。

我参观皇帝加冕后回到部队的时候,冬天已经到了,只好留在驻地。那里既找不到人聊天解闷,幸好也没有什么牵挂,没有什么情绪使我分心,我成天独自关在一间暖房里,有充分的闲暇跟自己的思想打交道。
我的生活方式表面上跟某些人没有什么两样:不做什么事情,只是愉快地、正派地过着日子,用心把欢乐和邪恶分开;为了不至于闲得无聊,从事着各种正当的娱乐。可是尽管如此,我仍然在执行我的计划,增进我对真理的认识,成绩也许比埋头读书、只跟读书人往来还要大些。

言语恐惧症者说

不知道从什么时候开始,写东西对我变成了一件有些困难的事情。记得高中的时候,晚自习或者躲在被窝里,可以在一张A4的纸上写很长很长的文字,但是现在要花费了我很大的精力和较长的时间才能完成一篇文字。暂且不论文字质量如何,表达能力的退化倒是更加让人觉得恐怖。有时候不知道自己是不是患上了失语症,就连我在写现在这样一篇描述我为什么要写这个Blog的文章的时候,想法依旧像春天飘在身边的杨絮,长长短短的文字侵占着一个个自然段。

口头表达上好像也有一些所谓的言语恐惧症(Glossophobia)。这词是从Nexus 7的广告上学来的,故事大意是一个有些自闭的小男孩靠Nexus 7完成了在班上的一次presentaion(然后我就好想要台Nexus 7 )!虽然我并不完全认为我有医学上所描述的这种疾病,但是说话表达时的紧张或者有时回避交谈的心理,自己还是略知一二的,而由此造成的事务的拖延也一直时有发生。自我诊断的症状有:与陌生人说话会出汗或不自主挠头,说话比较短少,内容断断续续,不善于阐述复杂的观点,而且往往因为事物复杂而懒于解释。

无论是内在的自我性格和生活历程的影响,还是外界信息片段化的传播,造成上面这些问题的原因不仅仅是单方面的。不过如今接触的事物越来越复杂,在学术圈里接触到的也是本领域里越来越复杂的概念和愈发深刻的思维,都需要能够用“最清晰最简明的词汇来描述深刻的思维”,而且要“循序渐进,从最简单的东西一步步来往深处描述”,而“千万不能在心里约定俗成的觉得听众都应该清晰明白,然后跳跃性的描述思想”(引文来自豆瓣的理性批判)。

因此,新建这个Blog的主要想法是想强迫自己敢于并尽可能完整的表达自己的观点和认知,而不是片段化的没有思索过程的抛出一些结论。因为Blog与其他网络表达方式相比,篇幅长了许多而且几乎没有限制,可以强迫自己表达比较长而丰富的内容,从而训练下自己日渐衰老的表达能力。同时书写能够帮助整理思路,尽可能保持思路的连贯性,避免从用一个话题补充另一个话题,打乱表达的节奏。

同时,我还发现自己对很多知识的认知不够精确。一个明显的问题是,一个知识自己看明白后,想再解释给别人时总是感到困难重重,还得回头再去翻阅书中的表述才能转述给他人。所以这个Blog里也会包含我自己觉得比较难理解而又难表述清楚的内容。

最后,我之前的Blog们都是一种类似于日记的存在,同时都在一个比较固定的圈子里,因而要开辟一些新的内容总觉得很困难,因为我自己会受到外界对我的印象的影响。一个新的Blog可以看成是一个新的开始,可以尝试我一直想做的关于专业关于科学关于艺术这些形而上一些为主题的Blog。于是,在内心又纠结了很久才舒服,同时又终于肯花钱(20欧/年)之后,就有了现在这个新的Blog.