A little bit of RNA-Seq (2/n)

理想状态下,人们进行基因表达量的比较分析时,希望比较的应该是任意两个细胞之间基因表达量的绝对差异,虽然这是一种resolution最高的状态。然而,实际在比较的时候,比较的是基因的相对表达量。对于qRT-PCR,是一个基因对应于内参(endogenous control)的表达在两个样本中的差异,endogenous control一般选取表达量不太高的基因。对于RNA-Seq来说,是一个基因在两个样本中proportion的差异。因此,这里其实暗含的假设是:两个样本间的转录组大小不能有巨大的变化。这个假设被分解成两个经常被提及的假设,即:

  1. 大多数基因的表达量没有发生变化
  2. 高表达量的基因的表达量没有发生变化

不过,这个假设其实往往是不成立的,无论是在样本本身,或是制备及测序过程中,都可能引入偏差。样本本身的问题是相对最难控制的,比如在比较二倍体和多倍体时,这就是个不能忽略的问题。现在通用的计算表达量的方法,很明显的就是在计算一个proportion。

通常用于计算RNA-Seq表达量时,主要考虑的是测序深度和基因长度。RPKM和FPKM分别对应于single-end和pair-end sequencing。FPKM只是对于pair-end reads无论两个片段都map到还是一个片段map到基因上,都算成一次mapped fragment。这两者的具体做法就是把基因上reads/fragments的数量先处以所有map到的reads的数量,然后再处以基因长度。所以RPKM和FPKM是名字是很误导人的,准确的说应该是Reads (Fragements) per Million reads per Kilobase。

另外一种更流行的做法时TPM,具体的做法其实只是先除以基因的长度,再处以所有mapped reads数量,这样各个样本表达量的总和就会是相同的数值(即100%)。但TPM的全名更没有实际意义,Transcripts Per Million(RNA-Seq领域起名都很随意而没有实际意义)。

A little bit of RNA-SEQ (1/n)

RNA-Seq是一种测序方法,最先发明这种方法的目的是为了测定组织或者细胞内基因的表达量。后来这种技术又兼具了探索gene space和SNP鉴定的任务,尤其在没有可参考的基因组的情况下。传统RNA-Seq的基本过程是通过一定的实验手段将所需测序的RNA进行富集,如mRNA,然后将RNA反转录成cDNA,再对cDNA进行测序。现在比较新的RNA-Seq方法分别在样本提取和cDNA进行了改进,即单细胞测序(样本)和直接测序RNA(dRNA-Seq)。单细胞测序的一类衍生做法是使用组织冷冻切片,从而确定基因表达的空间秩序,也有类似于原位杂交的测序手段。

不过由于RNA本身具有多种形式,又通过各种方式参与生命过程,因此,RNA-Seq的方法延伸到了研究RNA biology的许多方面。通过采取不同的手段对所关注的RNA进行富集,便可以获取不同RNA的序列。比如通过抗体吸附RNA聚合酶II而获得正在转录中的RNA,可以更好的获得5’RNA并比较好的确定转录起始位。通过高速离心的方法区分富有核糖体和少有核糖体的mRNA,以研究RNA和翻译之间的关系(假设是核糖体的数量与翻译量正相关)。也可以通过富集RNA的结构片段或其与其他RNA及蛋白质相互作用的部分,从而获得与RNA结构和相互作用的结果。

RNA-Seq的每一步都有很多陷阱和不确定性。过去因为测序成本太高,可能产生了许多不尽如人意的数据。在常规的DGE分析中,有几个比较常见的偏差,比如RNA的降解使的测得的片段偏向3‘。破碎后片段中GC含量会导致PCR效率不同,高GC含量的片段PCR效率比较低,导致最终reads里这类片段比较少。合成cDNA时,又是会使用random hexmer primers,但其与RNA的结合效率不同,所以起始测序位点的核酸频率会出现偏差。

Ortholog Detection by Blast+

The most highly cited tool in Bioinformatics, Blast, has been rewritten by C++ since 2009. Although released with a compatible perl script for blastall users, the parameters of Blast+ are quite different from those of its predecessor. The changes of parameters may influence people who use blast to detect orthology relationships by reciprocal best hits, or orthomcl, etc.

For RBH approach, Moreno-Hagelsieb and Latimer claim in their Bioinformatics paper that -F “m S” in blastall is the best choice balancing accuracy and running time. The corresponding parameters in blast+ are -seg yes -soft_masking true then. They also kindly provide a list of equivalent parameters used in their Bioinformatics paper between blastall and blast+ on their lab blog.

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6

For OrthoMCL, it is better to use the same masking strategy as RBH uses, but OrthoMCL has another issue. Because it can deal with multiple species at the same time, for some very large ortholog groups, the default limit on the number of alignments may be too low in some cases. Thus, the author suggests set -v 100000 -b 100000 in blastp to avoid missing any homologs. Well, actually only -b matters here, as it sets the upper limit on number of database sequence to show alignments for and -v only works when the output format is in -m 0 or -m 6. Neither of them is used in OrthoMCL. The equivalent parameter of -b in blast+ is -num_alignments, so for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-num_alignments 100000

The story never ends so early, as blast+ has another -max_target_seqs, which can control the number of aligned sequences to keep for any tabular formats (outfmt > 4). It is also incompatible with the ones, i.e. -num_descriptions and -num_alignments, used in output with separate definition line and alignment sections. So I think it is better to just use -max_target_seqs in blastp+ for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-max_target_seqs 100000

There seem to be some new characters in blast+, like database masking, which for sure can reduce running time, but their effects on orthology detection are unclear so far (as far as I know)

__END__