Polyploidization and meiotic recombination, no beer

  • /Genome duplication increases meiotic recombination frequency: a Saccharomyces cerevisiae model/, Molecular Biology and Evolution, 2021. I only quickly browsed this paper, which seems to use technologies that out of my expertise. The main conclusion, though, is the title itself, and the effect is associated with “weakened recombination interference, enhanced double-strand break density and loosened chromatin histone occupation” Also, as claimed by the author, it is the first system to direct study the effects of polyploidization on meiotic recombination frequency. This paper reminds me another paper by Levi Yant et al. 2013 on positive selected genes in the tetraploidy /Arabidopsis arenosa/, in which they found genes involved in meiosis, suggesting dealing with chromosome segregation is crucial after polyploidization.

A bit more about TMM

This blog is short, but I still want to post it to keep the writing mood. It is barely a paragraph though…

  • /A scaling normalisation method for differential expression analysis of RNA-Seq data/, Genome Biology, 2010. I spent some more time on TMM, but I would not say I fully get the normalisation procedures. However, the basic assumption is that most of genes in samples are not differential expressed genes, so that it tries finding a scaling factor to minimise the log-fold changes of genes between samples. TMM stands for Trimmed Mean of M-values. The M values are just the log-folds of genes between samples, and the Trimmed Mean simply means that some M values (e.g. 30% of the M on two sides) are removed before calculating their mean.

Quantile normalization, RLE, and TMM

There are just so many ways to normalise RNA-Seq data, even though one of the advantages of RNA-Seq is “it can capture transcriptome dynamics across different tissues of conditions without sophisticated normalization of data sets.” (/RNA-Seq: A revolutionary tool for transcriptomes/, Nature Review Genetics, 2009, ) BTW, I was wondering how many people actually take careful considerations on these normalisations. Or I’d like to know how robust RNA-Seq data sets are. Here are a few normalizations that came across today

  • Quantile normalization – /Selection between-sample RNA-Seq normalization methods from the perspective for their assumptions/, Briefings in Bioinformatics, 2018. The purpose for this normalization is to obtain identical distributions for read counts among different samples. It first orders read counts for each sample increasingly and leaves placement holders with the order indices. Then it uses the mean (or median) of the first, second, and so on read counts among samples to replace the placement holders. If there is a tie (aka with two values with identical placement holders i), then it uses the mean between value(i) and value (i+1). There are some varieties of the quantile normalization, e.g. upper quantile normalization or median quantile normalization. These are just approaches that replace the the original value by the division between the original value and the 75% quantile (or 50% quantile, i.e. median) of the values among samples. The rest should be the same.
  • RLE stands for relative log expression – /RLE plots: Visualizing unwanted variation in high dimensional data/, PLoS one, 2018. For each gene, it calculates a median of read counts among samples than uses the differences between read counts and the median to draw a box plot. The approach should remove variations among genes and only leave variations among samples, which all have similar distributions with a median of zero.
  • TMM stands for Trimmed Mean of M values – /A scaling normalization method for differential expression analysis of RNA-Seq data/, Genome Biology, 2010. I should look into this approach further tomorrow, but so far it seems to me that it considers the proportions of certain reads in a sample. It will be also interesting to read the first one /Selection between-sample RNA-Seq normalization methods from the perspective for their assumption/.

Duplication? No, I still want beer.

/Expression attenuation as a mechanism of robustness against gene duplication/, Proceedings of National Academy of Sciences, 2011. Budding yeast is still a fine system to do all kinds of imaginary experiments. To introduce a duplicated gene, yes, use single-copy centromeric plasmids (pCEN). To measure PPI interactions, yes, “use a protein-fragment complementation assay (PCA) based on he dihydrofolate reductase (DHFR) enzyme (DHFR-PCA)” (…) then the colony sizes can tell you about it. Also, of course, GFP and RNA-Seq are just a piece of cake.
Except for these magical experiments, here are a few more things:
1. Duplication of haploinsufficient genes seem to have higher chance of deleterious effects than haplosufficient genes.
2. Deleterious duplications have similar fractions for genes in and out of protein complexes.
3. Duplication may affect PPI, but tends to increase the strength or amount of PPIs of other subunits, however, the disturbance of PPI is not correlated with selection coefficient.
4. At the protein level, the amounts of protein are attenuated after duplication (but still slightly higher than the pre-duplication level?). Other subunits disturbed by a duplication may increase their protein abundance, consistent with increased amount of PPIs.
5. Attenuation can occur at the transcriptional level as well as the posttranscriptional level, and the latter is more frequent and requires the translation of mRNA.

Gene trees for synteny and phylotranscriptomes

Everyday is cloudy, so how should I come up with a title… Putting two independent topics in one title does not make much sense, but it is fine for a blog.

  • /PhylDiag: identifying complex synteny blocks that include tandem duplications using phylogenetic gene trees/, BMC Bioinformatics, 2014: I’ve seen this paper cited by the SCORPiOs paper ( Synteny-guided CORrection of Paralogies and Orthologies in gene trees, kind of a non-sense name, sorry…). It is similar to i-adhore to identify syntenic blocks (only for pairwise comparisons, though), but PhylDiag uses information from gene trees, including family members as well orthologous and paralogous relationships. Note that the trees are built with TreeBeST, the one used in EnsemblCompara pipeline. It is an old gene tree inference pipeline guided by a species tree and seems to be able to infer duplication and speciation events. SCORPiOs also reimplemented TreeBeST in their package, which might be helpful.
  • /Is phylotranscriptomics as reliable as phylogenomics/, Molecular Biology and Evolution, 2020. Many phylogenetic studies nowadays use transcriptomes because RNA-Seq is relatively cheap and can scale species sampling enormously. Here, the paper shows that orthology identification is the critical issue for using transcriptomes to infer phylogenies. Especially, orthologs identified by a tree-based approach developed by Yang and Smith (2014) produce more similar phylogenetic trees to phylogenomic trees than do tree-free methods. For phylogenetic analysis, the purpose is to curate a robust dataset of single-copy genes, so orthologous identifications can drop lots of data and only remain the best ones. In contrast, if all the gene families matter to analysis (e.g., building gene trees with all unigenes), it is still unclear how transcriptomes or unigenes are comparable to predicted genes in genomes.


The weather is grey as most days here, so let me start with some colors by talking about a column of /Nature Methods/ in 2010, written (mainly) by Bang Wong, who is now the creative director of the Broad Institute.

  • /Color coding/ and /Mapping quantitative data to color/, Nature Methods, 2010. Colors have three primary components, hue, saturation, and lightness. For categorical data, selecting a serial of different hues together with increasing saturation and lightness seems great to distinguish various categories. But it would be better to keep the number of categories not larger than six. For quantitative data, the main principle is to keep the hue but adjust the saturation. If there are two categories in data, using two hues. BTW, the whole column looks great as a starting point for data visualization, so some of them would appear again here.
  • /Heterozygous, polyploid, giant bacterium, Achromatium, possesses an identical functional inventory worldwide across drastically different ecosystems/ Molecular Biology and Evolution, 2020. Polyploids in bacterium! They are hyperpolyploids, usually with 300 chromosomes! All lineages (or strains) have more or less the same gene inventory, so it has been believed that gene expression is regulated according to their habitats, e.g., from fresh to saline water.
  • /Uncovering a novel function of the CCR4-NOT complex in phytochrome A-mediated light signaling in plants/, eLife, 2021. Phytochromes are photoreceptors for red and far-red light, while phytochrome A is mainly for far-red light. Their response to light forms a switch to sense the ratio of red:far-red light so that plants can sense sunlight from canopy shade. A gene, NOT9B, duplicated through the WGD in Brassicaceae (not sure alpha or beta), gets a new function to regulate the CCR4-NOT complex negatively. It can bind with both CCR4-NOT (the scaffold protein NOT1) and Phytochrome A, while its paralog NOT9A can only interact with NOT1 rather than Phytochrome A. When it is dark, NOT9B binds to NOT1 hence silence the activity of CCR4-NOT; when there is light, NOT9B interacts with Phytochrome A, thus relieving CCR4-NOT that triggers cascade reactions like far-red specific gene expression and isoform splicing.
  • /Synteny guided resolution of gene trees clarifies the functional impacts of whole-genome duplications/, Molecular Biology and Evolution, 2020. It is an intuitive idea to adjust gene tree topologies to obey syntenic relationships generated by WGD, followed by comparing the tree topologies with the ML trees to see if they are nearly-ML trees. One assumption behind the implementation, as far as I understand, is that orthologous syntenic pairs are more similar to paralogous syntenic pairs. Hence the method maximizes the deltaS score by threading (artificial?) syntenic regions. It then identifies orthologous and paralogous syntenic regions among species and proposes gene trees consistent with the synteny.

Snowing day in April 2021

Thinking about recording the journey of paper reading, I find it might be fun to write down some *stupid thoughts* and *immature opinions* in a serial that I call ‘WhatToReadTody’. This may sound like a paper review/recommandation, but it is not, at least not at present, although I hope it could be in future (and I will continue writing).

  • /Charing the genomic landscape of seed-free plants/ Nature Plants, 2021. The paper reviews several current available seed-free plant genomes, but provides few insights with respect to Bioinformatics or Genome Evolution. I just quickly browsed a few sections and figures though. A few points that in my mind:
    1. Seed-free plants do not have many WGDs. At first, it was thought due to the presence of mature sex chromosomes, but later the genome of a moss (Ceratodon purpureus) with ancient sex chromosome system also shows evidence of ancient WGD. Many animal genomes already provide evidence against this hypothesis and polyploidizations might not be too rare in animals.
    2. There are still some collinearity between seed-free plants and seed plants.
  • /New prospects in the detection and comparative analysis of hybridisation in the tree of life/ American Journal of Botany, 2014. Basically, I only checked Figure 1 and integrating gene order (collinearity) and gene trees may hint hybridisation histories of different species. I thought this was first shown in mammals (if not mouses?), which I read in the book /Tree thinking/.
  • /The effects of Arabidopsis genome duplication on chromatin organisation and transcriptional regulation/ Nucleic Acids Research, 2018. This study generated Hi-C data and ChIP-Seq data for diploid and novel tetraploid Arabidopsis (col0). They found more inter-chromosome interactions in tetraploids and differences of histon methylations between diploids and tetraploids. Differential expressed genes were analysed without spike-ins, although it is still difficult to tell if the spike-in system is required (theoretically yes, I think).
  • /Altered chromatin architecture and gene expression during polyploidization and domestication of soybean/ The Plant Cell, 2021. I came across this paper by checking papers cited the previous one. HiC and various eipgenomic data for soybean, wild soybean and common beans are produced in this study. Genes retained after the latest WGD in the soybean genome have long-range chromosome interactions, higher gene expression, higher chromatin accessibility, but lower DNA methylation. It might be interesting to further classified WGD retained genes to figure out if chromatin architecture is correlated with duplicate fates.

A little bit of RNA-Seq (2/n)

理想状态下,人们进行基因表达量的比较分析时,希望比较的应该是任意两个细胞之间基因表达量的绝对差异,虽然这是一种resolution最高的状态。然而,实际在比较的时候,比较的是基因的相对表达量。对于qRT-PCR,是一个基因对应于内参(endogenous control)的表达在两个样本中的差异,endogenous control一般选取表达量不太高的基因。对于RNA-Seq来说,是一个基因在两个样本中proportion的差异。因此,这里其实暗含的假设是:两个样本间的转录组大小不能有巨大的变化。这个假设被分解成两个经常被提及的假设,即:

  1. 大多数基因的表达量没有发生变化
  2. 高表达量的基因的表达量没有发生变化


通常用于计算RNA-Seq表达量时,主要考虑的是测序深度和基因长度。RPKM和FPKM分别对应于single-end和pair-end sequencing。FPKM只是对于pair-end reads无论两个片段都map到还是一个片段map到基因上,都算成一次mapped fragment。这两者的具体做法就是把基因上reads/fragments的数量先处以所有map到的reads的数量,然后再处以基因长度。所以RPKM和FPKM是名字是很误导人的,准确的说应该是Reads (Fragements) per Million reads per Kilobase。

另外一种更流行的做法时TPM,具体的做法其实只是先除以基因的长度,再处以所有mapped reads数量,这样各个样本表达量的总和就会是相同的数值(即100%)。但TPM的全名更没有实际意义,Transcripts Per Million(RNA-Seq领域起名都很随意而没有实际意义)。

A little bit of RNA-SEQ (1/n)

RNA-Seq是一种测序方法,最先发明这种方法的目的是为了测定组织或者细胞内基因的表达量。后来这种技术又兼具了探索gene space和SNP鉴定的任务,尤其在没有可参考的基因组的情况下。传统RNA-Seq的基本过程是通过一定的实验手段将所需测序的RNA进行富集,如mRNA,然后将RNA反转录成cDNA,再对cDNA进行测序。现在比较新的RNA-Seq方法分别在样本提取和cDNA进行了改进,即单细胞测序(样本)和直接测序RNA(dRNA-Seq)。单细胞测序的一类衍生做法是使用组织冷冻切片,从而确定基因表达的空间秩序,也有类似于原位杂交的测序手段。

不过由于RNA本身具有多种形式,又通过各种方式参与生命过程,因此,RNA-Seq的方法延伸到了研究RNA biology的许多方面。通过采取不同的手段对所关注的RNA进行富集,便可以获取不同RNA的序列。比如通过抗体吸附RNA聚合酶II而获得正在转录中的RNA,可以更好的获得5’RNA并比较好的确定转录起始位。通过高速离心的方法区分富有核糖体和少有核糖体的mRNA,以研究RNA和翻译之间的关系(假设是核糖体的数量与翻译量正相关)。也可以通过富集RNA的结构片段或其与其他RNA及蛋白质相互作用的部分,从而获得与RNA结构和相互作用的结果。

RNA-Seq的每一步都有很多陷阱和不确定性。过去因为测序成本太高,可能产生了许多不尽如人意的数据。在常规的DGE分析中,有几个比较常见的偏差,比如RNA的降解使的测得的片段偏向3‘。破碎后片段中GC含量会导致PCR效率不同,高GC含量的片段PCR效率比较低,导致最终reads里这类片段比较少。合成cDNA时,又是会使用random hexmer primers,但其与RNA的结合效率不同,所以起始测序位点的核酸频率会出现偏差。

Ortholog Detection by Blast+

The most highly cited tool in Bioinformatics, Blast, has been rewritten by C++ since 2009. Although released with a compatible perl script for blastall users, the parameters of Blast+ are quite different from those of its predecessor. The changes of parameters may influence people who use blast to detect orthology relationships by reciprocal best hits, or orthomcl, etc.

For RBH approach, Moreno-Hagelsieb and Latimer claim in their Bioinformatics paper that -F “m S” in blastall is the best choice balancing accuracy and running time. The corresponding parameters in blast+ are -seg yes -soft_masking true then. They also kindly provide a list of equivalent parameters used in their Bioinformatics paper between blastall and blast+ on their lab blog.

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6

For OrthoMCL, it is better to use the same masking strategy as RBH uses, but OrthoMCL has another issue. Because it can deal with multiple species at the same time, for some very large ortholog groups, the default limit on the number of alignments may be too low in some cases. Thus, the author suggests set -v 100000 -b 100000 in blastp to avoid missing any homologs. Well, actually only -b matters here, as it sets the upper limit on number of database sequence to show alignments for and -v only works when the output format is in -m 0 or -m 6. Neither of them is used in OrthoMCL. The equivalent parameter of -b in blast+ is -num_alignments, so for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-num_alignments 100000

The story never ends so early, as blast+ has another -max_target_seqs, which can control the number of aligned sequences to keep for any tabular formats (outfmt > 4). It is also incompatible with the ones, i.e. -num_descriptions and -num_alignments, used in output with separate definition line and alignment sections. So I think it is better to just use -max_target_seqs in blastp+ for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-max_target_seqs 100000

There seem to be some new characters in blast+, like database masking, which for sure can reduce running time, but their effects on orthology detection are unclear so far (as far as I know)