Polyploidization and meiotic recombination, no beer

  • /Genome duplication increases meiotic recombination frequency: a Saccharomyces cerevisiae model/, Molecular Biology and Evolution, 2021. I only quickly browsed this paper, which seems to use technologies that out of my expertise. The main conclusion, though, is the title itself, and the effect is associated with “weakened recombination interference, enhanced double-strand break density and loosened chromatin histone occupation” Also, as claimed by the author, it is the first system to direct study the effects of polyploidization on meiotic recombination frequency. This paper reminds me another paper by Levi Yant et al. 2013 on positive selected genes in the tetraploidy /Arabidopsis arenosa/, in which they found genes involved in meiosis, suggesting dealing with chromosome segregation is crucial after polyploidization.

A bit more about TMM

This blog is short, but I still want to post it to keep the writing mood. It is barely a paragraph though…

  • /A scaling normalisation method for differential expression analysis of RNA-Seq data/, Genome Biology, 2010. I spent some more time on TMM, but I would not say I fully get the normalisation procedures. However, the basic assumption is that most of genes in samples are not differential expressed genes, so that it tries finding a scaling factor to minimise the log-fold changes of genes between samples. TMM stands for Trimmed Mean of M-values. The M values are just the log-folds of genes between samples, and the Trimmed Mean simply means that some M values (e.g. 30% of the M on two sides) are removed before calculating their mean.

Quantile normalization, RLE, and TMM

There are just so many ways to normalise RNA-Seq data, even though one of the advantages of RNA-Seq is “it can capture transcriptome dynamics across different tissues of conditions without sophisticated normalization of data sets.” (/RNA-Seq: A revolutionary tool for transcriptomes/, Nature Review Genetics, 2009, ) BTW, I was wondering how many people actually take careful considerations on these normalisations. Or I’d like to know how robust RNA-Seq data sets are. Here are a few normalizations that came across today

  • Quantile normalization – /Selection between-sample RNA-Seq normalization methods from the perspective for their assumptions/, Briefings in Bioinformatics, 2018. The purpose for this normalization is to obtain identical distributions for read counts among different samples. It first orders read counts for each sample increasingly and leaves placement holders with the order indices. Then it uses the mean (or median) of the first, second, and so on read counts among samples to replace the placement holders. If there is a tie (aka with two values with identical placement holders i), then it uses the mean between value(i) and value (i+1). There are some varieties of the quantile normalization, e.g. upper quantile normalization or median quantile normalization. These are just approaches that replace the the original value by the division between the original value and the 75% quantile (or 50% quantile, i.e. median) of the values among samples. The rest should be the same.
  • RLE stands for relative log expression – /RLE plots: Visualizing unwanted variation in high dimensional data/, PLoS one, 2018. For each gene, it calculates a median of read counts among samples than uses the differences between read counts and the median to draw a box plot. The approach should remove variations among genes and only leave variations among samples, which all have similar distributions with a median of zero.
  • TMM stands for Trimmed Mean of M values – /A scaling normalization method for differential expression analysis of RNA-Seq data/, Genome Biology, 2010. I should look into this approach further tomorrow, but so far it seems to me that it considers the proportions of certain reads in a sample. It will be also interesting to read the first one /Selection between-sample RNA-Seq normalization methods from the perspective for their assumption/.

Duplication? No, I still want beer.

/Expression attenuation as a mechanism of robustness against gene duplication/, Proceedings of National Academy of Sciences, 2011. Budding yeast is still a fine system to do all kinds of imaginary experiments. To introduce a duplicated gene, yes, use single-copy centromeric plasmids (pCEN). To measure PPI interactions, yes, “use a protein-fragment complementation assay (PCA) based on he dihydrofolate reductase (DHFR) enzyme (DHFR-PCA)” (…) then the colony sizes can tell you about it. Also, of course, GFP and RNA-Seq are just a piece of cake.
Except for these magical experiments, here are a few more things:
1. Duplication of haploinsufficient genes seem to have higher chance of deleterious effects than haplosufficient genes.
2. Deleterious duplications have similar fractions for genes in and out of protein complexes.
3. Duplication may affect PPI, but tends to increase the strength or amount of PPIs of other subunits, however, the disturbance of PPI is not correlated with selection coefficient.
4. At the protein level, the amounts of protein are attenuated after duplication (but still slightly higher than the pre-duplication level?). Other subunits disturbed by a duplication may increase their protein abundance, consistent with increased amount of PPIs.
5. Attenuation can occur at the transcriptional level as well as the posttranscriptional level, and the latter is more frequent and requires the translation of mRNA.

Gene trees for synteny and phylotranscriptomes

Everyday is cloudy, so how should I come up with a title… Putting two independent topics in one title does not make much sense, but it is fine for a blog.

  • /PhylDiag: identifying complex synteny blocks that include tandem duplications using phylogenetic gene trees/, BMC Bioinformatics, 2014: I’ve seen this paper cited by the SCORPiOs paper ( Synteny-guided CORrection of Paralogies and Orthologies in gene trees, kind of a non-sense name, sorry…). It is similar to i-adhore to identify syntenic blocks (only for pairwise comparisons, though), but PhylDiag uses information from gene trees, including family members as well orthologous and paralogous relationships. Note that the trees are built with TreeBeST, the one used in EnsemblCompara pipeline. It is an old gene tree inference pipeline guided by a species tree and seems to be able to infer duplication and speciation events. SCORPiOs also reimplemented TreeBeST in their package, which might be helpful.
  • /Is phylotranscriptomics as reliable as phylogenomics/, Molecular Biology and Evolution, 2020. Many phylogenetic studies nowadays use transcriptomes because RNA-Seq is relatively cheap and can scale species sampling enormously. Here, the paper shows that orthology identification is the critical issue for using transcriptomes to infer phylogenies. Especially, orthologs identified by a tree-based approach developed by Yang and Smith (2014) produce more similar phylogenetic trees to phylogenomic trees than do tree-free methods. For phylogenetic analysis, the purpose is to curate a robust dataset of single-copy genes, so orthologous identifications can drop lots of data and only remain the best ones. In contrast, if all the gene families matter to analysis (e.g., building gene trees with all unigenes), it is still unclear how transcriptomes or unigenes are comparable to predicted genes in genomes.


The weather is grey as most days here, so let me start with some colors by talking about a column of /Nature Methods/ in 2010, written (mainly) by Bang Wong, who is now the creative director of the Broad Institute.

  • /Color coding/ and /Mapping quantitative data to color/, Nature Methods, 2010. Colors have three primary components, hue, saturation, and lightness. For categorical data, selecting a serial of different hues together with increasing saturation and lightness seems great to distinguish various categories. But it would be better to keep the number of categories not larger than six. For quantitative data, the main principle is to keep the hue but adjust the saturation. If there are two categories in data, using two hues. BTW, the whole column looks great as a starting point for data visualization, so some of them would appear again here.
  • /Heterozygous, polyploid, giant bacterium, Achromatium, possesses an identical functional inventory worldwide across drastically different ecosystems/ Molecular Biology and Evolution, 2020. Polyploids in bacterium! They are hyperpolyploids, usually with 300 chromosomes! All lineages (or strains) have more or less the same gene inventory, so it has been believed that gene expression is regulated according to their habitats, e.g., from fresh to saline water.
  • /Uncovering a novel function of the CCR4-NOT complex in phytochrome A-mediated light signaling in plants/, eLife, 2021. Phytochromes are photoreceptors for red and far-red light, while phytochrome A is mainly for far-red light. Their response to light forms a switch to sense the ratio of red:far-red light so that plants can sense sunlight from canopy shade. A gene, NOT9B, duplicated through the WGD in Brassicaceae (not sure alpha or beta), gets a new function to regulate the CCR4-NOT complex negatively. It can bind with both CCR4-NOT (the scaffold protein NOT1) and Phytochrome A, while its paralog NOT9A can only interact with NOT1 rather than Phytochrome A. When it is dark, NOT9B binds to NOT1 hence silence the activity of CCR4-NOT; when there is light, NOT9B interacts with Phytochrome A, thus relieving CCR4-NOT that triggers cascade reactions like far-red specific gene expression and isoform splicing.
  • /Synteny guided resolution of gene trees clarifies the functional impacts of whole-genome duplications/, Molecular Biology and Evolution, 2020. It is an intuitive idea to adjust gene tree topologies to obey syntenic relationships generated by WGD, followed by comparing the tree topologies with the ML trees to see if they are nearly-ML trees. One assumption behind the implementation, as far as I understand, is that orthologous syntenic pairs are more similar to paralogous syntenic pairs. Hence the method maximizes the deltaS score by threading (artificial?) syntenic regions. It then identifies orthologous and paralogous syntenic regions among species and proposes gene trees consistent with the synteny.

Snowing day in April 2021

Thinking about recording the journey of paper reading, I find it might be fun to write down some *stupid thoughts* and *immature opinions* in a serial that I call ‘WhatToReadTody’. This may sound like a paper review/recommandation, but it is not, at least not at present, although I hope it could be in future (and I will continue writing).

  • /Charing the genomic landscape of seed-free plants/ Nature Plants, 2021. The paper reviews several current available seed-free plant genomes, but provides few insights with respect to Bioinformatics or Genome Evolution. I just quickly browsed a few sections and figures though. A few points that in my mind:
    1. Seed-free plants do not have many WGDs. At first, it was thought due to the presence of mature sex chromosomes, but later the genome of a moss (Ceratodon purpureus) with ancient sex chromosome system also shows evidence of ancient WGD. Many animal genomes already provide evidence against this hypothesis and polyploidizations might not be too rare in animals.
    2. There are still some collinearity between seed-free plants and seed plants.
  • /New prospects in the detection and comparative analysis of hybridisation in the tree of life/ American Journal of Botany, 2014. Basically, I only checked Figure 1 and integrating gene order (collinearity) and gene trees may hint hybridisation histories of different species. I thought this was first shown in mammals (if not mouses?), which I read in the book /Tree thinking/.
  • /The effects of Arabidopsis genome duplication on chromatin organisation and transcriptional regulation/ Nucleic Acids Research, 2018. This study generated Hi-C data and ChIP-Seq data for diploid and novel tetraploid Arabidopsis (col0). They found more inter-chromosome interactions in tetraploids and differences of histon methylations between diploids and tetraploids. Differential expressed genes were analysed without spike-ins, although it is still difficult to tell if the spike-in system is required (theoretically yes, I think).
  • /Altered chromatin architecture and gene expression during polyploidization and domestication of soybean/ The Plant Cell, 2021. I came across this paper by checking papers cited the previous one. HiC and various eipgenomic data for soybean, wild soybean and common beans are produced in this study. Genes retained after the latest WGD in the soybean genome have long-range chromosome interactions, higher gene expression, higher chromatin accessibility, but lower DNA methylation. It might be interesting to further classified WGD retained genes to figure out if chromatin architecture is correlated with duplicate fates.

Ortholog Detection by Blast+

The most highly cited tool in Bioinformatics, Blast, has been rewritten by C++ since 2009. Although released with a compatible perl script for blastall users, the parameters of Blast+ are quite different from those of its predecessor. The changes of parameters may influence people who use blast to detect orthology relationships by reciprocal best hits, or orthomcl, etc.

For RBH approach, Moreno-Hagelsieb and Latimer claim in their Bioinformatics paper that -F “m S” in blastall is the best choice balancing accuracy and running time. The corresponding parameters in blast+ are -seg yes -soft_masking true then. They also kindly provide a list of equivalent parameters used in their Bioinformatics paper between blastall and blast+ on their lab blog.

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6

For OrthoMCL, it is better to use the same masking strategy as RBH uses, but OrthoMCL has another issue. Because it can deal with multiple species at the same time, for some very large ortholog groups, the default limit on the number of alignments may be too low in some cases. Thus, the author suggests set -v 100000 -b 100000 in blastp to avoid missing any homologs. Well, actually only -b matters here, as it sets the upper limit on number of database sequence to show alignments for and -v only works when the output format is in -m 0 or -m 6. Neither of them is used in OrthoMCL. The equivalent parameter of -b in blast+ is -num_alignments, so for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-num_alignments 100000

The story never ends so early, as blast+ has another -max_target_seqs, which can control the number of aligned sequences to keep for any tabular formats (outfmt > 4). It is also incompatible with the ones, i.e. -num_descriptions and -num_alignments, used in output with separate definition line and alignment sections. So I think it is better to just use -max_target_seqs in blastp+ for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-max_target_seqs 100000

There seem to be some new characters in blast+, like database masking, which for sure can reduce running time, but their effects on orthology detection are unclear so far (as far as I know)