Quantile normalization, RLE, and TMM

There are just so many ways to normalise RNA-Seq data, even though one of the advantages of RNA-Seq is “it can capture transcriptome dynamics across different tissues of conditions without sophisticated normalization of data sets.” (/RNA-Seq: A revolutionary tool for transcriptomes/, Nature Review Genetics, 2009, ) BTW, I was wondering how many people actually take careful considerations on these normalisations. Or I’d like to know how robust RNA-Seq data sets are. Here are a few normalizations that came across today

  • Quantile normalization – /Selection between-sample RNA-Seq normalization methods from the perspective for their assumptions/, Briefings in Bioinformatics, 2018. The purpose for this normalization is to obtain identical distributions for read counts among different samples. It first orders read counts for each sample increasingly and leaves placement holders with the order indices. Then it uses the mean (or median) of the first, second, and so on read counts among samples to replace the placement holders. If there is a tie (aka with two values with identical placement holders i), then it uses the mean between value(i) and value (i+1). There are some varieties of the quantile normalization, e.g. upper quantile normalization or median quantile normalization. These are just approaches that replace the the original value by the division between the original value and the 75% quantile (or 50% quantile, i.e. median) of the values among samples. The rest should be the same.
  • RLE stands for relative log expression – /RLE plots: Visualizing unwanted variation in high dimensional data/, PLoS one, 2018. For each gene, it calculates a median of read counts among samples than uses the differences between read counts and the median to draw a box plot. The approach should remove variations among genes and only leave variations among samples, which all have similar distributions with a median of zero.
  • TMM stands for Trimmed Mean of M values – /A scaling normalization method for differential expression analysis of RNA-Seq data/, Genome Biology, 2010. I should look into this approach further tomorrow, but so far it seems to me that it considers the proportions of certain reads in a sample. It will be also interesting to read the first one /Selection between-sample RNA-Seq normalization methods from the perspective for their assumption/.

Ortholog Detection by Blast+

The most highly cited tool in Bioinformatics, Blast, has been rewritten by C++ since 2009. Although released with a compatible perl script for blastall users, the parameters of Blast+ are quite different from those of its predecessor. The changes of parameters may influence people who use blast to detect orthology relationships by reciprocal best hits, or orthomcl, etc.

For RBH approach, Moreno-Hagelsieb and Latimer claim in their Bioinformatics paper that -F “m S” in blastall is the best choice balancing accuracy and running time. The corresponding parameters in blast+ are -seg yes -soft_masking true then. They also kindly provide a list of equivalent parameters used in their Bioinformatics paper between blastall and blast+ on their lab blog.

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6

For OrthoMCL, it is better to use the same masking strategy as RBH uses, but OrthoMCL has another issue. Because it can deal with multiple species at the same time, for some very large ortholog groups, the default limit on the number of alignments may be too low in some cases. Thus, the author suggests set -v 100000 -b 100000 in blastp to avoid missing any homologs. Well, actually only -b matters here, as it sets the upper limit on number of database sequence to show alignments for and -v only works when the output format is in -m 0 or -m 6. Neither of them is used in OrthoMCL. The equivalent parameter of -b in blast+ is -num_alignments, so for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-num_alignments 100000

The story never ends so early, as blast+ has another -max_target_seqs, which can control the number of aligned sequences to keep for any tabular formats (outfmt > 4). It is also incompatible with the ones, i.e. -num_descriptions and -num_alignments, used in output with separate definition line and alignment sections. So I think it is better to just use -max_target_seqs in blastp+ for OrthoMCL:

blastp -db database -query query.fasta -evalue 1E-5 \
-seg yes -soft_masking true -out blast.out -outfmt 6 \
-max_target_seqs 100000

There seem to be some new characters in blast+, like database masking, which for sure can reduce running time, but their effects on orthology detection are unclear so far (as far as I know)