There are just so many ways to normalise RNA-Seq data, even though one of the advantages of RNA-Seq is “it can capture transcriptome dynamics across different tissues of conditions without sophisticated normalization of data sets.” (/RNA-Seq: A revolutionary tool for transcriptomes/, Nature Review Genetics, 2009, ) BTW, I was wondering how many people actually take careful considerations on these normalisations. Or I’d like to know how robust RNA-Seq data sets are. Here are a few normalizations that came across today
- Quantile normalization – /Selection between-sample RNA-Seq normalization methods from the perspective for their assumptions/, Briefings in Bioinformatics, 2018. The purpose for this normalization is to obtain identical distributions for read counts among different samples. It first orders read counts for each sample increasingly and leaves placement holders with the order indices. Then it uses the mean (or median) of the first, second, and so on read counts among samples to replace the placement holders. If there is a tie (aka with two values with identical placement holders i), then it uses the mean between value(i) and value (i+1). There are some varieties of the quantile normalization, e.g. upper quantile normalization or median quantile normalization. These are just approaches that replace the the original value by the division between the original value and the 75% quantile (or 50% quantile, i.e. median) of the values among samples. The rest should be the same.
- RLE stands for relative log expression – /RLE plots: Visualizing unwanted variation in high dimensional data/, PLoS one, 2018. For each gene, it calculates a median of read counts among samples than uses the differences between read counts and the median to draw a box plot. The approach should remove variations among genes and only leave variations among samples, which all have similar distributions with a median of zero.
- TMM stands for Trimmed Mean of M values – /A scaling normalization method for differential expression analysis of RNA-Seq data/, Genome Biology, 2010. I should look into this approach further tomorrow, but so far it seems to me that it considers the proportions of certain reads in a sample. It will be also interesting to read the first one /Selection between-sample RNA-Seq normalization methods from the perspective for their assumption/.