The power of deep sequencing technology to reliably detect single RNA reads leads to a paradoxical problem of high sensitivity. In hybridization or PCR based methods for RNA quantification, the concern is low sensitivity, i.e., the problem that the signal from truly expressed genes might not be distinguishable from noise. In contrast, the problem with RNA-seq is that it is not clear whether genes with very low read counts are from low expressed genes or merely transcriptional noise. The frequency distribution for read counts does not show a clear separation in two classes of genes, which makes the decision whether a gene is to be considered expressed or not seemingly arbitrary. Here we address this problem by suggesting a statistical model that considers the number of transcripts detected in a RNA-seq study as a mixture of two distributions: one is a exponential distribution for transcripts from inactive genes, and a negative binomial distribution for actively transcribed genes. We apply this model to a number of RNA-seq data sets and find that the model fits the data very well. The calculated criteria for distinguishing between expressed and non-expressed gene is remarkably consistent among data sets, suggesting genes with more than two transcripts per million transcripts (TPM) are highly likely from actively transcribed genes. This criterion is consistent with the criterion of 1 RPKM proposed by Hebenstreit et al. Mol Sys Biol 7:497 (2011), based on chromatin modification and per cell RNA expression data. Hence, the regression model correctly identifies the not actively expressed class of genes and thus, provides an operational criterion for classifying genes in expressed and non-expressed sets, facilitating the interpretation of RNA-seq data.
- Gene expression
- Mixed models