Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment

Marek Gierlinski, Christian Cole, Pietà Schofield, Nicholas J. Schurch, Alexander Sherstnev, Vijender Singh, Nicola Wrobel, Karim Gharbi, Gordon Simpson, Tom Owen-Hughes, Mark Blaxter, Geoffrey J. Barton

Research output: Contribution to journalArticle

27 Citations (Scopus)
91 Downloads (Pure)

Abstract

Motivation: High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read-count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations.
Results: A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ∼0.01. The high-replicate data also allowed for strict quality control and screening of ‘bad’ replicates, which can drastically affect the gene read-count distribution.
Availability and implementation: RNA-seq data have been submitted to ENA archive with project ID PRJEB5348.
Original languageEnglish
Pages (from-to)3625-3630
Number of pages6
JournalBioinformatics
Volume31
Issue number22
Early online date23 Jul 2015
DOIs
Publication statusPublished - 15 Nov 2015

Cite this

Gierlinski, Marek ; Cole, Christian ; Schofield, Pietà ; Schurch, Nicholas J. ; Sherstnev, Alexander ; Singh, Vijender ; Wrobel, Nicola ; Gharbi, Karim ; Simpson, Gordon ; Owen-Hughes, Tom ; Blaxter, Mark ; Barton, Geoffrey J. / Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment. In: Bioinformatics. 2015 ; Vol. 31, No. 22. pp. 3625-3630.
@article{f7805eb4538d4a62af55d9e844c1df4f,
title = "Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment",
abstract = "Motivation: High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read-count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations.Results: A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ∼0.01. The high-replicate data also allowed for strict quality control and screening of ‘bad’ replicates, which can drastically affect the gene read-count distribution.Availability and implementation: RNA-seq data have been submitted to ENA archive with project ID PRJEB5348.",
author = "Marek Gierlinski and Christian Cole and Piet{\`a} Schofield and Schurch, {Nicholas J.} and Alexander Sherstnev and Vijender Singh and Nicola Wrobel and Karim Gharbi and Gordon Simpson and Tom Owen-Hughes and Mark Blaxter and Barton, {Geoffrey J.}",
note = "This work was supported by: The Wellcome Trust (92530/Z/10/Z and strategic awards WT09230, WT083481 and WT097945), Biotechnology and Biological Sciences Research Council (BB/H002286/1 and BB/J00247X/1).",
year = "2015",
month = "11",
day = "15",
doi = "10.1093/bioinformatics/btv425",
language = "English",
volume = "31",
pages = "3625--3630",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "22",

}

Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment. / Gierlinski, Marek; Cole, Christian; Schofield, Pietà; Schurch, Nicholas J.; Sherstnev, Alexander; Singh, Vijender; Wrobel, Nicola; Gharbi, Karim; Simpson, Gordon; Owen-Hughes, Tom; Blaxter, Mark; Barton, Geoffrey J.

In: Bioinformatics, Vol. 31, No. 22, 15.11.2015, p. 3625-3630.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment

AU - Gierlinski, Marek

AU - Cole, Christian

AU - Schofield, Pietà

AU - Schurch, Nicholas J.

AU - Sherstnev, Alexander

AU - Singh, Vijender

AU - Wrobel, Nicola

AU - Gharbi, Karim

AU - Simpson, Gordon

AU - Owen-Hughes, Tom

AU - Blaxter, Mark

AU - Barton, Geoffrey J.

N1 - This work was supported by: The Wellcome Trust (92530/Z/10/Z and strategic awards WT09230, WT083481 and WT097945), Biotechnology and Biological Sciences Research Council (BB/H002286/1 and BB/J00247X/1).

PY - 2015/11/15

Y1 - 2015/11/15

N2 - Motivation: High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read-count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations.Results: A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ∼0.01. The high-replicate data also allowed for strict quality control and screening of ‘bad’ replicates, which can drastically affect the gene read-count distribution.Availability and implementation: RNA-seq data have been submitted to ENA archive with project ID PRJEB5348.

AB - Motivation: High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read-count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations.Results: A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ∼0.01. The high-replicate data also allowed for strict quality control and screening of ‘bad’ replicates, which can drastically affect the gene read-count distribution.Availability and implementation: RNA-seq data have been submitted to ENA archive with project ID PRJEB5348.

U2 - 10.1093/bioinformatics/btv425

DO - 10.1093/bioinformatics/btv425

M3 - Article

VL - 31

SP - 3625

EP - 3630

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 22

ER -