Automated generation of high quality transcriptome annotations and their coding characterisation

  • Juan Carlos Entizne

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

RNA sequencing (RNA-seq) is now the standard methodology to explore gene and transcript expression changes in different conditions. RNA-seq has been extensively used in genomic and transcriptomics studies in multiple organisms for a wide range of analyses from developmental studies to diseases. Currently, there are state-of-the-art programs for accurate and rapid transcript quantifications and downstream differential expression analyses at the gene, transcript and isoform levels. However, the accuracy of these analyses are dependent on how comprehensive, diverse and accurate the reference transcriptome annotation is. In addition, to dissect gene expression, pre-mRNA processing and regulatory mechanisms correct translations and other transcript features are important for the correct interpretation of transcriptomic analysis.

To address these issues, the RTDmaker and TranSuite programs were developed. RTDmaker is an automated, modular program that facilitates the creation of high-quality transcriptome annotations known as Reference Transcript Datasets (RTDs). RTDmaker was based on experience and principles used to create Arabidopsis AtRTD and AtRTD2. The transition to species with poorly annotated genomes/transcriptomes introduced a new set of issues that had to be addressed in the development of RTDmaker.

TranSuite is a suite of programs for the accurate translation of transcripts (TransFix), and the determination of coding potential of transcripts and identification of transcript coding-related features such as premature termination codons and nonsense-mediated decay triggering signals (TransFeat). TransFix employs a novel method of identifying the authentic translation AUG start sites to generate realistic translations allowing the accurate determination of protein-coding capacity. In contrast, most translation programmes select the longest ORF present in transcripts of a gene as the representative coding sequence (CDS). Direct comparison with TransDecoder showed anomalous translations for around 25% of transcripts with TransDecoder.

RTDmaker and TranSuite were used to create improved and fully characterized annotations for DM potato (StRTD) and barley (BaRT2.17) in comparison to previously annotations. The StRTD annotation was used to explore NMD in potato plants treated with cycloheximide and supported the prediction of NMD-sensitive transcripts by TranSuite. This analysis demonstrated the utility of RTDmaker and TranSuite to explore and interpret results for differential expression analysis. The BaRT2.17 annotation was generated using both short and long read sequencing (RNA-seq and Iso-seq). respectively). BaRT2.17 transcripts had well-defined 5’ and 3’ ends (from Iso-seq) and rich splice-junction diversity (from RNA-seq) and represents the most advanced and accurate barley transcriptome available.
Date of Award2021
Original languageEnglish
SponsorsThe James Hutton Institute, EASTBIO Doctoral Training Partnership & Biotechnology and Biological Sciences Research Council
SupervisorJohn Brown (Supervisor) & Runxuan Zhang (Supervisor)

Keywords

  • Transcriptomics
  • RNA-seq
  • Transcriptome annotation
  • Coding characterization
  • Potato
  • Barley

Cite this

'