A Quantitative Exploration of Causes of False Positive Single Nucleotide Polymorphisms in Next-Generation Sequencing Data

  • Antonio Claudio Bello Ribeiro

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next-Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. The traditional approach to SNP discovery is based on mapping reads to a reference sequence. Apart from sequencing errors, which vary in pattern and rate depending on the sequencing platform, the short read lengths that prevail in NGS, together with the repetitive nature of the genomes of many organisms, can lead to errors in the genome assembly and/or read mapping stages of the mapping-based approach for SNP discovery.

The work described here has investigated and quantified some mechanisms that cause false positive SNPs. These include reference misassembly due to the presence of paralogous sequences and read cross-mapping, along with associated factors such as quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency, and filtering of SNPs by read mapping quality and read depth. The study shows that both paralogs and the choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced. A brief exploration of the influence of these factors towards false negative (FN) SNPs generation is also carried out in the end of the study, paving the way to new insights. This thesis aims to provide a stepping stone towards a better understanding of the factors influencing the mapping-based SNP discovery approach.
Date of Award2016
Original languageEnglish
Awarding Institution
  • University of Dundee
SponsorsThe James Hutton Institute
SupervisorAndrew Flavell (Supervisor), Micha Bayer (Supervisor) & David Marshall (Supervisor)

Keywords

  • False positive SNP
  • NGS
  • Read mismapping
  • Misassembly
  • Mapping stringency
  • Read lengths

Cite this

'