AbstractSingle Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next-Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. The traditional approach to SNP discovery is based on mapping reads to a reference sequence. Apart from sequencing errors, which vary in pattern and rate depending on the sequencing platform, the short read lengths that prevail in NGS, together with the repetitive nature of the genomes of many organisms, can lead to errors in the genome assembly and/or read mapping stages of the mapping-based approach for SNP discovery.
The work described here has investigated and quantified some mechanisms that cause false positive SNPs. These include reference misassembly due to the presence of paralogous sequences and read cross-mapping, along with associated factors such as quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency, and filtering of SNPs by read mapping quality and read depth. The study shows that both paralogs and the choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced. A brief exploration of the influence of these factors towards false negative (FN) SNPs generation is also carried out in the end of the study, paving the way to new insights. This thesis aims to provide a stepping stone towards a better understanding of the factors influencing the mapping-based SNP discovery approach.
|Date of Award||2016|
|Sponsors||The James Hutton Institute|
|Supervisor||Andrew Flavell (Supervisor), Micha Bayer (Supervisor) & David Marshall (Supervisor)|
- False positive SNP
- Read mismapping
- Mapping stringency
- Read lengths