AbstractEvolutionarily conserved sites in an alignment of protein homologues provide insight into the structural and functional pressures on those proteins. Likewise, global patterns of genetic variation across the human genome can reflect the functional and evolutionary constraints in protein-coding regions. Databases of genetic variation in the human population are essential resources to interpret variants causing rare Mendelian diseases clinically. Disease-causing variants cluster at regions of functional or structural importance in proteins, whereas missense variants are distributed in regions tolerant of amino acid substitutions. However, most human sequence residues in known protein domains have zero variants, so a comparative picture of genetic variation between individual protein residues in the human proteome is not possible.
This situation has been rectified by incorporating human population variant data into conventional sequence analysis, resulting in a larger number of variants that can be used in multiple sequence alignments, thus boosting statistical power. The aim was to understand their relationship at consensus positions of multiple sequence alignments across all protein families. Sites under selective pressure in the human population highlight positions with structural and functional importance that dictate domain evolution, as observed in protein residue conservation.
Unlike globular protein domains with unique folds, other proteins have evolved through periodic sequences of variable arrays duplicated adjacent to one another, termed repeats. Repeat families were focused on to boost the power of this analysis, as repeats present many more copies than any other domain within a single species: Tetratricopeptide Repeat, Ankyrin, and Armadillo repeat families. Repeats are widely distributed across kingdoms and organised in tandem arrays present in numerous proteins involved in many fundamental cellular processes. These repeats belong to a class of solenoid domains that fold into elongated solenoid structures, where repeats require one another to maintain the overall structure. They mediate protein-protein interactions and assemble multi-protein complexes.
A computational framework, Repeat Analysis Method (RAM), was developed to implement tools and methods in a modularised approach to study repeats. RAM retrieves annotations and analyses the organisation of repeats with protein repeat architecture (PRA). Sequence analysis was carried out, together with the different structural and functional pressures at each position in repeat alignments, as determined by amino acid conservation. The human population variant data from the Genome Aggregation Database (gnomAD) was analysed by aggregating and counting missense variants along each alignment column to identify positions constrained in the human population data. The relationship between residue conservation and the missense variant profile of each repeat alignment was studied. These data were supported by quantitative structural analyses of all available three-dimensional structures from the Protein Data Bank in Europe (PDBe). Positions important for the structural fold of the repeat domains and sites enriched for protein–substrate interactions were identified because they were under constraint in the human population.
This novel cross-disciplinary approach has validated known positions and identified new positions of importance that conventional sequence and structural analysis alone could not. These studies have also expanded the understanding of the key sites in each repeat family, reflecting the general relationship between homologous conservation and population constraint in most protein domains. The analysis demonstrates how the conservation and variation features are constrained by structure and thus can be used to infer structural features.
|Date of Award
|Geoffrey Barton (Supervisor) & Daan van Aalten (Supervisor)
- Tandem repeats
- Human population variation
- Multiple sequence alignments
- three-dimensional protein structures