Analysis of variation in protein domain families and interfaces

  • Fabio Manuel Marques Madeira

Student thesis: Doctoral ThesisDoctor of Philosophy


There are currently more than 100 million single nucleotide polymorphisms (SNPs), which are the most frequent and basic type of genetic variation. The availability of over 110,000 three-dimensional protein structures allows the structural context of many SNPs to be examined in atomic detail. Interfaces are essential sites for protein function and adaptation, and key in a majority of biological processes. A computational framework, ProIntVar, was developed for mapping SNPs onto the structures to allow the features of variation at domain-domain and domain-ligand interfaces to be studied. ProIntVar allows the systematic analysis of genetic variation in protein structure interaction surfaces by integrating structural and sequencing data from several biological databases and resources. Protein domains and variants were analysed in the context of structural clusters (SCs) and functional families (FunFams), which are derived from structurally and functionally related protein domains in Superfamilies classified in CATH (Class, Architecture, Topology, Homologous Superfamily). Multiple structural alignments were generated by STAMP for each CATH SC and FunFam, using an improved protocol that leads to improvement of the structural superimposition and resulting structure-based alignments. The structural alignments were extended with similar protein sequences by HMM-based sequence search. These sequences are believed to be structurally/functionally homologous and thus a rich source of novel insight into the structural context and potential consequences of a vast number of genetic variants. The characterisation of both disease-associated and non-disease germline variants, as well as somatic variation, was performed. The analysis of non-synonymous SNPs (nsSNPs) was stratified by annotation and potential consequence and focused particularly on domain-domain and domain-ligand interaction interfaces. Domain interactions were screened and further classified by mode of interaction as clustered by iRMSD (interaction root-mean-square deviation). The results corroborate previous observations that pathogenic mutations are enriched at key sites, such as structurally conserved domain-ligands interfaces and the protein core. Examination of genetic variation at such hot-spots in the context of domain families helps to infer which variants are more likely to affect protein activity and function in a broader evolutionary sense. The most drastic features shared by pathogenic variants were identified to prioritise the analysis of nsSNPs currently thought to be neutral, but potentially disruptive.
Date of Award2016
Original languageEnglish
SponsorsWellcome Trust
SupervisorGeoffrey Barton (Supervisor)


  • Protein
  • Structure
  • Variants
  • Domains
  • Interactions

Cite this