Projects per year
Human genome sequencing has generated population variant datasets containing millions of variants from hundreds of thousands of individuals. The datasets show the genomic distribution of genetic variation to be influenced on genic and sub-genic scales by gene essentiality, protein domain architecture and the presence of genomic features such as splice donor/acceptor sites. However, the variant data are still too sparse to provide a comparative picture of genetic variation between individual protein residues in the proteome. Here, we overcome this sparsity for ~25,000 human protein domains in 1,291 domain families by aggregating variants over equivalent positions (columns) in multiple sequence alignments of sequence-similar (paralagous) domains. We then compare the resulting variation profiles from the human population to residue conservation across all species and find that the same tertiary structural and functional pressures that affect amino acid conservation during domain evolution constrain missense variant distributions. Thus, depletion of missense variants at a position implies that it is structurally or functionally important. We find such positions are enriched in known disease-associated variants (OR = 2.83, p ≈ 0) while positions that are both missense depleted and evolutionary conserved are further enriched in disease-associated variants (OR = 1.85, p = 3.3x10-17) compared to those that are only evolutionary conserved (OR = 1.29, p = 4.5x10-19). Unexpectedly, a subset of evolutionary Unconserved positions are Missense Depleted in human (UMD positions) and these are also enriched in pathogenic variants (OR = 1.74, p = 0.02). UMD positions are further differentiated from other unconserved residues in that they are enriched in ligand, DNA and protein binding interactions (OR = 1.59, p = 0.003), which suggests this stratification can identify functionally important positions. A different class of positions that are Conserved and Missense Enriched (CME) show an enrichment of ClinVar risk factor variants (OR = 2.27, p = 0.004). We illustrate these principles with the G-Protein Coupled Receptor (GPCR) family, Nuclear Receptor Ligand Binding Domain family and In Between Ring-Finger (IBR) domains and list a total of 343 UMD positions in 211 domain families. This study will have broad applications to: (a) providing focus for functional studies of specific proteins by mutagenesis; (b) refining pathogenicity prediction models; (c) highlighting which residue interactions to target when refining the specificity of small-molecule drugs.
1/03/13 → 31/08/18