Human Missense Variation is Constrained by Domain Structure and Highlights Functional and Pathogenic Residues

Stuart MacGowan, Fabio Madeira, Thiago Britto Borges, Melanie S. Schmittner, Christian Cole, Geoffrey Barton

Research output: Contribution to journalArticle

34 Downloads (Pure)

Abstract

Human genome sequencing has generated population variant datasets containing millions of variants from hundreds of thousands of individuals. The datasets show the genomic distribution of genetic variation to be influenced on genic and sub-genic scales by gene essentiality, protein domain architecture and the presence of genomic features such as splice donor/acceptor sites. However, the variant data are still too sparse to provide a comparative picture of genetic variation between individual protein residues in the proteome. Here, we overcome this sparsity for ~25,000 human protein domains in 1,291 domain families by aggregating variants over equivalent positions (columns) in multiple sequence alignments of sequence-similar (paralagous) domains. We then compare the resulting variation profiles from the human population to residue conservation across all species and find that the same tertiary structural and functional pressures that affect amino acid conservation during domain evolution constrain missense variant distributions. Thus, depletion of missense variants at a position implies that it is structurally or functionally important. We find such positions are enriched in known disease-associated variants (OR = 2.83, p ≈ 0) while positions that are both missense depleted and evolutionary conserved are further enriched in disease-associated variants (OR = 1.85, p = 3.3x10-17) compared to those that are only evolutionary conserved (OR = 1.29, p = 4.5x10-19). Unexpectedly, a subset of evolutionary Unconserved positions are Missense Depleted in human (UMD positions) and these are also enriched in pathogenic variants (OR = 1.74, p = 0.02). UMD positions are further differentiated from other unconserved residues in that they are enriched in ligand, DNA and protein binding interactions (OR = 1.59, p = 0.003), which suggests this stratification can identify functionally important positions. A different class of positions that are Conserved and Missense Enriched (CME) show an enrichment of ClinVar risk factor variants (OR = 2.27, p = 0.004). We illustrate these principles with the G-Protein Coupled Receptor (GPCR) family, Nuclear Receptor Ligand Binding Domain family and In Between Ring-Finger (IBR) domains and list a total of 343 UMD positions in 211 domain families. This study will have broad applications to: (a) providing focus for functional studies of specific proteins by mutagenesis; (b) refining pathogenicity prediction models; (c) highlighting which residue interactions to target when refining the specificity of small-molecule drugs.
Original languageEnglish
Number of pages35
JournalBioRxiv
DOIs
Publication statusPublished - 13 Apr 2017

Fingerprint

RNA Splice Sites
Ligands
Proteins
Sequence Alignment
DNA-Binding Proteins
Human Genome
Proteome
Cytoplasmic and Nuclear Receptors
G-Protein-Coupled Receptors
Mutagenesis
Population
Fingers
Virulence
Amino Acids
Pressure
Pharmaceutical Preparations
Protein Domains
Datasets

Cite this

@article{8083596f9b1540d693007f3e647647bc,
title = "Human Missense Variation is Constrained by Domain Structure and Highlights Functional and Pathogenic Residues",
abstract = "Human genome sequencing has generated population variant datasets containing millions of variants from hundreds of thousands of individuals. The datasets show the genomic distribution of genetic variation to be influenced on genic and sub-genic scales by gene essentiality, protein domain architecture and the presence of genomic features such as splice donor/acceptor sites. However, the variant data are still too sparse to provide a comparative picture of genetic variation between individual protein residues in the proteome. Here, we overcome this sparsity for ~25,000 human protein domains in 1,291 domain families by aggregating variants over equivalent positions (columns) in multiple sequence alignments of sequence-similar (paralagous) domains. We then compare the resulting variation profiles from the human population to residue conservation across all species and find that the same tertiary structural and functional pressures that affect amino acid conservation during domain evolution constrain missense variant distributions. Thus, depletion of missense variants at a position implies that it is structurally or functionally important. We find such positions are enriched in known disease-associated variants (OR = 2.83, p ≈ 0) while positions that are both missense depleted and evolutionary conserved are further enriched in disease-associated variants (OR = 1.85, p = 3.3x10-17) compared to those that are only evolutionary conserved (OR = 1.29, p = 4.5x10-19). Unexpectedly, a subset of evolutionary Unconserved positions are Missense Depleted in human (UMD positions) and these are also enriched in pathogenic variants (OR = 1.74, p = 0.02). UMD positions are further differentiated from other unconserved residues in that they are enriched in ligand, DNA and protein binding interactions (OR = 1.59, p = 0.003), which suggests this stratification can identify functionally important positions. A different class of positions that are Conserved and Missense Enriched (CME) show an enrichment of ClinVar risk factor variants (OR = 2.27, p = 0.004). We illustrate these principles with the G-Protein Coupled Receptor (GPCR) family, Nuclear Receptor Ligand Binding Domain family and In Between Ring-Finger (IBR) domains and list a total of 343 UMD positions in 211 domain families. This study will have broad applications to: (a) providing focus for functional studies of specific proteins by mutagenesis; (b) refining pathogenicity prediction models; (c) highlighting which residue interactions to target when refining the specificity of small-molecule drugs.",
author = "Stuart MacGowan and Fabio Madeira and {Britto Borges}, Thiago and Schmittner, {Melanie S.} and Christian Cole and Geoffrey Barton",
year = "2017",
month = "4",
day = "13",
doi = "10.1101/127050",
language = "English",
journal = "BioRxiv",
publisher = "Cold Spring Harbor Laboratory Press",

}

TY - JOUR

T1 - Human Missense Variation is Constrained by Domain Structure and Highlights Functional and Pathogenic Residues

AU - MacGowan, Stuart

AU - Madeira, Fabio

AU - Britto Borges, Thiago

AU - Schmittner, Melanie S.

AU - Cole, Christian

AU - Barton, Geoffrey

PY - 2017/4/13

Y1 - 2017/4/13

N2 - Human genome sequencing has generated population variant datasets containing millions of variants from hundreds of thousands of individuals. The datasets show the genomic distribution of genetic variation to be influenced on genic and sub-genic scales by gene essentiality, protein domain architecture and the presence of genomic features such as splice donor/acceptor sites. However, the variant data are still too sparse to provide a comparative picture of genetic variation between individual protein residues in the proteome. Here, we overcome this sparsity for ~25,000 human protein domains in 1,291 domain families by aggregating variants over equivalent positions (columns) in multiple sequence alignments of sequence-similar (paralagous) domains. We then compare the resulting variation profiles from the human population to residue conservation across all species and find that the same tertiary structural and functional pressures that affect amino acid conservation during domain evolution constrain missense variant distributions. Thus, depletion of missense variants at a position implies that it is structurally or functionally important. We find such positions are enriched in known disease-associated variants (OR = 2.83, p ≈ 0) while positions that are both missense depleted and evolutionary conserved are further enriched in disease-associated variants (OR = 1.85, p = 3.3x10-17) compared to those that are only evolutionary conserved (OR = 1.29, p = 4.5x10-19). Unexpectedly, a subset of evolutionary Unconserved positions are Missense Depleted in human (UMD positions) and these are also enriched in pathogenic variants (OR = 1.74, p = 0.02). UMD positions are further differentiated from other unconserved residues in that they are enriched in ligand, DNA and protein binding interactions (OR = 1.59, p = 0.003), which suggests this stratification can identify functionally important positions. A different class of positions that are Conserved and Missense Enriched (CME) show an enrichment of ClinVar risk factor variants (OR = 2.27, p = 0.004). We illustrate these principles with the G-Protein Coupled Receptor (GPCR) family, Nuclear Receptor Ligand Binding Domain family and In Between Ring-Finger (IBR) domains and list a total of 343 UMD positions in 211 domain families. This study will have broad applications to: (a) providing focus for functional studies of specific proteins by mutagenesis; (b) refining pathogenicity prediction models; (c) highlighting which residue interactions to target when refining the specificity of small-molecule drugs.

AB - Human genome sequencing has generated population variant datasets containing millions of variants from hundreds of thousands of individuals. The datasets show the genomic distribution of genetic variation to be influenced on genic and sub-genic scales by gene essentiality, protein domain architecture and the presence of genomic features such as splice donor/acceptor sites. However, the variant data are still too sparse to provide a comparative picture of genetic variation between individual protein residues in the proteome. Here, we overcome this sparsity for ~25,000 human protein domains in 1,291 domain families by aggregating variants over equivalent positions (columns) in multiple sequence alignments of sequence-similar (paralagous) domains. We then compare the resulting variation profiles from the human population to residue conservation across all species and find that the same tertiary structural and functional pressures that affect amino acid conservation during domain evolution constrain missense variant distributions. Thus, depletion of missense variants at a position implies that it is structurally or functionally important. We find such positions are enriched in known disease-associated variants (OR = 2.83, p ≈ 0) while positions that are both missense depleted and evolutionary conserved are further enriched in disease-associated variants (OR = 1.85, p = 3.3x10-17) compared to those that are only evolutionary conserved (OR = 1.29, p = 4.5x10-19). Unexpectedly, a subset of evolutionary Unconserved positions are Missense Depleted in human (UMD positions) and these are also enriched in pathogenic variants (OR = 1.74, p = 0.02). UMD positions are further differentiated from other unconserved residues in that they are enriched in ligand, DNA and protein binding interactions (OR = 1.59, p = 0.003), which suggests this stratification can identify functionally important positions. A different class of positions that are Conserved and Missense Enriched (CME) show an enrichment of ClinVar risk factor variants (OR = 2.27, p = 0.004). We illustrate these principles with the G-Protein Coupled Receptor (GPCR) family, Nuclear Receptor Ligand Binding Domain family and In Between Ring-Finger (IBR) domains and list a total of 343 UMD positions in 211 domain families. This study will have broad applications to: (a) providing focus for functional studies of specific proteins by mutagenesis; (b) refining pathogenicity prediction models; (c) highlighting which residue interactions to target when refining the specificity of small-molecule drugs.

U2 - 10.1101/127050

DO - 10.1101/127050

M3 - Article

JO - BioRxiv

JF - BioRxiv

ER -