A unified approach to evolutionary conservation and population constraint in protein domains highlights structural features and pathogenic sites

Stuart A. MacGowan, Fábio Madeira, Thiago Britto-Borges, Geoffrey J. Barton

Research output: Working paper/PreprintPreprint

Abstract

The molecular evolution of a protein is constrained by its structure and function. This is how patterns of evolutionary conservation in a multiple sequence alignment can be exploited by algorithms like AlphaFold to predict structure and other features. Human population sequencing offers the potential to show similar trends within a single species. However, although gene-level and domain-level aggregation of variants is routinely applied, data still preclude simple residue-level analysis per protein. Here, we aggregate population variants within protein domain families to define the missense enrichment score (MES) which captures population constraint at positions in the family. Contrasting MES with evolutionary conservation identifies positions that are key to protein-ligand and protein-protein interaction specificity as well as locations critical to folding, function and pathogenicity. This approach also highlights a new class of pathogenic variant hot spot at positions that are evolutionarily conserved yet enriched in missense variants. We found 5,086 positions in 766 domain families were missense depleted (covering 365,300 residues in the human proteome) and 13,490 positions in 3,591 families were enriched in missense variants (340,829 residues in human) and provide a map of residue-level population constraint in human. Comparison to 61,214 three dimensional structures from the Protein Data Bank, covering 40,394 sequences from 3,661 Pfam domains, and over 10,000 variants labelled pathogenic in ClinVar provides validation of these sites in context with structure and pathogenicity. Application to the GPCR family identifies 17 sites important for ligand binding specificity as well as sites key to function in all members of the GPCR family. For nuclear receptor ligand binding domains, of 10 likely specificity sites, six are in direct contact with ligands. In P450s the most conserved and missense depleted sites interact with the electron donor. This study will have broad applications to: (a) providing focus for functional studies of specific proteins by mutagenesis; (b) refining pathogenicity prediction models; (c) highlighting which residue interactions to target when refining the specificity of small-molecule drugs.
Original languageEnglish
PublisherResearch Square
Number of pages53
DOIs
Publication statusPublished - 13 Jul 2023

Fingerprint

Dive into the research topics of 'A unified approach to evolutionary conservation and population constraint in protein domains highlights structural features and pathogenic sites'. Together they form a unique fingerprint.

Cite this