Estimation of P-values for global alignments of protein sequences

Caleb Webber, Geoffrey J. Barton

    Research output: Contribution to journalArticle

    30 Citations (Scopus)

    Abstract

    MOTIVATION: The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment.

    RESULTS: The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation.

    AVAILABILITY: Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk.

    CONTACT: geoff@ebi.ac.uk

    Original languageEnglish
    Pages (from-to)1158-1167
    Number of pages10
    JournalBioinformatics
    Volume17
    Issue number12
    DOIs
    Publication statusPublished - Dec 2001

    Fingerprint

    Z-score
    Sequence Alignment
    Protein Sequence
    Alignment
    Proteins
    Subsequence
    Fold
    Software
    Error Rate
    Structural Similarity
    Statistical Significance
    Confidence
    Penalty
    Likely
    Statistics
    Entire

    Keywords

    • Databases, Protein
    • Mathematical computing
    • Probability
    • Proteins
    • Sequence alignment
    • Sequence analysis, Protein
    • Software

    Cite this

    @article{acc91469f0fc4615ae343958d8f7e840,
    title = "Estimation of P-values for global alignments of protein sequences",
    abstract = "MOTIVATION: The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment.RESULTS: The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation.AVAILABILITY: Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk.CONTACT: geoff@ebi.ac.uk",
    keywords = "Databases, Protein, Mathematical computing, Probability, Proteins, Sequence alignment, Sequence analysis, Protein, Software",
    author = "Caleb Webber and Barton, {Geoffrey J.}",
    year = "2001",
    month = "12",
    doi = "10.1093/bioinformatics/17.12.1158",
    language = "English",
    volume = "17",
    pages = "1158--1167",
    journal = "Bioinformatics",
    issn = "1367-4803",
    publisher = "Oxford University Press",
    number = "12",

    }

    Estimation of P-values for global alignments of protein sequences. / Webber, Caleb; Barton, Geoffrey J.

    In: Bioinformatics, Vol. 17, No. 12, 12.2001, p. 1158-1167.

    Research output: Contribution to journalArticle

    TY - JOUR

    T1 - Estimation of P-values for global alignments of protein sequences

    AU - Webber, Caleb

    AU - Barton, Geoffrey J.

    PY - 2001/12

    Y1 - 2001/12

    N2 - MOTIVATION: The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment.RESULTS: The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation.AVAILABILITY: Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk.CONTACT: geoff@ebi.ac.uk

    AB - MOTIVATION: The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment.RESULTS: The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation.AVAILABILITY: Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk.CONTACT: geoff@ebi.ac.uk

    KW - Databases, Protein

    KW - Mathematical computing

    KW - Probability

    KW - Proteins

    KW - Sequence alignment

    KW - Sequence analysis, Protein

    KW - Software

    U2 - 10.1093/bioinformatics/17.12.1158

    DO - 10.1093/bioinformatics/17.12.1158

    M3 - Article

    VL - 17

    SP - 1158

    EP - 1167

    JO - Bioinformatics

    JF - Bioinformatics

    SN - 1367-4803

    IS - 12

    ER -