Computational structure analysis and prediction of Ser/Thr modified by O-GlcNAc in human proteins

  • Thiago Britto Borges

Student thesis: Doctoral ThesisDoctor of Philosophy


This thesis studies the post-translational modifications of proteins by O-GlcNAcylation with a computational biology approach. The O-GlcNAc transferase (OGT), the enzyme that catalyses the protein O-GlcNAcylation, targets specific Serines and Threonines (S/T) of intracellular proteins. However, while other post-translationally modified residues, including phosphorylated ones, occur within sites distinguished by their amino acid sequences, less than 25% of known O-GlcNAc sites match to a sequence pattern. The small signal on the sequence patterns of multiple sites leads to the question whether the sites’ structure defines the pattern recognised by OGT.

The thesis then focuses on the structural features of the modified sites that could help distinguish potential sites from non-modifiable ones. 1622 O-GlcNAc sites were collected from the scientific literature. Next, 143 sites were mapped to protein 3D structure in the PDB. Modified S/T were 1.7 times more likely than unmodified S/T in the same protein to be annotated in the REMARK465 field of the PDB file, which defines missing regions in the protein structure, suggesting that these sites may be in structurally disordered regions. Clustering the structure of O-GlcNAc sites leads to ten distinct groups indicating the sites’ structural diversity. The study was extended by the analysis of features predicted from the sequence of O-GlcNAcylated proteins with Jpred4 and 3 disorder predictors, DisEMBL, IUpred and JRonn. Overall, disorder scores and proportion of S/T in coils confirmed that O-GlcNAc sites tend to be disordered.

A new classifier for O-GlcNAc-site (POGSPSF) was developed and trained with sequence, predicted secondary structure and disordered from 1 283 non-redundant sites. The POGSPSF Random Forest model achieved 71% area under the ROC curve in a blind test. Predictions were applied to around 2.5 million S/T in the human proteome. Nuclear and cytoplasmic protein were over-represented among the top ranking proteins. Top scoring sites were also more likely to be phosphorylated. Also, novel and potential proteins were identified within the predictions.
Date of Award2016
Original languageEnglish
SponsorsCoordenação de Aperfeiçoamento de Pessoal de Nível Superior
SupervisorGeoffrey Barton (Supervisor)


  • Computational biology
  • Structural biology
  • Bioinformatics
  • Data Science
  • Machine learning
  • O-GlcNAc

Cite this