Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust

Benjamin Buchfink, Haim Ashkenazy, Klaus Reuter, John A. Kennedy, Hajk-Georg Drost (Lead / Corresponding author)

Research output: Working paper/PreprintPreprint

Abstract

The biosphere genomics era is transforming life science research, but existing methods struggle to efficiently reduce the vast dimensionality of the protein universe. We present DIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the 19 billion protein sequences currently defining the protein biosphere. As a result, we detect 1.7 billion clusters of which 32% hold more than one sequence. This means that 544 million clusters represent 94% of all known proteins, illustrating that clustering across the tree of life can significantly accelerate comparative studies in the Earth BioGenome era.
Original languageEnglish
PublisherBioRxiv
Number of pages29
DOIs
Publication statusPublished - 7 Feb 2023

Fingerprint

Dive into the research topics of 'Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust'. Together they form a unique fingerprint.

Cite this