Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

Teng Yu, Wenlai Zhao (Lead / Corresponding author), Pan Liu, Vladimir Janjic, Xiaohan Yan, Shicai Wang, Haohuan Fu, Guangwen Yang, John Thomson

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)
171 Downloads (Pure)

Abstract

This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We first introduce a multilevel parallel partition approach that not only partitions by dataflow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed autoclustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufficient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.
Original languageEnglish
Pages (from-to)997-1008
Number of pages12
JournalIEEE Transactions on Parallel and Distributed Systems
Volume31
Issue number5
Early online date25 Nov 2019
DOIs
Publication statusPublished - 1 May 2020

Keywords

  • AutoML
  • Supercomputer
  • clustering
  • data partitioning
  • heterogeneous many-core processor
  • scheduling

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer'. Together they form a unique fingerprint.

Cite this