AbstractProteomics has become an essential component of systems biology in the quest for personalised medicine. Each of us has a unique biology and can respond in different ways to medical treatments. By analysing the complete set of proteins present in humans the field of life sciences is moving closer to the goal of being able to recommend specific drugs to specific individuals thus greatly enhancing the probability of a cure.
Extensive pre-processing of the complex files created by mass spectrometers during proteomics experiments is required before it is possible to gain any insight from them. A typical mass spectrometer run may produce forty thousand scans of data each containing around two thousand data points. If this data were to be laid out in a relational database schema, it would equate to eighty million rows of data per experiment. In a laboratory containing multiple machines running several times a day, this figure quickly reaches into the tens of billions of rows, which is a very significant amount of data to process. Many tools currently exist to carry out this processing but most focus on batch-based workloads where the mass spectrometer finishes its analysis, and then the data is processed on a file-by-file basis. The processing time can vary from hours to days leading to a substantial time lag before the results of the experiment can be examined. In addition, life science laboratories often carry out this work on the local storage of PC hardware creating a significant data management problem.
This research investigates the potential for processing proteomics data in near real-time using a parallel system. The focus is on the feature detection part of the mass spectrometer processing pipeline and how this could become part of an architected cloud-based or on premise solution. The experimentation involves using the MapReduce framework to enable running the feature detection algorithm in parallel on a horizontally scalable cluster of servers. Systems tested include Hadoop, Flink and Spark in both a batch and real-time streaming mode.
The work shows that it is possible to detect features in the mass spectrometer data using an “intra-file” parallelism. The term intra-file means that a data file is split into sections, which are then processed independently on the cluster and is a vital part of enabling feature detection in a streaming fashion. This is a major differentiation between this research and most current processing methods, which process complete files in a serial fashion. This work highlighted that it is highly relevant to consider the laboratory as an Internet of Things. This involves the data streaming from the mass-spectrometers in real-time to a central computing platform where the data processing is completed with contemporary open-source technology. Consequently, the research described in this thesis points towards the adoption of a distributed cluster-based architecture which will allow the processing of mass spectrometer output in real-time as it is generated. Making the results available as soon as the experiment has completed allows life scientists to iterate over a problem faster which will lead to quicker paths to insights.
|Date of Award||2018|
|Supervisor||Andy Cobley (Supervisor), Karen Petrie (Supervisor) & Mark Whitehorn (Supervisor)|