Automatic transformation of raw clinical data into clean data using decision tree learning combining with string similarity algorithm

Jian Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

73 Downloads (Pure)

Abstract

It is challenging to conduct statistical analyses of complex scientific datasets. It is a time-consuming process to find the relationships within data for whether a scientist or a statistician. The process involves preprocessing the raw data, the selection of appropriate statistics, performing analysis and providing correct interpretations, among which, the data pre-processing is tedious and a particular time drain. In a large amount of data provided for analysis, there is not a standard for recording the information, and some errors either of spelling, typing or transmission. Thus, there will be many expressions for the same meaning in the data, but it will be impossible for analysis system to automatically deal with these inaccuracies. What is needed is an automatic method for transforming the raw clinical data into data which it is possible to process automatically. In this paper we propose a method combining decision tree learning with the string similarity algorithm, which is fast and accuracy to clinical data cleaning. Experimental results show that it outperforms individual string similarity algorithms and traditional data cleaning process.

Original languageEnglish
Title of host publicationOpenAccess Series in Informatics
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH
Pages87-94
Number of pages8
Volume49
ISBN (Print)9783959770002
DOIs
Publication statusPublished - 2015
Event5th Imperial College Computing Student Workshop, ICCSW 2015 - London, United Kingdom
Duration: 24 Sept 201525 Sept 2015

Conference

Conference5th Imperial College Computing Student Workshop, ICCSW 2015
Country/TerritoryUnited Kingdom
CityLondon
Period24/09/1525/09/15

Keywords

  • Decision tree learning
  • Raw clinical data
  • String similarity algorithm

ASJC Scopus subject areas

  • Geography, Planning and Development
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Automatic transformation of raw clinical data into clean data using decision tree learning combining with string similarity algorithm'. Together they form a unique fingerprint.

Cite this