Abstract
It is challenging to conduct statistical analyses of complex scientific datasets. It is a time-consuming process to find the relationships within data for whether a scientist or a statistician. The process involves preprocessing the raw data, the selection of appropriate statistics, performing analysis and providing correct interpretations, among which, the data pre-processing is tedious and a particular time drain. In a large amount of data provided for analysis, there is not a standard for recording the information, and some errors either of spelling, typing or transmission. Thus, there will be many expressions for the same meaning in the data, but it will be impossible for analysis system to automatically deal with these inaccuracies. What is needed is an automatic method for transforming the raw clinical data into data which it is possible to process automatically. In this paper we propose a method combining decision tree learning with the string similarity algorithm, which is fast and accuracy to clinical data cleaning. Experimental results show that it outperforms individual string similarity algorithms and traditional data cleaning process.
Original language | English |
---|---|
Title of host publication | OpenAccess Series in Informatics |
Publisher | Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH |
Pages | 87-94 |
Number of pages | 8 |
Volume | 49 |
ISBN (Print) | 9783959770002 |
DOIs | |
Publication status | Published - 2015 |
Event | 5th Imperial College Computing Student Workshop, ICCSW 2015 - London, United Kingdom Duration: 24 Sept 2015 → 25 Sept 2015 |
Conference
Conference | 5th Imperial College Computing Student Workshop, ICCSW 2015 |
---|---|
Country/Territory | United Kingdom |
City | London |
Period | 24/09/15 → 25/09/15 |
Keywords
- Decision tree learning
- Raw clinical data
- String similarity algorithm
ASJC Scopus subject areas
- Geography, Planning and Development
- Modelling and Simulation