Purpose Much potentially useful clinical information for pharmacoepidemiological research is contained in unstructured free-text documents and is not readily available for analysis. Routine health data such as Scottish Morbidity Records (SMR01) frequently use generic 'stroke' codes. Free-text Computerised Radiology Information System (CRIS) reports have potential to provide this missing detail. We aimed to increase the number of stroke-type-specific diagnoses by augmenting SMR01 with data derived from CRIS reports and to assess the accuracy of this methodology.
Methods SMR01 codes describing first-ever-stroke admissions in Tayside, Scotland from 1994 to 2005 were linked to CRIS CT-brain scan reports occurring with 14 days of admission. Software was developed to parse the text and elicit details of stroke type using keyword matching. An algorithm was iteratively developed to differentiate intracerebral haemorrhage (ICH) from ischaemic stroke (IS) against a training set of reports with pathophysiologically precise SMR01 codes. This algorithm was then applied to CRIS reports associated with generic SMR01 codes. To establish the accuracy of the algorithm a sample of 150 ICH and 150 IS reports were independently classified by a stroke physician.
Results There were 8419 SMR01 coded first-ever strokes. The proportion of patients with pathophysiologically clear diagnoses doubled from 2745 (32.6%) to 5614 (66.7%). The positive predictive value was 94.7% (95%CI 89.8-97.3) for IS and 76.7% (95%CI 69.3-82.7) for haemorrhagic stroke.
Conclusions A free-text processing approach was acceptably accurate at identifying IS, but not ICH. This approach could be adapted to other studies where radiology reports may be informative. Copyright (C) 2010 John Wiley & Sons, Ltd.
- cerebral haemorrhage
- brain infarction
- natural language processing
- radiology information systems
- medical records