Terminology development for Digital Forensics using Natural Language Processing

  • Maël Le Gall

Student thesis: Master's ThesisMaster of Science

Abstract

Digital forensics suffers from a communication and knowledge sharing challenge. Various actions have been undertaken to solve this problem, such as the creation of glossaries or ontologies. However, those solutions usually have a narrow scope, are not publicly available and/or do not alo the reuse of the knowledge by artificial agents. Those restrictions have led to a limited use by the practitioner community and, therefore, have not resolved the communication challenge that the digital forensics community faces. This study aims to determine the efficiency of Natural Language Processing (NLP) methods to provide a terminology covering the scope of the existing OSAC (Organization of Scientific Area Committees for Forensic Science) glossary, and to expand this scope to represent the depth of the terminology needs of the community.

A digital forensics corpus which represents the point of view of both academics and practitioners was gathered. A benchmark data set was created to identify the best unsupervised Automatic Term Extraction (ATE) method to use with the digital forensics corpus. The extraction an analysis of the first 1,000 terms were conducted with input from a digital forensics practitioner. The results of this benchmark revealed that the Weirdness method was the most relevant ATE approach. The extracted terms contained 61% (186/304) of the OSAC unique terms, while 85% (100/118) of the missing terms were justified. Of the top 1,000 terms absent from the OSAC glossary, 83% (783/943) were related to a topic or sub-discipline of digital forensics.

These results suggest that our corpus and the NLP method combined are within scope of the OSAC glossary. They also show that most of the top 1,000 terms of our corpus expand the OSAC scope. This underlines the fact that digital forensics has a terminological and semantic richness that can be supported and improved by technologies such as NLP.
Date of Award2022
Original languageEnglish
SupervisorNiamh Nic Daeid (Supervisor), David Haynes (Supervisor) & Christian Cole (Supervisor)

Cite this

'