A digital forensics corpus representing the view of academics and practitioners 1999-2021



A significant challenge in digital forensics is the lack of a framework for common language and knowledge. This creates barriers to communicating, collaborating and knowledge sharing amongst stakeholders. Methods for creating a comprehensive set of common terms on a topic includes Natural Language Processing (NLP) and Generative Artificial Intelligence (GenAI) algorithms. The efficiency of these algorithms depends on the coverage, quality and quantity of the training corpus. As far as we know, there is no such corpus that is readily available for training these algorithms.

This is a digital forensics practice and research corpus, validated by practitioners working in this domain. The corpus is ready for training new generations of NLP and GenAI algorithms. The associated paper also presents a systematic method of sharing a training corpus, where the data structure, such as folder and file names, make it convenient to programmatically interact with the data.
Date made available24 May 2024
PublisherUniversity of Dundee
Temporal coverage1999 - 2021
Date of data production2022

Data Monitor categories

  • Digital Forensics Corpus
  • Natural Language Processing
  • NLP
  • Generative Artificial Intelligence
  • GenAI

Cite this