The Real Data Corpus



The Real Data Corpus (RDC) is a collection of raw data extracted from data-carrying devices that were purchased on the secondary market around the world. Many studies have shown that hard drives, cell phones, USB memory sticks, and other data-carrying devices are frequently discarded by their original users without the data first being cleared or purged. By purchasing these devices and extracting their data, we have created a data set that closely mimics data as it is found in the real world.

Potential Uses

The Real Data Corpus is a one-of-a-kind scientific resource for:

  • Developing and validating forensic and data recovery algorithms and tools.
  • Developing and validating document translation software.
  • Exploring and characterizing real-world computing practices, configuration choices, and option settings.
  • Studying the storage allocation strategies of file systems under real-world conditions

Current Contents

  • A total of 156 hard drive images ranging in size from 500MB to 80GB.
  • Approximately 600 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
  • Approximately 100 CDs, all purchased outside the US.
  • Approximately 10 digital camera memory images.
  • Approximately 40 GSM SIM chip memory images.

More details of the corpus content can be found in Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada.[1]