The Electronic Discovery Reference Model Data Set Project Establishing guidelines. Setting Standards. Delivering Resources.

EDRM Data Set Project http://edrm.net/activities/projects/data-set

The EDRM Data Set Project’s mission is to provide industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of electronic discovery software and services. EDRM Reference ESI Data Sets enable organizations to more easily publish, replicate, and compare test results across various electronic discovery solutions. EDRM currently offers three Reference ESI Data Sets comprising 40GB of data, over 200 file formats, and 23 different languages. In addition to the Reference ESI Data Sets, the EDRM Data Set Project is investigating two software hash projects, the EDRM Reference Software Data Set and the EDRM Probabilistic Hash Data Set to lower review costs by improving the performance of culling of software files from ESI collections.

EDRM Reference ESI Data Sets

This project collects, evaluates, and publishes ESI data sets for use in testing electronic discovery software and services. There are currently three data sets being offered today and more are under evaluation. The three sets that are currently offered include:

 EDRM Data Set Enron PST files: 40GB of Enron e-mail messages and attach- ments in PST format organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files.  EDRM File Formats Data Set: 381 files covering 200 file formats.  EDRM Internationalization Data Set: A snapshot of selected Ubuntu localiza- tion mailing list archives covering 23 languages in 724 MB of email.

Using the Reference ESI Data Sets, organizations can more easily establish the effec- tiveness of their electronic discovery software, services, and processes.

EDRM Software Reference Data Set

The EDRM Software Reference Data Set project’s mission is to augment the NIST Reference Data Set hashes used in electronic discovery with additional hashes of known software files that can be further culled for review purposes. While the NIST list focuses on a selection of software applications and only as the software exists on installation media (e.g. DVDs, and CDs), this project will provide the hashes for the software after it has been extracted from compressed media containers and installed on a system, as well as for software not currently being handled by NIST, e.g. software that is downloaded from the Internet as opposed to received on DVD and/or CD media. This project will modernize and enhance the list of hashes available for culling software files to reduce electronic discovery costs.

This document is licensed under a Creative Commons Attribution 3.0 United States license. To provide attribution, please cite to "EDRM (edrm.net)." If you have questions, contact us at [email protected]. © 2010 Socha Consulting LLC and Gelbmann & Associates. The Electronic Discovery Reference Model Data Set Project Establishing guidelines. Setting Standards. Delivering Resources.

EDRM Probabilistic Hash Data Set

To further assist in the culling of possibly non-ESI files, the Probabilistic Hash Data Set project seeks to collect as many anonymous hashes as possible of files encountered in real world electronic discovery. The frequency of the appearance of hashes can then be used to determine how likely a particular file would be potentially relevant ESI or a non -ESI software that should be culled after sufficient sampling. This project seeks to sig- nificantly improve the performance of automated culling of non-ESI files for electronic discovery resulting in lower cost expenditures.

To learn more about the EDRM Data Set Project and participate in our activities, please go to http://edrm.net/activities/projects/data-set and/or contact George Socha ([email protected]) or Tom Gelbmann ([email protected]).

Select File Types Languages

 Adobe Photoshop  Microsoft Win Metafile 1. Arabic  Ami Draw  2. Catalan  Corel Draw  3. Chinese  Corel Presentations  4. Danish  dBASE  Mutipage 5. Dutch  First Choice DB, SS, WP  6. English  Freelance  OfficeWriter 7. Finnish  Harvard Graphics  Paintbrush 8. French  Gem File  Paint Shop Pro 9. German  Gem Image  Paradox 10. Greek  IBM DCA/RFT  PDF 11. Hebrew  IBM DisplayWrite  PerfectWorks for Windows 12. Hungarian  IBM Graphics Data Format  PFS: Plan 13. Italian  IBM Picture Interchange  Post Script 14. Japanese  IBM Writing Assistant  Q&A Database 15. Korean  IGES Drawing  Q&A Write 16. Norwegian  Kodak Photo CD  17. Polish  Lotus 1-2-3  Reflex 18. Portuguese  Lotus Manuscript  Smart 19. Romanian  Lotus PIC  ShartWare II 20. Russian  Lotus Screen Snapshot  StarOffice Calc 21. Spanish  Mac PowerPoint  StarOffice Impress 22. Swedish  Mac Word  StarOffice Writer 23. Tamil  Mac WordPerfect  SuperCalc 24. Turkish  Mac Works  Symphony  MacPaint  Targa  MacWrite  Total Word  Micrografax Designer  vCard  Microsoft Access  Volkswriter   VP Planner  Microsoft PowerPoint  Wang IWP  Microsoft Project  WordPerfect  Microsoft PST  Word Star  Microsoft Visio  XyWrite

This document is licensed under a Creative Commons Attribution 3.0 United States license. To provide attribution, please cite to "EDRM (edrm.net)." If you have questions, contact us at [email protected]. © 2010 Socha Consulting LLC and Gelbmann & Associates.