EDRM Data Set Project
Total Page:16
File Type:pdf, Size:1020Kb
The Electronic Discovery Reference Model Data Set Project Establishing guidelines. Setting Standards. Delivering Resources. EDRM Data Set Project http://edrm.net/activities/projects/data-set The EDRM Data Set Project’s mission is to provide industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of electronic discovery software and services. EDRM Reference ESI Data Sets enable organizations to more easily publish, replicate, and compare test results across various electronic discovery solutions. EDRM currently offers three Reference ESI Data Sets comprising 40GB of data, over 200 file formats, and 23 different languages. In addition to the Reference ESI Data Sets, the EDRM Data Set Project is investigating two software hash projects, the EDRM Reference Software Data Set and the EDRM Probabilistic Hash Data Set to lower review costs by improving the performance of culling of software files from ESI collections. EDRM Reference ESI Data Sets This project collects, evaluates, and publishes ESI data sets for use in testing electronic discovery software and services. There are currently three data sets being offered today and more are under evaluation. The three sets that are currently offered include: EDRM Data Set Enron PST files: 40GB of Enron e-mail messages and attach- ments in PST format organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files. EDRM File Formats Data Set: 381 files covering 200 file formats. EDRM Internationalization Data Set: A snapshot of selected Ubuntu localiza- tion mailing list archives covering 23 languages in 724 MB of email. Using the Reference ESI Data Sets, organizations can more easily establish the effec- tiveness of their electronic discovery software, services, and processes. EDRM Software Reference Data Set The EDRM Software Reference Data Set project’s mission is to augment the NIST Reference Data Set hashes used in electronic discovery with additional hashes of known software files that can be further culled for review purposes. While the NIST list focuses on a selection of software applications and only as the software exists on installation media (e.g. DVDs, and CDs), this project will provide the hashes for the software after it has been extracted from compressed media containers and installed on a system, as well as for software not currently being handled by NIST, e.g. software that is downloaded from the Internet as opposed to received on DVD and/or CD media. This project will modernize and enhance the list of hashes available for culling software files to reduce electronic discovery costs. This document is licensed under a Creative Commons Attribution 3.0 United States license. To provide attribution, please cite to "EDRM (edrm.net)." If you have questions, contact us at [email protected]. © 2010 Socha Consulting LLC and Gelbmann & Associates. The Electronic Discovery Reference Model Data Set Project Establishing guidelines. Setting Standards. Delivering Resources. EDRM Probabilistic Hash Data Set To further assist in the culling of possibly non-ESI files, the Probabilistic Hash Data Set project seeks to collect as many anonymous hashes as possible of files encountered in real world electronic discovery. The frequency of the appearance of hashes can then be used to determine how likely a particular file would be potentially relevant ESI or a non -ESI software that should be culled after sufficient sampling. This project seeks to sig- nificantly improve the performance of automated culling of non-ESI files for electronic discovery resulting in lower cost expenditures. To learn more about the EDRM Data Set Project and participate in our activities, please go to http://edrm.net/activities/projects/data-set and/or contact George Socha ([email protected]) or Tom Gelbmann ([email protected]). Select File Types Languages Adobe Photoshop Microsoft Win Metafile 1. Arabic Ami Draw Microsoft Word 2. Catalan Corel Draw Microsoft Works 3. Chinese Corel Presentations MultiMate 4. Danish dBASE Mutipage 5. Dutch First Choice DB, SS, WP Multiplan 6. English Freelance OfficeWriter 7. Finnish Harvard Graphics Paintbrush 8. French Gem File Paint Shop Pro 9. German Gem Image Paradox 10. Greek IBM DCA/RFT PDF 11. Hebrew IBM DisplayWrite PerfectWorks for Windows 12. Hungarian IBM Graphics Data Format PFS: Plan 13. Italian IBM Picture Interchange Post Script 14. Japanese IBM Writing Assistant Q&A Database 15. Korean IGES Drawing Q&A Write 16. Norwegian Kodak Photo CD Quattro Pro 17. Polish Lotus 1-2-3 Reflex 18. Portuguese Lotus Manuscript Smart Spreadsheet 19. Romanian Lotus PIC ShartWare II 20. Russian Lotus Screen Snapshot StarOffice Calc 21. Spanish Mac PowerPoint StarOffice Impress 22. Swedish Mac Word StarOffice Writer 23. Tamil Mac WordPerfect SuperCalc 24. Turkish Mac Works Symphony MacPaint Targa MacWrite Total Word Micrografax Designer vCard Microsoft Access Volkswriter Microsoft Excel VP Planner Microsoft PowerPoint Wang IWP Microsoft Project WordPerfect Microsoft PST Word Star Microsoft Visio XyWrite This document is licensed under a Creative Commons Attribution 3.0 United States license. To provide attribution, please cite to "EDRM (edrm.net)." If you have questions, contact us at [email protected]. © 2010 Socha Consulting LLC and Gelbmann & Associates. .