IDOL Keyview Filter SDK 12.7 .NET Programming Guide
Total Page:16
File Type:pdf, Size:1020Kb
KeyView Software Version 12.7 Filter SDK .NET Programming Guide Document Release Date: October 2020 Software Release Date: October 2020 Filter SDK .NET Programming Guide Legal notices Copyright notice © Copyright 2016-2020 Micro Focus or one of its affiliates. The only warranties for products and services of Micro Focus and its affiliates and licensors (“Micro Focus”) are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Micro Focus shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is subject to change without notice. Documentation updates The title page of this document contains the following identifying information: l Software Version number, which indicates the software version. l Document Release Date, which changes each time the document is updated. l Software Release Date, which indicates the release date of this version of the software. To check for updated documentation, visit https://www.microfocus.com/support-and-services/documentation/. Support Visit the MySupport portal to access contact information and details about the products, services, and support that Micro Focus offers. This portal also provides customer self-solve capabilities. It gives you a fast and efficient way to access interactive technical support tools needed to manage your business. As a valued support customer, you can benefit by using the MySupport portal to: l Search for knowledge documents of interest l Access product documentation l View software vulnerability alerts l Enter into discussions with other software customers l Download software patches l Manage software licenses, downloads, and support contracts l Submit and track service requests l Contact customer support l View information about all services that Support offers Many areas of the portal require you to sign in. If you need an account, you can create one when prompted to sign in. To learn about the different access levels the portal uses, see the Access Levels descriptions. KeyView (12.7) Page 2 of 256 Filter SDK .NET Programming Guide Contents Part I: Overview of Filter SDK 9 Chapter 1: Introducing Filter SDK 10 Overview 10 Features 10 Platforms, Compilers, and Dependencies 11 Supported Platforms 11 Supported Compilers 11 Software Dependencies 12 Windows Installation 12 UNIX Installation 13 Package Contents 14 License Information 15 Enable Advanced Document Readers 15 Pass License Information to KeyView 15 Directory Structure 16 Chapter 2: Getting Started 18 Architectural Overview 18 File Caching 19 Filtering 20 Subfile Extraction 20 Use the .NET Implementation of the API 20 Input/Output Operations 21 Filter in File or Stream Mode 21 Multithreaded Filtering 22 The Filter Process Model 23 Persist the Child Process 24 Run Filter In Process 25 Run File Extraction Functions Out of Process 25 Out-of-Process Logging 25 Enable Out-of-Process Logging 26 Set the Verbosity Level 26 Enable Windows Minidump 26 Keep Log Files 27 Run File Detection In or Out of Process 27 Specify the Process Type In the formats.ini File 27 Specify the Process Type In the API 28 Stream Data to Filter 28 Part II: Use Filter SDK 29 Chapter 3: Use the File Extraction API 30 KeyView (12.7) Page 3 of 256 Filter SDK .NET Programming Guide Introduction 30 Extract Subfiles 31 Sanitize Absolute Paths 32 Extract Images 33 Recreate a File's Hierarchy 33 Create a Root Node 33 Example 34 Extract Mail Metadata 35 Default Metadata Set 35 Extract the Default Metadata Set 36 Microsoft Outlook (MSG) Metadata 37 Extract MSG-Specific Metadata 38 Microsoft Outlook Express (EML) and Mailbox (MBX) Metadata 38 Extract EML- or MBX-Specific Metadata 38 Lotus Notes Database (NSF) Metadata 39 Extract NSF-Specific Metadata 39 Microsoft Personal Folders File (PST) Metadata 39 MAPI Properties 39 Extract PST-Specific Metadata 40 Exclude Metadata from the Extracted Text File 41 Extract Subfiles from Outlook Files 41 Extract Subfiles from Outlook Express Files 41 Extract Subfiles from Mailbox Files 41 Extract Subfiles from Outlook Personal Folders Files 42 Choose the Reader to use for PST Files 42 MAPI Attachment Methods 44 Open Secured PST Files 44 Detect PST Files While the Outlook Client is Running 45 Extract Subfiles from Lotus Domino XML Language Files 45 Extract Subfiles from Lotus Notes Database Files 46 System Requirements 46 Installation and Configuration 47 Windows 47 Solaris 47 AIX 5.x 47 Linux 48 Open Secured NSF Files 48 Format Note Subfiles 48 Extract Subfiles from PDF Files 49 Improve Performance for PDFs with Many Small Images 49 Extract Embedded OLE Objects 49 Extract Subfiles from ZIP Files 49 Default File Names for Extracted Subfiles 50 Default File Name for Mail Formats 50 Default File Name for Embedded OLE Objects 51 Chapter 4: Use the Filter API 53 KeyView (12.7) Page 4 of 256 Filter SDK .NET Programming Guide Generate an Error Log 53 Enable or Disable Error Logging 54 Change the Path and File Name of the Log File 54 Report Memory Errors 55 Specify a Memory Guard 55 Report the File Name in Stream Mode 55 Specify the Maximum Size of the Log File 56 Extract Metadata 56 Extract Metadata for File Filtering 57 Extract Metadata for Stream Filtering 57 Example 57 Convert Character Sets 59 Determine the Character Set of the Output Text 59 Guidelines for Character Set Conversion 59 Set the Character Set During Filtering 60 Set the Character Set During Subfile Extraction 60 Prevent the Default Conversion of a Character Set 61 Extract Deleted Text Marked by Tracked Changes 61 Filter PDF Files 62 Filter PDF Files to a Logical Reading Order 62 Enable Logical Reading Order 63 Use the API 63 Use the formats.ini File 64 Rotated Text 65 Extract Custom Metadata from PDF Files 65 Skip Embedded Fonts 66 Use the formats.ini File 66 Use the .NET API 66 Control Hyphenation 66 Filter Portfolio PDF Files 67 Filter Spreadsheet Files 67 Filter Worksheet Names 67 Filter Hidden Text in Microsoft Excel Files 67 Specify Date and Time Format on UNIX Systems 68 Filter Very Large Numbers in Spreadsheet Cells to Precision Numbers 68 Extract Microsoft Excel Formulas 69 Filter HTML Files 70 Filter XML Files 71 Configure Element Extraction for XML Documents 71 Modify Element Extraction Settings 72 Use an Initialization File 72 Modify Element Extraction Settings in the kvxconfig.ini File 72 Specify an Element's Namespace and Attribute 74 Add Configuration Settings for Custom XML Document Types 74 Configure Headers and Footers 75 Tab Delimited Output for Embedded Tables 75 KeyView (12.7) Page 5 of 256 Filter SDK .NET Programming Guide Exclude Japanese Guide Text 76 Source Code Identification 76 Chapter 5: Sample Programs 77 FilterTestDotNet 77 TestExtract 77 TestFilter 79 Appendixes 82 Appendix A: Supported Formats 83 Key to Supported Formats Table 83 Supported Formats 85 Appendix B: Document Readers 159 Key to Document Readers Table 159 Document Readers 161 Appendix C: Character Sets 189 Multibyte and Bidirectional Support 189 Coded Character Sets 197 Appendix D: Extract and Format Lotus Notes Subfiles 203 Overview 203 Customize XML Templates 203 Use Demo Templates 204 Use Old Templates 204 Disable XML Templates 204 Template Elements and Attributes 205 Conditional Elements 205 Control Elements 206 Data Elements 207 Date and Time Formats 210 Lotus Notes Date and Time Formats 210 KeyView Date and Time Formats 211 Appendix E: File Format Detection 216 Introduction 216 Extract Format Information 216 Determine Format Support 216 Example formats.ini file entries 217 Refine Detection of Text Files 217 Allow Consecutive NULL Bytes in a Text File 218 Translate Format Information 219 Distinguish Between Formats 219 Determine a Document Reader 220 Category Values in formats.ini 220 Appendix F: List of Required Files for Redistribution 224 Core Files 224 KeyView (12.7) Page 6 of 256 Filter SDK .NET Programming Guide Support Files 225 Document Readers 226 Appendix G: Develop a Custom Reader 233 Introduction 233 How to Write a Custom Reader 234 Naming Conventions 234 Basic Steps 235 Token Buffer 235 Macros 237 Reader Interface 237 Function Flow 238 Example Development of fffFillBuffer() 238 Implementation 1—fpFillBuffer() Function 238 Structure of Implementation 1 239 Problems with Implementation 1 239 Implementation 2—Processing a Large Token Stream 240 Structure of Implementation 2 241 Problems with Implementation 2 241 Boundary Conditions 241 Implementation 3—Interrupting Structured Access Layer Calls 242 Structure of Implementation 3 244 Development Tips 244 Functions 245 xxxsrAutoDet() 245 xxxAllocateContext() 246 xxxFreeContext() 247 xxxInitDoc() 248 xxxFillBuffer() 248 xxxGetSummaryInfo() 249 xxxOpenStream() 250 xxxCloseStream() 251 xxxCharSet() 251 Appendix H: Password Protected Files 253 Supported Password Protected File Types 253 Open Password Protected Container Files 254 Filter Password Protected Files 254 Send documentation feedback 256 KeyView (12.7) Page 7 of 256 Filter SDK .NET Programming Guide KeyView (12.7) Page 8 of 256 Part I: Overview of Filter SDK This section provides an overview of the Micro Focus KeyView Filter SDK and describes how to use the .NET implementation of the API. KeyView (12.7) Page 9 of 256 Chapter 1: Introducing Filter SDK This section describes the Filter SDK package. • Overview 10 • Features 10 • Platforms, Compilers, and Dependencies 11 • Windows Installation 12 • UNIX Installation 13 • Package Contents 14 • License Information 15 • Directory Structure 16 Overview Micro Focus KeyView Filter SDK enables you to incorporate text extraction functionality into your own applications. It extracts text and metadata from a wide variety of file formats on numerous platforms, and can automatically recognize over 1000 document types. It supports both file-based and stream- based I/O operations, and provides in-process or out-of-process filtering. Filter SDK is part of the KeyView suite of products.