Verity® Collection Reference Guide Version 5.0.1
August 28, 2003 Part Number DM0626
Verity, Incorporated 894 Ross Drive Sunnyvale, California 94089 (408) 541-1500
Verity Benelux BV Coltbaan 31 3439 NG Nieuwegein The Netherlands Copyright Information Copyright 2003 Verity, Inc. All rights reserved. No part of this publication may be reproduced, transmitted, stored in a retrieval system, nor translated into any human or computer language, in any form or by any means, electronic, mechanical, magnetic, optical, chemical, manual or otherwise, without the prior written permission of the copyright owner, Verity, Inc., 894 Ross Drive, Sunnyvale, California 94089. The copyrighted software that accompanies this manual is licensed to the End User for use only in strict accordance with the End User License Agreement, which the Licensee should read carefully before commencing use of the software. Verity®, Ultraseek®, TOPIC®, KeyView®, and Knowledge Organizer® are registered trademarks of Verity, Inc. in the United States and other countries. The Verity logo, Verity Portal One™, and Verity® Profiler™ are trademarks of Verity, Inc. Sun, Sun Microsystems, the Sun logo, Sun Workstation, Sun Operating Environment, and Java are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Xerces XML Parser Copyright 1999-2000 The Apache Software Foundation. All rights reserved. Microsoft is a registered trademark, and MS-DOS, Windows, Windows 95, Windows NT, and other Microsoft products referenced herein are trademarks of Microsoft Corporation. IBM is a registered trademark of International Business Machines Corporation. The American Heritage® Concise Dictionary, Third Edition Copyright 1994 by Houghton Mifflin Company. Electronic version licensed from Lernout & Hauspie Speech Products N.V. All rights reserved. WordNet 1.7 Copyright © 2001 by Princeton University. All rights reserved Includes Adobe® PDF. Adobe is a trademark of Adobe Systems Incorporated. LinguistX from Inxight Software, Inc., a Xerox New Enterprise Company, 1996-1997. Xerox , Inxight and LinguistX are trademarks of Xerox Corporation and Inxight Software, Inc. LinguistX contains patented technology of Xerox Corporation. All rights reserved. All other trademarks are the property of their respective owners.
Notice to Government End Users If this product is acquired under the terms of a DoD contract: Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of 252.227-7013. Civilian agency contract: Use, reproduction or disclosure is subject to 52.227-19 (a) through (d) and restrictions set forth in the accompanying end user agreement. Unpublished-rights reserved under the copyright laws of the United States. Verity, Inc., 894 Ross Drive Sunnyvale, California 94089.
8/11/03 Contents
Preface ...... 17 Using This Book ...... 17 Version ...... 18 Organization of This Book ...... 18 Stylistic Conventions...... 19 Related Documentation ...... 21 Release Notes and Document Updates ...... 21 Verity Technical Support...... 21
PART IOVERVIEW
1Overview...... 25 Collection Features ...... 25 What Is a Collection?...... 26 Features of the Collection Architecture...... 26 Contents of Collection Indexes...... 27 Dynamic Access to Documents ...... 27 Universal Document Support...... 28 Optional Indexes...... 29 Collection Configuration ...... 29 The Collection Directory...... 29 Collection Style Files ...... 29 Collection Optimizations...... 30 Altering Indexing Behavior...... 30 Inside a product Collection ...... 31 Types of Indexes ...... 31 Documents Table ...... 31
3 Contents
Persistent Fields ...... 31 Transitory Fields ...... 32 What Are Partitions? ...... 32 Index Tuning...... 33 Defining and Populating Fields ...... 34 Document Filters and Formatting ...... 34 Universal Filter and the Helper Filters ...... 35 Using Your Own Filter...... 35 Indexing Modes...... 35 Topic Sets...... 36 Troubleshooting and Maintenance Tools ...... 36
2 Using Gateways...... 37 Gateway Configuration Overview ...... 37 Primary Document Key Format...... 38 Simple Keys ...... 38 Gateway Field Types ...... 39 Security Method ...... 39 Using Separate Gateways for Indexing and Viewing...... 39 Gateway-Related Style Files ...... 40 Gateway-Related Style Files ...... 40 Using Different Gateways for Indexing and Viewing...... 41 Using the HTTP Gateway ...... 42 Features ...... 42 Security...... 42 Style Directory...... 42 Configuration File Syntax...... 43 Configuration File Sample...... 44 Using the File System Gateway...... 45 Features ...... 45 Style Directory...... 45 Configuration File Syntax...... 46
4 Verity® Collection Reference Guide Contents
PART II COLLECTION STYLE
3 Setting Policies...... 51 About Indexing Modes ...... 51 What Indexing Modes Do ...... 51 Dynamically Changing Modes...... 52 Background vs. Administrative Optimizations ...... 52 Using Indexing Modes...... 52 Using the product API ...... 53 Using mkvdk...... 53 Built-in Indexing Modes ...... 54 Generic Mode (generic) ...... 54 News Feed Optimizer Mode (newsfeedopt) ...... 56 Custom Indexing Modes ...... 57 Metaparameter modifiers in the style.plc File...... 57 Defining a Custom Mode ...... 58 Defining a Default Indexing Mode...... 59 Inheriting From a Predefined Indexing Mode ...... 59 Defining Multiple Custom Indexing Modes ...... 60 Returning Document Counts ...... 60 Using a style.plc File ...... 60 Skipping Results Set Filtering...... 61
4 Field Definitions ...... 63 Data Types ...... 63 Data Tables ...... 64 Field Types...... 64 Constant Fields ...... 65 Variable Fields ...... 65 Field Definition Files ...... 66 Internal Fields in style.ddd ...... 66 style.ddd contents ...... 66 Standard Fields in style.sfl ...... 69 Field Aliases in style.sfl ...... 69 style.sfl Contents ...... 70
Verity® Collection Reference Guide 5 Contents
Custom User Fields in style.ufl...... 72 style.ufl Contents ...... 73 style.ufl File Syntax...... 73 Mandatory Statements ...... 74 $control...... 74 descriptor ...... 74 data-table...... 74 Constant Field Types...... 76 constant ...... 76 autoval...... 76 worm...... 78 Variable Field Types...... 78 fixwidth ...... 78 fixwidth Length and Ranges for Integer Data Types ...... 80 varwidth...... 80 dispatch ...... 81
5 Index Tuning...... 83 Tuning Collection Index Contents...... 83 Style Files Affecting Collection Index Contents ...... 84 Setting Up the Collection style Directory...... 84 Using the style.lex File...... 85 style.lex File Syntax Reference ...... 85 General Information ...... 86 define Statements...... 87 token Statements...... 87 Statement Interpretation...... 87 Default Handling of the Dot Character ...... 88 Character Mapping...... 89 Using the style.stp File ...... 89 style.stp Syntax Reference ...... 90 style.stp Features...... 90 Case Sensitivity ...... 90 Regular Expressions ...... 90 Using the style.go File ...... 91 style.go Syntax Reference ...... 91 Using the style.ufl File...... 91
6 Verity® Collection Reference Guide Contents
Default style.ufl File ...... 92 Indexed Field Type...... 92 Minmax Field Type ...... 93 Using the style.prm File...... 94 Feature Vectors ...... 94 Stored Summaries...... 95 Summary Types...... 95 Summary Type Parameters...... 95 Zone Argument ...... 96 Case-Insensitive Word Indexes ...... 96 Instance Vector Encodings (PSW and WCT)...... 96 PSW Encoding ...... 97 WCT Encoding...... 97 Instance Vector Encoding and Searching ...... 98 SENTENCE and PARAGRAPH Operators...... 98 Soundex Data ...... 98 Highlight Location Data ...... 99 Qualify Instance Data...... 99 Enable Indexing on Nouns and Noun Phrases ...... 99 style.prm File Syntax...... 100 Default style.prm File...... 101 Using the style.fxs File ...... 104 Using the style.tkm file ...... 104 Creating Custom Zones...... 104 Tokenizing Custom Fields...... 106 style.tkm File Syntax ...... 106 Alias Definitions ...... 106 Mapping Rules...... 107 Tokenization Definitions...... 108 End of File...... 109 Default style.tkm File ...... 109
6 Document Filters and Formatting...... 111 The Virtual Document...... 112 Document Layout Definition...... 112 Document Filter Specification...... 112
Verity® Collection Reference Guide 7 Contents
Default style.dft File ...... 112 Universal Filter Functionality ...... 113 Using the style.dft File...... 113 style.dft File Syntax...... 114 style.dft File Statements ...... 114 style.dft File Keywords ...... 116 Shorthand Notation for zone-begin and zone-end...... 117 style.dft Keyword Modifiers ...... 117 Date Formats in the style.dft File...... 120 Late Binding for Field Elements ...... 120 The Universal Filter ...... 120 Invoking the Universal Filter ...... 121 How the Universal Filter Works...... 121 Components...... 121 How Filtering Occurs...... 121 Character Set Recognition and Mapping...... 123 Checking File Types...... 124 Using the style.uni File...... 124 Syntax of style.uni File Statements...... 125 Syntax of style.uni File Keywords...... 126 Syntax of style.uni Keyword Modifiers...... 129 Using the Older Word-Finder Algorithm for PDF files ...... 131 Configuring the Language-Identification Filter ...... 132 Conditionally Loading Filters ...... 134 Generating Text-Formatting Zones...... 135 Disabling Filtering ...... 135 Consequences of Changing style.uni ...... 136 Universal Filter Document Types...... 136 Recognized Document Types...... 136 Recognized Categories of Document Types ...... 137 The KeyView Filters...... 138 The PDF Filter...... 139 Custom Lexing Rules Not Supported...... 139 Specifying the PDF Filter ...... 139 Using the -fieldoverride Option ...... 140
8 Verity® Collection Reference Guide Contents
Using the -charmapto Option...... 140 PDF Fields...... 141 Standard PDF Fields ...... 141 Optional PDF Fields...... 142 Defining Optional PDF Fields ...... 144 The XML Filter ...... 145 Requirements for Indexing XML Documents ...... 145 Requirements for Data Files...... 145 Implementation Summary ...... 146 XML Filter...... 146 Style Files...... 146 Style File Configuration...... 147 style.uni File ...... 147 style.xml File ...... 147 style.ufl File ...... 149 style.dft File ...... 149 Indexing XML Documents ...... 150 Troubleshooting...... 150 Checking File Types ...... 150 Disable Document Filters by MIME Type ...... 151
7 Searching Documents by Fields ...... 153 Methods for Populating Fields ...... 153 Using the Bulk Modify Feature ...... 153 Extracting Field Values...... 154 style.tde Syntax ...... 154
PART III SEARCH FEATURES
8 The Zone Filter ...... 167 Zone Filter Overview ...... 168 Introduction to Zones ...... 168 Document Types...... 168 Zones vs. Fields...... 168 Advantages of Using a Field...... 169
Verity® Collection Reference Guide 9 Contents
Advantages of Using a Zone...... 169 Processing Order...... 169 Zones and Zone Occurrences...... 169 Using the Zone Filter ...... 170 Specifying the Zone Filter...... 170 Built-in Zone Mode Options ...... 171 Character Mapping Options ...... 171 Extracting META Tags as Fields...... 172 Extracting Zones as Fields ...... 174 Zones for Markup Language Documents...... 174 How the Zone Filter Parses Markup Language Documents ...... 175 Implicit Zone Endings...... 176 Zones for HTML Documents ...... 177 Zone Filter Specification for HTML ...... 177 Supported HTML Tags ...... 177 Supported HTML Entities ...... 180 Additional HTML Parsing Rules...... 180 Zones for SGML and XML Documents ...... 180 Zone Filter Specification for SGML and XML ...... 181 Using the style.zon file...... 181 Zones for Internet Message Format Documents ...... 181 How the Zone Filter Parses Internet Message Format Documents...... 182 Zone Filter Specification for E-mail...... 183 Using the style.zon file...... 184 Zone Filter Specification for Usenet News...... 185 Using the style.zon file...... 186 Custom Zone Definitions...... 186 Built-in vs. Custom Zone Definitions ...... 186 style.zon File ...... 186 style.zon Default Behavior...... 187 style.zon File Syntax ...... 187 The zonespec Keyword...... 187 The element Keyword...... 188 The attribute Keyword...... 189 The entity Keyword...... 191 Entity Substitution...... 192 Dumping style.zon Information ...... 192
10 Verity® Collection Reference Guide Contents
Debugging the style.zon File ...... 193 Dumping Information for Built-In Modes...... 193 Modifying Built-in Behavior ...... 193 Attribute Extraction...... 194 Defining Zones as Collection Fields...... 195 Extracting HTML Zones as Fields...... 196 Extracting META Tags as Fields ...... 197 Defining Zones for Virtual Documents ...... 198 Hidden Elements in Zones ...... 199 Entries in the style.dft File...... 200 Searching over Hidden Zones ...... 201 Special Noindex and Noextract Zones ...... 201 Noindex Zones...... 201 Noextract Zones...... 203 Hidden Elements in NoExtract Zones...... 203 Searching in Zones...... 203 Using the Query Language IN Operator ...... 204 Using a Custom Query Parser ...... 205 Searching Multiple Zone Occurrences ...... 205
APPENDIXES
A Style Files ...... 209 Summary of Style Files...... 209 Standard Style Files...... 210 Editable Standard Style Files ...... 210 Non-Editable Style Files ...... 211 Gateway-Related Style Files...... 212 Sample Style Sets ...... 212 Using Custom Style Files...... 213
B Supported Document Formats...... 215 Supported Formats for Indexing ...... 215 Word Processing Formats ...... 215
Verity® Collection Reference Guide 11 Contents
Picture Formats ...... 217 Presentation Formats...... 218 Spreadsheet Formats ...... 218 Other Formats...... 219 Multi-byte and Bi-Directional Support in Indexing...... 219 Supported Formats for Viewing ...... 222 Word Processing Formats...... 222 Picture Formats ...... 223 Presentation Formats...... 225 Adobe Portable Document Format (PDF) Support ...... 225 Spreadsheet Formats ...... 226 Other Formats...... 226 Multi-Byte and Bi-Directional Support for Document Viewing...... 226
C Supported Date Formats ...... 231 Date Import Formats ...... 231 Date Import Format Strings...... 232 Table Conventions...... 232 Zulu Date Format ...... 234 Time Formats...... 234 Numeric Date Formats...... 235
D Regular Expressions ...... 237 Operators for Regular Expressions...... 237 Symbols ...... 238 Substrings...... 239
E Collection Troubleshooting and Maintenance Tools ...... 243 Using didump...... 244 didump Syntax ...... 244 Viewing the Word List ...... 245 Viewing the Zone List ...... 246 Viewing the Zone Attribute List...... 247 Using browse ...... 248 Displaying Fields in a Documents Table File ...... 248
12 Verity® Collection Reference Guide Contents
Using rcvdk ...... 251 Starting rcvdk...... 251 Attaching to a Collection...... 252 Attaching to Subsequent Collections...... 252 Basic Searching...... 253 Viewing Results ...... 253 Authenticating with rcvdk...... 255 Displaying More Fields...... 255 Using merge...... 257 Merging Collections ...... 257 Splitting Collections ...... 258 Using mkvdk for Incremental Squeeze ...... 258 Options for 8-bit Characters...... 259
F Collection Limits...... 261
Index...... 263
Verity® Collection Reference Guide 13 Contents
14 Verity® Collection Reference Guide Figures, Tables, and Listings
Table 3-1 Predefined Indexing Mode Names ...... 54 Table B-1 Word-processing formats supported for indexing...... 215 Table B-2 Picture formats supported for indexing...... 217 Table B-3 Presentation formats supported for indexing...... 218 Table C-1 Import Date Formats ...... 233
15 Figures, Tables, and Listings
16 Verity® Collection Reference Guide Preface
The Verity K2 Collection Reference Guide describes the architecture and design of Verity collections. Collections represent groups of documents that can be searched by one or more Verity applications. This book is for Verity administrators. It is intended for readers who have a Verity installation and who are familiar with the basic concepts of search applications This preface contains the following sections: Using This Book Related Documentation Release Notes and Document Updates Verity Technical Support
Using This Book
This section briefly describes the organization of this book and the stylistic conventions it uses.
17 Preface Using This Book
Version
Except for the following corrections and additions made for Verity K2 Enterprise versions 5.0 and 5.0.1, the information in this manual is current as of K2 Enterprise version 4.5. The majority of the content of the manual was last modified April 30, 2002. Corrections or updates to this information may be available through the Verity Customer Support site; see “Release Notes and Document Updates” on page 21. Information added for Version 5.0: Zone-mapping configuration (style.tkm), in Chapter 5. Content-access filter configuration, in Chapter 6. flt_lang configuration, in Chapter 6. Minor corrections and updates at various locations in the book. Information added for Version 5.0.1: Correction of multiple references to “K2 Spider” to “Verity”. Documentation of -wordfinder switch for PDF filter in Chapter 6. Description of attribute operators for style.tkm in Chapter 5. Revision of the Appendix “Universal Filter Document Types” to Appendix B.
Organization of This Book
This book includes the following chapters and appendixes: Part I: Chapter 1, “Overview,” describes overview information about Verity collections and how to build them for your product application. Chapter 2, “Using Gateways,” describes gateway features and the File System gateway and HTTP gateway in detail. Part II: Chapter 3, “Setting Policies,” describes how you can customize the Verity engine’s behavior by assigning an indexing mode or by defining what is included in the document hit count. Chapter 4, “Field Definitions,” describes field definitions and the schema for a collection’s documents table.
18 Verity® Collection Reference Guide Preface Using This Book