Heritrix Documentation
Total Page:16
File Type:pdf, Size:1020Kb
Heritrix Documentation Internet Archive and contributors Sep 23, 2021 Contents: 1 Getting Started with Heritrix3 1.1 System Requirements..........................................3 1.2 Installation................................................3 1.3 Environment Variables..........................................3 1.4 Runnning Heritrix............................................4 1.5 Accessing the User Interface.......................................4 1.6 Your First Crawl.............................................4 1.7 Exiting Heritrix..............................................5 2 Operating Heritrix 7 2.1 Running Heritrix.............................................7 2.1.1 Command-line Options.....................................7 2.1.2 Environment Variables.....................................8 2.2 Security Considerations.........................................8 2.2.1 Understanding the Risks....................................8 2.2.2 Network Access Control....................................8 2.2.3 Login Authentication Access Control.............................9 2.3 Log Files.................................................9 2.3.1 alerts.log............................................9 2.3.2 crawl.log............................................9 2.3.3 progress-statistics.log...................................... 10 2.3.4 runtime-errors.log........................................ 11 2.3.5 uri-errors.log.......................................... 11 2.4 Reports.................................................. 11 2.4.1 Crawl Summary (crawl-report.txt)............................... 11 2.4.2 Seeds (seeds-report.txt)..................................... 12 2.4.3 Hosts (hosts-report.txt)..................................... 12 2.4.4 SourceTags (source-report.txt)................................. 13 2.4.5 Mimetypes (mimetype-report.txt)............................... 13 2.4.6 ResponseCode (responsecode-report.txt)............................ 14 2.4.7 Processors (processors-report.txt)............................... 14 2.4.8 FrontierSummary (frontier-summary-report.txt)........................ 15 2.4.9 ToeThreads (threads-report.txt)................................. 15 2.5 Action Directory............................................. 15 2.6 Crawl Recovery............................................. 16 2.6.1 Full recovery.......................................... 16 i 2.6.2 Split Recovery......................................... 17 3 Configuring Crawl Jobs 19 3.1 Basic Job Settings............................................ 19 3.1.1 Crawl Limits.......................................... 19 3.1.2 maxToeThreads......................................... 20 3.1.3 metadata.operatorContactUrl.................................. 20 3.1.4 Robots.txt Honoring Policy................................... 20 3.2 Crawl Scope............................................... 20 3.2.1 Decide Rules.......................................... 21 3.2.2 DecideRuleSequence Logging................................. 23 3.3 Frontier.................................................. 23 3.3.1 Politeness............................................ 23 3.3.2 Retry Policy........................................... 23 3.3.3 Bandwidth Limits........................................ 24 3.3.4 Extractor Parameters...................................... 24 3.4 Sheets (Site-specific Settings)...................................... 24 3.5 Other Protocols.............................................. 25 3.5.1 FTP............................................... 26 3.5.2 SFTP.............................................. 26 3.5.3 WHOIS............................................. 27 3.6 Modifying a Running Job........................................ 27 3.6.1 Browse Beans.......................................... 28 3.6.2 Scripting Console........................................ 29 4 Bean Reference 31 4.1 Core Beans................................................ 31 4.1.1 ActionDirectory......................................... 31 4.1.2 BdbCookieStore........................................ 32 4.1.3 BdbFrontier........................................... 32 4.1.4 BdbModule........................................... 32 4.1.5 BdbServerCache........................................ 33 4.1.6 BdbUriUniqFilter........................................ 33 4.1.7 CheckpointService....................................... 34 4.1.8 CrawlController......................................... 34 4.1.9 CrawlerLoggerModule..................................... 35 4.1.10 CrawlLimitEnforcer...................................... 36 4.1.11 CrawlMetadata......................................... 37 4.1.12 CredentialStore......................................... 37 4.1.13 DiskSpaceMonitor....................................... 38 4.1.14 RulesCanonicalizationPolicy.................................. 38 4.1.15 SheetOverlaysManager..................................... 38 4.1.16 StatisticsTracker........................................ 39 4.1.17 TextSeedModule........................................ 40 4.2 Decide Rules............................................... 40 4.2.1 AcceptDecideRule....................................... 40 4.2.2 ClassKeyMatchesRegexDecideRule.............................. 40 4.2.3 ContentLengthDecideRule................................... 41 4.2.4 ContentTypeMatchesRegexDecideRule............................ 41 4.2.5 ContentTypeNotMatchesRegexDecideRule.......................... 41 4.2.6 ExpressionDecideRule (contrib)................................ 41 4.2.7 ExternalGeoLocationDecideRule................................ 41 4.2.8 FetchStatusDecideRule..................................... 42 4.2.9 FetchStatusMatchesRegexDecideRule............................. 42 ii 4.2.10 FetchStatusNotMatchesRegexDecideRule........................... 42 4.2.11 HasViaDecideRule....................................... 42 4.2.12 HopCrossesAssignmentLevelDomainDecideRule....................... 42 4.2.13 HopsPathMatchesRegexDecideRule.............................. 43 4.2.14 IdenticalDigestDecideRule................................... 43 4.2.15 IpAddressSetDecideRule.................................... 43 4.2.16 MatchesFilePatternDecideRule................................. 44 4.2.17 MatchesListRegexDecideRule................................. 44 4.2.18 MatchesRegexDecideRule................................... 44 4.2.19 MatchesStatusCodeDecideRule................................ 44 4.2.20 NotMatchesFilePatternDecideRule............................... 45 4.2.21 NotMatchesListRegexDecideRule............................... 45 4.2.22 NotMatchesRegexDecideRule................................. 45 4.2.23 NotMatchesStatusCodeDecideRule.............................. 45 4.2.24 NotOnDomainsDecideRule................................... 45 4.2.25 NotOnHostsDecideRule.................................... 46 4.2.26 NotSurtPrefixedDecideRule.................................. 46 4.2.27 OnDomainsDecideRule..................................... 46 4.2.28 OnHostsDecideRule...................................... 46 4.2.29 PathologicalPathDecideRule.................................. 46 4.2.30 PredicatedDecideRule..................................... 47 4.2.31 PrerequisiteAcceptDecideRule................................. 47 4.2.32 RejectDecideRule........................................ 47 4.2.33 ResourceLongerThanDecideRule................................ 47 4.2.34 ResourceNoLongerThanDecideRule.............................. 48 4.2.35 ResponseContentLengthDecideRule.............................. 48 4.2.36 SchemeNotInSetDecideRule.................................. 48 4.2.37 ScriptedDecideRule....................................... 48 4.2.38 SeedAcceptDecideRule..................................... 49 4.2.39 SourceSeedDecideRule..................................... 49 4.2.40 SurtPrefixedDecideRule.................................... 49 4.2.41 TooManyHopsDecideRule................................... 50 4.2.42 TooManyPathSegmentsDecideRule.............................. 50 4.2.43 TransclusionDecideRule.................................... 50 4.2.44 ViaSurtPrefixedDecideRule................................... 51 4.3 Candidate Processors........................................... 51 4.3.1 CandidateScoper........................................ 51 4.3.2 FrontierPreparer......................................... 51 4.4 Pre-Fetch Processors........................................... 52 4.4.1 PreconditionEnforcer...................................... 52 4.4.2 Preselector........................................... 52 4.5 Fetch Processors............................................. 53 4.5.1 FetchDNS............................................ 53 4.5.2 FetchFTP............................................ 53 4.5.3 FetchHTTP........................................... 54 4.5.4 FetchSFTP........................................... 56 4.5.5 FetchWhois........................................... 57 4.6 Link Extractors.............................................. 58 4.6.1 ExtractorChrome (contrib)................................... 58 4.6.2 ExtractorCSS.......................................... 59 4.6.3 ExtractorDOC.......................................... 59 4.6.4 ExtractorHTML......................................... 59 4.6.5 AggressiveExtractorHTML................................... 60 4.6.6 JerichoExtractorHTML..................................... 60 iii 4.6.7 ExtractorHTMLForms..................................... 61 4.6.8 ExtractorHTTP........................................