Heritrix Documentation

Heritrix Documentation

Heritrix Documentation Internet Archive and contributors Sep 23, 2021 Contents: 1 Getting Started with Heritrix3 1.1 System Requirements..........................................3 1.2 Installation................................................3 1.3 Environment Variables..........................................3 1.4 Runnning Heritrix............................................4 1.5 Accessing the User Interface.......................................4 1.6 Your First Crawl.............................................4 1.7 Exiting Heritrix..............................................5 2 Operating Heritrix 7 2.1 Running Heritrix.............................................7 2.1.1 Command-line Options.....................................7 2.1.2 Environment Variables.....................................8 2.2 Security Considerations.........................................8 2.2.1 Understanding the Risks....................................8 2.2.2 Network Access Control....................................8 2.2.3 Login Authentication Access Control.............................9 2.3 Log Files.................................................9 2.3.1 alerts.log............................................9 2.3.2 crawl.log............................................9 2.3.3 progress-statistics.log...................................... 10 2.3.4 runtime-errors.log........................................ 11 2.3.5 uri-errors.log.......................................... 11 2.4 Reports.................................................. 11 2.4.1 Crawl Summary (crawl-report.txt)............................... 11 2.4.2 Seeds (seeds-report.txt)..................................... 12 2.4.3 Hosts (hosts-report.txt)..................................... 12 2.4.4 SourceTags (source-report.txt)................................. 13 2.4.5 Mimetypes (mimetype-report.txt)............................... 13 2.4.6 ResponseCode (responsecode-report.txt)............................ 14 2.4.7 Processors (processors-report.txt)............................... 14 2.4.8 FrontierSummary (frontier-summary-report.txt)........................ 15 2.4.9 ToeThreads (threads-report.txt)................................. 15 2.5 Action Directory............................................. 15 2.6 Crawl Recovery............................................. 16 2.6.1 Full recovery.......................................... 16 i 2.6.2 Split Recovery......................................... 17 3 Configuring Crawl Jobs 19 3.1 Basic Job Settings............................................ 19 3.1.1 Crawl Limits.......................................... 19 3.1.2 maxToeThreads......................................... 20 3.1.3 metadata.operatorContactUrl.................................. 20 3.1.4 Robots.txt Honoring Policy................................... 20 3.2 Crawl Scope............................................... 20 3.2.1 Decide Rules.......................................... 21 3.2.2 DecideRuleSequence Logging................................. 23 3.3 Frontier.................................................. 23 3.3.1 Politeness............................................ 23 3.3.2 Retry Policy........................................... 23 3.3.3 Bandwidth Limits........................................ 24 3.3.4 Extractor Parameters...................................... 24 3.4 Sheets (Site-specific Settings)...................................... 24 3.5 Other Protocols.............................................. 25 3.5.1 FTP............................................... 26 3.5.2 SFTP.............................................. 26 3.5.3 WHOIS............................................. 27 3.6 Modifying a Running Job........................................ 27 3.6.1 Browse Beans.......................................... 28 3.6.2 Scripting Console........................................ 29 4 Bean Reference 31 4.1 Core Beans................................................ 31 4.1.1 ActionDirectory......................................... 31 4.1.2 BdbCookieStore........................................ 32 4.1.3 BdbFrontier........................................... 32 4.1.4 BdbModule........................................... 32 4.1.5 BdbServerCache........................................ 33 4.1.6 BdbUriUniqFilter........................................ 33 4.1.7 CheckpointService....................................... 34 4.1.8 CrawlController......................................... 34 4.1.9 CrawlerLoggerModule..................................... 35 4.1.10 CrawlLimitEnforcer...................................... 36 4.1.11 CrawlMetadata......................................... 37 4.1.12 CredentialStore......................................... 37 4.1.13 DiskSpaceMonitor....................................... 38 4.1.14 RulesCanonicalizationPolicy.................................. 38 4.1.15 SheetOverlaysManager..................................... 38 4.1.16 StatisticsTracker........................................ 39 4.1.17 TextSeedModule........................................ 40 4.2 Decide Rules............................................... 40 4.2.1 AcceptDecideRule....................................... 40 4.2.2 ClassKeyMatchesRegexDecideRule.............................. 40 4.2.3 ContentLengthDecideRule................................... 41 4.2.4 ContentTypeMatchesRegexDecideRule............................ 41 4.2.5 ContentTypeNotMatchesRegexDecideRule.......................... 41 4.2.6 ExpressionDecideRule (contrib)................................ 41 4.2.7 ExternalGeoLocationDecideRule................................ 41 4.2.8 FetchStatusDecideRule..................................... 42 4.2.9 FetchStatusMatchesRegexDecideRule............................. 42 ii 4.2.10 FetchStatusNotMatchesRegexDecideRule........................... 42 4.2.11 HasViaDecideRule....................................... 42 4.2.12 HopCrossesAssignmentLevelDomainDecideRule....................... 42 4.2.13 HopsPathMatchesRegexDecideRule.............................. 43 4.2.14 IdenticalDigestDecideRule................................... 43 4.2.15 IpAddressSetDecideRule.................................... 43 4.2.16 MatchesFilePatternDecideRule................................. 44 4.2.17 MatchesListRegexDecideRule................................. 44 4.2.18 MatchesRegexDecideRule................................... 44 4.2.19 MatchesStatusCodeDecideRule................................ 44 4.2.20 NotMatchesFilePatternDecideRule............................... 45 4.2.21 NotMatchesListRegexDecideRule............................... 45 4.2.22 NotMatchesRegexDecideRule................................. 45 4.2.23 NotMatchesStatusCodeDecideRule.............................. 45 4.2.24 NotOnDomainsDecideRule................................... 45 4.2.25 NotOnHostsDecideRule.................................... 46 4.2.26 NotSurtPrefixedDecideRule.................................. 46 4.2.27 OnDomainsDecideRule..................................... 46 4.2.28 OnHostsDecideRule...................................... 46 4.2.29 PathologicalPathDecideRule.................................. 46 4.2.30 PredicatedDecideRule..................................... 47 4.2.31 PrerequisiteAcceptDecideRule................................. 47 4.2.32 RejectDecideRule........................................ 47 4.2.33 ResourceLongerThanDecideRule................................ 47 4.2.34 ResourceNoLongerThanDecideRule.............................. 48 4.2.35 ResponseContentLengthDecideRule.............................. 48 4.2.36 SchemeNotInSetDecideRule.................................. 48 4.2.37 ScriptedDecideRule....................................... 48 4.2.38 SeedAcceptDecideRule..................................... 49 4.2.39 SourceSeedDecideRule..................................... 49 4.2.40 SurtPrefixedDecideRule.................................... 49 4.2.41 TooManyHopsDecideRule................................... 50 4.2.42 TooManyPathSegmentsDecideRule.............................. 50 4.2.43 TransclusionDecideRule.................................... 50 4.2.44 ViaSurtPrefixedDecideRule................................... 51 4.3 Candidate Processors........................................... 51 4.3.1 CandidateScoper........................................ 51 4.3.2 FrontierPreparer......................................... 51 4.4 Pre-Fetch Processors........................................... 52 4.4.1 PreconditionEnforcer...................................... 52 4.4.2 Preselector........................................... 52 4.5 Fetch Processors............................................. 53 4.5.1 FetchDNS............................................ 53 4.5.2 FetchFTP............................................ 53 4.5.3 FetchHTTP........................................... 54 4.5.4 FetchSFTP........................................... 56 4.5.5 FetchWhois........................................... 57 4.6 Link Extractors.............................................. 58 4.6.1 ExtractorChrome (contrib)................................... 58 4.6.2 ExtractorCSS.......................................... 59 4.6.3 ExtractorDOC.......................................... 59 4.6.4 ExtractorHTML......................................... 59 4.6.5 AggressiveExtractorHTML................................... 60 4.6.6 JerichoExtractorHTML..................................... 60 iii 4.6.7 ExtractorHTMLForms..................................... 61 4.6.8 ExtractorHTTP........................................

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    95 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us