Statistics for Donauschwaben-Usa.Org (2010-07)

Total Page:16

File Type:pdf, Size:1020Kb

Statistics for Donauschwaben-Usa.Org (2010-07) Statistics for donauschwaben-usa.org (2010-07) Statistics for: donauschwaben-usa.org Last Update: 10 Aug 2010 - 10:16 Reported period: Month Jul 2010 When: Monthly history Days of month Days of week Hours Who: Organizations Countries Full list Hosts Full list Last visit Unresolved IP Address Robots/Spiders visitors Full list Last visit Navigation: Visits duration File type Viewed Full list Entry Exit Operating Systems Versions Unknown Browsers Versions Unknown Referrers: Origin Referring search engines Referring sites Search Search Keyphrases Search Keywords Others: Miscellaneous HTTP Status codes Pages not found Summary Reported period Month Jul 2010 First visit 01 Jul 2010 - 00:03 Last visit 31 Jul 2010 - 23:52 Unique visitors Number of visits Pages Hits Bandwidth 3882 4769 13853 91083 7.26 GB Viewed traffic * (1.22 visits/visitor) (2.9 Pages/Visit) (19.09 Hits/Visit) (1595.96 KB/Visit) Not viewed traffic * 29495 35395 3.18 GB * Not viewed traffic includes traffic generated by robots, worms, or replies with special HTTP status codes. Monthly history Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 Month Unique visitors Number of visits Pages Hits Bandwidth Jan 2010 0 0 0 0 0 Feb 2010 0 0 0 0 0 Mar 2010 0 0 0 0 0 Apr 2010 0 0 0 0 0 May 2010 0 0 0 0 0 Jun 2010 0 0 0 0 0 Jul 2010 3882 4769 13853 91083 7.26 GB Aug 2010 0 0 0 0 0 Sep 2010 0 0 0 0 0 Oct 2010 0 0 0 0 0 Nov 2010 0 0 0 0 0 Dec 2010 0 0 0 0 0 Total 3882 4769 13853 91083 7.26 GB Days of month 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Average Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Jul Day Number of visits Pages Hits Bandwidth 01 Jul 2010 130 330 2479 181.14 MB 02 Jul 2010 132 445 2900 265.71 MB 03 Jul 2010 114 299 2183 195.33 MB 04 Jul 2010 137 520 3274 266.35 MB 05 Jul 2010 154 335 2623 173.91 MB 10/08/10 1/211 Statistics for donauschwaben-usa.org (2010-07) 06 Jul 2010 161 597 3601 320.64 MB 07 Jul 2010 207 726 3971 262.08 MB 08 Jul 2010 209 529 3668 210.81 MB 09 Jul 2010 188 560 2747 239.56 MB 10 Jul 2010 122 387 2396 208.72 MB 11 Jul 2010 127 406 2357 171.73 MB 12 Jul 2010 137 396 2755 224.14 MB 13 Jul 2010 150 491 3134 269.78 MB 14 Jul 2010 117 329 2074 110.11 MB 15 Jul 2010 143 362 2099 177.17 MB 16 Jul 2010 132 382 2365 349.31 MB 17 Jul 2010 125 394 2712 284.31 MB 18 Jul 2010 124 435 2940 229.60 MB 19 Jul 2010 180 431 2909 248.87 MB 20 Jul 2010 156 440 2776 200.52 MB 21 Jul 2010 143 522 3328 357.79 MB 22 Jul 2010 176 542 3486 340.79 MB 23 Jul 2010 154 413 2763 222.35 MB 24 Jul 2010 173 404 2254 110.97 MB 25 Jul 2010 147 350 2608 171.89 MB 26 Jul 2010 195 450 3816 214.39 MB 27 Jul 2010 173 479 3066 366.72 MB 28 Jul 2010 197 636 4472 255.22 MB 29 Jul 2010 162 411 3373 293.45 MB 30 Jul 2010 157 468 3045 209.49 MB 31 Jul 2010 147 384 2909 299.96 MB Average 153.84 446.87 2938.16 239.77 MB Total 4769 13853 91083 7.26 GB Days of week Mon Tue Wed Thu Fri Sat Sun Day Pages Hits Bandwidth Mon 403 3025.75 215.33 MB Tue 501.75 3144.25 289.41 MB Wed 553.25 3461.25 246.30 MB Thu 434.80 3021 240.67 MB Fri 453.60 2764 257.28 MB Sat 373.60 2490.80 219.86 MB Sun 427.75 2794.75 209.89 MB Hours 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 10/08/10 2/211 Statistics for donauschwaben-usa.org (2010-07) Hours Pages Hits Bandwidth Hours Pages Hits Bandwidth 00 438 3262 225.85 MB 12 488 3274 207.31 MB 01 600 4392 337.92 MB 13 723 5069 397.42 MB 02 499 3415 257.12 MB 14 785 5751 529.37 MB 03 529 3652 259.86 MB 15 1079 6342 715.70 MB 04 386 2081 103.74 MB 16 868 5770 681.49 MB 05 266 1784 123.99 MB 17 945 5359 535.03 MB 06 277 1845 174.89 MB 18 908 5549 356.84 MB 07 265 1624 115.89 MB 19 890 5824 457.03 MB 08 311 2049 160.32 MB 20 750 5065 429.51 MB 09 386 2608 219.31 MB 21 680 3967 274.82 MB 10 337 2295 162.55 MB 22 535 4019 241.62 MB 11 430 2805 225.81 MB 23 478 3282 239.39 MB Countries (Top 10) - Full list Countries Pages Hits Bandwidth United States us 7294 42530 3.23 GB Unknown unknown 1828 13359 957.76 MB Germany de 1738 12671 1.50 GB Canada ca 880 5783 424.01 MB Great Britain gb 253 1718 118.44 MB Australia au 177 1221 65.09 MB Austria at 175 1206 115.17 MB South Africa za 105 500 20.29 MB Hungary hu 94 628 71.45 MB Switzerland ch 85 814 82.30 MB Others 1224 10653 734.14 MB Hosts (Top 10) - Full list - Last visit - Unresolved IP Address Hosts : 0 Known, 6070 Unknown (unresolved ip) GeoIP GeoIP Pages Hits Bandwidth Last visit 3882 Unique visitors Country Org 66.72.209.115 United States AS7132 520 1627 205.86 MB 10 Jul 2010 - 14:36 70.226.117.207 United States AS7132 312 689 91.06 MB 28 Jul 2010 - 17:43 68.73.214.160 United States AS7132 235 559 75.22 MB 16 Jul 2010 - 14:22 AS20773 AS of 80.237.156.112 Germany 224 224 15.91 MB 31 Jul 2010 - 21:32 Hosteu... 96.27.77.247 United States Unknown 118 455 18.98 MB 31 Jul 2010 - 12:03 199.212.250.97 Canada AS26677 78 196 7.28 MB 20 Jul 2010 - 17:54 75.136.217.98 United States AS19115 76 379 24.05 MB 05 Jul 2010 - 02:12 67.55.72.164 United States AS27257 70 71 1.69 MB 28 Jul 2010 - 21:22 AS5645 76.10.152.55 Canada TekSavvy 69 287 34.84 MB 10 Jul 2010 - 02:18 Solu... 99.250.76.231 Canada AS812 67 319 50.87 MB 31 Jul 2010 - 15:21 Others 12084 86277 6.75 GB 10/08/10 3/211 Statistics for donauschwaben-usa.org (2010-07) Robots/Spiders visitors (Top 10) - Full list - Last visit 43 different robots* Hits Bandwidth Last visit Googlebot 5791+78 390.66 MB 31 Jul 2010 - 23:58 Yahoo Slurp 5309+139 1.45 GB 31 Jul 2010 - 23:57 Unknown robot (identified by 'robot') 3687+229 70.82 MB 30 Jul 2010 - 20:11 MSNBot-media 1706+987 211.59 MB 31 Jul 2010 - 23:32 Unknown robot (identified by 'crawl') 2049+299 96.40 MB 31 Jul 2010 - 21:10 MSNBot 1645+593 329.19 MB 31 Jul 2010 - 23:57 Yandex bot 1655+85 170.62 MB 31 Jul 2010 - 23:44 Java (Often spam bot) 805 27.63 MB 31 Jul 2010 - 11:34 Unknown robot (identified by 'bot/' or 'bot-') 523+142 26.08 MB 30 Jul 2010 - 14:46 Nutch 234+21 4.63 MB 31 Jul 2010 - 05:36 Others 1142+433 72.42 MB * Robots shown here gave hits or traffic "not viewed" by visitors, so they are not included in other charts. Numbers after + are successful hits on "robots.txt" files. Visits duration Number of Number of visits: 4769 - Average: 141 s Percent visits 0s-30s 3789 79.4 % 30s-2mn 375 7.8 % 2mn-5mn 201 4.2 % 5mn-15mn 213 4.4 % 15mn-30mn 87 1.8 % 30mn-1h 78 1.6 % 1h+ 25 0.5 % Unknown 1 0 % File type File type Hits Percent Bandwidth Percent jpg Image 47623 52.2 % 5.98 GB 82.3 % gif Image 29158 32 % 303.28 MB 4 % htm HTML or XML static page 6943 7.6 % 384.54 MB 5.1 % html HTML or XML static page 5196 5.7 % 5.71 MB 0 % pdf Adobe Acrobat file 719 0.7 % 155.89 MB 2 % dll Binary library 649 0.7 % 246.90 MB 3.3 % png Image 443 0.4 % 18.05 MB 0.2 % mid 254 0.2 % 12.07 MB 0.1 % mp3 Audio file 52 0 % 132.58 MB 1.7 % wma Audio file 20 0 % 15.23 MB 0.2 % Unknown 14 0 % 5.87 KB 0 % wmv Video file 6 0 % 34.34 MB 0.4 % js JavaScript file 6 0 % 85.35 KB 0 % Pages-URL (Top 10) - Full list - Entry - Exit Average 650 different pages-url Viewed Entry Exit size /_vti_bin/fpcount.exe/ 5182 1.12 KB 249 2571 389.56 /_vti_bin/_vti_aut/author.dll 649 4 9 KB 15.27 /index.htm 598 388 214 KB /history.htm 403 31.45 225 45 KB 10/08/10 4/211 Statistics for donauschwaben-usa.org (2010-07) 210.74 /pdf%20forms/2010%20melissa%20venema/melissa%20venema%20biograph..
Recommended publications
  • Harvesting Strategies for a National Domain France Lasfargues, Clément Oury, Bert Wendland
    Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, Bert Wendland To cite this version: France Lasfargues, Clément Oury, Bert Wendland. Legal deposit of the French Web: harvesting strategies for a national domain. International Web Archiving Workshop, Sep 2008, Aarhus, Denmark. hal-01098538 HAL Id: hal-01098538 https://hal-bnf.archives-ouvertes.fr/hal-01098538 Submitted on 26 Dec 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, and Bert Wendland Bibliothèque nationale de France Quai François Mauriac 75706 Paris Cedex 13 {france.lasfargues, clement.oury, bert.wendland}@bnf.fr ABSTRACT 1. THE FRENCH CONTEXT According to French Copyright Law voted on August 1st, 2006, the Bibliothèque nationale de France (“BnF”, or “the Library”) is 1.1 Defining the scope of the legal deposit in charge of collecting and preserving the French Internet. The On August 1st, 2006, a new Copyright law was voted by the Library has established a “mixed model” of Web archiving, which French Parliament.
    [Show full text]
  • The SEO Battlefield WINNING STRATEGIES for SEARCH MARKETING PROGRAMS
    The SEO Battlefield WINNING STRATEGIES FOR SEARCH MARKETING PROGRAMS Anne Ahola Ward The SEO Battlefield Winning Strategies for Search Marketing Programs Anne Ahola Ward Beijing Boston Farnham Sebastopol Tokyo The SEO Battlefield by Anne Ahola Ward Copyright © 2017 Anne Ward. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editor: Meg Foley Indexer: Judy McConville Production Editor: Nicholas Adams Interior Designer: David Futato Copyeditor: Gillian McGarvey Cover Designer: Randy Comer Proofreader: Charles Roumeliotis Illustrator: Rebecca Demarest April 2017: First Edition Revision History for the First Edition 2017-03-21: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491958377 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The SEO Battlefield, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
    [Show full text]
  • Web Archiving Environmental Scan
    Web Archiving Environmental Scan The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25658314 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Web Archiving Environmental Scan Harvard Library Report January 2016 Prepared by Gail Truman The Harvard Library Report “Web Archiving Environmental Scan” is licensed under a Creative Commons Attribution 4.0 International License. Prepared by Gail Truman, Truman Technologies Reviewed by Andrea Goethals, Harvard Library and Abigail Bordeaux, Library Technology Services, Harvard University Revised by Andrea Goethals in July 2017 to correct the number of dedicated web archiving staff at the Danish Royal Library This report was produced with the generous support of the Arcadia Fund. Citation: Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Table of Contents Executive Summary ............................................................................................................................ 3 Introduction ......................................................................................................................................
    [Show full text]
  • Statistics for Donauschwaben-Usa.Org (2010-10)
    Statistics for donauschwaben-usa.org (2010-10) Statistics for: donauschwaben-usa.org Last Update: 01 Nov 2010 - 13:51 Reported period: Month Oct 2010 When: Monthly history Days of month Days of week Hours Who: Organizations Countries Full list Hosts Full list Last visit Unresolved IP Address Robots/Spiders visitors Full list Last visit Navigation: Visits duration File type Viewed Full list Entry Exit Operating Systems Versions Unknown Browsers Versions Unknown Referrers: Origin Referring search engines Referring sites Search Search Keyphrases Search Keywords Others: Miscellaneous HTTP Status codes Pages not found Summary Reported period Month Oct 2010 First visit 01 Oct 2010 - 00:05 Last visit 31 Oct 2010 - 23:59 Unique visitors Number of visits Pages Hits Bandwidth 4264 5044 16762 108013 9.02 GB Viewed traffic * (1.18 visits/visitor) (3.32 Pages/Visit) (21.41 Hits/Visit) (1875.41 KB/Visit) Not viewed traffic * 39400 45593 3.12 GB * Not viewed traffic includes traffic generated by robots, worms, or replies with special HTTP status codes. Monthly history Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 Month Unique visitors Number of visits Pages Hits Bandwidth Jan 2010 0 0 0 0 0 Feb 2010 0 0 0 0 0 Mar 2010 0 0 0 0 0 Apr 2010 0 0 0 0 0 May 2010 0 0 0 0 0 Jun 2010 0 0 0 0 0 Jul 2010 0 0 0 0 0 Aug 2010 0 0 0 0 0 Sep 2010 0 0 0 0 0 Oct 2010 4264 5044 16762 108013 9.02 GB Nov 2010 0 0 0 0 0 Dec 2010 0 0 0 0 0 Total 4264 5044 16762 108013 9.02 GB Days of month 01 02 03 04 05 06 07 08
    [Show full text]
  • User Manual [Pdf]
    Heritrix User Manual Internet Archive Kristinn Sigur#sson Michael Stack Igor Ranitovic Table of Contents 1. Introduction ............................................................................................................ 1 2. Installing and running Heritrix .................................................................................... 2 2.1. Obtaining and installing Heritrix ...................................................................... 2 2.2. Running Heritrix ........................................................................................... 3 2.3. Security Considerations .................................................................................. 7 3. Web based user interface ........................................................................................... 7 4. A quick guide to running your first crawl job ................................................................ 8 5. Creating jobs and profiles .......................................................................................... 9 5.1. Crawl job .....................................................................................................9 5.2. Profile ....................................................................................................... 10 6. Configuring jobs and profiles ................................................................................... 11 6.1. Modules (Scope, Frontier, and Processors) ....................................................... 12 6.2. Submodules ..............................................................................................
    [Show full text]
  • Web Archiving for Academic Institutions
    University of San Diego Digital USD Digital Initiatives Symposium Apr 23rd, 1:00 PM - 4:00 PM Web Archiving for Academic Institutions Lori Donovan Internet Archive Mary Haberle Internet Archive Follow this and additional works at: https://digital.sandiego.edu/symposium Donovan, Lori and Haberle, Mary, "Web Archiving for Academic Institutions" (2018). Digital Initiatives Symposium. 4. https://digital.sandiego.edu/symposium/2018/2018/4 This Workshop is brought to you for free and open access by Digital USD. It has been accepted for inclusion in Digital Initiatives Symposium by an authorized administrator of Digital USD. For more information, please contact [email protected]. Web Archiving for Academic Institutions Presenter 1 Title Senior Program Manager, Archive-It Presenter 2 Title Web Archivist Session Type Workshop Abstract With the advent of the internet, content that institutional archivists once preserved in physical formats is now web-based, and new avenues for information sharing, interaction and record-keeping are fundamentally changing how the history of the 21st century will be studied. Due to the transient nature of web content, much of this information is at risk. This half-day workshop will cover the basics of web archiving, help attendees identify content of interest to them and their communities, and give them an opportunity to interact with tools that assist with the capture and preservation of web content. Attendees will gain hands-on web archiving skills, insights into selection and collecting policies for web archives and how to apply what they've learned in the workshop to their own organizations. Location KIPJ Room B Comments Lori Donovan works with partners and the Internet Archive’s web archivists and engineering team to develop the Archive-It service so that it meets the needs of memory institutions.
    [Show full text]
  • Getting Started in Web Archiving
    Submitted on: 13.06.2017 Getting Started in Web Archiving Abigail Grotke Library Services, Digital Collections Management and Services Division, Library of Congress, Washington, D.C., United States. E-mail address: [email protected] This work is made available under the terms of the Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 Abstract: This purpose of this paper is to provide general information about how organizations can get started in web archiving, for both those who are developing new web archiving programs and for libraries that are just beginning to explore the possibilities. The paper includes an overview of considerations when establishing a web archiving program, including typical approaches that national libraries take when preserving the web. These include: collection development, legal issues, tools and approaches, staffing, and whether to do work in-house or outsource some or most of the work. The paper will introduce the International Internet Preservation Consortium and the benefits of collaboration when building web archives. Keywords: web archiving, legal deposit, collaboration 1 BACKGROUND In the more than twenty five years since the World Wide Web was invented, it has been woven into everyday life—a platform by which a huge number of individuals and more traditional publishers distribute information and communicate with one another around the world. While the size of the web can be difficult to articulate, it is generally understood that it is large and ever-changing and that content is continuously added and removed. With so much global cultural heritage being documented online, librarians, archivists, and others are increasingly becoming aware of the need to preserve this valuable resource for future generations.
    [Show full text]
  • Incremental Crawling with Heritrix
    Incremental crawling with Heritrix Kristinn Sigurðsson National and University Library of Iceland Arngrímsgötu 3 107 Reykjavík Iceland [email protected] Abstract. The Heritrix web crawler aims to be the world's first open source, extensible, web-scale, archival-quality web crawler. It has however been limited in its crawling strategies to snapshot crawling. This paper reports on work to add the ability to conduct incremental crawls to its capabilities. We first discuss the concept of incremental crawling as opposed to snapshot crawling and then the possible ways to design an effective incremental strategy. An overview is given of the implementation that we did, its limits and strengths are discussed. We then report on the results of initial experimentation with the new software which have gone well. Finally, we discuss issues that remain unresolved and possible future improvements. 1 Introduction With an increasing number of parties interested in crawling the World Wide Web, for a variety of reasons, a number of different crawl types have emerged. The development team at Internet Archive [12] responsible for the Heritrix web crawler, have highlighted four distinct variations [1], broad , focused, continuous and experimental crawling. Broad and focused crawls are in many ways similar, the primary difference being that broad crawls emphasize capturing a large scope1, whereas focused crawling calls for a more complete coverage of a smaller scope. Both approaches use a snapshot strategy , which involves crawling the scope once and once only. Of course, crawls are repeatable but only by starting again from the seeds. No information from past crawls is used in new ones, except possibly some changes to the configuration made by the operator, to avoid crawler traps etc.
    [Show full text]
  • List of Search Engines
    A blog network is a group of blogs that are connected to each other in a network. A blog network can either be a group of loosely connected blogs, or a group of blogs that are owned by the same company. The purpose of such a network is usually to promote the other blogs in the same network and therefore increase the advertising revenue generated from online advertising on the blogs.[1] List of search engines From Wikipedia, the free encyclopedia For knowing popular web search engines see, see Most popular Internet search engines. This is a list of search engines, including web search engines, selection-based search engines, metasearch engines, desktop search tools, and web portals and vertical market websites that have a search facility for online databases. Contents 1 By content/topic o 1.1 General o 1.2 P2P search engines o 1.3 Metasearch engines o 1.4 Geographically limited scope o 1.5 Semantic o 1.6 Accountancy o 1.7 Business o 1.8 Computers o 1.9 Enterprise o 1.10 Fashion o 1.11 Food/Recipes o 1.12 Genealogy o 1.13 Mobile/Handheld o 1.14 Job o 1.15 Legal o 1.16 Medical o 1.17 News o 1.18 People o 1.19 Real estate / property o 1.20 Television o 1.21 Video Games 2 By information type o 2.1 Forum o 2.2 Blog o 2.3 Multimedia o 2.4 Source code o 2.5 BitTorrent o 2.6 Email o 2.7 Maps o 2.8 Price o 2.9 Question and answer .
    [Show full text]
  • Adaptive Revisiting with Heritrix Master Thesis (30 Credits/60 ECTS)
    University of Iceland Faculty of Engineering Department of Computer Science Adaptive Revisiting with Heritrix Master Thesis (30 credits/60 ECTS) by Kristinn Sigurðsson May 2005 Supervisors: Helgi Þorbergsson, PhD Þorsteinn Hallgrímsson Útdráttur á íslensku Veraldarvefurinn geymir sívaxandi hluta af þekkingu og menningararfi heimsins. Þar sem Vefurinn er einnig sífellt að breytast þá er nú unnið ötullega að því að varðveita innihald hans á hverjum tíma. Þessi vinna er framlenging á skylduskila lögum sem hafa í síðustu aldir stuðlað að því að varðveita prentað efni. Fyrstu þrír kaflarnir lýsa grundvallar erfiðleikum við það að safna Vefnum og kynnir hugbúnaðinn Heritrix, sem var smíðaður til að vinna það verk. Fyrsti kaflinn einbeitir sér að ástæðunum og bakgrunni þessarar vinnu en kaflar tvö og þrjú beina kastljósinu að tæknilegri þáttum. Markmið verkefnisins var að þróa nýja tækni til að safna ákveðnum hluta af Vefnum sem er álitinn breytast ört og vera í eðli sínu áhugaverður. Seinni kaflar fjalla um skilgreininu á slíkri aðferðafræði og hvernig hún var útfærð í Heritrix. Hluti þessarar umfjöllunar beinist að því hvernig greina má breytingar í skjölum. Að lokum er fjallað um fyrstu reynslu af nýja hugbúnaðinum og sjónum er beint að þeim þáttum sem þarfnast frekari vinnu eða athygli. Þar sem markmiðið með verkefninu var að leggja grunnlínur fyrir svona aðferðafræði og útbúa einfalda og stöðuga útfærsla þá inniheldur þessi hluti margar hugmyndir um hvað mætti gera betur. Keywords Web crawling, web archiving, Heritrix, Internet, World Wide Web, legal deposit, electronic legal deposit. i Abstract The World Wide Web contains an increasingly significant amount of the world’s knowledge and heritage.
    [Show full text]
  • An Introduction to Heritrix an Open Source Archival Quality Web Crawler
    An Introduction to Heritrix An open source archival quality web crawler Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery and Michele Kimpton Internet Archive Web Team {gordon,stack,igor,dan,michele}@archive.org Abstract. Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality webcrawler project. The Internet Archive started Heritrix development in the early part of 2003. The intention was to develop a crawler for the specific purpose of archiving websites and to support multiple different use cases including focused and broadcrawling. The software is open source to encourage collaboration and joint development across institutions with similar needs. A pluggable, extensible architecture facilitates customization and outside contribution. Now, after over a year of development, the Internet Archive and other institutions are using Heritrix to perform focused and increasingly broad crawls. Introduction The Internet Archive (IA) is a 5013C non-profit corporation, whose mission is to build a public Internet digital library. Over the last 6 years, IA has built the largest public web archive to date, hosting over 400 TB of data. The Web Archive is comprised primarily of pages collected by Alexa Internet starting in 1996. Alexa Internet is a Web cataloguing company founded by Brewster Kahle and Bruce Gilliat in 1996. Alexa Internet takes a snapshot of the web every 2 months, currently collecting 10 TB of data per month from over 35 million sites. Alexa Internet donates this crawl to the Internet Archive, and IA stores and indexes the collection. Alexa uses its own proprietary software and techniques to crawl the web. This software is not available to Internet Archive or other institutions for use or extension.
    [Show full text]
  • Statistics for Donauschwaben-Usa.Org (2010-06)
    Statistics for donauschwaben-usa.org (2010-06) Statistics for: donauschwaben-usa.org Last Update: 04 Jul 2010 - 10:11 Reported period: Month Jun 2010 When: Monthly history Days of month Days of week Hours Who: Organizations Countries Full list Hosts Full list Last visit Unresolved IP Address Robots/Spiders visitors Full list Last visit Navigation: Visits duration File type Viewed Full list Entry Exit Operating Systems Versions Unknown Browsers Versions Unknown Referrers: Origin Referring search engines Referring sites Search Search Keyphrases Search Keywords Others: Miscellaneous HTTP Status codes Pages not found Summary Reported period Month Jun 2010 First visit 01 Jun 2010 - 00:12 Last visit 30 Jun 2010 - 23:49 Unique visitors Number of visits Pages Hits Bandwidth 3764 4634 12399 84974 6.35 GB Viewed traffic * (1.23 visits/visitor) (2.67 Pages/Visit) (18.33 Hits/Visit) (1437.87 KB/Visit) Not viewed traffic * 18670 22983 2.81 GB * Not viewed traffic includes traffic generated by robots, worms, or replies with special HTTP status codes. Monthly history Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 Month Unique visitors Number of visits Pages Hits Bandwidth Jan 2010 0 0 0 0 0 Feb 2010 0 0 0 0 0 Mar 2010 0 0 0 0 0 Apr 2010 0 0 0 0 0 May 2010 0 0 0 0 0 Jun 2010 3764 4634 12399 84974 6.35 GB Jul 2010 0 0 0 0 0 Aug 2010 0 0 0 0 0 Sep 2010 0 0 0 0 0 Oct 2010 0 0 0 0 0 Nov 2010 0 0 0 0 0 Dec 2010 0 0 0 0 0 Total 3764 4634 12399 84974 6.35 GB Days of month 01 02 03 04 05 06 07 08 09
    [Show full text]