Web Crawling, Analysis and Archiving
Vangelis Banos
Aristotle University of Thessaloniki Faculty of Sciences School of Informatics
Doctoral dissertation under the supervision of Professor Yannis Manolopoulos
October 2015
Ανάκτηση, Ανάλυση και Αρχειοθέτηση του Παγκόσμιου Ιστού
Ευάγγελος Μπάνος
Αριστοτέλειο Πανεπιστήμιο Θεσσαλονίκης Σχολή Θετικών Επιστημών Τμήμα Πληροφορικής
Διδακτορική Διατριβή υπό την επίβλεψη του Καθηγητή Ιωάννη Μανωλόπουλου
Οκτώβριος 2015
i
Web Crawling, Analysis and Archiving PhD Dissertation ©Copyright by Vangelis Banos, 2015. All rights reserved.
The Doctoral Dissertation was submitted to the the School of Informatics, Faculty of Sci- ences, Aristotle University of Thessaloniki. Defence Date: 30/10/2015.
Examination Committee Yannis Manolopoulos, Professor, Department of Informatics, Aristotle University of Thes- saloniki, Greece. Supervisor
Apostolos Papadopoulos, Assistant Professor, Department of Informatics, Aristotle Univer- sity of Thessaloniki, Greece. Advisory Committee Member
Dimitrios Katsaros, Assistant Professor, Department of Electrical & Computer Engineering, University of Thessaly, Volos, Greece. Advisory Committee Member
Athena Vakali, Professor, Department of Informatics, Aristotle University of Thessaloniki, Greece.
Anastasios Gounaris, Assistant Professor, Department of Informatics, Aristotle University of Thessaloniki, Greece.
Georgios Evangelidis, Professor, Department of Applied Informatics, University of Mace- donia, Greece.
Sarantos Kapidakis, Professor, Department of Archives, Library Science and Museology, Ionian University, Greece.
Abstract
The Web is increasingly important for all aspects of our society, culture and economy. Web archiving is the process of gathering digital materials from the Web, ingesting it, ensuring that these materials are preserved in an archive, and making the collected materials available for future use and research. Web archiving is a difficult problem due to organizational and technical reasons. We focus on the technical aspects of Web archiving. In this dissertation, we focus on improving the data acquisition aspect of the Web archiv- ing process. We establish the notion of Website Archivability (WA) and we introduce the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to measure WA for any website. We propose new algorithms to optimise Web crawling using near-duplicate detection and webgraph cycle detection, resolving also the problem of web spider traps. Following, we suggest that different types of websites demand different Web archiving ap- proaches. We focus on social media and more specifically on weblogs. We introduce weblog archiving as a special type of Web archiving and present our findings and developments in this area: a technical survey of the blogosphere, a scalable approach to harvest modern we- blogs and an integrated approach to preserve weblogs using a digital repository system. Keywords: Web Archiving, Web Crawling, Web Analytics, Webgraphs, Weblogs, Digital Repositories.
Περίληψη
Αρχειοθέτηση του παγκόσμιου ιστού ονομάζεται η διαδικασία συλλογής και αποθήκευσης ιστοσελίδων με σκοπό τη διαφύλαξή τους σε ένα ψηφιακό αρχείο, προσβάσιμο για το κοινό και τους ερευνητές. Η αρχειοθέτηση του παγκόσμιου ιστού είναι ένα ζήτημα ύψιστης προτεραιότητας, καθώς αφενός αποτελεί κύριο μέσο της σύγχρονης επικοινωνίας και αφετέρου η μέση διάρκεια ζωής των ιστοσελίδων είναι λιγότερη από 100 ημέρες. Έτσι, καθημερινά εξαφανίζονται από τον παγκόσμιο ιστό εκατομμύρια ιστοσελίδες που παύουν να λειτουργούν για διάφορους λόγους, με αποτέλεσμα να χάνονται πολύτιμες πληροφορίες. Το πρόβλημα της αρχειοθέτησης του παγκόσμιου ιστού συνίσταται από διάφορες επιμέρους διαδικασίες όπως η αυτόματη πλοήγηση στον παγκόσμιο ιστό, η εξαγωγή περιεχομένου, η ανάλυση και η αποθήκευσή του σε κατάλληλη μορφή ώστε να είναι δυνατή η ανάκτηση και η επαναχρησιμοποίησή του για οποιουσδήποτε σκοπούς. Το πρόβλημα της αυτοματοποιημένης πλοήγηση στον παγκόσμιο ιστό με σκοπό την ανάκτηση και την επεξεργασία πληροφορίας αποτελεί μια ιδιαίτερα διαδεδομένη διαδικασία που έχει εφαρμογές σε πολλά επιστημονικά και επιχειρηματικά πεδία. Ένα άλλο σημαντικό ζήτημα είναι ότι διαφορετικά είδη ιστοσελίδων έχουν διαφορετικά χαρακτηριστικά και ιδιότητες που απαιτούν ιδιαίτερο χειρισμό για την αποδοτικότερη ανάκτηση, επεξεργασία και αρχειοθέτηση δεδομένων. Εστιάζουμε την έρευνά μας στα κοινωνικά δίκτυα και συγκεκριμένα στα ιστολόγια (blogs) που αποτελούν ένα ιδιαίτερο νέο μέσο επικοινωνίας και ενημέρωσης που χρησιμοποιείται ευρέως. Η διδακτορική διατριβή έχει στόχο την βελτιστοποίηση της αρχειοθέτησης ιστοσελίδων μέσω της ανάπτυξης νέων αλγορίθμων αυτόματης πλοήγησης στον παγκόσμιο ιστό, ανά- κτησης πληροφορίας από ιστοσελίδες και ασφαλούς αποθήκευσής τους με αποδοτικό τρόπο ώστε να ευνοείται η μελλοντική πρόσβαση και επαναχρησιμοποίησή τους για οποιο- δήποτε σκοπό. Επιπλέον, η διδακτορική διατριβή επικεντρώνεται στην έρευνα και την ανάπτυξη εξειδικευμένων μεθόδων ανάκτησης, επεξεργασίας, αρχειοθέτησης και επανα- χρησιμοποίησης δεδομένων ιστολογίων. Η συνεισφορά της διατριβής στους παραπάνω τομείς συνοψίζεται στα εξής:
• Ο δείκτης Website Archivability που εκφράζει την ευκολία και ακρίβεια με την οποία αποθηκεύονται οι ιστοσελίδες από συστήματα αρχειοθέτησης ιστοσελίδων. Η μέθοδος Credible Live Evaluation for Archive Readiness Plus (CLEAR+) που υπολογίζει το Website Archivability και το σύστημα ArchiveReady που τα υλοποιεί ως διαδικτυακή εφαρμογή στη διεύθυνση: http://archiveready.com. Επιπλέον, μια μελέτη της αποθηκευσιμότητας διαφορετικών συστημάτων διαχείρισης περιεχομέ- νου στο διαδίκτυο. vi
• Αλγόριθμοι βελτιστοποίησης της αυτόματης πλοήγησης στο διαδίκτυο με τον εντο- πισμό όμοιων ή παρόμοιων ιστοσελίδων και τη χρήση μοντελοποίησης γράφων και μία μέθοδος εντοπισμού των παγίδων που αντιμετωπίζουν τα συστήματα αυτόματης πλοήγησης στο διαδίκτυο (web spider traps). Η πλατφόρμα WebGraph-it που υλοποιεί τους αλγορίθμους ως διαδικτυακή εφαρμογή στη διεύθυνση: http://webgraph-it. com. • Μια εκτεταμένη μελέτη των τεχνικών χαρακτηριστικών των ιστολογίων με έμφαση στα τεχνικά χαρακτηριστικά που αφορούν την αρχειοθετησιμότητά τους. • Το ολοκληρωμένο σύστημα διαφύλαξης ιστολογίων BlogForever που λύνει προβλή- ματα ανάκτησης, διαχείρισης, αρχειοθέτησης και επαναχρησιμοποίησης των δεδο- μένων τους. • Μια ιδιαίτερα αποδοτική μέθοδος για την ανάκτηση δεδομένων από ιστολόγια με τη χρήση αλγορίθμων μηχανικής μάθησης και ένα σύστημα αυτόματης πλοήγησης ιστολογίων που την υλοποιεί.
Στα πλαίσια της έρευνας μας δημιουργήθηκαν ειδικά πακέτα λογισμικού και υλοποιήθη- καν διαδικτυακές εφαρμογές που βρίσκονται σε παραγωγική λειτουργία στο διαδίκτυο. Η απόδοση όλων των αλγορίθμων και η εγκυρότητα των αποτελεσμάτων επικυρώθηκε με πειραματικές μετρήσεις. Τα αποτελέσματα της διατριβής δημοσιεύθηκαν σε έγκριτα διεθνή επιστημονικά περιοδικά, συνέδρια και εκδόσεις. Αναλυτικότερα, οι δημοσιεύσεις μας αναφέρονται στο Κεφάλαιο 1.3. Παρακάτω παρουσιάζουμε τα βασικά σημεία της διατριβής όπως είναι οργανωμένα σε κάθε κεφάλαιο.
Κεφάλαιο 1: Introduction
Στο Κεφάλαιο 1 παρουσιάζουμε καταρχήν ορισμένες γενικές πληροφορίες για την αυτό- ματη πλοήγηση στον παγκόσμιο ιστό, την εξαγωγή δεδομένων και την αρχειοθέτηση ιστοσελίδων, έννοιες που αποτελούν το βασικό πλαίσιο της έρευνάς μας. Στη συνέχεια ορίζουμε τους στόχους της διατριβής και παρουσιάζουμε τις συνεισφορές μας ανά κεφά- λαιο, δίνοντας παράλληλα την οργάνωση της διατριβής. Επιπλέον, παρουσιάζουμε τις δημοσιεύσεις που έγιναν σε διεθνή επιστημονικά περιοδικά, συνέδρια και εκδόσεις.
Κεφάλαιο 2: Background and Literature Review
Στο Κεφάλαιο 2 παρουσιάζουμε το ερευνητικό έργο που γίνεται στο πεδίο της αρχειοθέ- τησης του παγκόσμιου ιστού, της αυτόματης πλοήγησης στο διαδίκτυο και την αρχειοθέ- τησης των μέσων κοινωνικής δικτύωσης. Βλέπουμε τη σημασία της αρχειοθέτησης του παγκόσμιου ιστού και τις εργασίες που γίνονται για την εξασφάλιση ενός επιπέδου ποιό- τητας και αξιοπιστίας στο Κεφάλαιο 2.1.1. Εξετάζουμε τις εξελίξεις στον τομέα της εύρεσης όμοιου περιεχομένου στα ψηφιακά αρχεία του παγκόσμιου ιστού καθώς και τις τεχνικές εξάλειψής του ώστε να έχουμε μια σειρά από οφέλη σε κάθε στάδιο της λειτουργίας των ψηφιακών αρχείων (Κεφάλαιο 2.1.2). Μελετούμε τις προσπάθειες βελτι- στοποίησης των συστημάτων αυτόματης πλοήγησης στο διαδίκτυο στο Κεφάλαιο 2.1.3. vii
Ιδιαίτερη έμφαση δίνουμε τέλος στις εργασίες για την αρχειοθέτηση ιστολογίων και στα συστήματα που έχουν αναπτυχθεί για αυτό το σκοπό όπως αναλύονται στο Κεφάλαιο 2.2.
Κεφάλαιο 3: An Innovative Method to Evaluate Website Archivability
Στο Kεφάλαιο 3 παρουσιάζουμε μια νέα μέθοδο μοντελοποίησης των αρχών και των διαδικασιών αρχειοθέτησης του παγκόσμιου ιστού. Εισάγουμε το δείκτη Website Archiv- ability που εκφράζει το κατά πόσο ένας ιστότοπος θα μπορούσε να αρχειοθετηθεί με πληρότητα και ακρίβεια. Ορίζουμε τη μέθοδο Credible Live Evaluation for Archive Readi- ness Plus (CLEAR+) με την οποία μπορεί να υπολογιστεί ο δείκτης σε πραγματικό χρόνο. Περιγράφουμε την αρχιτεκτονική του συστήματος ArchiveReady που αποτελεί μια υλο- ποίηση της μεθόδου σε μορφή διαδικτυακής εφαρμογής. Η μέθοδος και οι εφαρμογές της είναι ιδιαίτερα σημαντικές και χρησιμοποιούνται ήδη από πανεπιστήμια, εθνικά αρχεία και εταιρίες του χώρου σε όλο τον κόσμο. Αναλυτικά οι χρήστες του ArchiveReady αναφέρονται στο Παράρτημα 7.2. Ένα βασικό ζήτημα όσον αφορά την αρχειοθέτηση του παγκόσμιου ιστού είναι η έλλειψη αυτοματοποιημένου ελέγχου του περιεχομένου που αρχειοθετείται. Πολλές φορές ιστο- σελίδες αρχειοθετούνται ελλειπώς, έχουν προβλήματα και τα αρχειοθετημένα αντίγραφα δεν μπορούν να χρησιμοποιηθούν. Το πρόβλημα έγκειται στο γεγονός ότι ενώ η πλοήγηση στον παγκόσμιο ιστό είναι αυτόματη, ο έλεγχος της ποιότητας των ιστοσελίδων που αρχειοθετούνται είναι στην καλύτερη περίπτωση ημι-αυτόματος ή εξολοκλήρου ελεγχό- μενος από ανθρώπους. Για να λύσουμε αυτό το πρόβλημα δημιουργούμε τη μέθοδο Credible Live Evaluation for Archive Readiness Plus (CLEAR+) που υπολογίζει το δείκτη Website Archivability (WA) και εκφράζει την δεκτικότητα που έχει μια ιστοσελίδα στην αρχειοθέτησή της από web archives. Η δεκτικότητα αυτή εξαρτάται από συγκεκριμένα τεχνικά χαρακτηριστικά της ιστοσελίδας, τα λεγόμενα Website Archivability Facets, τα οποία είναι εν συντομία:
• 퐹퐴: Accessibility: Η δυνατότητα ανακάλυψης και πρόσβασης στο σύνολο των δεδο- μένων ενός δικτυακού τόπου. Όσο μεγαλύτερη είναι η δυνατότητα αυτή, τόσο το καλύτερο για τη σωστή αρχειοθέτησή του.
• 퐹퐶 : Cohesion: Ο βαθμός διασποράς των δεδομένων ενός δικτυακού τόπου σε μία ή περισσότερες διαδικτυακές υπηρεσίες. Η μεγάλη διασπορά αυξάνει τις πιθανότητες απώλειας δεδομένων και ελλειπούς αρχειοθέτησης.
• 퐹푀 : Metadata: Ο πλούτος και η ακρίβεια των μεταδεδομένων που είναι διαθέσιμα για ένα δικτυακό τόπο είναι σημαντικά για την καλύτερη αξιοποίησή του.
• 퐹푆 : Standards Compliance: Η συμμόρφωση με καθιερωμένα πρότυπα κωδικοποίη- σης σε όλα τα αρχεία που αποτελούν το δικτυακό τόπο (HTML, CSS, κ.α.) ώστε να είναι δυνατή η κατανόησή τους τόσο στο παρόν όσον και στο μέλλον από οποιοδή- ποτε λογισμικό ακολουθεί τα πρότυπα. viii
Για την αξιολόγηση ενός δικτυακού τόπου σύμφωνα με τη μέθοδο CLEAR+, ανακτούμε όλα τα αρχεία μιας ιστοσελίδας του και πραγματοποιούμε τεχνικούς ελέγχους σε αυτά ώστε να συλλέξουμε αριθμητικά αποτελέσματα, τα οποία στη συνέχεια συνθέτουμε για να υπολογίσουμε το Website Archivability του. Η μέθοδος περιγράφεται αναλυτικά στο Κεφάλαιο 3.2 και ένα παράδειγμα αξιολόγησης της κεντρικής ιστοσελίδας του Αριστοτε- λείου Πανεπιστημίου παρουσιάζεται στο Κεφάλαιο 3.2.5. To σύστημα ArchiveReady υλοποιεί τη μέθοδο όπως παρουσιάζουμε στο Κεφάλαιο 3.3 και είναι διαθέσιμο ως διαδι- κτυακή εφαρμογή στη διεύθυνση http://archiveready.com. Η αξιολόγηση της μεθό- δου έγινε με τρία εναλλακτικά πειράματα στο Κεφάλαιο 3.4. Μια ακόμη ιδιαίτερα ενδιαφέρουσα σχετική μελέτη που κάνουμε είναι η αξιολόγηση του Website Archivability για δώδεκα συστήματα διαχείρισης περιεχομένου στο διαδίκτυο (Web Content Management Systems) όπως παρουσιάζεται στο Κεφάλαιο 3.5. Χρησιμοποι- ώντας ένα σημαντικό δείγμα, κάνουμε ανάλυση ενός μεγάλου αριθμού δικτυακών τόπων που βασίζονται σε τέτοια συστήματα και εντοπίζουμε τα δυνατά σημεία και τις αδυναμίες τους όσον αφορά τη δυνατότητα αρχειοθέτησής τους.
Κεφάλαιο 4: Near-duplicate and Cycle Detection in Web- graphs towards Optimised Web Crawling
Στο Kεφάλαιο 4 παρουσιάζουμε μια νέα προσέγγιση για τη μοντελοποίηση των αρχών και των διαδικασιών αυτόματης πλοήγησης του παγκόσμιου ιστού με τη χρήση γράφων (web graphs). Δημιουργούμε νέα μοντέλα, αλγορίθμους και λογισμικό για τη βελτιστοποίηση της απόδοσης του web crawling, τον εντοπισμό ομοίων ή σχεδόν ομοίων δεδομένων και την αποφυγή «παγίδων» που δημιουργούν προβλήματα στα συστήματα που πλοηγούνται στον παγκόσμιο ιστό (web spider traps). Οι αλγόριθμοι που προτείνουμε βασίζονται στις εξής παρατηρήσεις:
• To URI θεωρείται το μοναδικό αναγνωριστικό μιας ιστοσελίδας αλλά θα μπορούσα- με να χρησιμοποιήσουμε και άλλα εναλλακτικά όπως για παράδειγμα το Sort-friendly URI Reordering Transform (SURT). • Η ομοιότητα δύο μοναδικών αναγνωριστικών μιας ιστοσελίδας δεν χρειάζεται να είναι απόλυτη, μπορεί δύο μοναδικά αναγνωριστικά να είναι σχεδόν όμοια και θα μπορούσαμε να ελέγξουμε τι συμβαίνει με διάφορους βαθμούς ομοιότητας. • Η ομοιότητα του περιεχομένου δύο ιστοσελίδων δεν χρειάζεται να είναι απόλυτη, μπορεί να είναι σχεδόν όμοιες και θα μπορούσαμε να ελέγξουμε τι συμβαίνει με διάφορους βαθμούς ομοιότητας. • Μοντελοποιώντας ένα δικτυακό τόπο ως γράφο, μπορούμε να χρησιμοποιήσουμε τις παραπάνω παρατηρήσεις ώστε να συγχωνεύσουμε διπλανούς κόμβους και να μειώσουμε την πολυπλοκότητα του γράφου. Με αυτό τον τρόπο μπορούμε επίσης να ανακαλύψουμε κύκλους που υποδεικνύουν παγίδες για λογισμικό αυτόματης πλοήγησης στο διαδίκτυο.
Με βάση τις παρατηρήσεις μας, στο Κεφάλαιο 4.2.2 παρουσιάζουμε οκτώ αλγόριθμους αυτόματης πλοήγησης στο διαδίκτυο που βασίζονται σε εναλλακτικές συνθέσεις των ix
παραπάνω λογικών. Στη συνέχεια, παρουσιάζουμε στο Κεφάλαιο 4.3 την πλατφόρμα WebGraph-it που υλοποιεί τους προτεινόμενους αλγόριθμους ως διαδικτυακή εφαρμογή στη διεύθυνση http://webgraph-it.com. Για την αξιολόγηση των μεθόδων μας και την επιλογή της βέλτιστης πραγματοποιούμε εκτεταμένα πειράματα με μεγάλο αριθμό δικτυακών τόπων στους οποίους εφαρμόζουμε αυτόματη πλοήγηση και με τις οκτώ μεθόδους ώστε να αξιολογήσουμε τα αποτελέσματά τους, όπως περιγράφουμε στο Κεφάλαιο 4.4. Επιπλέον, με ένα πείραμα που κάνουμε αποδεικνύουμε πως οι αλγόριθμοί μας μπορούν να εντοπίσουν web spider traps ενώ οι παραδοσιακές μέθοδοι αυτόματης πλοήγησης πρέπει να παραμετροποιούνται με μη αυτό- ματο τρόπο για να αποφεύγουν τις ίδιες παγίδες.
Κεφάλαιο 5: The BlogForever Platform: An Integrated Ap- proach to Preserve Weblogs
Στο Kεφάλαιο 5 παρουσιάζουμε μια πρωτότυπη προσέγγιση στο πρόβλημα της ανάκτησης, επεξεργασίας, αρχειοθέτησης και επαναχρησιμοποίησης δεδομένων ιστολογίων. Το Blog- Forever είναι ένα πρωτότυπο σύστημα ανάκτησης, επεξεργασίας και αρχειοθέτησης δεδο- μένων ιστολογίων το οποίο υλοποιεί διάφορες καινοτομίες όπως βελτιστοποίηση των ροών εργασίας ανάκτησης, διαχείρισης και αρχειοθέτησης των ιστολογίων. To BlogFor- ever είναι καταλληλότερο για την αρχειοθέτηση ιστολογίων σε σχέση με γενικά συστήματα αρχειοθέτησης του παγκόσμιου ιστού, όπως επιβεβαιώνεται και από τα 5 case studies που χρησιμοποιήθηκαν για την αξιολόγηση του. Αρχικά παρουσιάζουμε τη μελέτη που κάναμε σε ένα σημαντικό δείγμα ιστολογίων στο Κεφάλαιο 5.2. Καταγράφουμε τα ιδιαίτερα τεχνικά χαρακτηριστικά των ιστολογίων, τα συγκρίνουμε με άλλα είδη ιστοσελίδων και με βάση τα συμπεράσματά μας δημιουργούμε το προφίλ τους. Έτσι, είμαστε σε θέση να προτείνουμε καλύτερους τρόπους ανάκτησης, επεξεργασίας και αρχειοθέτησης του περιεχομένου τους. Στο κεφάλαιο 5.3 κωδικοποιούμε τις απαιτήσεις για την πλατφόρμα BlogForever και στη συνέχεια παρουσιάζουμε την αρχιτεκτονική του συστήματος στο Κεφάλαιο 5.4.Η πλατφόρμα BlogForever αποτελείται από δύο βασικά τμήματα, το Blog Spider που είναι υπεύθυνο για την εξαγωγή δεδομένων από ιστολόγια και το Digital Repository που είναι υπεύθυνο για την αρχειοθέτηση, τη διαχείριση και την παροχή των δεδομένων στους χρήστες. Παρουσιάζουμε επίσης βασικά σημεία της υλοποίησής μας στο Κεφάλαιο 5.5. Στο Κεφάλαιο 5.6 παρουσιάζουμε την αξιολόγηση του συστήματος που έγινε μέσω 5 διαφορετικών case studies στις οποίες συμμετείχε μεγάλος αριθμός χρηστών. Ορίζουμε συγκεκριμένα ερευνητικά ερωτήματα που είχαν να κάνουν με την αποτελεσματικότητα και τη λειτουργικότητα του συστήματος. Μελετούμε τις απαντήσεις των χρηστών και τις τεχνικές παραμέτρους για να δούμε τη συμπεριφορά του συστήματος και καταλήγουμε σε χρήσιμα συμπεράσματα για τη λειτουργία του και την αρχειοθέτηση ιστολογίων γενικά. x
Κεφάλαιο 6: A Scalable Approach to Harvest Modern We- blogs
Στο Κεφάλαιο 6 παρουσιάζουμε ένα νέο αλγόριθμο αποδοτικότερης ανάκτησης πληροφo- ρίας από ιστολόγια ο οποίος χρησιμοποιεί μοντέλα μηχανικής μάθησης και τα ειδικά χαρακτηριστικά των ιστολογίων (weblogs’ significant properties) για να εξάγει πληροφορία από τα ημιδομημένα δεδομένα των ιστολογίων με μεγάλη ακρίβεια και ταχύτητα. Εξετά- ζοντας τα ειδικά χαρακτηριστικά των ιστολογίων, εστιάζουμε στα εξής:
• Τα ιστολόγια παρέχουν web feeds: δομημένα δεδομένα σε μορφή XML που αφορούν τις πιο πρόσφατες καταχωρήσεις τους. • Όλες οι δημοσιεύσεις των ιστολογίων χρησιμοποιούν την ίδια δομή HTML για την παρουσίασή τους.
Με βάση τα παραπάνω, εκπαιδεύουμε ένα σύστημα το οποίο παράγει κανόνες εξαγωγής δεδομένων από ιστολόγια χρησιμοποιώντας τεχνικές μηχανικής μάθησης. Στη συνέχεια, χρησιμοποιώντας αυτές τις τεχνικές πετυχαίνουμε πολύ καλύτερη ανάκτηση δεδομένων χρησιμοποιώντας τους κανόνες αυτούς. Επιπλέον, δείχνουμε τη μέθοδο με την οποία μπορούμε να επεκτείνουμε τον αλγόριθμό μας ώστε να κάνουμε εξαγωγή και άλλων δεδο- μένων ιστολογίων όπως για παράδειγμα ονόματα συγγραφέων και λέξεις κλειδιά. Στο Κεφάλαιο 6.3 παρουσιάζουμε την αρχιτεκτονική του συστήματός μας, τον τρόπο με τον οποίο χειριζόμαστε εξειδικευμένες περιπτώσεις JavaScript και εξαγωγής περιεχομένου, καθώς και τις δυνατότητες κλιμάκωσης του συστήματος. Δίνουμε ιδιαίτερη έμφαση επίσης στη διαλειτουργικότητα και στην κωδικοποίηση των δεδομένων που εξάγουμε με το πρότυπο MARC21. Τέλος, αξιολογούμε τη μέθοδό μας στο Κεφάλαιο 6.4 συγκρίνοντάς την με διαδεδομένα συστήματα αυτόματης πλοήγησης και εξαγωγής δεδομένων. Διαπι- στώνουμε ότι πετυχαίνουμε καλύτερη ακρίβεια στην εξαγωγή δεδομένων και αυξημένη ταχύτητα σε σχέση με τις άλλες λύσεις.
Κεφάλαιο 7: Conclusions and Future Work
Κλείνοντας, στο Κεφάλαιο 7 παρουσιάζουμε τα συμπεράσματά μας και τις μελλοντικές ερευνητικές μας κατευθύνσεις. Συνοψίζοντας τις συνεισφορές μας, φτάνουμε σε ορισμέ- να χρήσιμα συμπεράσματα:
• Ο δείκτης Website Archivability εκφράζει τη δυνατότητα αρχειοθέτησης ενός δικτυ- ακού τόπου με βάση τα ιδιαίτερα τεχνικά χαρακτηριστικά του. • Τα συστήματα διαχείρισης περιεχομένου στο διαδίκτυο (Web Content Management Systems) έχουν σημαντικά περιθώρια βελτίωσης όσον αφορά την αρχειοθετησιμό- τητά τους. • Η χρήση γράφων για την ανάλυση της δομής δικτυακών τόπων (webgraph analysis) είναι δόκιμη για την εύρεση όμοιων ή σχεδόν όμοιων κόμβων. Έτσι, πετυχαίνουμε xi
τη μείωση της πολυπλοκότητας των webgraphs και την ευκολότερη επεξεργασία τους. • Με την εύρεση όμοιων ή σχεδόν όμοιων κόμβων σε γράφους που χρησιμοποιούνται για τη μοντελοποίηση δικτυακών τόπων, μπορούμε να εντοπίσουμε κύκλους (web- graph cycles) και να ανακαλύψουμε παγίδες για συστήματα αυτόματης πλοήγησης στο διαδίκτυο (web spider traps). • Εύχρηστες διαδικτυακές εφαρμογές όπως το ArchiveReady και το WebGraph-it αυξάνουν κατά πολύ τη διάδοση και τη χρήση νέων μεθόδων. • Χρησιμοποιώντας τα ιδιαίτερα χαρακτηριστικά των ιστολογίων, μπορούμε να πετύ- χουμε πολύ καλύτερες μεθόδους αυτόματης πλοήγησης και εξαγωγής δεδομένων. • Η διάκριση των δεδομένων σε όσο το δυνατό αναλυτικότερα πεδία κατά την αρχειο- θέτηση ιστοσελίδων μπορεί να αυξήσει την αξία και την επαναχρησιμοποίησιμότητά τους.
Μελλοντικά, σκοπεύουμε να προχωρήσουμε στην εξέλιξη των μεθόδων που αναπτύξαμε με σκοπό την βελτιστοποίησή τους και τη χρήση τους σε βιομηχανικές εφαρμογές. Θα βελτιώσουμε τη μέθοδο CLEAR+ για τον υπολογισμό του Website Archivability που παρουσιάσαμε στο Κεφάλαιο 3 με σκοπό να προωθήσουμε τη χρήση της σε ένα ευρύτερο ακροατήριο όπως μηχανικοί λογισμικού και διαχειριστές ιστοσελίδων. Επιπλέον, θα προ- σπαθήσουμε να προωθήσουμε την ενσωμάτωση της μεθόδου στις ροές εργασίας των συστημάτων αρχειοθέτησης του διαδικτύου. Οι μέθοδοι που υλοποιήσαμε για την ανάλυση δικτυακών τόπων με τη χρήση γράφων στο Κεφάλαιο 4 έχουν επίσης πολύ καλές προοπτικές. θα συνεχίσουμε να αναπτύσουμε νέες παραλλαγές των αλγορίθμων στο πλαίσιο που έχουμε αναπτύξει και σκοπεύουμε να παρέχουμε υπηρεσίες ανάλυσης δικτυακών τόπων με την πλατφόρμα WebGraph-it που βρίσκεται στη διεύθυνση http://webgraph-it.com. Η πλατφορμα BlogForever που παρουσιάσαμε στο Κεφάλαιο 5 μπορεί επίσης να επεκταθεί με διάφορους τρόπους, ενσωματώνοντας ημι-αυτόματες λειτουργίες ανακάλυψης περιε- χομένου, χειρισμού περισσότερων τύπων ιστολογίων και κλιμάκωσης για τη δημιουργία αποθετηρίων μεγάλου μεγέθους. Θα εξετάσουμε επίσης τη δυνατότητα αρχειοθέτησης microblogs. Επιπλέον, οι αλγόριθμοι ανάκτησης δεδομένων από ιστολόγια που παρουσιά- σαμε στο Κεφάλαιο 6 θα μπορούσαν να επεκταθούν με συνδυασμούς άλλων μεθόδων εξαγωγής ώστε να δημιουργηθούν υβριδικές μέθοδοι με καλύτερα χαρακτηριστικά.
Acknowledgements
It takes a lot of work and persistence to successfully finish a Ph.D. Many people helped me in this difficult task and they deserve a special mention. I would like to thank my supervisor Prof. Yannis Manolopoulos for giving me the opportu- nity to collaborate with him. Our discussions helped me to proceed and improve significantly in many areas. I am looking forward to learning more from him in the future. My profound gratitude goes out to my colleagues from the BlogForever project. They did great research work, they were great partners and I owe them a lot. I would like to men- tion especially Stratos Arampatzis, Ilias Trochidis, Nikos Kasioumis, Jaime Garcia Llopis, Raquel Jimenez Encinar, Tibor Simko, Yunhyong Kim, Seamus Ross, Senan Postaci, Karen Stepanyan, George Gkotsis, Alexandra Cristea, Mike Joy, Hendrik Kalb, Silvia Arango Do- cio, Patricia Sleeman, Ed Pinsent and Richard Davis. Thanks to the Hellenic Institute of Metrology management and especially my director, Dion- isios G. Kiriakidis for supporting me. Most importantly, none of this would have been possible without the love and patience of my family. Finally, I cannot stress out how I admire Evi for her character and attitude. I could not be able to do this without her. Vangelis, October 2015
Table of contents
List of figures xix
List of tables xxi
1 Introduction 1 1.1 Key Definitions and Problem Description ...... 1 1.2 Contributions and Document Organisation ...... 2 1.3 Publications ...... 3
2 Background and Literature Review 5 2.1 Web Archiving ...... 5 2.1.1 Web Archiving Quality Assurance ...... 6 2.1.2 Web Content Deduplication ...... 7 2.1.3 Web Crawler Automation ...... 9 2.2 Blog Archiving ...... 10 2.2.1 Blog Archiving Projects ...... 11
3 An Innovative Method to Evaluate Website Archivability 15 3.1 Introduction ...... 15 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) . 17 3.2.1 Requirements ...... 18 3.2.2 Website Archivability Facets ...... 19 3.2.3 Attributes ...... 26 xvi Table of contents
3.2.4 Evaluations ...... 27 3.2.5 Example ...... 29 3.2.6 The Evolution from CLEAR to CLEAR+ ...... 32 3.3 ArchiveReady: A Website Archivability Evaluation Tool ...... 32 3.3.1 System Architecture ...... 33 3.3.2 Scalability ...... 35 3.3.3 Workflow ...... 36 3.3.4 Interoperability and APIs ...... 37 3.4 Evaluation ...... 38 3.4.1 Methodology and Limits ...... 38 3.4.2 Experimentation with Assorted Datasets ...... 40 3.4.3 Evaluation by Experts ...... 42 3.4.4 WA Variance in the Same Website ...... 44 3.5 Web Content Management Systems Archivability ...... 46 3.5.1 Website Corpus Evaluation Method ...... 46 3.5.2 Evaluation Results and Observations ...... 47 3.5.3 Discussion ...... 53 3.6 Conclusions ...... 55
4 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawl- ing 57 4.1 Introduction ...... 57 4.2 Method ...... 59 4.2.1 Key Concepts ...... 59 4.2.2 Algorithms ...... 65 4.3 The WebGraph-it System Architecture ...... 69 4.3.1 System ...... 69 4.3.2 Web Crawling Framework ...... 72 4.4 Evaluation ...... 73 4.4.1 Methodology ...... 74 Table of contents xvii
4.4.2 Example ...... 75 4.4.3 Results ...... 76 4.4.4 Optimal DFS Limit for Cycle Detection ...... 78 4.4.5 Web Spider Trap Experiment ...... 79 4.5 Conclusions and Future Work ...... 81
5 The BlogForever Platform: An Integrated Approach to Preserve Weblogs 83 5.1 Introduction ...... 84 5.2 Blogosphere Technical Survey ...... 84 5.2.1 Survey Implementation ...... 85 5.2.2 Results ...... 87 5.2.3 Comparison Between Blogosphere and the Generic Web ...... 100 5.3 User Requirements ...... 102 5.3.1 Preservation Requirements ...... 102 5.3.2 Interoperability Requirements ...... 103 5.3.3 Performance Requirements ...... 104 5.4 System Architecture ...... 105 5.4.1 The BlogForever Software Platform ...... 105 5.4.2 Blog Spider Component ...... 106 5.4.3 Digital Repository Component ...... 111 5.5 Implementation ...... 116 5.6 Evaluation ...... 117 5.6.1 Method ...... 117 5.6.2 Results ...... 119 5.6.3 Evaluation Outcomes ...... 121 5.7 Discussion and Conclusions ...... 122
6 A Scalable Approach to Harvest Modern Weblogs 125 6.1 Introduction ...... 125 6.2 Algorithms ...... 126 xviii Table of contents
6.2.1 Motivation ...... 127 6.2.2 Content Extraction Overview ...... 127 6.2.3 Extraction Rules and String Similarity ...... 128 6.2.4 Time Complexity and Linear Reformulation ...... 130 6.2.5 Variations for Authors, Dates and Comments ...... 132 6.3 Architecture ...... 133 6.3.1 System and Workflow ...... 133 6.3.2 JavaScript Rendering ...... 134 6.3.3 Content Extraction ...... 135 6.3.4 The BlogForever Metadata Schema for Interoperability ...... 136 6.3.5 Distributed Architecture and Scalability ...... 136 6.4 Evaluation ...... 139 6.4.1 Extraction Success Rates ...... 140 6.4.2 Article Extraction Running Times ...... 140 6.5 Discussion and Conclusions ...... 141
7 Conclusions and Future Work 143 7.1 Conclusions ...... 143 7.2 Future Work ...... 144
Bibliography 147
Appendix Website Archivability Impact 157
Appendix BlogForever Platform Screenshots 159
Appendix VITA 161 List of figures
3.1 WA Facets: An Overview...... 19 3.2 Website attributes evaluated for WA ...... 26 3.3 Evaluating http://auth.gr/ WA using ArchiveReady...... 31 3.4 The architecture of the archiveready.com system...... 34 3.5 The home page of the archiveready.com system...... 35 3.6 WA statistics for assorted datasets box plot...... 42 3.7 WA distribution for assorted datasets...... 43 3.8 WA average rating and standard deviation values, as well as the homepage WA for a set of 783 random websites...... 45 3.9 WA Facets average values and standard deviation for each WCMS . . . . . 48
4.1 WebGraph-It system architecture ...... 70 4.2 Viewing a webgraph in the http://webgraph-it.com web application . . 71
5.1 HTTP Status response codes registered during data-collection ...... 86 5.2 Frequency of weblog software platforms ...... 88 5.3 Variation in versions of Wordpress software ...... 89 5.4 Variation in versions of MovableType software ...... 89 5.5 Variation in versions of vBulletin software ...... 90 5.6 Variation in versions of Discuz! software ...... 90 5.7 Encoding of evaluated resources ...... 91 5.8 Break down of the other 6% of character set attributes ...... 91 5.9 Average number of images identified ...... 92 xx List of figures
5.10 Average use of BMP, SVG, TIFF, WBMP and WEBP formats ...... 93 5.11 Distribution of images for pages with less than 20 images only ...... 93 5.12 Summary of metadata usage ...... 94 5.13 Histogram of Open Graph references ...... 94 5.14 Use of XML feeds by type ...... 96 5.15 Number of JavaScript instances identified ...... 97 5.16 Number of identified JavaScript library/framework instances ...... 97 5.17 Frequency of embedded YouTube videos ...... 98 5.18 Flash use on the web (left) and on blogs (right)...... 100 5.19 JavaScript frameworks use on the web (left) and on blogs (right)...... 100 5.20 Image formats use on the web (left) and on blogs (right)...... 101 5.21 HTTP status responses on the web (left) and on blogs (right)...... 101 5.22 A general overview of the BlogForver platform, featuring the blog spider and the blog repository ...... 105 5.23 Core entities of the BlogForever data model [134]...... 107 5.24 BlogForever conceptual data model [134]...... 108 5.25 The outline of the blog spider component design ...... 109 5.26 High level outline of a scalable set up of the blog spider component . . . . 111 5.27 The outline of the blog repository component design ...... 113 5.28 BlogForever Evaluation Timeline [7]...... 118
6.1 Overview of the crawler architecture. (Credit: Pablo Hoffman, Daniel Graña, Scrapy) ...... 134 6.2 Article extraction running time ...... 141
1 BlogForever Platform Home Page ...... 159 2 BlogForever Platform Features ...... 160 List of tables
2.1 Overview of related initiatives and projects ...... 11
3.1 퐹퐴: Accessibility Evaluations ...... 21
3.2 퐹푆 Standards Compliance Facet Evaluations ...... 23
3.3 퐹퐶 Cohesion Facet Evaluations...... 24
3.4 퐹푀 Metadata Facet Evaluations ...... 25 3.5 WA Facet Weights ...... 28
3.6 퐹퐴 evaluation of http://auth.gr/...... 29
3.7 퐹푆 evaluation http://auth.gr/...... 30
3.8 퐹퐶 evaluation http://auth.gr/...... 30
3.9 퐹푀 evaluation http://auth.gr/...... 30 3.10 Description of assorted datasets ...... 41 3.11 Comparison of WA statistics for assorted datasets...... 42 3.12 Correlation between WA, WA Facets and Experts rating...... 44
3.13 퐴1 The percentage of valid URLs. Higher is better...... 49
3.14 퐴2 The number of inline scripts per WCMS instance. Lower is better. . . . 49
3.15 퐴3 Sitemap.xml is present. Higher is better...... 50
3.16 퐶1 The percentage of local versus remote image. Higher is better...... 50
3.17 퐶1 The percentage of local versus remote CSS. Higher is better...... 51
3.18 푆1 HTML errors per instance. Lower is better...... 51
3.19 푆2 The lack of use of proprietary files (Flash, QuickTime). Higher is better. 52
3.20 퐴5: Valid Feeds. Higher is better...... 52 xxii List of tables
3.21 푀1: HTTP Content-Type header. Higher is better...... 53
3.22 푀2: HTTP caching headers. Higher is better...... 53
4.1 Potential webgraph node similarity metrics ...... 64 4.2 Web crawling algorithms summary ...... 69 4.3 Variables used in the evaluation, 푖=1-8 ...... 74 4.4 Results from all methods for a single website, http://deixto.com ...... 76
4.5 푊푖: Captured webpages difference between all webcrawls and base crawl. Lower is better...... 76
4.6 퐶푂푖: Completeness of each web crawling method. Higher is better...... 77
4.7 퐷푖: Duration difference between all webcrawls and base crawl. Lower is better. 77
4.8 퐿푖 Captured links difference between all webcrawls and base crawl. Lower is better...... 77 4.9 Number of cycles for each distance limit ...... 79 4.10 Web spider trap crawling results...... 80
5.1 Datasets ...... 85 5.2 File MIME types ordered by descending frequency of occurence...... 99 5.3 Overview of user requirements ...... 103 5.4 BlogForever Evaluation Metrics [13]...... 119 5.5 BlogForever Evaluation Themes [13]...... 120 5.6 BlogForever Case Studies for User Testing [14]...... 120 5.7 External and Internal Scores Summary [13]...... 121
6.1 Examples of string similarities...... 129 6.2 TechCrunch blog post example...... 131 6.3 Blog post excerpt and full text similarity using different 푁 values...... 131 6.4 Blog record attributes - MARC 21 representations mapping...... 137 6.5 Blog record attributes - MARC 21 representations mapping...... 138 6.6 Comment record attributes - MARC tags mapping...... 138 6.7 Extraction success rates for different algorithms...... 140 Chapter 1
Introduction
Here, we present the main context of our research, including key definitions and the problem description. In the sequel, we outline our key contributions and the overall document organ- isation. Finally, we list our scientific publications. Since no research happens in isolation, I use the authorial “we” throughout the text.
1.1 Key Definitions and Problem Description
The World Wide Web (WWW) is increasingly important for all aspects of our society, culture and economy. It has become known simply as the web. The number of indexed webpages is estimated to be 4.8 billion in 2015 according to major search engines1. The importance of the web suggests a need for preservation, at least for selected websites [149]. Web archiving is the process of gathering up digital materials from the web, ingesting it, ensuring that these materials are preserved in an archive, and making the collected materials available for future use and research [106]. Web archiving is crucial to ensure that our digital materials remain accessible over time. Web archiving is a difficult problem for many reasons, organisational and technical. The organisational aspect of web archiving has all the inherent issues of any digital preservation activity. It involves the entity that is responsible for the process, its governance, funding, long term viability and personnel responsible for the web archiving tasks [116]. The technical aspect of web archiving involves the procedures of web content identification, acquisition, ingest, organization, access and use [44, 132]. One of the main technical chal- lenges of web archiving in comparison to other digital archiving activities is the process of data acquisition. Websites are becoming increasingly complex and versatile, posing chal- lenges for web data extraction systems, also known as web crawlers, to retrieve their content with accuracy and reliability [69]. The process of web crawling is inherently complex and there are no standard methods to access the complete website data. As a result, research has shown that web archives are missing significant portions of archived websites [29].
1http://www.worldwidewebsize.com/, accessed August 1, 2015 2 Introduction
A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is difficult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended nature of the web. Essentially, a web crawler starts from a seed webpage and then uses the hyperlinks within it to visit other webpages. This process repeats with every new webpage until some conditions are met (e.g. a maximum number of webpages is visited or no new hyperlinks are detected). Despite the simplicity of the basic algorithm, web crawling has many challenges [12]. For instance, there is a lot of duplicate or near-duplicate data captured during web crawling [95]. Also, web spiders are often disrupted or waste excessive computing resources in “Spider traps” [110]. Specific web domains have different characteristics and special properties, which require different web crawling, analysis and archiving approaches. We focus on the Blogosphere, the collective outcome of all weblogs, their content, interconnections and influences which constitute an active part of the Social Media, an established channel of online communication with great significance [78]. Weblogs are used from teaching physics in Latin America [151] to facilitating fashion discussions by young people in France [36]. Weblogs are also known as blogs. They are specific types of websites regularly updated and intended for general public consumption. Their structure is defined as a series of pages in re- verse chronological order. Wordpress, a single blog publishing company, reports more than 1 million new posts and 1.5 million new comments each day [147]. These overwhelming numbers illustrate the importance of blogs in most aspects of private and business life [37]. Blogs contain data with historic, political, social and scientific value, which need to be ac- cessible for current and future generations. For instance, blogs proved to be an important resource during the 2011 Egyptian revolution by playing an instrumental role in the orga- nization and implementation of protests [51]. The problem is that blogs disappear every day [76] because there is no standard method or authority to ensure blog archiving and long- term digital preservation. In this thesis, we focus on improving web crawling, aggregated data analysis, management and archiving. We look into the way Web crawlers visit webpages and extract content. We try to improve the way Web archives select and ingest websites. We also focus on weblogs and aim to devise a new approach for Weblog data extraction, management, preservation and reuse. In the following section, we present our contributions and the overall structure of the thesis.
1.2 Contributions and Document Organisation
Our work focuses on web crawling, analysis and archiving methods. We introduce new metrics to appreciate the possibilities of archiving websites. We propose new algorithms to improve web crawling efficiency and performance. We also propose new ways to deal with Weblog archiving and propose new algorithms focused on weblog data extraction and weblog archiving. For each proposed method we conduct extensive experimental evaluation using real world data. We also implement software systems as reference implementations. Some of our sys- tems are also publicly available on the web and/or as Open Source Software. We present our main contributions and their organisation in this document: 1.3 Publications 3
Chapter 2: We outline related work and literature in the field of web crawling and archiving. We focus on the state of the art of web content deduplication and spider trap detection, as well as methods for website evaluation and web archiving Quality Assurance (QA). We also review the state of Weblog archiving as a special case of Web archiving. Chapter 3: We introduce the concept of Website Archivability (WA), a metric to quantify whether a website has the potentiality to be archived with correctness and accuracy. We de- fine the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to calculate WA. We present ArchiveReady, a web platform which implements the proposed methods and is available at archiveready.com. We evaluate the Website Archivability (WA) of the most prevalent WCMS and come up with specific recommendations for their developers. Chapter 4: We propose a set of methods to detect duplicate and near-duplicate webpages in real time during web crawling. We propose a set of methods to detect web spider traps using webgraphs in real time during web crawling. We present WebGraph-It, a web platform which implements the proposed methods and is avalable at webgraph-it.com. Chapter 5: We present the BlogForever platform, a new approach to aggregate, manage, preserve and reuse weblog content. First, we explore patterns in weblog structure and data to outline weblogs’ main technical characteristics and differences from the generic web. Then, we describe policies, workflows, methods and systems we develop in the context of BlogForever. We also present extensive evaluation test results. Chapter 6: We present a new approach to harvest weblogs. Our algorithm is simple yet robust and scalable. It is generating weblog content extraction rules with great accuracy and performance. We also present our system, which is evaluated using extensive test procedures. Chapter 7: We summarise the results of our work and present some conclusions and poten- tial future work directions.
1.3 Publications
The research presented in this Ph.D. dissertation was published in 4 peer-reviewed journals, 7 international conference proceedings and 1 book chapter. Publications in scientific journals:
1. Banos V., Manolopoulos Y.: “Near-duplicate and Cycle Detection in Webgraphs to- wards Optimised Web Crawling”, ACM Transactions on the Web Journal, submitted, 2015. 2. Banos V., Manolopoulos Y.: “A Quantitative Approach to Evaluate Website Archiv- ability Using the CLEAR+ Method”, International Journal on Digital Libraries, 2015. 3. Banos V., Blanvillain O., Kasioumis N., Manolopoulos Y.: “A Scalable Approach to Harvest Modern Weblogs”, International Journal of AI Tools, Vol.24, N.2, 2015. 4. Kasioumis N., Banos V., Kalb H.: “Towards Building a Blog Preservation Platform”, World Wide Web Journal, Special Issue on Social Media Preservation and Applica- tions, Springer, 2013. 4 Introduction
Publications in international conferences proceedings:
1. Banos V., Manolopoulos Y.: “Web Content Management Systems Archivability”, Proceedings 19th East-European Conference on Advances in Databases & Informa- tion Systems (ADBIS), Springer Verlag, LNCS Vol.9282, Poitiers, France, 2015. 2. Blanvillain O., Banos V., Kasioumis N.: BlogForever Crawler: “Techniques and Al- gorithms to Harvest Modern Weblogs”, Proceedings 4th International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Thessaloniki, Greece, 2014. 3. Banos V., Kim Y., Ross S., Manolopoulos Y.: “CLEAR: a Credible Method to Evalu- ate Website Archivability”, Proceedings 10th International Conference on Preserva- tion of Digital Objects (iPRES), Lisbon, Portugal, 2013. 4. Kalb H., Lazaridou P., Banos V., Kasioumis N., Trier M.: “BlogForever: From Web Archiving to Blog Archiving”, Proceedings ‘Informatik Angepast an Mensch, Organ- isation und Umwelt‘ (INFORMATIK), Koblenz, Germany, 2013. 5. Stepanyan K., Gkotsis G., Banos V., Cristea A., Joy M.: “A Hybrid Approach for Spot- ting, Disambiguating and Annotating Places in User-Generated Text”, Proceedings 22nd International Conference on World Wide Web (WWW), Rio de Janeiro, Brazil, 2013. 6. Banos V., Baltas N., Manolopoulos Y.: “Trends in Blog Preservation”, Proceed- ings 14th International Conference on Enterprise Information Systems (ICEIS), Vol.1, pp.13-22, Wroclaw, Poland, 2012. 7. Banos V., Stepanyan K., Manolopoulos Y., Joy M., Cristea A.: “Technological Foun- dations of the Current Blogosphere”, Proceedings 2nd International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Craiova, Romania, 2012.
Book chapters:
1. Banos V., Baltas N., Manolopoulos Y.: “Blog Preservation: Current Challenges and a New Paradigm”, chapter 3 in book Enterprise Information Systems XIII, by Cordeiro J., Maciaszek L. and Filipe J. (eds.), Springer LNBIP Vol.141, pp.29–51, 2013. Chapter 2
Background and Literature Review
The current state of Web crawling, Web archiving and Social Media preservation. More specifically, we look into specific work in the areas of web archiving Quality Assurance (QA), web content deduplication and web crawler automation which are highly relevant to our work. We also review related work in weblog archiving.
2.1 Web Archiving
Web archiving is an important aspect of the preservation of cultural heritage [98]. Web preservation is defined as ‘the capturing, management and preservation of websites and Web resources’. Web preservation must be a start-to finish activity, and it should encompass the entire lifecycle of the Web resource [9]. The most notable web archiving initiative is the In- ternet Archive, 1 which has been operating since 1996. In addition, a large variety of projects from national and international organizations are working on Web preservation related activ- ities. For instance, many national memory institutions such as national libraries understood the value of Web preservation and developed special activities towards this goal [137]. All active national web archiving efforts, as well as some academic Web archives are members of the International Internet Preservation Consortium (IIPC). Web archiving is a complex task that requires a lot of resources. Therefore, it is always a selective process and only parts of the existing web are archived [61, 4]. Contrary to traditional media like printed books, webpages can be highly dynamic. As a result, the selection of archived information com- prises not only of the decision of what to archive (e.g. topic or regional focus) but also of the setting of additional parameters such as the archiving frequency per page, and other parame- ters related to the page request (e.g. browser, user account, language etc.) [98]. Thereby, the selection seems often to be driven by human publicity and search engine discoverability [4]. The archiving of large parts of the web is a highly automated process, and the archiving fre- quency of a webpage is normally determined by a schedule for harvesting the page. Thus, the life of a website is not recorded appropriately if the page is updated more often than it is crawled [71].
1http://archive.org/, accessed: August 1, 2015 6 Background and Literature Review
Following, we look into the current state of web archiving quality assurance, web content deduplication and web crawler automation, key aspects of the web archiving data ingestion workflow.
2.1.1 Web Archiving Quality Assurance
The web archiving workflow includes identification, appraisal and selection, acquisition, ingest, organization and storage, description and access [106]. The present section focuses explicitly on the acquision of web content and the way it is handled by web archiving projects and initiatives. Web content acquision is one of the most delicate aspects of the web archiving workflow because it depends heavily on external systems: the target websites, web servers, applica- tion servers, proxies and network infrastructure. The number of independent and dependent elements gives harvesting a substantial risk load. Web content acquisition for web archiv- ing is performed using robots, also known as “spiders”, “crawlers”, or “bots”, self-acting agents that navigate around-the-clock through the hyperlinks of the web, harvesting topical resources without human supervision [112]. The most popular web harvester, Heritrix is an open source, extensible, scalable, archival quality web crawler [102] developed by the Internet Archive in partnership with a number of libraries and web archives from across the world. Heritrix is currently the main web harvesting application used by the International In- ternet Preservation Consortium (IIPC) as well as numerous web archiving projects. Heritrix is being continuously developed and extended to improve its capacities for intelligent and adaptive crawling [52] or capture streaming media [72]. The Heritrix crawler was originally established for crawling general webpages that do not include substantial dynamic or com- plex content. In response other crawlers have been developed which aim to address some of Heritrix’s shortcomings. For instance, BlogForever [15] is utilizing blog specific tech- nologies to preserve blogs. Also, the ArchivePress project is based explicitly on XML feeds produced by blog platforms to detect web content [115]. As websites become more sophisticated and complex, the difficulties that web bots face in harvesting them increase. For instance, some web bots have limited abilities to process dy- namic web content or streaming media [106]. To overcome these obstacles, standards have been developed to make websites more amenable to harvesting by web bots. Two examples are the Sitemap.xml and Robots.txt protocols. The Sitemap.xml protocol, ‘Simple Website Footprinting’, is a way to build a detailed picture of the structure and link architecture of a website [96, 127]. Implementation of the Robots.txt protocol provide web bots with in- formation about specific elements of a website and their access permissions [135]. Such protocols are not used universally. Web content acquisition for archiving is only considered complete once the quality of the harvested material has been established. The entire web archiving workflow is often handled using special software, such as the open source software Web Curator Tool (WCT) [108], developed as a collaborative effort by the National Library of New Zealand and the British Library, at the instigation of the IIPC. WCT supports such web archiving processes as per- missions, job scheduling, harvesting, quality review, and the collection of descriptive meta- data. Focusing on quality review, when a harvest is complete, the harvest result is saved in 2.1 Web Archiving 7 the digital asset store, and the Target Instance is saved in the Harvested state2. The next step is for the Target Instance Owner to Quality Review the harvest. WCT operators perform this task manually. Moreover, according to the web archiving process followed by the National Library of New Zealand, after performing the harvests, the operators review and endorse or reject the harvested material; accepted material is then deposited in the repository [114]. A report from the Web-At-Risk project provides confirmation of this process. Operators must review the content thoroughly to determine if it can be harvested at all [59]. Efforts to deploy crowdsourced techniques to manage QA provide an indication of how sig- nificant the QA bottleneck is. The use of these approaches is not new, they were deployed by digitisation projects. The QA process followed by most web archives is time consuming and potentially complicated, depending on the volume of the site, the type of content hosted, and the technical structure. However, to quote the IIPC, “it is conceivable that crowdsourcing could support targeted elements of the QA process. The comparative aspect of QA lends itself well to ‘quick wins’ for participants”3. IIPC has also organized a Crowdsourcing Workshop in its 2012 General Assembly to explore how to involve users in developing and curating web archives. QA was indicated as one of the key tasks to be assigned to users: “The process of examining the characteristics of the websites captured by web crawling software, which is largely manual in practice, before making a decision as to whether a website has been successfully captured to become a valid archival copy” 4. The previous literature shows that there is an agreement within the web archiving community that web content aggregation is challenging. QA is an essential stage in the web archiving workflow but currently the process requires human intervention and research into automating QA is in its infancy. The solution used by web archiving initiatives such as Archive-it5 is to perform test crawls prior to archiving6 but these suffer from, at least, two shortcomings: a) the test crawls require human intervention to evaluate the results, and b) they do not fully address such challenges as deep-level metadata usage and media file format validation.
2.1.2 Web Content Deduplication
There are many efforts related to organise crawled web content effectively, including rank- ing and duplicate detection. What is interesting is that most of such works focus on already archived content, whereas little work has been done to improve duplicate or near-duplicate detection during web crawling. The problem of web spider traps is also relevant as it gen- erates infinite numbers of duplicate web content. To the best of our knowledge, it is not addressed sufficiently in an automated way.
2http://webcurator.sourceforge.net/docs/1.5.2/Web%20Curator%20Tool%20User- %20Manual%20(WCT%201.5.2).pdf, accessed: August 1, 2015 3http://www.netpreserve.org/sites/default/files/.../CompleteCrowdsourcing.pdf, ac- cessed: August 1, 2015 4http://netpreserve.org/sites/default/files/attachments/CrowdsourcingWebArchiving- _WorkshopReport.pdf, accessed: August 1, 2015 5http://www.archive-it.org/, accessed: August 1, 2015 6https://webarchive.jira.com/wiki/display/ARIH/Test+Crawls, accessed: August 1, 2015 8 Background and Literature Review
Duplicate web content is a major issue for systems that perform web data extraction. The nature of the web promotes the creation of duplicate content, either intentionally or unin- tentionally. Web crawler systems have tried to address the issue of duplicate content, based on the URL, the webpage content, or both. This problem has been also defined as “dupli- cate URL with similar text” (DUST) [18] and various algorithms have been created to mine server and crawler logs, sampling webpages and inferring rules to detect duplicate web pages [3, 43]. It is evident that these methods cannot be applied on web scale as it is not practical to access the required information for every website. An approach to apply URL deduplication in web-scale crawlers suggests the use of two level URL duplication checking, both at the website and the webpage level [150]. The Mercator web crawler implements a ‘content seen’ test, which performs a 64bit checksum in the con- tents of each downloaded document and stores them in tables. The checksum of each newly downloaded document is looked up in the tables to detect duplicates [70]. Near-duplicate detection is also very important as the web is abundant with near-duplicate documents. Differences between these documents may be trivial, such as banner ads or timestamps. Manku et al. have presented an efficient method to detect near-duplicates in large web document archives [95] based on Charikar’s similarity estimation techniques [35]. CiteSeerX digital library search engine employs both duplicate and near-duplicate document detection. During web crawling, they generate SHA-15 hashes of documents and check them against existing ones in their database. Duplicates are discarded immediately. Near- duplicates are detected after ingestion via clustering. Document attributes such as title and author names are normalised and used as keys [148]. Website duplicate detection is also an active topic. An approach is to use the websites’ structure and the content of their pages to identify possible replicas [42]. A more thorough approach to detect web mirrors depends mostly on the syntactic analysis of URL strings, and requires retrieval and content analysis only for a small number of pages. They are able to detect both partial and total mirroring, and handle cases where the content is not byte-wise identical [20]. Web archives extract and preserve massive amounts of web content and they are very in- terested in identifying and removing duplicate content. Important work has been done to detect duplicate web resources referenced by the same URL in web archives and create rel- evant systems [62]. Another important contribution is the DeDuplicator plug-in for Heritrix to detect and avoid the storage of duplicate content [129]. This system was also used in a billion-scale searchable web archive with good results [60]. Web archive content deduplica- tion is so significant that in 2015, the International Internet Preservation Consortium (IIPC) adopted a proposal to extend the Web ARChive (WARC) archive format [75] to standard- ise the recording of arbitrary duplicates in WARC files7. The issue with all presented web archive content deduplication cases is that the web content is already extracted, processed and stored in the web archive before it is identified as duplicate and it is deleted. Thus, a lot of computing resources are used in a poor way. The concept of near-duplicates is also not even mentioned at all in web archive content deduplication related research. One of the main principles of the web is that each content has its own URI. Good URIs are the topic of much discussion [60] The Portuguese web archive provides instructions to
7https://iipc.github.io/warc-specifications/specifications/warc-deduplication- /recording-arbitrary-duplicates.html, accessed: August 1, 2015 2.1 Web Archiving 9 website owners to have one hyperlink for each content8. The same guideline is given by W3C9. Search Engine Optimisation (SEO) guidelines also advise website administrators to have one URI for each web resource. These guidelines help web crawling systems depending on URLs to retrieve each document only once.
2.1.3 Web Crawler Automation
Web crawlers are complex software systems, which often combine techniques from various disciplines in computer science. Our work on the BlogForever crawler is related to the fields of web data extraction, distributed computing and natural language processing. In the liter- ature on web data extraction, the word wrapper is commonly used to designate procedures to extract structured data from unstructured documents. We did not use this word in the present work in favor of the term extraction rule, which better reflects our implementation and is decoupled from software that concretely performs the extraction. A common approach in web data extraction is to manually build wrappers for the targeted websites. This approach has been proposed for the crawler discussed in [52], which auto- matically assigns web sites to predefined categories and gets the appropriate wrapper from a static knowledge base. The limiting factor in this type of approach is the substantial amount of manual work needed to write and maintain the wrappers, which is not compatible with the increasing size and diversity of the web. Several projects try to simplify this process and provide various degrees of automation. This is the case of the Stalker algorithm [105] which generates wrappers based on user-labelled training examples. Some commercial solutions such as the Lixto project [64] simplify the task of building wrappers by offering a complete integrated development environment, where the training data set is obtained via a graphical user interface. As an alternative to dedicated software for the creation and maintenance of wrappers, some query languages have been designed specifically for wrappers. These languages rely on their users to manually identify the structure of the data to be extracted. This structure can then be formalised as a small declarative program, which can then be turned into a concrete wrapper by an execution engine. The OXPath language [57] is an interesting extension to XPath designed to incorporate interaction in the extraction process. It supports simulated user actions such as filling forms or clicking buttons to obtain information that would not be accessible otherwise. Another extension of XPath, called Spatial XPath [111], allows to write special rules in the extraction queries. The execution engine embeds a complete web browser which computes the visual representation of the page. Fully automated solutions use different techniques to identify and extract information directly from the structure and content of the web page, without the need of any manual intervention. The Boilerpipe project [84] - which is also used in our evaluation in Chapter 6 - uses text density analysis to extract the main article of a webpage. The approach presented in [119] is based on a tree structure analysis of pages with similar templates, such as news web sites or blogs. Automatic solutions have also been designed specifically for blogs. Similarly to our approach, Oita and Senellart [109] describe a procedure to automatically build wrappers by matching web feed articles to HTML pages. This work was further extended by Gkotsis et al.
8http://sobre.arquivo.pt/how-to-participate/recommendations-for-web-authors-to- -enable-web/one-link-to-the-address-of-each-content, accessed: August 1, 2015 9http://www.w3.org/Provider/Style/Bookmarkable.html, accessed: August 1, 2015 10 Background and Literature Review with a focus on extracting content anterior to the one indexed in web feeds [58]. They also report to have successfully extracted blog post titles, publication dates and authors, but their approach is less generic than the one for the extraction of articles. Finally, neither [109] nor [58] provide complexity analysis which we believe to be essential before using an algorithm in production. One interesting research direction is the one of large scale distributed crawlers. Merca- tor [70], UbiCrawler [23] and the crawler discussed in [128] are examples of successful distributed crawlers. The associated articles provide useful information regarding the chal- lenges encountered when working on a distributed architecture. One of the core issues when scaling out seems to be in sharing the list of URLs that have already been visited and those that need to be visited next. While [70] and [128] rely on a central node to hold this informa- tion, [23] uses a fully distributed architecture where URLs are divided among nodes using consistent hashing. Both of these approaches require the crawlers to implement complex mechanisms to achieve fault tolerance. Regarding our research, the BlogForever Crawler does not have to address this issue as it is already Handled by the BlogForever back-end system, which is responsible for task and state management (Section 5). In addition, since we process webpages on the fly and directly emit the extracted content to the back-end, there is no need for persistent storage on the crawler’s side. This removes one layer of complexity when compared to general crawlers which need to use a distributed file system ([128] uses NFS, [ber] uses HDFS) or implement an aggregation mechanism to further exploit the collected data. In Section 5.4, we present our design which is similar to the distributed active object pattern presented in [85]. It is also further simplified by the fact that the state of the crawler instances is not kept between crawls.
2.2 Blog Archiving
Blog archiving is a prominent subcategory of web archiving due to the significance of blogs in every aspect of business and private life. However, current web archiving tools have to face important issues with respect to blog preservation. First, the tools for acquisition and curation use a schedule based approach to determine the point in time, when the content should be captured for archiving. Thus, the life of a blog is not recorded appropriately if it updates more often than it is crawled [71]. On the other hand, unnecessary harvests and archived duplicates occur if the blog is less updated than the crawling schedule, and if the whole blog is harvested again instead of a selective harvest of the new pages. Therefore, an approach that considers update events (e.g. new post, new comment, etc.) as triggers for crawling activities would be more suitable. Thereby, RSS feeds and ping servers can be utilized as triggers. Secondly, the general web archiving approach considers the webpage as the digital object that is preserved and can be accessed. However, a blog consists of several smaller entities like posts and comments. Therefore, while archives like the Internet Archive enable a structural blogosphere analysis, a specialised archiving system based on the inherent structure of blogs facilitates also further analysis like issues or events [144]. In summary, we identify several problems of blog archiving with current web archiving tools:
• Aggregation scheduling performs on timely intervals without considering web site updates. This causes incomplete content aggregation if the update frequency of the contents is higher than the schedule predicts [71, 137]. 2.2 Blog Archiving 11
• Traditional aggregation uses brute-force methods to crawl without taking into account what is the updated content of the target website. Thus, the performance of both the archiving system and the crawled system are affected unnecessarily [137].
• Current web archiving solutions do exploit the potential of the inherent structure of blogs. Therefore, while blogs provide a rich set of information entities, structured content, APIs, interconnections and semantic information [89], the management and end-user features of existing web archives are limited to primitive features such as URL Search, Keyword Search, Alphabetic Browsing and Full-Text Search [137].
2.2.1 Blog Archiving Projects
Here, we review projects and initiatives related to blog preservation. Therefore, we inspect the existing solutions of the IIPC [66] for web archiving and the ArchivePress blog archiv- ing project. Furthermore, we look into EC funded research projects such as Living Web Archives (LiWA) [91], SCalable Preservation Environments (SCAPE) [126] and Collect-All ARchives to COmmunity MEMories (ARCOMEM) [120], which are focusing on preserving dynamic content, web scale preservation activities and how to identify important web con- tent that should be selected for preservation. Table 2.1 provides an overview of the related initiatives and projects.
Initiative Description Started ArchivePress Explore practical issues around the archiving of weblog 2009 content, focusing on blogs as records of institutional ac- tivity and corporate memory ARCOMEM Leverage the Wisdom of the Crowds for content ap- 2011 praisal, selection and preservation, to create and preserve archives that reflect collective memory and social con- tent perception, and are, thus, closer to current and future users IIPC projects Web archiving tools for acquisition, curation, access and 1996 search LiWA Develop and demonstrate web archiving tools able to cap- 2009 ture content from a wide variety of sources, to improve archive fidelity and authenticity and to ensure long term interpretability of web content PageFreezer.com Enterprise Class On Demand web archiving and replay 2006 service Preservica.com Enterprise Class Cloud-based web archiving service 2012 SCAPE Developing an infrastructure and tools for scalable 2011 preservation actions WebPreserver.com Google Chrome Plugin to preserve Social Media 2015 Table 2.1: Overview of related initiatives and projects 12 Background and Literature Review
The International Internet Preservation Consortium (IIPC) is the leading International or- ganization dedicated to improving the tools, standards and best practices of web archiving. The software they provide as open source comprises tools for acquisition (Heritix [102]), curation (Web Curator Tool [108] and NetarchiveSuite10, access and finding (Wayback11, NutchWAX12, and WERA13). They are widely accepted and used by the majority of internet archive initiatives [61]. However, the IIPC tools cause several problems on blog preservation. First, the tools for acquisition and curation use a schedule based approach to determine the time point when the content should be captured for archiving. Thus, the life of a website is not recorded appropriately if the page is updated more often than it is crawled [71]. Given that a lot of blogs are frequently updated, an approach which considers update events (e.g. new post, new comment, etc.) as triggers for crawling activities would be more suitable. Secondly, the archiving approach of IIPC considers the webpage as the digital object that is preserved and can be accessed. However, a blog consists of several smaller entities like posts and comments. Therefore, while archives based on IIPC tools enable a structural blogosphere analysis, a specialised archiving system based on the inherent structure of blogs facilitates also further analysis like issues or events [144]. The ArchivePress project was an initial effort to attack the problem of blog archiving from a different perspective than traditional web crawlers. It is the only existing open source blog-specific archiving software according to our knowledge. ArchivePress utilises XML feeds produced by blog platforms to achieve better archiving [115]. The scope of the project explicitly excludes the harvesting of the full browser rendering of blog contents (headers, sidebars, advertising and widgets), focusing solely on collecting the marked-up text of blog posts and blog comments (including embedded media). The approach was suggested by the observation that blog content is frequently consumed through automated syndication and ag- gregation in news reader applications, rather than by navigation of blog websites themselves. Chris Rusbridge, then Director of the Digital Curation Centre at Edinburgh University, ob- served, with reason, that “blogs represent an area where the content is primary and design secondary” [123]. Contrary to the solutions of IIPC, ArchivePress utilises update informa- tion of the blogs to launch capturing activities instead of a predefined schedule. For this purpose, it is taking advantage of RSS feeds, a ubiquitous feature of blogs. Thus, blogs can be captured according to their activity level, and it is more likely that the whole lifecycle of the blog can be preserved. However, ArchivePress has also a strong limitation because it does not access the actual blog page but only its RSS feed. Thus, ArchivePress does not aggregate the complete blog content but only a portion which is published in RSS feeds, because feeds potentially contain just partial content instead of the full text and do not contain advertisements, formatting markup, or reader comments [50]. Even if blog preservation does not necessarily mean to preserve every aspect of a blog [9], and requires instead the identification of significant properties [133], the restriction to RSS feeds would prevent a successful blog preservation in various cases. RSS references only recent blog posts and not older ones. What is more, static blog pages are not listed at all in RSS. Several European Commission funded digital preservation projects are also relevant to blog preservation in various ways. The LiWA project focuses on creating long term web archives,
10https://sbforge.org/display/NAS/Releases+and+downloads, accessed August 1, 2015 11http://archive-access.sourceforge.net/projects/wayback/, accessed August 1, 2015 12http://archive-access.sourceforge.net/projects/nutch/, accessed August 1, 2015 13http://archive-access.sourceforge.net/projects/wera/, accessed August 1, 2015 2.2 Blog Archiving 13
filtering out irrelevant content and trying to facilitate a wide variety of content. Its approach is valuable as it focuses on many aspects of web preservation but its drawback is that it provides specific components which are integrated with the IIPC tools and are not generic. On the other hand, the ARCOMEM project focuses mainly on social web driven content appraisal and selection, and intelligent content acquisition. Its aim is to detect important content regarding events and topics, in order to preserve it. Its approach is unique regarding content acquisition and could be valuable to detect important blog targets for preservation but it does not progress the state-of-the-art regarding preservation, management and dissem- ination of archived content. Another relevant project is SCAPE, which is aiming to create scalable services for planning and execution of preservation strategies. SCAPE does not directly advance the state-of-the-art with new approaches to web preservation but aims at scaling only. Its outcome could assist in the deployment of web scale blog preservation systems. Besides the presented initiatives, there is an entire software industry sector focused on com- mercial web archiving services. Representative examples include Hanzo Archives: Social Media Archiving14, Archive-it: a web archiving service to harvest and preserve digital col- lections and PageFreezer: social media and website archiving15. Due to the commercial nature of these services though, it is not possible to find much information on their preser- vation strategies and technologies. Furthermore, it is impossible to know how long these companies will support these services or even be in business. Thus, we believe that they cannot be considered in our evaluation.
14http://www.hanzoarchives.com/products/social_media_archiving, accessed: August 1, 2015 15http://pagefreezer.com/, accessed: August 1, 2015
Chapter 3
An Innovative Method to Evaluate Website Archivability
We establish the notion of Website Archivability, a concept which captures the core as- pects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. We present the two iterations of the Credible Live Evaluation method for Archive Readiness (CLEAR) and CLEAR+ to evaluate Website Archivability for any website. We outline the architecture and implementation of http://archiveready.com (ArchiveReady), a reference implementation of CLEAR+. We conduct thorough evaluations of significant datasets in order to support the validity, the reliability and the benefits of our method. Finally, we evaluate the Website Archivability of the most prevalent web content management systems and present our observations and improvement suggestions. 1
3.1 Introduction
Web archiving has to examine two key aspects: organizational and technical. The organi- zational aspect of web archiving involves the entity that is responsible for the process, its governance, funding, long term viability and personnel responsible for the web archiving
1This chapter is based on the following publications: • Banos V., Manolopoulos Y.: “Web Content Management Systems Archivability”, Proceedings 19th East-European Conference on Advances in Databases & Information Systems (ADBIS), Springer Ver- lag, LNCS Vol.9282, Poitiers, France, 2015. • Banos V., Manolopoulos Y.: “A Quantitative Approach to Evaluate Website Archivability Using the CLEAR+ Method”, International Journal on Digital Libraries, 2015. • Banos V., Kim Y., Ross S., Manolopoulos Y.: “CLEAR: a Credible Method to Evaluate Website Archivability”, Proceedings 10th International Conference on Preservation of Digital Objects (iPRES), Lisbon, Portugal, 2013. 16 An Innovative Method to Evaluate Website Archivability tasks [116]. The technical aspect involves the procedures of web content identification, ac- quisition, ingest, organization, access and use [44, 132]. We are addressing two of the main challenges associated with the technical aspects of web archiving: (a) the acquisition of web content and (b) the Quality Assurance (QA) evaluation performed before it is ingested into a web archive. Web content acquisition and ingest is a critical step in the process of web archiving; if the initial Submission Information Package (SIP) lacks completeness and accuracy for any reason (e.g. missing or invalid web content), then the rest of the preservation processes are rendered useless. In particular, QA is vital stage in ensuring that the acquired content is complete and accurate. The peculiarity of web archiving systems in comparison to other archiving systems, is that the SIP is preceded by an automated extraction step. Websites often contain rich information not available on their surface. While the great variety and versatility of website structures, technologies and types of content is one of the strengths of the web, it is also a serious weakness. There is no guarantee that web bots dedicated to performing web crawling can access and retrieve website content successfully [69]. Websites benefit from following established best practices, international standards and web technologies if they are amenable to being archived. We define the sum of the attributes that make a website amenable to being archived as Website Archivability. This work aims to:
• Provide mechanisms to improve the quality of web archive content (e.g. facilitate access, enhance content integrity, identify core metadata gaps). • Expand and optimize the knowledge and practices of Web archivists, supporting them in their decision making and risk management processes. • Standardize the web aggregation practices of web archives, especially in relation to QA. • Foster good practices in website development and Web content authoring that make websites more amenable to harvesting, ingesting, and preserving. • Raise awareness among web professionals regarding web preservation. • Make observations regarding the archivability of the 12 most prominent Web Content Management Systems and suggest improvements.
We define the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method, a set of metrics to quantify the level of archivability of any website. This method is designed to consolidate, extend and complement empirical web aggregation practices through the for- mulation of a standard process to measure if a website is archivable. The main contributions of this work are:
• an introduction of the notion of Website Archivability, • a definition of the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to measure Website Archivability, • a detailed architecture and implementation outline of ArchiveReady.com, an online system that functions as a reference implementation of the method. 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 17
• an extended evaluation using real world datasets. • a proof that the CLEAR+ method needs only to evaluate a single webpage to calculate a website’s Archivability. • an evaluation of the WA of 12 prominent Web Content Management Systems (WCMS) and a presentation of observations and improvement suggestions.
The concept of CLEAR emerged from our current research in web preservation in the context of the BlogForever project, which involves weblog harvesting and archiving [80]. Our work revealed the need for a method to assess website archive readiness to support web archiving workflows. The remainder of this chapter is organized as follows: Section 3.2 introduces the CLEAR+ method, Section 3.3 presents the ArchiveReady system, Section 3.4 presents experimental evaluation and results. Section 3.5 presents WCMS archivability survey. Finally, we con- clude with some discussion and remarks in Section 3.6.
3.2 Credible Live Evaluation method for Archive Readi- ness Plus (CLEAR+)
We present the Credible Live Evaluation of Archive Readiness Plus method (CLEAR+) as of 08/2014. We focus on its requirements, main components, WA facets and evaluation meth- ods. We also include an example website evaluation to illustrate the CLEAR+ application in a detailed manner. The CLEAR+ method proposes an approach to produce on-the-fly measurement of WA, which is defined as the extent to which a website meets the conditions for the safe trans- fer of its content to a web archive for preservation purposes [17]. All web archives currently employ some form of crawler technology to collect the content of target websites. They com- municate through HTTP requests and responses, processes that are agnostic of the repository system of the archive. Information, such as the unavailability of webpages and other errors, is accessible as part of this communication exchange and could be used by the web archive to support archival decisions (e.g. regarding retention, risk management, and characterisation). Here, we combine this kind of information with an evaluation of the website’s compliance with recognised practices in digital curation (e.g. using adopted standards, validating for- mats, and assigning metadata) to generate a credible score representing the archivability of target websites. The main components of CLEAR+ are:
1. WA Facets: the factors that come into play and need to be taken into account to cal- culate total WA. 2. Website Attributes: the website homepage elements analysed to assess the WA Facets (e.g. the HTML markup code). 3. Evaluations: the tests executed on the website attributes (e.g. HTML code valida- tion against W3C HTML standards) and approach used to combine the test results to calculate the WA metrics. 18 An Innovative Method to Evaluate Website Archivability
It is very important to highlight that WA is meant to evaluate websites only and is not destined to evaluate distinct webpages. This is due to the fact that many of the attributes used in the evaluation are website attributes and not attributes of a specific webpage. The correct way to use WA is to provide as input the website home page. Furthermore, in Section 3.4.4 we prove that our method needs only to evaluate the home webpage to calculate the WA of the target website, based on the premise that webpages of the same website share the same components, standards and technologies. WA must also not be confused with website dependability, since the former refers to the ability to archive a website, whereas the latter is a system property that integrates several attributes, such as reliability, availability, safety, security, survivability and maintainability [11]. In the rest of this Section we present in detail the CLEAR+ method. First, we look into the requirements of reliable high quality metrics and how the CLEAR+ method fulfills them (Section 3.2.1). We continue with the way each of the CLEAR+ components is examined with respect to aspects of web crawler technology (e.g. hyperlink validation; performance measure) and general digital curation practices (e.g. file format validation; use of meta- data) to propose four core constituent Facets of WA (Section 3.2.2). We further describe the website attributes (e.g. HTML elements; hyperlinks) which are used to examine each WA Facet (Section 3.2.3), and, propose a method for combining tests on these attributes (e.g. validation of image format) to produce a quantitative measure that represents the Website’s Archivability (Section 3.2.4). To illustrate the application of CLEAR+, we present an exam- ple in Section 3.2.5. Finally, we outline the development of CLEAR+ in comparison with CLEAR in Section 3.2.6.
3.2.1 Requirements
It is necessary for a newly introduced method and a novel metric, such as WA, to evaluate its properties. A good metric must be Quantitative, Discriminative, Fair, Scalable and Nor- mative according to [113]. In the following, we explain how the WA metric satisfies these requirements.
1. Quantitative: WA can be measured in a quantitative score that provides a continuous range of values from perfectly archivable to completely not archivable. WA allows assessment of change over time, as well as comparison between websites or between groups of websites. For more details see the evaluation using assorted datasets in Section 3.4.2. 2. Discriminative: The metric range of values has a large discriminating power beyond simple archivable and not archivable. Discrimination power of the metric allows as- sessment of the rate of change. See the underlying theory and an example implemen- tation of the metric in Sections 3.2.4 and 3.2.5. 3. Fair: The metric is fair, taking into account all the attributes of a web resource and performing a large number of evaluations. Moreover, it also takes into account and adjusts to the size and complexity of the websites. WA is evaluated from multiple different aspects, using several WA Facets as presented in Section 3.2.2. 4. Scalable: The metric is scalable and able to conduct large-scale WA studies given the relevant resources. WA supports aggregation and second-order statistics, such as 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 19
STDDEV. Also WA is calculated in an efficient way; it is relevant to the number of web resources used in a webpage. WA is calculated in real-time. The scalability of the archiveready.com platform is presented in Section 3.3.2. 5. Normative: The metric is normative, deriving from international standards and guide- lines. WA stems from established metadata standards, preservation standards, guide- lines of W3C, etc. The proposed metric is based on established digital preservation practices. All WA aspects are presented in Section 3.2.2.
The WA metric has many strengths, such as objectivity, practicality and ability to conduct a large-scale assessment without many resources. Following, we focus on each WA Facet.
3.2.2 Website Archivability Facets
WA can be measured from several different perspectives. Here, we have called these per- spectives WA Facets (See Figure 3.1). The selection of these facets is motivated by a number of considerations:
Figure 3.1: WA Facets: An Overview.
1. whether there are verifiable guidelines to indicate what and where information is held at the target website and whether access is available and permitted by a high perfor- mance web server. (i.e. Accessibility, see Section 3.2.2); 2. whether included information follows a common set of format and/or language speci- fications (i.e. Standards Compliance, see Section 3.2.2); 3. the extent to which information is independent from external support (i.e. Cohesion, see Section 3.2.2); and, 4. the level of extra information available about the content (i.e. Metadata Usage).
Certain classes and specific types of errors create less or more obstacles to web archiving. The CLEAR+ algorithm calculates the significance of each evaluation based on the follow- ing criteria:
1. High Significance: Critical issues which prevent web crawling or may cause highly problematic web archiving results. 2. Medium Significance: Issues which are not critical but may affect the quality of web archiving results. 20 An Innovative Method to Evaluate Website Archivability
3. Low Significance: Minor details which do not cause any issues when they are missing but will help web archiving when available.
Each WA Facet is computed as the weighted average of the scores of the questions associated with this Facet. The significance of each question defines its weight. The WA calculation is presented in detail in Section 3.2.4. Finally, it must be noted that a single evaluation may impact more than one WA Facets. For instance, the presence of a Flash menu in a website has a negative impact in the Accessibility Facet because web archives cannot detect hyperlinks inside Flash and also in the Standards Compliance Facet because Flash is not an open standard.
퐹퐴: Accessibility
A website is considered archivable only if web crawlers are able to visit its home page, tra- verse its content and retrieve it via standard HTTP protocol requests [54]. In case a crawler cannot find the location of all web resources, then it will not be possible to retrieve the con- tent. It is not only necessary to put resources on a website, it is also essential to provide proper references to allow crawlers to discover them and retrieve them effectively and effi- ciently. Performance is also an important aspect of web archiving. The throughput of data acquisition of a web bot directly affects the number and complexity of web resources it is able to process. The faster the performance, the faster the ingestion of web content, improves a website’s archiving process. It is important to highlight that we evaluate performance using the initial HTTP response time and not the total transfer time because the former depends on server performance characteristics, whereas the latter depends on file size. Example 1: a web developer is creating a website containing a Flash menu, which requires a proprietary web browser plugin to render properly. Web crawlers cannot access the flash menu contents so they are not able to find the web resources referenced in the menu. Thus, the web archive fails to access all available website content. Example 2: a website is archivable only if it can be fully retrieved correctly by a third party application using HTTP protocols. If a website is employing any other protocol, web crawlers will not be able to copy all its content. Example 3: if the performance of a website is slow or web crawling is throttled using some artificial mechanism, web crawlers will have difficulties in aggregating content and they may even abort if the performance degrades below a specific threshold. To support WA, the website should, of course, provide valid links. In addition, a set of maps, guides, and updates for links should be provided to help crawlers find all the content (see Figure 3.1). These can be exposed in feeds, sitemap.xml[127], and robots.txt3 files. Proper HTTP protocol support for Etags, Datestamps and other features should also be considered [38, 63].
2https://developers.google.com/speed/docs/insights/Server, accessed August 1, 2015 3http://www.robotstxt.org/, accessed August 1, 2015 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 21
Id Description Significance
퐴1 Check the percentage of valid vs. invalid hyperlink and CSS urls. High These urls are critical for web archives to discover all website content and render it successfully.
퐴2 Check if inline JavaScript code exists in HTML. Inline JavaScript High may be used to dynamically generate content (e.g. via AJAX re- quests), creating obstacles for web archiving systems.
퐴3 Check if sitemap.xml exists. Sitemap.xml files are meant to include High references to all the webpages of the website. This feature is critical to identify all website content with accuracy and efficiency.
퐴4 Calculate the max initial response time of all HTTP requests. The High rating ranges from 100% for initial response time less than or equal to 0.2 sec and 0% if the initial response time is more than 2 sec. The limits are imposed based on Google Developers speed info2. The rationale is that high performance websites facilitate faster and more efficient web archiving.
퐴5 Check if proprietary file format such as Flash and QuickTime are High used. Web crawlers cannot access the proprietary files contents; so they are not able to find the web resources referenced in them. Thus, the web archive fails to access all available website content.
퐴6 Check if the robots.txt file contains any “Disallow:” rules. These Medium rules may block web archives from retrieving parts of a website but it must be noted that not all web archives respect them.
퐴7 Check if the robots.txt file contains any “Sitemap:” rules. These rules Medium may help web archives locate one or more sitemap.xml files with ref- erences to all the webpages of the website. Although not critical, this rule may help web archives identify sitemap.xml files located in non-standard locations.
퐴8 Check the percentage of downloadable linked media files. Valid me- Medium dia file links are important to enable web archives to retrieve them successfully.
퐴9 Check if any HTTP Caching headers (Expires, Last-modified or Medium ETag) are set. They are important because they can be used by web crawlers to avoid retrieved not modified content, accelerating web content retrieval.
퐴10 Check if RSS or Atom feeds are referenced in the HTML source code Low using RSS autodiscovery. RSS function similarly to sitemap.xml files providing references to webpages in the current website. RSS feeds are not always present; thus, they can be considered as not absolutely necessary for web archiving and with low significance.
Table 3.1: 퐹퐴: Accessibility Evaluations 22 An Innovative Method to Evaluate Website Archivability
The Accessibility Evaluations performed are presented in detail in Table 3.1. For each one of the presented evaluations, a score in the range of 0-100 is calculated depending on the success of the evaluation.
퐹푆 : Standards Compliance
Compliance with standards is a sine qua non theme in digital curation practices (e.g. see Dig- ital Preservation Coalition guidelines [39]). It is recommended that for digital resources to be preserved they need to be represented in known and transparent standards. The standards themselves could be proprietary, as long as they are widely adopted and well understood with supporting tools for validation and access. Above all, the standard should support dis- closure, transparency, minimal external dependencies and no legal restrictions with respect to preservation processes that might take place within the archive4. Disclosure refers to the existence of complete documentation, so that, for example, file for- mat validation processes can take place. Format validation is the process of determining whether a digital object meets the specifications for the format it purports to be. A key ques- tion in digital curation is, “I have an object purportedly of format F; is it really F? [103] Considerations of transparency and external dependencies refers to the resource’s openness to basic tools (e.g. W3C HTML standard validation tool; JHOVE2 format validation tool [46]). Example: if a webpage has not been created using accepted standards, it is unlikely to be ren- derable by web browsers using established methods. Instead it is rendered in “Quirks mode”, a custom technique to maintain compatibility with older/broken webpages. The problem is that the quirks mode is really versatile. As a result, one cannot depend on it to have a stan- dard rendering of the website in the future. It is true that using emulators one may be able to render these websites in the future but this is rarely the case for the average user who will be accessing the web archive with his/her latest web browser. We recommend that validation is performed for three types of content (see Table 3.2): web- page components (e.g. HTML and CSS), reference media content (e.g. audio, video, image, documents), HTTP protocol headers used for communication and supporting resources (e.g. robots.txt, sitemap.xml, JavaScript). The website is checked for Standards Compliance on three levels: referenced media format (e.g. image and audio included in the webpage), webpage (e.g. HTML and CSS markup) and resource (e.g. sitemap, scripts). Each one of these are expressed using a set of specified file formats and/or languages. The languages (e.g. XML) and formats (e.g. jpeg) will be val- idated using tools, such as W3C HTML [141] and CSS validator5, JHOVE2 and/or Apache Tika6 file format validator, Python XML validator7 and robots.txt checker8.
4http://www.digitalpreservation.gov/formats/sustain/sustain.shtml, accessed August 1, 2015 5http://jigsaw.w3.org/css-validator/, accessed August 1, 2015 6http://tika.apache.org/, accessed August 1, 2015 7http://code.google.com/p/pyxmlcheck/, accessed August 1, 2015 8http://tool.motoricerca.info/robots-checker.phtml, accessed August 1, 2015 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 23
Id Description Significance
푆1 Check if the HTML source code complies with the W3C standards. High This is critical because invalid HTML may lead to invalid content processing and unrenderable archived web content in the future.
푆2 Check the usage of QuickTime and Flash file formats. Digital preser- High vation best practices are in favor of open standards; so it is considered problematic to use these types of files.
푆3 Check the integrity and the standards of images. This is critical to Medium detect potential problems with image formats and corruption.
푆4 Check if the RSS feed format complies with W3C standards. This Medium is important because invalid RSS feeds may prevent web crawlers from analysing them and extracting metadata or references to website content.
푆5 Check if the HTTP Content-encoding or Transfer-encoding headers Medium are set. They are important because they provide information regard- ing the way the content is transfered.
푆6 Check if any HTTP Caching headers (Expires, Last-modified or Medium ETag) are set. They are important because they may help web archives avoid downloading not modified content, improving their performance and efficiency.
푆7 Check if the CSS referenced in the HTML source code complies with Medium W3C standards. This is important because invalid CSS may lead to unrenderable archived web content in the future.
푆8 Check the integrity and the standards compliance of HTML5 Audio Medium elements. This is important to detect a wide array of problems with audio formats and corruption.
푆9 Check the integrity and the standards compliance of HTML Video Medium elements. This is important to detect potential problems with video formats and corruption.
푆10 Check if the HTTP Content-type header exists. This is significant Medium because it provides information to the web archives about the content and it may potentially help interpret it.
Table 3.2: 퐹푆 Standards Compliance Facet Evaluations
We also have to note that we are checking the usage of QuickTime and Flash explicitly because they are the major closed standard file formats with the greatest adoption on the web, according to the HTTP Archive9.
9http://httparchive.org/, accessed August 1, 2015 24 An Innovative Method to Evaluate Website Archivability
퐹퐶 : Cohesion
Cohesion is relevant to both the efficient operation of web crawlers, and, also, the manage- ment of dependencies within digital curation (e.g. see NDIIPP comment on format depen- dencies [8]). If files comprising a single webpage are dispersed across different services (e.g. different servers for images, JavaScript widgets, other resources) in different domains, the acquisition and ingest is likely to risk suffering from being neither complete nor accu- rate. If one of the multiple services fails, the website fails as well. Here we characterise the robustness of the website in comparison to this kind of failure as Cohesion. It must be noted that we use the top-level domain and not the host name to calculate Cohesion. Thus, both http://www.test.com and http://images.test.com belong to the top-level do- main test.com. Example: a flash widget used in a website but hosted elsewhere may cause problems in web archiving because it may not be captured when the website is archived. More important is the case where, if the target website depends on third party websites, the future availability of which is unknown, then new kinds of problems are likely to arise.
Id Description Significance
퐶1 The percentage of local vs. remote images. Medium 퐶2 The percentage of local vs. remote CSS. Medium 퐶3 The percentage of local vs. remote script tags. Medium 퐶4 The percentage of local vs. remote video elements. Medium 퐶5 The percentage of local vs. remote audio elements. Medium 퐶6 The percentage of local vs. remote proprietary objects (Flash, Quick- Medium Time).
Table 3.3: 퐹퐶 Cohesion Facet Evaluations.
The premise is that, keeping information associated to the same website together (e.g. using the same host for a single instantiation of the website content) would lead to a robustness of resources preserved against changes that occur outside of the website (cf. encapsulation10). Cohesion is tested at two levels:
1. examining how many domains are employed in relation to the location of referenced media content (images, video, audio, proprietary files), 2. examining how many domains are employed in relation to supporting resources (e.g. robots.txt, sitemap.xml, CSS and JavaScript files).
The level of Cohesion is measured by the extent to which material associated to the website is kept within one domain. This is measured by the proportion of content, resources, and plu- gins that are sourced internally. This can be examined through an analysis of links, on the level of referenced media content, and on the level of supporting resources (e.g. JavaScript). In addition the proportion of content relying on predefined proprietary software can be as- sessed and monitored. The Cohesion Facet evaluations are presented in Table 3.3.
10http://www.paradigm.ac.uk/workbook/preservation-strategies/selecting-other.html, accessed August 1, 2015 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 25
One may argue that if we choose to host website files across multiple services, they could still be saved in case the website failed. This is true but our aim is to archive the website as a whole and not each independent file. Distributing the files in multiple locations increases the possibility of losing some of these files.
퐹푀 : Metadata Usage
The adequate provision of metadata (e.g. see Digital Curation Centre Curation Reference Manual chapters on metadata [101], preservation metadata [33], archival metadata [48], and learning object metadata [93]) has been a continuing concern within digital curation (e.g. see seminal article by Lavoie [87] and insightful discussions going beyond preservation11. The lack of metadata impairs the archive’s ability to manage, organise, retrieve and interact with content effectively. It is, widely recognised that it makes understanding the context of the material a challenge.
Id Description Significance
푀1 Check if the HTTP Content-type header exists. This is significant Medium because it provides information to the web archives about the content and may potentially help retrieve more information.
푀2 Check if any HTTP Caching headers (Expires, Last-modified or Medium ETag) are set. They are important because they provide extra infor- mation regarding the creation and last modification of web resources.
푀3 Check if the HTML meta robots noindex, nofollow, noarchive, nos- Low nippet and noodp tags are used in the markup. If true, they instruct the web archives to avoid archiving the website. This tag is optional and usually omitted. 12 푀4 Check if the DC profile is used in the HTML markup. This evalu- Low ation is optional and with low significance. If the DC profile exists, it will help the web archive obtain more information regarding the archived content. If absent, there will be no negative effect.
푀5 Check if the FOAF profile [27] is used in the HTML markup. This Low evaluation is optional and with low significance. If the FOAF profile exists, it will help the web archive obtain more information regarding the archived content. If it does not exist, it will not have any negative effect.
푀6 Check if the HTML meta description tag exists in the HTML source Low code. The meta description tag is optional with low significance. It does not affect web archiving directly but affects the information we have about the archived content.
Table 3.4: 퐹푀 Metadata Facet Evaluations
11http://www.activearchive.com/content/what-about-metadata, accessed August 1, 2015 12http://dublincore.org/documents/2008/08/04/dc-html/, accessed August 1, 2015 26 An Innovative Method to Evaluate Website Archivability
We will consider metadata on three levels. To avoid the dangers associated with committing to any specific metadata model, we have adopted a general view point shared across many information disciplines (e.g. philosophy, linguistics, computer sciences) based on syntax (e.g. how is it expressed), semantics (e.g. what is it about) and pragmatics (e.g. what can you do with it). There are extensive discussions on metadata classification depending on their application (e.g. see National Information Standards Organization classification [117]; discussion in Digital Curation Centre Curation Reference Manual chapter on Metadata [101]). Here we avoid these fine-grained discussions and focus on the fact that much of the metadata approaches examined in existing literature can be exposed already at the time that websites are created and disseminated. For example, metadata such as transfer and content encoding can be included by the server in HTTP headers. The required end-user language to understand the content can be indicated as part of the HTML element attribute. Descriptive information (e.g. author, keywords) that can help in understanding how the content is classified can be included in the HTML META element attribute and values. Metadata that support rendering information, such as application and generator names, can also be included in the HTML META element. The use of other well-known metadata and description schemas (e.g. Dublin Core [143]; Friend of a Friend (FOAF) [27]; Resource Description Framework (RDF) [99]) can be included to promote better interoperability. The existence of selected metadata elements can be checked as a way of increasing the probability of implementing automated extraction and refinement of metadata at harvest, ingest, or subsequent stage of repository management. The score for Metadata Usage Facet evaluations are presented in Table 3.4.
3.2.3 Attributes
We summarise what website attributes we evaluate to calculate WA. They are also presented in Figure 3.2.
Figure 3.2: Website attributes evaluated for WA
RSS: The existence of an RSS feed allows the publication of webpage content that can be automatically syndicated or exposed. It allows web crawlers to automatically retrieve up- dated content, whereas the standardised format of the feeds allows access by many different applications. For example, the BBC uses feeds to let readers see when new content has been added13. Robots.txt: The file robots.txt indicates to a web crawler which URLs it is allowed to crawl. The use of robots.txt helps preventing the retrieval of website content that would be aligned with permissions and special rights associated to the webpage.
13http://www.bbc.co.uk/news/10628494, accessed August 1, 2015 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 27
Sitemap.xml: The Sitemaps protocol, jointly supported by the most widely used search en- gines to help content creators and search engines, is an increasingly used way to unlock hidden data by making it available to search engines [127]. To implement the Sitemaps protocol, the file sitemap.xml is used to list all the website pages and their locations. The location of this sitemap, if it exists, can be indicated in the robots.txt. Regardless of its in- clusion in the robots.txt file, the sitemap, if it exists, should ideally be called ‘sitemap.xml’ and put at the root of your web server (e.g. http://www.example.co.uk/sitemap.xml). HTTP Headers: HTTP is the protocol used to transfer content from the web server to the web archive. HTTP is very important as it contains a significant information regarding many web content aspects. Source code and linked web resources: The source code of the website (HTML, JavaScript, CSS). Binary files: The binary files included in the webpage (images, pdf, etc.). Hyperlinks: Hyperlinks comprise a net that links the web together. The hyperlinks of the website can be examined for availability as an indication of website accessibility. The lack of hyperlinks does not impact WA but the existence of missing and/or broken links should be considered problematic.
3.2.4 Evaluations
Combining the information discussed in Section 3.2.2 to calculate a score for WA goes through the following steps.
1. The WA potential with respect to each facet is represented by an 푁-tuple (푥1, …, 푥푘, …, 푥푁 ), where 푥푘 equals a 0 or 1 and represents a negative or positive answer, respec- tively, to the binary question asked about that facet, whereas 푁 is the total number of questions associated to that facet. An example question in the case of the Standards Compliance Facet would be “I have an object purportedly of format F; is it?” [103]; if there are 푀 files for which format validation is being carried out then there will be 푀 binary questions of this type. 2. Not all questions are considered of equal value to the facet. Depending on their sig- nificance (Low, Medium and High), they have different weight 푤푘 = (1, 2 or 4, re- spectively). The weights follow a power law distribution where Medium is twice as important as Low and High is twice as important as Medium. The value of each facet is the weighted average of its coordinates:
푁 휔 푥 퐹 = 푘 푘 (3.1) 휆 ∑ 푘=1 퐶
where 휔푘 is the weight assigned to question 푘 and
푁 퐶 = 푤 ∑ 푖 푖=1 28 An Innovative Method to Evaluate Website Archivability
Once the rating with respect to each facet is calculated, the total measure of WA can be simply defined as: 푊 퐴 = 푤 퐹 (3.2) ∑ 휆 휆 휆∈{퐴,푆,퐶,푀} where 퐹퐴, 퐹푆 , 퐹퐶 , 퐹푀 are WA with respect to Accessibility, Standards Compliance, Cohe- sion, Metadata Usage, respectively, and
푤 = 1 ∑ 휆 휆∈{퐴,푆,퐶,푀} for 0 ≤ 푤휆 ≤ 1 ∀ 휆 ∈ {퐴, 푆, 퐶, 푀}
Depending on the curation and preservation objectives of the web archive, the significance of each facet is likely to be different, and 푤휆 could be adapted to reflect this. In the simplest model, these 푤휆 values can be equal, i.e. 푤휆=0.25 for any 휆. Thus, the WA is calculated as:
1 1 1 1 푊 퐴 = 퐹 + 퐹 + 퐹 + 퐹 (3.3) 4 퐴 4 푆 4 퐶 4 푀
Facet Weight
퐹퐴 (5*4) + (4*2) + (1*1) = 29 퐹푆 (2*4) + (8*2) = 24 퐹퐶 6*2 = 12 퐹푀 (2*2) + (4*1) = 8 Total 73
Table 3.5: WA Facet Weights
We can calculate WA by adopting a normalized model approach, i.e. by multiplying facet evaluations by special weights according to their specific questions (of low, medium or high significance). To this end, in Table 3.5 we calculate the special weights of each facet. Thus, we can evaluate a weighted WA as:
29 24 12 8 푊 퐴 = 퐹 + 퐹 + 퐹 + 퐹 (3.4) 푤푒푖푔ℎ푡푒푑 73 퐴 73 푆 73 퐶 73 푀
Actually, accessibility is the most central consideration in WA since, if the content cannot be found or accessed, then the website’s compliance with other standards, and conditions becomes moot. In case the user needs to change the significance of each facet, it is easy to do so by assigning different values to their significance. 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 29
Id Description Rating Significance
퐴1 121 valid and 1 invalid links. 99% High 퐴2 6 inline JavaScript tags. 0% High 퐴3 Sitemap file exists http://auth.gr/sitemap.xml 100% High 퐴4 Network response time is 100ms. 100% High 퐴5 No use of any proprietary file format such as Flash and 100% High QuickTime.
퐴6 Robots.txt file contains multiple “Disallow” rules. 0% Medium http://auth.gr/robots.txt
퐴7 No sitemap.xml reference in the robots.txt file. 0% Medium 퐴8 16 in 16 images. 100% Medium 퐴9 HTTP caching headers available. 100% Medium 퐴10 One RSS feed http://auth.gr/rss.xml found using RSS 100% Low autodiscovery.
Table 3.6: 퐹퐴 evaluation of http://auth.gr/.
3.2.5 Example
To illustrate the application of CLEAR+, we calculate the WA rating of the website of the Aristotle University of Thessaloniki (AUTH)14. For each WA Facet, we conduct the neces- sary evaluations (Tables 3.6-3.9) and calculate the respective Facet values (see Equations 5-8) using Equation 3.1.
(99 ∗ 4) + (0 ∗ 4) + (100 ∗ 4)+ (100 ∗ 4) + (100 ∗ 4) + (0 ∗ 2) + (0 ∗ 2)+ (100 ∗ 2) + (100 ∗ 2) + (100 ∗ 1) 퐹 = ≈ 72% (3.5) 퐴 (4 ∗ 5) + (2 ∗ 4) + (1 ∗ 1)
(0 ∗ 4) + (100 ∗ 4) + (100 ∗ 2) + (100 ∗ 2)+ (100 ∗ 2) + (100 ∗ 2) + (54 ∗ 2) + (100 ∗ 2) 퐹 = ≈ 75% (3.6) 푆 (4 ∗ 2) + (2 ∗ 6)
(87 ∗ 2) + (90 ∗ 2) + (100 ∗ 2) 퐹 = ≈ 92% (3.7) 퐶 3 ∗ 2
(100 ∗ 2) + (100 ∗ 2) + (100 ∗ 1) + (100 ∗ 1) 퐹 = = 100% (3.8) 푀 (2 ∗ 2) + (1 ∗ 2)
14http://www.auth.gr/ as of 10 August 2014. 30 An Innovative Method to Evaluate Website Archivability
Id Description Rating Significance
푆1 HTML validated, multiple errors. 0% High 푆2 No proprietary external objects (Flash, QuickTime). 100% High 푆3 16 well-formed images checked with JHOVE. 100% Medium 푆4 RSS feed http://auth.gr/rss.xml is valid according to 100% Medium the W3C feed validator.
푆5 Content encoding was clearly defined in HTTP Headers. 100% Medium 푆6 HTTP Caching headers clearly defined. 100% Medium 푆7 6 valid and 5 invalid CSS. 54% Medium 푆8 No HTML5 audio elements. - Medium 푆9 No HTML5 video elements. - Medium 푆10 Content type . clearly defined in HTTP Headers. 100% Medium
Table 3.7: 퐹푆 evaluation http://auth.gr/.
Id Description Rating Significance
퐶1 14 local and 2 external images. 87% Medium 퐶2 10 local and 1 external CSS. 90% Medium 퐶3 7 local and no external scripts. 100% Medium 퐶4 No HTML5 audio elements. - Medium 퐶5 No HTML5 video elements. - Medium 퐶6 No proprietary objects. - Medium
Table 3.8: 퐹퐶 evaluation http://auth.gr/.
Id Description Rating Significance
푀1 Content type clearly defined in HTTP Headers. 100% Medium 푀2 HTTP Caching headers are set. 100% Medium 푀3 No meta robots blocking. - Low 푀4 No DC metadata. - Low 푀5 FOAF metadata found. 100% Low 푀6 HTML description meta tag found. 100% Low
Table 3.9: 퐹푀 evaluation http://auth.gr/. 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 31
Finally, assuming the flat model approach we calculate the WA value as:
퐹 + 퐹 + 퐹 + 퐹 푊 퐴 = 퐴 퐶 푆 푀 ≈ 85% 4 whereas, by following the normalized model approach, the weighted WA value is calculated as: 29 20 6 6 푊 퐴 = 퐹 + 퐹 + 퐹 + 퐹 ≈ 78% 푤푒푖푔ℎ푡푒푑 61 퐴 61 푆 61 퐶 61 푀
A screenshot of the http://archiveready.com/ web application session we use to eval- uate http://auth.gr/ is presented in Figure 3.3.
Figure 3.3: Evaluating http://auth.gr/ WA using ArchiveReady. 32 An Innovative Method to Evaluate Website Archivability
3.2.6 The Evolution from CLEAR to CLEAR+
Finally, we conclude the presentation of CLEAR+ with the developments of the method since the first incarnation of CLEAR (Ver.1 of 04/2013) [17]. We experimented in practice with the CLEAR method for a considerable time, running a live online system which is also presented in detail in Section 3.3. We conducted multiple evaluations and received feed- back from academics and the web archiving industry professionals. This process resulted in the identification of many issues such as missing evaluations and overestimated or un- derestimated criteria. The algorithmic and technical improvements of our method can be summarised as follows:
1. Each website attribute evaluation has a different significance, depending on its effect to web archiving, as presented in Section 3.2.2.
2. The Performance Facet has been integrated in the Accessibility and its importance has been downgraded significantly. This is a result of the fact that website performance in tests has been consistently high, regardless of their other characteristics. Thus, Performance Facet rating was always 100% or near 100%, distorting the general WA evaluation.
3. Weighted arithmetic mean is implemented to calculate WA Facets instead of simple mean. All evaluations have been assigned a low, medium or high significance in- dicator, which affects the calculation of all WA Facets. The significance has been defined based on the initial experience with WA evaluations from the first year of archiveready.com operation.
4. Certain evaluations have been removed from the method as they were considered irrel- evant. For example, the check that archived versions of the target website are present in the Internet Archive or not should be part of the assessment.
5. On a technical level, all aspects of the reference implementation of the Website Archiv- ability Evaluation Tool http://archiveready.com have been improved. The soft- ware has also the new capability of analysing dynamic websites using a headless web browser, as presented in Section 3.3. Thus, its operation has become more accurate and valid than the previous version.
In the following Section, we present the architecture of ArchiveReady, a web system imple- menting the CLEAR+ method.
3.3 ArchiveReady: A Website Archivability Evaluation Tool
We present ArchiveReady15, a WA evaluation system that implements CLEAR+ as a web application. We describe the system architecture, design decisions, WA evaluation workflow and Application Programming Interfaces (APIs) available for interoperability purposes.
15http://www.archiveready.com, accessed August 1, 2015 3.3 ArchiveReady: A Website Archivability Evaluation Tool 33
3.3.1 System Architecture
ArchiveReady is a web application based on the following key components:
1. Debian linux operating system [118] for development and production servers,
2. Nginx web server16 to server static web content,
3. Python programming language17,
4. Gunicorn Python WSGI HTTP Server for unix18 to server dynamic content,
5. BeautifulSoup19 to analyse HTML markup and locate elements,
6. Flask20, a Python micro-framework to develop web applications,
7. Redis advanced key-value store21 to manage job queues and temporary data,
8. Mariadb Mysql RDBMS22 to store long-term data.
9. PhantomJS 23, a headless WebKit scriptable with a JavaScript API. It has fast and na- tive support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. 10. JSTOR/Harvard Object Validation Environment (JHOVE) [46] for media file valida- tion, 11. JavaScript and CSS libraries such as jQuery24 and Bootstrap25 are utilized to create a compeling user interface, 12. W3C HTML Markup Validation Service [141] and CSS Validation Service APIs for web resources evaluation.
The home page of the ArchiveReady system is presented in Figure 3.5. An overview of the system architecture is presented in Figure 3.4. During the design and implementation of the platform, we took some important decisions, which influenced greatly all aspects of development. We choose Python to implement ArchiveReady since it is ideal for rapid application devel- opment and has many modern features. Moreover, it is supported by a big user community
16http://www.nginx.org 17http://www.python.org/, accessed: August 1, 2015 18http://gunicorn.org/, accessed: August 1, 2015 19http://www.crummy.com/software/BeautifulSoup/, accessed: August 1, 2015 20http://flask.pocoo.org/, accessed: August 1, 2015 21http://redis.io, accessed: August 1, 2015 22http://www.mariadb.com, accessed: August 1, 2015 23http://phantomjs.org/, accessed: August 1, 2015 24http://www.jquery.com, accessed: August 1, 2015 25http://twitter.github.com/bootstrap/, accessed: August 1, 2015 34 An Innovative Method to Evaluate Website Archivability
Figure 3.4: The architecture of the archiveready.com system.
and has a wide range of modules. Using these assets, we can implement many important features such as RSS feed validation (feedvalidator module), XML parsing, validation and analysis (lxml module), HTTP communication (python-requests module) and asynchronous job queues (python-rq module). We use PhantomJS to access websites which use Javascipt, AJAX and other web technolo- gies, which are difficult to handle with HTML processing. Using PhantomJS, we can perform JavaScript rendering when processing a website. Therefore, we can extract dynamic content and even support AJAX-generated content in addition to traditional HTML-only websites. We select Redis to store temporary data into memory because of its performance and its ability to support many data structures. Redis is an advanced key-value store, since keys can contain strings, hashes, lists, sets and sorted sets. These features make it ideal for holding volatile information, such as intermediate evaluation results and other temporary data about website evaluations. Redis is also critical for the implementation of asynchronous job queues as described in Section 3.3.2. We use MariaDB to store data permanently for all evaluations. Such data are the final eval- uation results and user preferences. We use JHOVE [46], an established digital preservation tool to evaluate the files included in websites for their correctness. We evaluate HTML markup, CSS and RSS correctness using W3C validator tools. We also use Python excep- tions to track problems when analysing webpages and try to locate webpages, which cause problems to web client software. 3.3 ArchiveReady: A Website Archivability Evaluation Tool 35
Figure 3.5: The home page of the archiveready.com system.
3.3.2 Scalability
One of the greatest challenges in implementing ArchiveReady is performance, scalability and responsiveness. A web service must be able to evaluate multiple websites in parallel, while maintaining a responsive Web UI and API. To achieve this goal, we implement asyn- chronous job queues in the following manner:
1. ArchiveReady tasks are separated into two groups: real-time and asynchronous. Real- time commands are processed as soon as they are received by the user as in any com- mon web application. 2. Asynchronous tasks are processed in a different way. When a user or third party ap- plication initiates a new evaluation task, the web application server maps the task into multiple individual atomic subtasks, which are inserted in the asynchronous job queue of the system, which is stored in a Redis List. 3. Background workers equal to the number of server CPU cores are constantly monitor- ing the job queue for new tasks. As soon as they identify them, they begin processing them one by one and store the results in the MariaDB database. 4. When all subtasks of a given task are finished, the web application server process presents the results to the user. While the background processes are working, the application server is free to reply to any requests regarding new website evaluations without any delay. 36 An Innovative Method to Evaluate Website Archivability
The presented evaluation processing logic has many important benefits. Tasks are separated into multiple individual atomic evaluations. This makes the system very robust. An excep- tion or any other system error in any individual evaluation does not interfere with the general system operation. More important is the fact that the platform is highly scalable as it is pos- sible for the asynchronous job queues to scale not only vertically depending on the number of available server CPU cores, but also horizontally, as multiple servers can be configured to share the same asynchronous job queue and mysql database. To ensure high level compatibility with W3C standards, we use open source web services provided by the W3C. These include: the Markup Validator, the Feed Validation Service26 and the CSS Validation Service. According to the HTTP Archive Trends, the average number of HTTP requests initiated when accessing a webpage is over 90 and is expected to rise27. In response to this perfor- mance context, ArchiveReady has to be capable of performing a very large number of HTTP requests, process the data and present the outcomes to the user in real time. This is not possi- ble with a single process for each user, the typical approach in web applications. To resolve this blocking issue, an asynchronous job queue system based on Redis for queue manage- ment and the Python RQ library28 is deployed. This approach enables the parallel execution of multiple evaluation processes, resulting in huge performance benefits when compared to traditional web application execution model. Its operation can be summarised as follows:
1. As soon as a user submits a website for evaluation, the master process maps the work into multiple individual jobs, which are inserted in the parallel job queues in the back- ground.
2. Background worker processes are notified and begin processing the individual jobs in parallel. The level of parallelism is configurable, 16 parallel processes are the current setup.
3. As soon as a job is finished, the results are sent to the master process.
4. When all jobs are finished, the master process calculates the WA and presents the final results to the user.
3.3.3 Workflow
ArchiveReady is a web application providing two types of interaction: web interface and web service. With the exception of presentation of outcomes (HTML for the former and JSON for the latter) both are identical. The evaluation workflow of a target website can be summarised as follows:
1. ArchiveReady receives a target URL and performs an HTTP request to retrieve the webpage hypertext.
26http://validator.w3.org/feed/ 27http://httparchive.org/trends.php 28http://python-rq.org/ 3.3 ArchiveReady: A Website Archivability Evaluation Tool 37
2. After analysing it, multiple HTTP connections are initiated in parallel to retrieve all web resources referenced in the target webpage, imitating a web crawler. ArchiveReady analyses only the URL submitted by the user, it does not evaluate the whole website recursively, as we have proven that the WA analysis of a single website page is a good proxy of the whole website WA rating.
3. In stage 3, Website Attributes (See Section 3.2.3) are evaluated. In more detail:
(a) HTML and CSS analysis and validation. (b) HTTP response headers analysis and validation. (c) Media files (images, video, audio, other objects) retrieval, analysis and valida- tion. (d) Sitemap.xml and Robots.txt retrieval, analysis and validation. (e) RSS feeds detection, retrieval, analysis and validation. (f) Website Performance evaluation. The sum of all network transfer activity is recorded by the system and in the end, after the completion of all network trans- fers, the average transfer time is calculated. There are fast and slow evaluations; fast are performed instantly at the application server, whereas slow evaluations are performed asynchronously using a job queue as presented in Section 3.3.2.
4. The metrics for the WA Facets are calculated according to the CLEAR+ method and the final WA rating is calculated.
Note that in the current implementation, CLEAR+ evaluates only a single webpage based on the assumption that all its webpages share the same components, standards and technologies. This is validated in Section 3.4.4.
3.3.4 Interoperability and APIs
ArchiveReady is operating not only as a web application for users visiting the website but also as a web service, which is available for integration into third party applications. Its interface is quite simple; by accessing archiveready.com/api?url=http://auth.gr/ via HTTP and a JSON document will be retrieved with the full results of the WA evaluation results on the target URL as presented in Listing 3.1.
1 {"test":{ 2 "website_archivability": 91, 3 "Metadata":100 4 "Standards_Compliance":73, 5 "Accessibility":88, 6 "Cohesion":71, 7 }, 8 "url": "http://auth.gr/", 9 "messages":[ 38 An Innovative Method to Evaluate Website Archivability
10 {"title":"Invalid CSS http://dididownload.com/wp-content/ themes/didinew/style.css. Located 8 errors, 78 warnings .", 11 "attribute":"html", 12 "facets":["Standards_Compliance"], 13 "level":0, 14 "significance":"LOW", 15 "message":"Webpages which do not conform with Web Standards have a lower possibility to be preserved correctly", 16 "ref":"http://jigsaw.w3.org/css-validator/validator?uri= http://dididownload.com/wp-content/themes/didinew/style .css&warning=0&profile=css3"}, 17 .... 18 ] 19 }
Listing 3.1: ArchiveReady API JSON output
The JSON output can be easily used by third party programs. In fact, all evaluations in Section 3.4 are conducted this way. Also, another significant interoperability feature of the ArchiveReady platform is to output Evaluation and Report Language (EARL) XML [2], which is the W3C standard for expressing test results. EARL XML enables users to assert WA evaluation results for any website in a flexible way.
3.4 Evaluation
We present our evaluation methodology and limits, followed by two independent experi- ments which support the validity of the WA metric. Following, we use another experiment to prove that WA variance for the webpages of the same website is very small.
3.4.1 Methodology and Limits
Our evaluation has two aims. The first is to prove the validity of the WA metric by experi- menting on assorted datasets and by expert evaluation. The second is to validate our claim that it is only necessary to evaluate a single webpage from a website to calculate a good approximation of its WA value. 3.4 Evaluation 39
In our experiments, we use Debian GNU/Linux 7.3, Python 2.7.6 and an Intel Core i7-3820, 3.60 GHz processor. The Git repository for this work29 contains the necessary data, scripts and instructions to reproduce all the evaluation experiments presented here. WA is a new concept and even though our method has solid foundations, there are still open issues regarding the evaluation of all WA Facets and the definition of a dataset of websites to be used as a Gold Standard:
1. The tools we have at our disposal are limited and cannot cope with the latest develop- ments on the web. For instance, web browser vendors are free to implement extensions to the CSS specifications that, in most cases, are proprietary to their browser30. The official W3C CSS Standard31 is evolving to include some of these new extensions but the process has an inherent delay. As a result, the state of the art W3C CSS validator we use in our system to validate target website CSS may return false negative results. This problem is also apparent in all W3C standards validators. As a result, Standards Compliance (퐹푆 ) evaluation is not always accurate. It must be noted though that W3C validators are improving on a steady rate and any improvement would be utilised auto- matically by our system as we are using the W3C validators as web services. Another aspect of this issue is that experts evaluating the live as well as the archived version of a website depend mainly on their web browsers to evaluate the website quality using mostly visual information. The problem is that HTML documents which are not fol- lowing W3C standards may appear correctly to the viewer even if they contain serious errors because the web browser is operating in “Quirks Mode” [34] and has particular algorithms to mitigate such problems. Thus, a website may appear correctly in a cur- rent browser but it may not do the same in a future browser because the error mitigation algorithms are not standard and depend on the web browser vendor. As a result, it is possible that experts evaluating a website may report that it has been archived correctly but the 퐹푆 evaluation results may not be equally good. 2. The presented situation regarding standards compliance raises issues about the accu- racy of the Accessibility Facet (퐹퐴) evaluation. Web crawlers try to mitigate the errors they encounter in web resources with various levels of success, affecting their capabil- ity to access all website content. Their success depends on the sophistication of their error mitigation algorithms. On the contrary, the 퐹퐴 rating of websites having such errors will be definitely low. For instance, a web crawler may access a sitemap.xml which contains invalid XML. If it uses a strict XML parser, it will fail to parse it and retrieve its URLs to proceed with web crawling. On the contrary, if it uses a relaxed XML parser, it will be able to retrieve a large number of its URLs and it will access more website content. In any case, the 퐹퐴 rating will suffer.
3. The Cohesion (퐹퐶 ) of a website does not directly affect its archiving unless one or more servers hosting its resources become unreachable during the time of archiving. The possibility of encountering such a case when running a WA experiment is very low. Thus, it is very difficult to measure it in an automated way. 4. Metadata are a major concert for digital curation, as discussed in Section 3.2.2. Nev- ertheless, the lack of metadata in a web archive does not have any direct impact to the user; archived websites may appear correctly although some of their resources may
29https://github.com/vbanos/web-archivability-journal-paper-data-2014 30http://reference.sitepoint.com/css/vendorspecific, accessed August 1, 2015 31http://www.w3.org/Style/CSS/, accessed August 1, 2015 40 An Innovative Method to Evaluate Website Archivability
lack correct metadata. This deficiency may become significant in the future, when the web archivists would need to render or process some “legacy” web resources and they would not have the correct information to do so. Thus, it is also challenging to evaluate the 퐹푀 Facet automatically. 5. The granularity of specific evaluations could be improved in the future to improve the accuracy of the method. Currently, the evaluations are either binary (100%/0 stands for successful/failed evaluation) and relative percentage evaluations (for instance, if 9 out of 10 hyperlinks are valid, the relevant evaluation score is 90%). There are some binary evaluations though which may be defined better as a relative percentage. For example, we have 퐴2: Check if inline JavaScript code exists in HTML. We are certain that inline JavaScript code is causing problems to web archiving so we assign a 100% score if no inline JavaScript code is present and 0% in the opposite case. Ideally, we should assign a relative percentage score based on multiple parameters, such as the specific number of inline JavaScript files, filesizes, type of inline code, complexity and other JavaScript-specific details. The same also applies for many evaluations such as 푆1: HTML standards compliance, 푆4: RSS feed standards compliance, 푆7: CSS standards compliance and 퐴6: Robots.txt “Disallow:” rules.
With these concerns in mind, we consider several possible methods to perform the evalua- tion. First, we could survey domain experts. We could ask web archivists working in IIPC Member Organisations to judge websites. However, this method is impractical because we would need to spend significant time and resources to evaluate a considerable number of websites. A second alternative method would be to devise datasets based on thematic or do- main classifications. For instance, websites of similar organisations from around the world. A third alternative would be to perform manual checking of the way a set of websites is archived in a web archive and evaluate all their data, attributes and behaviours in comparison with the original website. We choose to implement both the second and the third method.
3.4.2 Experimentation with Assorted Datasets
To study WA with real-world data, we conduct an experiment to see if high quality websites, according to some general standards, have better WA than low quality websites. We devise a number of assorted datasets with websites of varying themes, as presented in Table 3.10. We evaluate their WA using the ArchiveReady.com API (Section 3.3.4) and finally, we analyse the results.
We define three datasets of websites (퐷1, 퐷2, 퐷3) with certain characteristics:
• They belong to important educational, government or scientific organisations from all around the world. • They are developed and maintained by dedicated personnel and/or special IT compa- nies. • They are used by a large number of people and are considered very important for the operation of the organisation they belong to.
We also choose to create a dataset (퐷4) of manually selected spam websites which have the following characteristics: 3.4 Evaluation 41
• They are created automatically by website generators in large numbers. • Their content is generated automatically. • They are neither maintained nor evaluated for their quality at all. • They have relatively very few visitors.
It is important to highlight that a number of websites from all these datasets could not be evaluated by our system for various technical reasons. This means that these websites may also pose the same problems to web archiving systems. The reasons for these complications may be one or more of the following:
• The websites do not support web crawlers and deny sending content to them. This may be due to security settings or technical incompetence. In any case, web archives would not be able to archive these websites. • The websites were not available during the evaluation time. • The websites returned some kind of problematic data which resulted in the abnormal termination of the ArchiveReady API during the evaluation.
It is worth mentioning that 퐷4, the list of manually selected spam had the most problematic websites: 42 out of 120 could not be evaluated at all. In comparison, 8 out of 94 IIPC websites could not be evaluated (퐷1), 13 out of 200 (퐷2) and 16 out of 450 (퐷3). We conducted the WA evaluation using a Python script, which uses the ArchiveReady.com API and record the outcomes in a file. We calculate the results of the WA distribution for all four datasets. Also, we calculate the average, median, min, max and standard deviation functions on these datasets and present the results in Table 3.11 and depict them in Figure 3.6 using boxplots.
From these results we can observe the following: datasets 퐷1, 퐷2 and 퐷3 which are consid- ered high quality have high WA value distribution as illustrated in Figure 3.7. This is also evident from the statistics presented in Table 3.11. The average WA values are 75.87, 80.08
Id Description Raw Clean Data Data
퐷1 A set of websites from a pool of international web standards organ- 94 86 isations, national libraries, IIPC members and other high profile organisations in these fields.
퐷2 The first 200 of the top universities according to the Academic 200 187 Ranking of World Universities[90], also known as the “Shanghai list”.
퐷3 A list of government organisation websites from around the world. 450 434 퐷4 A list of manually selected spam websites from the top 1 million 120 78 websites published by Alexa. Table 3.10: Description of assorted datasets 42 An Innovative Method to Evaluate Website Archivability
Function 퐷1 퐷2 퐷3 퐷4 Average (WA) 75.87 80.08 80.75 58.37 Median (WA) 77.5 81 81 58.75 Min (WA) 41.75 56 54 33.25 Max (WA) 93.25 96 96 84.25 StDev (WA) 10.16 6.11 7.06 11.63 Table 3.11: Comparison of WA statistics for assorted datasets.
Figure 3.6: WA statistics for assorted datasets box plot.
and 80.75. The median WA values are also similar. On the contrary, 퐷4 websites, which are characterised as low quality, have remarkably lower WA values as shown in Table 3.11 and in Figure 3.6. The average WA value is 58.37 and the median value is 58.75. Thus, lower quality websites are prone to issues, which make them difficult to be archived. Finally, the standard deviation values are in all cases quite low. As the WA range is [0..100], standard deviation values of approximately 10 or less indicate that our results are strongly consistent, for both lower and higher WA values. To conclude, this experiment indicates that higher quality websites have higher WA than lower quality websites. This outcome is confirmed not only by the WA score itself but also by another indicator which was revealed during the experiment, the percentage of completed WA evaluations for each data set.
3.4.3 Evaluation by Experts
To evaluate the validity of our metrics, a reference standard has to be employed for the evaluation. It is important to note that this task requires careful and thorough investigation, as it has been already elaborated in existing works [81, 140]. With the contribution of 3 3.4 Evaluation 43
Figure 3.7: WA distribution for assorted datasets. post-doc researchers and PhD candidates in informatics from the Delab laboratory32 of the Department of Informatics at Aristotle University who assist us as experts, we conduct the following experiment: We use the first 200 websites of the top universities according to the Academic Ranking of World Universities of 2013 as a dataset (퐷2 from Section 3.4.2). We review the way that they are archived in the Internet Archive and rank their web archiving with a scale of 0 to 10. We select to use the Internet Archive because it is the most popular web archiving service, to the best of our knowledge. More specifically, for each website we conduct the following evaluation:
1. We visit http://archive.org, enter the URL on the Wayback Machine and open the latest snapshot of the website. 2. We visit the original website. 3. We evaluate the two instances of the website and assign a score from 0 to 10 depending on the following criteria: (a) Compare the views of the homepage and try to find visual differences and things missing in the archived version (3.33 points). (b) Inspect dynamic menus or other moving elements in the archived version (3.33 points). (c) Visit random website hyperlinks to evaluate if they are also captured successfully (3.33 points).
After analysing all websites, we conduct WA evaluation for the same websites with a Python script which is using the archiveready.com API (Section 3.3.4). We record the outcomes in a file and calculate the Pearson’s Correlation Coefficient for WA, WA Facets and expert scores. We present the results in Table 3.12. From these results, we observe that the correlation between WA and Experts rating is 0.516, which is quite significant taken into consideration the discussion about the limits presented in Section 3.4.1. It is also important to highlight the lack of correlation between different
32http://delab.csd.auth.gr/ 44 An Innovative Method to Evaluate Website Archivability
퐹퐴 퐹퐶 퐹푀 퐹푆 WA Exp. 퐹퐴 1.000 퐹퐶 0.060 1.000 퐹푀 0.217 -0.096 1.000 퐹푆 0.069 0.060 0.019 1.000 WA 0.652 0.398 0.582 0.514 1.000 Exp. 0.384 0.263 0.282 0.179 0.516 1.000
Table 3.12: Correlation between WA, WA Facets and Experts rating.
WA Facets. The correlation indicators between 퐹퐴 − 퐹퐶 , 퐹퐴 − 퐹푆 , 퐹퐶 − 퐹푀 , 퐹퐶 − 퐹푆 and 퐹푀 − 퐹푆 are very close to zero, ranging from -0.096 to 0.069. There is only a very small correlation in the case of 퐹퐴−퐹푀 , 0.217. Practically, there is no correlation between different WA Facets, confirming the validity and strength of the CLEAR+ method. WA Facets are different perspectives of WA, if there was any correlation of the WA Facets, this would mean that their differences would not be so significant. This experiment confirms that WA Facets are totally independent. Finally, we conduct One Way Analysis of Variance (ANOVA) [94], to calculate the 퐹 -value = 397.628 and the 푃 -value = 2.191e-54. These indicators are very positive and show that our results are statistically significant.
3.4.4 WA Variance in the Same Website
We argue that the CLEAR+ method needs only to evaluate the WA value of a single webpage based on the assumption that webpages from the same website share the same components, standards and technologies. We also claim that the website homepage has a representative WA score. This is important because it would be common for the users of the CLEAR+ method to evaluate the homepage of a website and we have to confirm that it has a represen- tative WA value. Following, we conduct the following experiment:
1. We use the Alexa top 1 million websites dataset33 and we select 1000 random websites.
2. We retrieve 10 random webpages from each website to use as a test sample. To this end, we decided to use their RSS feeds.
3. We perform RSS feeds auto-detection and we finally identify 783 websites, which are suited for our experiment.
4. We evaluate the WA for 10 individual webpages for each website and record the results in a file.
5. We calculate the WA average (푊 퐴푎푣푒푟푎푔푒) and standard deviation (StDev(푊 퐴푎푣푒푟푎푔푒)) for each website.
33http://s3.amazonaws.com/alexa-static/top-1m.csv.zip 3.4 Evaluation 45
6. We calculate and store the WA of the homepage for each website (푊 퐴ℎ표푚푒푝푎푔푒) as an extra variable.
Figure 3.8: WA average rating and standard deviation values, as well as the homepage WA for a set of 783 random websites.
We plot the variables 푊 퐴푎푣푒푟푎푔푒, StDev(푊 퐴푎푣푒푟푎푔푒) and 푊 퐴ℎ표푚푒푝푎푔푒 for each website in a descending order by 푊 퐴푎푣푒푟푎푔푒 in Figure 3.8. The x-axis represents each evaluated website, whereas the y-axis represents WA. The red cross (+) markers which appear in a seemingly continuous line starting from the top left and ending at the center right of the diagram repre- sent the 푊 퐴푎푣푒푟푎푔푒 values for each website. The blue star (*) markers which appear around the red markers represent the 푊 퐴ℎ표푚푒푝푎푔푒 values. The green square () markers at the bottom of the diagram represent StDev(푊 퐴푎푣푒푟푎푔푒). From the outcomes of our evaluation we draw the following conclusions:
1. While average WA for the webpages of the same website may vary significantly from 50% to 100%, the WA standard deviation does not behave in the same manner. The WA standard deviation is extremely low. More specifically, its average is 0.964 points in the 0-100 WA scale and its median is 0.5. Its maximum value is 13.69 but this is an outlier; the second biggest value is 6.88. This means that WA values are consistent for webpages of the same website. 2. The WA standard deviation for webpages of the same website does not depend on average WA of the website. As depicted in Figure 3.8, regardless of the 푊 퐴푎푣푒푟푎푔푒 value, StDev(푊 퐴푎푣푒푟푎푔푒) value remains very low. 3. The WA of the homepage is near the average WA for most websites. Figure 3.8 indi- cates the 푊 퐴ℎ표푚푒푝푎푔푒 values are always around 푊 퐴푎푣푒푟푎푔푒 values with very few out- liers. The average absolute difference between 푊 퐴푎푣푒푟푎푔푒 and 푊 퐴ℎ표푚푒푝푎푔푒 for all web- sites is 3.87 and its standard deviation is 3.76. The minimum value is obviously 0 and the maximum is 25.9. 46 An Innovative Method to Evaluate Website Archivability
4. Although 푊 퐴ℎ표푚푒푝푎푔푒 is near 푊 퐴푎푣푒푟푎푔푒, we observed that its value is usually higher. Out of the 783 websites, in 510 cases 푊 퐴ℎ표푚푒푝푎푔푒 is higher, in 35 is it exactly equal and in 238 it is lower that 푊 퐴푎푣푒푟푎푔푒. Even though the difference is quite small, it is notable.
Our conclusion is that our initial assumptions are valid, the variance of WA for the webpages of the same website is remarkably small. Moreover, the homepage WA is quite similar to the average, with a small bias towards higher WA values, which is quite interesting. A valid explanation regarding this phenomenon is that website owners spend more resources on the homepage than any other page because it is the most visited part of the website. Overall, we can confirm that it is justified to evaluate WA using only the website homepage.
3.5 Web Content Management Systems Archivability
Web Content Management Systems (WCMS) are widely adopted and account for much of the web’s activity. For instance, just a single WCMS company, WordPress, reported more than 1 million new posts and 1.5 million new comments each day [147] WCMS are created in various different programming languages, using many new web technologies [53]. We believe that the wide adoption of WCMS has benefits for web archiving and needs to be taken into consideration. WCMS constitute a common technical framework which may facilitate or hinder web archiving for a large number of websites. If a web archive is compatible with a certain WCMS, it is highly probable that it will be able to archive all websites built with this WCMS. We evaluate the WA of 12 prominent WCMSs to identify their strengths and weaknesses and propose improvements to improve web content extraction and archiving. We conduct an experimental evaluation using a nontrivial dataset of websites based on these WCMSs and make observations regarding their WA characteristics. We also come up with specifc suggestions for each WCMS based on our experimental data. Our aim is to improve the web archiving practice by indicating potential issues to the WCMS development community. If our findings result in advances in WCMS source code upstream, all web archiving initiatives will benefit as the websites based on these WCMSs will become more archivable. Following, we present our method, results and conclusions.
3.5.1 Website Corpus Evaluation Method
We use 5.821 random WCMS samples from the Alexa top 1 million websites34 as our exper- imental dataset. We use this dataset because it contains high quality websites from multiple domains and disciplines. This dataset is also used in other related research [142, 60]. We select our corpus with the following process:
1. We implement a simple Python script to visit each homepage and look for the tag.
34http://s3.amazonaws.com/alexa-static/top-1m.csv.zip 3.5 Web Content Management Systems Archivability 47
2. For each website having the required meta tag, we evaluate if it belongs to one of the WCMSs listed in wikipedia35. If yes, we record it in our database. 3. We continue this process until we have a significant number of instances for 12 WCMSs (Blogger, DataLife Engine, DotNetNuke, Drupal, Joomla, Mediawiki, MovableType, Plone, PrestaShop, Typo3, vBulletin, Wordpress). 4. We evaluate each website using the ArchiveReady REST API and record the outcomes in our database. 5. We analyse the results using SQL to calculate various metrics.
The generator meta tag is not used universally on the web due to a variety of reasons, such as security. Thus, we have skipped a large number of websites, which did not indicate the system they use. Also, we did not take into consideration the version number of each WCMS as it would be impractical. There would be too many different variables in our experiment to conduct useful research. Also, it is highly improbable that the top internet websites would use legacy versions of their WCMS. The Git repository for this work36 contains all the captured data and the necessary scripts to reproduce all the evaluation experiments.
3.5.2 Evaluation Results and Observations
For each WCMS, we present the average and standard deviation for each WA Facet, as well as their cumulative WA (Figure 3.9). First of all, our results are consistent. While the WA Facet range is 0-100%, the standard deviation of all WA Facet values for each WCMS ranges from 4.2% (Blogger, 퐹퐴) to 13.2% (Mediawiki, 퐹푆 ). There are considerable differences between different WCMS regarding their WA. The top WCMS is DataLife Engine with a WA score of 83.52% with Plone and Drupal scoring also very high (83.06% and 82.08%). The rest of the WCMS score between 80.3% and 77.2%, whereas the lowest score belongs to Blogger (65.91%). In many cases, even though two or more WCMS may have similar WA score, their WA Facet scores are significantly different and each WCMS has different strengths and weaknesses. Thus, it is beneficial to look into each WA Facet differences.
퐹퐴: Top value is around 69.85% for Blogger and 69.51% for DataLife Engine, whereas the minimum value is below 60, at 56.29% for Mediawiki and 58.15% for DotNetNuke.
퐹푀 : Top value is 99.24% for Mediawiki, whereas the minimum value is 76.17% for DotNet- Nuke. The difference between the minimum and the maximum value is around 23 points, which almost twice the difference between 퐹퐴 range (13).
퐹퐶 : Appears to have the greatest differentiation between WCMS. The minimum value is only 7.38% for Blogger and the maximum value is 96.01% for DotNetNuke. At first sight, there seems to be an issue with the way Blogger is using multiple online services to host its web resources. Other WCMSs also vary from 78.5% (MovableType) to 92% (Plone), which is a considerable variation.
퐹푆 : Range is between 71.42% for Mediawiki and 88.06% for PrestaShop. Again these dif- ferences should be considered significant.
35 http://en.wikipedia.org/wiki/List표푓푐표푛푡푒푛푡푚푎푛푎푔푒푚푒푛푡푠푦푠푡푒푚푠, 푎푐푐푒푠푠푒푑퐴푢푔푢푠푡1, 2015 36https://github.com/vbanos/wcms-archivability-paper-data, accessed August 1, 2015 48 An Innovative Method to Evaluate Website Archivability
Figure 3.9: WA Facets average values and standard deviation for each WCMS
퐹퐴 has the smallest differentiation and 퐹퐶 has the greatest one among all WA Facets. We continue our research with more detailed observations regarding specific evaluations. Due to the large number of WA evaluations and the space restrictions imposed, we cannot present everything. We choose to discuss only highly significant rules. Similar research is easy to be contacted by anyone interested using the full dataset and source code available on github. We present our observations grouped by the four different WA Facets.
퐹퐴: Accessibility
Accessibility refers to the web archiving systems’ ability to traverse all website content via standard HTTP protocol requests [54].
퐴1: The percentage of valid versus invalid hyperlink and CSS URLs (Table 3.13). These are critical for web archives to retrieve all WCMS published content. Hyperlinks are cre- ated not only by users but also by WCMS subsystems. In any case, some WCMS check if they are valid whereas others don’t. In addition, some WCMS may be incurred with invalid hyperlinks due to bugs. The results show that not all WCMSs have the same frequency of invalid hyperlinks. Joomla and Typo3 have a high percentage (88% and 89%), whereas Blog- 3.5 Web Content Management Systems Archivability 49
WCMS Valid URLs Invalid URLs Correct (%) Blogger 45425 1148 97% Mediawiki 39178 1763 96% Drupal 52501 2185 96% MovableType 22442 1009 96% vBulletin 104492 5841 95% PrestaShop 57238 3287 94% DataLife Engine 31981 2342 93% Plone 25719 1856 93% Wordpress 47717 3515 93% DotNetNuke 38144 2791 93% Typo3 30945 3747 89% Joomla 37956 4886 88%
Table 3.13: 퐴1 The percentage of valid URLs. Higher is better. ger, Mediawiki, Drupal and MovableType have the highest percentage of invalid hyperlinks (97% and 96%).
퐴2: The number of inline JavaScript scripts per WCMS instance (Table 3.14). The excessive use of inline scripts in modern web development results in web archiving problems. Plone, MovableType and Typo3 have the lowest number of inline scripts per instance (4.82, 6.82 and 6.89). The highest usage by far comes from Blogger (27.11), while Drupal (15.09) and vBulletin (12.38) follow.
WCMS Inst. Inline scripts scripts/inst. Plone 431 2076 4.82 MovableType 295 2011 6.82 Typo3 624 4298 6.89 Mediawiki 408 3753 9.20 DataLife Engine 321 3159 9.84 Wordpress 863 8646 10.02 DotNetNuke 598 6028 10.08 Joomla 501 5163 10.31 PrestaShop 466 5130 11.01 vBulletin 462 5721 12.38 Drupal 528 7969 15.09 Blogger 324 8783 27.11
Table 3.14: 퐴2 The number of inline scripts per WCMS instance. Lower is better. The Sitemap.xml protocol is meant to create files which include references to all the web- pages of the website [127]. Sitemap.xml files are generated automatically by WCMS when their content is updated. The results of the 퐴3 evaluation (Table 3.15) indicate that most WCMS lack proper support for this feature. Only DataLife Engine has a very high score (86%). Also Wordpress and Drupal score over 60%. All other WCMSs perform very poorly, which is surprising. 50 An Innovative Method to Evaluate Website Archivability
WCMS Instances Issues Correct DataLife Engine 321 46 86% Wordpress 863 272 68% Drupal 528 189 64% PrestaShop 466 237 49% MovableType 295 152 48% Typo3 624 322 48% Plone 431 249 42% vBulletin 462 329 29% Joomla 501 359 28% Blogger 324 240 26% DotNetNuke 598 461 23% Mediawiki 408 335 18%
Table 3.15: 퐴3 Sitemap.xml is present. Higher is better.
퐹퐶 : Cohesion
Cohesion is relevant to the level of dispersion of files comprising a single website to multiple servers in different domains. The lower the dispersion of a website’s files, the lower the susceptibility to errors because of a failed third-party system. We evaluate the performance for two 퐹퐶 related evaluations.
퐶1: The percentage of local versus remote images is presented in Table 3.16). Blogger is suffering from the highest dispersion of images. On the contrary, Plone, DotNetNuke, PrestaShop, Typo3 and Joomla have the higher 퐹퐶 , over 90%.
퐶2: The percentage of local versus remote CSS (Table 3.17). Again, Blogger has a very low score (2%), whereas every one WCMS is performing very well.
WCMS Local imgs Remote imgs Percent. Plone 7833 290 96% DotNetNuke 13136 680 95% PrestaShop 19910 1187 94% Typo3 15434 897 94% Joomla 14684 1251 92% MovableType 8147 1388 86% Drupal 16636 3169 84% vBulletin 11319 2314 83% Wordpress 20350 4236 83% Mediawiki 4935 1127 81% DataLife Engine 9638 2356 80% Blogger 1498 8121 16%
Table 3.16: 퐶1 The percentage of local versus remote image. Higher is better. 3.5 Web Content Management Systems Archivability 51
WCMS Local CSS Remote CSS Percent. DotNetNuke 5243 101 98% Typo3 3365 154 96% Plone 1475 72 95% Joomla 4539 222 95% DataLife Engine 919 56 94% PrestaShop 5221 400 93% MovableType 578 42 93% vBulletin 1459 104 93% Mediawiki 1120 84 93% Drupal 2320 354 87% Wordpress 5658 1019 85% Blogger 18 954 2%
Table 3.17: 퐶1 The percentage of local versus remote CSS. Higher is better.
퐹푆 : Standards Compliance
Standards Compliance is a necessary precondition in digital curation practices [39]. We evaluate 푆1: Validate if the HTML source code complies with the W3C standards using the W3C HTML validator and present the results in Table 3.18.
WCMS Instances Errors Errors/Instance Plone 431 12205 28.32 Mediawiki 408 14032 34.39 Typo3 624 23965 38.41 Wordpress 863 35805 41.49 Joomla 501 26609 53.11 PrestaShop 466 30066 64.52 DotNetNuke 598 43009 71.92 Drupal 528 47131 89.26 vBulletin 462 46466 100.58 MovableType 295 29994 101.67 DataLife Engine 321 34768 108.31 Blogger 324 71283 220.01
Table 3.18: 푆1 HTML errors per instance. Lower is better.
Plone has the lower number of errors (28.32), followed by Mediawiki (34.39) and Typo3 (34.41). On the contrary, Blogger has the most errors per instance (220.01), followed by far by DataLife Engine (108.31) and MovableType(101.67).
푆3: The usage of Quicktime and Flash formats is considered problematic for web archiving because web crawlers cannot process their contents to extract information, including web resource references. Results show that their use is very low in all WCMS (Table 3.19). 52 An Innovative Method to Evaluate Website Archivability
WCMS Instances No propr. files Success PrestaShop 466 460 99% Mediawiki 408 398 98% Blogger 324 310 96% Plone 431 412 96% Wordpress 863 821 95% Typo3 624 592 95% vBulletin 462 434 94% Drupal 528 494 94% DotNetNuke 598 548 92% DataLife Engine 321 294 92% MovableType 295 263 89% Joomla 501 439 88%
Table 3.19: 푆2 The lack of use of proprietary files (Flash, QuickTime). Higher is better.
푆4: Check if the RSS feed format complies with W3C standards. The results (Table 3.20) in- dicate that Blogger has mostly correct feeds (91%), whereas every other WCMS has various levels of correctness. The lowest scores belong to Mediawiki (2%) and DotNetNuke (13%). In general, the results show that there is a problem with RSS feed standard compliance.
WCMS valid feeds invalid feeds Correct Blogger 872 83 91% DataLife Engine 240 57 81% Wordpress 1283 317 80% Joomla 556 141 80% vBulletin 299 96 76% MovableType 271 120 69% Drupal 133 74 64% PrestaShop 82 112 42% Typo3 124 191 39% Plone 116 184 39% DotNetNuke 2 14 13% Mediawiki 10 521 2%
Table 3.20: 퐴5: Valid Feeds. Higher is better.
퐹푀 : Metadata usage
The lack of metadata impairs the archive’s ability to manage content effectively. Web sites include a lot of metadata, which need to be communicated in a correct manner to be utilised by web archives [101]. 3.5 Web Content Management Systems Archivability 53
WCMS Instances Exists Success Blogger 324 324 100% Drupal 528 527 100% MovableType 295 294 100% vBulletin 462 458 99% Plone 431 427 99% Typo3 624 618 99% Joomla 501 494 99% DotNetNuke 598 589 98% Mediawiki 408 401 98% DataLife Engine 321 315 98% PrestaShop 466 456 98% Wordpress 863 841 97%
Table 3.21: 푀1: HTTP Content-Type header. Higher is better.
푀1: Check if the HTTP Content-type header exists (Table 3.21). There is virtually no issue with HTTP Content-Type in all WCMSs. Their performance is excellent.
푀2: Check if any HTTP Caching headers (Expires, Last-modified or ETag) are set. HTTP Caching is highly relevant to accessibility and performance. Blogger, Mediawiki, Drupal, DataLife Engine and Plone have very good support of HTTP Caching headers (Table 3.22).
WCMS Instances Issues Percentage Blogger 324 3 99% Mediawiki 408 12 97% Drupal 528 23 96% DataLife Engine 321 16 95% Plone 431 49 89% MovableType 295 106 64% Joomla 501 186 63% Wordpress 863 466 46% Typo3 624 364 42% vBulletin 462 326 29% PrestaShop 466 388 17% DotNetNuke 598 569 5%
Table 3.22: 푀2: HTTP caching headers. Higher is better.
3.5.3 Discussion
We evaluated 12 prominent WCMS and presented specific results and statistics regarding their WA Facets. We concluded that not all WCMSs are considered equally archivable. Each one has its own strengths and weaknesses, which we highlight in the following: 54 An Innovative Method to Evaluate Website Archivability
1. Blogger has by far the worst overall WA score (65.91%, Figure 3.9), mainly due to the very low 퐹퐶 . Blogger files are dispersed in multiple different web services, which is increasing the possibility of errors in case one of them fails. In addition, Blogger scores very low in many metrics such as the number of inline scripts per instance (Table 3.14) and HTML errors per instance (Table 3.18). On the contrary, Blogger scores very high regarding 퐹푀 and 퐹푆 . 2. DataLife Engine has the highest WA score (83.52%). One aspect that they should look into is HTML errors per instance (Table 3.18), where it has the second worst score.
3. DotNetNuke has the second worst WA score in our evaluation (77.2%). 퐹퐶 is their strong point (96.01%) but they have issues is every other area. We suggest that they look into their RSS feeds (13% Correct) (Table 3.20), and lacking HTTP caching sup- port (5%) (Table 3.22). 4. Drupal has the third highest WA score (82.08%). It has good overall performance and the only issue is the existence of too many inline scripts per instance (15.09) (Ta- ble 3.14). 5. Joomla WA score is average (80.37%). It has a large number of invalid URLs per instance (12%) (Table 3.13) and it has also the highest usage of proprietary files (12%) (Table 3.19) which is not good for accessibility and preservation. 6. Mediawiki WA score is low (77.81%). This can be attributed to mostly invalid feeds (only 2% are correct according to standards) and very low sitemap.xml support (18%), Table 3.15. 7. MovableType WA score is average (80.02%). It does not stand out in any evaluation either in a positive or a negative way. General improvement in all areas would be welcome. 8. Plone has the second highest WA score (83.06%). It must be commented for having the lowest number of HTML errors per instance (28.32) (Table 3.18) and very high 퐹퐶 scores (96% for images, Table 3.16 and 95% for CSS, Table 3.17). 9. PrestaShop WA score is average (79%). It has average scores in all evaluations but it should be commented for not using any proprietary files (top score: 99% at Table 3.19). 10. Typo3 WA score is average (79%). It has the largest number of invalid URLs per instance (12%) (Table 3.13). 11. vBulletin WA score is consistenly low (78.37%). General improvement in all areas would be welcome. 12. Wordpress WA score is average (78.47%). We cannot highlight a specific area where it should be improved. As this is currently the most popular WCMS, Wordpress de- velopers should look into all WA Facets and try to improve.
We recommend that the WCMS development communities investigate the presented issues and resolve them as many are easy to be fixed without causing any issues with existing users and installations. If the situation regarding the highlighted issues is improved in the next releases of the investigated WCMS, the impact would be significant. A large number of websites which could not be archived correctly would no longer have these issues once they update their software and newly created websites based on these WCMS would be more archivable. Web archiving operations around the world would see great improvement, re- sulting in general advancements in the state of web archiving. 3.6 Conclusions 55
3.6 Conclusions
We presented our extended work towards the foundation of a quantitative method to evaluate WA. The Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to evaluate Website Archivability has been elaborated in great detail, the key Facets of WA have been defined and the method of their calculating has been explained in theory and practice. In addition, we presented the ArchiveReady system, which is the reference implementation of CLEAR+. We overviewed all aspects of the system, including design decisions, technolo- gies, workflows and interoperability APIs. We believe that it is quite important to explain how the reference implementation of CLEAR+ works because transparency raises the con- fidence for the method. A critical part of this work is also the experimental evaluation. First, we performed ex- perimental WA evaluations of assorted datasets and observed the behaviour of our metrics. Then, we conducted a manual characterisation of websites to create a reference standard and we identified correlations with WA. Both evaluations provided very positive results, which support that the CLEAR+ can be used to identify whether a website has the potential to be archived with correctness and accuracy. We also experimentally proved that that CLEAR+ method needs only to evaluate a single webpage to calculate the WA of a website, based on the assumption that webpages from the same website share the same components, standards and technologies. Finally, we evaluated the WA of the most prevalent WCMS, one of the common technical denominators of current websites. We investigated the extent to which each WCMS meets the conditions for a safe transfer of their content to a web archive for preservation purposes, and thus identified their strengths and weaknesses. More importantly, we deduced specific recommendations to improve the WA of each WCMS, aiming to advance the general practice of web data extraction and archiving. Introducing a new metric to quantify the previously unquantifiable notion of WA is not an easy task. We believe that we have captured the core aspects of a website crucial in diag- nosing whether it has the potential to be archived with correctness and accuracy with the CLEAR+ method and the WA metric.
Chapter 4
Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling
The performance and efficiency of web crawling is important for many Applications, such as for search engines, web archives and online news. We propose methods to optimise web crawling by duplicate and near-duplicate webpage detection. Using webgraphs to model web crawling, we perform webgraph edge contractions and detect web spider traps, improving the performance and efficiency of web crawling as well as the quality of its results. We introduce http://webgraph-it.com (WebGraph-It), a web platform which implements the presented methods, and conduct extensive experiments using real-world web data to evaluate the strengths and weaknesses of our methods1.
4.1 Introduction
Websites have become large and complex systems, which require strong software systems to be managed effectively [22]. Web content extraction, or web crawling, is becoming increas- ingly important. It is crucial to have web crawlers capable of efficiently traversing websites to harvest their content. The sheer size of the web combined with an unpredictable publish- ing rate of new information call for a highly scalable system, while the lack of programmatic access to the complete web content makes the use of automatic extraction techniques neces- sary.
1This chapter is based on the following publication: • Banos V., Manolopoulos Y.: “Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling, ACM Transactions on the Web Journal, submitted, 2015. 58 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling
Special software systems have been created, the web crawlers, also known as “spiders” or “bots”, to conduct web crawling with efficiency and performance on large scale. They are self-acting agents that navigate around-the-clock through the hyperlinks of the web, harvest- ing topical resources without human supervision [112]. Essentially, a web crawler starts from a seed webpage and then uses the hyperlinks within it to visit other webpages. This process repeats with every new webpage until some conditions are met (e.g. a maximum number of webpages is visited or no new hyperlinks are detected). Despite the simplicity of the basic algorithm, web crawling has many challenges [12]. In this work, we focus on addressing two key issues:
• There are a lot of duplicate or near-duplicate data captured during web crawling. Such data are considered superfluous and, thus, great effort is necessary to detect and remove them after crawling [95]. To the best of our knowledge, there is no method to perform this task during web crawling. • Web spider traps are sets of webpages that cause web crawlers to make an infinite number of requests. They result in software crashes, web crawling disruption and excessive waste of computing resources [110]. There is no automated way to detect and avoid web spider traps; web crawling engineers use various heuristics with limited success.
These issues impact greatly web crawling systems’ performance and users’ experience. We explore some fundamental web crawling concepts and present various methods to improve baseline web crawling to address them:
• Unique webpage identifier selection: URI is the de facto standard for unique webpage identification but web archiving systems also use the Sort-friendly URI Reordering Transform (SURT)2, a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names3. We suggest using SURT as an alternative unique webpage identifier for web crawling applications. • Unique webpage identifier similarity: Unique URI is the defacto standard but we also look into near-duplicates as well. It is possible that two near-duplicate URIs or SURTs belong to the same webpage. • Webpage content similarity: Duplicate and near-duplicate webpage content detec- tion can be used in conjunction with unique webpage identifier similarity. • Webgraph edge contraction: Modeling websites as webgraphs[28] during crawling, we can apply node merging using the previous three concepts as similarity criteria and achieve webgraph edge contraction and cycle detection.
Using these concepts, we establish a theoretical framework as well as novel web crawling methods, which provide us with the following information for any target website: (i) unique and valid webpages, (ii) hyperlinks between them, (iii) duplicate and near-duplicate web- pages, (iv) web spider trap locations, and, (v) a webgraph model of the website.
2http://crawler.archive.org/apidocs/org/archive/util/SURT.html, accessed August 1, 2015 3http://crawler.archive.org/articles/user_manual/glossary.html, accessed August 1, 2015 4.2 Method 59
We also present WebGraph-It, a system which implements our methods and is available at http://webgraph-it.com. Web crawling engineers could use WebGraph-It to preprocess websites prior to web crawling to get specific lists of URLs to avoid as duplicates or near- duplicates, get URLs to avoid as web spider traps, and, generate webgraphs of the target website. Finally, we conduct an experiment with a non-trivial dataset of websites to evaluate the proposed methods. Our contributions can be summarized as follows:
• we propose a set of methods to detect duplicate and near-duplicate webpages in real time during web crawling.
• we propose a set of methods to detect web spider traps using webgraphs in real time during web crawling.
• we introduce WebGraph-It, a web platform which implements the proposed methods.
The remainder of this chapter is organised as follows: Section 4.2 presents the main concepts of our methods and introduces new web crawling methods that use them to detect duplicate and near-duplicate content, as well as detect web spider traps. Section 4.3 presents the sys- tem architecture of WebGraph-It. Section 4.4 presents our experiments and detailed results. Finally, Section 4.5 discusses results and presents future work.
4.2 Method
In the following subsections, we propose some algorithms to detect duplicate web content during web crawling and avoid web spider traps. However, first we present some fundamental concepts before defining our methods.
4.2.1 Key Concepts
We model the web crawling process as a directed graph, which we call webgraph. A web- graph relative to a certain set of URLs is a directed graph having those URLs as nodes, and with an arc from 푋 to 푌 whenever page 푋 contains a hyperlink towards page 푌 [24]. We model web crawling of a single website as the generation and traversal of a webgraph in real time. When new webpages are identified, new nodes and arcs are added to the webgraph. This concept can be extended to crawling large numbers of websites or domains; however, in this work, we focus on the problem of crawling a single website to be able to present and validate our ideas in a tangible way. The standard “naive” procedures of web crawling result in webgraphs with an excessive number of duplicate or near-duplicate nodes. Despite any heuristics used by web crawlers, their logic is static and they cannot cope with the changing nature of the web, as already presented in Section 2.1.2. Moreover, in many cases existing procedures result in webgraphs with infinite size, as web crawlers are tricked to detect new webpages indefinitely when there 60 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling is no new web content available. This issue is also known as web spider traps and is already detailed in Section 4.4.5. To detect more optimal web crawling methods, we exploit three concepts which, to the best of our knowledge, have not been fully exploited until this point:
• Unique webpage identifier selection: Which webpage attribute is considered as its unique identifier? • Unique webpage identifiers similarity: Which webpage unique identifiers should be considered similar? • Webpage content similarity: Which webpage content should be considered similar?
Using these concepts, we identify duplicate or near-duplicate web content which highlights webgraph nodes which contain little or no new information and, thus, can be removed. These findings result in webgraph edge contractions and restructuring. In addition, this process enables webgraph cycle detection. The result is the reduction of webgraph complexity, im- proving the efficiency of the web crawling process and the quality of its results. We must note that each of the presented concepts can be used not only independently but also in conjunction with others. In the sequel, we analyse each concept in detail.
Unique webpage identifier selection
The Uniform Resource Identifiers (URIs) are the de facto standards for unique web resource identification of the WWW. The architecture of the WWW is based on the Uniform Resource Locator (URL), which is a subset of the URI that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network “location”) [19]. Many web related technologies such as the Semantic Web and Linked Open Data use URLs [21]. We suggest that we should rethink the use of URLs for unique webpage identification in web crawling applications. There are many special cases where there are issues with this concept:
• URLs with excessive parameters are usually pointing at the same webpage. Web ap- plications ignore arbitrary HTTP GET parameters. For instance, the following three URLs are pointing at the same webpage: http://edition.cnn.com/videos http://edition.cnn.com/videos?somevar=1 http://edition.cnn.com/videos?somevar=1&other=1 There is no restriction in using either of these URLs. If a web content editor or user mentions one of these for any reason, they would be accepted as valid as they are point- ing at a valid webpage. The web server responds with an HTTP 200 status response and a correct web document. The problem is that web crawlers would capture three copies of the same webpage. • Two or more totally different URLs could point at the same webpage. For instance, the following two AUTH university webpages are duplicates: 4.2 Method 61
http://www.auth.gr/invalid-page-1 http://www.auth.gr/invalid-page-2 They are both pointing at the same “Not found” webpage. If URLs such as these are mentioned in any webpage visited by a web crawler, this would result in multiple copies of the same webpage.
• Problematic DNS configurations could lead to multiple duplicate web documents. For example, in many cases the handling of the ’www.’ Prefix in websites is not consistent. For instance, the following two URLs are exactly similar: http://www.example.com/ http://example.com/ The correct DNS configuration would make the system respond with an HTTP redirect from the one to the other according to the owner’s preference. Currently, web crawlers would consider them as two different websites.
We suggest that URLs need to be preprocessed and normalised before used as unique web- page identifiers. An appropriate solution for this problem would be the use of Sort-friendly URI Reordering Transform (SURT) to encode URLs. In sort, SURT converts URLs from their original format: scheme://[email protected]:port/path?query#fragment into the following: scheme://(tld,domain,:port@user)/path?query#fragment An example conversion is presented below. URL: http://edition.cnn.com/tech SURT: com,cnn,edition)/tech The ‘(’ and ‘)’ characters serve as an unambiguous notice that the so-called ’authority’ por- tion of the URI ([userinfo@] host[:port] in http URIs) has been transformed; the commas prevent confusion with regular hostnames. This remedies the ‘problem’ with standard URIs that the host portion of a regular URI, with its dotted-domains, is actually in reverse order from the natural hierarchy that’s usually helpful for grouping and sorting. The value of re- specting URI case variance is considered negligible: it is vanishingly rare for case-variance to be meaningful, while URI case- variance often arises from people’s confusion or sloppi- ness, and they only correct it insofar as necessary to avoid blatant problems. Thus the usual SURT form is considered to be flattened to all lowercase, and not completely reversible4. Web archiving systems use SURT internally. For instance, Murray et al. use SURT and certain limits to conduct link analysis in captured web Content [104]. Alsum et al. use SURT to create a unique ID for each URI to achieve incremental and distributed processing for the same URI on different web crawling cycles or different machines [5].
4http://crawler.archive.org/articles/user_manual/glossary.html, accessed: August 1, 2015 62 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling
Unique webpage identifiers similarity
One of the basic assumptions of the web is that URLs are unique [19]. When a web crawler visits a webpage and encounters duplicate URLs, it does not visit the same URL twice. We suggest that in some cases, near-duplicate URLs could also lead to the same webpage and should be avoided. For instance, the following two URLs lead to the same webpage: http://vbanos.gr/ http://vbanos.gr A slight difference in HTTP GET URL parameters could also trick web crawlers into pro- cessing duplicate webpage such as lowercase or uppercase characters, unescaped parameters, or any other web application specific variable could lead to the same results. For example, the following could be near-duplicate webpages: http://vbanos.gr/show?show-greater=10 http://vbanos.gr/page?show-greater=11 Parameter ordering may also trick web crawlers. Example: http://example.com?a=1&b=2 http://example.com?b=2&a=1 Thus, we propose to detect near-duplicate URLs using standard string similarity methods and consider webpages with near-duplicate URLs as potential duplicates. The content of these webpages is also evaluated to clarify if they are indeed duplicate. We use the Sorensen-Dice coefficient similarity because it is a string similarity algorithm with the following characteristics: (i) low sensitivity to word ordering, (ii) low sensitivity to length variations, (iii) runs in linear time [16, 45]. For the sake of experimentation, we consider the 95% similarity threshold as appropriate to define near-duplicate URLs. Finally, we must highlight that the proposed method can be used with both URL and SURT as unique identifiers.
Webpage content similarity
Webpage content similarity can be also used to detect duplicate or near-duplicate webpages. The problem can be defined as:
• Detect duplicate webpages: Two webpages which contain exactly the same content. • Detect near-duplicate webpages: Webpages with content that is very similar but not exactly the same. This is a common pattern on the web, webpages may even be slightly different if the same user makes two subsequent visits because some dynamic parts of the webpage are updated. E.g. some counter or some other widget.
Digital file checksum algorithms create a unique signature for each file based on their con- tents. They can be used to identify duplicate webpages but not near-duplicates. We need a very efficient and high performance algorithm. The simhash algorithm by Charikar can 4.2 Method 63 be used to calculate hashes from documents to be able to perform fast comparisons [35]. It has has been already used very effectively to detect near-duplicates in web search engine applications [95]. This work demonstrates that simhash is appropriate and practical for near- duplicate detection in webpages. To use simhash, we need to calculate the simhash signature of every webpage after it is captured, and save it in a dictionary with its URL. Then, when capturing any new page, we would compare its simhash signature with existing ones in the dictionary to find duplicates or near-duplicates. The similarity threshold would be an option according to user needs. For the sake of simplicity and experimentation, we only consider similarity evaluation between two webpage to identify exact similarity or at least 95% similarity. The potential problem of this approach is that in case a website contains a large number of webpages, then it would not be efficient to calculate every new webpage’s similarity with all existing webpages, even though we may use simhash, which is very efficient compared with a bag of words or any other attribute section method [35].
Webgraph cycle detection
During web crawling, a webgraph is generated in real time using the newly captured web- pages as nodes and their links as edges. New branches are created and expanded as the web crawler captures new webpages. The final outcome is a directed acyclic graph [28]. Our method can be summarised as follows: Every time a new node is added to the webgraph, we evaluate if it is duplicate or near-duplicate to nearby nodes. If this is true, then the two nodes are merged, their edges are contracted and we detect potential cycles in the modified graph starting from the new node up to a distance of 푁 nodes. If a cycle is detected, we do not proceed with crawling links from this node, else we continue. A generic version of the web crawling with cycle detection algorithm is presented in Listing 4.2. In more detail, to implement our method we need to have a shared global webgraph object in memory. which can be accessed by all web crawler bots. Each webgraph node has the structure of Listing 4.1.
1 struct webgraph-node { 2 string webpage-url 3 string webpage-surt 4 bitstream webpage-content-simhash 5 list[string] internal-links 6 }
Listing 4.1: Webgraph node structure
Webpage-url keeps the original webpage url without any modification. Webpage-surt is generated by the webpage-url and webpage-content-simhash is generated by the webpage html markup. Only internal links are used due to the scope of our algorithms to detect duplicate webpages of the same website.
1 global var webgraph 2 64 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling
3 method crawl(URL): 4 Fetch webpage from URL 5 new-node = create-webgraph-node(URL) 6 webgraph->add-node(new-node) 7 for limit = (1,...,N): 8 near-nodes = webgraph->get-nodes(new-node, limit) 9 for node in near-nodes: 10 if is-similar(node, new-node): 11 webgraph->merge(node, new-node) 12 for limit = (1,...,N): 13 has-cycle = dfs-check-cycle(webgraph, new-node, limit ) 14 if has-cycle is True: 15 return 16 parse webpage and extract all URLs 17 save webpage 18 for all URLs not seen before: 19 crawl(URL)
Listing 4.2: Generic web crawling with cycle detection algorithm
The algorithm can have multiple variations regarding: a) node similarity method and b) maximum node distance evaluation. Webgraph nodes which would be otherwise considered as unique considering only unique URL can now be identified as duplicate or near-duplicate using the methods presented in the previous sections (4.2.1-4.2.1). The potential similarity metrics are presented in Table 4.1.
Table 4.1: Potential webgraph node similarity metrics Id Identifier Identifier Similarity Content Similarity
푆1 URL No No 푆2 SURT No No 푆3 URL Yes No 푆4 SURT Yes No 푆5 URL Yes Yes 푆6 SURT Yes Yes
To search for cycles we use Depth-First Search (DFS) [136] because it is ideal to perform limited searches in potentially infinite graphs. We limit the search distance to 3 nodes be- cause our experiments indicate that it is not relevant to perform deeper searches to detect cycles. We present such an experiment in Section 4.4.4. We must note that our method is very efficient because we do not need to save the contents of every webpage but only some specific webpage attributes as presented in Listing 4.1. Also, 4.2 Method 65 our method uses one shared webgraph model in memory regardless of the number of web crawler processes using locking mechanisms when adding or removing nodes. Due to the fact that web crawling is I/O bound, this architecture does not incur performance penalties.
4.2.2 Algorithms
Using the concepts introduced in Section 4.2.1, we design specific web crawling algorithms. These algorithms are later tested experimentally in Section 4.4. We note that in all cases, we evaluate a single domain. When we mention URLs, we mean URLs from the same target domain. We ignore external URLs. Also, we use breadth-first webpage ordering.
Algorithm 1 - the base web crawling algorithm
First, we present the basic algorithm for web crawling in Listing 4.3. The algorithm is “naive”. It is considered as the standard reference, indicating the baseline webpages, links and web crawling duration. All other proposed methods are compared with this to indicate any achievements or issues due to the application of our concepts (Section 4.2.1).
1 method crawl(URL): 2 fetch webpage from URL 3 parse webpage and extract all URLs 4 save webpage 5 for all URLs not seen before: 6 crawl(URL)
Listing 4.3: 퐶1: Basic web crawling algorithm
Algorithm 2 - SURT variation
Instead of using URL as the unique identifier for the webpage in the web crawler memory, we use SURT. Thus, the new algorithm is presented in Listing 4.4. We note that the extra SURT generation requires trivial computing resources.
1 method crawl(URL): 2 fetch webpage from URL 3 parse webpage and extract all URLs 4 save webpage 5 SURTs = calculate-SURT(URLs) 6 for all SURTs not seen before: 7 crawl(URL using SURT) 66 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling
Listing 4.4: 퐶2: Basic web crawling algorithm using SURT as unique webpage identifier
Algorithms 3, 4 - near-duplicate unique identifiers
In the previous two algorithms, we use a dictionary data structure to hold the URLs or the SURTs of all visited pages. In addition, we use the exact similarity measure to decide if we have already visited a webpage or not. In this algorithm we propose to use near-similarity to decide whether to visit a webpage or not. As presented in Section 4.2.1, we use the Sorensen- Dice coefficient similarity to calculate the similarity of the new webpage’s unique identifier (URL or SURT) with all existing identifiers. The algorithmic steps are exactly the same.
1 method crawl(URL): 2 fetch webpage from URL 3 parse webpage and extract all URLs 4 save webpage 5 for all URLs not seen before: 6 if URL is not similar to existing URLs: 7 crawl(URL)
Listing 4.5: 퐶3: Using near-similarity for URLs
1 method crawl(URL): 2 fetch webpage from URL 3 parse webpage and extract all URLs 4 save webpage 5 SURTs = calculate-SURT(URLs) 6 for all SURTs not seen before: 7 if SURT is not similar to existing SURTs: 8 crawl(URL using SURT)
Listing 4.6: 퐶4: Using near-similarity for SURTs
Algorithms 5, 6 - near-duplicate content detection
In the previous four algorithms, we worked with the selection of the webpage unique iden- tifier (URL or SURT) and its similarity metric (exact or similar identifier). We propose that we should also take into consideration webpage content similarity in addition to unique identifier similarity. 4.2 Method 67
1 method crawl(URL): 2 fetch webpage from URL 3 if webpage is not similar to existing webpages: 4 Parse webpage and extract all URLs 5 save webpage 6 for all URLs not seen before: 7 if URL is not similar to existing URLs: 8 crawl(URL)
Listing 4.7: 퐶5: Using near-similarity for URLs and content similarity for webpages
1 method crawl(URL): 2 fetch webpage from URL 3 if webpage is not similar to existing webpages: 4 parse webpage and extract all URLs 5 save webpage 6 SURTs = calculate-SURT(URLs) 7 for all SURTs not seen before: 8 if SURT is not similar to existing SURTs: 9 crawl(URL using SURT)
Listing 4.8: 퐶6: Using near-similarity for SURTs and content similarity for webpages
Algorithms 7, 8 - cycle detection
We can extend the previously defined algorithms 3-6 using webgraph cycle detection based on the concept we presented in Section 4.2.1. Algorithm 7 uses URL as the unique webpage identifier and the content similarity function (Listing 4.9) whereas algorithms 8 uses SURT as the unique webpage identifier and also the content similarity function (Listing 4.10).
1 global var webgraph 2 3 method crawl(URL): 4 fetch webpage from URL 5 new-node = create-webgraph-node(URL) 6 webgraph->add-node(new-node) 7 for limit = (1,...,N): 8 near-nodes = webgraph->get-nodes(new-node, limit) 9 for node in near-nodes: 10 if content-is-similar(node, new-node): 68 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling
11 webgraph->merge(node, new-node) 12 for limit = (1,...,N): 13 has-cycle = dfs-check-cycle(webgraph, new-node, limit ) 14 if has-cycle is True: 15 return 16 parse webpage and extract all URLs 17 save webpage 18 for all SURTs not seen before: 19 crawl(URL)
Listing 4.9: 퐶7: Using URL as unique identifier, webpage content similarity and webgraph cycle detection
1 global var webgraph 2 3 method crawl(URL): 4 fetch webpage from URL 5 new-node = create-webgraph-node(URL) 6 webgraph->add-node(new-node) 7 for limit = (1,...,N): 8 near-nodes = webgraph->get-nodes(new-node, limit) 9 for node in near-nodes: 10 if content-is-similar(node, new-node): 11 webgraph->merge(node, new-node) 12 for limit = (1,...,N): 13 has-cycle = dfs-check-cycle(webgraph, new-node, limit ) 14 if has-cycle is True: 15 return 16 parse webpage and extract all URLs 17 save webpage 18 SURTs = calculate-SURT(URLs) 19 for all SURTs not seen before: 20 crawl(URL using SURT)
Listing 4.10: 퐶8: Using SURT as unique identifier, webpage content similarity and web graph cycle detection 4.3 The WebGraph-it System Architecture 69
Table 4.2.2 is a summary of all presented web crawling algorithms. We notice that we are not exhausting all potential method combinations but we are focusing on a substantial set, which is sufficient to explore the value of our methods. In the sequel, we present the WebGraph-It platform, which implements the presented algorithms.
Table 4.2: Web crawling algorithms summary Id Identifier Selection Identifier Similarity Content Similarity Cycle Detection
퐶1 URL No No No 퐶2 SURT No No No 퐶3 URL Yes No No 퐶4 SURT Yes No No 퐶5 URL Yes Yes No 퐶6 SURT Yes Yes No 퐶7 URL No Yes Yes 퐶8 SURT No Yes Yes
4.3 The WebGraph-it System Architecture
Here, we present http://webgraph-it.com (WebGraph-It), a web platform that imple- ments our methods as a web application. Using WebGraph-It, users can analyse target web- sites and gain an understanding regarding their structure, pages, hyperlinks, duplicate content and web crawler traps.
4.3.1 System
WebGraph-It is a web platform built using the micro-service architecture:
• The back-end subsystem implements all the algorithms and the web crawling logic for downloading and analysing data. It also exposes a private REST API. • The data storage subsystem is responsible for permanent and temporary data storage. It communicates with the back-end to send or receive data via standard storage APIs. • The front-end subsystem implements the user interface and the public REST API. It communicates with the back-end to invoke commands and retrieve data.
We use the following standard software components: a) Debian linux operating system [118] for development and production servers, b) Nginx web server, c) Python programming lan- guage, d) Gunicorn Python WSGI HTTP Server, e) Flask Python web micro-framework, f) Redis advanced key-value store to manage job queues and temporary data which are common among background web crawling processes, g) Mariadb Mysql RDBMS to store permanent 70 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling data, h) PhantomJS, a headless WebKit scriptable with a JavaScript API. It has fast and na- tive support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG, i) JavaScript and CSS libraries such as Bootstrap for UI. An overview of the system architecture is presented in Figure 4.1.
Figure 4.1: WebGraph-It system architecture
We explain some of our system architecture decisions as they are important for the imple- mentation of our methods. First, we choose the micro-services architecture because we need to separate the web crawling logic, the data storage and the user interface. This way, we can upgrade each system without affecting the other ones. For instance, we could create a new more user-friendly web interface or a public REST API for WebGraph-It without modifying the web crawling logic. We use asynchronous job queues in the back-end to define and conduct the web crawling process because it is a flexible system; we can define arbitrary numbers of worker processes in one or more servers, and, thus, the system is resilient to faults due to unexpected condi- tions. A single process can crash without affecting the other ones. Also, the results are kept in the job queues and can be evaluated later. 4.3 The WebGraph-it System Architecture 71
Figure 4.2: Viewing a webgraph in the http://webgraph-it.com web application
We use Python to implement the front-end and the back-end subsystems because it has many features such as a robust MVC framework and networking libraries, such as python-requests5. Python also has a large set of libraries, which implement algorithms such as simhash6 and Sorensen-Dice, as well as graph analysis (NetworkX7) and numeric calculations (numpy8). We use PhantomJS to improve the ability of our web crawler to process webpages which use Javascipt, AJAX and other web technologies, which are difficult to handle with HTML pro- cessing. Using PhantomJS, we render JavaScript in webpages and extract dynamic content. In the past, this method has been tested successfully in web crawling work [16]. We use Redis9 to store temporary data in memory because of its extremely high performance, its ability to support many data structures and multiple clients in parallel. Web crawling is
5http://python-requests.org, accessed: August 1, 2015 6https://github.com/sangelone/python-hashes, accessed: August 1, 2015 7https://networkx.github.io/, accessed: August 1, 2015 8http://www.numpy.org/, accessed: August 1, 2015 9http://redis.io, accessed: August 1, 2015 72 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling performed by multiple software agents / web crawling processes, which can be distributed in one or more servers. The WebGraph-It architecture uses Redis as a common temporary data storage to maintain asynchronous job queues, webgraph structures, visited URL lists, SURT lists, webpages’ simhash values and other vital web crawl information. We use MariaDB to store permanent data for users, webcrawls and captured webpages in- formation such as hyperlinks. All data are stored in a relational model to be able to query them and generate views, reports and statistics. The http://webgraph-it.com front-end enables users to register and conduct web crawls with various options. The users can see completed web crawls and retrieve the results. An indicative screenshot of the front-end is presented in Figure 4.3.1. Users are able to create new web crawling tasks or view the results of existing tasks via an intuitive interface. Users are also able to export webgraph data in a variety of formats such as Graph Markup Language (GraphML) [26], Geography Markup Language (GML) [30], Graphviz DOT Language [49], sitemap.xml [127] and CSV. Our aim is to enable the use of the generated webgraphs in a large variety of 3rd party applications and contexts.
4.3.2 Web Crawling Framework
The development of multiple alternative web crawling methods requires the appropriate code base. We implement a special framework for WebGraph-it which simplifies the web crawler creation process; we use it for the implementation of the alternative web crawling algorithms presented in Section 4.2.2. The basic functionality of user input/output, storage and logging is common for all web crawlers. The developer needs only to create a Python module with three methods:
• check_url: Check if we should continue to follow a URL. • process: Analyse captured webpage and extract information. • capture: Download webpage from URL
We present the Python implementation of the basic web crawling algorithm 퐶1 from Sec- tion 4.3) in Listing 4.11.
1 from lib.crawl_list_class import CrawlList 2 from lib.crawling_utils import enqueue_capture 3 from app.models.base import db 4 from app.models.crawl import Crawl 5 from app.models.page import Page 6 7 def permit_url(target): 8 crawl_list = CrawlList(target.crawl_id) 9 return not crawl_list.is_visited(target.url) 10 4.4 Evaluation 73
11 def capture(target): 12 current_crawl = Crawl.query.filter(Crawl.id == target. crawl_id).first() 13 crawl_list = CrawlList(target.crawl_id) 14 if permit_url(target): 15 if not current_crawl.has_available_pages(): 16 return 17 if not target.get_html(): 18 return 19 new_page = Page( 20 crawl_id=target.crawl_id, 21 url=unicode(target.url), 22 ) 23 db.session.add(new_page) 24 db.session.commit() 25 if target.unique_key and new_page.id: 26 crawl_list.add_visited(target.unique_key, new_page.id) 27 target.page_id = new_page.id 28 if target.links: 29 target.save_links() 30 enqueue_capture("standard", capture, target, target. links) 31 32 def process(target): 33 crawl_list = CrawlList(target.crawl_id) 34 crawl_list.clear() 35 enqueue_capture("standard", capture, target, (target.url,))
Listing 4.11: 퐶1: Basic web crawling algorithm python implementation
4.4 Evaluation
We present the evaluation method and we explain in detail the evaluation of a single website with all web crawling algorithms to demonstrate the process. Then, we present and dis- cuss the results of the evaluation of a significant set of websites. Finally, we also present some auxiliary experiments to define the optimal webgraph cycle distance variable and the behavior of the system analysing a web spider trap. 74 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling
4.4.1 Methodology
Our evaluation aims to explore the behavior and the results of all the web crawling algorithms defined in Section 4.2.2 and come up with conclusions regarding each key web crawling con- cept introduced in Section 4.2.1. We study the quality and the completeness of the web crawl- ing algorithms’ results. We also evaluate their speed. We need to evaluate if all webpages and hyperlinks are captured and the time needed to perform this task. We compare these findings with the standard baseline web crawling, which does not include any optimisations to evaluate the effects of our methods. In our experiments, we use Debian GNU/Linux 8, Python 2.7.9 and a virtual machine with 3 CPU cores and 1GB of RAM. All experiments presented here can be reproduced using the WebGraph-It system online at http://webgraph-it.com/. All web crawling tasks are performed in parallel using 4 processes (python workers) as presented in Section 4.3. We perform the following steps for our evaluation:
1. We select 100 random websites from Alexa top 1M websites 10 as a dataset.
2. We run 8 subsequent web crawls for each website with the WebGraph-it system using the 8 different web crawling algorithms presented in Section 4.2.2 (퐶1 − 퐶8). We produce 8 different result sets (푅1 − 푅8) for each website. 3. We record the specific metrics for each web crawl. All variables and metrics are pre- sented and explained in Table 4.3.
4. We analyse the results and reach specific conclusions after the completion of all web crawls.
Symbol Explanation
퐶푖 Web crawl 푅푖 Web crawl total results 퐷푖 Web crawl duration 푊푖 Captured webpages 퐿푖 Captured internal links from webpages 퐶푌 푖 Webgraph cycles detected 퐶푂푖 Completeness indicates the percentage of information contained in a web crawl result set compared with the respective base web crawl result set.
Table 4.3: Variables used in the evaluation, 푖=1-8
For evaluation purposes, it is necessary to have a baseline to which the web crawling methods should be compared. We define the base crawl 퐶1 (Listing 4.3) as the fundamental method to crawl websites without any attempt for optimisation. All metrics are calculated as percent- ages to the base crawl measurements. For instance, if the base crawl number of webpages captured is 100 and method 퐶5 has 90, its 푊5 value is 0.9.
10http://s3.amazonaws.com/alexa-static/top-1m.csv.zip, accessed August 1, 2015 4.4 Evaluation 75
The completeness metric needs also further explanation: For each web crawl except the base one (i.e. 퐶2 − 퐶8), we need to evaluate whether our proposed methods succeed to capture all target website webpages. This is necessary because it is possible that we may accelerate the web crawling process but fail to capture all content. This behavior may not be acceptable for some applications, where completeness is a key issue (e.g. web archiving) but it may be useful for other cases, such as search engines. To evaluate the completeness of each web crawl, we conduct the following steps:
1. We crawl each website using the standard method (퐶1) and save the results (푅1).
2. We crawl using every other method 퐶푖 and save their results 푅푖 (for 푖=2,...,8).
3. We check for each webcrawl results 푅푖, if every captured webpage 푊푖 in the base crawl 푅1 is available in 푅푖. To achieve this, we use the simhash of each 푊푖 in 푅1 and compare it with the simhashes of all webpages in 푅푖.
4. We calculate the completeness of each webcrawl results 퐶푖 as percentage of the number of found webpages against the number of base webcrawl results.
We notice that we do not apply exact similarity to evaluate if a webpage is present in the webcrawl results but we also look for near duplicates with a threshold of 95% similarity. It is important to outline the conditions of our experiments. There is a maximum limit in the number of webpages each target website may have in our system. This is to prevent a website with nearly infinite numbers of webpages to crash our web crawler. This is a prototype application and the implementation of web scale crawler is beyond its scope. Thus, we have set an arbitrary limit of 300 webpages per crawl. In some cases, websites have less than this number, which is not a problem, whereas in other cases, we stop the web crawling at this limit. What happens in practice is that the naive web crawling process (퐶1) would surely have a number of duplicates in the first 300 webpages of such a website. All methods would reach this limit of 300 webpages but they would include different webpages, where the most simplistic methods would contain duplicates to a significant extent. Another fact is that the experiment is performed using a single server and network con- nection. Any potential test server network or system issues would affect the results of the experiment. To minimise this effect, we conduct all web crawling operations on a target website together on a sequential order (퐶1 − 퐶8) and without any time difference between each web crawl.
4.4.2 Example
We present the detailed evaluation of a single website as an example to indicate the oper- ation of our system and the results of all web crawling methods in great detail (Table IV). For captured webpages (푊푖), links (퐿푖) and duration (퐷푖), we present not only the absolute numbers but also the percentages compared with the base crawl results. We also present the completeness (퐶푂푖) as a percentage. 76 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling
Table 4.4: Results from all methods for a single website, http://deixto.com
Method 푊푖 (%) 퐿푖 (%) 퐷푖 (%) 퐶푂푖 (%) 퐶1 (URL) 151 1.000 5131 1.000 199 1.000 1.000 퐶2 (SURT) 150 0.993 4982 0.971 211 1.060 0.987 퐶3 (URL + Unique key similar- 128 0.848 3980 0.776 178 0.894 0.887 ity)
퐶4 (SURT + Unique key simi- 133 0.881 4208 0.820 175 0.879 0.894 larity)
퐶5 (URL + Unique key similar- 152 1.007 5163 1.006 229 1.151 0.993 ity + content similarity)
퐶6 (SURT + Unique key simi- 150 0.993 4980 0.971 345 1.734 0.997 larity + content similarity)
퐶7 (URL + Cycle detection) 126 0.834 5130 1.000 202 1.015 0.993 퐶8 (SURT + Cycle detection) 124 0.821 4982 0.971 200 1.005 0.987
4.4.3 Results
We perform a total of 800 web crawls using our dataset of 100 websites. We capture data as presented in the example evaluation of the previous section. We summarise the results and calculate statistics such as the average, median, minimum, maximum and standard deviation to study them and come up with conclusions.
Table 4.5: 푊푖: Captured webpages difference between all webcrawls and base crawl. Lower is better. Id Average Median Min Max StDev
푊1 1.000 1.000 1.000 1.000 0 푊2 0.885 0.988 0.504 1.000 0.192 푊3 0.605 0.593 0.201 1.000 0.261 푊4 0.589 0.547 0.305 1.000 0.270 푊5 0.907 0.900 0.815 1.000 0.070 푊6 0.861 0.880 0.451 1.000 0.242 푊7 0.669 0.649 0.285 0.868 0.209 푊8 0.606 0.582 0.285 0.868 0.229
The captured webpages 푊푖 are presented in Table 4.5. These results need to be evaluated along with the completeness 퐶푖 of each algorithm, which is presented in Table 4.6. Most ef- ficient algorithm is the one that is capturing the less number of webpages, while also having the highest completeness. Another important parameter is also the duration of web crawling. A good algorithm has to be efficient. Performance does not only depend on the number of webpages downloaded but also on the computations required by the algorithm to decide the best web crawling process. The performance of each algorithm is presented in Table 4.7. It is also interesting to see the number of captured links for each webcrawl, as presented in Ta- ble 4.8. The standard deviation values for all metrics is quite low, increasing the confidence 4.4 Evaluation 77
Table 4.6: 퐶푂푖: Completeness of each web crawling method. Higher is better. Id Average Median Min Max StDev
퐶푂1 1.000 1.000 1.000 1 0 퐶푂2 0.986 0.991 0.958 1 0.016 퐶푂3 0.698 0.800 0.215 1 0.263 퐶푂4 0.723 0.814 0.312 1 0.230 퐶푂5 0.989 0.995 0.965 1 0.014 퐶푂6 0.982 0.986 0.944 1 0.020 퐶푂7 0.985 0.995 0.951 1 0.018 퐶푂8 0.983 0.982 0.979 1 0.009
Table 4.7: 퐷푖: Duration difference between all webcrawls and base crawl. Lower is better. Id Average Median Min Max StDev
퐷1 1.000 1.000 1 1.000 0 퐷2 0.726 0.905 0.150 0.946 0.333 퐷3 0.492 0.545 0.151 0.728 0.218 퐷4 0.419 0.370 0.185 0.750 0.238 퐷5 1.168 1.106 0.155 1.663 0.538 퐷6 1.152 1.181 0.213 1.707 0.594 퐷7 0.955 1.004 0.182 1.160 0.294 퐷8 0.829 0.805 0.210 1.070 0.350 in our results.
Table 4.8: 퐿푖 Captured links difference between all webcrawls and base crawl. Lower is better. Id Average Median Min Max StDev
퐿1 1.000 1.000 1.000 1.000 0 퐿2 0.846 0.961 0.386 0.984 0.231 퐿3 0.516 0.561 0.072 0.920 0.272 퐿4 0.508 0.526 0.163 1.000 0.299 퐿5 0.929 0.941 0.864 0.960 0.035 퐿6 0.859 0.915 0.361 0.956 0.228 퐿7 0.887 0.893 0.820 1.005 0.106 퐿8 0.857 0.873 0.801 1.000 0.238
Next, we look closely into the results of all algorithms and draw some conclusions. 퐶2 is similar to the standard algorithm with the only difference that it uses SURT instead of URL for unique webpage identification. 퐶2 captures less data than 퐶1 (average(푊2)=0.885, average(퐿2)=0.846), while having very high completeness (average(퐶푂2)=0.986) and sig- 78 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling
nificantly lower time spent (average(퐷2)=0.726). This means than the use of SURT is su- perior to the use of URL as a unique identifier for web crawling. 퐶2 is a small improvement over the base algorithm regarding captured content (∼11%) but scores much better regarding web crawling performance (∼37%).
Algorithms 퐶3 and 퐶4 have in common the use of unique key similarity to identify duplicate web content. They capture very little content (average(푊3)=0.605, average(푊4)=0.589) and their results are quite incomplete (average(퐶푂3)= 0.698 and average(퐶푂)4=0.723). Hence, we believe that they are not suitable for accurate web crawling as they would miss a large subset of the target website. Nevertheless, we mention that their performance is very good, which may be due to the fact that they skip a large subset of the target website (average(퐷3)=0.492, average(퐷4)=0.419. One reason probably to use these algorithms would be to perform web crawling with the purpose of sampling. The process would be very fast compared with regular web crawling but the results would be a subset of the total website.
Algorithm 퐶5 uses URLs as unique identifiers, URL near-duplicate detection and webpage content near-duplicate detection. Its results show marginal gains regarding the base algo- rithm (average(푊5)=0.907, average(퐿5)=0.929) Algorithm 퐶6 which is using SURTs in- stead of URLs is showing similar results (average(푊6)=0.861, average(퐿6)=0.859. The great drawback of these methods (퐶5, 퐶6) is that they are quite slower than all other meth- ods (1.168 and 1.152 average values, respectively). They are even slower than the baseline web crawling algorithm. This behavior is attributed to the fact that they compare every new webpage they capture with all already captured pages, as presented in Section 4.2.1. Despite the fact that we use simhash signatures to achieve very fast webpage content near similarity evaluation, it is still an inefficient choice. Regardless of the test system specifications, us- ing these methods in large scale would require considerably more computing resources than standard web crawling.
Finally, we study the results of the algorithms using cycle detection (퐶7, 퐶8). They succeed in capturing less content not only from the base algorithm, but also from 퐶2, 퐶5 and 퐶6. We see that average(푊7)=0.669 and average(푊8)=0.606. We also remark that average(퐿7)=0.887 and average(퐿8)= 0.857. At the same time, their completeness scores are very good, avera- ge(퐶푂7)=0.985 and average(퐶푂8)=0.983. Regarding web crawling duration, they are not much faster than the base crawl and are slower than 퐷2 (i.e. average(퐷7) =0.955 and average- (퐷8)=0.829. Comparing 퐶7 with 퐶8 we see that 퐶8 is faster and also has better scores regard- ing captured web content. The two algorithms are almost equal regarding results completion.
Our conclusion is that the algorithm 퐶8 is the best choice for web crawling when consid- ering captured web content quality and accuracy. 퐶8 achieved to capture 40% (1−0.606) less duplicate website content with very high accuracy (average(퐶푂8) = 0.983) at approxi- mately the same time as the standard web crawling algorithm. On the other hand, if we are considering web crawling duration as our top priority and we would like to get a sample of a website in very little time compared with full web crawling, we should use 퐶4 which was ∼58% faster than the standard algorithm (average(퐷4)=0.419).
4.4.4 Optimal DFS Limit for Cycle Detection
The cycle detection algorithms defined in Section 4.2.1 use Depth-First Search (DFS) to perform searches in webgraphs and detect cycles. The distance limit for webgraph searches 4.4 Evaluation 79 is important because it has an impact on the performance and accuracy of cycle detection. If we have a very small limit, the cycle detection will be fast; however, it will not evaluate many nodes and it may miss cycles. If we have a large limit, the performance of the algorithm will suffer. To identify what is the optimal distance limit to stop searches we perform the following experiment:
1. We use the websites from our previous experiment as a dataset (Section 4.4.3). 2. We set a maximum limit equal to 5 and perform only cycle detection algorithms (퐶7, 퐶8). 3. When we detect a cycle, we record the distance of the respective node. 4. We count the occurrence of cycles for each distance limit in the range [1,4] and present them in Table 4.9 .
Table 4.9: Number of cycles for each distance limit Distance Cycles Percentage 1 10.615 85.19% 2 1.056 8.47% 3 600 4.81% 4 189 1.51%
Based on the outcomes of this experiment, we limit the search distance to 3 nodes.
4.4.5 Web Spider Trap Experiment
We conduct a simple experiment to showcase the operation of our web crawling algorithms in case of a web spider trap. We setup a simple web spider trap with a PHP script (Listing 4.12) in the author’s website http://vbanos.gr/trap/. The web spider trap generates an infi- nite number of URLs with the format presented in Listing 4.13. Each time a web spider visits the webpage, a new set of random URLs are created. This process is repeated infinitely.
1 2
3Lorem Ipsum is simply dummy text
7- 8
- {$label} "; 14 } 15 ?> 16
11 $url = "http://vbanos.gr/trap/index.php?var={$var}"; 12 $label = "Randomly generated label {$var} for testing purposes"; 13 echo "
Lorem Ipsum is simply dummy text of the printing and 18 typesetting industry. Lorem Ipsum has been the industry's 19 standard dummy text ever since the 1500s, when an unknown 20 printer took a galley of type and scrambled it to make a type 21 specimen book.
22 23Listing 4.12: Simple PHP web spider trap
1 http://vbanos.gr/trap/index.php?var=123 2 http://vbanos.gr/trap/index.php?var=412 3 http://vbanos.gr/trap/index.php?var=548
Listing 4.13: Example web spider trap hyperlink outputs
We initiate a new experiment with the spider trap URL as a target and a limit of 100 webpages for all web crawls. The results are presented in Table [? ]. The naive web crawling algorithms 퐶1 and 퐶2 fall into the trap and capture 100 and 103 webpages respectively. 퐶2 captures more than 100 webpages because our system is running 4 web crawling processes in parallel and when the maximum limit is reached, processes need to complete their current web crawling task before exiting. The potential maximum number of captured webpages is limit + number of processes.
Table 4.10: Web spider trap crawling results. Method 푊 퐷
퐶1 100 64 퐶2 103 66 퐶3 11 12 퐶4 8 18 퐶5 41 97 퐶6 35 78 퐶7 4 4 퐶8 3 5 4.5 Conclusions and Future Work 81
The algorithms using URL/SURT (퐶3, 퐶4) stop web crawling on different points, depending on when the generated URL is 95% similar to an already generated webpage. This point depends on the web spider URL generation algorithm. If it is complex, multiple webpages may need to be captured. The algorithms using URL/SURT and content similarity (퐶5, 퐶6) also behave in the same way. They stop after capturing more webpages than 퐶3/퐶4, because the probability to have similar webpage URL and content is smaller than the probability to have just similar URL. The cycle detection algorithms have the best performance. They evaluate nearby webpages and use their content similarity to find near-duplicates. This is the case with our experimental web spider trap, so the web spiders stop crawling very quickly.
4.5 Conclusions and Future Work
We presented our work towards the improvement of web crawling performance and effi- ciency using near-duplicate web content detection and webgraph cycle detection. New con- cepts were introduced to improve the web crawling process: (i) the selection of URL or SURT as the unique webpage identifier, (ii) the use of unique webpage identifier similarity to detect duplicates and near-duplicates, (iii) the use of webpage content similarity for the same purpose, and, (iv) the application of webgraph cycle detection. Using these concepts, we designed and implemented 8 web crawling algorithms and we performed extended experi- ments to study their behavior. In addition, we presented an implementation of our algorithms via the WebGraph-It platform, a web system available at http://webgraph-it.com which enables users to analyse websites, perform test web crawls and generate webgraphs. The concepts introduced in this work could lead to the implementation of many more web crawling algorithms besides the 8 ones we have implemented and tested. There could be many variations of existing parameters and combinations of similarity and near-similarity criterial to produce many more algorithms. We decided to focus on the presented ones be- cause we believe that they are indicative of our work. Our aim was to enable other researchers and web crawling engineers to learn and evaluate these concepts, to be able to integrate them in other new or existing web crawling systems. The outcomes of our research provide many useful insights. The use of the Sort-friendly URI Reordering Transform (SURT) for webpage URLs during web crawling results in im- proved web crawling performance and reduced duplicate captured content. We believe that this method should be used universally by web crawlers as it is easy to implement, requires little existing code modification, incurs little performance overhead and yields great results. The application of near-duplicate detection in URLs or SURTs should not be used inde- pendently because it results in incomplete web crawling results. A lot of web content is mistaken as duplicate when it is not. Near-duplicate URL/SURT evaluation should be used along with webpage content near-duplicate detection. This method has far more accurate results. The problem is that it does not scale well as we have demonstrated in the evalua- tion (Section 4.4.3). On the other hand, the webgraph cycle detection algorithms we have introduced in this work have great results and potential. The best performing algorithm in our experiments is 퐶8 which uses SURT for unique webpage identification and webgraph cycle detection to identify cycles and highlight duplicate and near-duplicate webpages. In addition, our method has the benefit of low computing requirements. It is only necessary to maintain a webgraph in memory with very little data for each node (URL, SURT, simhash 82 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling signature, internal-links) to evaluate potential duplicates. Using the framework we developed in the context of WebGraph-It to enable easy web crawl- ing algorithm implementation (Section 4.3.2), we aim to evolve existing web crawling al- gorithms and create new. Also, we plan to implement our methods in other open existing web crawlers such as the BlogForever platform [16]. Finally, we aim to launch a public web service via http://webgraph-it.com to provide users with web crawling and web- graph generation services. The applications of web crawling optimisation, webgraph gener- ation and web spider trap detection are numerous. Web crawling engineers would be able to streamline their web crawling operations by identifying web spider traps and other problem- atic webpages, researchers would be able to have quality web crawling data and webgraphs for experimentation, students would be able to learn more about the web, web crawling and webgraphs. Chapter 5
The BlogForever Platform: An Integrated Approach to Preserve Weblogs
We present BlogForever, a new system to harvest, preserve, manage and reuse weblog con- tent. We present the issues we resolve and the methods we use to achieve this. We survey the technical aspects of the blogosphere and we outline the BlogForever data model, system architecture, use cases, experiments and results1.
1This chapter is based on the following publications:
• Kalb H., Lazaridou P., Banos V., Kasioumis N., Trier M.: “BlogForever: From Web Archiving to Blog Archiving”, Proceedings ‘Informatik Angepast an Mensch, Organisation und Umwelt‘ (INFOR- MATIK), Koblenz, Germany, 2013.
• Banos V., Baltas N., Manolopoulos Y.: “Blog Preservation: Current Challenges and a New Paradigm”, chapter 3 in book Enterprise Information Systems XIII, by Cordeiro J., Maciaszek L. and Filipe J. (eds.), Springer LNBIP Vol.141, pp.29–51, 2013.
• Kasioumis N., Banos V., Kalb H.: “Towards Building a Blog Preservation Platform”, World Wide Web Journal, Special Issue on Social Media Preservation and Applications, Springer, 2013.
• Banos V., Baltas N., Manolopoulos Y.: “Trends in Blog Preservation”, Proceedings 14th International Conference on Enterprise Information Systems (ICEIS), Vol.1, pp.13-22, Wroclaw, Poland, 2012.
• Banos V., Stepanyan K., Manolopoulos Y., Joy M., Cristea A.: “Technological Foundations of the Current Blogosphere”, Proceedings 2nd International Conference on Web Intelligence, Mining & Se- mantics (WIMS), ACM Press, Craiova, Romania, 2012. 84 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
5.1 Introduction
We present how specialised blog archiving can overcome problems of current web archiving. We introduce a specialised platform that exploits the characteristics of blogs to enable im- proved archiving. In summary, we identify several problems for blog archiving with current web archiving tools:
• Aggregation scheduling performs on timely intervals without considering web site updates. This causes incomplete content aggregation if the update frequency of the contents is higher than the schedule predicts [71, 137]. • Traditional aggregation uses brute-force methods to crawl without taking into account what is the updated content of the target website. Thus, the performance of both the archiving system and the crawled system are affected unnecessarily [137]. • Current web archiving solutions do not exploit the potential of the inherent structure of blogs. While blogs provide a rich set of information entities, structured content, APIs, interconnections and semantic information [89], the management and end-user features of existing web archives are limited to primitive features such as URL Search, Keyword Search, Alphabetic Browsing and Full-Text Search [137].
Our research and development aims to overcome these problems by exploiting blog charac- teristics. Additionally, while specialisation can solve existing problems, it causes additional challenges for the interoperability with other archiving and preservation facilities. For ex- ample, the vision of a seamless navigation and access of archived web content requires the support and application of accepted standards [125]. Therefore, our targeted solution aims to:
• improve blog archiving through the exploitation of blog characteristics, and, • support the integration to existing archiving and preservation facilities.
The rest of this chapter is structured as follows: In Section 5.2, we perform a technical survey of the Blogosphere to identify weblogs’ structures, data and semantics. In Section 5.3, we present the rationale and identified user requirements behind system design decisions. We proceed with the presentation of the resulting system in Section 5.4 and the implementation in Section 5.5. We present the evaluation in Section 5.6, before we discuss some issues of our solution in Section 5.7.
5.2 Blogosphere Technical Survey
It is important to achieve a better understanding of the Blogosphere to aggregate, manage and archive weblogs. It is necessary to explore patterns in weblog structure, data and semantics, weblog-specific APIs, social media interconnections and other unique blog characteristics. We conduct a large-scale evaluation of active blogs and we review the adoption of an exten- sive list of technologies and standards. Finally, we compare the results with existing findings from the web to identify similarities and differences. 5.2 Blogosphere Technical Survey 85
There are already important initiatives aiming to identify and record how the web is con- structed and what its main ingredients are. The HTTP Archive is a permanent repository of web performance information and technologies utilized. W3Techs also provides infor- mation about the usage of various types of technologies on the web [142]. Alexa Internet is a company that is maintaining a database of information about sites, including technical information, since 1996. There are also many initiatives which gather web and especially blog information via user surveys and online questionnaires. Technorati’s state of the Blogosphere is the most high profile one but there are also others such as The State of Web Development. However, while some of the above mentioned ini- tiatives publish descriptive statistics about the technological foundations of the Blogosphere, the scope and depth of these studies remains limited. For instance, while Technorati may publish basic statistics about the most widely adopted platforms and popular devices for ac- cessing blogs [131], the use of libraries, formats and tools remain beyond the focus of the review. To the best of our knowledge, there is no other initiative that conducts technical sur- veys and evaluates the technological foundations of the Blogosphere. This work addresses this gap. Following, we see the technical aspects of the survey implementation, detailed results and conclusions. We also highlight interesting differences between the generic web and the Blogosphere.
5.2.1 Survey Implementation
We evaluate the use of third-party libraries, external services, semantic mark-up, metadata, web feeds, and various media formats in the blogosphere. We use a relatively large set of blogs which consists from various data sources as presented in Table 5.1. All datasets were downloaded on August 12, 2011.
Description Initial resources Valid resources Blogs from http://weblogs.com ping server 259.286 209.560 Technorati top 100 resources 100 90 Blogpulse top 40 resources 40 35 BlogForever project survey user contributed blogs 504 145 Total 259.930 209.830
Table 5.1: Datasets
We choose to use http://weblogs.com/ for this evaluation for two reasons. First, it is a widely accepted and popular hub service in the Blogosphere, which makes it suitable for conducting a broad survey with a large sample of blogs. Secondly, it publishes a list of resources updated within the last hour. Using a list of recently updated resources can elim- inate abandoned or inactive blogs which constitute about the half of all the blogs. Blog ping servers receive notifications when new content is published on blogs via the XML- RPC-based ping mechanism and, subsequently, notify their subscribers about recent updates. http://weblogs.com/ receives millions of blog update notifications every day. we also use other resources such as the list of Top 100 blogs published by Technorati.com, top 40 blogs published by Blogpulse.com and a collection of blogs acquired from the BlogForever Weblog Survey [130]. The inclusion of additional blogs shared by participants of the survey extends the automatically generated list of blogs with a set of selectively contributed ones. 86 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
On the other hand, the use of Technorati and Blogpulse provides a potential for enriching the evaluation. Technorati and Blogpulse are among the earlier and established authorities on indexing, ranking and monitoring blogs. Inclusion of top blogs from Technorati and Blog- pulse enables a comparative analysis between the more general Weblogs.com cohort and the list of highly ranked blogs. The overall number of accessed blogs is 259,930. HTTP response codes are recorded. Items without valid HTTP status code are discarded. 94% of all the received status codes are successful. The total number of valid (i.e. Response Status Code: 200) records surveyed is 209,830. The summary of the registered response codes is shown in Figure 5.1.
Figure 5.1: HTTP Status response codes registered during data-collection
An informed decision on the time of collecting the data is made. The choice of a specified time frame is justified by the anticipated increase in publishing activity of blogs in European and other states within the time zone proximity. The XML file is parsed and URL entries are extracted for further processing. The URL entries are filtered to distinguish between updated resources and their hosted websites. Duplicate entries are removed. The number of accessed resources contains all the URLs that have been extracted and followed by the survey script. We generate the datasets by accessing relevant XML feeds published online by the above mentioned resources. For each URL in the datasets, we try to access it via standard HTTP, retrieve all output hypertext and related files (e.g. CSS, images) and store them in our server. Then, we scan the data to identify the presence of specific technologies, tools, standards and services via their signature. To implement the data collection, we implement custom software using PHP5.3 for the core application and we utilise the cURL network library to implement communication with the blogs via HTTP. Regular expressions are used to parse the blog source code and evaluate the use of certain technologies. We also use Bash to implement process management and file I/O. The software is a Linux command line application which requires a URL list as input and outputs results in CSV files. For each URL of the input, the application performs an HTTP request and retrieves the respective HTML code. Subsequently, a set of regular expressions are executed, one for each technology or digital object type we try to detect, and the results are stored in a comma delimited CSV file. It must be noted that input URLs can be blog base URLs but also specific blog post URLs. In either case, the software retrieves 5.2 Blogosphere Technical Survey 87 the specific URL’s HTML code and proceeds to parse and analyse it. The complete software for implementing this survey is freely available via github2. To evaluate the use of certain technologies, we parse the source code of the accessed re- sources and look for evidence of adopted technologies. The technologies we consider as part of this evaluation are summarized in the following (+count indicates that number of identified occurrences was counted): • Content Type • CSS (+count) • Dublin Core (+count) • Embedded YouTube video • Facebook • FOAF • Flash (+count) • Google+ • HTML • HTTP Response Status Code (200, 404, etc.) • Image tags (BMP, WEBP, JPG, PNG, GIF) (+count) • JavaScript and specific libraries (Dojo, ExtCore, JQuery, JQueryUI, MooTools, Proto- type, YUI) • Microdata • Twitter • Microformat-hCard • Microformat-XFN • Open Graph Protocol (+count) • Other MIME Types (see Table 2) • Open Search • Pingback • RDF (+count) • SIOC • Software/Platform • XHTML • XML Feeds (Atom, Atom-comments, RSS, RSS-comments).
5.2.2 Results
Platforms and Software
We obtain information about the blog hosting platform and software from two blog attributes: a) the HTTP response headers of the blog and b) the html tag. Blog software version is also present. The most frequent blog platforms that appear in the studied cohort are WordPress (36%) and Blogger (19%). Technorati, similarly to our find- ings reported WordPress, followed by Blogger, to be the platform of choice. However, the number of WordPress instances observed within the studied dataset is considerably lower from the 51% reported by Technorati. Similar observations are made in relation to the Blog- ger platform. These differences may be due to a large number of cases (40%) for which information about the platform remained hidden.
2https://github.com/BlogForever/crawler, accessed: August 1, 2015 88 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
A considerable number of instances are registered for Typepad, vBulletin Discuz and Joomla. Among other (2%) frequently appearing platforms are: Webnode, PChoc, Posterous, Blog- spirit, DataLife Engine and BlueFish. The total number of unique platforms registered how- ever is considerably large – totaling 469 unique platforms. However, even combined together they do not exceed the 19% of the entire list of studied blogs. It remains an open question why a large number of blogs do not exhibit the platforms they are built on. It requires further investigation to identify whether some blogs prefer not to acknowledge the use of a certain blogging engine or whether they are based on custom systems.
Figure 5.2: Frequency of weblog software platforms
There is a considerable variation across most popular software platforms used. The consis- tency in specifying versions of adopted software varies too. However, it is still possible to identify the extent of adoption and noticeable patterns within the studied corpus. First, it becomes apparent that a large number of websites are maintained without a software upgrade, despite the availability of more recent versions. For instance, 20% of all the Mov- able type blogs continue using version 3, as shown on Figure 5.4 despite the availability of many latest versions. There is a similar pattern, with around 13% (and some of the generic 4%) of the WordPress users choosing earlier versions of software released between 2004 and 2009, despite the availability of newer versions. While the number of earlier platforms across active blogs remains substantial, the majority of software platforms (with an average of around 75%) use more recent versions. These results are limited to the providers of software packages that do specify their versions. Among the providers that do not specify information about the software version are: Blogger, Typepad and Joomla.
Character Encoding
Documents transmitted via HTTP are expected to specify their character encoding. Often referred to as “charset”, it represents a method of converting a sequence of bytes into a sequence of characters. When servers send HTML documents to user agents (e.g. browsers) as a stream of bytes, user agents interpret them as a sequence of characters. Due to a large 5.2 Blogosphere Technical Survey 89
Figure 5.3: Variation in versions of Wordpress software
Figure 5.4: Variation in versions of MovableType software number of characters throughout written languages, and a variety of ways to represent them, charsets are used to help user agents rendering and representing them. It is, therefore, recommended by the W3c [40] to label Web documents explicitly by using element or using specific HTTP headers as a way of conveying this information. An example of specifying character encoding is given below:
1
User agents are expected to work with any character encoding registered with IANA, how- ever, the support of an encoding is bound to the implementation of a specific user agent. This evaluation records the use of content and charset attributes across the studied blogs. This enables commenting on most widely used charsets or the absence of the recommended labeling. Information about the types of documents distributed by blogs is also collected. The results suggest that text/html is the most widely (61%) specified content type within the studied corpus. Other types constitute to less than 1% and include: application/xhtml; /xml; /xhtml+xml; /vnd.wap.xhtml+xml, as well as text/xml; / javascript; / phpl; / shtml; and / html+javascript. A considerable number of accessed resources were not labelled. 90 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
Figure 5.5: Variation in versions of vBulletin software
Figure 5.6: Variation in versions of Discuz! software
In addition to content type, we capture and analyse information about encoding. UTF-8 is the most frequently used encoding. Other identified charsets do not exceed 6%. Encod- ing information is not specified or remains unidentified in 39% of the cases (Figure 5.7). The number of blog instances that do not specify charset information are worthy of notice. Within the 6% of other types of charset specifications 48 distinct records are identified. Most common charset specifications include: iso-8859-1 (48%), euc-jp (23%), shift-jis (8%) and windows-1251 (6%). See Figure 5.8 for more details. The results demonstrate that the overwhelming majority of studied resources are distributed in Unicode as text/html documents. A still considerable number (6%) of resources are using alternative encoding. It may therefore be required to consider solutions for capturing and preserving the blogs distributed in character sets other than UTF-8.
CSS, Images, HTML5 and Flash
We discuss the findings of the study regarding: CSS, HTML5, Flash and certain image file formats. The dataset includes:
• Number of embedded references to CSS files linked 5.2 Blogosphere Technical Survey 91
Figure 5.7: Encoding of evaluated resources
Figure 5.8: Break down of the other 6% of character set attributes
• Presence of HTML5 based on declaration • Number of Flash objects used based on references to SWF files • Number of png, gif, bmp, jpg, webp, wbmp, tiff and svg images used
Cascading Style Sheet (CSS) is a language that enables separation of content from presen- tation. Used primarily with HTML documents, CSS provides a common mechanism for shared formatting among pages, improved accessibility and greater flexibility and control over the presentation elements of various web documents. We demonstrate that most of the accessed resources use CSS elements (without distinguish- ing between CSS1 and CSS2). The average number of references to CSS is 1.94 – suggesting a frequent use of this technology. 81% of all the studied resources employed CSS. HTML5 is the fifth and (on the day of writing this document) the most recent revision of the HTML language. HTML5 intends to improve its predecessors and define a single markup language for HTML and XHTML. It introduces new syntactical features such as,
Figure 5.9: Average number of images identified
Flash, also known as Macromedia/Adobe Flash, is a multimedia platform used for adding interactivity or animation to web documents. It is frequently used for advertisement, games streaming video or audio. Flash is provided by using an object-oriented ActionScript pro- gramming language and allows the use of both vector and rasterised graphical content. The detection of Flash content within the studied resources was based on the use of SWF format. Accessed resourced were searched for
Figure 5.10: Average use of BMP, SVG, TIFF, WBMP and WEBP formats
Figure 5.11: Distribution of images for pages with less than 20 images only
Semantic Markup
We investigate the use of metadata formats and associated technologies:
• Metadata • Dublin Core • The Friend of a Friend (FOAF) • Open Graph Protocol (OG) • Semantically-Interlinked Online Communities (SIOC) • Micro/data/formats • Microdata • hCard (Microformats) • XFN (Microformats) • Common Semantic Technologies • Resource Description Framework (RDF) • Re- ally Simple Discovery (RSD) • Open Search
Metadata are commonly defined as data about data. Within the context of the Web, metadata are commonly referred to as the descriptive text used alongside web content. Examples of 94 The BlogForever Platform: An Integrated Approach to Preserve Weblogs metadata can include keywords, associations or various content mapping. It is often required to standardize these descriptions for ensuring consistency and interoperability of web con- tent. Referring to Dublin Core, Open Graph, SIOC and FOAF as simply metadata would be inaccurate. However, their use is discussed jointly due to some similarities of their applica- tion. The summary of identified uses of metadata standards is presented in Figure 5.12. Open Graph (OG) is most frequently used standard. Each of the instances of OG and DC mark-up has been counted. The average occurrence of OG is 5.7 per page compared to 1.37 for DC.
Figure 5.12: Summary of metadata usage
We show the histogram of OG occurrences in Figure 5.13. The use of FOAF has been identified in only 561 cases, which constitutes to less than 0.3% of all the studied pages. The overwhelming majority of evaluated resources did not use FOAF. Across the entire corpus of studied resources no reference to SIOC is identified.
Figure 5.13: Histogram of Open Graph references
Microdata and Microformats are conceptually different approaches to enriching web content with semantic notation. This evaluation counted the number of resources where presence of 5.2 Blogosphere Technical Survey 95 microdata or microformats has been identified. More specifically, when referring to micro- formats, the investigation distinguished between XFN, a way of representing human rela- tionships using hyperlinks, and hCard – a simple, distributed format for representing people, companies, organisations, and places. The presence of Microdata within a resource is based on locating itemscope and itemtype=”http://schema.org/*” within a studied page. hCard and XFN microformats were identified, respectively, as class attributes with hcards values and rel attributes within tags. To add a property to an item, we use the itemprop attribute on one of the item’s descendants. We identify XFN in 74,709 cases, which constitutes to 35.6% of the entire corpus. On the opposite, the use of microdata and hCards is less frequent. Only 27 instances of microdata are identified within the studied resources. The number of identified hCards is limited to 607 (0.3%). A large portion of the studied corpus contains no evidence of either microdata nor microformats. Common Semantic Technologies considered in this evaluation are limited to the use of: RDF language, Open Search and Really Simple Discovery (RSD) formats. We identify RDF using the application/rdf+xml resource content type. We identify OpenSearch using the applica- tion/opensearchdescription+xml content type and the relevant namespace declaration:
1
Similarly, we identify RSD using the following namespace declaration:
1
The use of RSD is widespread. About 74% of all the accessed resources use RSD. On the contrary, only 567 records (0.3%) use RDF. No references to Open Search are identified.
XML Feeds
XML feeds following the RSS and Atom protocols, are used across weblog platforms and services. Represented in a machine readable format, web feeds enable data sharing among applications. Most common use of web feeds is to provide content syndication and notifica- tion of updates from multiple websites into a single application [67]. Aggregators or news readers are commonly used for syndicating the web content by enabling users to subscribe to web feeds. The simple mechanisms for accessing and distributing web content justify the wide adoption of feeds on weblog platforms. We identify the use of web feeds by the use of the tag with type=”application/atom+xml” for Atom feeds, type= ”application/rss +xml” for standard RSS feeds with an additional dis- tinction to comments where applicable. The results are outlined in Figure 5.14. RSS feeds are most widely used (56%) feeds. The use of Atom feeds (29%) is still common. 15% of RSS feeds are used distinctly for distributing the content of comments. Yet, no Atom feeds are identified for this purpose. 96 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
Figure 5.14: Use of XML feeds by type
JavaScript Libraries
We evaluate the use of the following popular JavaScript language frameworks: Dojo3, Ext JS4, JQuery5, JQuery UI6, MooTools7, Prototype8, and, YUI Library9. We also discuss the use of Pingback services throughout the studied cohort. To use of JavaScript by each of the accessed resource has been quantified based on the num- ber of *.js files linked or segments of JavaScript code embedded within the accessed docu- ment. The results suggest a wide adoption of JavaScript with 82% of the entire studied corpus having at least one reference to JavaScript. The average number of JavaScript instances is large too – 12.5 instances per resource (Figure 5.15). Within the identified instances of JavaScript code, there are references to specific libraries and frameworks. Their use is identified by the reference to their name (e.g. dojo.js, jquery.js, etc.). The most frequently used technologies are JQuery, Moo Tools and YUI Library. The cumulative use of Dojo, Ext Core JQuery UI and Prototype constitute to just over 1% of all the accessed resources (Figure 5.16). Last, but not least, this sections summarises the use of PingbackAPIs. The identification of Pingback is based on the reference of tags with rel=”pingback” attribute within the accessed recourses. The results suggest that 46.4% of all the accessed resources used pingbacks. The use of other Linkback mechanisms, including Trackbacks and Refbacks have not been considered in this evaluation. The use of other third party libraries such as Google Analytics is also omitted.
3https://dojotoolkit.org/, accessed August 1, 2015 4https://www.sencha.com/products/extjs/, accessed August 1, 2015 5http://jquery.com/, accessed August 1, 2015 6https://jqueryui.com/, accessed August 1, 2015 7http://mootools.net/, accessed August 1, 2015 8http://prototypejs.org/, accessed August 1, 2015 9http://yuilibrary.com/, accessed August 1, 2015 5.2 Blogosphere Technical Survey 97
Figure 5.15: Number of JavaScript instances identified
Figure 5.16: Number of identified JavaScript library/framework instances
Social Media
The rise of social media such as Facebook, Twitter and YouTube has a profound effect on people’s blogging behaviour and the Blogosphere in general. A large number of blogs al- ready integrate mechanism for easy distribution of its content on social media websites. So- cial media are used for promoting and notifying readership about new posts. We summarise our investigation into the use of social media. To use of Twitter, Facebook, Google+ and YouTube, it is necessary to integrate specific JavaScript libraries and XML namespaces with appropriate references to these web services. The results suggest that almost 4% of all the studied resource indicate an evidence of integra- tion with Facebook. The number of references to Twitter are marginal with only a handful of identified instances. The adoption of Google+, on the other hand, is shown to be consid- erably higher – totaling 17.2% among the studied resources. This high number of instances is surprising given the announcement of the service less than two months ago from the time of writing this report. We study the use of YouTube differently from that of earlier discussed social media. Each of the accessed resources were scanned for occurrences of embedded content from YouTube. 98 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
The use of
Figure 5.17: Frequency of embedded YouTube videos
File Formats
This evaluation was extended to consider the use of various file formats described as MIME types by Internet Assigned Numbers Authority (IANA). This evaluation looked into some of the files categorised as audio, video, text, and applications. Originally used to describe email content MIME standard extends further and used along with communication proto- cols like HTTP. Similarly to email, HTTP requires certain data be transmitted where MIME specification is considered suitable. The full list of the studied file formats and the frequency of their use as part of the accessed resources is presented in Table 5.2. The results suggest that the most frequently (13,731) used file type across the studied corpus is PDF. Slightly less frequent (10,231) occurrences were recorded for mp3 Audio files. The use of MS Word documents, AVI and MP4 videos is between 1,097 and 3,265. No database or 3D reality files were identified within the studied corpus. Given the large number of resources studied as part of this evaluation, even most frequently used file types constitute to a small proportion. The use of MS Word and PDF documents is between 4.9-6.4% of all the studied resources. The combined use of all audio and video files constitutes 9% of all the studied resources.
Single Posts versus Websites
The data contained in the dataset published by Welogs.com contain both URLs that refer to single posts/pages as well as general domains. The distinction between the two was intro- 5.2 Blogosphere Technical Survey 99
File Extension Application Instances pdf Word Processing 13.731 mp3 Audio 10.231 mp4 Video/Audio 3.265 avi Video 3.265 3gpp Video 1.429 doc Word Processing 1.097 ods Spread Sheet 722 txt Word Processing 641 odd Presentation 618 mpg Video 613 odb Database 153 docx Word Processing 147 xls Spread Sheet 138 mov Video 71 ppt Presentation 67 odf Math Formulas 63 odt Word Processing 51 mpeg Video 36 xlsx Spread Sheet 24 pptx Presentation 20 vCard Card 14 wav Audio 13 odg Graphics 4 mdb Database 0 ccbd Database 0 vrml 3D 0
Table 5.2: File MIME types ordered by descending frequency of occurence.
duced during the data collection stage. This enables discussing the differences between the use of technologies on the levels of single posts/pages and larger websites. The results suggest that the average number of technologies used on the website level is approximately twice as large as that on a single page/post level. This does not hold for every element studied here. For instance the use of YUI JavaScript library is 2.5 times more frequently used on the post/page level than on a website level. Almost twice more FOAF references were recorded on the post/page level compared to general websites. The number of GIF images used is also slightly higher on the post/page level compared to their use on the home page. On the contrary the number of JPG images used on the website level is 3.4 times higher than on the post/page level. A similar pattern holds for embedded YouTube videos with 5 times more videos used on a website level. These results are not surprising since, posts and pages contain more focused content compared to homepages that may include listings with excerpts from a set of posts. 100 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
5.2.3 Comparison Between Blogosphere and the Generic Web
We evaluate the technological differences between blogs and the generic web using the re- sults from our technical survey presented in the previous section and the data from the HTTP Archive. Http Archive attempts to record the changes that take place on the Internet by col- lecting and storing Web content. Furthermore, HTTP Archive attempts to preserve the ways in which Web content is being constructed and served including information about: size of pages, failed requests and technologies utilized. We compare the commonly occurring technologies with those in the Blogosphere. The ra- tionale for comparing these domains is to identify whether major differences between the Blogosphere and the generic Web exist and whether different strategies may be necessary for blog preservation. Some differences between the Blogosphere and the rest of the Web have already been demonstrated. The spread of information and influence dynamics within blog networks are among common perspectives that distinguish blogs from other websites. For instance, emerging patterns of information propagation and content sharing within the Blogosphere have already been highlighted [83]. However, more work is needed to capture technological differences between the two domains. The data used from the HTTP Archive corresponds to the timeframe of the data obtained from our survey. Identical period sourcing the data ensures the comparability of the datasets and eliminates the possibility of techno- logical changes that may affect the results.
Figure 5.18: Flash use on the web (left) and on blogs (right).
Figure 5.19: JavaScript frameworks use on the web (left) and on blogs (right). 5.2 Blogosphere Technical Survey 101
The results indicate the presence of considerable differences between the two domains, par- ticularly in relation to the use of Flash, GIF images and JavaScript libraries. The results indicate a considerable variance in using Flash elements generally as part of the Web (44%), compared to only 15% within the Blogosphere (Figure 5.2.3). It is also apparent that the adoption of PNG images is higher within the Blogosphere (25%) compared to the general Web (20%). There is a greater number of GIF images used within the general compared to the blog domain (Figure 5.2.3). Furthermore, a considerable difference was observed in the number of recorded HTTP response errors, presented in Figure 5.2.3. While conclusive evaluation will require expanding the study beyond the recently active blogs, the descriptive statistics presented here highlights the problem of no longer accessible web resources.
Figure 5.20: Image formats use on the web (left) and on blogs (right).
Figure 5.21: HTTP status responses on the web (left) and on blogs (right).
The key message emerging from our study argues for the diversity of the Blogosphere. More specifically, there are large numbers of software platforms, encoding standards, third party services and libraries used. There are considerable differences in the ways the standards are being adopted. In the context of BlogForever and preservation of blogs in general, this diversity exhibited in the Blogosphere may require additional efforts for avoiding data loss or distortion when aggregating, preserving and disseminating blogs. For example, given the empirical evidence that indicates limited use of certain file types and popularity of others, informed decisions can be made for focusing on specific file formats and omitting others. Informed trade-offs can enable conserving the resources and contribute to the greater sus- tainability of the archive. 102 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
Our evaluation suggests existence of around 470 platforms in addition to the dominating WordPress and Blogger. Furthermore, there is a wide variety in the versions and subsystems adopted. The wide variety of content types in addition to the 61% of text/html published in a wide range of encoding standards showcase this fact. On the other hand, there are a large number of established and widely used technologies and standards applied consistently throughout the Blogosphere. The use of RSS and Atom feeds, along with CSS and JavaScript are among those technologies. The frequency of images used and their formats are very similar to the ways they are used within the entire Web. There is a wide variation on the adoption of third party libraries and services. The use of social media APIs is not consistent throughout the studied corpus. However, support for Google+, a service announced 2 months before the authoring of this paper, is considerably large. The adoption of metadata such as Dublin Core, Open Graph, FOAF and SIOC is not consistently spread either. This may have direct implications for crawling, data extraction and aggregation. To conclude, this evaluation measures and reports the technological foundations used in the Blogosphere. The results of this evaluation can, therefore, be used for many purposes by services and solutions geared towards the Blogosphere.
5.3 User Requirements
The requirements presented in this section, although identified in the context of BlogFor- ever, can be applied to any blog preservation platform. Besides identifying serious issues in current web archiving, we have to further determine the necessary requirements to ag- gregate and preserve blog content correctly. Therefore, we examine domain specific user requirements which influence the design and development of the BlogForever platform. The user requirements phase of the BlogForever platform [77] involves conducting 26 semi- structured interviews with representatives from six different stakeholder groups, identified as key to the project. These groups are researchers, libraries, blog authors and readers, businesses, and blog hosting services. We group the requirements into two main categories, functional and non-functional. Furthermore, we divide the non-functional requirements into several subcategories as presented in Table 5.3. Due to the amount of identified requirements, it is not possible to explain them in detail. Therefore, we focus on important requirements regarding preservation, interoperability and performance because they specify the possible solutions for the aims of the research pre- sented in this work.
5.3.1 Preservation Requirements
BlogForever is utilizing the Open Archival Information System (OAIS) reference model [107] as a conceptual guidance for its weblog digital repository construction and manage- ment. Therefore, one of the main concerns while gathering requirements was to specify four specific OAIS functions: Ingest, Data Management, Archival Storage and Access [82]. A 5.3 User Requirements 103
Requirements Category Description Requirements (End-)user interface Specify the interface between the software 35 and the user [100]. Data Describe visible data and data the user needs 23 to export or to process [55] Preservation Specify digital preservation specific require- 17 ments [56] Security Specify the factors that protect the software 9 from accidental or malicious access, use, modification, destruction, or disclosure Performance Specify speed, performance, capacity, scala- 8 bility, reliability and availability [100] Interoperability Specify the ability of the system to share in- 7 formation and services [65] Operational Specify required capabilities of the system to 6 ensure a stable operation of the system [74] Legal Cover any constraints through legal regula- 5 tion, e.g. licenses, laws, patents, etc [56]
Table 5.3: Overview of user requirements blog preservation platform should be able to:
• Receive crawled blog content in the form of Submission Information Packages (SIPs). In BlogForever, the SIP is a Metadata Encoding and Transmission Standard (METS) [32] wrapper which contains the original XML data as crawled by the spider, along with Machine-Readable Cataloging (MARC) and Metadata for Images in XML (MIX) meta- data, and links to locally-attached files. • Perform quality assurance checking to ensure that the transmission was successful and that the content is eligible for admission to the repository. This validation is necessary as the content source is not controlled by the administrators of the blog preservation platform and may cause issues. Virus and splog detection should be performed. • Generate Blog Archival Information Packages (AIPs) from SIPs to preserve the con- tent. • Produce blog Dissemination Information Packages (DIPs) upon user request. It is possible to retrieve the original METS file from the Ingestion Database, enrich it with extracted information, and export it to the designated community.
5.3.2 Interoperability Requirements
Interoperability is one of the crucial aspects of any digital preservation system. BlogForever poses greater interoperability challenges due to the fact that the archive has to interoperate 104 The BlogForever Platform: An Integrated Approach to Preserve Weblogs with multiple 3rd party systems which are hosting blogs to retrieve content in real time. In more detail, the following requirements are deemed necessary:
• Capturing has to be possible for various platforms. Blogs are available on different platforms. The archive should not be restricted to only one kind of platform or software because it would severely limit the amount of blogs that can be archived. Therefore, the spider and respectively the archive should be able to capture blogs ideally from every platform or at least from the most common platforms.
• Export blog elements. Expose parts of the archive via Open Archives Initiative - Pro- tocol for Metadata Harvesting (OAI-PMH)10 based on specified criteria. The BlogFor- ever platform should enable exposing different parts of the archive to different clients via OAI-PMH according to specified parameters and policy (e.g. client identity, admin settings).
• Support pingback /trackback. The archive should enable pingback / trackback to facil- itate connections between the archived contents and external resources. Thereby, the archive can show if external blogs are referencing archived content.
• Export links between blog content. The BlogForever platform should be able to pro- vide users with a graph of links between blogs and blog posts archived in the system. The format of this graph is not restricted.
5.3.3 Performance Requirements
Performance is crucial for a system that needs to perform real time archiving. A platform that captures large amounts of blog content every day needs to have certain performance charac- teristics. These aspects of BlogForever clearly differentiate it from other archiving platforms where aggregation is performed in controlled, scheduled intervals. Real time capturing can produce a lot of data in peak times. Therefore, it should be possible to:
• Connect to multiple blogs in parallel,
• Retrieve and process blog content from multiple sources in parallel,
• Store data concurrently in the archive, and
• Scale in terms of size of data it stores, number of blogs it monitors, volume of content it retrieves, visitors it facilitates.
The user requirements specified in this step of the development are used as inputs for the architecture of the BlogForever platform, presented in the following section.
10http://www.openarchives.org/pmh/, accessed: August 1, 2015 5.4 System Architecture 105
5.4 System Architecture
We introduce the general concepts of our system architecture, consisting of two main com- ponents, comparing it with other current solutions and explaining how we address common issues in novel ways. Following, we give a more detailed description of the design of each software component justifying our design decisions. Screenshots of the BlogForever plat- form are available in Appendix 7.2.
5.4.1 The BlogForever Software Platform
We provide an overview of the general architecture of the BlogForever platform before we explain in detail its two components: the blog spider and the blog repository. The Blog- Forever platform provides all the necessary functionality for collecting, storing, managing, preserving, and disseminating large volumes of blog content. Decoupling the two platform components allows the platform to be configured, run and managed in a very flexible and scalable way, as the spider and digital repository can be entirely independent in their opera- tion. A general overview of the BlogForever platform is illustrated in Figure 5.22.
Figure 5.22: A general overview of the BlogForver platform, featuring the blog spider and the blog repository 106 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
In a nutshell, given a predefined list of blogs, the spider analyses them, manages the capturing process, harvests the required information in an event driven manner and hands them over to the repository together with additional metadata in a standardised and widely accepted format. The repository ingests, processes and stores the blog content and metadata, facili- tating their long-term preservation. Furthermore, the content can be provided to end-users in various ways including sophisticated blog specific functionalities. The BlogForever platform is based on the well-established model of harvesting, storing and presenting information from the Web. However, unlike other solutions, the BlogForever software treats blogs as web resources with a unique nature throughout their existence in the platform, from the moment of harvesting, to ingesting and processing and finally to present- ing to the final user, while at the same time focusing in long term digital preservation as a key aspect of this process. A selective harvesting process that captures only new content without missing relevant in- formation is the main challenge for blog harvesting. The BlogForever spider uses RSS and Atom feeds, ping servers, blog APIs as well as traditional HTML scraping to harvest blog content. As a result, real-time - or near real-time archiving can be achieved. Due to the overhead incurred by the various network connections, and the latency incurred by the tools processing the information it is accurate to say that archiving is as close to real time as these factors allow. In contrast, we note that traditional methods use a schedule based approach to decide when new content should be harvested [71]. Furthermore, we note that using our approach, only the parts of a blog that contain new information are crawled, instead of the whole page. Thus, the amount of crawled data can be reduced significantly. A detailed data model has been developed based on existing conceptual models of blogs, data models of open source blogging systems, an empirical study of web feeds, and an online survey with blogger and blog reader perceptions [134]. While the full data model comprises twelve categories with over forty single entities, Figure 5.23 shows just a high level view of the blog core. The primary identified entities of a weblog and the interrelation between them is shown and described by the connected lines. The small triangles indicate the directions of the relationships. Each entity has several properties, e.g. title, URI, aliases, etc. of a blog, which are not described here. Using an exhaustive data model, which is unique for blogs, BlogForever does not limit itself to capturing online blog content as it is given to it. New material is analyzed and the differ- ent parts, as defined in the data model, are identified. All metadata are stored in a relational database on the repository component and are instantly available for discovery and exploita- tion. Furthermore, search and navigation features are available for the key blog entities of posts and comments), enabling sophisticated blog analysis. Additionally, an extended range of personalized services, social and community tools are made available to the end-user.
5.4.2 Blog Spider Component
The blog spider is the component of the BlogForever platform that harvests a defined set of blogs. However, harvesting new blogs that are connected to the already archived blogs is also possible. It comprises of several subcomponents as shown in Figure 5.25. The Inputter is the entry point, where the list of blog URLs is given to the spider. The Host Analyzer parses each blog URL, collects information about the blog host and discovers RSS feeds 5.4 System Architecture 107
Figure 5.23: Core entities of the BlogForever data model [134] that the blog may provide. The Fetcher harvests the actual content of each blog. In the core of the spider, the Source Database handles all the blog URLs as well as information on the structure of the blogs and the filters used to parse their content. The Scheduler ensures that all the given blogs are checked at regular intervals. The Parser analyzes HTML content, semantically extracts all the useful information and encapsulates it into XML. Finally, the Exporter serves all the parsed content and also provides an API for programmable access to it. In the next paragraphs the internal architecture of the spider is made clearer. Inputter: The Inputter’s main function is to serve as the entry point to the spider, where the list of blogs to be monitored can be set. There are two ways to achieve that: either through a web interface that allows the manual insertion of new blogs, or automatically by fetching new blogs from a broader list hosted at the digital repository. The idea behind this design decision is that the administrator should have full control over the set of blogs they want to have harvested. The automatic fetching of new blogs from the repository is achieved through the use of OAI-PMH, which allows for the spider to pull information from the repository. However, mechanisms for the repository to push information on new blogs to the spider, such as the use of a web service, are also being explored. At the same time, the option to insert new blogs discovered through ping servers is also implemented. The Inputter is also capable 108 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
Figure 5.24: BlogForever conceptual data model [134] 5.4 System Architecture 109
Figure 5.25: The outline of the blog spider component design of identifying and accepting a set of options that affect the crawling of each blog. These options, which can be either set at the repository or directly at the spider, include fetching embedded objects of specific data types, producing and delivering snapshots of the content at harvest time, etc. Host Analyzer: All blog URLs collected by the Inputter have to pass through the Host An- alyzer, which approves or blacklists them as incorrect or inappropriate for harvesting. The Host Analyzer discovers and identifies RSS feeds for the given blog and, consequently, de- tects all content associated to it, such as blog posts, comments and other related pages. This is achieved through a process of analyzing HTML and detected URLs using machine-learning techniques, such as ID3, examining the provenance of the URL. As a result of these tech- niques, the Host Analyzer also detects the blog authoring software platform used and/or the blog host as well as the link extraction patterns and filters that are used for future harvesting. Finally, this is the first stage where a blog may be flagged as potential spam, given basic information such as its URL. Source Database: At the core of the spider’s system we find the system manager, which comprises of the Source Database, that contains all the monitored blogs, and the Scheduler, that ensures that all blogs are checked at regular intervals for updates. The Source Database holds the internal catalog of all the blog sources that the Spider harvests. For each blog, in- formation on blog post URLs, how blog comments relate to their parent blog posts, various metadata as well as filtering rules and extraction patterns are stored. Additionally, informa- tion related to how the blogs were inserted and processed by the spider is saved. It should be noted that blogs which had previously been blacklisted as incorrect or inappropriate are also kept in the Source Database. Scheduler: The Source Database feeds the Scheduler with information on which blogs should be checked for updates manually. Unless a ping server delivers updates automatically, the Scheduler makes sure that all blogs are checked on a regular basis for new material, includ- ing new posts as well as new comments. The update frequency is configurable and depends on the measured frequency of new material on the given blog. Fetcher: All content fetching and parsing takes place at the Worker unit of the spider, which is divided to the sub-components of the Fetcher and the Parser. The Fetcher’s main operation is to capture and analyze RSS feeds and HTML content. This sub-component’s services 110 The BlogForever Platform: An Integrated Approach to Preserve Weblogs are used in two distinct places in the spider’s design: during the Host Analyzer’s detection process, as seen before, and most notably during the actual downloading and analyzing of the blog content. For its first use, the Fetcher uses heuristic techniques, such as the ID3 decision tree [124], to create and use rules that extract useful links within a blog. These links usually are RSS feeds, links to posts, comments and other related information. For its second and main use, the Fetcher analyzes the full captured content to identify distinct blog elements. To this end, platform-specific rules as well as heuristic rules are deployed to discover and define a post’s full text, attached material, related comments, author and date information etc. This process includes comparing content extracted from the RSS feed with content extracted from HTML sources, using the Levenshtein distance string metric, to ensure the integrity of the final results. Parser: Once the detailed content has been extracted the Parser is put in use to perform an even more exhaustive discovery of information and metadata as well as encapsulate every- thing into a given XML format and send it to the Exporter. This is achieved through a series of addition checks which include: detecting further links to related content that had not been previously checked, detecting additional embedded material such as pictures, videos, blog rolls, social networks widgets, etc, and discovering microformats and microdata semantic markup. Another function of the Parser is to encapsulate all the collected data and metadata into an exportable XML file of a given structure. We choose to use the METS standard as the default container for this information in our implementation. More information on this design decision will be provided at the next section. Finally, in order to compliment the first level of spam detection and filtering performed by the Host Analyzer an additional dedicated Spam Filter is being developed as part of the Parser. Filtering is performed at the content level through the detection of common spam patterns. Exporter: At the end of the processing chain of the Spider we find the Exporter subcom- ponent which makes the parsed content available to the Spider’s clients. This is achieved through a web service that supports an interoperable machine-to-machine interaction be- tween the Spider and the potential client. The Spider provides detailed description of all the offered functionality through a machine-readable Web Services Description Language (WSDL)11 description file. Clients can use Simple Object Access Protocol (SOAP)12 mes- sages to search for newly updated material and download it on demand. As a fallback option for technologically simpler clients, the Exporter is able to make the XML files produced by the Parser temporarily available through an FTP server. Scalability: In the previous paragraphs we described a Blog Spider software design that can support crawling a limited amount of blogs. To be able to monitor and harvest large numbers of blogs, a scalable server architecture is needed. At the core of the Spider the high load of the Source Database can be distributed over a number of dedicated database servers. Another congestion point can appear when fetching and parsing blog content. Thanks to the smart Worker unit design described previously, an arbitrary number of Workers can be deployed to work together so that the various Fetcher and Parser subcomponents can work in parallel on harvesting and analyzing resources from different sources. A schematic representation of the above can be seen in Figure 5.26.
11http://www.w3.org/TR/wsdl, accessed: August 1, 2015 12http://www.w3.org/TR/soap12-part1/, accessed: August 1, 2015 5.4 System Architecture 111
Figure 5.26: High level outline of a scalable set up of the blog spider component
5.4.3 Digital Repository Component
The BlogForever digital repository component facilitates the ingestion, management and dissemination of blog content and information. Once the blog spider has finished processing new blog content the repository programmatically imports this semi-structured information through its ingestion mechanism. New blogs can be submitted to the repository and approved by the administrators. Within the repository, blog data and metadata is stored, checked, ana- lyzed and converted in order to fit the repository’s internal management needs in an efficient way. Long term digital preservation is a key aspect of this process and is therefore reflected in many of the actions performed. Dedicated indexing and ranking services are employed to ensure the most accurate search results for the end users through the web interface. Search- ing is the main functionality of the web interface of the repository, part of a series of ser- vices offered to the administrator and end user, such as community and collaborative tools and value-added services. Finally, interoperability is an important aspect of the repository, which supports many widely used open standards, protocols and export formats. A graphical summary of the above can be seen in Figure 5.27. The blog repository is based on Invenio, a free and open source digital library software suite13. It covers all aspects of digital library management from document ingestion through classification, indexing, and curation to dissemination. Invenio complies with standards such as OAI-PMH and uses MARC [97] as its underlying bibliographic format. The flexibility and performance of Invenio make it a comprehensive solution for management of document repositories with several million records. Invenio has been originally developed at CERN to run the CERN document server, managing over 1,000,000 bibliographic records in high- energy physics since 2002, covering articles, books, journals, photos, videos, and more. In- venio is being co-developed by an international collaboration comprising institutes such as CERN14, DESY15, EPFL16, FNAL17, SLAC18 and is used by hundreds of institutions world-
13http://invenio-software.org, accessed: August 1, 2015 14http://www.cern.ch/, accessed August 1, 2015 15http://www.desy.de/, accessed August 1, 2015 16http://www.epfl.ch/, accessed August 1, 2015 17http://www.fnal.gov/, accessed August 1, 2015 18http://www.slac.stanford.edu/, accessed August 1, 2015 112 The BlogForever Platform: An Integrated Approach to Preserve Weblogs wide. As it is outside the scope of this thesis, basic Invenio concepts will not be discussed here. Metadata: One of the core design decisions of the BlogForever digital repository is the representation of metadata, both internal and at the interface level. Among the approaches relevant for transferring data across software applications and networks is the Extensible Markup Language (XML)19. XML is a widely adopted machine/human-readable mark-up language. Its simplicity and suitability for transactions over the network spawned a large number of XML-based languages and standards. Among those standards are METS and MARCXML [97]. Both of the standards have been developed by the Library of Congress20 and are widely adopted for representing or describing data. METS categorizes metadata into two high level sections: descriptive metadata (dmdSec) and administrative metadata (amdSec). The administrative metadata section is divided into further subcategories: a subsection related to provenance metadata (digiProv) and a subsec- tion related to technical specifications of embedded content (techMD). In BlogForever all metadata are wrapped, encoded, and exposed using METS. The policies on how these are wrapped within METS is described thoroughly in [82]. In addition to the METS object associated to each record, it is essential that a METS profile describing the purpose of the repository and the types of objects and formats supported within the repository is made available at a fixed URI within the repository. The principles governing the management of the information ingested into the BlogForever repository are proposed in the METS profile included in [82], along with an example. In short, the profile stipulates:
• The controlled vocabularies and syntaxes to be used for metadata field values (e.g. ISO standards for describing time, date, geographic locations). • The set of external metadata schemas that are being incorporated into the METS stan- dard (e.g. MARCXML, MIX etc), and how they will be included within the native METS syntax. • Any tools that might be employed for object characterization. • The description and structural rules for employment in writing METS objects.
The BlogForever METS profile should be considered a work in progress, to be adapted if necessary as the implementation of the repository continues. The final profile should be made available at a fixed public URI within the repository and also submitted for inclusion in a public registry such as the Library of Congress, so that future projects involved in blog preservation activities may profit from our research. MARCXML is the main descriptive metadata standard for BlogForever. The fact that the MARC format is the chosen internal representation of metadata in Invenio played an impor- tant role in our decision. MARC is well-established, flexible enough to guarantee long-term reliability and can be thoroughly extended to adapt to any metadata structure. In the context of blogs we have defined a draft mapping between the conceptual data model from [134] and MARC fields. For this reason we have redefined a series of MARC fields in the 9xx field
19http://www.w3.org/XML/, accessed August 1, 2015 20http://www.loc.gov/, accessed August 1, 2015 5.4 System Architecture 113
Figure 5.27: The outline of the blog repository component design range21. Other suggested standards for specific object types are: MIX for images, TextMD22 for structured text, AES57-201123 for audio and MPEG/724 for moving images. BlogForever’s arsenal of standards is completed with the Preservation Metadata: Imple- mentation Strategy (PREMIS) standard. PREMIS consists of a core set of standardized data elements that are recommended for repositories to manage and perform their preservation functions, such as to make the digital objects usable over time, keeping them viable, or readable, displayable and kept intact, all for the purpose of future access. Additionally, a PREMIS schema can be wrapped up in the METS profile. The detailed reasons and analysis for choosing these standards are described in [82]. Submission: As mentioned previously, one of our design decisions is that the users cannot interfere with the list of target blogs. The system administrators provide the option to groups of registered or other users to suggest new blogs to be archived in the repository. The option of bulk submissions by the administrators is available too through the use of command line tools. Invenio already features an elaborate submission management system that allows for customizable forms to be created. BlogForever expands that system to offer a simple yet complete submission form that allows the user to suggest a new blog using its URL with the following optional information:
• Topic - a selection of existing topics is offered, based on previous user input, or the user can name their own topic. • License - in order to respect ownership copyright and potential local rules and laws that may apply, the repository offers a knowledge of licenses the user can select from. More information on how licenses are enforced will be given in the following paragraphs.
Ingestion: The spider’s Exporter subcomponent, which makes the parsed content available to the spider’s clients, was previously presented and the technologies used to make data and
21http://www.loc.gov/marc/bibliographic/bd9xx.html, accessed: August 1, 2015 22http://www.loc.gov/standards/textMD/, accessed: August 1, 2015 23http://www.loc.gov/standards/amdvmd/audiovideoMDschemas.html, accessed: August 1, 2015 24http://mpeg.chiariglione.org/standards/mpeg-7/mpeg-7.htm, accessed: August 1, 2015 114 The BlogForever Platform: An Integrated Approach to Preserve Weblogs metadata available were described. On the repository side, a SOAP client is employed to access the web service. In case an FTP server is needed to export blog content, the repository may also use a dedicated FTP client to connect and download all new content. In both scenarios, the repository is periodically pulling information from the spider. A two-way communication is currently being designed to allow the spider to push new information to the repository and ensure real time archiving. Once new content is received by the repository in the form of individual packages (set of resources) they are considered as SIPs according to the Open Archival Information System (OAIS) specification. To ensure their validity, quality assurance checks may be performed on the SIP such as MD5 checksum verification. A dedicated ingestion storage engine has been introduced in the repository to store all SIPs. At the end of the ingestion process an AIP is generated from a single SIP by extracting descriptive information like descriptive metadata for search and retrieval, attached files, etc. The AIP is then finally transferred to the internal storage. Storage: All data and metadata in BlogForever are stored in a database as well as in the filesystem. Multiple copies of each resource are kept in the form of replication as well as fre- quent incremental backups. To ensure the long term digital preservation of these resources, versioning is also introduced in various stages of the ingestion and archival process. Any possible change on any resources creates a new version of that resource and keeps all old versions. For example, nothing can be deleted unless the system administrator physically does so directly on the server. As mentioned before, all SIPs are permanently stored in a dedicated ingestion storage en- gine. Any database server can be configured to act as such. In the current implementation, MongoDB25 has been chosen as a highly scalable and agile document database server26. When the AIPs are generated from the SIPs, descriptive metadata (in MARCXML) are ex- tracted and stored separately in a database to provide fast search and retrieval services. In the current implementation the well-established open source relational database manage- ment system MySQL is used. During the next stages of development an object-relational mapping software solution will be deployed to provide for easy integration of other popular RDBMSs such as PostgreSQL27 and Oracle28. All attached files are managed by a dedicated software module which acts as an abstraction layer between the physical storage on the file system and the logical names of those files as part of the AIP. As for the metadata, versioning and replication are also provided for all physical files. BlogForever is interconnecting all metadata and data that amount to an AIP in a seamless way, using various identifiers. Within its role as an archive, BlogForever ingests and stores all content as it is sent by the spider. Upon reception all data are treated equally. Content that may potentially later be declared inappropriate at its original source can be flagged as such in the platform either au- tomatically by following spider updates or manually at the repository by the administrators. Indexing & Ranking: BlogForever features special indexes to provide high speed searching. Indexes can be set up on any logical field of the descriptive metadata after associating it to one of the MARC tags. Indexing may also be performed on all text files such as PDF docu-
25http://www.mongodb.org/, accessed: August 1, 2015 26http://www.mongodb.org/display/DOCS/Use+Cases, accessed: August 1, 2015 27http://www.postgresql.org/, accessed: August 1, 2015 28http://www.oracle.com/products/database/, accessed: August 1, 2015 5.4 System Architecture 115 ments. This leads to a combined metadata and full-text high speed search. Fuzzy indexing is also available through multi-lingual stemming. Search results may be sorted and ranked according to a series of criteria such as the classical word-frequency based vector model that allows to retrieve similar documents, the number of views and downloads etc. Searching: The most important user feature of the platform is provided through a simple yet intuitive interface. A single search field offers access to all the digital repository’s resources. Behind the scenes, a powerful search engine makes sure accurate results are returned as quickly as possible. The user may search across all metadata and data on the repository or can focus their search on specific logical fields such as author or title, once defined by the administrator, as is for indexing. The option to use regular expressions is also available for advanced users. Finally, all documents on the BlogForever repository can be organized in collections, based on any descriptive metadata the administrator chooses. This extra granu- larity can help the user browse the contents or limit their search in specific collections. Interoperability: In order to work together with diverse systems and ensure wider accep- tance, interoperability is a key aspect of the BlogForever platform. Various standards and common technologies are supported. Most notably:
• the Open Archives Initiative29 (OAI) protocols can be used for the acquisition and dissemination of resources,
• the OpenURL standardized format can be used to locate resources in the repository,
• the Search/Retrieval via URL30 (SRU) protocol is supported,
• Digital Object Identifiers (DOI) can be used to uniquely identify resources, in the repository.
Exporting: To further enhance interoperability, BlogForever offers the possibility to export metadata in various standard formats not native to the platform, such as Dublin Core31 and MODS32. A powerful conversion module quickly provides the various export formats on demand. Personalization: BlogForever proposes a number of personalization and collaborative fea- tures. Like the repository-wide search collections, users may create their personal, collec- tions of documents (baskets) to use privately or share with other users. Periodical notifica- tions on newly added documents (alerts) are also available and can also be used in conjunc- tion with baskets. An RSS feed on any search query is available to be used with any given reader. Finally, commenting and rating on all repository content is available. Digital Rights Management (DRM): Basic management of digital rights in the BlogFor- ever platform is achieved through the configuration of a standard role-based access control (RBAC) system, which controls access to all repository’s resources and allows for compli- cated access rules to be set. Full DRM support is currently being designed based on the research and analysis of common copyright issues in the blogosphere.
29http://www.openarchives.org/ 30http://www.loc.gov/standards/sru/ 31http://dublincore.org/, accessed: August 1, 2015 32http://www.loc.gov/standards/mods/ 116 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
Scalability: The digital repository component of the BlogForever platform is based on Inve- nio, which in turn, is based on widely used open source software projects such as the MySQL relational database management system and the Apache HTTP server. High availability and scalability are key features for MySQL with several success stories33 to date and a com- plete guide34 for developers including replication and clustering techniques. Apache also uses many well proven internal and external techniques to scale, including load balancing and caching. Invenio can run in multi-node load-balanced architecture with several front- end worker nodes and with database replication nodes in the Back-end35. Furthermore, the indexing and searching capabilities of Invenio can already be augmented by plugging in a scalable search server such as Apache Solr36.
5.5 Implementation
The necessity for high quality software can only be fulfilled by a quality assurance of the de- velopment. This challenge is tackled with both a well-structured collaborative development process. The implementation of the BlogForever platform is led by the Digital Library Tech- nology (DLT)37 Section of the Collaboration and Information Services (CIS) Group at the Information Technology (IT) Department at CERN. The software implementation is broken down into two distinct parts, as is the system architecture. The blog spider and the digital repository follow different but almost parallel development cycles, and share some of the de- velopment techniques. BlogForever project partner CyberWatcher38 is the main developer of the spider. CyberWatcher has a long history of offering web and social media crawling services to its customers. BlogForever project partner CERN is the main developer of the repository. CERN is the lead developer of the open source digital library software Invenio. As discussed in Section 5.3, the first step of the workflow was to collect and define user requirements and platform specifications. To enter the actual development cycle it was then necessary to map those to concrete software features and place them on a timeline following a well-defined set of case studies, according to their importance. This way developers can focus on a specific group of features at each moment allowing for testing and documenting to be performed in parallel, allowing for a highly efficient practice. A requirement is a capability that a product must possess in order to ultimately satisfy a customer need. A feature is a set of related requirements that allows the user to satisfy a business objective or need.39. In other words, a capability provided by a user requirement must be described and implemented through a set of concrete software functionalities. This
33http://www.mysql.com/why-mysql/scaleout/, accessed: August 1, 2015 34http://dev.mysql.com/doc/mysql-ha-scalability/en/ha-overview.html, accessed: August 1, 2015 35https://indico.cern.ch/getFile.py/access?contribId=17&sessionId=9&resId= 0&materialId=slides&confId=183318, accessed: August 1, 2015 36http://lucene.apache.org/solr/, accessed: August 1, 2015 37http://information-technology.web.cern.ch/about/organisation/ digital-library-technology, accessed: August 1, 2015 38http://www.cyberwatcher.com/ 39http://www.accompa.com/product-management-blog/2009/07/13/ features-vs-requirements-requirements-management-basics/, accessed: August 1, 2015 5.6 Evaluation 117 set is called a software feature. Once the user requirements and platform specifications are gathered, software engineers perform an analysis and determine the scope of the develop- ment. Certain functionality may be out of scope of the project as a function of cost or as a result of unclear requirements at the start of development. Some requirements might be thematically grouped together and therefore mapped to a single feature. On the other hand some requirements might be broken down and mapped to more than one features in case they should be described by distinct sets of software functionalities. Finally, a generic user requirement is one that cannot be fully described by a specific set of concrete software func- tionalities and thus cannot be directly mapped to one or more software features. Generic user requirements can however be part of thematic group of requirements. Developers follow these steps:
1. Study the detailed user requirements and the capabilities of the existing software to understand how they map to each other. 2. Analyse each requirement and describe to what extend it is already being satisfied. 3. Convert requirements to features, grouping them together or splitting them when nec- essary, according to the effort needed, the developer undertaking them and their im- portance. Detailed specifications are written for each feature.
Taking this information into account, the developers plan the implementation of all the fea- tures over the entire span of the duration of the five case studies. The features, that had already been classified by their degree of importance for the users and platform, were as- signed to different case studies according to the necessity of their evaluation for the given case study, as well as the calculated effort of implementation. The ultimate goal of the de- velopers is to have all the features ready on time, leaving enough time for internal testing and documentation. All resulting features, whether new developments, modifications or additions, are imple- mented through a series of iterations. That includes implementation of new modules and components to satisfy blog-specific features as well as customization and adaptation of ex- isting modules and components to satisfy users expectations when experiencing the Blog- Forever platform. At the end of the implementation process for both the spider and the repository, every feature is tested and documented.
5.6 Evaluation
The importance of software testing and evaluation in software engineering cannot be stressed enough. It is the main process used to identify the completeness, correctness, and quality of developed software.
5.6.1 Method
To achieve the best results, we use three processes to gather input to provide a better under- standing of the purpose, current status, functionality and issues of the BlogForever platform. 118 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
These are:
1. System Logs recorded by the servers running the case studies. 2. Internal Testing used to gather feedback from project partners while testing features and recording their status. Specific reports were created for each case study, presenting the outcomes of internal testing. Detailed information on internal testing is part of the BlogForever Deliverable 5.2, Implementation of Case Studies [6]. 3. External Testing used to gather feedback from 3rd party users involved in testing. Ex- ternal users submitted specially designed User Questionnaires. Detailed information on the User Questionnaires is part of the BlogForever Deliverable 5.3 User Question- naires [14].
The evaluation workflow is presented in Figure 5.28.
Figure 5.28: BlogForever Evaluation Timeline [7]
We analyse the outcomes of the evaluation using quantitative and qualitative methods, aim- ing to answer the following Research Questions:
• RQ1: What are the particular problems the implementation is facing? Or are the Blog- Forever software implementation processes an overall success? • RQ2: Are complex BlogForever platform search strategies working efficiently when high levels of content are available within the BlogForever platform? • RQ3: How useful is the BlogForever platform as a whole? 5.6 Evaluation 119
• RQ4: Does the use of the BlogForever repository lead to successful results for the different users? • RQ5: How user friendly are the BlogForever platform functions for the different des- ignated blog communities?
The evaluation of the first two Research Questions regarding the implementation and the ef- ficiency of the BlogForever platform are answered by the results of the System Logs evalua- tion. Using the data gathered from System Logs, we specify some explicit technical Metrics such as page response time, pages per visit and system errors, as presented in Table 5.4.
ID Metric Description M1 Content records page views Repository pages presenting records (blogs, posts, pages or comments). M2 Export page views Repository pages used to export content. M3 Search page views Repository pages used for search. M4 Achieve goals in Google Analytics Goals are a versatile way to measure how well a website fulfils specific objectives, which can be a set of consecutive actions in the website. M5 Number of python code errors The number and nature of python errors is im- portant for the system integrity. M6 HTTP status distribution The distribution of HTTP responses provides an insight on application stability and in- tegrity. M7 Page loading time distribution Average web page loading time is a character- istic of website performance. M8 Pages per visit The more pages per user visit is directly rel- evant to the quality of system navigation and functionality. M9 Average visit duration The length of user visit duration is directly rel- evant to the quality of system navigation and functionality.
Table 5.4: BlogForever Evaluation Metrics [13] To elaborate the remaining Research Questions, a set of ten Themes are defined to help rationalising the outputs of all evaluation and connect them with the Research Questions. The terms used do not relate to any technical or development terms previously used within the project (e.g. when building the platform), and are intended to be as clear and simple as possible to promote the point of view of a user. The Themes and their connections with the Research Questions are presented in Table 5.5.
5.6.2 Results
We conduct case studies with different parameters to evaluate the platform using the pre- sented Themes. More specifically, the case studies include two small domain-specific blog 120 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
ID Theme Related Rationale Linking Theme to RQ T1 Using blog RQ5 The user experience, satisfaction and usefulness of the records archived blogs collections tested within the different ver- sions of the BlogForever repository. T2 System RQ3 Cover whether the system is logical and secure. The tests integrity are dependent on the software integrity level or risk level. T3 Sharing and RQ3 Evaluate the ability of BF users to share content and meta- interaction data with others, including other users of the platform, and any external use via social software. T4 Searching RQ4 The tests focus on how the platform performs searches, and how users can use and interpret the results of searches. T5 Access RQ4 Evaluate how the platform allows access to the blog records, and how it presents dissemination copies of the content. T6 Data in- RQ3 Assess if the blog datasets are properly captured, well- tegrity maintained and consistent. T7 Preservation RQ3 Assess if it is possible to preserve blogs. T8 Functionality RQ3 Assess the functions available to users and administrators. T9 System navi- RQ5 Assess the general navigation aspects of the system. gation T10 System RQ5 Evaluate instructions, help pages, and other aspects of ter- terminology minology in the platform.
Table 5.5: BlogForever Evaluation Themes [13] groups with good use of posting and commenting, two general blog groups that aim to broaden the scope of the testing and prove information retrieval algorithms using more di- verse content and topics and finally, a wide group of about 500.000 multilingual blogs with diverse content and various content types including multimedia. We use the first four case studies to test and evaluate the project’s usability and impact for a specific set of content and users. Their information is presented in Table 5.6. We use the fifth case study to study the performance and operational aspects of our system. During each case study, we gather and analyse user feedback, as well as data on usage and satisfaction, as presented in [14] and [13]. The data collected forms the core data source for system evaluation. Potential problems are addressed immediately by the developers in par-
Case Study Blogs Tests CS1: Led by the University of London 58 academic blogs 7 CS2: Led by the University of Warwick 70 academic blogs 6 CS3: Led by CyberWatcher 333 random blogs in four languages 5 CS4: Led by Phaistos 1000 personal blogs in two lan- 6 guages
Table 5.6: BlogForever Case Studies for User Testing [14] 5.6 Evaluation 121 allel with the implementation of the case studies. The outcomes of the external and internal testing are presented in Table 5.7.
Theme External Score Internal Score T1: Using blog records 3.61 3.4 T2: System integrity 3.61 4.25 T3: Sharing and interaction 3.94 3 T4: Searching 3.54 3.75 T5: Access 3.75 3.28 T6: Data integrity 3.53 3.89 T7: Preservation 3.67 2.5 T8: Functionality 3.59 4.11 T9: System navigation 3.62 N/A T10: System terminology 3.54 N/A Average 3.64 3.66
Table 5.7: External and Internal Scores Summary [13]
We summarise some interesting observations:
• The average score in internal and external testing is almost equal, 3.64 and 3.66. This score is rather favourable as 3 equates to “Most areas worked as expected” and 4 equates to “All work as expected”.
• On average, the scores of all external testing scores are very consistent, with a min- imum of 3.53 in T6: Data integrity and a maximum of 3.94 in T3: Sharing and in- teraction (range 1 to 5). On the other hand, internal testing scores vary more, with a minimum of 2.5 in T7: Preservation and a maximum of 4.11 in T8: Functionality.
• There seems to be consensus on the scoring of most themes, except T3: Sharing and interaction and T7: Preservation where the different between the external and internal testing is around 1. In all other cases, the differences are much smaller, strengthening the outcomes of the evaluation.
5.6.3 Evaluation Outcomes
We summarise our finding regarding the original Research Questions presented in Sec- tion 5.6.1. RQ1: What are the particular problems the implementation is facing? Or are the BlogForever software implementation processes an overall success? The answer to RQ1 lies in the general outcomes of the evaluation as it is trying to evaluate the overall success of the BlogForever software implementation. Nevertheless, we tried to focus on some specific aspects of the evaluation results, which we believe apply directly to RQ1: 122 The BlogForever Platform: An Integrated Approach to Preserve Weblogs
• Looking into the general information of the case studies as presented through the sys- tem logs, we see that the number of visitors and page views is substantial. These statistics demonstrate the rigour of our testing process: multiple tests were made by many users. It also shows the platform is capable of handling a large number of users and requests. • In addition, evaluating the system logs metrics and especially Metric 5: Python error codes and Metric 6: HTTP status distributions, we see that very few system errors have occurred considering the testing process. • Theme 8: Functionality and Theme 2: System integrity scores are above the average in internal testing results and near the average in external testing results.
RQ2: Are complex BlogForever platform search strategies working efficiently when high levels of content are available within the BlogForever platform? The fifth case study is characterised by the large volume and complexity of the blogs in scope without any problem. Therefore, we consider that the BlogForever platform is working efficiently. RQ3: How useful is the BlogForever platform as a whole? RQ3 is aligned with the following Themes: T2: System integrity, T6: Data integrity, T7: Preservation and T8: Functionality. Their internal and external testing scores are presented in Table 5.7. Also, the System Logs Metrics relevant to T2: System integrity are M5: Num- ber of python errors, M6: HTTP status distribution and M7: Page loading time distribution. As we see in the results, all their scores are quite high. RQ4: Does the use of the BlogForever repository lead to successful results for the different users? RQ4 is aligned with T4: Searching and T5: Access. Their internal and external testing scores ar quite good as we see in Table 5.7. RQ5: How user friendly are the BlogForever platform functions for the different designated blog communities? RQ5 is aligned with the following Themes: T1: Using blog records, T3: Sharing and in- teraction, T9: System navigation and T10: System terminology. Their internal and external testing scores are also quite good. To conclude, we consider the rating of all Themes to average out between 3 (“Most areas worked as expected”) and 4 (“All work as expected”). Any deviations between the different themes, evaluation methods and case studies are not significant. From these scores, we can conclude that the majority of users are satisfied with the performance of the BlogForever platform.
5.7 Discussion and Conclusions
While our solution overcomes the identified generic web archiving problems, some limita- tions are discussed in this section. The first concern is the definition of the target blogs to 5.7 Discussion and Conclusions 123 be preserved. As we described in Section 5.4.1, the BlogForever spider is provided with a predefined list of blogs to analyse and harvest. The problem lies in the fact that the list of target blogs has to be predefined explicitly. The administrators need to already have the list of blogs, which may not be always the case. When BlogForever is deployed to preserve the blogs of a specific organization, the list of target blogs is predefined but when BlogForever is deployed in a different context, e.g. to create a repository of major Mathematics blogs, then the definition of target blogs is a big issue. The solution would be have a mechanism able to generate and curate such a list in a semi-automatic or fully automatic way, based on config- urable topic hierarchies. This way, the administrator would define the topics of interest and let the platform handle the specifics of blog collection management. The Blog Spider has also limitations regarding new blog content detection and processing. First of all, it uses RSS and ping servers to receive notifications for blog content updates. Nevertheless, these methods do not notify about layout changes. Moreover, RSS is used in different ways. Some blogs provide separate RSS feeds for posts and comments while others provide RSS feeds just for posts. Thus, the detection of new comments in real-time is problematic in such cases. We also face issues during the processing of unknown blog platforms or ‘exotic implementations’ because the identification/analysis process for the en- tities of blogs is a knowledge intensive process and has to be adapted to new developments in blog platforms. Therefore, the amount of necessary adaptations is dependent on the actual domain respectively the actual blog platforms that should be archived. In addition, while the identification of structural blog entities like posts, comments, etc. is achieved, the validation of whether the author uses a real name or an alias cannot be applied automatically. To sum up, there are some difficulties to evaluate the validity of certain elements. Another issue is relevant to the scalability of the BlogForever platform. Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth [25]. As we described in 5.4.3, the BlogForever repository component is based on the Invenio software suite which has been in production since 2002 and many popular repositories hosting millions of records such as CERN Document Server40 and INSPIRE 41 use it with great success. Nevertheless, due to its reliance on MySQL RDBMS, the BlogForever repository architecture is inherently not scalable to web scale datasets. New database technologies such as NoSQL databases are better fit for this purpose [41]. Thus, it would not be possible to deploy BlogForever in a large cluster of servers in order to create internet scale blog archives. In summary, we identified several limitations that should be addressed in further research and development. However, they do not refute the claim that the BlogForever system solves the identified problems of current web archiving.
40http://cds.cern.ch/ 41http://inspirehep.net/
Chapter 6
A Scalable Approach to Harvest Modern Weblogs
We present methods to automatically extract content such as articles, authors, dates and com- ments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog’s web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system1.
6.1 Introduction
One of the key challenges in developing blog archiving systems is the design of a web crawler capable of efficiently traversing blogs to harvest their content. The sheer size of the blogo- sphere combined with an unpredictable publishing rate of new information call for a highly scalable system, while the lack of programmatic access to the complete blog content makes the use of automatic extraction techniques necessary. The variety of available blog publish- ing platforms offers a limited common set of properties that a crawler can exploit, further narrowed by the ever-changing structure of blog contents. Finally, an increasing number of blogs heavily rely on dynamically created content to present information, using the latest
1This chapter is based on the following publications: • Banos V., Blanvillain O., Kasioumis N., Manolopoulos Y.: “A Scalable Approach to Harvest Modern Weblogs”, International Journal of AI Tools, Vol.24, N.2, 2015. • Blanvillain O., Banos V., Kasioumis N.: BlogForever Crawler: “Techniques and Algorithms to Harvest Modern Weblogs”, Proceedings 4th International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Thessaloniki, Greece, 2014. 126 A Scalable Approach to Harvest Modern Weblogs web technologies, hence invalidating traditional web crawling techniques. A key characteristic of blogs which differentiates them from regular websites is their associa- tion with web feeds [89]. Their primary use is to provide a uniform subscription mechanism, thereby allowing users to keep track of the latest updates without the need to actually visit blogs. Concretely, a web feed is an XML file containing links to the latest blog posts along with their articles (abstract or full text) and associated metadata [122]. While web feeds essentially solve the question of update monitoring, their limited size makes it necessary to download blog pages to harvest previous content. We present the open-source BlogForever Crawler, a key component of the BlogForever plat- form [80] responsible for traversing blogs, extracting their content and monitoring their up- dates. Our main objectives in this work are to introduce a new approach to blog data ex- traction and to present the architecture and implementation of a blog crawler capable of extracting articles, authors, publication dates, comments and potentially any other element which appear in weblog web feeds. Our contributions can be summarized as follows:
• A new algorithm to build extraction rules from web feeds and an optimised reformula- tion based on a particular string similarity algorithm featuring linear time complexity. • A methodology to use the algorithm for blog article extraction and how it can be aug- mented to be used with other blog elements such as authors, publication dates and comments. • The overall BlogForever crawler architecture and implementation with a focus on de- sign decisions, modularity, scalability and interoperability. • An approach to use a complete web browser to render JavaScript powered webpages before processing them. This step allows our crawler to effectively harvest blogs built with modern technologies, such as the increasingly popular third-party commenting systems. • A mapping of the extracted blog content to Archival Information Packages (AIPs) using METS and MARCXML standards for interoperability purposes. • An evaluation of the content extraction and execution time of our algorithm against three state-of-the-art web article extraction algorithms.
The concepts emerging from our research are viewed in the context of the BlogForever plat- form but the presented algorithms, techniques and system architectures can be used in other applications related to Wrapper Generation and Web Data Extraction. The rest of this chapter is structured as follows: Section 6.2 introduces the new algorithms to extract data from blogs. Section 6.3 presents the blog crawler system architecture and implementation. Section 6.4 presents the evaluation and results. Finally, our conclusions and some discussion on our work are presented in Section 6.5.
6.2 Algorithms
We propose a new set of algorithms to extract blog post articles as well as variations for extracting authors, dates and comments. We start with our motivation to use blog spe- 6.2 Algorithms 127 cific characteristics, followed by our approach to build extraction rules which are applicable throughout a blog. Our focus is on minimising the algorithmic complexity while keeping our approach simple and generic.
6.2.1 Motivation
Extracting metadata and content from HTML documents is a challenging task because stan- dards and format recommendations suffer from very low usage. W3C has been publishing web standards and format recommendations for quite some time [139]. For instance, ac- cording to the W3C HTML guidelines, the
tags have to contain the highest-level heading of the page and must not appear more than once per page [138]. More recently, spec- ifications such as microdata [145] define ways to embed semantic information and metadata inside HTML documents, but these still suffer from very low usage: estimated to be used in less than 0.5% of websites [121]. In fact, the majority of websites rely on the generic and container elements with custom id or class attributes to organise the structure of pages, and more than 95% of pages do not pass HTML valida- tion [146]. Under such circumstances, relying on HTML structure to extract content from webpages is not viable and other techniques need to be employed. Having blogs as our tar- get websites, we make the following observations which play a central role in the extraction process:1. Blogs provide web feeds: structured and standardized views of the latest blog posts. 2. Posts of the same blog share a similar HTML structure.
Web feeds usually contain about 20 blog posts [109], often less than the total number of posts in blogs. Consequently, to effectively archive the entire content of a blog, it is necessary to download and process pages beyond the ones referenced in the web feed.
6.2.2 Content Extraction Overview
To extract content from blog posts, we proceed by building extraction rules from the data given in the blog’s web feed. The idea is to use a set of training data, pairs of HTML pages and target content, which are used to build an extraction rule capable of locating the target content on each HTML page. Observation (1) allows the crawler to obtain input for the extraction rule generation algo- rithm: each web feed entry contains a link to the corresponding webpage as well as blog post article (either abstract or full text), its title, authors and publication date. We call these fields targets as they constitute the data our crawler aims to extract. Observation (2) guar- antees the existence of an appropriate extraction rule, as well as its applicability to all posts of the blog. Each page is uniquely identified by its URL. If a page is already processed, it is not processed again in the future. Algorithm 1 shows the generic procedure we use to build extraction rules. The idea is quite simple: for each (page, target) input, compute, out of all possible extraction rules, the best 128 A Scalable Approach to Harvest Modern Weblogs
Algorithm 1: Best Extraction Rule input : Set pageZipTarget of (page and target) pairs output: Best extraction rule bestRules ⟵ new list foreach (page, target) in pageZipTarget do score ⟵ new map foreach rule in AllRules(page) do extracted ⟵ Apply(rule, page) score of rule ⟵ ScoreFunction(extracted, target) bestRules ⟵ bestRules + rule with highest score return rule with highest occurrence in bestRules one with respect to a certain ScoreFunction. The rule which is most frequently the best rule is then returned. One might notice that each best rule computation is independent and operates on a different input pair. This implies that algorithm 1 is embarrassingly parallel: iterations of the outer loop can trivially be executed on multiple threads. Functions in Algorithm 1 are voluntarily abstract at this point and will be explained in detail in the remaining of this section. Subsec- tion 6.2.3 AllRules, Apply and the ScoreFunction we use for article extraction. In subsec- tion 6.2.4 we analyse the time complexity of algorithm 1 and give a linear time reformulation using dynamic programming. Finally, subsection 6.2.5 shows how the ScoreFunction can be adapted to extract authors, dates and comments.
6.2.3 Extraction Rules and String Similarity
In our implementation, rules are queries in the XPath Language. Consequently, standard libraries can be used to parse HTML pages and apply extraction rules, providing the Apply function used in algorithm 1. We experiment with 3 types of XPath queries: selection over the HTML id attribute, selection over the HTML class attribute and selection us- ing the relative path in the HTML tree. id attributes are expected to be unique, and class attributes show better consistency than relative paths over pages of a blog in our experi- ments. For these reasons we opt to always favour class over path, and id over class, such that the AllRules function returns a single rule per node. Unsurprisingly, the choice of ScoreFunction greatly influences the running time and precision of the extraction process. When targeting articles, extraction rule scores are computed with a string similarity func- tion comparing the extracted strings with the target strings. We chose the Sorensen–Dice coefficient similarity [45], which is, to the best of our knowledge, the only string similarity algorithm fulfilling the following criteria:
1. Has low sensitivity to word ordering, 2. Has low sensitivity to length variations, 6.2 Algorithms 129
3. Runs in linear time.
Function AllRules(page) rules ⟵ new set foreach node in page do if node as id attribute then rules ⟵ rules + {"//*[@id=`node.id']"} else if node as class attribute then rules ⟵ rules + {"//*[@class=`node.class']"} else rules ⟵ rules + {RelativePathTo(node)} return rules
Properties 1 and 2 are essential when dealing with cases where the blog’s web feed only contains an abstract or a subset of the entire post article. function AllRules gives examples to illustrate how these two properties hold for the Sorensen–Dice coefficient similarity but do not for edit distance based similarities such as the Levenshtein [88] similarity. The Sorensen–Dice coefficient similarity algorithm operates by first building sets of pairs of adjacent characters, also known as bigrams, and then applying the quotient of similarity formula:
Function Similarity(string1, string2) bigrams1 ⟵ Bigrams(string1) bigrams2 ⟵ Bigrams(string2) return 2 |bigrams1 ∩ bigrams2| / (|bigrams1| + |bigrams2|)
Function Bigrams(string) return set of pairs of adjacent characters in string
string1 string2 Sorensen-Dice Levenshtein "Scheme Scala" "Scala Scheme" 90% 50% "Rachid" "Richard" 18% 61% "Rachid" "Amy, Rachid and all their 29% 31% friends"
Table 6.1: Examples of string similarities. 130 A Scalable Approach to Harvest Modern Weblogs
6.2.4 Time Complexity and Linear Reformulation
With the functions AllRules, Apply and Similarity (as ScoreFunction) being defined, the definition of algorithm 1 for article extraction is now complete. We can therefore proceed with a time complexity analysis. First, let’s assume that we have at our disposal a linear time HTML parser that constructs an appropriate data structure, indexing HTML nodes on their id and class attributes, effec- tively making Apply ∈ 풪(1). As stated before, the outer loop splits the input into independent computations and each call to AllRules returns (in linear time) at most as many rules as the number of nodes in its page argument. Therefore, the body of the inner loop will be executed 풪(푛) times. Because each extraction rule can return any subtree of the queried page, each call to Similarity takes 풪(푛), leading to an overall quadratic running time. We now present algorithm 2, a linear time reformulation of algorithm 1 for article extraction using dynamic programming.
Algorithm 2: Linear Time Best Content Extraction Rule input : Set pageZipTarget of (Html and Text) pairs output: Best extraction rule bestRules ⟵ new list foreach (page, target) in pageZipTarget do score ⟵ new map bigrams ⟵ new map bigrams of target ⟵ Bigrams(target) foreach node in page with post-order traversal do bigrams of node ⟵ Bigrams(node.text) ∪ bigrams of all node.childs score of node ⟵ 2 |(bigrams of node) ∩ (bigrams of target)| |푏푖푔푟푎푚푠 of 푛표푑푒| + |푏푖푔푟푎푚푠 of 푡푎푟푔푒푡| bestRules ⟵ bestRules + Rule(node with best score) return rule with highest occurrence in bestRules
While very intuitive, the original idea of first generating extraction rules and then picking these best rules prevents us from effectively reusing previously computed 푁-grams (sets of adjacent characters). For instance, when evaluating the extraction rule for the HTML root node, algorithm 1 will obtain the complete string of the page and pass it to the Similarity function. At this point, the information on where the string could be split into substrings with already computed 푁-grams is not accessible, and the 푁-grams of the page have to be computed by linearly traversing the entire string. To overcome this limitation and implement memoization over the 푁-gram computations, algorithm 2 uses a post-order traversal of the HTML tree and computes node 푁-grams from their children 푁-grams. This way, we avoid serializing HTML subtrees for each 푁-gram computation and have the guarantee that each character of the HTML page will be read at most once during the 푁-gram computation. 6.2 Algorithms 131
Blog post text Apple has a new version of iOS 8 out, just a short time after the initial launch of the software. The update, 8.0.1, includes a number of fixes, but most notably (and listed first), it addressed the bug that prevented HealthKit apps from being available at launch. It also zaps some bugs with third-party keyboards, which should make them remain the default option until a user switches to another, which has been a sore spot for fans of the new external software keyboard options. Unfortunately, installing the iOS 8.0.1 update revealed that despite Apple’s promised fixes, it actually completely disables cellular service and Touch ID on many devices, though some iPhone 5s and older model owners report no issues. The bottom line is that you should definitely NOT install this update, at least until an updated version appears, at which time we’ll let you know it’s safe to go ahead. As you can see in the image below, Apple is also addressing an issue that blocked some photos from appearing in Photo Library, fixing reliability concerns around Reachability on iPhone 6 and 6 Plus (which brings the top of the screen down when you double touch the Home button), zapping bugs that cause unexpected data use when communicating via SMS or MMS, and improving the “Ask to Buy” feature for Family Sharing, specifically around in-app purchases, in addition to other minor bugs. Weblog post text abstract from RSS Apple has a new version of iOS 8 out, just a short time after the initial launch of the software. The update, 8.0.1, includes a number of fixes, but most notably (and listed first), it addressed the bug that prevented HealthKit apps from being available at launch. It also zaps... Table 6.2: TechCrunch blog post example.
푁 2 3 4 Similarity 0.7817 0.6680 0.6212 Table 6.3: Blog post excerpt and full text similarity using different 푁 values.
An interesting question is what is the optimal 푁 for 푁-gram computation. To answer this question, we conduct some simple experiments. Using sample blog post excerpts and full texts, we calculate the string similarity between each excerpt and full text. For instance, using a blog post full text and associated text excerpt from the popular blog TechCrunch as presented in Table 6.2, the text similarity results (Table 6.3) using different 푁 values support the selection of bigrams (푁=2). With bigrams computed in this dynamic programming manner, the overall time to compute all Bigrams(node.text) is linear. To conclude the argument that algorithm 2 runs in linear time we show that all other computations of the inner loop can be done in constant amortized time. As the number of edges in a tree is one less than the number of nodes, the amortized number of bigrams unions per inner loop iteration tends to one. Each quotient of similarity computation requires one bigrams intersection and three bigrams length computations. Over a finite alphabet (we used printable ASCII), bigrams sizes have bounded size and each of these operations takes constant time. 132 A Scalable Approach to Harvest Modern Weblogs
6.2.5 Variations for Authors, Dates and Comments
Using string similarity as the only score measurement leads to poor performance on author and date extraction, and is not suitable for comment extraction. This subsection presents variations of the ScoreFunction which addresses issues of these other types of content. The case of authors is problematic because authors’ names often appear in multiple places of a page, which results in several rules with maximum Similarity score. The heuristic we use to get around this issue consists of adding a new component in the ScoreFunction for author extraction rules: the tree distance between the evaluated node and the post content node. This new component takes advantage of the positioning of a post’s authors node which often is a direct child or shares its parent with the post content node. Dates are affected by the same duplication issue, as well as the issue of inconsistencies of format between web feeds and webpages. Our solution for date extraction extends the ScoreFunction for authors by comparing the extracted string to multiple targets, each be- ing a different string representation of the original date obtained from the web feed. For instance, if the feed indicates that a post was published at "Thu, 01 Jan 1970 00:00:00", our algorithm will search for a rule that returns one of "Thursday January 1, 1970", "1970-01-01", "43 years ago" and so on. So far we do not support dates in multiple languages, but adding new target formats based on languages detection would be a simple extension of our date extraction algorithm. Comments are usually available in separate web feeds, one per blog post. Similarly to blog feeds, comment feeds have a limited number of entries, and when the number of comments on a blog post exceeds this limit, comments have to be extracted from webpages. To do so, we use the following ScoreFunction:
• Rules returning fewer HTML nodes than the number of comments on the feed are filtered out with a zero score,
• The scores of the remaining rules are computed with the value of the maximum weighted matching in the complete bipartite graph 퐺 = (푈, 푉 , 퐸), where 푈 is the set of HTML nodes returned by the rule, 푉 is the set of target comment fields from the web feed (such as comment authors) and 퐸(푢, 푣) has weight equal to Similarity(u,v).
Regarding time complexity, computing the tree distance of each node of a graph to a single reference node and multiplying the number of targets by a constant factor can be done in lin- ear time. However, computing scores of comment extraction rules requires a more expensive algorithm. This is compensated by the fact that the proportion of candidate HTML nodes left, after filtering out rules not returning enough results, is very low in practice. Analogous reformulations to the one done with algorithm 2 can be straightforwardly applied on each ScoreFunction to minimize the time spent in Similarity calculations. It must be noted that there is no limitation due to comment nesting as long as comments follow the same format. 6.3 Architecture 133
6.3 Architecture
We present the BlogForever crawler system architecture which implements the proposed al- gorithms for weblog data extraction via the generation of extraction rules. We describe the system architecture and discuss the software tools and techniques we used, such as the en- richment of the Scrapy framework for our specific usage and the integration of a headless web browser into the harvesting process to achieve content extraction from webpages which use JavaScript to display content. Following, we focus on the scalability design and dis- tributed architecture of our system. Finally, we present our provisions for interoperability using established open standards which increases the value and reusability of the proposed system in many contexts.
6.3.1 System and Workflow
The BlogForever crawler is a Python application which is based on Scrapy, an open-source framework for web crawling. Scrapy provides an elegant and modular architecture illus- trated in Figure 6.1. Several components can be plugged into the Scrapy core infrastructure. Following, we present each part of the architecture and our own contributions:
• Spiders define how a target website is scraped, including how to perform the crawl (i.e. follow links). The BlogForever crawer implementation includes two new types of spiders: NewCrawl and UpdateCrawl, which implement the logic to respectively crawl a new blog and get updates from a previously crawled blog. • Item Pipeline defines the processing of extracted data from the spiders through several components that are executed sequentially. The BlogForever crawler implementation includes a new item pipeline which orchestrates all aspects of crawling. More specif- ically the BlogForever pipeline is defined as follows: 1. JavaScript rendering, 2. Extract content, 3. Extract comments, 4. Download multimedia files, 5. Prepare Archival Information Packages (APIs) to propagate the results to poten- tial back-ends. • Downloader Middlewares is a framework of hooks into Scrapy’s request/response pro- cessing and altering Scrapy’s requests and responses. • Spider Middlewares is a framework of hooks into Scrapy’s spider processing mecha- nism.
The system architecture providers great modularity. This is illustrated clearly in our work with the following example:
• If it is necessary to disable JavaScript rendering or plugging in an alternative back-end can be done by editing a single line of code. 134 A Scalable Approach to Harvest Modern Weblogs
Figure 6.1: Overview of the crawler architecture. (Credit: Pablo Hoffman, Daniel Graña, Scrapy)
• The features to extract comments and download multimedia files were implemented after creating the initial logic to extract content and were added as extra steps in the pipeline. • The requirement to implement interoperability provisions later presented in section 6.3.4 was easily covered with the implementation of an extra middleware plugin which was invoked from the main crawler architecture. No further modifications were nec- essary in the code.
In the remaining parts of this section, we elaborate our work on each specific part of the crawler system.
6.3.2 JavaScript Rendering
JavaScript is a widely used language for client-side scripting. While some applications sim- ply use it for aesthetics, an increasing number of websites use JavaScript to download and display content. In such cases, traditional HTML based crawlers do not see webpages as they are presented to a human visitor by a web browser, and might therefore be obsolete for data extraction. 6.3 Architecture 135
In our experiments whilst crawling the blogosphere, we encountered several blogs where crawled data was incomplete because of the lack of JavaScript interpretation. The most fre- quent cases were blogs using the Disqus2 and LiveFyre3 comment hosting services. For webmasters, these tools are very handy because the entire comments infrastructure is exter- nalized and their setup essentially comes down to including a JavaScript snippet in each target page. Both of these services heavily rely on JavaScript to download and display the com- ments, even providing functionalities such as real-time updates for edits and newly written comments. Less commonly, some blogs are fully rendered using JavaScript. When loading such websites, the web browser will not receive the page content as an HTML document, but will instead have to execute JavaScript code to download and display the page content. The Blogger platform provides the Dynamic Views as a default template, which uses this mechanism [68]. To support blogs with JavaScript-generated content, we embed a full web browser into the crawler. After considering multiple options, we opted for PhantomJS, a headless web browser with great performance and scripting capabilities. The JavaScript rendering is en- abled by default and is the very first step of webpage processing. Therefore, extracting blog post articles, comments or multimedia files works equally well on blogs with JavaScript- generated content and on traditional HTML-only blogs. When the number of comments on a page exceeds a certain threshold, both Disqus and LiveFyre will only load the most recent ones and the stream of comments will end with a Show More Comments button. As part of the page loading process, we instruct PhantomJS to repeatedly click on these buttons until all comments are loaded. Paths to Disqus and LiveFyre Show More buttons are manually obtained. They constitute the only non-generic elements of our extraction stack which require human intervention to maintain and extend to other commenting platforms.
6.3.3 Content Extraction
In order to identify webpages as blog posts, our implementation enriches Scrapy with two components to narrow the extraction process down to the subsets of pages which are blog posts: blog post identification and download priority heuristic. Given a URL entry point to a website, the default Scrapy behaviour traverses all the pages of the same domain in a last-in-first-out manner. The blog post identification function is able to identify whether an URL points to a blog post or not. Internally, for each blog, this function uses a regular expression constructed from the blog post URLs found in the web feed. This simple approach requires that blogs use the same URL pattern for all their posts (or false negatives will occur) which has to be distinct for pages that are not posts (or false positives will occur). In practice, this assumption holds for all blog platforms we encountered and seems to be a common practice among web developers. In order to efficiently deal with blogs that have a large number of pages which are not posts, the blog post identification mechanism is not sufficient. Indeed, after all pages identified as blog posts are processed, the crawler needs to download all other pages to search for addi- tional blog posts. To replace the naive random walk, depth first search or breadth first search
2http://disqus.com/websites 3http://web.livefyre.com 136 A Scalable Approach to Harvest Modern Weblogs web site traversals, we use a priority queue where priorities for new URLs are determined by a machine learning system. This mechanism has shown to be mandatory for blogs hosted on a single domain alongside large number of other types of webpages, such as those in forums or wikis. The idea is to give high priority to URLs which are believed to point to pages with links to blog posts. These predictions are done using an active Distance-Weighted 푘-Nearest- Neighbour classifier [47]. Let 퐿(푢) be the number of links to blog posts contained in a page with URL 푢. Whenever a page is downloaded, its URL 푢 and 퐿(푢) are given to the machine learning system as training data. When the crawler encounters a new URL 푣, it will ask the machine learning system for an estimation of 퐿(푣), and use this value as the download priority of 푣. 퐿(푣) is estimated by calculating a weighted average of the values of the 푘 URLs most similar to 푣.
6.3.4 The BlogForever Metadata Schema for Interoperability
One of the key BlogForever project goals is interoperability with third party platforms. The original BlogForever crawler was intended to insert blog data directly to the BlogForever repository component but later the architecture was reworked to make it possible to use other storage and archiving systems as well. To achieve this goal, we implement a special interoperability middleware for the spider to produce Archival Information Packages (AIPs) from harvested blog content. The AIPs can be used by any software platform which complies with the OAIS reference model [86]. It must be noted also that this is the first time weblog content is encoded in this way. The AIPs consist of XML files structured using the METS [32] standard for encoding meta- data and content. METS is widely adopted and supported by all popular digital library sys- tems. In addition, the blog content attributes which are included in the METS XML packages are encoded using the MARCXML Schema[97]. The reason for the selection of MARCXML is the wide adoption of the standard, its flexibility and extensibility, as well as previous ex- perience with the Invenio digital library system which is also based on MARCXML. There are three kinds of entities which can be included in an AIP: Blog, Entry and Comment. The content extracted from weblogs is mapped to the relevant entities using the following rule: If an attribute is already defined in MARC for other content types use the same MARC code for blogs. If an attribute is totally new, an unused MARC 9xx tag is chosen to represent it, composing therefore the BlogForever metadata schema [92]. Following, we present the BlogForever metadata schema for Blog, Post, Page and Comment entities in Tables 6.4, 6.5 and 6.6.
6.3.5 Distributed Architecture and Scalability
One of the problems of web crawling is the large amount of input which need to be pro- cessed. To address this issue, it is crucial to build every layer of the system with scalability in mind [10]. The BlogForever Crawler, and in particular the two core procedures NewCrawl and Up- dateCrawl, are designed to be usable as part of an event-driven, scalable and fault-resilient 6.3 Architecture 137
Blog attribute MARC 21 representation title 245 $a subtitle 245 $b URI 520 $u aliases 100 $g status_code 952 $a language 041 $a encoding 532 sitemap_uri 520 platform 781 $a platform_version 781 $b webmaster 955 $a hosting_ip 956 $a location_city 270 $d location_country 270 $b last_activity_date 954 $a post_frequency 954 $b update_frequency 954 $c copyright 542 ownership_rights 542 distribution_rights 542 access_rights 542 license 542 $f Table 6.4: Blog record attributes - MARC 21 representations mapping. 138 A Scalable Approach to Harvest Modern Weblogs
Post and page attribute MARC 21 representation title 245 $a subtitle 245 $b full_content 520 $a full_content_format 520 $b author 100 $a URI 520 $u aliases 100 $g alt_identifier (UR) 0247 $a date_created 269 $c date_modified 260 $m version 950 $a status_code 952 $a response_code 952 $b geo_longitude 342 $g geo_latitude 342 $h access_restriction 506 has_reply 788 $a last_reply_date 788 $c num_of_replies 788 $b child_of 760 $o $4 $w
Table 6.5: Blog record attributes - MARC 21 representations mapping.
Comment attribute MARC 21 representation subject 245 $a author 100 $a full_content 520 $a full_content_format 520 $b URI 520 $u status 952 $a date_added 269 $c date_modified 269 $m addressed_to_URI 789 $u geo_longitude 342 $g geo_latitude 342 $h has_reply 788 $a num_replies 788 $b is_child_of_post 773 $o $4 $w is_child_of_comment 773 $o $4 $w Table 6.6: Comment record attributes - MARC tags mapping. 6.4 Evaluation 139 distributed system. Heading in this direction, we made the key design choice to have both NewCrawl and UpdateCrawl as stateless components. From a high-level point of view, these two components are purely functional:
푁푒푤퐶푟푎푤푙 ∶ 푈푅퐿 → 풫 (푅퐸퐶푂푅퐷) 푈푝푑푎푡푒퐶푟푎푤푙 ∶ 푈푅퐿 × 퐷퐴푇 퐸 → 풫 (푅퐸퐶푂푅퐷) where 푈푅퐿, 퐷퐴푇 퐸 and 푅퐸퐶푂푅퐷 are respectively the set of all URLs, dates and records, and 풫 designates the power set operator. By delegating all shared mutable state to the back- end system, web crawler instances can be added, removed and used interchangeably. To implement a distributed crawler architecture, we choose to use Scrapyd4, an application for deploying and running Scrapy spiders. The process is quite straightforward:
1. Deploy the BlogForever crawler in any number of servers according to requirements. Using the Scrapyd component which is run as a system daemon, each crawler is lis- tening for requests to run crawling tasks and spawn a process for each new command.
2. Implement a small control program that reads the list of target weblogs which need to be crawled and issue commands in a round-robin fashion using the Scrapyd JSON API5.
3. All crawlers share a common storage service where they save the crawling results.
6.4 Evaluation
Our evaluation is articulated in two parts. First, we compare the article extraction procedure presented in section 6.2 with three open-source projects capable of extracting articles and titles from webpages. The comparison will show that our weblog-targeted solution has better performance both in terms of success rate and running time. Second, a discussion is held regarding the different solutions available to archive data beyond what is available in the HTML source code. Extraction of authors, dates and comments is not part of this evaluation because of the lack of publicly available competing projects and reference data sets. In our experiments we used Debian GNU/Linux 7.2, Python 2.7 and an Intel Core i7-3770 3.4 GHz processor. Timing measurements were made on a single dedicated core with garbage collection disabled. The Git repository for this work6 contains the necessary scripts and in- structions to reproduce all the evaluation experiments presented in this section. The crawler source code is available under the MIT license from the project’s websites7.
4http://scrapyd.readthedocs.org/en/latest/ 5http://scrapyd.readthedocs.org/en/latest/api.html 6https://github.com/OlivierBlanvillain/bfc-paper, accessed August 1, 2015 7https://github.com/BlogForever/crawler, accessed August 1, 2015 140 A Scalable Approach to Harvest Modern Weblogs
Target Our approach Readability Boilerpipe Goose Article 93.0% 88.1% 79.3% 79.2% Title 95.0% 74.0% N/A 84.9% Table 6.7: Extraction success rates for different algorithms.
6.4.1 Extraction Success Rates
To evaluate article and title extraction from weblog posts we compare our approach to three open source projects: Readability8, Boilerpipe [84] and Goose9, which are implemented in JavaScript, Java and Scala respectively. These projects are more generic than our blog- specific approach in the sense that they are able to identify and extract data directly from HTML source code, and do not make use of web feeds or structural similarities between pages of the same weblog (observations (1) and (2)). Table 6.7 shows the extraction success rates for article and title on a test sample of 2300 posts from 230 weblogs obtained from the Spinn3r dataset [31]. On our test dataset, algorithm 1 outperformed the competition by 4.9% on article extrac- tion and 10.1% on title extraction. It is important to stress that Readability, Boilerpipe and Goose rely on generic techniques such as word density, paragraph clustering and heuristics on HTML tagging conventions, which are designed to work for any type of webpage. On the contrary, our algorithm is only suitable for pages with associated web feeds, as these pro- vide the reference data used to build extraction rules. Therefore, results shown in Table 6.7 should not be interpreted as a general quality evaluation of the different projects, but simply as evidence that our approach is more suitable when working with weblogs.
6.4.2 Article Extraction Running Times
In addition to the quality of the extracted data we also evaluated the running time of the extraction procedure. The main point of interest is the ability of the extraction procedure to scale as the number of posts in the processed weblog increases. This corresponds to the evaluation of a NewCrawl task, which is in charge of harvesting all published content on a weblog. Figure 6.2 shows the cumulated time spent for each article extraction procedure (this excludes common tasks such as downloading pages and storing results) as a function of the number of weblog posts processed. We used the Quantum Diaries10 blog for this experiment. Data presented in this graph was obtained by taking the arithmetic mean over 10 measurements. These results are believed to be significant given that standard deviations are of the order of 2 milliseconds. As illustrated in Figure 6.2, our approach spends the majority of its total running time be- tween the initialisation the processing of the first weblog post. This initial increase of about
8https://github.com/gfxmonk/python-readability, accessed August 1, 2015 9https://github.com/GravityLabs/goose, accessed August 1, 2015 10http://www.quantumdiaries.org, accessed August 1, 2015 6.5 Discussion and Conclusions 141
Figure 6.2: Article extraction running time
0.4 seconds corresponds to cost of executing algorithm 2 to compute extraction rule for ar- ticles. As already mentioned, this consists in computing the best extraction rule of each pages references by the web feed and picking the one functioning best on these pages. Once we have this extraction rule, processing subsequent weblog posts only requires parsing and applying the rule, which takes about 3 milliseconds and are barely visible on the scale of Figure 6.2. The other evaluated solutions do not function this way: each weblog post is processed as new and independent input, leading to approximately linear running times. The vertical dashed line at 15 processed weblog posts represents a suitable point of com- parison of processing time per post. Indeed, as the web feed of our test weblog contains 15 posts, the extraction rule computation performed by our approach include the cost of entirely processing these 15 entries. That being said, comparing raw performance of algo- rithms implemented in different programming languages is not very informative given the high variations of running times observed across programming languages [73]. When compared to the other parts of crawling, the content extraction is sufficiently quick not to be a bottleneck. Indeed, extracting the content of 100 pages takes about half a sec- ond, downloading 100 pages takes at least 1 second, and, if enabled, rendering and taking screenshots takes about half a second per page.
6.5 Discussion and Conclusions
We presented a scalable approach to harvest modern weblogs. Our approach is based on a new algorithm to build extraction rules from web feeds. The key observations which led us to the inception and implementation of this algorithm are the facts that: (1) weblogs provide web feeds which include structured and standardized views of the latest blog posts, and, (2) posts of the same weblog share a similar HTML structure. Following, we presented a sim- ple adaptation of this procedure that allows extracting different types of content, including 142 A Scalable Approach to Harvest Modern Weblogs authors, dates, comments and potentially any other element. The elaboration of this process enables the wider use of this method as it is feasible to devise variations to extract any kind of weblog content. A critical part of this work was the presentation of the BlogForever crawler architecture and discussion of the software tools and the novel techniques we used. In order to support rapidly evolving web technologies such as JavaScript-generated content, the crawler uses a headless web browser to render pages before processing them. Another important aspect of the crawler architecture was the interoperability provisions. We introduced a new metadata schema for interoperability in our design, encoding weblog crawling results in Archival In- formation Packages (AIPs) using established open standards. This feature enables the use of our system in many contexts and with multiple different back-ends, increasing its relevance and reusability. We also highlighted the design choices made to achieve both modularity and scalability. This is particularly relevant in the domain of web crawling given that intensive network operations can be a serious bottleneck. Crawlers greatly benefit from the use of multiple web access points which makes them natural candidates for distributed computing. Our method had great success with content extraction accuracy and performance against state-of-the-art open source web article extraction systems as presented in the evaluation section. We have to note thought that our experiments on a considerably large weblogs dataset showed that there were some failing tests which stem from either the violation of one of our two key observations, or from an insufficient amount of text in posts. Therefore, it is suggested to potential users to ensure that these observations are valid on the target weblogs before proceeding with using the BlogForever crawler. Future work could attempt to alleviate this problem using hybrid extraction algorithms. Combining our approach with others techniques such as word density or special reasoning could lead to better performance given that these techniques are insensible to the above issues. Chapter 7
Conclusions and Future Work
7.1 Conclusions
This thesis presented tangible ways to improve Web crawling, web archiving and blog archiv- ing in particular, with the inception of new algorithms and systems. We defined the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to evaluate Website Archiv- ability in Chapter 3. We introduced methods to improve web crawling via detecting dupli- cates and near-duplicates on the fly in Chapter 4. We also provided a solution to the hot issue of detecting web spider traps in the same Chapter. Focusing on weblog archiving, we studied the technical aspects of the blogosphere and we introduced a new approach to harvest, man- age, preserve and reuse weblog content with the BlogForever platform in Chapter 5. Finally, we created a scalable approach to harvest modern weblogs in Chapter 6. We conclude with a set of interesting points from this research:
• Website Archivability is a useful metric that can quantify the amenability of a website to being archived with completeness and accuracy.
• Web Content Management Systems have room for significant improvement towards their amenability to being archived with completeness and accuracy.
• Webgraph, similarity detection can be applied successfully to merge webgraph nodes and reduce webgraph complexity, improving further processing.
• Webgraph cycles can be identified if we merge near-duplicate webpages, resulting in node merging and edge contractions. Webgraph cycles imply web spider traps.
• Easy to use but thorough web applications such as ArchiveReady and WebGraph-it have significant impact on many users and communicate the intended notions in larger web-related audiences. See also Appendix 7.2.
• If we use the special characteristics of weblogs, we can develop more efficient weblog data extraction mechanisms. 144 Conclusions and Future Work
• Higher data granularity in web archives improves their value, reusability and applica- tions.
7.2 Future Work
Future work will focus on extending the created methods but also an important aspect will be to promote usage is larger audiences and especially in the industry. Future work on the CLEAR+ method we presented in Section 3 will be towards three di- rections: (a) further CLEAR+ method development, (b) dissemination to larger web-related audiences, and, (c) exploring application in web archiving and web development. We will also try to extend the evaluations over more website attributes, such as catching JavaScript execution and maybe also automated interaction with the page (random clicking, scrolling down etc). Besides method development, it is also critical to communicate the notion of WA and the method to evaluate it in larger web-related audiences, where we hope it will have important impact. We also plan to explore applications of the method in web archiving, web development and related education activities. Towards this direction, we are planning to im- plement plugins for popular CMS to enable web professionals to integrate WA evaluations in their systems. We aim to evolve the presented Webgraph assisted web crawling methods we presented in Chapter 4 and also create new variants. We plan to implement our methods in other open existing web crawlers such as the BlogForever platform [16]. Finally, we aim to launch a public web service via http://webgraph-it.com to provide users with web crawling and webgraph generation services. The applications of web crawling optimisation, webgraph generation and web spider trap detection are numerous. Web crawling engineers would be able to streamline their web crawling operations by identifying web spider traps and other problematic webpages, researchers would be able to have quality web crawling data and webgraphs for experimentation, students would be able to learn more about the web, web crawling and webgraphs. The BlogForever Platform presented in Chapter 5 could be expanded in several ways, that we are planning to explore in future work. First of all, we can improve on our limitation to handle the target blogs for preservation in a more automated way. As we described, the target blogs are a relatively ’static’ list defined by the administrators of the system. An automated way of manipulating the target blog list by defining topics and rules would be welcome, es- pecially for larger and more complex blog repositories. Second, the Blog Spider could be developed considerably in order to alleviate new content detection and processing issues. A new way to detect layout changes in blogs would be a valuable addition to existing update detection mechanisms. What is more, blog entities analysis and detection could be opti- mised to cope with a wider range of possible instances. Third, the BlogForever platform could become considerably more scalable by switching the database component from a re- lational database system (MySQL) to a more scalable database architecture such as NoSQL. As we described in the previous section, this is an important limitation in order to deploy web scale blog repositories using the BlogForever platform. Fourth, we can expand the tar- gets of the BlogForever platform to include micro-blogs. Microblog differ from traditional blogs in that their contents are typically smaller in both actual and aggregate file size [79]. Therefore, it will be possible to preserve microblogs as well without altering significantly the requirements, architecture and implementation of the BlogForever platform. 7.2 Future Work 145
Finally, future work on the weblog spider algorithm presented in Chapter 6 could attempt to alleviate this problem using hybrid extraction algorithms. Combining our approach with others techniques such as word density or special reasoning could lead to better performance given that these techniques are insensible to the above issues.
Bibliography
[ber] Mapping the Blogosphere - a Universal and Scalable Blog-Crawler, author = Berger, P. and Hennig, P. and Bross, J. and Meinel, C., booktitle = Proceedings 3rd International Conference on Social Computing (SocialCom), year = 2011, pages = 672-677, owner = vbanos, timestamp = 2015.08.01.
[2] Abou-Zahra, S. and Squillace, M. (2006). Evaluation and report language (earl) 1.0 schema. http://www.w3.org/TR/EARL10-Schema/. [Online; accessed 22-December- 2014].
[3] Agarwal, A., Koppula, H. S., Leela, K. P., Chitrapura, K. P., Garg, S., Haty, C., Roy, A., and Sasturkar, A. (2009). URL normalization for de-duplication of web pages. In Proceedings 18th ACM Conference on Information & Knowledge Management (CIKM), pages 1987–1990.
[4] Ainsworth, S. G., Alsum, A., SalahEldeen, H., Weigle, M. C., and Nelson, M. L. (2011). How much of the web is archived? In Proceedings 11th Annual International ACM/IEEE Joint Conference, page 133, New York, NY. ACM Press.
[5] AlSum, A. and Nelson, M. L. (2013). Arclink: optimization techniques to build and retrieve the temporal web graph. In Proceedings 13th ACM/IEEE Joint Conference on Digital libraries (JCDL), pages 377–378.
[6] Arango, S., Pinsent, E., Sleeman, P., Gkotsis, G., Stepanyan, K., Rynning, M., and Kop- idaki, S. (2013). Blogforever: D5.2 implementation of case studies. Technical report, Technical report.
[7] Arango-Docio, S., Farrell, T., Gkotsis, G., Kopidaki, S., Pinsent, E., Rynning, M., and Sleeman, P. (2012). Blogforever d5.1: Design and specification of case studies.
[8] Arms, C., Fleischhauer, C., and Murray, K. (2013). Sustainability of digital for- mats planning for Library of Congress collections: external dependencies. http://www. digitalpreservation.gov/formats/sustain/sustain.shtml#external. [Online; accessed 22- December-2014].
[9] Ashley, K., Davis, R., Guy, M., Kelly, B., Pinsent, E., and Farrell, S. (2010). A guide to web preservation. Technical report, JISC PoWR Project.
[10] authors, V. (2013). The reactive manifesto. http://reactivemanifesto.org. August 1, 2015. 148 Bibliography
[11] Avižienis, A., Laprie, J.-C., and Randell, B. (2001). Fundamental concepts of computer system dependability. In Proceedings IARP/IEEE-RAS Workshop on Robot Dependabil- ity: Technological Challenge of Dependable, Robots in Human Environments. [12] Baeza-Yates, R., Castillo, C., Marin, M., and Rodriguez, A. (2005). Crawling a coun- try: better strategies than breadth-first for web page ordering. In Proceedings (companion) 14th International Conference on World Wide Web (WWW), pages 864–872. [13] Banos, V., Arango-Docio, S., Pinsent, E., and Sleeman, P. (2013a). Blogforever: D5.5 case studies comparative analysis and conclusions. Technical report, Technical report. [14] Banos, V., Arango-Docio, S., Sleeman, P., Stepanyan, K., Rynning, M., Kopidaki, S., Manolopoulos, Y., and Trochidis, I. (2013b). Blogforever: D5.3 user questionnaires and reports. Technical report, Technical report. [15] Banos, V., Baltas, N., and Manolopoulos, Y. (2012). Trends in blog preservation. In Proceedings 14th International Conference on Enterprise Information Systems (ICEIS), Wroclaw, Poland. [16] Banos, V., Blanvillain, O., Kasioumis, N., and Manolopoulos, Y. (2015). A scalable approach to harvest modern weblogs. International Journal on Artificial Intelligence Tools, 24(2). [17] Banos, V., Kim, Y., Ross, S., and Manolopoulos, Y. (2013c). Clear: a credible method to evaluate website archivability. In Proceedings 10th International Conference on Preservation of Digital Objects (iPRES). [18] Bar-Yossef, Z., Keidar, I., and Schonfeld, U. (2009). Do not crawl in the dust: different URLs with similar text. ACM Transactions on the Web, 3(1):3. [19] Berners-Lee, T., Fielding, R., and Masinter, L. (2005). Rfc 3986: Uniform resource identifier (URI): Generic syntax. Technical report, The Internet Engineering Task Force. [20] Bharat, K. and Broder, A. (1999). Mirror, mirror on the web: A study of host pairs with replicated content. Computer Networks, 31(11):1579–1590. [21] Bizer, C., Heath, T., and Berners-Lee, T. (2009). Linked data-the story so far. Inter- national Journal Semantic Web & Information Systems, 5(3):1–22. [22] Boiko, B. (2001). Understanding content management. Bulletin of the American Soci- ety for Information Science & Technology, 28(1):8–13. [23] Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2003). UbiCrawler: a scalable fully distributed web crawler. Software - Practice & Experience, 34(8):711–726. [24] Boldi, P. and Vigna, S. (2004). The webgraph framework i: compression techniques. In Proceedings 13th International Conference on World Wide Web (WWW), pages 595–602. [25] Bondi, A. B. (2000). Characteristics of scalability and their impact on performance. In Proceedings 2nd International Workshop on Software & Performance (WOSP), pages 195–203. ACM Press. [26] Brandes, U., Eiglsperger, M., Lerner, J., and Pich, C. (2010). Graph markup language (GraphML). Bibliothek der Universität Konstanz. Bibliography 149
[27] Brickley, D. and Miller, L. (2010). FOAF vocabulary specification 0.98. Namespace Document, 9. [28] Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. (2000). Graph structure in the web. Computer Networks, 33(1):309– 320. [29] Brunelle, J. F., Kelly, M., SalahEldeen, H., Weigle, M. C., and Nelson, M. L. (2014). Not all mementos are created equal: Measuring the impact of missing resources. In Proceedings 14th ACM/IEEE Joint Conference on Digital libraries (JCDL), page 321 330. IEEE Press. [30] Burggraf, D. S. (2006). Geography markup language. Data Science Journal, 5:178 204. [31] Burton, K., Kasch, N., and Soboroff, I. (2011). The ICWSM 2011 spinn3r dataset. In Proceedings 5th Annual Conference on Weblogs & Social Media (ICWSM). [32] Cantara, L. (2005). Mets: The metadata encoding and transmission standard. Cata- loging & classification quarterly, 40(3-4):237–253. [33] Caplan, P. (2006). Preservation metadata, curation reference manual. http://www.dcc. ac.uk/resources/curation-reference-manual/completed-chapters/preservation-metadata. [Online; accessed 22-December-2014]. [34] Center, M. D. (2008). Mozilla’s quirks mode. 2007. [35] Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings 34th Annual ACM Symposium on Theory of Computing (STOC), pages 380–388. [36] Chittenden, T. (2010). Digital dressing up: modelling female teen identity in the dis- cursive spaces of the fashion blogosphere. Journal of Youth Studies, 13(4):505–520. [37] Chung, D. S., Kim, E., Trammell, K. D., and Porter, L. V. (2007). Uses and perceptions of blogs: A report on professional journalists and journalism educators. Journalism & Mass Communication Educator, 62(3):305–322. [38] Clausen, L. (2004). Concerning etags and datestamps. In Proceedings 4th International Web Archiving Workshop (IWAW). Citeseer. [39] Coalition, D. P. (2012). Institutional strategies - standards and best practice guide- lines. http://www.dpconline.org/advice/preservationhandbook/institutional-strategies/ standards-and-best-practice-guidelines. [Online; accessed 22-December-2014]. [40] Consortium, W. W. W. et al. (1999). Html 4.01 specification. [41] Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R. (2010). Bench- marking cloud serving systems with ycsb. In Proceedings 1st ACM Symposium on Cloud computing (SOCC), pages 143–154. ACM. [42] da Costa Carvalho, A. L., de Moura, E. S., da Silva, A. S., Berlt, K., and Bezerra, A. (2007). A cost-effective method for detecting web site replicas on search engine databases. Data & Knowledge Engineering, 62(3):421–437. 150 Bibliography
[43] Dasgupta, A., Kumar, R., and Sasturkar, A. (2008). De-duping URLs via rewrite rules. In Proceedings 14th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 186–194. [44] Denev, D., Mazeika, A., Spaniol, M., and Weikum, G. (2011). The sharc framework for data quality in web archiving. The VLDB Journal, 20(2):183–207. [45] Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3):297. [46] Donnelly, M. (2006). Jstor/harvard object validation environment (jhove). Digital Curation Centre Case Studies and Interviews. [47] Dudani, S. A. (1976). The distance-weighted 푘-nearest-neighbor rule. IEEE Transac- tions on Systems, Man & Cybernetics, 6(4):325–327. [48] Duff, W. and van Ballegooie, M. (2006). Archival metadata, curation reference man- ual. http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/ archival-metadata. [Online; accessed 22-December-2014]. [49] Ellson, J., Gansner, E. R., Koutsofios, E., North, S. C., and Woodhull, G. (2004). Graphviz and dynagraph—static and dynamic graph drawing tools. In Graph drawing software, pages 127–148. Springer. [50] Elsas, J. L., Arguello, J., Callan, J., and Carbonell, J. G. (2008). Retrieval and feedback models for blog feed search. In Proceedings 31st Annual International ACM Conference on Research & Development in Information Retrieval (SIGIR), pages 347–354, New York, NY. ACM Press. [51] Eltantawy, N. and Wiest, J. B. (2012). Social media in the egyptian revolution: Recon- sidering resource mobilization theory (1-3). [52] Faheem, M. (2012). Intelligent crawling of web applications for web archiving. In Proceedings (companion) 21st International Conference on World Wide Web (WWW), pages 127–132, New York, NY. ACM Press. [53] Fernández-Garcia, N., Sánchez-Fernandez, L., and Villamor-Lugo, J. (2004). Next generation web technologies in content management. In Proceedings (companion) 13th International Conference on World Wide Web (WWW), pages 260–261. [54] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and Berners- Lee, T. (1999). Hypertext transfer protocol http/1.1. http://tools.ietf.org/html/rfc2616. [Online; accessed 10-November-2014]. [55] Firesmith, D. G. and Henderson-Sellers, B. (2002). The OPEN process framework: An introduction. Pearson Education. [56] Fitzner, D. (2010). Requirements specification of the teleios user community. Technical report, Fraunhofer IGD. [57] Furche, T., Gottlob, G., Grasso, G., Schallhart, C., and Sellers, A. J. (2013). Oxpath: A language for scalable data extraction, automation, and crawling on the deep web. The VLDB Journal, 22(1):47–72. Bibliography 151
[58] Gkotsis, G., Stepanyan, K., Cristea, A. I., and Joy, M. (2013). Self-supervised auto- mated wrapper generation for weblog data extraction. In Proceedings 29th British Na- tional Conference on Databases (BNCOD), pages 292–302, Oxford, UK. [59] Glenn, V. D. (2007). Preserving government and political information: The web-at-risk project. First Monday, 12(7). [60] Gomes, D., Costa, M., Cruz, D., Miranda, J., and Fontes, S. (2013). Creating a billion- scale searchable web archive. In Proceedings (companion) 22nd International Confer- ence on World Wide Web (WWW), pages 1059–1066. [61] Gomes, D., Miranda, J., and Costa, M. (2011). A survey on web archiving initiatives. In Proceedings 15th International Conference on Theory & Practice of Digital Libraries, (TPDL), pages 408–420. [62] Gomes, D., Santos, A. L., and Silva, M. J. (2006). Managing duplicates in a web archive. In Proceedings ACM Symposium on Applied Computing (SAC), pages 818–825. [63] Gomes, D. and Silva, M. J. (2006). Modelling information persistence on the web. In Proceedings 6th International Conference on Web Engineering (ICWE), pages 193–200. ACM. [64] Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., and Flesca, S. (2004). The lixto data extraction project: Back and forth between theory and practice. In Proceedings 23rd Symposium on Principles of Database Systems (PODS), pages 1–12. [65] Group, T. O. (2011). Interoperability requirements. [66] Hallgrímsson, T. (2005). The international internet preservation consortium (iipc). In Proceedings Conference of Directors of National Libraries (CDNL), pages 14–18. [67] Hammersley, B. (2005). Developing feeds with RSS and Atom. O’Reilly Media. [68] Harasymiv, A. (2011). Blogger dynamic views. http://buzz.blogger.com/2011/09/ dynamic-views- seven-new-ways-to-share.htm. August 1, 2015. [69] He, Y., Xin, D., Ganti, V., Rajaraman, S., and Shah, N. (2013). Crawling deep web entity pages. In Proceedings 6th ACM International Conference on Web Search & Data Mining (WSDM), pages 355–364, Rome, Italy. [70] Heydon, A. and Najork, M. (1999). Mercator: A scalable, extensible web crawler. World Wide Web, 2(4):219–229. [71] Hockx-Yu, H. (2011). The past issue of the web. In Proceedings ACM Web Science Conference (WebSci), Koblenz, Germany. [72] Hockx-Yu, H., Crawford, L., Coram, R., and Johnson, S. (2010). Capturing and re- playing streaming media in a web archive-a british library case study. [73] Hundt, R. (2011). Loop recognition in C++/Java/Go/Scala. In Proceedings of Scala Days. [74] IEEE (1998). Ieee recommended practice for software requirements specifications. IEEE Std 830-1998, page i. 152 Bibliography
[75] ISO, I. (2009). 28500: 2009 information and documentation-warc file format. Inter- national Organization for Standardization. [76] Johnson, K. (2008). Are blogs here to stay?: An examination of the longevity and currency of a static list of library and information science weblogs. Serials review, 34(3):199–204. [77] Kalb, H., Kasioumis, N., Llopis, J. G., Postaci, S., and Arango-Docio, S. (2011). Blog- forever: D4.1 user requirements and platform specifications. Technical report, Technische Universität Berlin. [78] Kalb, H. and Trier, M. (2012). The blogosphere as oeuvre: Individual and collective influence on bloggers. In Proceedings European Conference on Information Systems (ECIS). AIS Electronic Library. 2012; Paper 110. [79] Kaplan, A. M. and Haenlein, M. (2011). The early bird catches the news: Nine things you should know about micro-blogging. Business Horizons, 54(2):105–113. [80] Kasioumis, N., Banos, V., and Kalb, H. (2014). Towards building a blog preservation platform. World Wide Web, 17(4):799–825. [81] Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users, volume 3 of Foundations and Trends in Information Retrieval. Now Publishers. [82] Kim, Y., Ross, S., Stepanyan, K., Pinsent, E., Sleeman, P., Arango-Docio, S., Banos, V., Trochidis, I., Llopis, J. G., and Kalb, H. (2012). Blogforever: D3.1 preservation strategy report. Technical report, University of Glasgow. [83] Klamma, R., Cao, Y., and Spaniol, M. (2007). Watching the blogosphere: Knowledge sharing in the web 2.0. In Proceedings 1st Annual Conference on Weblogs & Social Media (ICWSM). [84] Kohlschütter, C., Fankhauser, P., and Nejdl, W. (2010). Boilerplate detection using shallow text features. In Proceedings 3rd ACM International Conference on Web Search and Data Mining (WSDM), pages 441–450, New York, NY. [85] Lavender, R. G. and Schmidt, D. C. (1996). Active Object An Object Behavioral Pattern for Concurrent Programming, pages 483–499. Addison-Wesley, Boston, MA. [86] Lavoie, B. (2000). Meeting the challenges of digital preservation: The oais reference model. OCLC Newsletter, 243:26–30. [87] Lavoie, B. F. (2004). Implementing metadata in digital preservation systems: the premis activity. D-Lib Magazine, 10(4). [88] Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady., 10(8):707–710. [89] Lindahl, C. and Blount, E. (2003). Weblogs: simplifying web publishing. Computer, 36(11):114–116. [90] Liu, N. C. and Cheng, Y. (2005). The academic ranking of world universities. Higher education in Europe, 30(2):127–136. [91] LiWA (2011). Living web archives project. Bibliography 153
[92] Llopis, J. G., Encinar, R. J., Stepanyan, K., Kim, Y., Haberfield, A., Postacı, S., Gkot- sis, G., Lazaridou, A. C., Kalb, H., Banos, V., et al. (2012). D4.4: Digital repository component design. work package, European Organization for Nuclear Research (CERN). [93] Lorna Campbell, U. o. S. (2007). Learning object metadata, curation reference man- ual. http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/ learning-object-metadata. [Online; accessed 18-April-2013]. [94] Lowry, R. (1998). Concepts and applications of inferential statistics. Lowry, Richard. [95] Manku, G. S., Jain, A., and Das Sarma, A. (2007). Detecting near-duplicates for web crawling. In Proceedings 16th International Conference on World Wide Web (WWW), pages 141–150. [96] Mansfield-Devine, S. (2009). Simple website footprinting. Network Security, 2009(4):7–9. [97] MARC, X. (2003). Official web site. [98] Masanès, J. (2006). Web archiving. Springer. [99] McBride, B. et al. (2004). The resource description framework (rdf) and its vocabulary description language rdfs. Handbook on Ontologies, pages 51–66. [100] McEwen, S. (2004). Requirements: An introduction. [101] Michael Day, D. (2005). Metadata, curation reference manual. http://www.dcc. ac.uk/resources/curation-reference-manual/completed-chapters/metadata. [Online; ac- cessed 18-April-2013]. [102] Mohr, G., Stack, M., Rnitovic, I., Avery, D., and Kimpton, M. (2004). Introduction to heritrix. In Proceedings 4th International Web Archiving Workshop (IWAW), Vienna, Austria. [103] Morrissey, S., Meyer, J., Bhattarai, S., Kurdikar, S., Ling, J., Stoeffler, M., and Thanneeru, U. (2010). Portico: A case study in the use of xml for the long-term preser- vation of digital artifacts. In Proceedings International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada. [104] Murray, K., Ko, L., and Phillips, M. (2011). Curation of the end-of-term web archive. Technical report, University of North Texas. [105] Muslea, I., Minton, S., and Knoblock, C. (1999). A hierarchical approach to wrapper induction. In Proceedings 3rd Annual Conference on Autonomous Agents (AGENTS), pages 190–197, New York, NY. ACM Press. [106] Niu, J. (2012). An overview of web archiving. D-Lib Magazine, 18(3/4). [107] NORM, I. (2003). 14721: 2003 2003 iso 14721: 2003. Space data and information transfer systems-Open archival information system-Reference model, pages 1–156. [108] of New Zealand, N. L. (2006). Web curator tool (wct). [109] Oita, M. and Senellart, P. (2010). Archiving data objects using web feeds. In Pro- ceedings International Workshop on Web Archiving (IWAW). 154 Bibliography
[110] Olston, C. and Najork, M. (2010). Web crawling. Foundations & Trends in Informa- tion Retrieval, 4(3):175–246. [111] Oro, E., Ruffolo, M., and Staab, S. (2010). Sxpath - extending xpath towards spatial querying on web documents. Proceedings VLDB, 4(2):129–140. [112] Pant, G., Srinivasan, P., and Menczer, F. (2004). Crawling the web. In Web Dynamics, pages 153–177. Springer. [113] Parmanto, B. and Zeng, X. (2005). Metric for web accessibility evaluation. Journal of the American Society for Information Science and Technology, 56(13):1394–1404. [114] Paynter, G., Joe, S., Lala, V., and Lee, G. (2008). A year of selective web archiving with the web curator tool at the National Library of New Zealand. D-Lib Magazine, 14(5):2. [115] Pennock, M. and Davis, R. (2009). Archivepress: A really simple solution to archiving blog content. In Proceedings 6th International Conference on Preservation of Digital Objects (iPRES). [116] Pennock, M. and Kelly, B. (2006). Archiving web site resources: a records manage- ment view. In Proceedings 15th International Conference on World Wide Web (WWW), pages 987–988, Edinburgh, UK. [117] Press, N. (2004). Understanding metadata. National Information Standards, 20. [118] project contributors, D. (2014). Debian, the universal operating system. [119] Reis, D. C., Golgher, P. B., Silva, A. S., and Laender, A. F. (2004). Automatic web news extraction using tree edit distance. In Proceedings 13th International Conference on World Wide Web (WWW), pages 502–511, New York, NY. [120] Risse, T. and Peters, W. (2012). Arcomem: from collect-all archives to community memories. In Proceedings (companion) 21st International Conference on World Wide Web (WWW), pages 275–278. ACM. [121] Rogers, A. and Brewer, G. (2014). Microdata usage statistics. http://trends.builtwith. com/docinfo/Microdata. August 1, 2015. [122] RSS Advisory Board (2007). Rss 2.0 specification. [123] Rusbridge, C. (2009). Preservation for scholarly blogs. [124] Rynning, M., Banos, V., Stepanyan, K., Joy, M., and Gulliksen, M. (2011). Blog- forever: D2. 4 weblog spider prototype and associated methodology. Technical report, BlogForever Project. [125] Sanderson, R., Shankar, H., Ainsworth, S., McCown, F., and Adams, S. (2011). Im- plementing time travel for the web. The Code4Lib Journal, 29(13). [126] SCAPE (2014). Scalable preservation environments. [127] Schonfeld, U. and Shivakumar, N. (2009). Sitemaps: above and beyond the crawl of duty. In Proceedings 18th International Conference on World Wide Web (WWW), pages 991–1000, Madrid, Spain. Bibliography 155
[128] Shkapenyuk, V. and Suel, T. (2002). Design and implementation of a high- performance distributed web crawler. In Proceedings 18th International Conference on Data Engineering (ICDE), pages 357–368. [129] Sigurðsson, K. (2010). Managing duplicates across sequential crawls. Technical report, National & University Library of Iceland. [130] Silvia Arango-Docio, P. S. and Kalb, H. (2011). Blogforever: D2.1 survey imple- mentation report. Technical report, University of London Computer Centre. [131] Sobel, J. (2010). State of the blogosphere 2010. Retrieved from Technorati: http://technorati. com/blogging/article/state-of-the-blogosphere-2010-introduction. [132] Spaniol, M., Denev, D., Mazeika, A., Weikum, G., and Senellart, P. (2009). Data quality in web archiving. In Proceedings 3rd Workshop on Information Credibility on the Web (WICOW), pages 19–26, Madrid, Spain. [133] Stepanyan, K., Gkotsis, G., Kalb, H., Kim, Y., Cristea, A., Joy, M., Trier, M., and Ross, S. (2012). Blogs as objects of preservation: Advancing the discussion on signifi- cant properties. In Proceedings 9th International Conference on Preservation of Digital Objects (iPRES), Toronto, Canada. [134] Stepanyan, K., Joy, M., Cristea, A., Kim, Y., Pinsent, E., and Kopidaki, S. (2011). Blogforever: Weblog data model. Technical report, University of Warwick. [135] Sun, Y., Zhuang, Z., and Giles, C. L. (2007). A large-scale study of robots.txt. In Proceedings 16th International Conference on World Wide Web (WWW), pages 1123– 1124, Banf, Canada. [136] Tarjan, R. (1972). Depth-first search and linear graph algorithms. SIAM Journal on Computing, 1(2):146–160. [137] Vangelis Banos, N. B. and Manolopoulos, Y. (2012). Trends in blog preservation. In Proceedings 1th International Conference on Enterprise Information Systems (ICEIS). [138] Various authors, W. (2002). Use h1 for top level heading. http://www-mit.w3.org/ QA/Tips/Use_h1_for_Title. August 1, 2015. [139] Various authors, W. (2014). W3C standards. http://w3.org/standards. August 1, 2015. [140] Voorhees, E. and Harman, D. (2005). TREC: experiment and evaluation in informa- tion retrieval. MIT Press. [141] W3C (2001). W3C HTML validation service. [142] W3Techs (2014). Usage of content management systems for websites. http://w3techs. com/technologies/overview/content_management/all. [Online; accessed 10-November- 2014]. [143] Weibel, S., Kunze, J., Lagoze, C., and Wolf, M. (1998). Dublin core metadata for resource discovery. Internet Engineering Task Force RFC, 2413:222. [144] Weltevrede, E. and Helmond, A. (2012). Where do bloggers blog? platform transi- tions within the historical dutch blogosphere. First Monday, 17(2-6). 156 Bibliography
[145] WHATWG (2014). Microdata - HTML5 draft standard. http://whatwg.org/specs/ web-apps/current- work/multipage/microdata.html. August 1, 2015. [146] Wilson, B. (2008). Metadata analysis and mining application: W3C validator re- search. http://dev.opera.com/articles/view/mama-w3c-validator-research-2/. [147] WordPress (2014). Posting activity. http://wordpress.com/stats/posting. August 1, 2015. [148] Wu, J., Williams, K., Chen, H.-H., Khabsa, M., Caragea, C., Ororbia, A., Jordan, D., and Giles, C. L. (2014). CiteSeerx: AI in a digital library search engine. In Proceedings 26th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI). [149] Yang, S., Chitturi, K., Wilson, G., Magdy, M., and Fox, E. A. (2012). A study of automation from seed url generation to focused web archive development: the ctrnet con- text. In Proceedings 12th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 341–342, Washington, DC. [150] Younus, A., Whang, K.-Y., Kwon, H.-Y., and Yeo, Y.-M. (2015). A full-coverage two- level URL duplication checking method for a high-speed parallel web crawler. Journal of Information Science & Engineering, 31:839–860. [151] Zúñiga, V. T. (2009). Blogs as an effective tool to teach and popularize physics: a case study. Latin-American Journal of Physics Education, 3(2):4. Website Archivability Impact
There is a notable international interest in the concept of Website Archivability and the ArchiveReady application available at http://archiveready.com/.
• The author was invited to present Website Archivability at the Library of Congress, National Digital Information Infrastructure and Preservation Program, Web Archiving Workgroup, 2015/06/03. • Deutches Literatur Archiv, Marbach, is using the http://archiveready.com API in its web archiving workflow since early 2014. • Stanford University Libraries Web Archiving Resources recommends using the CLEAR method and ArchiveReady. https://library.stanford.edu/projects/web-archiving/archivability /resources • ArchiveReady was used as an example of innovative web archiving applications at the Carnegie Mellon Web Archiving Incentive Program https://library.columbia.edu/bts/web_resources_collection/proposal _examples.html. • Website Archivability and http://archiveready.com was invited speaker at the Web Archive meeting of the University of Innsbruck, 2013. • The University of South Australia is using ArchiveReady in their Digital Preservation Course (INFS 5082) http://programs.unisa.edu.au/public/pcms/course.aspx?pageid=101875. • Many academic have contacted us regarding Website Archivability, including people from: University of Newcastle, University of Manchester, Columbia University, Stan- ford University, University of Michigan Bentley Historical Library, Old Dominion University. • More than 120 unique visitors from around the world use ArchiveReady every day.
BlogForever Platform Screenshots
Figure 1: BlogForever Platform Home Page 160 BlogForever Platform Screenshots
Figure 2: BlogForever Platform Features VITA
Vangelis Banos, Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece.
EDUCATION
2015 Ph.D. in Computer Science, Aristotle University of Thessaloniki, Greece. 2006 M.S. in Information and Communication Systems, Aristotle University of Thessa- loniki, Greece. 2004 B.S. in Information and Communication Systems Engineering, Aegean University, Greece.
PROFESSIONAL EXPERIENCE
2011 - Information Systems Engineer at the National Quality Infrastructure System of Greece (NQIS), Hellenic Institute of Metrology (EIM). 2008 - Independent consulting and software development for clients such at Phaistos Networks S.A., Future Library and Veria Public Library. 2008 - 2011 Software Engineer at Dataways S.A., Greece. 2007 - 2008 Research and Information Officer at the Hellenic Army. 2005 - 2007 Information Systems Engineer at the University of Macedonia Library and Information Centre. 2006 - 2007 Computer Science Instructor at the State Institute of Vocational Training of Triandria, Thessaloniki. 2004 - 2005 Software Engineer at Dataways S.A., Greece. 162 VITA
RESEARCH EXPERIENCE
2015 - Researcher at the ROUTE-TO-PA Horizon2020 Project, Ortelio Ltd, UK. 2014 - 2016 Researcher at the LoCloud EC funded Best Practice Network. Future Li- brary. 2011 - 2015 Researcher at the National Information System for Research and Technol- ogy (NISRT) Project, National Documentation Centre of Greece. 2011 - 2013 Project Manager of the BlogForever EC funded project. Aristotle Univer- sity. 2009 - 2009 Researcher at the EuropeanaLocal EC funded project. Future Library. 2006 - 2007 Researcher at PAVET Action 4.3.2 ”Development of Industrial Research and Technology”, Tero S.A. 2006 - 2008 Junior Researcher at the Intelligent Systems and Knowledge Processing (ISKP), IT Department of the Aristotle University of Thessaloniki, Greece. 2002 - 2004 Junior Researcher at the Intelligent Cooperative Systems (InCoSys) of In- formation & Communication Systems Department of the University of the Aegean.
INNOVATION
2014 webgraph-it.com - Website Analysis and Graph Tool. 2013 archiveready.com - Website Archivability Evaluation Tool. 2011 yperdiavgeia.gr - The prevalent Greek Open Government Data Search Engine. 2011 oaipmh.com - OAI-PMH Validator and Data Extractor Tool. 2006 openarchives.gr - The prevalent Greek Digital Libraries Search Engine.
AWARDS
2015 Winner of the LoCloud hackathon, EuropeanaTech Conference 2015, Paris, France. 2014 3rd place in the 1st Greek Open Public Data Hackathon, Ministry of Administrative Reform and E-Government, Greece. 2003 Best Student of the Year at the Information & Communication Systems Department of the University of the Aegean, State Scholarships Foundation of Greece (I.K.Y.)
SCIENTIFIC PUBLICATIONS Full list available at http://vbanos.gr/publications/