Web Crawling, Analysis and Archiving

Vangelis Banos

Aristotle University of Thessaloniki Faculty of Sciences School of Informatics

Doctoral dissertation under the supervision of Professor Yannis Manolopoulos

October 2015

Ανάκτηση, Ανάλυση και Αρχειοθέτηση του Παγκόσμιου Ιστού

Ευάγγελος Μπάνος

Αριστοτέλειο Πανεπιστήμιο Θεσσαλονίκης Σχολή Θετικών Επιστημών Τμήμα Πληροφορικής

Διδακτορική Διατριβή υπό την επίβλεψη του Καθηγητή Ιωάννη Μανωλόπουλου

Οκτώβριος 2015

i

Web Crawling, Analysis and Archiving PhD Dissertation ©Copyright by Vangelis Banos, 2015. All rights reserved.

The Doctoral Dissertation was submitted to the the School of Informatics, Faculty of Sci- ences, Aristotle University of Thessaloniki. Defence Date: 30/10/2015.

Examination Committee Yannis Manolopoulos, Professor, Department of Informatics, Aristotle University of Thes- saloniki, Greece. Supervisor

Apostolos Papadopoulos, Assistant Professor, Department of Informatics, Aristotle Univer- sity of Thessaloniki, Greece. Advisory Committee Member

Dimitrios Katsaros, Assistant Professor, Department of Electrical & Computer Engineering, University of Thessaly, Volos, Greece. Advisory Committee Member

Athena Vakali, Professor, Department of Informatics, Aristotle University of Thessaloniki, Greece.

Anastasios Gounaris, Assistant Professor, Department of Informatics, Aristotle University of Thessaloniki, Greece.

Georgios Evangelidis, Professor, Department of Applied Informatics, University of Mace- donia, Greece.

Sarantos Kapidakis, Professor, Department of Archives, Library Science and Museology, Ionian University, Greece.

Abstract

The Web is increasingly important for all aspects of our society, culture and economy. is the process of gathering digital materials from the Web, ingesting it, ensuring that these materials are preserved in an archive, and making the collected materials available for future use and research. Web archiving is a difficult problem due to organizational and technical reasons. We focus on the technical aspects of Web archiving. In this dissertation, we focus on improving the data acquisition aspect of the Web archiv- ing process. We establish the notion of Archivability (WA) and we introduce the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to measure WA for any website. We propose new to optimise Web crawling using near-duplicate detection and webgraph cycle detection, resolving also the problem of web spider traps. Following, we suggest that different types of demand different Web archiving ap- proaches. We focus on social media and more specifically on weblogs. We introduce weblog archiving as a special type of Web archiving and present our findings and developments in this area: a technical survey of the , a scalable approach to harvest modern we- and an integrated approach to preserve weblogs using a digital repository system. Keywords: Web Archiving, Web Crawling, , Webgraphs, Weblogs, Digital Repositories.

Περίληψη

Αρχειοθέτηση του παγκόσμιου ιστού ονομάζεται η διαδικασία συλλογής και αποθήκευσης ιστοσελίδων με σκοπό τη διαφύλαξή τους σε ένα ψηφιακό αρχείο, προσβάσιμο για το κοινό και τους ερευνητές. Η αρχειοθέτηση του παγκόσμιου ιστού είναι ένα ζήτημα ύψιστης προτεραιότητας, καθώς αφενός αποτελεί κύριο μέσο της σύγχρονης επικοινωνίας και αφετέρου η μέση διάρκεια ζωής των ιστοσελίδων είναι λιγότερη από 100 ημέρες. Έτσι, καθημερινά εξαφανίζονται από τον παγκόσμιο ιστό εκατομμύρια ιστοσελίδες που παύουν να λειτουργούν για διάφορους λόγους, με αποτέλεσμα να χάνονται πολύτιμες πληροφορίες. Το πρόβλημα της αρχειοθέτησης του παγκόσμιου ιστού συνίσταται από διάφορες επιμέρους διαδικασίες όπως η αυτόματη πλοήγηση στον παγκόσμιο ιστό, η εξαγωγή περιεχομένου, η ανάλυση και η αποθήκευσή του σε κατάλληλη μορφή ώστε να είναι δυνατή η ανάκτηση και η επαναχρησιμοποίησή του για οποιουσδήποτε σκοπούς. Το πρόβλημα της αυτοματοποιημένης πλοήγηση στον παγκόσμιο ιστό με σκοπό την ανάκτηση και την επεξεργασία πληροφορίας αποτελεί μια ιδιαίτερα διαδεδομένη διαδικασία που έχει εφαρμογές σε πολλά επιστημονικά και επιχειρηματικά πεδία. Ένα άλλο σημαντικό ζήτημα είναι ότι διαφορετικά είδη ιστοσελίδων έχουν διαφορετικά χαρακτηριστικά και ιδιότητες που απαιτούν ιδιαίτερο χειρισμό για την αποδοτικότερη ανάκτηση, επεξεργασία και αρχειοθέτηση δεδομένων. Εστιάζουμε την έρευνά μας στα κοινωνικά δίκτυα και συγκεκριμένα στα ιστολόγια (blogs) που αποτελούν ένα ιδιαίτερο νέο μέσο επικοινωνίας και ενημέρωσης που χρησιμοποιείται ευρέως. Η διδακτορική διατριβή έχει στόχο την βελτιστοποίηση της αρχειοθέτησης ιστοσελίδων μέσω της ανάπτυξης νέων αλγορίθμων αυτόματης πλοήγησης στον παγκόσμιο ιστό, ανά- κτησης πληροφορίας από ιστοσελίδες και ασφαλούς αποθήκευσής τους με αποδοτικό τρόπο ώστε να ευνοείται η μελλοντική πρόσβαση και επαναχρησιμοποίησή τους για οποιο- δήποτε σκοπό. Επιπλέον, η διδακτορική διατριβή επικεντρώνεται στην έρευνα και την ανάπτυξη εξειδικευμένων μεθόδων ανάκτησης, επεξεργασίας, αρχειοθέτησης και επανα- χρησιμοποίησης δεδομένων ιστολογίων. Η συνεισφορά της διατριβής στους παραπάνω τομείς συνοψίζεται στα εξής:

• Ο δείκτης Website Archivability που εκφράζει την ευκολία και ακρίβεια με την οποία αποθηκεύονται οι ιστοσελίδες από συστήματα αρχειοθέτησης ιστοσελίδων. Η μέθοδος Credible Live Evaluation for Archive Readiness Plus (CLEAR+) που υπολογίζει το Website Archivability και το σύστημα ArchiveReady που τα υλοποιεί ως διαδικτυακή εφαρμογή στη διεύθυνση: http://archiveready.com. Επιπλέον, μια μελέτη της αποθηκευσιμότητας διαφορετικών συστημάτων διαχείρισης περιεχομέ- νου στο διαδίκτυο. vi

• Αλγόριθμοι βελτιστοποίησης της αυτόματης πλοήγησης στο διαδίκτυο με τον εντο- πισμό όμοιων ή παρόμοιων ιστοσελίδων και τη χρήση μοντελοποίησης γράφων και μία μέθοδος εντοπισμού των παγίδων που αντιμετωπίζουν τα συστήματα αυτόματης πλοήγησης στο διαδίκτυο (web spider traps). Η πλατφόρμα WebGraph-it που υλοποιεί τους αλγορίθμους ως διαδικτυακή εφαρμογή στη διεύθυνση: http://webgraph-it. com. • Μια εκτεταμένη μελέτη των τεχνικών χαρακτηριστικών των ιστολογίων με έμφαση στα τεχνικά χαρακτηριστικά που αφορούν την αρχειοθετησιμότητά τους. • Το ολοκληρωμένο σύστημα διαφύλαξης ιστολογίων BlogForever που λύνει προβλή- ματα ανάκτησης, διαχείρισης, αρχειοθέτησης και επαναχρησιμοποίησης των δεδο- μένων τους. • Μια ιδιαίτερα αποδοτική μέθοδος για την ανάκτηση δεδομένων από ιστολόγια με τη χρήση αλγορίθμων μηχανικής μάθησης και ένα σύστημα αυτόματης πλοήγησης ιστολογίων που την υλοποιεί.

Στα πλαίσια της έρευνας μας δημιουργήθηκαν ειδικά πακέτα λογισμικού και υλοποιήθη- καν διαδικτυακές εφαρμογές που βρίσκονται σε παραγωγική λειτουργία στο διαδίκτυο. Η απόδοση όλων των αλγορίθμων και η εγκυρότητα των αποτελεσμάτων επικυρώθηκε με πειραματικές μετρήσεις. Τα αποτελέσματα της διατριβής δημοσιεύθηκαν σε έγκριτα διεθνή επιστημονικά περιοδικά, συνέδρια και εκδόσεις. Αναλυτικότερα, οι δημοσιεύσεις μας αναφέρονται στο Κεφάλαιο 1.3. Παρακάτω παρουσιάζουμε τα βασικά σημεία της διατριβής όπως είναι οργανωμένα σε κάθε κεφάλαιο.

Κεφάλαιο 1: Introduction

Στο Κεφάλαιο 1 παρουσιάζουμε καταρχήν ορισμένες γενικές πληροφορίες για την αυτό- ματη πλοήγηση στον παγκόσμιο ιστό, την εξαγωγή δεδομένων και την αρχειοθέτηση ιστοσελίδων, έννοιες που αποτελούν το βασικό πλαίσιο της έρευνάς μας. Στη συνέχεια ορίζουμε τους στόχους της διατριβής και παρουσιάζουμε τις συνεισφορές μας ανά κεφά- λαιο, δίνοντας παράλληλα την οργάνωση της διατριβής. Επιπλέον, παρουσιάζουμε τις δημοσιεύσεις που έγιναν σε διεθνή επιστημονικά περιοδικά, συνέδρια και εκδόσεις.

Κεφάλαιο 2: Background and Literature Review

Στο Κεφάλαιο 2 παρουσιάζουμε το ερευνητικό έργο που γίνεται στο πεδίο της αρχειοθέ- τησης του παγκόσμιου ιστού, της αυτόματης πλοήγησης στο διαδίκτυο και την αρχειοθέ- τησης των μέσων κοινωνικής δικτύωσης. Βλέπουμε τη σημασία της αρχειοθέτησης του παγκόσμιου ιστού και τις εργασίες που γίνονται για την εξασφάλιση ενός επιπέδου ποιό- τητας και αξιοπιστίας στο Κεφάλαιο 2.1.1. Εξετάζουμε τις εξελίξεις στον τομέα της εύρεσης όμοιου περιεχομένου στα ψηφιακά αρχεία του παγκόσμιου ιστού καθώς και τις τεχνικές εξάλειψής του ώστε να έχουμε μια σειρά από οφέλη σε κάθε στάδιο της λειτουργίας των ψηφιακών αρχείων (Κεφάλαιο 2.1.2). Μελετούμε τις προσπάθειες βελτι- στοποίησης των συστημάτων αυτόματης πλοήγησης στο διαδίκτυο στο Κεφάλαιο 2.1.3. vii

Ιδιαίτερη έμφαση δίνουμε τέλος στις εργασίες για την αρχειοθέτηση ιστολογίων και στα συστήματα που έχουν αναπτυχθεί για αυτό το σκοπό όπως αναλύονται στο Κεφάλαιο 2.2.

Κεφάλαιο 3: An Innovative Method to Evaluate Website Archivability

Στο Kεφάλαιο 3 παρουσιάζουμε μια νέα μέθοδο μοντελοποίησης των αρχών και των διαδικασιών αρχειοθέτησης του παγκόσμιου ιστού. Εισάγουμε το δείκτη Website Archiv- ability που εκφράζει το κατά πόσο ένας ιστότοπος θα μπορούσε να αρχειοθετηθεί με πληρότητα και ακρίβεια. Ορίζουμε τη μέθοδο Credible Live Evaluation for Archive Readi- ness Plus (CLEAR+) με την οποία μπορεί να υπολογιστεί ο δείκτης σε πραγματικό χρόνο. Περιγράφουμε την αρχιτεκτονική του συστήματος ArchiveReady που αποτελεί μια υλο- ποίηση της μεθόδου σε μορφή διαδικτυακής εφαρμογής. Η μέθοδος και οι εφαρμογές της είναι ιδιαίτερα σημαντικές και χρησιμοποιούνται ήδη από πανεπιστήμια, εθνικά αρχεία και εταιρίες του χώρου σε όλο τον κόσμο. Αναλυτικά οι χρήστες του ArchiveReady αναφέρονται στο Παράρτημα 7.2. Ένα βασικό ζήτημα όσον αφορά την αρχειοθέτηση του παγκόσμιου ιστού είναι η έλλειψη αυτοματοποιημένου ελέγχου του περιεχομένου που αρχειοθετείται. Πολλές φορές ιστο- σελίδες αρχειοθετούνται ελλειπώς, έχουν προβλήματα και τα αρχειοθετημένα αντίγραφα δεν μπορούν να χρησιμοποιηθούν. Το πρόβλημα έγκειται στο γεγονός ότι ενώ η πλοήγηση στον παγκόσμιο ιστό είναι αυτόματη, ο έλεγχος της ποιότητας των ιστοσελίδων που αρχειοθετούνται είναι στην καλύτερη περίπτωση ημι-αυτόματος ή εξολοκλήρου ελεγχό- μενος από ανθρώπους. Για να λύσουμε αυτό το πρόβλημα δημιουργούμε τη μέθοδο Credible Live Evaluation for Archive Readiness Plus (CLEAR+) που υπολογίζει το δείκτη Website Archivability (WA) και εκφράζει την δεκτικότητα που έχει μια ιστοσελίδα στην αρχειοθέτησή της από web archives. Η δεκτικότητα αυτή εξαρτάται από συγκεκριμένα τεχνικά χαρακτηριστικά της ιστοσελίδας, τα λεγόμενα Website Archivability Facets, τα οποία είναι εν συντομία:

• 퐹퐴: Accessibility: Η δυνατότητα ανακάλυψης και πρόσβασης στο σύνολο των δεδο- μένων ενός δικτυακού τόπου. Όσο μεγαλύτερη είναι η δυνατότητα αυτή, τόσο το καλύτερο για τη σωστή αρχειοθέτησή του.

• 퐹퐶 : Cohesion: Ο βαθμός διασποράς των δεδομένων ενός δικτυακού τόπου σε μία ή περισσότερες διαδικτυακές υπηρεσίες. Η μεγάλη διασπορά αυξάνει τις πιθανότητες απώλειας δεδομένων και ελλειπούς αρχειοθέτησης.

• 퐹푀 : : Ο πλούτος και η ακρίβεια των μεταδεδομένων που είναι διαθέσιμα για ένα δικτυακό τόπο είναι σημαντικά για την καλύτερη αξιοποίησή του.

• 퐹푆 : Standards Compliance: Η συμμόρφωση με καθιερωμένα πρότυπα κωδικοποίη- σης σε όλα τα αρχεία που αποτελούν το δικτυακό τόπο (HTML, CSS, κ.α.) ώστε να είναι δυνατή η κατανόησή τους τόσο στο παρόν όσον και στο μέλλον από οποιοδή- ποτε λογισμικό ακολουθεί τα πρότυπα. viii

Για την αξιολόγηση ενός δικτυακού τόπου σύμφωνα με τη μέθοδο CLEAR+, ανακτούμε όλα τα αρχεία μιας ιστοσελίδας του και πραγματοποιούμε τεχνικούς ελέγχους σε αυτά ώστε να συλλέξουμε αριθμητικά αποτελέσματα, τα οποία στη συνέχεια συνθέτουμε για να υπολογίσουμε το Website Archivability του. Η μέθοδος περιγράφεται αναλυτικά στο Κεφάλαιο 3.2 και ένα παράδειγμα αξιολόγησης της κεντρικής ιστοσελίδας του Αριστοτε- λείου Πανεπιστημίου παρουσιάζεται στο Κεφάλαιο 3.2.5. To σύστημα ArchiveReady υλοποιεί τη μέθοδο όπως παρουσιάζουμε στο Κεφάλαιο 3.3 και είναι διαθέσιμο ως διαδι- κτυακή εφαρμογή στη διεύθυνση http://archiveready.com. Η αξιολόγηση της μεθό- δου έγινε με τρία εναλλακτικά πειράματα στο Κεφάλαιο 3.4. Μια ακόμη ιδιαίτερα ενδιαφέρουσα σχετική μελέτη που κάνουμε είναι η αξιολόγηση του Website Archivability για δώδεκα συστήματα διαχείρισης περιεχομένου στο διαδίκτυο (Web Content Management Systems) όπως παρουσιάζεται στο Κεφάλαιο 3.5. Χρησιμοποι- ώντας ένα σημαντικό δείγμα, κάνουμε ανάλυση ενός μεγάλου αριθμού δικτυακών τόπων που βασίζονται σε τέτοια συστήματα και εντοπίζουμε τα δυνατά σημεία και τις αδυναμίες τους όσον αφορά τη δυνατότητα αρχειοθέτησής τους.

Κεφάλαιο 4: Near-duplicate and Cycle Detection in Web- graphs towards Optimised Web Crawling

Στο Kεφάλαιο 4 παρουσιάζουμε μια νέα προσέγγιση για τη μοντελοποίηση των αρχών και των διαδικασιών αυτόματης πλοήγησης του παγκόσμιου ιστού με τη χρήση γράφων (web graphs). Δημιουργούμε νέα μοντέλα, αλγορίθμους και λογισμικό για τη βελτιστοποίηση της απόδοσης του web crawling, τον εντοπισμό ομοίων ή σχεδόν ομοίων δεδομένων και την αποφυγή «παγίδων» που δημιουργούν προβλήματα στα συστήματα που πλοηγούνται στον παγκόσμιο ιστό (web spider traps). Οι αλγόριθμοι που προτείνουμε βασίζονται στις εξής παρατηρήσεις:

• To URI θεωρείται το μοναδικό αναγνωριστικό μιας ιστοσελίδας αλλά θα μπορούσα- με να χρησιμοποιήσουμε και άλλα εναλλακτικά όπως για παράδειγμα το Sort-friendly URI Reordering Transform (SURT). • Η ομοιότητα δύο μοναδικών αναγνωριστικών μιας ιστοσελίδας δεν χρειάζεται να είναι απόλυτη, μπορεί δύο μοναδικά αναγνωριστικά να είναι σχεδόν όμοια και θα μπορούσαμε να ελέγξουμε τι συμβαίνει με διάφορους βαθμούς ομοιότητας. • Η ομοιότητα του περιεχομένου δύο ιστοσελίδων δεν χρειάζεται να είναι απόλυτη, μπορεί να είναι σχεδόν όμοιες και θα μπορούσαμε να ελέγξουμε τι συμβαίνει με διάφορους βαθμούς ομοιότητας. • Μοντελοποιώντας ένα δικτυακό τόπο ως γράφο, μπορούμε να χρησιμοποιήσουμε τις παραπάνω παρατηρήσεις ώστε να συγχωνεύσουμε διπλανούς κόμβους και να μειώσουμε την πολυπλοκότητα του γράφου. Με αυτό τον τρόπο μπορούμε επίσης να ανακαλύψουμε κύκλους που υποδεικνύουν παγίδες για λογισμικό αυτόματης πλοήγησης στο διαδίκτυο.

Με βάση τις παρατηρήσεις μας, στο Κεφάλαιο 4.2.2 παρουσιάζουμε οκτώ αλγόριθμους αυτόματης πλοήγησης στο διαδίκτυο που βασίζονται σε εναλλακτικές συνθέσεις των ix

παραπάνω λογικών. Στη συνέχεια, παρουσιάζουμε στο Κεφάλαιο 4.3 την πλατφόρμα WebGraph-it που υλοποιεί τους προτεινόμενους αλγόριθμους ως διαδικτυακή εφαρμογή στη διεύθυνση http://webgraph-it.com. Για την αξιολόγηση των μεθόδων μας και την επιλογή της βέλτιστης πραγματοποιούμε εκτεταμένα πειράματα με μεγάλο αριθμό δικτυακών τόπων στους οποίους εφαρμόζουμε αυτόματη πλοήγηση και με τις οκτώ μεθόδους ώστε να αξιολογήσουμε τα αποτελέσματά τους, όπως περιγράφουμε στο Κεφάλαιο 4.4. Επιπλέον, με ένα πείραμα που κάνουμε αποδεικνύουμε πως οι αλγόριθμοί μας μπορούν να εντοπίσουν web spider traps ενώ οι παραδοσιακές μέθοδοι αυτόματης πλοήγησης πρέπει να παραμετροποιούνται με μη αυτό- ματο τρόπο για να αποφεύγουν τις ίδιες παγίδες.

Κεφάλαιο 5: The BlogForever Platform: An Integrated Ap- proach to Preserve Weblogs

Στο Kεφάλαιο 5 παρουσιάζουμε μια πρωτότυπη προσέγγιση στο πρόβλημα της ανάκτησης, επεξεργασίας, αρχειοθέτησης και επαναχρησιμοποίησης δεδομένων ιστολογίων. Το - Forever είναι ένα πρωτότυπο σύστημα ανάκτησης, επεξεργασίας και αρχειοθέτησης δεδο- μένων ιστολογίων το οποίο υλοποιεί διάφορες καινοτομίες όπως βελτιστοποίηση των ροών εργασίας ανάκτησης, διαχείρισης και αρχειοθέτησης των ιστολογίων. To BlogFor- ever είναι καταλληλότερο για την αρχειοθέτηση ιστολογίων σε σχέση με γενικά συστήματα αρχειοθέτησης του παγκόσμιου ιστού, όπως επιβεβαιώνεται και από τα 5 case studies που χρησιμοποιήθηκαν για την αξιολόγηση του. Αρχικά παρουσιάζουμε τη μελέτη που κάναμε σε ένα σημαντικό δείγμα ιστολογίων στο Κεφάλαιο 5.2. Καταγράφουμε τα ιδιαίτερα τεχνικά χαρακτηριστικά των ιστολογίων, τα συγκρίνουμε με άλλα είδη ιστοσελίδων και με βάση τα συμπεράσματά μας δημιουργούμε το προφίλ τους. Έτσι, είμαστε σε θέση να προτείνουμε καλύτερους τρόπους ανάκτησης, επεξεργασίας και αρχειοθέτησης του περιεχομένου τους. Στο κεφάλαιο 5.3 κωδικοποιούμε τις απαιτήσεις για την πλατφόρμα BlogForever και στη συνέχεια παρουσιάζουμε την αρχιτεκτονική του συστήματος στο Κεφάλαιο 5.4.Η πλατφόρμα BlogForever αποτελείται από δύο βασικά τμήματα, το Blog Spider που είναι υπεύθυνο για την εξαγωγή δεδομένων από ιστολόγια και το Digital Repository που είναι υπεύθυνο για την αρχειοθέτηση, τη διαχείριση και την παροχή των δεδομένων στους χρήστες. Παρουσιάζουμε επίσης βασικά σημεία της υλοποίησής μας στο Κεφάλαιο 5.5. Στο Κεφάλαιο 5.6 παρουσιάζουμε την αξιολόγηση του συστήματος που έγινε μέσω 5 διαφορετικών case studies στις οποίες συμμετείχε μεγάλος αριθμός χρηστών. Ορίζουμε συγκεκριμένα ερευνητικά ερωτήματα που είχαν να κάνουν με την αποτελεσματικότητα και τη λειτουργικότητα του συστήματος. Μελετούμε τις απαντήσεις των χρηστών και τις τεχνικές παραμέτρους για να δούμε τη συμπεριφορά του συστήματος και καταλήγουμε σε χρήσιμα συμπεράσματα για τη λειτουργία του και την αρχειοθέτηση ιστολογίων γενικά. x

Κεφάλαιο 6: A Scalable Approach to Harvest Modern We- blogs

Στο Κεφάλαιο 6 παρουσιάζουμε ένα νέο αλγόριθμο αποδοτικότερης ανάκτησης πληροφo- ρίας από ιστολόγια ο οποίος χρησιμοποιεί μοντέλα μηχανικής μάθησης και τα ειδικά χαρακτηριστικά των ιστολογίων (weblogs’ significant properties) για να εξάγει πληροφορία από τα ημιδομημένα δεδομένα των ιστολογίων με μεγάλη ακρίβεια και ταχύτητα. Εξετά- ζοντας τα ειδικά χαρακτηριστικά των ιστολογίων, εστιάζουμε στα εξής:

• Τα ιστολόγια παρέχουν web feeds: δομημένα δεδομένα σε μορφή XML που αφορούν τις πιο πρόσφατες καταχωρήσεις τους. • Όλες οι δημοσιεύσεις των ιστολογίων χρησιμοποιούν την ίδια δομή HTML για την παρουσίασή τους.

Με βάση τα παραπάνω, εκπαιδεύουμε ένα σύστημα το οποίο παράγει κανόνες εξαγωγής δεδομένων από ιστολόγια χρησιμοποιώντας τεχνικές μηχανικής μάθησης. Στη συνέχεια, χρησιμοποιώντας αυτές τις τεχνικές πετυχαίνουμε πολύ καλύτερη ανάκτηση δεδομένων χρησιμοποιώντας τους κανόνες αυτούς. Επιπλέον, δείχνουμε τη μέθοδο με την οποία μπορούμε να επεκτείνουμε τον αλγόριθμό μας ώστε να κάνουμε εξαγωγή και άλλων δεδο- μένων ιστολογίων όπως για παράδειγμα ονόματα συγγραφέων και λέξεις κλειδιά. Στο Κεφάλαιο 6.3 παρουσιάζουμε την αρχιτεκτονική του συστήματός μας, τον τρόπο με τον οποίο χειριζόμαστε εξειδικευμένες περιπτώσεις JavaScript και εξαγωγής περιεχομένου, καθώς και τις δυνατότητες κλιμάκωσης του συστήματος. Δίνουμε ιδιαίτερη έμφαση επίσης στη διαλειτουργικότητα και στην κωδικοποίηση των δεδομένων που εξάγουμε με το πρότυπο MARC21. Τέλος, αξιολογούμε τη μέθοδό μας στο Κεφάλαιο 6.4 συγκρίνοντάς την με διαδεδομένα συστήματα αυτόματης πλοήγησης και εξαγωγής δεδομένων. Διαπι- στώνουμε ότι πετυχαίνουμε καλύτερη ακρίβεια στην εξαγωγή δεδομένων και αυξημένη ταχύτητα σε σχέση με τις άλλες λύσεις.

Κεφάλαιο 7: Conclusions and Future Work

Κλείνοντας, στο Κεφάλαιο 7 παρουσιάζουμε τα συμπεράσματά μας και τις μελλοντικές ερευνητικές μας κατευθύνσεις. Συνοψίζοντας τις συνεισφορές μας, φτάνουμε σε ορισμέ- να χρήσιμα συμπεράσματα:

• Ο δείκτης Website Archivability εκφράζει τη δυνατότητα αρχειοθέτησης ενός δικτυ- ακού τόπου με βάση τα ιδιαίτερα τεχνικά χαρακτηριστικά του. • Τα συστήματα διαχείρισης περιεχομένου στο διαδίκτυο (Web Content Management Systems) έχουν σημαντικά περιθώρια βελτίωσης όσον αφορά την αρχειοθετησιμό- τητά τους. • Η χρήση γράφων για την ανάλυση της δομής δικτυακών τόπων (webgraph analysis) είναι δόκιμη για την εύρεση όμοιων ή σχεδόν όμοιων κόμβων. Έτσι, πετυχαίνουμε xi

τη μείωση της πολυπλοκότητας των webgraphs και την ευκολότερη επεξεργασία τους. • Με την εύρεση όμοιων ή σχεδόν όμοιων κόμβων σε γράφους που χρησιμοποιούνται για τη μοντελοποίηση δικτυακών τόπων, μπορούμε να εντοπίσουμε κύκλους (web- graph cycles) και να ανακαλύψουμε παγίδες για συστήματα αυτόματης πλοήγησης στο διαδίκτυο (web spider traps). • Εύχρηστες διαδικτυακές εφαρμογές όπως το ArchiveReady και το WebGraph-it αυξάνουν κατά πολύ τη διάδοση και τη χρήση νέων μεθόδων. • Χρησιμοποιώντας τα ιδιαίτερα χαρακτηριστικά των ιστολογίων, μπορούμε να πετύ- χουμε πολύ καλύτερες μεθόδους αυτόματης πλοήγησης και εξαγωγής δεδομένων. • Η διάκριση των δεδομένων σε όσο το δυνατό αναλυτικότερα πεδία κατά την αρχειο- θέτηση ιστοσελίδων μπορεί να αυξήσει την αξία και την επαναχρησιμοποίησιμότητά τους.

Μελλοντικά, σκοπεύουμε να προχωρήσουμε στην εξέλιξη των μεθόδων που αναπτύξαμε με σκοπό την βελτιστοποίησή τους και τη χρήση τους σε βιομηχανικές εφαρμογές. Θα βελτιώσουμε τη μέθοδο CLEAR+ για τον υπολογισμό του Website Archivability που παρουσιάσαμε στο Κεφάλαιο 3 με σκοπό να προωθήσουμε τη χρήση της σε ένα ευρύτερο ακροατήριο όπως μηχανικοί λογισμικού και διαχειριστές ιστοσελίδων. Επιπλέον, θα προ- σπαθήσουμε να προωθήσουμε την ενσωμάτωση της μεθόδου στις ροές εργασίας των συστημάτων αρχειοθέτησης του διαδικτύου. Οι μέθοδοι που υλοποιήσαμε για την ανάλυση δικτυακών τόπων με τη χρήση γράφων στο Κεφάλαιο 4 έχουν επίσης πολύ καλές προοπτικές. θα συνεχίσουμε να αναπτύσουμε νέες παραλλαγές των αλγορίθμων στο πλαίσιο που έχουμε αναπτύξει και σκοπεύουμε να παρέχουμε υπηρεσίες ανάλυσης δικτυακών τόπων με την πλατφόρμα WebGraph-it που βρίσκεται στη διεύθυνση http://webgraph-it.com. Η πλατφορμα BlogForever που παρουσιάσαμε στο Κεφάλαιο 5 μπορεί επίσης να επεκταθεί με διάφορους τρόπους, ενσωματώνοντας ημι-αυτόματες λειτουργίες ανακάλυψης περιε- χομένου, χειρισμού περισσότερων τύπων ιστολογίων και κλιμάκωσης για τη δημιουργία αποθετηρίων μεγάλου μεγέθους. Θα εξετάσουμε επίσης τη δυνατότητα αρχειοθέτησης microblogs. Επιπλέον, οι αλγόριθμοι ανάκτησης δεδομένων από ιστολόγια που παρουσιά- σαμε στο Κεφάλαιο 6 θα μπορούσαν να επεκταθούν με συνδυασμούς άλλων μεθόδων εξαγωγής ώστε να δημιουργηθούν υβριδικές μέθοδοι με καλύτερα χαρακτηριστικά.

Acknowledgements

It takes a lot of work and persistence to successfully finish a Ph.D. Many people helped me in this difficult task and they deserve a special mention. I would like to thank my supervisor Prof. Yannis Manolopoulos for giving me the opportu- nity to collaborate with him. Our discussions helped me to proceed and improve significantly in many areas. I am looking forward to learning more from him in the future. My profound gratitude goes out to my colleagues from the BlogForever project. They did great research work, they were great partners and I owe them a lot. I would like to men- tion especially Stratos Arampatzis, Ilias Trochidis, Nikos Kasioumis, Jaime Garcia Llopis, Raquel Jimenez Encinar, Tibor Simko, Yunhyong Kim, Seamus Ross, Senan Postaci, Karen Stepanyan, George Gkotsis, Alexandra Cristea, Mike Joy, Hendrik Kalb, Silvia Arango Do- cio, Patricia Sleeman, Ed Pinsent and Richard Davis. Thanks to the Hellenic Institute of Metrology management and especially my director, Dion- isios G. Kiriakidis for supporting me. Most importantly, none of this would have been possible without the love and patience of my family. Finally, I cannot stress out how I admire Evi for her character and attitude. I could not be able to do this without her. Vangelis, October 2015

Table of contents

List of figures xix

List of tables xxi

1 Introduction 1 1.1 Key Definitions and Problem Description ...... 1 1.2 Contributions and Document Organisation ...... 2 1.3 Publications ...... 3

2 Background and Literature Review 5 2.1 Web Archiving ...... 5 2.1.1 Web Archiving Quality Assurance ...... 6 2.1.2 Web Content Deduplication ...... 7 2.1.3 Automation ...... 9 2.2 Blog Archiving ...... 10 2.2.1 Blog Archiving Projects ...... 11

3 An Innovative Method to Evaluate Website Archivability 15 3.1 Introduction ...... 15 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) . 17 3.2.1 Requirements ...... 18 3.2.2 Website Archivability Facets ...... 19 3.2.3 Attributes ...... 26 xvi Table of contents

3.2.4 Evaluations ...... 27 3.2.5 Example ...... 29 3.2.6 The Evolution from CLEAR to CLEAR+ ...... 32 3.3 ArchiveReady: A Website Archivability Evaluation Tool ...... 32 3.3.1 System Architecture ...... 33 3.3.2 Scalability ...... 35 3.3.3 Workflow ...... 36 3.3.4 Interoperability and APIs ...... 37 3.4 Evaluation ...... 38 3.4.1 Methodology and Limits ...... 38 3.4.2 Experimentation with Assorted Datasets ...... 40 3.4.3 Evaluation by Experts ...... 42 3.4.4 WA Variance in the Same Website ...... 44 3.5 Web Content Management Systems Archivability ...... 46 3.5.1 Website Corpus Evaluation Method ...... 46 3.5.2 Evaluation Results and Observations ...... 47 3.5.3 Discussion ...... 53 3.6 Conclusions ...... 55

4 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawl- ing 57 4.1 Introduction ...... 57 4.2 Method ...... 59 4.2.1 Key Concepts ...... 59 4.2.2 Algorithms ...... 65 4.3 The WebGraph-it System Architecture ...... 69 4.3.1 System ...... 69 4.3.2 Web Crawling Framework ...... 72 4.4 Evaluation ...... 73 4.4.1 Methodology ...... 74 Table of contents xvii

4.4.2 Example ...... 75 4.4.3 Results ...... 76 4.4.4 Optimal DFS Limit for Cycle Detection ...... 78 4.4.5 Web Experiment ...... 79 4.5 Conclusions and Future Work ...... 81

5 The BlogForever Platform: An Integrated Approach to Preserve Weblogs 83 5.1 Introduction ...... 84 5.2 Blogosphere Technical Survey ...... 84 5.2.1 Survey Implementation ...... 85 5.2.2 Results ...... 87 5.2.3 Comparison Between Blogosphere and the Generic Web ...... 100 5.3 User Requirements ...... 102 5.3.1 Preservation Requirements ...... 102 5.3.2 Interoperability Requirements ...... 103 5.3.3 Performance Requirements ...... 104 5.4 System Architecture ...... 105 5.4.1 The BlogForever Software Platform ...... 105 5.4.2 Blog Spider Component ...... 106 5.4.3 Digital Repository Component ...... 111 5.5 Implementation ...... 116 5.6 Evaluation ...... 117 5.6.1 Method ...... 117 5.6.2 Results ...... 119 5.6.3 Evaluation Outcomes ...... 121 5.7 Discussion and Conclusions ...... 122

6 A Scalable Approach to Harvest Modern Weblogs 125 6.1 Introduction ...... 125 6.2 Algorithms ...... 126 xviii Table of contents

6.2.1 Motivation ...... 127 6.2.2 Content Extraction Overview ...... 127 6.2.3 Extraction Rules and String Similarity ...... 128 6.2.4 Time Complexity and Linear Reformulation ...... 130 6.2.5 Variations for Authors, Dates and Comments ...... 132 6.3 Architecture ...... 133 6.3.1 System and Workflow ...... 133 6.3.2 JavaScript Rendering ...... 134 6.3.3 Content Extraction ...... 135 6.3.4 The BlogForever Metadata Schema for Interoperability ...... 136 6.3.5 Distributed Architecture and Scalability ...... 136 6.4 Evaluation ...... 139 6.4.1 Extraction Success Rates ...... 140 6.4.2 Article Extraction Running Times ...... 140 6.5 Discussion and Conclusions ...... 141

7 Conclusions and Future Work 143 7.1 Conclusions ...... 143 7.2 Future Work ...... 144

Bibliography 147

Appendix Website Archivability Impact 157

Appendix BlogForever Platform Screenshots 159

Appendix VITA 161 List of figures

3.1 WA Facets: An Overview...... 19 3.2 Website attributes evaluated for WA ...... 26 3.3 Evaluating http://auth.gr/ WA using ArchiveReady...... 31 3.4 The architecture of the archiveready.com system...... 34 3.5 The home page of the archiveready.com system...... 35 3.6 WA statistics for assorted datasets box plot...... 42 3.7 WA distribution for assorted datasets...... 43 3.8 WA average rating and standard deviation values, as well as the homepage WA for a set of 783 random websites...... 45 3.9 WA Facets average values and standard deviation for each WCMS . . . . . 48

4.1 WebGraph-It system architecture ...... 70 4.2 Viewing a webgraph in the http://webgraph-it.com web application . . 71

5.1 HTTP Status response codes registered during data-collection ...... 86 5.2 Frequency of weblog software platforms ...... 88 5.3 Variation in versions of Wordpress software ...... 89 5.4 Variation in versions of MovableType software ...... 89 5.5 Variation in versions of vBulletin software ...... 90 5.6 Variation in versions of Discuz! software ...... 90 5.7 Encoding of evaluated resources ...... 91 5.8 Break down of the other 6% of character set attributes ...... 91 5.9 Average number of images identified ...... 92 xx List of figures

5.10 Average use of BMP, SVG, TIFF, WBMP and WEBP formats ...... 93 5.11 Distribution of images for pages with less than 20 images only ...... 93 5.12 Summary of metadata usage ...... 94 5.13 Histogram of Open Graph references ...... 94 5.14 Use of XML feeds by type ...... 96 5.15 Number of JavaScript instances identified ...... 97 5.16 Number of identified JavaScript library/framework instances ...... 97 5.17 Frequency of embedded YouTube videos ...... 98 5.18 Flash use on the web (left) and on blogs (right)...... 100 5.19 JavaScript frameworks use on the web (left) and on blogs (right)...... 100 5.20 Image formats use on the web (left) and on blogs (right)...... 101 5.21 HTTP status responses on the web (left) and on blogs (right)...... 101 5.22 A general overview of the BlogForver platform, featuring the blog spider and the blog repository ...... 105 5.23 Core entities of the BlogForever data model [134]...... 107 5.24 BlogForever conceptual data model [134]...... 108 5.25 The outline of the blog spider component design ...... 109 5.26 High level outline of a scalable set up of the blog spider component . . . . 111 5.27 The outline of the blog repository component design ...... 113 5.28 BlogForever Evaluation Timeline [7]...... 118

6.1 Overview of the crawler architecture. (Credit: Pablo Hoffman, Daniel Graña, Scrapy) ...... 134 6.2 Article extraction running time ...... 141

1 BlogForever Platform Home Page ...... 159 2 BlogForever Platform Features ...... 160 List of tables

2.1 Overview of related initiatives and projects ...... 11

3.1 퐹퐴: Accessibility Evaluations ...... 21

3.2 퐹푆 Standards Compliance Facet Evaluations ...... 23

3.3 퐹퐶 Cohesion Facet Evaluations...... 24

3.4 퐹푀 Metadata Facet Evaluations ...... 25 3.5 WA Facet Weights ...... 28

3.6 퐹퐴 evaluation of http://auth.gr/...... 29

3.7 퐹푆 evaluation http://auth.gr/...... 30

3.8 퐹퐶 evaluation http://auth.gr/...... 30

3.9 퐹푀 evaluation http://auth.gr/...... 30 3.10 Description of assorted datasets ...... 41 3.11 Comparison of WA statistics for assorted datasets...... 42 3.12 Correlation between WA, WA Facets and Experts rating...... 44

3.13 퐴1 The percentage of valid . Higher is better...... 49

3.14 퐴2 The number of inline scripts per WCMS instance. Lower is better. . . . 49

3.15 퐴3 Sitemap.xml is present. Higher is better...... 50

3.16 퐶1 The percentage of local versus remote image. Higher is better...... 50

3.17 퐶1 The percentage of local versus remote CSS. Higher is better...... 51

3.18 푆1 HTML errors per instance. Lower is better...... 51

3.19 푆2 The lack of use of proprietary files (Flash, QuickTime). Higher is better. 52

3.20 퐴5: Valid Feeds. Higher is better...... 52 xxii List of tables

3.21 푀1: HTTP Content-Type header. Higher is better...... 53

3.22 푀2: HTTP caching headers. Higher is better...... 53

4.1 Potential webgraph node similarity metrics ...... 64 4.2 Web crawling algorithms summary ...... 69 4.3 Variables used in the evaluation, 푖=1-8 ...... 74 4.4 Results from all methods for a single website, http://deixto.com ...... 76

4.5 푊푖: Captured webpages difference between all webcrawls and base crawl. Lower is better...... 76

4.6 퐶푂푖: Completeness of each web crawling method. Higher is better...... 77

4.7 퐷푖: Duration difference between all webcrawls and base crawl. Lower is better. 77

4.8 퐿푖 Captured links difference between all webcrawls and base crawl. Lower is better...... 77 4.9 Number of cycles for each distance limit ...... 79 4.10 Web spider trap crawling results...... 80

5.1 Datasets ...... 85 5.2 File MIME types ordered by descending frequency of occurence...... 99 5.3 Overview of user requirements ...... 103 5.4 BlogForever Evaluation Metrics [13]...... 119 5.5 BlogForever Evaluation Themes [13]...... 120 5.6 BlogForever Case Studies for User Testing [14]...... 120 5.7 External and Internal Scores Summary [13]...... 121

6.1 Examples of string similarities...... 129 6.2 TechCrunch blog example...... 131 6.3 Blog post excerpt and full text similarity using different 푁 values...... 131 6.4 Blog record attributes - MARC 21 representations mapping...... 137 6.5 Blog record attributes - MARC 21 representations mapping...... 138 6.6 Comment record attributes - MARC tags mapping...... 138 6.7 Extraction success rates for different algorithms...... 140 Chapter 1

Introduction

Here, we present the main context of our research, including key definitions and the problem description. In the sequel, we outline our key contributions and the overall document organ- isation. Finally, we list our scientific publications. Since no research happens in isolation, I use the authorial “we” throughout the text.

1.1 Key Definitions and Problem Description

The (WWW) is increasingly important for all aspects of our society, culture and economy. It has become known simply as the web. The number of indexed webpages is estimated to be 4.8 billion in 2015 according to major search engines1. The importance of the web suggests a need for preservation, at least for selected websites [149]. Web archiving is the process of gathering up digital materials from the web, ingesting it, ensuring that these materials are preserved in an archive, and making the collected materials available for future use and research [106]. Web archiving is crucial to ensure that our digital materials remain accessible over time. Web archiving is a difficult problem for many reasons, organisational and technical. The organisational aspect of web archiving has all the inherent issues of any digital preservation activity. It involves the entity that is responsible for the process, its governance, funding, long term viability and personnel responsible for the web archiving tasks [116]. The technical aspect of web archiving involves the procedures of web content identification, acquisition, ingest, organization, access and use [44, 132]. One of the main technical chal- lenges of web archiving in comparison to other digital archiving activities is the process of data acquisition. Websites are becoming increasingly complex and versatile, posing chal- lenges for web data extraction systems, also known as web crawlers, to retrieve their content with accuracy and reliability [69]. The process of web crawling is inherently complex and there are no standard methods to access the complete website data. As a result, research has shown that web archives are missing significant portions of archived websites [29].

1http://www.worldwidewebsize.com/, accessed August 1, 2015 2 Introduction

A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is difficult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended nature of the web. Essentially, a web crawler starts from a seed webpage and then uses the within it to visit other webpages. This process repeats with every new webpage until some conditions are met (e.g. a maximum number of webpages is visited or no new hyperlinks are detected). Despite the simplicity of the basic , web crawling has many challenges [12]. For instance, there is a lot of duplicate or near-duplicate data captured during web crawling [95]. Also, web spiders are often disrupted or waste excessive computing resources in “Spider traps” [110]. Specific web domains have different characteristics and special properties, which require different web crawling, analysis and archiving approaches. We focus on the Blogosphere, the collective outcome of all weblogs, their content, interconnections and influences which constitute an active part of the Social Media, an established channel of online communication with great significance [78]. Weblogs are used from teaching physics in Latin America [151] to facilitating fashion discussions by young people in France [36]. Weblogs are also known as blogs. They are specific types of websites regularly updated and intended for general public consumption. Their structure is defined as a series of pages in re- verse chronological order. Wordpress, a single blog publishing company, reports more than 1 million new posts and 1.5 million new comments each day [147]. These overwhelming numbers illustrate the importance of blogs in most aspects of private and business life [37]. Blogs contain data with historic, political, social and scientific value, which need to be ac- cessible for current and future generations. For instance, blogs proved to be an important resource during the 2011 Egyptian revolution by playing an instrumental role in the orga- nization and implementation of protests [51]. The problem is that blogs disappear every day [76] because there is no standard method or authority to ensure blog archiving and long- term digital preservation. In this thesis, we focus on improving web crawling, aggregated data analysis, management and archiving. We look into the way Web crawlers visit webpages and extract content. We try to improve the way Web archives select and ingest websites. We also focus on weblogs and aim to devise a new approach for Weblog data extraction, management, preservation and reuse. In the following section, we present our contributions and the overall structure of the thesis.

1.2 Contributions and Document Organisation

Our work focuses on web crawling, analysis and archiving methods. We introduce new metrics to appreciate the possibilities of archiving websites. We propose new algorithms to improve web crawling efficiency and performance. We also propose new ways to deal with Weblog archiving and propose new algorithms focused on weblog data extraction and weblog archiving. For each proposed method we conduct extensive experimental evaluation using real world data. We also implement software systems as reference implementations. Some of our sys- tems are also publicly available on the web and/or as Open Source Software. We present our main contributions and their organisation in this document: 1.3 Publications 3

Chapter 2: We outline related work and literature in the field of web crawling and archiving. We focus on the state of the art of web content deduplication and spider trap detection, as well as methods for website evaluation and web archiving Quality Assurance (QA). We also review the state of Weblog archiving as a special case of Web archiving. Chapter 3: We introduce the concept of Website Archivability (WA), a metric to quantify whether a website has the potentiality to be archived with correctness and accuracy. We de- fine the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to calculate WA. We present ArchiveReady, a web platform which implements the proposed methods and is available at archiveready.com. We evaluate the Website Archivability (WA) of the most prevalent WCMS and come up with specific recommendations for their developers. Chapter 4: We propose a set of methods to detect duplicate and near-duplicate webpages in real time during web crawling. We propose a set of methods to detect web spider traps using webgraphs in real time during web crawling. We present WebGraph-It, a web platform which implements the proposed methods and is avalable at webgraph-it.com. Chapter 5: We present the BlogForever platform, a new approach to aggregate, manage, preserve and reuse weblog content. First, we explore patterns in weblog structure and data to outline weblogs’ main technical characteristics and differences from the generic web. Then, we describe policies, workflows, methods and systems we develop in the context of BlogForever. We also present extensive evaluation test results. Chapter 6: We present a new approach to harvest weblogs. Our algorithm is simple yet robust and scalable. It is generating weblog content extraction rules with great accuracy and performance. We also present our system, which is evaluated using extensive test procedures. Chapter 7: We summarise the results of our work and present some conclusions and poten- tial future work directions.

1.3 Publications

The research presented in this Ph.D. dissertation was published in 4 peer-reviewed journals, 7 international conference proceedings and 1 book chapter. Publications in scientific journals:

1. Banos V., Manolopoulos Y.: “Near-duplicate and Cycle Detection in Webgraphs to- wards Optimised Web Crawling”, ACM Transactions on the Web Journal, submitted, 2015. 2. Banos V., Manolopoulos Y.: “A Quantitative Approach to Evaluate Website Archiv- ability Using the CLEAR+ Method”, International Journal on Digital Libraries, 2015. 3. Banos V., Blanvillain O., Kasioumis N., Manolopoulos Y.: “A Scalable Approach to Harvest Modern Weblogs”, International Journal of AI Tools, Vol.24, N.2, 2015. 4. Kasioumis N., Banos V., Kalb H.: “Towards Building a Blog Preservation Platform”, World Wide Web Journal, Special Issue on Social Media Preservation and Applica- tions, Springer, 2013. 4 Introduction

Publications in international conferences proceedings:

1. Banos V., Manolopoulos Y.: “Web Content Management Systems Archivability”, Proceedings 19th East-European Conference on Advances in Databases & Informa- tion Systems (ADBIS), Springer Verlag, LNCS Vol.9282, Poitiers, France, 2015. 2. Blanvillain O., Banos V., Kasioumis N.: BlogForever Crawler: “Techniques and Al- gorithms to Harvest Modern Weblogs”, Proceedings 4th International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Thessaloniki, Greece, 2014. 3. Banos V., Kim Y., Ross S., Manolopoulos Y.: “CLEAR: a Credible Method to Evalu- ate Website Archivability”, Proceedings 10th International Conference on Preserva- tion of Digital Objects (iPRES), Lisbon, Portugal, 2013. 4. Kalb H., Lazaridou P., Banos V., Kasioumis N., Trier M.: “BlogForever: From Web Archiving to Blog Archiving”, Proceedings ‘Informatik Angepast an Mensch, Organ- isation und Umwelt‘ (INFORMATIK), Koblenz, Germany, 2013. 5. Stepanyan K., Gkotsis G., Banos V., Cristea A., Joy M.: “A Hybrid Approach for Spot- ting, Disambiguating and Annotating Places in User-Generated Text”, Proceedings 22nd International Conference on World Wide Web (WWW), Rio de Janeiro, Brazil, 2013. 6. Banos V., Baltas N., Manolopoulos Y.: “Trends in Blog Preservation”, Proceed- ings 14th International Conference on Enterprise Information Systems (ICEIS), Vol.1, pp.13-22, Wroclaw, Poland, 2012. 7. Banos V., Stepanyan K., Manolopoulos Y., Joy M., Cristea A.: “Technological Foun- dations of the Current Blogosphere”, Proceedings 2nd International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Craiova, Romania, 2012.

Book chapters:

1. Banos V., Baltas N., Manolopoulos Y.: “Blog Preservation: Current Challenges and a New Paradigm”, chapter 3 in book Enterprise Information Systems XIII, by Cordeiro J., Maciaszek L. and Filipe J. (eds.), Springer LNBIP Vol.141, pp.29–51, 2013. Chapter 2

Background and Literature Review

The current state of Web crawling, Web archiving and Social Media preservation. More specifically, we look into specific work in the areas of web archiving Quality Assurance (QA), web content deduplication and web crawler automation which are highly relevant to our work. We also review related work in weblog archiving.

2.1 Web Archiving

Web archiving is an important aspect of the preservation of cultural heritage [98]. Web preservation is defined as ‘the capturing, management and preservation of websites and Web resources’. Web preservation must be a start-to finish activity, and it should encompass the entire lifecycle of the Web resource [9]. The most notable web archiving initiative is the In- ternet Archive, 1 which has been operating since 1996. In addition, a large variety of projects from national and international organizations are working on Web preservation related activ- ities. For instance, many national memory institutions such as national libraries understood the value of Web preservation and developed special activities towards this goal [137]. All active national web archiving efforts, as well as some academic Web archives are members of the International Internet Preservation Consortium (IIPC). Web archiving is a complex task that requires a lot of resources. Therefore, it is always a selective process and only parts of the existing web are archived [61, 4]. Contrary to traditional media like printed books, webpages can be highly dynamic. As a result, the selection of archived information com- prises not only of the decision of what to archive (e.g. topic or regional focus) but also of the setting of additional parameters such as the archiving frequency per page, and other parame- ters related to the page request (e.g. browser, user account, language etc.) [98]. Thereby, the selection seems often to be driven by human publicity and [4]. The archiving of large parts of the web is a highly automated process, and the archiving fre- quency of a webpage is normally determined by a schedule for harvesting the page. Thus, the life of a website is not recorded appropriately if the page is updated more often than it is crawled [71].

1http://archive.org/, accessed: August 1, 2015 6 Background and Literature Review

Following, we look into the current state of web archiving quality assurance, web content deduplication and web crawler automation, key aspects of the web archiving data ingestion workflow.

2.1.1 Web Archiving Quality Assurance

The web archiving workflow includes identification, appraisal and selection, acquisition, ingest, organization and storage, description and access [106]. The present section focuses explicitly on the acquision of web content and the way it is handled by web archiving projects and initiatives. Web content acquision is one of the most delicate aspects of the web archiving workflow because it depends heavily on external systems: the target websites, web servers, applica- tion servers, proxies and network infrastructure. The number of independent and dependent elements gives harvesting a substantial risk load. Web content acquisition for web archiv- ing is performed using robots, also known as “spiders”, “crawlers”, or “bots”, self-acting agents that navigate around-the-clock through the hyperlinks of the web, harvesting topical resources without human supervision [112]. The most popular web harvester, is an open source, extensible, scalable, archival quality web crawler [102] developed by the in partnership with a number of libraries and web archives from across the world. Heritrix is currently the main web harvesting application used by the International In- ternet Preservation Consortium (IIPC) as well as numerous web archiving projects. Heritrix is being continuously developed and extended to improve its capacities for intelligent and adaptive crawling [52] or capture [72]. The Heritrix crawler was originally established for crawling general webpages that do not include substantial dynamic or com- plex content. In response other crawlers have been developed which aim to address some of Heritrix’s shortcomings. For instance, BlogForever [15] is utilizing blog specific tech- nologies to preserve blogs. Also, the ArchivePress project is based explicitly on XML feeds produced by blog platforms to detect web content [115]. As websites become more sophisticated and complex, the difficulties that web bots face in harvesting them increase. For instance, some web bots have limited abilities to process dy- namic web content or streaming media [106]. To overcome these obstacles, standards have been developed to make websites more amenable to harvesting by web bots. Two examples are the Sitemap.xml and Robots.txt protocols. The Sitemap.xml protocol, ‘Simple Website Footprinting’, is a way to build a detailed picture of the structure and link architecture of a website [96, 127]. Implementation of the Robots.txt protocol provide web bots with in- formation about specific elements of a website and their access permissions [135]. Such protocols are not used universally. Web content acquisition for archiving is only considered complete once the quality of the harvested material has been established. The entire web archiving workflow is often handled using special software, such as the open source software Web Curator Tool (WCT) [108], developed as a collaborative effort by the National Library of New Zealand and the British Library, at the instigation of the IIPC. WCT supports such web archiving processes as per- missions, job scheduling, harvesting, quality review, and the collection of descriptive meta- data. Focusing on quality review, when a harvest is complete, the harvest result is saved in 2.1 Web Archiving 7 the digital asset store, and the Target Instance is saved in the Harvested state2. The next step is for the Target Instance Owner to Quality Review the harvest. WCT operators perform this task manually. Moreover, according to the web archiving process followed by the National Library of New Zealand, after performing the harvests, the operators review and endorse or reject the harvested material; accepted material is then deposited in the repository [114]. A report from the Web-At-Risk project provides confirmation of this process. Operators must review the content thoroughly to determine if it can be harvested at all [59]. Efforts to deploy crowdsourced techniques to manage QA provide an indication of how sig- nificant the QA bottleneck is. The use of these approaches is not new, they were deployed by digitisation projects. The QA process followed by most web archives is time consuming and potentially complicated, depending on the volume of the site, the type of content hosted, and the technical structure. However, to quote the IIPC, “it is conceivable that crowdsourcing could support targeted elements of the QA process. The comparative aspect of QA lends itself well to ‘quick wins’ for participants”3. IIPC has also organized a Crowdsourcing Workshop in its 2012 General Assembly to explore how to involve users in developing and curating web archives. QA was indicated as one of the key tasks to be assigned to users: “The process of examining the characteristics of the websites captured by web crawling software, which is largely manual in practice, before making a decision as to whether a website has been successfully captured to become a valid archival copy” 4. The previous literature shows that there is an agreement within the web archiving community that web content aggregation is challenging. QA is an essential stage in the web archiving workflow but currently the process requires human intervention and research into automating QA is in its infancy. The solution used by web archiving initiatives such as Archive-it5 is to perform test crawls prior to archiving6 but these suffer from, at least, two shortcomings: a) the test crawls require human intervention to evaluate the results, and b) they do not fully address such challenges as deep-level metadata usage and media file format validation.

2.1.2 Web Content Deduplication

There are many efforts related to organise crawled web content effectively, including rank- ing and duplicate detection. What is interesting is that most of such works focus on already archived content, whereas little work has been done to improve duplicate or near-duplicate detection during web crawling. The problem of web spider traps is also relevant as it gen- erates infinite numbers of duplicate web content. To the best of our knowledge, it is not addressed sufficiently in an automated way.

2http://webcurator.sourceforge.net/docs/1.5.2/Web%20Curator%20Tool%20User- %20Manual%20(WCT%201.5.2)., accessed: August 1, 2015 3http://www.netpreserve.org/sites/default/files/.../CompleteCrowdsourcing.pdf, ac- cessed: August 1, 2015 4http://netpreserve.org/sites/default/files/attachments/CrowdsourcingWebArchiving- _WorkshopReport.pdf, accessed: August 1, 2015 5http://www.archive-it.org/, accessed: August 1, 2015 6https://webarchive.jira.com/wiki/display/ARIH/Test+Crawls, accessed: August 1, 2015 8 Background and Literature Review

Duplicate web content is a major issue for systems that perform web data extraction. The nature of the web promotes the creation of duplicate content, either intentionally or unin- tentionally. Web crawler systems have tried to address the issue of duplicate content, based on the URL, the webpage content, or both. This problem has been also defined as “dupli- cate URL with similar text” (DUST) [18] and various algorithms have been created to mine server and crawler logs, sampling webpages and inferring rules to detect duplicate web pages [3, 43]. It is evident that these methods cannot be applied on web scale as it is not practical to access the required information for every website. An approach to apply URL deduplication in web-scale crawlers suggests the use of two level URL duplication checking, both at the website and the webpage level [150]. The Mercator web crawler implements a ‘content seen’ test, which performs a 64bit checksum in the con- tents of each downloaded document and stores them in tables. The checksum of each newly downloaded document is looked up in the tables to detect duplicates [70]. Near-duplicate detection is also very important as the web is abundant with near-duplicate documents. Differences between these documents may be trivial, such as banner ads or timestamps. Manku et al. have presented an efficient method to detect near-duplicates in large web document archives [95] based on Charikar’s similarity estimation techniques [35]. CiteSeerX search engine employs both duplicate and near-duplicate document detection. During web crawling, they generate SHA-15 hashes of documents and check them against existing ones in their database. Duplicates are discarded immediately. Near- duplicates are detected after ingestion via clustering. Document attributes such as title and author names are normalised and used as keys [148]. Website duplicate detection is also an active topic. An approach is to use the websites’ structure and the content of their pages to identify possible replicas [42]. A more thorough approach to detect web mirrors depends mostly on the syntactic analysis of URL strings, and requires retrieval and content analysis only for a small number of pages. They are able to detect both partial and total mirroring, and handle cases where the content is not byte-wise identical [20]. Web archives extract and preserve massive amounts of web content and they are very in- terested in identifying and removing duplicate content. Important work has been done to detect duplicate web resources referenced by the same URL in web archives and create rel- evant systems [62]. Another important contribution is the DeDuplicator plug-in for Heritrix to detect and avoid the storage of duplicate content [129]. This system was also used in a billion-scale searchable with good results [60]. Web archive content deduplica- tion is so significant that in 2015, the International Internet Preservation Consortium (IIPC) adopted a proposal to extend the Web ARChive (WARC) archive format [75] to standard- ise the recording of arbitrary duplicates in WARC files7. The issue with all presented web archive content deduplication cases is that the web content is already extracted, processed and stored in the web archive before it is identified as duplicate and it is deleted. Thus, a lot of computing resources are used in a poor way. The concept of near-duplicates is also not even mentioned at all in web archive content deduplication related research. One of the main principles of the web is that each content has its own URI. Good URIs are the topic of much discussion [60] The Portuguese web archive provides instructions to

7https://iipc.github.io/warc-specifications/specifications/warc-deduplication- /recording-arbitrary-duplicates., accessed: August 1, 2015 2.1 Web Archiving 9 website owners to have one for each content8. The same guideline is given by W3C9. Search Engine Optimisation (SEO) guidelines also advise website administrators to have one URI for each web resource. These guidelines help web crawling systems depending on URLs to retrieve each document only once.

2.1.3 Web Crawler Automation

Web crawlers are complex software systems, which often combine techniques from various disciplines in computer science. Our work on the BlogForever crawler is related to the fields of web data extraction, distributed computing and natural language processing. In the liter- ature on web data extraction, the word wrapper is commonly used to designate procedures to extract structured data from unstructured documents. We did not use this word in the present work in favor of the term extraction rule, which better reflects our implementation and is decoupled from software that concretely performs the extraction. A common approach in web data extraction is to manually build wrappers for the targeted websites. This approach has been proposed for the crawler discussed in [52], which auto- matically assigns web sites to predefined categories and gets the appropriate wrapper from a static knowledge base. The limiting factor in this type of approach is the substantial amount of manual work needed to write and maintain the wrappers, which is not compatible with the increasing size and diversity of the web. Several projects try to simplify this process and provide various degrees of automation. This is the case of the Stalker algorithm [105] which generates wrappers based on user-labelled training examples. Some commercial solutions such as the Lixto project [64] simplify the task of building wrappers by offering a complete integrated development environment, where the training data set is obtained via a graphical user interface. As an alternative to dedicated software for the creation and maintenance of wrappers, some query languages have been designed specifically for wrappers. These languages rely on their users to manually identify the structure of the data to be extracted. This structure can then be formalised as a small declarative program, which can then be turned into a concrete wrapper by an execution engine. The OXPath language [57] is an interesting extension to XPath designed to incorporate interaction in the extraction process. It supports simulated user actions such as filling forms or clicking buttons to obtain information that would not be accessible otherwise. Another extension of XPath, called Spatial XPath [111], allows to write special rules in the extraction queries. The execution engine embeds a complete web browser which computes the visual representation of the page. Fully automated solutions use different techniques to identify and extract information directly from the structure and content of the web page, without the need of any manual intervention. The Boilerpipe project [84] - which is also used in our evaluation in Chapter 6 - uses text density analysis to extract the main article of a webpage. The approach presented in [119] is based on a tree structure analysis of pages with similar templates, such as news web sites or blogs. Automatic solutions have also been designed specifically for blogs. Similarly to our approach, Oita and Senellart [109] describe a procedure to automatically build wrappers by matching articles to HTML pages. This work was further extended by Gkotsis et al.

8http://sobre.arquivo.pt/how-to-participate/recommendations-for-web-authors-to- -enable-web/one-link-to-the-address-of-each-content, accessed: August 1, 2015 9http://www.w3.org/Provider/Style/Bookmarkable.html, accessed: August 1, 2015 10 Background and Literature Review with a focus on extracting content anterior to the one indexed in web feeds [58]. They also report to have successfully extracted blog post titles, publication dates and authors, but their approach is less generic than the one for the extraction of articles. Finally, neither [109] nor [58] provide complexity analysis which we believe to be essential before using an algorithm in production. One interesting research direction is the one of large scale distributed crawlers. Merca- tor [70], UbiCrawler [23] and the crawler discussed in [128] are examples of successful distributed crawlers. The associated articles provide useful information regarding the chal- lenges encountered when working on a distributed architecture. One of the core issues when scaling out seems to be in sharing the list of URLs that have already been visited and those that need to be visited next. While [70] and [128] rely on a central node to hold this informa- tion, [23] uses a fully distributed architecture where URLs are divided among nodes using consistent hashing. Both of these approaches require the crawlers to implement complex mechanisms to achieve fault tolerance. Regarding our research, the BlogForever Crawler does not have to address this issue as it is already Handled by the BlogForever back-end system, which is responsible for task and state management (Section 5). In addition, since we process webpages on the fly and directly emit the extracted content to the back-end, there is no need for persistent storage on the crawler’s side. This removes one layer of complexity when compared to general crawlers which need to use a distributed file system ([128] uses NFS, [ber] uses HDFS) or implement an aggregation mechanism to further exploit the collected data. In Section 5.4, we present our design which is similar to the distributed active object pattern presented in [85]. It is also further simplified by the fact that the state of the crawler instances is not kept between crawls.

2.2 Blog Archiving

Blog archiving is a prominent subcategory of web archiving due to the significance of blogs in every aspect of business and private life. However, current web archiving tools have to face important issues with respect to blog preservation. First, the tools for acquisition and curation use a schedule based approach to determine the point in time, when the content should be captured for archiving. Thus, the life of a blog is not recorded appropriately if it updates more often than it is crawled [71]. On the other hand, unnecessary harvests and archived duplicates occur if the blog is less updated than the crawling schedule, and if the whole blog is harvested again instead of a selective harvest of the new pages. Therefore, an approach that considers update events (e.g. new post, new comment, etc.) as triggers for crawling activities would be more suitable. Thereby, RSS feeds and servers can be utilized as triggers. Secondly, the general web archiving approach considers the webpage as the digital object that is preserved and can be accessed. However, a blog consists of several smaller entities like posts and comments. Therefore, while archives like the Internet Archive enable a structural blogosphere analysis, a specialised archiving system based on the inherent structure of blogs facilitates also further analysis like issues or events [144]. In summary, we identify several problems of blog archiving with current web archiving tools:

• Aggregation scheduling performs on timely intervals without considering web site updates. This causes incomplete content aggregation if the update frequency of the contents is higher than the schedule predicts [71, 137]. 2.2 Blog Archiving 11

• Traditional aggregation uses brute-force methods to crawl without taking into account what is the updated content of the target website. Thus, the performance of both the archiving system and the crawled system are affected unnecessarily [137].

• Current web archiving solutions do exploit the potential of the inherent structure of blogs. Therefore, while blogs provide a rich set of information entities, structured content, APIs, interconnections and semantic information [89], the management and end-user features of existing web archives are limited to primitive features such as URL Search, Keyword Search, Alphabetic Browsing and Full-Text Search [137].

2.2.1 Blog Archiving Projects

Here, we review projects and initiatives related to blog preservation. Therefore, we inspect the existing solutions of the IIPC [66] for web archiving and the ArchivePress blog archiv- ing project. Furthermore, we look into EC funded research projects such as Living Web Archives (LiWA) [91], SCalable Preservation Environments (SCAPE) [126] and Collect-All ARchives to COmmunity MEMories (ARCOMEM) [120], which are focusing on preserving dynamic content, web scale preservation activities and how to identify important web con- tent that should be selected for preservation. Table 2.1 provides an overview of the related initiatives and projects.

Initiative Description Started ArchivePress Explore practical issues around the archiving of weblog 2009 content, focusing on blogs as records of institutional ac- tivity and corporate memory ARCOMEM Leverage the Wisdom of the Crowds for content ap- 2011 praisal, selection and preservation, to create and preserve archives that reflect collective memory and social con- tent perception, and are, thus, closer to current and future users IIPC projects Web archiving tools for acquisition, curation, access and 1996 search LiWA Develop and demonstrate web archiving tools able to cap- 2009 ture content from a wide variety of sources, to improve archive fidelity and authenticity and to ensure long term interpretability of web content PageFreezer.com Enterprise Class On Demand web archiving and replay 2006 service Preservica.com Enterprise Class Cloud-based web archiving service 2012 SCAPE Developing an infrastructure and tools for scalable 2011 preservation actions WebPreserver.com Google Chrome Plugin to preserve Social Media 2015 Table 2.1: Overview of related initiatives and projects 12 Background and Literature Review

The International Internet Preservation Consortium (IIPC) is the leading International or- ganization dedicated to improving the tools, standards and best practices of web archiving. The software they provide as open source comprises tools for acquisition (Heritix [102]), curation (Web Curator Tool [108] and NetarchiveSuite10, access and finding (Wayback11, NutchWAX12, and WERA13). They are widely accepted and used by the majority of internet archive initiatives [61]. However, the IIPC tools cause several problems on blog preservation. First, the tools for acquisition and curation use a schedule based approach to determine the time point when the content should be captured for archiving. Thus, the life of a website is not recorded appropriately if the page is updated more often than it is crawled [71]. Given that a lot of blogs are frequently updated, an approach which considers update events (e.g. new post, new comment, etc.) as triggers for crawling activities would be more suitable. Secondly, the archiving approach of IIPC considers the webpage as the digital object that is preserved and can be accessed. However, a blog consists of several smaller entities like posts and comments. Therefore, while archives based on IIPC tools enable a structural blogosphere analysis, a specialised archiving system based on the inherent structure of blogs facilitates also further analysis like issues or events [144]. The ArchivePress project was an initial effort to attack the problem of blog archiving from a different perspective than traditional web crawlers. It is the only existing open source blog-specific archiving software according to our knowledge. ArchivePress utilises XML feeds produced by blog platforms to achieve better archiving [115]. The scope of the project explicitly excludes the harvesting of the full browser rendering of blog contents (headers, sidebars, advertising and widgets), focusing solely on collecting the marked-up text of blog posts and blog comments (including embedded media). The approach was suggested by the observation that blog content is frequently consumed through automated syndication and ag- gregation in news reader applications, rather than by navigation of blog websites themselves. Chris Rusbridge, then Director of the Digital Curation Centre at Edinburgh University, ob- served, with reason, that “blogs represent an area where the content is primary and design secondary” [123]. Contrary to the solutions of IIPC, ArchivePress utilises update informa- tion of the blogs to launch capturing activities instead of a predefined schedule. For this purpose, it is taking advantage of RSS feeds, a ubiquitous feature of blogs. Thus, blogs can be captured according to their activity level, and it is more likely that the whole lifecycle of the blog can be preserved. However, ArchivePress has also a strong limitation because it does not access the actual blog page but only its RSS feed. Thus, ArchivePress does not aggregate the complete blog content but only a portion which is published in RSS feeds, because feeds potentially contain just partial content instead of the full text and do not contain advertisements, formatting markup, or reader comments [50]. Even if blog preservation does not necessarily mean to preserve every aspect of a blog [9], and requires instead the identification of significant properties [133], the restriction to RSS feeds would prevent a successful blog preservation in various cases. RSS references only recent blog posts and not older ones. What is more, static blog pages are not listed at all in RSS. Several European Commission funded digital preservation projects are also relevant to blog preservation in various ways. The LiWA project focuses on creating long term web archives,

10https://sbforge.org/display/NAS/Releases+and+downloads, accessed August 1, 2015 11http://archive-access.sourceforge.net/projects/wayback/, accessed August 1, 2015 12http://archive-access.sourceforge.net/projects/nutch/, accessed August 1, 2015 13http://archive-access.sourceforge.net/projects/wera/, accessed August 1, 2015 2.2 Blog Archiving 13

filtering out irrelevant content and trying to facilitate a wide variety of content. Its approach is valuable as it focuses on many aspects of web preservation but its drawback is that it provides specific components which are integrated with the IIPC tools and are not generic. On the other hand, the ARCOMEM project focuses mainly on social web driven content appraisal and selection, and intelligent content acquisition. Its aim is to detect important content regarding events and topics, in order to preserve it. Its approach is unique regarding content acquisition and could be valuable to detect important blog targets for preservation but it does not progress the state-of-the-art regarding preservation, management and dissem- ination of archived content. Another relevant project is SCAPE, which is aiming to create scalable services for planning and execution of preservation strategies. SCAPE does not directly advance the state-of-the-art with new approaches to web preservation but aims at scaling only. Its outcome could assist in the deployment of web scale blog preservation systems. Besides the presented initiatives, there is an entire software industry sector focused on com- mercial web archiving services. Representative examples include Hanzo Archives: Social Media Archiving14, Archive-it: a web archiving service to harvest and preserve digital col- lections and PageFreezer: social media and website archiving15. Due to the commercial nature of these services though, it is not possible to find much information on their preser- vation strategies and technologies. Furthermore, it is impossible to know how long these companies will support these services or even be in business. Thus, we believe that they cannot be considered in our evaluation.

14http://www.hanzoarchives.com/products/social_media_archiving, accessed: August 1, 2015 15http://pagefreezer.com/, accessed: August 1, 2015

Chapter 3

An Innovative Method to Evaluate Website Archivability

We establish the notion of Website Archivability, a concept which captures the core as- pects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. We present the two iterations of the Credible Live Evaluation method for Archive Readiness (CLEAR) and CLEAR+ to evaluate Website Archivability for any website. We outline the architecture and implementation of http://archiveready.com (ArchiveReady), a reference implementation of CLEAR+. We conduct thorough evaluations of significant datasets in order to support the validity, the reliability and the benefits of our method. Finally, we evaluate the Website Archivability of the most prevalent web content management systems and present our observations and improvement suggestions. 1

3.1 Introduction

Web archiving has to examine two key aspects: organizational and technical. The organi- zational aspect of web archiving involves the entity that is responsible for the process, its governance, funding, long term viability and personnel responsible for the web archiving

1This chapter is based on the following publications: • Banos V., Manolopoulos Y.: “Web Content Management Systems Archivability”, Proceedings 19th East-European Conference on Advances in Databases & Information Systems (ADBIS), Springer Ver- lag, LNCS Vol.9282, Poitiers, France, 2015. • Banos V., Manolopoulos Y.: “A Quantitative Approach to Evaluate Website Archivability Using the CLEAR+ Method”, International Journal on Digital Libraries, 2015. • Banos V., Kim Y., Ross S., Manolopoulos Y.: “CLEAR: a Credible Method to Evaluate Website Archivability”, Proceedings 10th International Conference on Preservation of Digital Objects (iPRES), Lisbon, Portugal, 2013. 16 An Innovative Method to Evaluate Website Archivability tasks [116]. The technical aspect involves the procedures of web content identification, ac- quisition, ingest, organization, access and use [44, 132]. We are addressing two of the main challenges associated with the technical aspects of web archiving: (a) the acquisition of web content and (b) the Quality Assurance (QA) evaluation performed before it is ingested into a web archive. Web content acquisition and ingest is a critical step in the process of web archiving; if the initial Submission Information Package (SIP) lacks completeness and accuracy for any reason (e.g. missing or invalid web content), then the rest of the preservation processes are rendered useless. In particular, QA is vital stage in ensuring that the acquired content is complete and accurate. The peculiarity of web archiving systems in comparison to other archiving systems, is that the SIP is preceded by an automated extraction step. Websites often contain rich information not available on their surface. While the great variety and versatility of website structures, technologies and types of content is one of the strengths of the web, it is also a serious weakness. There is no guarantee that web bots dedicated to performing web crawling can access and retrieve website content successfully [69]. Websites benefit from following established best practices, international standards and web technologies if they are amenable to being archived. We define the sum of the attributes that make a website amenable to being archived as Website Archivability. This work aims to:

• Provide mechanisms to improve the quality of web archive content (e.g. facilitate access, enhance content integrity, identify core metadata gaps). • Expand and optimize the knowledge and practices of Web archivists, supporting them in their decision making and risk management processes. • Standardize the web aggregation practices of web archives, especially in relation to QA. • Foster good practices in website development and Web content authoring that make websites more amenable to harvesting, ingesting, and preserving. • Raise awareness among web professionals regarding web preservation. • Make observations regarding the archivability of the 12 most prominent Web Content Management Systems and suggest improvements.

We define the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method, a set of metrics to quantify the level of archivability of any website. This method is designed to consolidate, extend and complement empirical web aggregation practices through the for- mulation of a standard process to measure if a website is archivable. The main contributions of this work are:

• an introduction of the notion of Website Archivability, • a definition of the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to measure Website Archivability, • a detailed architecture and implementation outline of ArchiveReady.com, an online system that functions as a reference implementation of the method. 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 17

• an extended evaluation using real world datasets. • a proof that the CLEAR+ method needs only to evaluate a single webpage to calculate a website’s Archivability. • an evaluation of the WA of 12 prominent Web Content Management Systems (WCMS) and a presentation of observations and improvement suggestions.

The concept of CLEAR emerged from our current research in web preservation in the context of the BlogForever project, which involves weblog harvesting and archiving [80]. Our work revealed the need for a method to assess website archive readiness to support web archiving workflows. The remainder of this chapter is organized as follows: Section 3.2 introduces the CLEAR+ method, Section 3.3 presents the ArchiveReady system, Section 3.4 presents experimental evaluation and results. Section 3.5 presents WCMS archivability survey. Finally, we con- clude with some discussion and remarks in Section 3.6.

3.2 Credible Live Evaluation method for Archive Readi- ness Plus (CLEAR+)

We present the Credible Live Evaluation of Archive Readiness Plus method (CLEAR+) as of 08/2014. We focus on its requirements, main components, WA facets and evaluation meth- ods. We also include an example website evaluation to illustrate the CLEAR+ application in a detailed manner. The CLEAR+ method proposes an approach to produce on-the-fly measurement of WA, which is defined as the extent to which a website meets the conditions for the safe trans- fer of its content to a web archive for preservation purposes [17]. All web archives currently employ some form of crawler technology to collect the content of target websites. They com- municate through HTTP requests and responses, processes that are agnostic of the repository system of the archive. Information, such as the unavailability of webpages and other errors, is accessible as part of this communication exchange and could be used by the web archive to support archival decisions (e.g. regarding retention, risk management, and characterisation). Here, we combine this kind of information with an evaluation of the website’s compliance with recognised practices in digital curation (e.g. using adopted standards, validating for- mats, and assigning metadata) to generate a credible score representing the archivability of target websites. The main components of CLEAR+ are:

1. WA Facets: the factors that come into play and need to be taken into account to cal- culate total WA. 2. Website Attributes: the website homepage elements analysed to assess the WA Facets (e.g. the HTML markup code). 3. Evaluations: the tests executed on the website attributes (e.g. HTML code valida- tion against W3C HTML standards) and approach used to combine the test results to calculate the WA metrics. 18 An Innovative Method to Evaluate Website Archivability

It is very important to highlight that WA is meant to evaluate websites only and is not destined to evaluate distinct webpages. This is due to the fact that many of the attributes used in the evaluation are website attributes and not attributes of a specific webpage. The correct way to use WA is to provide as input the website home page. Furthermore, in Section 3.4.4 we prove that our method needs only to evaluate the home webpage to calculate the WA of the target website, based on the premise that webpages of the same website share the same components, standards and technologies. WA must also not be confused with website dependability, since the former refers to the ability to archive a website, whereas the latter is a system property that integrates several attributes, such as reliability, availability, safety, security, survivability and maintainability [11]. In the rest of this Section we present in detail the CLEAR+ method. First, we look into the requirements of reliable high quality metrics and how the CLEAR+ method fulfills them (Section 3.2.1). We continue with the way each of the CLEAR+ components is examined with respect to aspects of web crawler technology (e.g. hyperlink validation; performance measure) and general digital curation practices (e.g. file format validation; use of meta- data) to propose four core constituent Facets of WA (Section 3.2.2). We further describe the website attributes (e.g. HTML elements; hyperlinks) which are used to examine each WA Facet (Section 3.2.3), and, propose a method for combining tests on these attributes (e.g. validation of image format) to produce a quantitative measure that represents the Website’s Archivability (Section 3.2.4). To illustrate the application of CLEAR+, we present an exam- ple in Section 3.2.5. Finally, we outline the development of CLEAR+ in comparison with CLEAR in Section 3.2.6.

3.2.1 Requirements

It is necessary for a newly introduced method and a novel metric, such as WA, to evaluate its properties. A good metric must be Quantitative, Discriminative, Fair, Scalable and Nor- mative according to [113]. In the following, we explain how the WA metric satisfies these requirements.

1. Quantitative: WA can be measured in a quantitative score that provides a continuous range of values from perfectly archivable to completely not archivable. WA allows assessment of change over time, as well as comparison between websites or between groups of websites. For more details see the evaluation using assorted datasets in Section 3.4.2. 2. Discriminative: The metric range of values has a large discriminating power beyond simple archivable and not archivable. Discrimination power of the metric allows as- sessment of the rate of change. See the underlying theory and an example implemen- tation of the metric in Sections 3.2.4 and 3.2.5. 3. Fair: The metric is fair, taking into account all the attributes of a web resource and performing a large number of evaluations. Moreover, it also takes into account and adjusts to the size and complexity of the websites. WA is evaluated from multiple different aspects, using several WA Facets as presented in Section 3.2.2. 4. Scalable: The metric is scalable and able to conduct large-scale WA studies given the relevant resources. WA supports aggregation and second-order statistics, such as 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 19

STDDEV. Also WA is calculated in an efficient way; it is relevant to the number of web resources used in a webpage. WA is calculated in real-time. The scalability of the archiveready.com platform is presented in Section 3.3.2. 5. Normative: The metric is normative, deriving from international standards and guide- lines. WA stems from established metadata standards, preservation standards, guide- lines of W3C, etc. The proposed metric is based on established digital preservation practices. All WA aspects are presented in Section 3.2.2.

The WA metric has many strengths, such as objectivity, practicality and ability to conduct a large-scale assessment without many resources. Following, we focus on each WA Facet.

3.2.2 Website Archivability Facets

WA can be measured from several different perspectives. Here, we have called these per- spectives WA Facets (See Figure 3.1). The selection of these facets is motivated by a number of considerations:

Figure 3.1: WA Facets: An Overview.

1. whether there are verifiable guidelines to indicate what and where information is held at the target website and whether access is available and permitted by a high perfor- mance . (i.e. Accessibility, see Section 3.2.2); 2. whether included information follows a common set of format and/or language speci- fications (i.e. Standards Compliance, see Section 3.2.2); 3. the extent to which information is independent from external support (i.e. Cohesion, see Section 3.2.2); and, 4. the level of extra information available about the content (i.e. Metadata Usage).

Certain classes and specific types of errors create less or more obstacles to web archiving. The CLEAR+ algorithm calculates the significance of each evaluation based on the follow- ing criteria:

1. High Significance: Critical issues which prevent web crawling or may cause highly problematic web archiving results. 2. Medium Significance: Issues which are not critical but may affect the quality of web archiving results. 20 An Innovative Method to Evaluate Website Archivability

3. Low Significance: Minor details which do not cause any issues when they are missing but will help web archiving when available.

Each WA Facet is computed as the weighted average of the scores of the questions associated with this Facet. The significance of each question defines its weight. The WA calculation is presented in detail in Section 3.2.4. Finally, it must be noted that a single evaluation may impact more than one WA Facets. For instance, the presence of a Flash menu in a website has a negative impact in the Accessibility Facet because web archives cannot detect hyperlinks inside Flash and also in the Standards Compliance Facet because Flash is not an open standard.

퐹퐴: Accessibility

A website is considered archivable only if web crawlers are able to visit its home page, tra- verse its content and retrieve it via standard HTTP protocol requests [54]. In case a crawler cannot find the location of all web resources, then it will not be possible to retrieve the con- tent. It is not only necessary to put resources on a website, it is also essential to provide proper references to allow crawlers to discover them and retrieve them effectively and effi- ciently. Performance is also an important aspect of web archiving. The throughput of data acquisition of a web bot directly affects the number and complexity of web resources it is able to process. The faster the performance, the faster the ingestion of web content, improves a website’s archiving process. It is important to highlight that we evaluate performance using the initial HTTP response time and not the total transfer time because the former depends on server performance characteristics, whereas the latter depends on file size. Example 1: a web developer is creating a website containing a Flash menu, which requires a proprietary web browser plugin to render properly. Web crawlers cannot access the flash menu contents so they are not able to find the web resources referenced in the menu. Thus, the web archive fails to access all available website content. Example 2: a website is archivable only if it can be fully retrieved correctly by a third party application using HTTP protocols. If a website is employing any other protocol, web crawlers will not be able to copy all its content. Example 3: if the performance of a website is slow or web crawling is throttled using some artificial mechanism, web crawlers will have difficulties in aggregating content and they may even abort if the performance degrades below a specific threshold. To support WA, the website should, of course, provide valid links. In addition, a set of maps, guides, and updates for links should be provided to help crawlers find all the content (see Figure 3.1). These can be exposed in feeds, sitemap.xml[127], and robots.txt3 files. Proper HTTP protocol support for Etags, Datestamps and other features should also be considered [38, 63].

2https://developers.google.com/speed/docs/insights/Server, accessed August 1, 2015 3http://www.robotstxt.org/, accessed August 1, 2015 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 21

Id Description Significance

퐴1 Check the percentage of valid vs. invalid hyperlink and CSS urls. High These urls are critical for web archives to discover all website content and render it successfully.

퐴2 Check if inline JavaScript code exists in HTML. Inline JavaScript High may be used to dynamically generate content (e.g. via re- quests), creating obstacles for web archiving systems.

퐴3 Check if sitemap.xml exists. Sitemap.xml files are meant to include High references to all the webpages of the website. This feature is critical to identify all website content with accuracy and efficiency.

퐴4 Calculate the max initial response time of all HTTP requests. The High rating ranges from 100% for initial response time less than or equal to 0.2 sec and 0% if the initial response time is more than 2 sec. The limits are imposed based on Google Developers speed info2. The rationale is that high performance websites facilitate faster and more efficient web archiving.

퐴5 Check if proprietary file format such as Flash and QuickTime are High used. Web crawlers cannot access the proprietary files contents; so they are not able to find the web resources referenced in them. Thus, the web archive fails to access all available website content.

퐴6 Check if the robots.txt file contains any “Disallow:” rules. These Medium rules may block web archives from retrieving parts of a website but it must be noted that not all web archives respect them.

퐴7 Check if the robots.txt file contains any “Sitemap:” rules. These rules Medium may help web archives locate one or more sitemap.xml files with ref- erences to all the webpages of the website. Although not critical, this rule may help web archives identify sitemap.xml files located in non-standard locations.

퐴8 Check the percentage of downloadable linked media files. Valid me- Medium dia file links are important to enable web archives to retrieve them successfully.

퐴9 Check if any HTTP Caching headers (Expires, Last-modified or Medium ETag) are set. They are important because they can be used by web crawlers to avoid retrieved not modified content, accelerating web content retrieval.

퐴10 Check if RSS or feeds are referenced in the HTML source code Low using RSS autodiscovery. RSS function similarly to sitemap.xml files providing references to webpages in the current website. RSS feeds are not always present; thus, they can be considered as not absolutely necessary for web archiving and with low significance.

Table 3.1: 퐹퐴: Accessibility Evaluations 22 An Innovative Method to Evaluate Website Archivability

The Accessibility Evaluations performed are presented in detail in Table 3.1. For each one of the presented evaluations, a score in the range of 0-100 is calculated depending on the success of the evaluation.

퐹푆 : Standards Compliance

Compliance with standards is a sine qua non theme in digital curation practices (e.g. see Dig- ital Preservation Coalition guidelines [39]). It is recommended that for digital resources to be preserved they need to be represented in known and transparent standards. The standards themselves could be proprietary, as long as they are widely adopted and well understood with supporting tools for validation and access. Above all, the standard should support dis- closure, transparency, minimal external dependencies and no legal restrictions with respect to preservation processes that might take place within the archive4. Disclosure refers to the existence of complete documentation, so that, for example, file for- mat validation processes can take place. Format validation is the process of determining whether a digital object meets the specifications for the format it purports to be. A key ques- tion in digital curation is, “I have an object purportedly of format F; is it really F? [103] Considerations of transparency and external dependencies refers to the resource’s openness to basic tools (e.g. W3C HTML standard validation tool; JHOVE2 format validation tool [46]). Example: if a webpage has not been created using accepted standards, it is unlikely to be ren- derable by web browsers using established methods. Instead it is rendered in “Quirks mode”, a custom technique to maintain compatibility with older/broken webpages. The problem is that the quirks mode is really versatile. As a result, one cannot depend on it to have a stan- dard rendering of the website in the future. It is true that using emulators one may be able to render these websites in the future but this is rarely the case for the average user who will be accessing the web archive with his/her latest web browser. We recommend that validation is performed for three types of content (see Table 3.2): web- page components (e.g. HTML and CSS), reference media content (e.g. audio, video, image, documents), HTTP protocol headers used for communication and supporting resources (e.g. robots.txt, sitemap.xml, JavaScript). The website is checked for Standards Compliance on three levels: referenced media format (e.g. image and audio included in the webpage), webpage (e.g. HTML and CSS markup) and resource (e.g. sitemap, scripts). Each one of these are expressed using a set of specified file formats and/or languages. The languages (e.g. XML) and formats (e.g. jpeg) will be val- idated using tools, such as W3C HTML [141] and CSS validator5, JHOVE2 and/or Apache Tika6 file format validator, Python XML validator7 and robots.txt checker8.

4http://www.digitalpreservation.gov/formats/sustain/sustain.shtml, accessed August 1, 2015 5http://jigsaw.w3.org/css-validator/, accessed August 1, 2015 6http://tika.apache.org/, accessed August 1, 2015 7http://code.google.com/p/pyxmlcheck/, accessed August 1, 2015 8http://tool.motoricerca.info/robots-checker.phtml, accessed August 1, 2015 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 23

Id Description Significance

푆1 Check if the HTML source code complies with the W3C standards. High This is critical because invalid HTML may lead to invalid content processing and unrenderable archived web content in the future.

푆2 Check the usage of QuickTime and Flash file formats. Digital preser- High vation best practices are in favor of open standards; so it is considered problematic to use these types of files.

푆3 Check the integrity and the standards of images. This is critical to Medium detect potential problems with image formats and corruption.

푆4 Check if the RSS feed format complies with W3C standards. This Medium is important because invalid RSS feeds may prevent web crawlers from analysing them and extracting metadata or references to website content.

푆5 Check if the HTTP Content-encoding or Transfer-encoding headers Medium are set. They are important because they provide information regard- ing the way the content is transfered.

푆6 Check if any HTTP Caching headers (Expires, Last-modified or Medium ETag) are set. They are important because they may help web archives avoid downloading not modified content, improving their performance and efficiency.

푆7 Check if the CSS referenced in the HTML source code complies with Medium W3C standards. This is important because invalid CSS may lead to unrenderable archived web content in the future.

푆8 Check the integrity and the standards compliance of HTML5 Audio Medium elements. This is important to detect a wide array of problems with audio formats and corruption.

푆9 Check the integrity and the standards compliance of HTML Video Medium elements. This is important to detect potential problems with video formats and corruption.

푆10 Check if the HTTP Content-type header exists. This is significant Medium because it provides information to the web archives about the content and it may potentially help interpret it.

Table 3.2: 퐹푆 Standards Compliance Facet Evaluations

We also have to note that we are checking the usage of QuickTime and Flash explicitly because they are the major closed standard file formats with the greatest adoption on the web, according to the HTTP Archive9.

9http://httparchive.org/, accessed August 1, 2015 24 An Innovative Method to Evaluate Website Archivability

퐹퐶 : Cohesion

Cohesion is relevant to both the efficient operation of web crawlers, and, also, the manage- ment of dependencies within digital curation (e.g. see NDIIPP comment on format depen- dencies [8]). If files comprising a single webpage are dispersed across different services (e.g. different servers for images, JavaScript widgets, other resources) in different domains, the acquisition and ingest is likely to risk suffering from being neither complete nor accu- rate. If one of the multiple services fails, the website fails as well. Here we characterise the robustness of the website in comparison to this kind of failure as Cohesion. It must be noted that we use the top-level domain and not the host name to calculate Cohesion. Thus, both http://www.test.com and http://images.test.com belong to the top-level do- main test.com. Example: a flash widget used in a website but hosted elsewhere may cause problems in web archiving because it may not be captured when the website is archived. More important is the case where, if the target website depends on third party websites, the future availability of which is unknown, then new kinds of problems are likely to arise.

Id Description Significance

퐶1 The percentage of local vs. remote images. Medium 퐶2 The percentage of local vs. remote CSS. Medium 퐶3 The percentage of local vs. remote script tags. Medium 퐶4 The percentage of local vs. remote video elements. Medium 퐶5 The percentage of local vs. remote audio elements. Medium 퐶6 The percentage of local vs. remote proprietary objects (Flash, Quick- Medium Time).

Table 3.3: 퐹퐶 Cohesion Facet Evaluations.

The premise is that, keeping information associated to the same website together (e.g. using the same host for a single instantiation of the website content) would lead to a robustness of resources preserved against changes that occur outside of the website (cf. encapsulation10). Cohesion is tested at two levels:

1. examining how many domains are employed in relation to the location of referenced media content (images, video, audio, proprietary files), 2. examining how many domains are employed in relation to supporting resources (e.g. robots.txt, sitemap.xml, CSS and JavaScript files).

The level of Cohesion is measured by the extent to which material associated to the website is kept within one domain. This is measured by the proportion of content, resources, and plu- gins that are sourced internally. This can be examined through an analysis of links, on the level of referenced media content, and on the level of supporting resources (e.g. JavaScript). In addition the proportion of content relying on predefined proprietary software can be as- sessed and monitored. The Cohesion Facet evaluations are presented in Table 3.3.

10http://www.paradigm.ac.uk/workbook/preservation-strategies/selecting-other.html, accessed August 1, 2015 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 25

One may argue that if we choose to host website files across multiple services, they could still be saved in case the website failed. This is true but our aim is to archive the website as a whole and not each independent file. Distributing the files in multiple locations increases the possibility of losing some of these files.

퐹푀 : Metadata Usage

The adequate provision of metadata (e.g. see Digital Curation Centre Curation Reference Manual chapters on metadata [101], preservation metadata [33], archival metadata [48], and learning object metadata [93]) has been a continuing concern within digital curation (e.g. see seminal article by Lavoie [87] and insightful discussions going beyond preservation11. The lack of metadata impairs the archive’s ability to manage, organise, retrieve and interact with content effectively. It is, widely recognised that it makes understanding the context of the material a challenge.

Id Description Significance

푀1 Check if the HTTP Content-type header exists. This is significant Medium because it provides information to the web archives about the content and may potentially help retrieve more information.

푀2 Check if any HTTP Caching headers (Expires, Last-modified or Medium ETag) are set. They are important because they provide extra infor- mation regarding the creation and last modification of web resources.

푀3 Check if the HTML meta robots noindex, , noarchive, nos- Low nippet and noodp tags are used in the markup. If true, they instruct the web archives to avoid archiving the website. This is optional and usually omitted. 12 푀4 Check if the DC profile is used in the HTML markup. This evalu- Low ation is optional and with low significance. If the DC profile exists, it will help the web archive obtain more information regarding the archived content. If absent, there will be no negative effect.

푀5 Check if the FOAF profile [27] is used in the HTML markup. This Low evaluation is optional and with low significance. If the FOAF profile exists, it will help the web archive obtain more information regarding the archived content. If it does not exist, it will not have any negative effect.

푀6 Check if the HTML meta description tag exists in the HTML source Low code. The meta description tag is optional with low significance. It does not affect web archiving directly but affects the information we have about the archived content.

Table 3.4: 퐹푀 Metadata Facet Evaluations

11http://www.activearchive.com/content/what-about-metadata, accessed August 1, 2015 12http://dublincore.org/documents/2008/08/04/dc-html/, accessed August 1, 2015 26 An Innovative Method to Evaluate Website Archivability

We will consider metadata on three levels. To avoid the dangers associated with committing to any specific metadata model, we have adopted a general view point shared across many information disciplines (e.g. philosophy, linguistics, computer sciences) based on syntax (e.g. how is it expressed), semantics (e.g. what is it about) and pragmatics (e.g. what can you do with it). There are extensive discussions on metadata classification depending on their application (e.g. see National Information Standards Organization classification [117]; discussion in Digital Curation Centre Curation Reference Manual chapter on Metadata [101]). Here we avoid these fine-grained discussions and focus on the fact that much of the metadata approaches examined in existing literature can be exposed already at the time that websites are created and disseminated. For example, metadata such as transfer and content encoding can be included by the server in HTTP headers. The required end-user language to understand the content can be indicated as part of the HTML element attribute. Descriptive information (e.g. author, keywords) that can help in understanding how the content is classified can be included in the HTML attribute and values. Metadata that support rendering information, such as application and generator names, can also be included in the HTML META element. The use of other well-known metadata and description schemas (e.g. Dublin Core [143]; Friend of a Friend (FOAF) [27]; Resource Description Framework (RDF) [99]) can be included to promote better interoperability. The existence of selected metadata elements can be checked as a way of increasing the probability of implementing automated extraction and refinement of metadata at harvest, ingest, or subsequent stage of repository management. The score for Metadata Usage Facet evaluations are presented in Table 3.4.

3.2.3 Attributes

We summarise what website attributes we evaluate to calculate WA. They are also presented in Figure 3.2.

Figure 3.2: Website attributes evaluated for WA

RSS: The existence of an RSS feed allows the publication of webpage content that can be automatically syndicated or exposed. It allows web crawlers to automatically retrieve up- dated content, whereas the standardised format of the feeds allows access by many different applications. For example, the BBC uses feeds to let readers see when new content has been added13. Robots.txt: The file robots.txt indicates to a web crawler which URLs it is allowed to crawl. The use of robots.txt helps preventing the retrieval of website content that would be aligned with permissions and special rights associated to the webpage.

13http://www.bbc.co.uk/news/10628494, accessed August 1, 2015 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 27

Sitemap.xml: The protocol, jointly supported by the most widely used search en- gines to help content creators and search engines, is an increasingly used way to unlock hidden data by making it available to search engines [127]. To implement the Sitemaps protocol, the file sitemap.xml is used to list all the website pages and their locations. The location of this sitemap, if it exists, can be indicated in the robots.txt. Regardless of its in- clusion in the robots.txt file, the sitemap, if it exists, should ideally be called ‘sitemap.xml’ and put at the root of your web server (e.g. http://www.example.co.uk/sitemap.xml). HTTP Headers: HTTP is the protocol used to transfer content from the web server to the web archive. HTTP is very important as it contains a significant information regarding many web content aspects. Source code and linked web resources: The source code of the website (HTML, JavaScript, CSS). Binary files: The binary files included in the webpage (images, pdf, etc.). Hyperlinks: Hyperlinks comprise a net that links the web together. The hyperlinks of the website can be examined for availability as an indication of website accessibility. The lack of hyperlinks does not impact WA but the existence of missing and/or broken links should be considered problematic.

3.2.4 Evaluations

Combining the information discussed in Section 3.2.2 to calculate a score for WA goes through the following steps.

1. The WA potential with respect to each facet is represented by an 푁-tuple (푥1, …, 푥푘, …, 푥푁 ), where 푥푘 equals a 0 or 1 and represents a negative or positive answer, respec- tively, to the binary question asked about that facet, whereas 푁 is the total number of questions associated to that facet. An example question in the case of the Standards Compliance Facet would be “I have an object purportedly of format F; is it?” [103]; if there are 푀 files for which format validation is being carried out then there will be 푀 binary questions of this type. 2. Not all questions are considered of equal value to the facet. Depending on their sig- nificance (Low, Medium and High), they have different weight 푤푘 = (1, 2 or 4, re- spectively). The weights follow a power law distribution where Medium is twice as important as Low and High is twice as important as Medium. The value of each facet is the weighted average of its coordinates:

푁 휔 푥 퐹 = 푘 푘 (3.1) 휆 ∑ 푘=1 퐶

where 휔푘 is the weight assigned to question 푘 and

푁 퐶 = 푤 ∑ 푖 푖=1 28 An Innovative Method to Evaluate Website Archivability

Once the rating with respect to each facet is calculated, the total measure of WA can be simply defined as: 푊 퐴 = 푤 퐹 (3.2) ∑ 휆 휆 휆∈{퐴,푆,퐶,푀} where 퐹퐴, 퐹푆 , 퐹퐶 , 퐹푀 are WA with respect to Accessibility, Standards Compliance, Cohe- sion, Metadata Usage, respectively, and

푤 = 1 ∑ 휆 휆∈{퐴,푆,퐶,푀} for 0 ≤ 푤휆 ≤ 1 ∀ 휆 ∈ {퐴, 푆, 퐶, 푀}

Depending on the curation and preservation objectives of the web archive, the significance of each facet is likely to be different, and 푤휆 could be adapted to reflect this. In the simplest model, these 푤휆 values can be equal, i.e. 푤휆=0.25 for any 휆. Thus, the WA is calculated as:

1 1 1 1 푊 퐴 = 퐹 + 퐹 + 퐹 + 퐹 (3.3) 4 퐴 4 푆 4 퐶 4 푀

Facet Weight

퐹퐴 (5*4) + (4*2) + (1*1) = 29 퐹푆 (2*4) + (8*2) = 24 퐹퐶 6*2 = 12 퐹푀 (2*2) + (4*1) = 8 Total 73

Table 3.5: WA Facet Weights

We can calculate WA by adopting a normalized model approach, i.e. by multiplying facet evaluations by special weights according to their specific questions (of low, medium or high significance). To this end, in Table 3.5 we calculate the special weights of each facet. Thus, we can evaluate a weighted WA as:

29 24 12 8 푊 퐴 = 퐹 + 퐹 + 퐹 + 퐹 (3.4) 푤푒푖푔ℎ푡푒푑 73 퐴 73 푆 73 퐶 73 푀

Actually, accessibility is the most central consideration in WA since, if the content cannot be found or accessed, then the website’s compliance with other standards, and conditions becomes moot. In case the user needs to change the significance of each facet, it is easy to do so by assigning different values to their significance. 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 29

Id Description Rating Significance

퐴1 121 valid and 1 invalid links. 99% High 퐴2 6 inline JavaScript tags. 0% High 퐴3 Sitemap file exists http://auth.gr/sitemap.xml 100% High 퐴4 Network response time is 100ms. 100% High 퐴5 No use of any proprietary file format such as Flash and 100% High QuickTime.

퐴6 Robots.txt file contains multiple “Disallow” rules. 0% Medium http://auth.gr/robots.txt

퐴7 No sitemap.xml reference in the robots.txt file. 0% Medium 퐴8 16 in 16 images. 100% Medium 퐴9 HTTP caching headers available. 100% Medium 퐴10 One RSS feed http://auth.gr/rss.xml found using RSS 100% Low autodiscovery.

Table 3.6: 퐹퐴 evaluation of http://auth.gr/.

3.2.5 Example

To illustrate the application of CLEAR+, we calculate the WA rating of the website of the Aristotle University of Thessaloniki (AUTH)14. For each WA Facet, we conduct the neces- sary evaluations (Tables 3.6-3.9) and calculate the respective Facet values (see Equations 5-8) using Equation 3.1.

(99 ∗ 4) + (0 ∗ 4) + (100 ∗ 4)+ (100 ∗ 4) + (100 ∗ 4) + (0 ∗ 2) + (0 ∗ 2)+ (100 ∗ 2) + (100 ∗ 2) + (100 ∗ 1) 퐹 = ≈ 72% (3.5) 퐴 (4 ∗ 5) + (2 ∗ 4) + (1 ∗ 1)

(0 ∗ 4) + (100 ∗ 4) + (100 ∗ 2) + (100 ∗ 2)+ (100 ∗ 2) + (100 ∗ 2) + (54 ∗ 2) + (100 ∗ 2) 퐹 = ≈ 75% (3.6) 푆 (4 ∗ 2) + (2 ∗ 6)

(87 ∗ 2) + (90 ∗ 2) + (100 ∗ 2) 퐹 = ≈ 92% (3.7) 퐶 3 ∗ 2

(100 ∗ 2) + (100 ∗ 2) + (100 ∗ 1) + (100 ∗ 1) 퐹 = = 100% (3.8) 푀 (2 ∗ 2) + (1 ∗ 2)

14http://www.auth.gr/ as of 10 August 2014. 30 An Innovative Method to Evaluate Website Archivability

Id Description Rating Significance

푆1 HTML validated, multiple errors. 0% High 푆2 No proprietary external objects (Flash, QuickTime). 100% High 푆3 16 well-formed images checked with JHOVE. 100% Medium 푆4 RSS feed http://auth.gr/rss.xml is valid according to 100% Medium the W3C feed validator.

푆5 Content encoding was clearly defined in HTTP Headers. 100% Medium 푆6 HTTP Caching headers clearly defined. 100% Medium 푆7 6 valid and 5 invalid CSS. 54% Medium 푆8 No HTML5 audio elements. - Medium 푆9 No HTML5 video elements. - Medium 푆10 Content type . clearly defined in HTTP Headers. 100% Medium

Table 3.7: 퐹푆 evaluation http://auth.gr/.

Id Description Rating Significance

퐶1 14 local and 2 external images. 87% Medium 퐶2 10 local and 1 external CSS. 90% Medium 퐶3 7 local and no external scripts. 100% Medium 퐶4 No HTML5 audio elements. - Medium 퐶5 No HTML5 video elements. - Medium 퐶6 No proprietary objects. - Medium

Table 3.8: 퐹퐶 evaluation http://auth.gr/.

Id Description Rating Significance

푀1 Content type clearly defined in HTTP Headers. 100% Medium 푀2 HTTP Caching headers are set. 100% Medium 푀3 No meta robots blocking. - Low 푀4 No DC metadata. - Low 푀5 FOAF metadata found. 100% Low 푀6 HTML description meta tag found. 100% Low

Table 3.9: 퐹푀 evaluation http://auth.gr/. 3.2 Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) 31

Finally, assuming the flat model approach we calculate the WA value as:

퐹 + 퐹 + 퐹 + 퐹 푊 퐴 = 퐴 퐶 푆 푀 ≈ 85% 4 whereas, by following the normalized model approach, the weighted WA value is calculated as: 29 20 6 6 푊 퐴 = 퐹 + 퐹 + 퐹 + 퐹 ≈ 78% 푤푒푖푔ℎ푡푒푑 61 퐴 61 푆 61 퐶 61 푀

A screenshot of the http://archiveready.com/ web application session we use to eval- uate http://auth.gr/ is presented in Figure 3.3.

Figure 3.3: Evaluating http://auth.gr/ WA using ArchiveReady. 32 An Innovative Method to Evaluate Website Archivability

3.2.6 The Evolution from CLEAR to CLEAR+

Finally, we conclude the presentation of CLEAR+ with the developments of the method since the first incarnation of CLEAR (Ver.1 of 04/2013) [17]. We experimented in practice with the CLEAR method for a considerable time, running a live online system which is also presented in detail in Section 3.3. We conducted multiple evaluations and received feed- back from academics and the web archiving industry professionals. This process resulted in the identification of many issues such as missing evaluations and overestimated or un- derestimated criteria. The algorithmic and technical improvements of our method can be summarised as follows:

1. Each website attribute evaluation has a different significance, depending on its effect to web archiving, as presented in Section 3.2.2.

2. The Performance Facet has been integrated in the Accessibility and its importance has been downgraded significantly. This is a result of the fact that website performance in tests has been consistently high, regardless of their other characteristics. Thus, Performance Facet rating was always 100% or near 100%, distorting the general WA evaluation.

3. Weighted arithmetic mean is implemented to calculate WA Facets instead of simple mean. All evaluations have been assigned a low, medium or high significance in- dicator, which affects the calculation of all WA Facets. The significance has been defined based on the initial experience with WA evaluations from the first year of archiveready.com operation.

4. Certain evaluations have been removed from the method as they were considered irrel- evant. For example, the check that archived versions of the target website are present in the Internet Archive or not should be part of the assessment.

5. On a technical level, all aspects of the reference implementation of the Website Archiv- ability Evaluation Tool http://archiveready.com have been improved. The soft- ware has also the new capability of analysing dynamic websites using a headless web browser, as presented in Section 3.3. Thus, its operation has become more accurate and valid than the previous version.

In the following Section, we present the architecture of ArchiveReady, a web system imple- menting the CLEAR+ method.

3.3 ArchiveReady: A Website Archivability Evaluation Tool

We present ArchiveReady15, a WA evaluation system that implements CLEAR+ as a web application. We describe the system architecture, design decisions, WA evaluation workflow and Application Programming Interfaces (APIs) available for interoperability purposes.

15http://www.archiveready.com, accessed August 1, 2015 3.3 ArchiveReady: A Website Archivability Evaluation Tool 33

3.3.1 System Architecture

ArchiveReady is a web application based on the following key components:

1. Debian linux operating system [118] for development and production servers,

2. Nginx web server16 to server static web content,

3. Python programming language17,

4. Gunicorn Python WSGI HTTP Server for unix18 to server dynamic content,

5. BeautifulSoup19 to analyse HTML markup and locate elements,

6. Flask20, a Python micro-framework to develop web applications,

7. Redis advanced key-value store21 to manage job queues and temporary data,

8. Mariadb Mysql RDBMS22 to store long-term data.

9. PhantomJS 23, a headless WebKit scriptable with a JavaScript API. It has fast and na- tive support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. 10. JSTOR/Harvard Object Validation Environment (JHOVE) [46] for media file valida- tion, 11. JavaScript and CSS libraries such as jQuery24 and Bootstrap25 are utilized to create a compeling user interface, 12. W3C HTML Markup Validation Service [141] and CSS Validation Service APIs for web resources evaluation.

The home page of the ArchiveReady system is presented in Figure 3.5. An overview of the system architecture is presented in Figure 3.4. During the design and implementation of the platform, we took some important decisions, which influenced greatly all aspects of development. We choose Python to implement ArchiveReady since it is ideal for rapid application devel- opment and has many modern features. Moreover, it is supported by a big user community

16http://www.nginx.org 17http://www.python.org/, accessed: August 1, 2015 18http://gunicorn.org/, accessed: August 1, 2015 19http://www.crummy.com/software/BeautifulSoup/, accessed: August 1, 2015 20http://flask.pocoo.org/, accessed: August 1, 2015 21http://redis.io, accessed: August 1, 2015 22http://www.mariadb.com, accessed: August 1, 2015 23http://phantomjs.org/, accessed: August 1, 2015 24http://www.jquery.com, accessed: August 1, 2015 25http://twitter.github.com/bootstrap/, accessed: August 1, 2015 34 An Innovative Method to Evaluate Website Archivability

Figure 3.4: The architecture of the archiveready.com system.

and has a wide range of modules. Using these assets, we can implement many important features such as RSS feed validation (feedvalidator module), XML parsing, validation and analysis (lxml module), HTTP communication (python-requests module) and asynchronous job queues (python-rq module). We use PhantomJS to access websites which use Javascipt, AJAX and other web technolo- gies, which are difficult to handle with HTML processing. Using PhantomJS, we can perform JavaScript rendering when processing a website. Therefore, we can extract dynamic content and even support AJAX-generated content in addition to traditional HTML-only websites. We select Redis to store temporary data into memory because of its performance and its ability to support many data structures. Redis is an advanced key-value store, since keys can contain strings, hashes, lists, sets and sorted sets. These features make it ideal for holding volatile information, such as intermediate evaluation results and other temporary data about website evaluations. Redis is also critical for the implementation of asynchronous job queues as described in Section 3.3.2. We use MariaDB to store data permanently for all evaluations. Such data are the final eval- uation results and user preferences. We use JHOVE [46], an established digital preservation tool to evaluate the files included in websites for their correctness. We evaluate HTML markup, CSS and RSS correctness using W3C validator tools. We also use Python excep- tions to track problems when analysing webpages and try to locate webpages, which cause problems to web client software. 3.3 ArchiveReady: A Website Archivability Evaluation Tool 35

Figure 3.5: The home page of the archiveready.com system.

3.3.2 Scalability

One of the greatest challenges in implementing ArchiveReady is performance, scalability and responsiveness. A web service must be able to evaluate multiple websites in parallel, while maintaining a responsive Web UI and API. To achieve this goal, we implement asyn- chronous job queues in the following manner:

1. ArchiveReady tasks are separated into two groups: real-time and asynchronous. Real- time commands are processed as soon as they are received by the user as in any com- mon web application. 2. Asynchronous tasks are processed in a different way. When a user or third party ap- plication initiates a new evaluation task, the web application server maps the task into multiple individual atomic subtasks, which are inserted in the asynchronous job queue of the system, which is stored in a Redis List. 3. Background workers equal to the number of server CPU cores are constantly monitor- ing the job queue for new tasks. As soon as they identify them, they begin processing them one by one and store the results in the MariaDB database. 4. When all subtasks of a given task are finished, the web application server process presents the results to the user. While the background processes are working, the application server is free to reply to any requests regarding new website evaluations without any delay. 36 An Innovative Method to Evaluate Website Archivability

The presented evaluation processing logic has many important benefits. Tasks are separated into multiple individual atomic evaluations. This makes the system very robust. An excep- tion or any other system error in any individual evaluation does not interfere with the general system operation. More important is the fact that the platform is highly scalable as it is pos- sible for the asynchronous job queues to scale not only vertically depending on the number of available server CPU cores, but also horizontally, as multiple servers can be configured to share the same asynchronous job queue and database. To ensure high level compatibility with W3C standards, we use open source web services provided by the W3C. These include: the Markup Validator, the Feed Validation Service26 and the CSS Validation Service. According to the HTTP Archive Trends, the average number of HTTP requests initiated when accessing a webpage is over 90 and is expected to rise27. In response to this perfor- mance context, ArchiveReady has to be capable of performing a very large number of HTTP requests, process the data and present the outcomes to the user in real time. This is not possi- ble with a single process for each user, the typical approach in web applications. To resolve this blocking issue, an asynchronous job queue system based on Redis for queue manage- ment and the Python RQ library28 is deployed. This approach enables the parallel execution of multiple evaluation processes, resulting in huge performance benefits when compared to traditional web application execution model. Its operation can be summarised as follows:

1. As soon as a user submits a website for evaluation, the master process maps the work into multiple individual jobs, which are inserted in the parallel job queues in the back- ground.

2. Background worker processes are notified and begin processing the individual jobs in parallel. The level of parallelism is configurable, 16 parallel processes are the current setup.

3. As soon as a job is finished, the results are sent to the master process.

4. When all jobs are finished, the master process calculates the WA and presents the final results to the user.

3.3.3 Workflow

ArchiveReady is a web application providing two types of interaction: web interface and web service. With the exception of presentation of outcomes (HTML for the former and JSON for the latter) both are identical. The evaluation workflow of a target website can be summarised as follows:

1. ArchiveReady receives a target URL and performs an HTTP request to retrieve the webpage hypertext.

26http://validator.w3.org/feed/ 27http://httparchive.org/trends.php 28http://python-rq.org/ 3.3 ArchiveReady: A Website Archivability Evaluation Tool 37

2. After analysing it, multiple HTTP connections are initiated in parallel to retrieve all web resources referenced in the target webpage, imitating a web crawler. ArchiveReady analyses only the URL submitted by the user, it does not evaluate the whole website recursively, as we have proven that the WA analysis of a single website page is a good proxy of the whole website WA rating.

3. In stage 3, Website Attributes (See Section 3.2.3) are evaluated. In more detail:

(a) HTML and CSS analysis and validation. (b) HTTP response headers analysis and validation. () Media files (images, video, audio, other objects) retrieval, analysis and valida- tion. (d) Sitemap.xml and Robots.txt retrieval, analysis and validation. (e) RSS feeds detection, retrieval, analysis and validation. (f) Website Performance evaluation. The sum of all network transfer activity is recorded by the system and in the end, after the completion of all network trans- fers, the average transfer time is calculated. There are fast and slow evaluations; fast are performed instantly at the application server, whereas slow evaluations are performed asynchronously using a job queue as presented in Section 3.3.2.

4. The metrics for the WA Facets are calculated according to the CLEAR+ method and the final WA rating is calculated.

Note that in the current implementation, CLEAR+ evaluates only a single webpage based on the assumption that all its webpages share the same components, standards and technologies. This is validated in Section 3.4.4.

3.3.4 Interoperability and APIs

ArchiveReady is operating not only as a web application for users visiting the website but also as a web service, which is available for integration into third party applications. Its interface is quite simple; by accessing archiveready.com/api?url=http://auth.gr/ via HTTP and a JSON document will be retrieved with the full results of the WA evaluation results on the target URL as presented in Listing 3.1.

1 {"test":{ 2 "website_archivability": 91, 3 "Metadata":100 4 "Standards_Compliance":73, 5 "Accessibility":88, 6 "Cohesion":71, 7 }, 8 "": "http://auth.gr/", 9 "messages":[ 38 An Innovative Method to Evaluate Website Archivability

10 {"title":"Invalid CSS http://dididownload.com/wp-content/ themes/didinew/style.css. Located 8 errors, 78 warnings .", 11 "attribute":"html", 12 "facets":["Standards_Compliance"], 13 "level":0, 14 "significance":"LOW", 15 "message":"Webpages which do not conform with Web Standards have a lower possibility to be preserved correctly", 16 "ref":"http://jigsaw.w3.org/css-validator/validator?uri= http://dididownload.com/wp-content/themes/didinew/style .css&warning=0&profile=css3"}, 17 .... 18 ] 19 }

Listing 3.1: ArchiveReady API JSON output

The JSON output can be easily used by third party programs. In fact, all evaluations in Section 3.4 are conducted this way. Also, another significant interoperability feature of the ArchiveReady platform is to output Evaluation and Report Language (EARL) XML [2], which is the W3C standard for expressing test results. EARL XML enables users to assert WA evaluation results for any website in a flexible way.

3.4 Evaluation

We present our evaluation methodology and limits, followed by two independent experi- ments which support the validity of the WA metric. Following, we use another experiment to prove that WA variance for the webpages of the same website is very small.

3.4.1 Methodology and Limits

Our evaluation has two aims. The first is to prove the validity of the WA metric by experi- menting on assorted datasets and by expert evaluation. The second is to validate our claim that it is only necessary to evaluate a single webpage from a website to calculate a good approximation of its WA value. 3.4 Evaluation 39

In our experiments, we use Debian GNU/Linux 7.3, Python 2.7.6 and an Intel Core i7-3820, 3.60 GHz processor. The Git repository for this work29 contains the necessary data, scripts and instructions to reproduce all the evaluation experiments presented here. WA is a new concept and even though our method has solid foundations, there are still open issues regarding the evaluation of all WA Facets and the definition of a dataset of websites to be used as a Gold Standard:

1. The tools we have at our disposal are limited and cannot cope with the latest develop- ments on the web. For instance, web browser vendors are free to implement extensions to the CSS specifications that, in most cases, are proprietary to their browser30. The official W3C CSS Standard31 is evolving to include some of these new extensions but the process has an inherent delay. As a result, the state of the art W3C CSS validator we use in our system to validate target website CSS may return false negative results. This problem is also apparent in all W3C standards validators. As a result, Standards Compliance (퐹푆 ) evaluation is not always accurate. It must be noted though that W3C validators are improving on a steady rate and any improvement would be utilised auto- matically by our system as we are using the W3C validators as web services. Another aspect of this issue is that experts evaluating the live as well as the archived version of a website depend mainly on their web browsers to evaluate the website quality using mostly visual information. The problem is that HTML documents which are not fol- lowing W3C standards may appear correctly to the viewer even if they contain serious errors because the web browser is operating in “Quirks Mode” [34] and has particular algorithms to mitigate such problems. Thus, a website may appear correctly in a cur- rent browser but it may not do the same in a future browser because the error mitigation algorithms are not standard and depend on the web browser vendor. As a result, it is possible that experts evaluating a website may report that it has been archived correctly but the 퐹푆 evaluation results may not be equally good. 2. The presented situation regarding standards compliance raises issues about the accu- racy of the Accessibility Facet (퐹퐴) evaluation. Web crawlers try to mitigate the errors they encounter in web resources with various levels of success, affecting their capabil- ity to access all website content. Their success depends on the sophistication of their error mitigation algorithms. On the contrary, the 퐹퐴 rating of websites having such errors will be definitely low. For instance, a web crawler may access a sitemap.xml which contains invalid XML. If it uses a strict XML parser, it will fail to parse it and retrieve its URLs to proceed with web crawling. On the contrary, if it uses a relaxed XML parser, it will be able to retrieve a large number of its URLs and it will access more website content. In any case, the 퐹퐴 rating will suffer.

3. The Cohesion (퐹퐶 ) of a website does not directly affect its archiving unless one or more servers hosting its resources become unreachable during the time of archiving. The possibility of encountering such a case when running a WA experiment is very low. Thus, it is very difficult to measure it in an automated way. 4. Metadata are a major concert for digital curation, as discussed in Section 3.2.2. Nev- ertheless, the lack of metadata in a web archive does not have any direct impact to the user; archived websites may appear correctly although some of their resources may

29https://github.com/vbanos/web-archivability-journal-paper-data-2014 30http://reference.sitepoint.com/css/vendorspecific, accessed August 1, 2015 31http://www.w3.org/Style/CSS/, accessed August 1, 2015 40 An Innovative Method to Evaluate Website Archivability

lack correct metadata. This deficiency may become significant in the future, when the web archivists would need to render or process some “legacy” web resources and they would not have the correct information to do so. Thus, it is also challenging to evaluate the 퐹푀 Facet automatically. 5. The granularity of specific evaluations could be improved in the future to improve the accuracy of the method. Currently, the evaluations are either binary (100%/0 stands for successful/failed evaluation) and relative percentage evaluations (for instance, if 9 out of 10 hyperlinks are valid, the relevant evaluation score is 90%). There are some binary evaluations though which may be defined better as a relative percentage. For example, we have 퐴2: Check if inline JavaScript code exists in HTML. We are certain that inline JavaScript code is causing problems to web archiving so we assign a 100% score if no inline JavaScript code is present and 0% in the opposite case. Ideally, we should assign a relative percentage score based on multiple parameters, such as the specific number of inline JavaScript files, filesizes, type of inline code, complexity and other JavaScript-specific details. The same also applies for many evaluations such as 푆1: HTML standards compliance, 푆4: RSS feed standards compliance, 푆7: CSS standards compliance and 퐴6: Robots.txt “Disallow:” rules.

With these concerns in mind, we consider several possible methods to perform the evalua- tion. First, we could survey domain experts. We could ask web archivists working in IIPC Member Organisations to judge websites. However, this method is impractical because we would need to spend significant time and resources to evaluate a considerable number of websites. A second alternative method would be to devise datasets based on thematic or do- main classifications. For instance, websites of similar organisations from around the world. A third alternative would be to perform manual checking of the way a set of websites is archived in a web archive and evaluate all their data, attributes and behaviours in comparison with the original website. We choose to implement both the second and the third method.

3.4.2 Experimentation with Assorted Datasets

To study WA with real-world data, we conduct an experiment to see if high quality websites, according to some general standards, have better WA than low quality websites. We devise a number of assorted datasets with websites of varying themes, as presented in Table 3.10. We evaluate their WA using the ArchiveReady.com API (Section 3.3.4) and finally, we analyse the results.

We define three datasets of websites (퐷1, 퐷2, 퐷3) with certain characteristics:

• They belong to important educational, government or scientific organisations from all around the world. • They are developed and maintained by dedicated personnel and/or special IT compa- nies. • They are used by a large number of people and are considered very important for the operation of the organisation they belong to.

We also choose to create a dataset (퐷4) of manually selected spam websites which have the following characteristics: 3.4 Evaluation 41

• They are created automatically by website generators in large numbers. • Their content is generated automatically. • They are neither maintained nor evaluated for their quality at all. • They have relatively very few visitors.

It is important to highlight that a number of websites from all these datasets could not be evaluated by our system for various technical reasons. This means that these websites may also pose the same problems to web archiving systems. The reasons for these complications may be one or more of the following:

• The websites do not support web crawlers and deny sending content to them. This may be due to security settings or technical incompetence. In any case, web archives would not be able to archive these websites. • The websites were not available during the evaluation time. • The websites returned some kind of problematic data which resulted in the abnormal termination of the ArchiveReady API during the evaluation.

It is worth mentioning that 퐷4, the list of manually selected spam had the most problematic websites: 42 out of 120 could not be evaluated at all. In comparison, 8 out of 94 IIPC websites could not be evaluated (퐷1), 13 out of 200 (퐷2) and 16 out of 450 (퐷3). We conducted the WA evaluation using a Python script, which uses the ArchiveReady.com API and record the outcomes in a file. We calculate the results of the WA distribution for all four datasets. Also, we calculate the average, median, min, max and standard deviation functions on these datasets and present the results in Table 3.11 and depict them in Figure 3.6 using boxplots.

From these results we can observe the following: datasets 퐷1, 퐷2 and 퐷3 which are consid- ered high quality have high WA value distribution as illustrated in Figure 3.7. This is also evident from the statistics presented in Table 3.11. The average WA values are 75.87, 80.08

Id Description Raw Clean Data Data

퐷1 A set of websites from a pool of international web standards organ- 94 86 isations, national libraries, IIPC members and other high profile organisations in these fields.

퐷2 The first 200 of the top universities according to the Academic 200 187 Ranking of World Universities[90], also known as the “Shanghai list”.

퐷3 A list of government organisation websites from around the world. 450 434 퐷4 A list of manually selected spam websites from the top 1 million 120 78 websites published by Alexa. Table 3.10: Description of assorted datasets 42 An Innovative Method to Evaluate Website Archivability

Function 퐷1 퐷2 퐷3 퐷4 Average (WA) 75.87 80.08 80.75 58.37 Median (WA) 77.5 81 81 58.75 Min (WA) 41.75 56 54 33.25 Max (WA) 93.25 96 96 84.25 StDev (WA) 10.16 6.11 7.06 11.63 Table 3.11: Comparison of WA statistics for assorted datasets.

Figure 3.6: WA statistics for assorted datasets box plot.

and 80.75. The median WA values are also similar. On the contrary, 퐷4 websites, which are characterised as low quality, have remarkably lower WA values as shown in Table 3.11 and in Figure 3.6. The average WA value is 58.37 and the median value is 58.75. Thus, lower quality websites are prone to issues, which make them difficult to be archived. Finally, the standard deviation values are in all cases quite low. As the WA range is [0..100], standard deviation values of approximately 10 or less indicate that our results are strongly consistent, for both lower and higher WA values. To conclude, this experiment indicates that higher quality websites have higher WA than lower quality websites. This outcome is confirmed not only by the WA score itself but also by another indicator which was revealed during the experiment, the percentage of completed WA evaluations for each data set.

3.4.3 Evaluation by Experts

To evaluate the validity of our metrics, a reference standard has to be employed for the evaluation. It is important to note that this task requires careful and thorough investigation, as it has been already elaborated in existing works [81, 140]. With the contribution of 3 3.4 Evaluation 43

Figure 3.7: WA distribution for assorted datasets. post-doc researchers and PhD candidates in informatics from the Delab laboratory32 of the Department of Informatics at Aristotle University who assist us as experts, we conduct the following experiment: We use the first 200 websites of the top universities according to the Academic Ranking of World Universities of 2013 as a dataset (퐷2 from Section 3.4.2). We review the way that they are archived in the Internet Archive and rank their web archiving with a scale of 0 to 10. We select to use the Internet Archive because it is the most popular web archiving service, to the best of our knowledge. More specifically, for each website we conduct the following evaluation:

1. We visit http://archive.org, enter the URL on the and open the latest snapshot of the website. 2. We visit the original website. 3. We evaluate the two instances of the website and assign a score from 0 to 10 depending on the following criteria: (a) Compare the views of the homepage and try to find visual differences and things missing in the archived version (3.33 points). (b) Inspect dynamic menus or other moving elements in the archived version (3.33 points). (c) Visit random website hyperlinks to evaluate if they are also captured successfully (3.33 points).

After analysing all websites, we conduct WA evaluation for the same websites with a Python script which is using the archiveready.com API (Section 3.3.4). We record the outcomes in a file and calculate the Pearson’s Correlation Coefficient for WA, WA Facets and expert scores. We present the results in Table 3.12. From these results, we observe that the correlation between WA and Experts rating is 0.516, which is quite significant taken into consideration the discussion about the limits presented in Section 3.4.1. It is also important to highlight the lack of correlation between different

32http://delab.csd.auth.gr/ 44 An Innovative Method to Evaluate Website Archivability

퐹퐴 퐹퐶 퐹푀 퐹푆 WA Exp. 퐹퐴 1.000 퐹퐶 0.060 1.000 퐹푀 0.217 -0.096 1.000 퐹푆 0.069 0.060 0.019 1.000 WA 0.652 0.398 0.582 0.514 1.000 Exp. 0.384 0.263 0.282 0.179 0.516 1.000

Table 3.12: Correlation between WA, WA Facets and Experts rating.

WA Facets. The correlation indicators between 퐹퐴 − 퐹퐶 , 퐹퐴 − 퐹푆 , 퐹퐶 − 퐹푀 , 퐹퐶 − 퐹푆 and 퐹푀 − 퐹푆 are very close to zero, ranging from -0.096 to 0.069. There is only a very small correlation in the case of 퐹퐴−퐹푀 , 0.217. Practically, there is no correlation between different WA Facets, confirming the validity and strength of the CLEAR+ method. WA Facets are different perspectives of WA, if there was any correlation of the WA Facets, this would mean that their differences would not be so significant. This experiment confirms that WA Facets are totally independent. Finally, we conduct One Way Analysis of Variance (ANOVA) [94], to calculate the 퐹 -value = 397.628 and the 푃 -value = 2.191e-54. These indicators are very positive and show that our results are statistically significant.

3.4.4 WA Variance in the Same Website

We argue that the CLEAR+ method needs only to evaluate the WA value of a single webpage based on the assumption that webpages from the same website share the same components, standards and technologies. We also claim that the website homepage has a representative WA score. This is important because it would be common for the users of the CLEAR+ method to evaluate the homepage of a website and we have to confirm that it has a represen- tative WA value. Following, we conduct the following experiment:

1. We use the Alexa top 1 million websites dataset33 and we select 1000 random websites.

2. We retrieve 10 random webpages from each website to use as a test sample. To this end, we decided to use their RSS feeds.

3. We perform RSS feeds auto-detection and we finally identify 783 websites, which are suited for our experiment.

4. We evaluate the WA for 10 individual webpages for each website and record the results in a file.

5. We calculate the WA average (푊 퐴푎푣푒푟푎푔푒) and standard deviation (StDev(푊 퐴푎푣푒푟푎푔푒)) for each website.

33http://s3.amazonaws.com/alexa-static/top-1m.csv.zip 3.4 Evaluation 45

6. We calculate and store the WA of the homepage for each website (푊 퐴ℎ표푚푒푝푎푔푒) as an extra variable.

Figure 3.8: WA average rating and standard deviation values, as well as the homepage WA for a set of 783 random websites.

We plot the variables 푊 퐴푎푣푒푟푎푔푒, StDev(푊 퐴푎푣푒푟푎푔푒) and 푊 퐴ℎ표푚푒푝푎푔푒 for each website in a descending order by 푊 퐴푎푣푒푟푎푔푒 in Figure 3.8. The x-axis represents each evaluated website, whereas the y-axis represents WA. The red cross (+) markers which appear in a seemingly continuous line starting from the top left and ending at the center right of the diagram repre- sent the 푊 퐴푎푣푒푟푎푔푒 values for each website. The blue star (*) markers which appear around the red markers represent the 푊 퐴ℎ표푚푒푝푎푔푒 values. The green square () markers at the bottom of the diagram represent StDev(푊 퐴푎푣푒푟푎푔푒). From the outcomes of our evaluation we draw the following conclusions:

1. While average WA for the webpages of the same website may vary significantly from 50% to 100%, the WA standard deviation does not behave in the same manner. The WA standard deviation is extremely low. More specifically, its average is 0.964 points in the 0-100 WA scale and its median is 0.5. Its maximum value is 13.69 but this is an outlier; the second biggest value is 6.88. This means that WA values are consistent for webpages of the same website. 2. The WA standard deviation for webpages of the same website does not depend on average WA of the website. As depicted in Figure 3.8, regardless of the 푊 퐴푎푣푒푟푎푔푒 value, StDev(푊 퐴푎푣푒푟푎푔푒) value remains very low. 3. The WA of the homepage is near the average WA for most websites. Figure 3.8 indi- cates the 푊 퐴ℎ표푚푒푝푎푔푒 values are always around 푊 퐴푎푣푒푟푎푔푒 values with very few out- liers. The average absolute difference between 푊 퐴푎푣푒푟푎푔푒 and 푊 퐴ℎ표푚푒푝푎푔푒 for all web- sites is 3.87 and its standard deviation is 3.76. The minimum value is obviously 0 and the maximum is 25.9. 46 An Innovative Method to Evaluate Website Archivability

4. Although 푊 퐴ℎ표푚푒푝푎푔푒 is near 푊 퐴푎푣푒푟푎푔푒, we observed that its value is usually higher. Out of the 783 websites, in 510 cases 푊 퐴ℎ표푚푒푝푎푔푒 is higher, in 35 is it exactly equal and in 238 it is lower that 푊 퐴푎푣푒푟푎푔푒. Even though the difference is quite small, it is notable.

Our conclusion is that our initial assumptions are valid, the variance of WA for the webpages of the same website is remarkably small. Moreover, the homepage WA is quite similar to the average, with a small bias towards higher WA values, which is quite interesting. A valid explanation regarding this phenomenon is that website owners spend more resources on the homepage than any other page because it is the most visited part of the website. Overall, we can confirm that it is justified to evaluate WA using only the website homepage.

3.5 Web Content Management Systems Archivability

Web Content Management Systems (WCMS) are widely adopted and account for much of the web’s activity. For instance, just a single WCMS company, WordPress, reported more than 1 million new posts and 1.5 million new comments each day [147] WCMS are created in various different programming languages, using many new web technologies [53]. We believe that the wide adoption of WCMS has benefits for web archiving and needs to be taken into consideration. WCMS constitute a common technical framework which may facilitate or hinder web archiving for a large number of websites. If a web archive is compatible with a certain WCMS, it is highly probable that it will be able to archive all websites built with this WCMS. We evaluate the WA of 12 prominent WCMSs to identify their strengths and weaknesses and propose improvements to improve web content extraction and archiving. We conduct an experimental evaluation using a nontrivial dataset of websites based on these WCMSs and make observations regarding their WA characteristics. We also come up with specifc suggestions for each WCMS based on our experimental data. Our aim is to improve the web archiving practice by indicating potential issues to the WCMS development community. If our findings result in advances in WCMS source code upstream, all web archiving initiatives will benefit as the websites based on these WCMSs will become more archivable. Following, we present our method, results and conclusions.

3.5.1 Website Corpus Evaluation Method

We use 5.821 random WCMS samples from the Alexa top 1 million websites34 as our exper- imental dataset. We use this dataset because it contains high quality websites from multiple domains and disciplines. This dataset is also used in other related research [142, 60]. We select our corpus with the following process:

1. We implement a simple Python script to visit each homepage and look for the tag.

34http://s3.amazonaws.com/alexa-static/top-1m.csv.zip 3.5 Web Content Management Systems Archivability 47

2. For each website having the required meta tag, we evaluate if it belongs to one of the WCMSs listed in wikipedia35. If yes, we record it in our database. 3. We continue this process until we have a significant number of instances for 12 WCMSs (Blogger, DataLife Engine, DotNetNuke, Drupal, Joomla, Mediawiki, MovableType, Plone, PrestaShop, Typo3, vBulletin, Wordpress). 4. We evaluate each website using the ArchiveReady REST API and record the outcomes in our database. 5. We analyse the results using SQL to calculate various metrics.

The generator meta tag is not used universally on the web due to a variety of reasons, such as security. Thus, we have skipped a large number of websites, which did not indicate the system they use. Also, we did not take into consideration the version number of each WCMS as it would be impractical. There would be too many different variables in our experiment to conduct useful research. Also, it is highly improbable that the top internet websites would use legacy versions of their WCMS. The Git repository for this work36 contains all the captured data and the necessary scripts to reproduce all the evaluation experiments.

3.5.2 Evaluation Results and Observations

For each WCMS, we present the average and standard deviation for each WA Facet, as well as their cumulative WA (Figure 3.9). First of all, our results are consistent. While the WA Facet range is 0-100%, the standard deviation of all WA Facet values for each WCMS ranges from 4.2% (Blogger, 퐹퐴) to 13.2% (Mediawiki, 퐹푆 ). There are considerable differences between different WCMS regarding their WA. The top WCMS is DataLife Engine with a WA score of 83.52% with Plone and Drupal scoring also very high (83.06% and 82.08%). The rest of the WCMS score between 80.3% and 77.2%, whereas the lowest score belongs to Blogger (65.91%). In many cases, even though two or more WCMS may have similar WA score, their WA Facet scores are significantly different and each WCMS has different strengths and weaknesses. Thus, it is beneficial to look into each WA Facet differences.

퐹퐴: Top value is around 69.85% for Blogger and 69.51% for DataLife Engine, whereas the minimum value is below 60, at 56.29% for Mediawiki and 58.15% for DotNetNuke.

퐹푀 : Top value is 99.24% for Mediawiki, whereas the minimum value is 76.17% for DotNet- Nuke. The difference between the minimum and the maximum value is around 23 points, which almost twice the difference between 퐹퐴 range (13).

퐹퐶 : Appears to have the greatest differentiation between WCMS. The minimum value is only 7.38% for Blogger and the maximum value is 96.01% for DotNetNuke. At first sight, there seems to be an issue with the way Blogger is using multiple online services to host its web resources. Other WCMSs also vary from 78.5% (MovableType) to 92% (Plone), which is a considerable variation.

퐹푆 : Range is between 71.42% for Mediawiki and 88.06% for PrestaShop. Again these dif- ferences should be considered significant.

35 http://en.wikipedia.org/wiki/List표푓푐표푛푡푒푛푡푚푎푛푎푔푒푚푒푛푡푠푦푠푡푒푚푠, 푎푐푐푒푠푠푒푑퐴푢푔푢푠푡1, 2015 36https://github.com/vbanos/wcms-archivability-paper-data, accessed August 1, 2015 48 An Innovative Method to Evaluate Website Archivability

Figure 3.9: WA Facets average values and standard deviation for each WCMS

퐹퐴 has the smallest differentiation and 퐹퐶 has the greatest one among all WA Facets. We continue our research with more detailed observations regarding specific evaluations. Due to the large number of WA evaluations and the space restrictions imposed, we cannot present everything. We choose to discuss only highly significant rules. Similar research is easy to be contacted by anyone interested using the full dataset and source code available on github. We present our observations grouped by the four different WA Facets.

퐹퐴: Accessibility

Accessibility refers to the web archiving systems’ ability to traverse all website content via standard HTTP protocol requests [54].

퐴1: The percentage of valid versus invalid hyperlink and CSS URLs (Table 3.13). These are critical for web archives to retrieve all WCMS published content. Hyperlinks are cre- ated not only by users but also by WCMS subsystems. In any case, some WCMS check if they are valid whereas others don’t. In addition, some WCMS may be incurred with invalid hyperlinks due to bugs. The results show that not all WCMSs have the same frequency of invalid hyperlinks. Joomla and Typo3 have a high percentage (88% and 89%), whereas Blog- 3.5 Web Content Management Systems Archivability 49

WCMS Valid URLs Invalid URLs Correct (%) Blogger 45425 1148 97% Mediawiki 39178 1763 96% Drupal 52501 2185 96% MovableType 22442 1009 96% vBulletin 104492 5841 95% PrestaShop 57238 3287 94% DataLife Engine 31981 2342 93% Plone 25719 1856 93% Wordpress 47717 3515 93% DotNetNuke 38144 2791 93% Typo3 30945 3747 89% Joomla 37956 4886 88%

Table 3.13: 퐴1 The percentage of valid URLs. Higher is better. ger, Mediawiki, Drupal and MovableType have the highest percentage of invalid hyperlinks (97% and 96%).

퐴2: The number of inline JavaScript scripts per WCMS instance (Table 3.14). The excessive use of inline scripts in modern web development results in web archiving problems. Plone, MovableType and Typo3 have the lowest number of inline scripts per instance (4.82, 6.82 and 6.89). The highest usage by far comes from Blogger (27.11), while Drupal (15.09) and vBulletin (12.38) follow.

WCMS Inst. Inline scripts scripts/inst. Plone 431 2076 4.82 MovableType 295 2011 6.82 Typo3 624 4298 6.89 Mediawiki 408 3753 9.20 DataLife Engine 321 3159 9.84 Wordpress 863 8646 10.02 DotNetNuke 598 6028 10.08 Joomla 501 5163 10.31 PrestaShop 466 5130 11.01 vBulletin 462 5721 12.38 Drupal 528 7969 15.09 Blogger 324 8783 27.11

Table 3.14: 퐴2 The number of inline scripts per WCMS instance. Lower is better. The Sitemap.xml protocol is meant to create files which include references to all the web- pages of the website [127]. Sitemap.xml files are generated automatically by WCMS when their content is updated. The results of the 퐴3 evaluation (Table 3.15) indicate that most WCMS lack proper support for this feature. Only DataLife Engine has a very high score (86%). Also Wordpress and Drupal score over 60%. All other WCMSs perform very poorly, which is surprising. 50 An Innovative Method to Evaluate Website Archivability

WCMS Instances Issues Correct DataLife Engine 321 46 86% Wordpress 863 272 68% Drupal 528 189 64% PrestaShop 466 237 49% MovableType 295 152 48% Typo3 624 322 48% Plone 431 249 42% vBulletin 462 329 29% Joomla 501 359 28% Blogger 324 240 26% DotNetNuke 598 461 23% Mediawiki 408 335 18%

Table 3.15: 퐴3 Sitemap.xml is present. Higher is better.

퐹퐶 : Cohesion

Cohesion is relevant to the level of dispersion of files comprising a single website to multiple servers in different domains. The lower the dispersion of a website’s files, the lower the susceptibility to errors because of a failed third-party system. We evaluate the performance for two 퐹퐶 related evaluations.

퐶1: The percentage of local versus remote images is presented in Table 3.16). Blogger is suffering from the highest dispersion of images. On the contrary, Plone, DotNetNuke, PrestaShop, Typo3 and Joomla have the higher 퐹퐶 , over 90%.

퐶2: The percentage of local versus remote CSS (Table 3.17). Again, Blogger has a very low score (2%), whereas every one WCMS is performing very well.

WCMS Local imgs Remote imgs Percent. Plone 7833 290 96% DotNetNuke 13136 680 95% PrestaShop 19910 1187 94% Typo3 15434 897 94% Joomla 14684 1251 92% MovableType 8147 1388 86% Drupal 16636 3169 84% vBulletin 11319 2314 83% Wordpress 20350 4236 83% Mediawiki 4935 1127 81% DataLife Engine 9638 2356 80% Blogger 1498 8121 16%

Table 3.16: 퐶1 The percentage of local versus remote image. Higher is better. 3.5 Web Content Management Systems Archivability 51

WCMS Local CSS Remote CSS Percent. DotNetNuke 5243 101 98% Typo3 3365 154 96% Plone 1475 72 95% Joomla 4539 222 95% DataLife Engine 919 56 94% PrestaShop 5221 400 93% MovableType 578 42 93% vBulletin 1459 104 93% Mediawiki 1120 84 93% Drupal 2320 354 87% Wordpress 5658 1019 85% Blogger 18 954 2%

Table 3.17: 퐶1 The percentage of local versus remote CSS. Higher is better.

퐹푆 : Standards Compliance

Standards Compliance is a necessary precondition in digital curation practices [39]. We evaluate 푆1: Validate if the HTML source code complies with the W3C standards using the W3C HTML validator and present the results in Table 3.18.

WCMS Instances Errors Errors/Instance Plone 431 12205 28.32 Mediawiki 408 14032 34.39 Typo3 624 23965 38.41 Wordpress 863 35805 41.49 Joomla 501 26609 53.11 PrestaShop 466 30066 64.52 DotNetNuke 598 43009 71.92 Drupal 528 47131 89.26 vBulletin 462 46466 100.58 MovableType 295 29994 101.67 DataLife Engine 321 34768 108.31 Blogger 324 71283 220.01

Table 3.18: 푆1 HTML errors per instance. Lower is better.

Plone has the lower number of errors (28.32), followed by Mediawiki (34.39) and Typo3 (34.41). On the contrary, Blogger has the most errors per instance (220.01), followed by far by DataLife Engine (108.31) and MovableType(101.67).

푆3: The usage of Quicktime and Flash formats is considered problematic for web archiving because web crawlers cannot process their contents to extract information, including web resource references. Results show that their use is very low in all WCMS (Table 3.19). 52 An Innovative Method to Evaluate Website Archivability

WCMS Instances No propr. files Success PrestaShop 466 460 99% Mediawiki 408 398 98% Blogger 324 310 96% Plone 431 412 96% Wordpress 863 821 95% Typo3 624 592 95% vBulletin 462 434 94% Drupal 528 494 94% DotNetNuke 598 548 92% DataLife Engine 321 294 92% MovableType 295 263 89% Joomla 501 439 88%

Table 3.19: 푆2 The lack of use of proprietary files (Flash, QuickTime). Higher is better.

푆4: Check if the RSS feed format complies with W3C standards. The results (Table 3.20) in- dicate that Blogger has mostly correct feeds (91%), whereas every other WCMS has various levels of correctness. The lowest scores belong to Mediawiki (2%) and DotNetNuke (13%). In general, the results show that there is a problem with RSS feed standard compliance.

WCMS valid feeds invalid feeds Correct Blogger 872 83 91% DataLife Engine 240 57 81% Wordpress 1283 317 80% Joomla 556 141 80% vBulletin 299 96 76% MovableType 271 120 69% Drupal 133 74 64% PrestaShop 82 112 42% Typo3 124 191 39% Plone 116 184 39% DotNetNuke 2 14 13% Mediawiki 10 521 2%

Table 3.20: 퐴5: Valid Feeds. Higher is better.

퐹푀 : Metadata usage

The lack of metadata impairs the archive’s ability to manage content effectively. Web sites include a lot of metadata, which need to be communicated in a correct manner to be utilised by web archives [101]. 3.5 Web Content Management Systems Archivability 53

WCMS Instances Exists Success Blogger 324 324 100% Drupal 528 527 100% MovableType 295 294 100% vBulletin 462 458 99% Plone 431 427 99% Typo3 624 618 99% Joomla 501 494 99% DotNetNuke 598 589 98% Mediawiki 408 401 98% DataLife Engine 321 315 98% PrestaShop 466 456 98% Wordpress 863 841 97%

Table 3.21: 푀1: HTTP Content-Type header. Higher is better.

푀1: Check if the HTTP Content-type header exists (Table 3.21). There is virtually no issue with HTTP Content-Type in all WCMSs. Their performance is excellent.

푀2: Check if any HTTP Caching headers (Expires, Last-modified or ETag) are set. HTTP Caching is highly relevant to accessibility and performance. Blogger, Mediawiki, Drupal, DataLife Engine and Plone have very good support of HTTP Caching headers (Table 3.22).

WCMS Instances Issues Percentage Blogger 324 3 99% Mediawiki 408 12 97% Drupal 528 23 96% DataLife Engine 321 16 95% Plone 431 49 89% MovableType 295 106 64% Joomla 501 186 63% Wordpress 863 466 46% Typo3 624 364 42% vBulletin 462 326 29% PrestaShop 466 388 17% DotNetNuke 598 569 5%

Table 3.22: 푀2: HTTP caching headers. Higher is better.

3.5.3 Discussion

We evaluated 12 prominent WCMS and presented specific results and statistics regarding their WA Facets. We concluded that not all WCMSs are considered equally archivable. Each one has its own strengths and weaknesses, which we highlight in the following: 54 An Innovative Method to Evaluate Website Archivability

1. Blogger has by far the worst overall WA score (65.91%, Figure 3.9), mainly due to the very low 퐹퐶 . Blogger files are dispersed in multiple different web services, which is increasing the possibility of errors in case one of them fails. In addition, Blogger scores very low in many metrics such as the number of inline scripts per instance (Table 3.14) and HTML errors per instance (Table 3.18). On the contrary, Blogger scores very high regarding 퐹푀 and 퐹푆 . 2. DataLife Engine has the highest WA score (83.52%). One aspect that they should look into is HTML errors per instance (Table 3.18), where it has the second worst score.

3. DotNetNuke has the second worst WA score in our evaluation (77.2%). 퐹퐶 is their strong point (96.01%) but they have issues is every other area. We suggest that they look into their RSS feeds (13% Correct) (Table 3.20), and lacking HTTP caching sup- port (5%) (Table 3.22). 4. Drupal has the third highest WA score (82.08%). It has good overall performance and the only issue is the existence of too many inline scripts per instance (15.09) (Ta- ble 3.14). 5. Joomla WA score is average (80.37%). It has a large number of invalid URLs per instance (12%) (Table 3.13) and it has also the highest usage of proprietary files (12%) (Table 3.19) which is not good for accessibility and preservation. 6. Mediawiki WA score is low (77.81%). This can be attributed to mostly invalid feeds (only 2% are correct according to standards) and very low sitemap.xml support (18%), Table 3.15. 7. MovableType WA score is average (80.02%). It does not stand out in any evaluation either in a positive or a negative way. General improvement in all areas would be welcome. 8. Plone has the second highest WA score (83.06%). It must be commented for having the lowest number of HTML errors per instance (28.32) (Table 3.18) and very high 퐹퐶 scores (96% for images, Table 3.16 and 95% for CSS, Table 3.17). 9. PrestaShop WA score is average (79%). It has average scores in all evaluations but it should be commented for not using any proprietary files (top score: 99% at Table 3.19). 10. Typo3 WA score is average (79%). It has the largest number of invalid URLs per instance (12%) (Table 3.13). 11. vBulletin WA score is consistenly low (78.37%). General improvement in all areas would be welcome. 12. Wordpress WA score is average (78.47%). We cannot highlight a specific area where it should be improved. As this is currently the most popular WCMS, Wordpress de- velopers should look into all WA Facets and try to improve.

We recommend that the WCMS development communities investigate the presented issues and resolve them as many are easy to be fixed without causing any issues with existing users and installations. If the situation regarding the highlighted issues is improved in the next releases of the investigated WCMS, the impact would be significant. A large number of websites which could not be archived correctly would no longer have these issues once they update their software and newly created websites based on these WCMS would be more archivable. Web archiving operations around the world would see great improvement, re- sulting in general advancements in the state of web archiving. 3.6 Conclusions 55

3.6 Conclusions

We presented our extended work towards the foundation of a quantitative method to evaluate WA. The Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to evaluate Website Archivability has been elaborated in great detail, the key Facets of WA have been defined and the method of their calculating has been explained in theory and practice. In addition, we presented the ArchiveReady system, which is the reference implementation of CLEAR+. We overviewed all aspects of the system, including design decisions, technolo- gies, workflows and interoperability APIs. We believe that it is quite important to explain how the reference implementation of CLEAR+ works because transparency raises the con- fidence for the method. A critical part of this work is also the experimental evaluation. First, we performed ex- perimental WA evaluations of assorted datasets and observed the behaviour of our metrics. Then, we conducted a manual characterisation of websites to create a reference standard and we identified correlations with WA. Both evaluations provided very positive results, which support that the CLEAR+ can be used to identify whether a website has the potential to be archived with correctness and accuracy. We also experimentally proved that that CLEAR+ method needs only to evaluate a single webpage to calculate the WA of a website, based on the assumption that webpages from the same website share the same components, standards and technologies. Finally, we evaluated the WA of the most prevalent WCMS, one of the common technical denominators of current websites. We investigated the extent to which each WCMS meets the conditions for a safe transfer of their content to a web archive for preservation purposes, and thus identified their strengths and weaknesses. More importantly, we deduced specific recommendations to improve the WA of each WCMS, aiming to advance the general practice of web data extraction and archiving. Introducing a new metric to quantify the previously unquantifiable notion of WA is not an easy task. We believe that we have captured the core aspects of a website crucial in diag- nosing whether it has the potential to be archived with correctness and accuracy with the CLEAR+ method and the WA metric.

Chapter 4

Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

The performance and efficiency of web crawling is important for many Applications, such as for search engines, web archives and online news. We propose methods to optimise web crawling by duplicate and near-duplicate webpage detection. Using webgraphs to model web crawling, we perform webgraph edge contractions and detect web spider traps, improving the performance and efficiency of web crawling as well as the quality of its results. We introduce http://webgraph-it.com (WebGraph-It), a web platform which implements the presented methods, and conduct extensive experiments using real-world web data to evaluate the strengths and weaknesses of our methods1.

4.1 Introduction

Websites have become large and complex systems, which require strong software systems to be managed effectively [22]. Web content extraction, or web crawling, is becoming increas- ingly important. It is crucial to have web crawlers capable of efficiently traversing websites to harvest their content. The sheer size of the web combined with an unpredictable publish- ing rate of new information call for a highly scalable system, while the lack of programmatic access to the complete web content makes the use of automatic extraction techniques neces- sary.

1This chapter is based on the following publication: • Banos V., Manolopoulos Y.: “Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling￿, ACM Transactions on the Web Journal, submitted, 2015. 58 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

Special software systems have been created, the web crawlers, also known as “spiders” or “bots”, to conduct web crawling with efficiency and performance on large scale. They are self-acting agents that navigate around-the-clock through the hyperlinks of the web, harvest- ing topical resources without human supervision [112]. Essentially, a web crawler starts from a seed webpage and then uses the hyperlinks within it to visit other webpages. This process repeats with every new webpage until some conditions are met (e.g. a maximum number of webpages is visited or no new hyperlinks are detected). Despite the simplicity of the basic algorithm, web crawling has many challenges [12]. In this work, we focus on addressing two key issues:

• There are a lot of duplicate or near-duplicate data captured during web crawling. Such data are considered superfluous and, thus, great effort is necessary to detect and remove them after crawling [95]. To the best of our knowledge, there is no method to perform this task during web crawling. • Web spider traps are sets of webpages that cause web crawlers to make an infinite number of requests. They result in software crashes, web crawling disruption and excessive waste of computing resources [110]. There is no automated way to detect and avoid web spider traps; web crawling engineers use various heuristics with limited success.

These issues impact greatly web crawling systems’ performance and users’ experience. We explore some fundamental web crawling concepts and present various methods to improve baseline web crawling to address them:

• Unique webpage identifier selection: URI is the de facto standard for unique webpage identification but web archiving systems also use the Sort-friendly URI Reordering Transform (SURT)2, a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names3. We suggest using SURT as an alternative unique webpage identifier for web crawling applications. • Unique webpage identifier similarity: Unique URI is the defacto standard but we also look into near-duplicates as well. It is possible that two near-duplicate URIs or SURTs belong to the same webpage. • Webpage content similarity: Duplicate and near-duplicate webpage content detec- tion can be used in conjunction with unique webpage identifier similarity. • Webgraph edge contraction: Modeling websites as webgraphs[28] during crawling, we can apply node merging using the previous three concepts as similarity criteria and achieve webgraph edge contraction and cycle detection.

Using these concepts, we establish a theoretical framework as well as novel web crawling methods, which provide us with the following information for any target website: (i) unique and valid webpages, (ii) hyperlinks between them, (iii) duplicate and near-duplicate web- pages, (iv) web spider trap locations, and, (v) a webgraph model of the website.

2http://crawler.archive.org/apidocs/org/archive/util/SURT.html, accessed August 1, 2015 3http://crawler.archive.org/articles/user_manual/glossary.html, accessed August 1, 2015 4.2 Method 59

We also present WebGraph-It, a system which implements our methods and is available at http://webgraph-it.com. Web crawling engineers could use WebGraph-It to preprocess websites prior to web crawling to get specific lists of URLs to avoid as duplicates or near- duplicates, get URLs to avoid as web spider traps, and, generate webgraphs of the target website. Finally, we conduct an experiment with a non-trivial dataset of websites to evaluate the proposed methods. Our contributions can be summarized as follows:

• we propose a set of methods to detect duplicate and near-duplicate webpages in real time during web crawling.

• we propose a set of methods to detect web spider traps using webgraphs in real time during web crawling.

• we introduce WebGraph-It, a web platform which implements the proposed methods.

The remainder of this chapter is organised as follows: Section 4.2 presents the main concepts of our methods and introduces new web crawling methods that use them to detect duplicate and near-duplicate content, as well as detect web spider traps. Section 4.3 presents the sys- tem architecture of WebGraph-It. Section 4.4 presents our experiments and detailed results. Finally, Section 4.5 discusses results and presents future work.

4.2 Method

In the following subsections, we propose some algorithms to detect duplicate web content during web crawling and avoid web spider traps. However, first we present some fundamental concepts before defining our methods.

4.2.1 Key Concepts

We model the web crawling process as a directed graph, which we call webgraph. A web- graph relative to a certain set of URLs is a directed graph having those URLs as nodes, and with an arc from 푋 to 푌 whenever page 푋 contains a hyperlink towards page 푌 [24]. We model web crawling of a single website as the generation and traversal of a webgraph in real time. When new webpages are identified, new nodes and arcs are added to the webgraph. This concept can be extended to crawling large numbers of websites or domains; however, in this work, we focus on the problem of crawling a single website to be able to present and validate our ideas in a tangible way. The standard “naive” procedures of web crawling result in webgraphs with an excessive number of duplicate or near-duplicate nodes. Despite any heuristics used by web crawlers, their logic is static and they cannot cope with the changing nature of the web, as already presented in Section 2.1.2. Moreover, in many cases existing procedures result in webgraphs with infinite size, as web crawlers are tricked to detect new webpages indefinitely when there 60 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling is no new web content available. This issue is also known as web spider traps and is already detailed in Section 4.4.5. To detect more optimal web crawling methods, we exploit three concepts which, to the best of our knowledge, have not been fully exploited until this point:

• Unique webpage identifier selection: Which webpage attribute is considered as its unique identifier? • Unique webpage identifiers similarity: Which webpage unique identifiers should be considered similar? • Webpage content similarity: Which webpage content should be considered similar?

Using these concepts, we identify duplicate or near-duplicate web content which highlights webgraph nodes which contain little or no new information and, thus, can be removed. These findings result in webgraph edge contractions and restructuring. In addition, this process enables webgraph cycle detection. The result is the reduction of webgraph complexity, im- proving the efficiency of the web crawling process and the quality of its results. We must note that each of the presented concepts can be used not only independently but also in conjunction with others. In the sequel, we analyse each concept in detail.

Unique webpage identifier selection

The Uniform Resource Identifiers (URIs) are the de facto standards for unique web resource identification of the WWW. The architecture of the WWW is based on the Uniform Resource Locator (URL), which is a subset of the URI that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network “location”) [19]. Many web related technologies such as the and Linked Open Data use URLs [21]. We suggest that we should rethink the use of URLs for unique webpage identification in web crawling applications. There are many special cases where there are issues with this concept:

• URLs with excessive parameters are usually pointing at the same webpage. Web ap- plications ignore arbitrary HTTP GET parameters. For instance, the following three URLs are pointing at the same webpage: http://edition.cnn.com/videos http://edition.cnn.com/videos?somevar=1 http://edition.cnn.com/videos?somevar=1&other=1 There is no restriction in using either of these URLs. If a web content editor or user mentions one of these for any reason, they would be accepted as valid as they are point- ing at a valid webpage. The web server responds with an HTTP 200 status response and a correct web document. The problem is that web crawlers would capture three copies of the same webpage. • Two or more totally different URLs could point at the same webpage. For instance, the following two AUTH university webpages are duplicates: 4.2 Method 61

http://www.auth.gr/invalid-page-1 http://www.auth.gr/invalid-page-2 They are both pointing at the same “Not found” webpage. If URLs such as these are mentioned in any webpage visited by a web crawler, this would result in multiple copies of the same webpage.

• Problematic DNS configurations could lead to multiple duplicate web documents. For example, in many cases the handling of the ’www.’ Prefix in websites is not consistent. For instance, the following two URLs are exactly similar: http://www.example.com/ http://example.com/ The correct DNS configuration would make the system respond with an HTTP redirect from the one to the other according to the owner’s preference. Currently, web crawlers would consider them as two different websites.

We suggest that URLs need to be preprocessed and normalised before used as unique web- page identifiers. An appropriate solution for this problem would be the use of Sort-friendly URI Reordering Transform (SURT) to encode URLs. In sort, SURT converts URLs from their original format: scheme://[email protected]:port/path?query#fragment into the following: scheme://(tld,domain,:port@user)/path?query#fragment An example conversion is presented below. URL: http://edition.cnn.com/tech SURT: com,cnn,edition)/tech The ‘(’ and ‘)’ characters serve as an unambiguous notice that the so-called ’authority’ por- tion of the URI ([userinfo@] host[:port] in http URIs) has been transformed; the commas prevent confusion with regular hostnames. This remedies the ‘problem’ with standard URIs that the host portion of a regular URI, with its dotted-domains, is actually in reverse order from the natural hierarchy that’s usually helpful for grouping and sorting. The value of re- specting URI case variance is considered negligible: it is vanishingly rare for case-variance to be meaningful, while URI case- variance often arises from people’s confusion or sloppi- ness, and they only correct it insofar as necessary to avoid blatant problems. Thus the usual SURT form is considered to be flattened to all lowercase, and not completely reversible4. Web archiving systems use SURT internally. For instance, Murray et al. use SURT and certain limits to conduct link analysis in captured web Content [104]. Alsum et al. use SURT to create a unique ID for each URI to achieve incremental and distributed processing for the same URI on different web crawling cycles or different machines [5].

4http://crawler.archive.org/articles/user_manual/glossary.html, accessed: August 1, 2015 62 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

Unique webpage identifiers similarity

One of the basic assumptions of the web is that URLs are unique [19]. When a web crawler visits a webpage and encounters duplicate URLs, it does not visit the same URL twice. We suggest that in some cases, near-duplicate URLs could also lead to the same webpage and should be avoided. For instance, the following two URLs lead to the same webpage: http://vbanos.gr/ http://vbanos.gr A slight difference in HTTP GET URL parameters could also trick web crawlers into pro- cessing duplicate webpage such as lowercase or uppercase characters, unescaped parameters, or any other web application specific variable could lead to the same results. For example, the following could be near-duplicate webpages: http://vbanos.gr/show?show-greater=10 http://vbanos.gr/page?show-greater=11 Parameter ordering may also trick web crawlers. Example: http://example.com?a=1&b=2 http://example.com?b=2&a=1 Thus, we propose to detect near-duplicate URLs using standard string similarity methods and consider webpages with near-duplicate URLs as potential duplicates. The content of these webpages is also evaluated to clarify if they are indeed duplicate. We use the Sorensen-Dice coefficient similarity because it is a string similarity algorithm with the following characteristics: (i) low sensitivity to word ordering, (ii) low sensitivity to length variations, (iii) runs in linear time [16, 45]. For the sake of experimentation, we consider the 95% similarity threshold as appropriate to define near-duplicate URLs. Finally, we must highlight that the proposed method can be used with both URL and SURT as unique identifiers.

Webpage content similarity

Webpage content similarity can be also used to detect duplicate or near-duplicate webpages. The problem can be defined as:

• Detect duplicate webpages: Two webpages which contain exactly the same content. • Detect near-duplicate webpages: Webpages with content that is very similar but not exactly the same. This is a common pattern on the web, webpages may even be slightly different if the same user makes two subsequent visits because some dynamic parts of the webpage are updated. E.g. some counter or some other widget.

Digital file checksum algorithms create a unique signature for each file based on their con- tents. They can be used to identify duplicate webpages but not near-duplicates. We need a very efficient and high performance algorithm. The simhash algorithm by Charikar can 4.2 Method 63 be used to calculate hashes from documents to be able to perform fast comparisons [35]. It has has been already used very effectively to detect near-duplicates in web search engine applications [95]. This work demonstrates that simhash is appropriate and practical for near- duplicate detection in webpages. To use simhash, we need to calculate the simhash signature of every webpage after it is captured, and save it in a dictionary with its URL. Then, when capturing any new page, we would compare its simhash signature with existing ones in the dictionary to find duplicates or near-duplicates. The similarity threshold would be an option according to user needs. For the sake of simplicity and experimentation, we only consider similarity evaluation between two webpage to identify exact similarity or at least 95% similarity. The potential problem of this approach is that in case a website contains a large number of webpages, then it would not be efficient to calculate every new webpage’s similarity with all existing webpages, even though we may use simhash, which is very efficient compared with a bag of words or any other attribute section method [35].

Webgraph cycle detection

During web crawling, a webgraph is generated in real time using the newly captured web- pages as nodes and their links as edges. New branches are created and expanded as the web crawler captures new webpages. The final outcome is a directed acyclic graph [28]. Our method can be summarised as follows: Every time a new node is added to the webgraph, we evaluate if it is duplicate or near-duplicate to nearby nodes. If this is true, then the two nodes are merged, their edges are contracted and we detect potential cycles in the modified graph starting from the new node up to a distance of 푁 nodes. If a cycle is detected, we do not proceed with crawling links from this node, else we continue. A generic version of the web crawling with cycle detection algorithm is presented in Listing 4.2. In more detail, to implement our method we need to have a shared global webgraph object in memory. which can be accessed by all web crawler bots. Each webgraph node has the structure of Listing 4.1.

1 struct webgraph-node { 2 string webpage-url 3 string webpage-surt 4 bitstream webpage-content-simhash 5 list[string] internal-links 6 }

Listing 4.1: Webgraph node structure

Webpage-url keeps the original webpage url without any modification. Webpage-surt is generated by the webpage-url and webpage-content-simhash is generated by the webpage html markup. Only internal links are used due to the scope of our algorithms to detect duplicate webpages of the same website.

1 global var webgraph 2 64 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

3 method crawl(URL): 4 Fetch webpage from URL 5 new-node = create-webgraph-node(URL) 6 webgraph->add-node(new-node) 7 for limit = (1,...,N): 8 near-nodes = webgraph->get-nodes(new-node, limit) 9 for node in near-nodes: 10 if is-similar(node, new-node): 11 webgraph->merge(node, new-node) 12 for limit = (1,...,N): 13 has-cycle = dfs-check-cycle(webgraph, new-node, limit ) 14 if has-cycle is True: 15 return 16 parse webpage and extract all URLs 17 save webpage 18 for all URLs not seen before: 19 crawl(URL)

Listing 4.2: Generic web crawling with cycle detection algorithm

The algorithm can have multiple variations regarding: a) node similarity method and b) maximum node distance evaluation. Webgraph nodes which would be otherwise considered as unique considering only unique URL can now be identified as duplicate or near-duplicate using the methods presented in the previous sections (4.2.1-4.2.1). The potential similarity metrics are presented in Table 4.1.

Table 4.1: Potential webgraph node similarity metrics Id Identifier Identifier Similarity Content Similarity

푆1 URL No No 푆2 SURT No No 푆3 URL Yes No 푆4 SURT Yes No 푆5 URL Yes Yes 푆6 SURT Yes Yes

To search for cycles we use Depth-First Search (DFS) [136] because it is ideal to perform limited searches in potentially infinite graphs. We limit the search distance to 3 nodes be- cause our experiments indicate that it is not relevant to perform deeper searches to detect cycles. We present such an experiment in Section 4.4.4. We must note that our method is very efficient because we do not need to save the contents of every webpage but only some specific webpage attributes as presented in Listing 4.1. Also, 4.2 Method 65 our method uses one shared webgraph model in memory regardless of the number of web crawler processes using locking mechanisms when adding or removing nodes. Due to the fact that web crawling is I/O bound, this architecture does not incur performance penalties.

4.2.2 Algorithms

Using the concepts introduced in Section 4.2.1, we design specific web crawling algorithms. These algorithms are later tested experimentally in Section 4.4. We note that in all cases, we evaluate a single domain. When we mention URLs, we mean URLs from the same target domain. We ignore external URLs. Also, we use breadth-first webpage ordering.

Algorithm 1 - the base web crawling algorithm

First, we present the basic algorithm for web crawling in Listing 4.3. The algorithm is “naive”. It is considered as the standard reference, indicating the baseline webpages, links and web crawling duration. All other proposed methods are compared with this to indicate any achievements or issues due to the application of our concepts (Section 4.2.1).

1 method crawl(URL): 2 fetch webpage from URL 3 parse webpage and extract all URLs 4 save webpage 5 for all URLs not seen before: 6 crawl(URL)

Listing 4.3: 퐶1: Basic web crawling algorithm

Algorithm 2 - SURT variation

Instead of using URL as the unique identifier for the webpage in the web crawler memory, we use SURT. Thus, the new algorithm is presented in Listing 4.4. We note that the extra SURT generation requires trivial computing resources.

1 method crawl(URL): 2 fetch webpage from URL 3 parse webpage and extract all URLs 4 save webpage 5 SURTs = calculate-SURT(URLs) 6 for all SURTs not seen before: 7 crawl(URL using SURT) 66 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

Listing 4.4: 퐶2: Basic web crawling algorithm using SURT as unique webpage identifier

Algorithms 3, 4 - near-duplicate unique identifiers

In the previous two algorithms, we use a dictionary data structure to hold the URLs or the SURTs of all visited pages. In addition, we use the exact similarity measure to decide if we have already visited a webpage or not. In this algorithm we propose to use near-similarity to decide whether to visit a webpage or not. As presented in Section 4.2.1, we use the Sorensen- Dice coefficient similarity to calculate the similarity of the new webpage’s unique identifier (URL or SURT) with all existing identifiers. The algorithmic steps are exactly the same.

1 method crawl(URL): 2 fetch webpage from URL 3 parse webpage and extract all URLs 4 save webpage 5 for all URLs not seen before: 6 if URL is not similar to existing URLs: 7 crawl(URL)

Listing 4.5: 퐶3: Using near-similarity for URLs

1 method crawl(URL): 2 fetch webpage from URL 3 parse webpage and extract all URLs 4 save webpage 5 SURTs = calculate-SURT(URLs) 6 for all SURTs not seen before: 7 if SURT is not similar to existing SURTs: 8 crawl(URL using SURT)

Listing 4.6: 퐶4: Using near-similarity for SURTs

Algorithms 5, 6 - near-duplicate content detection

In the previous four algorithms, we worked with the selection of the webpage unique iden- tifier (URL or SURT) and its similarity metric (exact or similar identifier). We propose that we should also take into consideration webpage content similarity in addition to unique identifier similarity. 4.2 Method 67

1 method crawl(URL): 2 fetch webpage from URL 3 if webpage is not similar to existing webpages: 4 Parse webpage and extract all URLs 5 save webpage 6 for all URLs not seen before: 7 if URL is not similar to existing URLs: 8 crawl(URL)

Listing 4.7: 퐶5: Using near-similarity for URLs and content similarity for webpages

1 method crawl(URL): 2 fetch webpage from URL 3 if webpage is not similar to existing webpages: 4 parse webpage and extract all URLs 5 save webpage 6 SURTs = calculate-SURT(URLs) 7 for all SURTs not seen before: 8 if SURT is not similar to existing SURTs: 9 crawl(URL using SURT)

Listing 4.8: 퐶6: Using near-similarity for SURTs and content similarity for webpages

Algorithms 7, 8 - cycle detection

We can extend the previously defined algorithms 3-6 using webgraph cycle detection based on the concept we presented in Section 4.2.1. Algorithm 7 uses URL as the unique webpage identifier and the content similarity function (Listing 4.9) whereas algorithms 8 uses SURT as the unique webpage identifier and also the content similarity function (Listing 4.10).

1 global var webgraph 2 3 method crawl(URL): 4 fetch webpage from URL 5 new-node = create-webgraph-node(URL) 6 webgraph->add-node(new-node) 7 for limit = (1,...,N): 8 near-nodes = webgraph->get-nodes(new-node, limit) 9 for node in near-nodes: 10 if content-is-similar(node, new-node): 68 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

11 webgraph->merge(node, new-node) 12 for limit = (1,...,N): 13 has-cycle = dfs-check-cycle(webgraph, new-node, limit ) 14 if has-cycle is True: 15 return 16 parse webpage and extract all URLs 17 save webpage 18 for all SURTs not seen before: 19 crawl(URL)

Listing 4.9: 퐶7: Using URL as unique identifier, webpage content similarity and webgraph cycle detection

1 global var webgraph 2 3 method crawl(URL): 4 fetch webpage from URL 5 new-node = create-webgraph-node(URL) 6 webgraph->add-node(new-node) 7 for limit = (1,...,N): 8 near-nodes = webgraph->get-nodes(new-node, limit) 9 for node in near-nodes: 10 if content-is-similar(node, new-node): 11 webgraph->merge(node, new-node) 12 for limit = (1,...,N): 13 has-cycle = dfs-check-cycle(webgraph, new-node, limit ) 14 if has-cycle is True: 15 return 16 parse webpage and extract all URLs 17 save webpage 18 SURTs = calculate-SURT(URLs) 19 for all SURTs not seen before: 20 crawl(URL using SURT)

Listing 4.10: 퐶8: Using SURT as unique identifier, webpage content similarity and web graph cycle detection 4.3 The WebGraph-it System Architecture 69

Table 4.2.2 is a summary of all presented web crawling algorithms. We notice that we are not exhausting all potential method combinations but we are focusing on a substantial set, which is sufficient to explore the value of our methods. In the sequel, we present the WebGraph-It platform, which implements the presented algorithms.

Table 4.2: Web crawling algorithms summary Id Identifier Selection Identifier Similarity Content Similarity Cycle Detection

퐶1 URL No No No 퐶2 SURT No No No 퐶3 URL Yes No No 퐶4 SURT Yes No No 퐶5 URL Yes Yes No 퐶6 SURT Yes Yes No 퐶7 URL No Yes Yes 퐶8 SURT No Yes Yes

4.3 The WebGraph-it System Architecture

Here, we present http://webgraph-it.com (WebGraph-It), a web platform that imple- ments our methods as a web application. Using WebGraph-It, users can analyse target web- sites and gain an understanding regarding their structure, pages, hyperlinks, duplicate content and web crawler traps.

4.3.1 System

WebGraph-It is a web platform built using the micro-service architecture:

• The back-end subsystem implements all the algorithms and the web crawling logic for downloading and analysing data. It also exposes a private REST API. • The data storage subsystem is responsible for permanent and temporary data storage. It communicates with the back-end to send or receive data via standard storage APIs. • The front-end subsystem implements the user interface and the public REST API. It communicates with the back-end to invoke commands and retrieve data.

We use the following standard software components: a) Debian linux operating system [118] for development and production servers, b) Nginx web server, c) Python programming lan- guage, d) Gunicorn Python WSGI HTTP Server, e) Flask Python web micro-framework, f) Redis advanced key-value store to manage job queues and temporary data which are common among background web crawling processes, g) Mariadb Mysql RDBMS to store permanent 70 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling data, h) PhantomJS, a headless WebKit scriptable with a JavaScript API. It has fast and na- tive support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG, i) JavaScript and CSS libraries such as Bootstrap for UI. An overview of the system architecture is presented in Figure 4.1.

Figure 4.1: WebGraph-It system architecture

We explain some of our system architecture decisions as they are important for the imple- mentation of our methods. First, we choose the micro-services architecture because we need to separate the web crawling logic, the data storage and the user interface. This way, we can upgrade each system without affecting the other ones. For instance, we could create a new more user-friendly web interface or a public REST API for WebGraph-It without modifying the web crawling logic. We use asynchronous job queues in the back-end to define and conduct the web crawling process because it is a flexible system; we can define arbitrary numbers of worker processes in one or more servers, and, thus, the system is resilient to faults due to unexpected condi- tions. A single process can crash without affecting the other ones. Also, the results are kept in the job queues and can be evaluated later. 4.3 The WebGraph-it System Architecture 71

Figure 4.2: Viewing a webgraph in the http://webgraph-it.com web application

We use Python to implement the front-end and the back-end subsystems because it has many features such as a robust MVC framework and networking libraries, such as python-requests5. Python also has a large set of libraries, which implement algorithms such as simhash6 and Sorensen-Dice, as well as graph analysis (NetworkX7) and numeric calculations (numpy8). We use PhantomJS to improve the ability of our web crawler to process webpages which use Javascipt, AJAX and other web technologies, which are difficult to handle with HTML pro- cessing. Using PhantomJS, we render JavaScript in webpages and extract dynamic content. In the past, this method has been tested successfully in web crawling work [16]. We use Redis9 to store temporary data in memory because of its extremely high performance, its ability to support many data structures and multiple clients in parallel. Web crawling is

5http://python-requests.org, accessed: August 1, 2015 6https://github.com/sangelone/python-hashes, accessed: August 1, 2015 7https://networkx.github.io/, accessed: August 1, 2015 8http://www.numpy.org/, accessed: August 1, 2015 9http://redis.io, accessed: August 1, 2015 72 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling performed by multiple software agents / web crawling processes, which can be distributed in one or more servers. The WebGraph-It architecture uses Redis as a common temporary data storage to maintain asynchronous job queues, webgraph structures, visited URL lists, SURT lists, webpages’ simhash values and other vital web crawl information. We use MariaDB to store permanent data for users, webcrawls and captured webpages in- formation such as hyperlinks. All data are stored in a relational model to be able to query them and generate views, reports and statistics. The http://webgraph-it.com front-end enables users to register and conduct web crawls with various options. The users can see completed web crawls and retrieve the results. An indicative screenshot of the front-end is presented in Figure 4.3.1. Users are able to create new web crawling tasks or view the results of existing tasks via an intuitive interface. Users are also able to export webgraph data in a variety of formats such as Graph Markup Language (GraphML) [26], Geography Markup Language (GML) [30], Graphviz DOT Language [49], sitemap.xml [127] and CSV. Our aim is to enable the use of the generated webgraphs in a large variety of 3rd party applications and contexts.

4.3.2 Web Crawling Framework

The development of multiple alternative web crawling methods requires the appropriate code base. We implement a special framework for WebGraph-it which simplifies the web crawler creation process; we use it for the implementation of the alternative web crawling algorithms presented in Section 4.2.2. The basic functionality of user input/output, storage and logging is common for all web crawlers. The developer needs only to create a Python module with three methods:

• check_url: Check if we should continue to follow a URL. • process: Analyse captured webpage and extract information. • capture: Download webpage from URL

We present the Python implementation of the basic web crawling algorithm 퐶1 from Sec- tion 4.3) in Listing 4.11.

1 from lib.crawl_list_class import CrawlList 2 from lib.crawling_utils import enqueue_capture 3 from app.models.base import db 4 from app.models.crawl import Crawl 5 from app.models.page import Page 6 7 def permit_url(target): 8 crawl_list = CrawlList(target.crawl_id) 9 return not crawl_list.is_visited(target.url) 10 4.4 Evaluation 73

11 def capture(target): 12 current_crawl = Crawl.query.filter(Crawl.id == target. crawl_id).first() 13 crawl_list = CrawlList(target.crawl_id) 14 if permit_url(target): 15 if not current_crawl.has_available_pages(): 16 return 17 if not target.get_html(): 18 return 19 new_page = Page( 20 crawl_id=target.crawl_id, 21 url=unicode(target.url), 22 ) 23 db.session.add(new_page) 24 db.session.commit() 25 if target.unique_key and new_page.id: 26 crawl_list.add_visited(target.unique_key, new_page.id) 27 target.page_id = new_page.id 28 if target.links: 29 target.save_links() 30 enqueue_capture("standard", capture, target, target. links) 31 32 def process(target): 33 crawl_list = CrawlList(target.crawl_id) 34 crawl_list.clear() 35 enqueue_capture("standard", capture, target, (target.url,))

Listing 4.11: 퐶1: Basic web crawling algorithm python implementation

4.4 Evaluation

We present the evaluation method and we explain in detail the evaluation of a single website with all web crawling algorithms to demonstrate the process. Then, we present and dis- cuss the results of the evaluation of a significant set of websites. Finally, we also present some auxiliary experiments to define the optimal webgraph cycle distance variable and the behavior of the system analysing a web spider trap. 74 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

4.4.1 Methodology

Our evaluation aims to explore the behavior and the results of all the web crawling algorithms defined in Section 4.2.2 and come up with conclusions regarding each key web crawling con- cept introduced in Section 4.2.1. We study the quality and the completeness of the web crawl- ing algorithms’ results. We also evaluate their speed. We need to evaluate if all webpages and hyperlinks are captured and the time needed to perform this task. We compare these findings with the standard baseline web crawling, which does not include any optimisations to evaluate the effects of our methods. In our experiments, we use Debian GNU/Linux 8, Python 2.7.9 and a virtual machine with 3 CPU cores and 1GB of RAM. All experiments presented here can be reproduced using the WebGraph-It system online at http://webgraph-it.com/. All web crawling tasks are performed in parallel using 4 processes (python workers) as presented in Section 4.3. We perform the following steps for our evaluation:

1. We select 100 random websites from Alexa top 1M websites 10 as a dataset.

2. We run 8 subsequent web crawls for each website with the WebGraph-it system using the 8 different web crawling algorithms presented in Section 4.2.2 (퐶1 − 퐶8). We produce 8 different result sets (푅1 − 푅8) for each website. 3. We record the specific metrics for each web crawl. All variables and metrics are pre- sented and explained in Table 4.3.

4. We analyse the results and reach specific conclusions after the completion of all web crawls.

Symbol Explanation

퐶푖 Web crawl 푅푖 Web crawl total results 퐷푖 Web crawl duration 푊푖 Captured webpages 퐿푖 Captured internal links from webpages 퐶푌 푖 Webgraph cycles detected 퐶푂푖 Completeness indicates the percentage of information contained in a web crawl result set compared with the respective base web crawl result set.

Table 4.3: Variables used in the evaluation, 푖=1-8

For evaluation purposes, it is necessary to have a baseline to which the web crawling methods should be compared. We define the base crawl 퐶1 (Listing 4.3) as the fundamental method to crawl websites without any attempt for optimisation. All metrics are calculated as percent- ages to the base crawl measurements. For instance, if the base crawl number of webpages captured is 100 and method 퐶5 has 90, its 푊5 value is 0.9.

10http://s3.amazonaws.com/alexa-static/top-1m.csv.zip, accessed August 1, 2015 4.4 Evaluation 75

The completeness metric needs also further explanation: For each web crawl except the base one (i.e. 퐶2 − 퐶8), we need to evaluate whether our proposed methods succeed to capture all target website webpages. This is necessary because it is possible that we may accelerate the web crawling process but fail to capture all content. This behavior may not be acceptable for some applications, where completeness is a key issue (e.g. web archiving) but it may be useful for other cases, such as search engines. To evaluate the completeness of each web crawl, we conduct the following steps:

1. We crawl each website using the standard method (퐶1) and save the results (푅1).

2. We crawl using every other method 퐶푖 and save their results 푅푖 (for 푖=2,...,8).

3. We check for each webcrawl results 푅푖, if every captured webpage 푊푖 in the base crawl 푅1 is available in 푅푖. To achieve this, we use the simhash of each 푊푖 in 푅1 and compare it with the simhashes of all webpages in 푅푖.

4. We calculate the completeness of each webcrawl results 퐶푖 as percentage of the number of found webpages against the number of base webcrawl results.

We notice that we do not apply exact similarity to evaluate if a webpage is present in the webcrawl results but we also look for near duplicates with a threshold of 95% similarity. It is important to outline the conditions of our experiments. There is a maximum limit in the number of webpages each target website may have in our system. This is to prevent a website with nearly infinite numbers of webpages to crash our web crawler. This is a prototype application and the implementation of web scale crawler is beyond its scope. Thus, we have set an arbitrary limit of 300 webpages per crawl. In some cases, websites have less than this number, which is not a problem, whereas in other cases, we stop the web crawling at this limit. What happens in practice is that the naive web crawling process (퐶1) would surely have a number of duplicates in the first 300 webpages of such a website. All methods would reach this limit of 300 webpages but they would include different webpages, where the most simplistic methods would contain duplicates to a significant extent. Another fact is that the experiment is performed using a single server and network con- nection. Any potential test server network or system issues would affect the results of the experiment. To minimise this effect, we conduct all web crawling operations on a target website together on a sequential order (퐶1 − 퐶8) and without any time difference between each web crawl.

4.4.2 Example

We present the detailed evaluation of a single website as an example to indicate the oper- ation of our system and the results of all web crawling methods in great detail (Table IV). For captured webpages (푊푖), links (퐿푖) and duration (퐷푖), we present not only the absolute numbers but also the percentages compared with the base crawl results. We also present the completeness (퐶푂푖) as a percentage. 76 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

Table 4.4: Results from all methods for a single website, http://deixto.com

Method 푊푖 (%) 퐿푖 (%) 퐷푖 (%) 퐶푂푖 (%) 퐶1 (URL) 151 1.000 5131 1.000 199 1.000 1.000 퐶2 (SURT) 150 0.993 4982 0.971 211 1.060 0.987 퐶3 (URL + Unique key similar- 128 0.848 3980 0.776 178 0.894 0.887 ity)

퐶4 (SURT + Unique key simi- 133 0.881 4208 0.820 175 0.879 0.894 larity)

퐶5 (URL + Unique key similar- 152 1.007 5163 1.006 229 1.151 0.993 ity + content similarity)

퐶6 (SURT + Unique key simi- 150 0.993 4980 0.971 345 1.734 0.997 larity + content similarity)

퐶7 (URL + Cycle detection) 126 0.834 5130 1.000 202 1.015 0.993 퐶8 (SURT + Cycle detection) 124 0.821 4982 0.971 200 1.005 0.987

4.4.3 Results

We perform a total of 800 web crawls using our dataset of 100 websites. We capture data as presented in the example evaluation of the previous section. We summarise the results and calculate statistics such as the average, median, minimum, maximum and standard deviation to study them and come up with conclusions.

Table 4.5: 푊푖: Captured webpages difference between all webcrawls and base crawl. Lower is better. Id Average Median Min Max StDev

푊1 1.000 1.000 1.000 1.000 0 푊2 0.885 0.988 0.504 1.000 0.192 푊3 0.605 0.593 0.201 1.000 0.261 푊4 0.589 0.547 0.305 1.000 0.270 푊5 0.907 0.900 0.815 1.000 0.070 푊6 0.861 0.880 0.451 1.000 0.242 푊7 0.669 0.649 0.285 0.868 0.209 푊8 0.606 0.582 0.285 0.868 0.229

The captured webpages 푊푖 are presented in Table 4.5. These results need to be evaluated along with the completeness 퐶푖 of each algorithm, which is presented in Table 4.6. Most ef- ficient algorithm is the one that is capturing the less number of webpages, while also having the highest completeness. Another important parameter is also the duration of web crawling. A good algorithm has to be efficient. Performance does not only depend on the number of webpages downloaded but also on the computations required by the algorithm to decide the best web crawling process. The performance of each algorithm is presented in Table 4.7. It is also interesting to see the number of captured links for each webcrawl, as presented in Ta- ble 4.8. The standard deviation values for all metrics is quite low, increasing the confidence 4.4 Evaluation 77

Table 4.6: 퐶푂푖: Completeness of each web crawling method. Higher is better. Id Average Median Min Max StDev

퐶푂1 1.000 1.000 1.000 1 0 퐶푂2 0.986 0.991 0.958 1 0.016 퐶푂3 0.698 0.800 0.215 1 0.263 퐶푂4 0.723 0.814 0.312 1 0.230 퐶푂5 0.989 0.995 0.965 1 0.014 퐶푂6 0.982 0.986 0.944 1 0.020 퐶푂7 0.985 0.995 0.951 1 0.018 퐶푂8 0.983 0.982 0.979 1 0.009

Table 4.7: 퐷푖: Duration difference between all webcrawls and base crawl. Lower is better. Id Average Median Min Max StDev

퐷1 1.000 1.000 1 1.000 0 퐷2 0.726 0.905 0.150 0.946 0.333 퐷3 0.492 0.545 0.151 0.728 0.218 퐷4 0.419 0.370 0.185 0.750 0.238 퐷5 1.168 1.106 0.155 1.663 0.538 퐷6 1.152 1.181 0.213 1.707 0.594 퐷7 0.955 1.004 0.182 1.160 0.294 퐷8 0.829 0.805 0.210 1.070 0.350 in our results.

Table 4.8: 퐿푖 Captured links difference between all webcrawls and base crawl. Lower is better. Id Average Median Min Max StDev

퐿1 1.000 1.000 1.000 1.000 0 퐿2 0.846 0.961 0.386 0.984 0.231 퐿3 0.516 0.561 0.072 0.920 0.272 퐿4 0.508 0.526 0.163 1.000 0.299 퐿5 0.929 0.941 0.864 0.960 0.035 퐿6 0.859 0.915 0.361 0.956 0.228 퐿7 0.887 0.893 0.820 1.005 0.106 퐿8 0.857 0.873 0.801 1.000 0.238

Next, we look closely into the results of all algorithms and draw some conclusions. 퐶2 is similar to the standard algorithm with the only difference that it uses SURT instead of URL for unique webpage identification. 퐶2 captures less data than 퐶1 (average(푊2)=0.885, average(퐿2)=0.846), while having very high completeness (average(퐶푂2)=0.986) and sig- 78 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

nificantly lower time spent (average(퐷2)=0.726). This means than the use of SURT is su- perior to the use of URL as a unique identifier for web crawling. 퐶2 is a small improvement over the base algorithm regarding captured content (∼11%) but scores much better regarding web crawling performance (∼37%).

Algorithms 퐶3 and 퐶4 have in common the use of unique key similarity to identify duplicate web content. They capture very little content (average(푊3)=0.605, average(푊4)=0.589) and their results are quite incomplete (average(퐶푂3)= 0.698 and average(퐶푂)4=0.723). Hence, we believe that they are not suitable for accurate web crawling as they would miss a large subset of the target website. Nevertheless, we mention that their performance is very good, which may be due to the fact that they skip a large subset of the target website (average(퐷3)=0.492, average(퐷4)=0.419. One reason probably to use these algorithms would be to perform web crawling with the purpose of sampling. The process would be very fast compared with regular web crawling but the results would be a subset of the total website.

Algorithm 퐶5 uses URLs as unique identifiers, URL near-duplicate detection and webpage content near-duplicate detection. Its results show marginal gains regarding the base algo- rithm (average(푊5)=0.907, average(퐿5)=0.929) Algorithm 퐶6 which is using SURTs in- stead of URLs is showing similar results (average(푊6)=0.861, average(퐿6)=0.859. The great drawback of these methods (퐶5, 퐶6) is that they are quite slower than all other meth- ods (1.168 and 1.152 average values, respectively). They are even slower than the baseline web crawling algorithm. This behavior is attributed to the fact that they compare every new webpage they capture with all already captured pages, as presented in Section 4.2.1. Despite the fact that we use simhash signatures to achieve very fast webpage content near similarity evaluation, it is still an inefficient choice. Regardless of the test system specifications, us- ing these methods in large scale would require considerably more computing resources than standard web crawling.

Finally, we study the results of the algorithms using cycle detection (퐶7, 퐶8). They succeed in capturing less content not only from the base algorithm, but also from 퐶2, 퐶5 and 퐶6. We see that average(푊7)=0.669 and average(푊8)=0.606. We also remark that average(퐿7)=0.887 and average(퐿8)= 0.857. At the same time, their completeness scores are very good, avera- ge(퐶푂7)=0.985 and average(퐶푂8)=0.983. Regarding web crawling duration, they are not much faster than the base crawl and are slower than 퐷2 (i.e. average(퐷7) =0.955 and average- (퐷8)=0.829. Comparing 퐶7 with 퐶8 we see that 퐶8 is faster and also has better scores regard- ing captured web content. The two algorithms are almost equal regarding results completion.

Our conclusion is that the algorithm 퐶8 is the best choice for web crawling when consid- ering captured web content quality and accuracy. 퐶8 achieved to capture 40% (1−0.606) less duplicate website content with very high accuracy (average(퐶푂8) = 0.983) at approxi- mately the same time as the standard web crawling algorithm. On the other hand, if we are considering web crawling duration as our top priority and we would like to get a sample of a website in very little time compared with full web crawling, we should use 퐶4 which was ∼58% faster than the standard algorithm (average(퐷4)=0.419).

4.4.4 Optimal DFS Limit for Cycle Detection

The cycle detection algorithms defined in Section 4.2.1 use Depth-First Search (DFS) to perform searches in webgraphs and detect cycles. The distance limit for webgraph searches 4.4 Evaluation 79 is important because it has an impact on the performance and accuracy of cycle detection. If we have a very small limit, the cycle detection will be fast; however, it will not evaluate many nodes and it may miss cycles. If we have a large limit, the performance of the algorithm will suffer. To identify what is the optimal distance limit to stop searches we perform the following experiment:

1. We use the websites from our previous experiment as a dataset (Section 4.4.3). 2. We set a maximum limit equal to 5 and perform only cycle detection algorithms (퐶7, 퐶8). 3. When we detect a cycle, we record the distance of the respective node. 4. We count the occurrence of cycles for each distance limit in the range [1,4] and present them in Table 4.9 .

Table 4.9: Number of cycles for each distance limit Distance Cycles Percentage 1 10.615 85.19% 2 1.056 8.47% 3 600 4.81% 4 189 1.51%

Based on the outcomes of this experiment, we limit the search distance to 3 nodes.

4.4.5 Web Spider Trap Experiment

We conduct a simple experiment to showcase the operation of our web crawling algorithms in case of a web spider trap. We setup a simple web spider trap with a PHP script (Listing 4.12) in the author’s website http://vbanos.gr/trap/. The web spider trap generates an infi- nite number of URLs with the format presented in Listing 4.13. Each time a web spider visits the webpage, a new set of random URLs are created. This process is repeated infinitely.

1 2 3 web spider trap 4 5 6

Lorem Ipsum is simply dummy text

7
    8

    11 $url = "http://vbanos.gr/trap/index.php?var={$var}"; 12 $label = "Randomly generated label {$var} for testing purposes"; 13 echo "

  • {$label}
  • "; 14 } 15 ?> 16
17

Lorem Ipsum is simply dummy text of the printing and 18 typesetting industry. Lorem Ipsum has been the industry's 19 standard dummy text ever since the 1500s, when an unknown 20 printer took a galley of type and scrambled it to make a type 21 specimen book.

22 23

Listing 4.12: Simple PHP web spider trap

1 http://vbanos.gr/trap/index.php?var=123 2 http://vbanos.gr/trap/index.php?var=412 3 http://vbanos.gr/trap/index.php?var=548

Listing 4.13: Example web spider trap hyperlink outputs

We initiate a new experiment with the spider trap URL as a target and a limit of 100 webpages for all web crawls. The results are presented in Table [? ]. The naive web crawling algorithms 퐶1 and 퐶2 fall into the trap and capture 100 and 103 webpages respectively. 퐶2 captures more than 100 webpages because our system is running 4 web crawling processes in parallel and when the maximum limit is reached, processes need to complete their current web crawling task before exiting. The potential maximum number of captured webpages is limit + number of processes.

Table 4.10: Web spider trap crawling results. Method 푊 퐷

퐶1 100 64 퐶2 103 66 퐶3 11 12 퐶4 8 18 퐶5 41 97 퐶6 35 78 퐶7 4 4 퐶8 3 5 4.5 Conclusions and Future Work 81

The algorithms using URL/SURT (퐶3, 퐶4) stop web crawling on different points, depending on when the generated URL is 95% similar to an already generated webpage. This point depends on the web spider URL generation algorithm. If it is complex, multiple webpages may need to be captured. The algorithms using URL/SURT and content similarity (퐶5, 퐶6) also behave in the same way. They stop after capturing more webpages than 퐶3/퐶4, because the probability to have similar webpage URL and content is smaller than the probability to have just similar URL. The cycle detection algorithms have the best performance. They evaluate nearby webpages and use their content similarity to find near-duplicates. This is the case with our experimental web spider trap, so the web spiders stop crawling very quickly.

4.5 Conclusions and Future Work

We presented our work towards the improvement of web crawling performance and effi- ciency using near-duplicate web content detection and webgraph cycle detection. New con- cepts were introduced to improve the web crawling process: (i) the selection of URL or SURT as the unique webpage identifier, (ii) the use of unique webpage identifier similarity to detect duplicates and near-duplicates, (iii) the use of webpage content similarity for the same purpose, and, (iv) the application of webgraph cycle detection. Using these concepts, we designed and implemented 8 web crawling algorithms and we performed extended experi- ments to study their behavior. In addition, we presented an implementation of our algorithms via the WebGraph-It platform, a web system available at http://webgraph-it.com which enables users to analyse websites, perform test web crawls and generate webgraphs. The concepts introduced in this work could lead to the implementation of many more web crawling algorithms besides the 8 ones we have implemented and tested. There could be many variations of existing parameters and combinations of similarity and near-similarity criterial to produce many more algorithms. We decided to focus on the presented ones be- cause we believe that they are indicative of our work. Our aim was to enable other researchers and web crawling engineers to learn and evaluate these concepts, to be able to integrate them in other new or existing web crawling systems. The outcomes of our research provide many useful insights. The use of the Sort-friendly URI Reordering Transform (SURT) for webpage URLs during web crawling results in im- proved web crawling performance and reduced duplicate captured content. We believe that this method should be used universally by web crawlers as it is easy to implement, requires little existing code modification, incurs little performance overhead and yields great results. The application of near-duplicate detection in URLs or SURTs should not be used inde- pendently because it results in incomplete web crawling results. A lot of web content is mistaken as duplicate when it is not. Near-duplicate URL/SURT evaluation should be used along with webpage content near-duplicate detection. This method has far more accurate results. The problem is that it does not scale well as we have demonstrated in the evalua- tion (Section 4.4.3). On the other hand, the webgraph cycle detection algorithms we have introduced in this work have great results and potential. The best performing algorithm in our experiments is 퐶8 which uses SURT for unique webpage identification and webgraph cycle detection to identify cycles and highlight duplicate and near-duplicate webpages. In addition, our method has the benefit of low computing requirements. It is only necessary to maintain a webgraph in memory with very little data for each node (URL, SURT, simhash 82 Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling signature, internal-links) to evaluate potential duplicates. Using the framework we developed in the context of WebGraph-It to enable easy web crawl- ing algorithm implementation (Section 4.3.2), we aim to evolve existing web crawling al- gorithms and create new. Also, we plan to implement our methods in other open existing web crawlers such as the BlogForever platform [16]. Finally, we aim to launch a public web service via http://webgraph-it.com to provide users with web crawling and web- graph generation services. The applications of web crawling optimisation, webgraph gener- ation and web spider trap detection are numerous. Web crawling engineers would be able to streamline their web crawling operations by identifying web spider traps and other problem- atic webpages, researchers would be able to have quality web crawling data and webgraphs for experimentation, students would be able to learn more about the web, web crawling and webgraphs. Chapter 5

The BlogForever Platform: An Integrated Approach to Preserve Weblogs

We present BlogForever, a new system to harvest, preserve, manage and reuse weblog con- tent. We present the issues we resolve and the methods we use to achieve this. We survey the technical aspects of the blogosphere and we outline the BlogForever data model, system architecture, use cases, experiments and results1.

1This chapter is based on the following publications:

• Kalb H., Lazaridou P., Banos V., Kasioumis N., Trier M.: “BlogForever: From Web Archiving to Blog Archiving”, Proceedings ‘Informatik Angepast an Mensch, Organisation und Umwelt‘ (INFOR- MATIK), Koblenz, Germany, 2013.

• Banos V., Baltas N., Manolopoulos Y.: “Blog Preservation: Current Challenges and a New Paradigm”, chapter 3 in book Enterprise Information Systems XIII, by Cordeiro J., Maciaszek L. and Filipe J. (eds.), Springer LNBIP Vol.141, pp.29–51, 2013.

• Kasioumis N., Banos V., Kalb H.: “Towards Building a Blog Preservation Platform”, World Wide Web Journal, Special Issue on Social Media Preservation and Applications, Springer, 2013.

• Banos V., Baltas N., Manolopoulos Y.: “Trends in Blog Preservation”, Proceedings 14th International Conference on Enterprise Information Systems (ICEIS), Vol.1, pp.13-22, Wroclaw, Poland, 2012.

• Banos V., Stepanyan K., Manolopoulos Y., Joy M., Cristea A.: “Technological Foundations of the Current Blogosphere”, Proceedings 2nd International Conference on Web Intelligence, Mining & Se- mantics (WIMS), ACM Press, Craiova, Romania, 2012. 84 The BlogForever Platform: An Integrated Approach to Preserve Weblogs

5.1 Introduction

We present how specialised blog archiving can overcome problems of current web archiving. We introduce a specialised platform that exploits the characteristics of blogs to enable im- proved archiving. In summary, we identify several problems for blog archiving with current web archiving tools:

• Aggregation scheduling performs on timely intervals without considering web site updates. This causes incomplete content aggregation if the update frequency of the contents is higher than the schedule predicts [71, 137]. • Traditional aggregation uses brute-force methods to crawl without taking into account what is the updated content of the target website. Thus, the performance of both the archiving system and the crawled system are affected unnecessarily [137]. • Current web archiving solutions do not exploit the potential of the inherent structure of blogs. While blogs provide a rich set of information entities, structured content, APIs, interconnections and semantic information [89], the management and end-user features of existing web archives are limited to primitive features such as URL Search, Keyword Search, Alphabetic Browsing and Full-Text Search [137].

Our research and development aims to overcome these problems by exploiting blog charac- teristics. Additionally, while specialisation can solve existing problems, it causes additional challenges for the interoperability with other archiving and preservation facilities. For ex- ample, the vision of a seamless navigation and access of archived web content requires the support and application of accepted standards [125]. Therefore, our targeted solution aims to:

• improve blog archiving through the exploitation of blog characteristics, and, • support the integration to existing archiving and preservation facilities.

The rest of this chapter is structured as follows: In Section 5.2, we perform a technical survey of the Blogosphere to identify weblogs’ structures, data and semantics. In Section 5.3, we present the rationale and identified user requirements behind system design decisions. We proceed with the presentation of the resulting system in Section 5.4 and the implementation in Section 5.5. We present the evaluation in Section 5.6, before we discuss some issues of our solution in Section 5.7.

5.2 Blogosphere Technical Survey

It is important to achieve a better understanding of the Blogosphere to aggregate, manage and archive weblogs. It is necessary to explore patterns in weblog structure, data and semantics, weblog-specific APIs, social media interconnections and other unique blog characteristics. We conduct a large-scale evaluation of active blogs and we review the adoption of an exten- sive list of technologies and standards. Finally, we compare the results with existing findings from the web to identify similarities and differences. 5.2 Blogosphere Technical Survey 85

There are already important initiatives aiming to identify and record how the web is con- structed and what its main ingredients are. The HTTP Archive is a permanent repository of web performance information and technologies utilized. W3Techs also provides infor- mation about the usage of various types of technologies on the web [142]. is a company that is maintaining a database of information about sites, including technical information, since 1996. There are also many initiatives which gather web and especially blog information via user surveys and online questionnaires. Technorati’s state of the Blogosphere is the most high profile one but there are also others such as The State of Web Development. However, while some of the above mentioned ini- tiatives publish descriptive statistics about the technological foundations of the Blogosphere, the scope and depth of these studies remains limited. For instance, while Technorati may publish basic statistics about the most widely adopted platforms and popular devices for ac- cessing blogs [131], the use of libraries, formats and tools remain beyond the focus of the review. To the best of our knowledge, there is no other initiative that conducts technical sur- veys and evaluates the technological foundations of the Blogosphere. This work addresses this gap. Following, we see the technical aspects of the survey implementation, detailed results and conclusions. We also highlight interesting differences between the generic web and the Blogosphere.

5.2.1 Survey Implementation

We evaluate the use of third-party libraries, external services, semantic mark-up, metadata, web feeds, and various media formats in the blogosphere. We use a relatively large set of blogs which consists from various data sources as presented in Table 5.1. All datasets were downloaded on August 12, 2011.

Description Initial resources Valid resources Blogs from http://weblogs.com ping server 259.286 209.560 Technorati top 100 resources 100 90 Blogpulse top 40 resources 40 35 BlogForever project survey user contributed blogs 504 145 Total 259.930 209.830

Table 5.1: Datasets

We choose to use http://weblogs.com/ for this evaluation for two reasons. First, it is a widely accepted and popular hub service in the Blogosphere, which makes it suitable for conducting a broad survey with a large sample of blogs. Secondly, it publishes a list of resources updated within the last hour. Using a list of recently updated resources can elim- inate abandoned or inactive blogs which constitute about the half of all the blogs. Blog ping servers receive notifications when new content is published on blogs via the XML- RPC-based ping mechanism and, subsequently, notify their subscribers about recent updates. http://weblogs.com/ receives millions of blog update notifications every day. we also use other resources such as the list of Top 100 blogs published by Technorati.com, top 40 blogs published by Blogpulse.com and a collection of blogs acquired from the BlogForever Weblog Survey [130]. The inclusion of additional blogs shared by participants of the survey extends the automatically generated list of blogs with a set of selectively contributed ones. 86 The BlogForever Platform: An Integrated Approach to Preserve Weblogs

On the other hand, the use of Technorati and Blogpulse provides a potential for enriching the evaluation. Technorati and Blogpulse are among the earlier and established authorities on indexing, ranking and monitoring blogs. Inclusion of top blogs from Technorati and Blog- pulse enables a comparative analysis between the more general Weblogs.com cohort and the list of highly ranked blogs. The overall number of accessed blogs is 259,930. HTTP response codes are recorded. Items without valid HTTP status code are discarded. 94% of all the received status codes are successful. The total number of valid (i.e. Response Status Code: 200) records surveyed is 209,830. The summary of the registered response codes is shown in Figure 5.1.

Figure 5.1: HTTP Status response codes registered during data-collection

An informed decision on the time of collecting the data is made. The choice of a specified time frame is justified by the anticipated increase in publishing activity of blogs in European and other states within the time zone proximity. The XML file is parsed and URL entries are extracted for further processing. The URL entries are filtered to distinguish between updated resources and their hosted websites. Duplicate entries are removed. The number of accessed resources contains all the URLs that have been extracted and followed by the survey script. We generate the datasets by accessing relevant XML feeds published online by the above mentioned resources. For each URL in the datasets, we try to access it via standard HTTP, retrieve all output hypertext and related files (e.g. CSS, images) and store them in our server. Then, we scan the data to identify the presence of specific technologies, tools, standards and services via their signature. To implement the data collection, we implement custom software using PHP5.3 for the core application and we utilise the cURL network library to implement communication with the blogs via HTTP. Regular expressions are used to parse the blog source code and evaluate the use of certain technologies. We also use Bash to implement process management and file I/O. The software is a Linux command line application which requires a URL list as input and outputs results in CSV files. For each URL of the input, the application performs an HTTP request and retrieves the respective HTML code. Subsequently, a set of regular expressions are executed, one for each technology or digital object type we try to detect, and the results are stored in a comma delimited CSV file. It must be noted that input URLs can be blog base URLs but also specific blog post URLs. In either case, the software retrieves 5.2 Blogosphere Technical Survey 87 the specific URL’s HTML code and proceeds to parse and analyse it. The complete software for implementing this survey is freely available via github2. To evaluate the use of certain technologies, we parse the source code of the accessed re- sources and look for evidence of adopted technologies. The technologies we consider as part of this evaluation are summarized in the following (+count indicates that number of identified occurrences was counted): • Content Type • CSS (+count) • Dublin Core (+count) • Embedded YouTube video • Facebook • FOAF • Flash (+count) • Google+ • HTML • HTTP Response Status Code (200, 404, etc.) • Image tags (BMP, WEBP, JPG, PNG, GIF) (+count) • JavaScript and specific libraries (Dojo, ExtCore, JQuery, JQueryUI, MooTools, Proto- type, YUI) • Microdata • Twitter • -hCard • Microformat-XFN • Open Graph Protocol (+count) • Other MIME Types (see Table 2) • Open Search • • RDF (+count) • SIOC • Software/Platform • XHTML • XML Feeds (Atom, Atom-comments, RSS, RSS-comments).

5.2.2 Results

Platforms and Software

We obtain information about the blog hosting platform and software from two blog attributes: a) the HTTP response headers of the blog and b) the html tag. Blog software version is also present. The most frequent blog platforms that appear in the studied cohort are WordPress (36%) and Blogger (19%). Technorati, similarly to our find- ings reported WordPress, followed by Blogger, to be the platform of choice. However, the number of WordPress instances observed within the studied dataset is considerably lower from the 51% reported by Technorati. Similar observations are made in relation to the Blog- ger platform. These differences may be due to a large number of cases (40%) for which information about the platform remained hidden.

2https://github.com/BlogForever/crawler, accessed: August 1, 2015 88 The BlogForever Platform: An Integrated Approach to Preserve Weblogs

A considerable number of instances are registered for Typepad, vBulletin Discuz and Joomla. Among other (2%) frequently appearing platforms are: Webnode, PChoc, Posterous, Blog- spirit, DataLife Engine and BlueFish. The total number of unique platforms registered how- ever is considerably large – totaling 469 unique platforms. However, even combined together they do not exceed the 19% of the entire list of studied blogs. It remains an open question why a large number of blogs do not exhibit the platforms they are built on. It requires further investigation to identify whether some blogs prefer not to acknowledge the use of a certain blogging engine or whether they are based on custom systems.

Figure 5.2: Frequency of weblog software platforms

There is a considerable variation across most popular software platforms used. The consis- tency in specifying versions of adopted software varies too. However, it is still possible to identify the extent of adoption and noticeable patterns within the studied corpus. First, it becomes apparent that a large number of websites are maintained without a software upgrade, despite the availability of more recent versions. For instance, 20% of all the Mov- able type blogs continue using version 3, as shown on Figure 5.4 despite the availability of many latest versions. There is a similar pattern, with around 13% (and some of the generic 4%) of the WordPress users choosing earlier versions of software released between 2004 and 2009, despite the availability of newer versions. While the number of earlier platforms across active blogs remains substantial, the majority of software platforms (with an average of around 75%) use more recent versions. These results are limited to the providers of software packages that do specify their versions. Among the providers that do not specify information about the software version are: Blogger, Typepad and Joomla.

Character Encoding

Documents transmitted via HTTP are expected to specify their character encoding. Often referred to as “charset”, it represents a method of converting a sequence of bytes into a sequence of characters. When servers send HTML documents to user agents (e.g. browsers) as a stream of bytes, user agents interpret them as a sequence of characters. Due to a large 5.2 Blogosphere Technical Survey 89

Figure 5.3: Variation in versions of Wordpress software

Figure 5.4: Variation in versions of MovableType software number of characters throughout written languages, and a variety of ways to represent them, charsets are used to help user agents rendering and representing them. It is, therefore, recommended by the W3c [40] to label Web documents explicitly by using element or using specific HTTP headers as a way of conveying this information. An example of specifying character encoding is given below:

1

User agents are expected to work with any character encoding registered with IANA, how- ever, the support of an encoding is bound to the implementation of a specific . This evaluation records the use of content and charset attributes across the studied blogs. This enables commenting on most widely used charsets or the absence of the recommended labeling. Information about the types of documents distributed by blogs is also collected. The results suggest that text/html is the most widely (61%) specified content type within the studied corpus. Other types constitute to less than 1% and include: application/xhtml; /xml; /xhtml+xml; /vnd.wap.xhtml+xml, as well as text/xml; / javascript; / phpl; / shtml; and / html+javascript. A considerable number of accessed resources were not labelled. 90 The BlogForever Platform: An Integrated Approach to Preserve Weblogs

Figure 5.5: Variation in versions of vBulletin software

Figure 5.6: Variation in versions of Discuz! software

In addition to content type, we capture and analyse information about encoding. UTF-8 is the most frequently used encoding. Other identified charsets do not exceed 6%. Encod- ing information is not specified or remains unidentified in 39% of the cases (Figure 5.7). The number of blog instances that do not specify charset information are worthy of notice. Within the 6% of other types of charset specifications 48 distinct records are identified. Most common charset specifications include: iso-8859-1 (48%), euc-jp (23%), shift-jis (8%) and windows-1251 (6%). See Figure 5.8 for more details. The results demonstrate that the overwhelming majority of studied resources are distributed in Unicode as text/html documents. A still considerable number (6%) of resources are using alternative encoding. It may therefore be required to consider solutions for capturing and preserving the blogs distributed in character sets other than UTF-8.

CSS, Images, HTML5 and Flash

We discuss the findings of the study regarding: CSS, HTML5, Flash and certain image file formats. The dataset includes:

• Number of embedded references to CSS files linked 5.2 Blogosphere Technical Survey 91

Figure 5.7: Encoding of evaluated resources

Figure 5.8: Break down of the other 6% of character set attributes

• Presence of HTML5 based on declaration • Number of Flash objects used based on references to SWF files • Number of png, gif, bmp, jpg, webp, wbmp, tiff and svg images used

Cascading Style Sheet (CSS) is a language that enables separation of content from presen- tation. Used primarily with HTML documents, CSS provides a common mechanism for shared formatting among pages, improved accessibility and greater flexibility and control over the presentation elements of various web documents. We demonstrate that most of the accessed resources use CSS elements (without distinguish- ing between CSS1 and CSS2). The average number of references to CSS is 1.94 – suggesting a frequent use of this technology. 81% of all the studied resources employed CSS. HTML5 is the fifth and (on the day of writing this document) the most recent revision of the HTML language. HTML5 intends to improve its predecessors and define a single markup language for HTML and XHTML. It introduces new syntactical features such as,

and elements, along with the integration of SVG content. We look into the adoption of HTML5 within the studied corpus. The results suggest that only 25% (53,546) of all the considered resources are using HTML5. However, it is important to 92 The BlogForever Platform: An Integrated Approach to Preserve Weblogs specify here that identification and count of native to HTML5 elements was not performed as part of this study. Image File Formats that are primarily used on the Web vary widely. Graphical elements displayed on websites are primarily divided into raster and vector images. Raster images, however, are more widely used across the web. This study identified and quantified the number of images used within each of the accessed resource. The raster formats used here include: png, gif, bmp, jpg/jpeg, webp, tiff and wbmp. SVG graphics were considered from the range of vector formats. Figure 5.10 outlines the use of file formats in the studied corpus of resources. Most frequently used formats are JPG, GIF and PNG images. The average number of these graphic types per webpage is between 4 and 8. We show the less frequently used images in Figure 5.11. The largest number for (and the only instance of) SVG images identified within the dataset is 5. This explains the low value of the averages. The average number of BMP images is the largest with 0.02 per accessed resource. The average of other file types does not exceed 0.01. Interestingly, the average number of resources with no images identified was considerably high (21.2%). Figure 5.9 illustrates the frequencies of images identified on a single resource. 90% of all the pages exhibited less than 40 images. The long tail of distribution indicates a rapid decline in the number of websites using large numbers of images.

Figure 5.9: Average number of images identified

Flash, also known as /, is a multimedia platform used for adding interactivity or animation to web documents. It is frequently used for advertisement, games streaming video or audio. Flash is provided by using an object-oriented ActionScript pro- gramming language and allows the use of both vector and rasterised graphical content. The detection of Flash content within the studied resources was based on the use of SWF format. Accessed resourced were searched for elements with a source that points to an *.swf file. The instances of Flash content were counted as well. The results indicate that the overwhelming majority (85%) of the accessed resourced did not include any Flash content. Around 7% of all the resources were identified as having a single reference to a Flash object. The number of occurrences exceeds double figures only in exceptional cases. 5.2 Blogosphere Technical Survey 93

Figure 5.10: Average use of BMP, SVG, TIFF, WBMP and WEBP formats

Figure 5.11: Distribution of images for pages with less than 20 images only

Semantic Markup

We investigate the use of metadata formats and associated technologies:

• Metadata • Dublin Core • The Friend of a Friend (FOAF) • Open Graph Protocol (OG) • Semantically-Interlinked Online Communities (SIOC) • Micro/data/formats • Microdata • hCard () • XFN (Microformats) • Common Semantic Technologies • Resource Description Framework (RDF) • Re- ally Simple Discovery (RSD) • Open Search

Metadata are commonly defined as data about data. Within the context of the Web, metadata are commonly referred to as the descriptive text used alongside web content. Examples of 94 The BlogForever Platform: An Integrated Approach to Preserve Weblogs metadata can include keywords, associations or various content mapping. It is often required to standardize these descriptions for ensuring consistency and interoperability of web con- tent. Referring to Dublin Core, Open Graph, SIOC and FOAF as simply metadata would be inaccurate. However, their use is discussed jointly due to some similarities of their applica- tion. The summary of identified uses of metadata standards is presented in Figure 5.12. Open Graph (OG) is most frequently used standard. Each of the instances of OG and DC mark-up has been counted. The average occurrence of OG is 5.7 per page compared to 1.37 for DC.

Figure 5.12: Summary of metadata usage

We show the histogram of OG occurrences in Figure 5.13. The use of FOAF has been identified in only 561 cases, which constitutes to less than 0.3% of all the studied pages. The overwhelming majority of evaluated resources did not use FOAF. Across the entire corpus of studied resources no reference to SIOC is identified.

Figure 5.13: Histogram of Open Graph references

Microdata and Microformats are conceptually different approaches to enriching web content with semantic notation. This evaluation counted the number of resources where presence of 5.2 Blogosphere Technical Survey 95 or microformats has been identified. More specifically, when referring to micro- formats, the investigation distinguished between XFN, a way of representing human rela- tionships using hyperlinks, and hCard – a simple, distributed format for representing people, companies, organisations, and places. The presence of Microdata within a resource is based on locating itemscope and itemtype=”http://schema.org/*” within a studied page. hCard and XFN microformats were identified, respectively, as class attributes with hcards values and rel attributes within tags. To add a property to an item, we use the itemprop attribute on one of the item’s descendants. We identify XFN in 74,709 cases, which constitutes to 35.6% of the entire corpus. On the opposite, the use of microdata and hCards is less frequent. Only 27 instances of microdata are identified within the studied resources. The number of identified hCards is limited to 607 (0.3%). A large portion of the studied corpus contains no evidence of either microdata nor microformats. Common Semantic Technologies considered in this evaluation are limited to the use of: RDF language, Open Search and Really Simple Discovery (RSD) formats. We identify RDF using the application/rdf+xml resource content type. We identify OpenSearch using the applica- tion/opensearchdescription+xml content type and the relevant namespace declaration:

1

Similarly, we identify RSD using the following namespace declaration:

1

The use of RSD is widespread. About 74% of all the accessed resources use RSD. On the contrary, only 567 records (0.3%) use RDF. No references to Open Search are identified.

XML Feeds

XML feeds following the RSS and Atom protocols, are used across weblog platforms and services. Represented in a machine readable format, web feeds enable data sharing among applications. Most common use of web feeds is to provide content syndication and notifica- tion of updates from multiple websites into a single application [67]. Aggregators or news readers are commonly used for syndicating the web content by enabling users to subscribe to web feeds. The simple mechanisms for accessing and distributing web content justify the wide adoption of feeds on weblog platforms. We identify the use of web feeds by the use of the tag with type=”application/atom+xml” for Atom feeds, type= ”application/rss +xml” for standard RSS feeds with an additional dis- tinction to comments where applicable. The results are outlined in Figure 5.14. RSS feeds are most widely used (56%) feeds. The use of Atom feeds (29%) is still common. 15% of RSS feeds are used distinctly for distributing the content of comments. Yet, no Atom feeds are identified for this purpose. 96 The BlogForever Platform: An Integrated Approach to Preserve Weblogs

Figure 5.14: Use of XML feeds by type

JavaScript Libraries

We evaluate the use of the following popular JavaScript language frameworks: Dojo3, Ext JS4, JQuery5, JQuery UI6, MooTools7, Prototype8, and, YUI Library9. We also discuss the use of Pingback services throughout the studied cohort. To use of JavaScript by each of the accessed resource has been quantified based on the num- ber of *.js files linked or segments of JavaScript code embedded within the accessed docu- ment. The results suggest a wide adoption of JavaScript with 82% of the entire studied corpus having at least one reference to JavaScript. The average number of JavaScript instances is large too – 12.5 instances per resource (Figure 5.15). Within the identified instances of JavaScript code, there are references to specific libraries and frameworks. Their use is identified by the reference to their name (e.g. dojo.js, jquery.js, etc.). The most frequently used technologies are JQuery, Moo Tools and YUI Library. The cumulative use of Dojo, Ext Core JQuery UI and Prototype constitute to just over 1% of all the accessed resources (Figure 5.16). Last, but not least, this sections summarises the use of PingbackAPIs. The identification of Pingback is based on the reference of tags with rel=”pingback” attribute within the accessed recourses. The results suggest that 46.4% of all the accessed resources used . The use of other Linkback mechanisms, including and Refbacks have not been considered in this evaluation. The use of other third party libraries such as Google Analytics is also omitted.

3https://dojotoolkit.org/, accessed August 1, 2015 4https://www.sencha.com/products/extjs/, accessed August 1, 2015 5http://jquery.com/, accessed August 1, 2015 6https://jqueryui.com/, accessed August 1, 2015 7http://mootools.net/, accessed August 1, 2015 8http://prototypejs.org/, accessed August 1, 2015 9http://yuilibrary.com/, accessed August 1, 2015 5.2 Blogosphere Technical Survey 97

Figure 5.15: Number of JavaScript instances identified

Figure 5.16: Number of identified JavaScript library/framework instances

Social Media

The rise of social media such as Facebook, Twitter and YouTube has a profound effect on people’s blogging behaviour and the Blogosphere in general. A large number of blogs al- ready integrate mechanism for easy distribution of its content on social media websites. So- cial media are used for promoting and notifying readership about new posts. We summarise our investigation into the use of social media. To use of Twitter, Facebook, Google+ and YouTube, it is necessary to integrate specific JavaScript libraries and XML namespaces with appropriate references to these web services. The results suggest that almost 4% of all the studied resource indicate an evidence of integra- tion with Facebook. The number of references to Twitter are marginal with only a handful of identified instances. The adoption of Google+, on the other hand, is shown to be consid- erably higher – totaling 17.2% among the studied resources. This high number of instances is surprising given the announcement of the service less than two months ago from the time of writing this report. We study the use of YouTube differently from that of earlier discussed social media. Each of the accessed resources were scanned for occurrences of embedded content from YouTube. 98 The BlogForever Platform: An Integrated Approach to Preserve Weblogs

The use of