and Tools for Structured Data An increasing number of research projects are now making use of some form of structured data – that is, data which consists of sets of comparable information, in which multiple items or objects share certain common features. Even if you have no intention of producing a website or presenting your data publicly, there are many situations in which databases or -like tools may provide the best way of organising this sort of information. This article provides some tips on how to select the tools that are most appropriate for a given research project.

Contents

Research Research When a word processor is not enough ...... 1 Spreadsheets ...... 1 Relational databases ...... 2 XML databases ...... 3 TEI XML ...... 4 RDF data ...... 5 Where to go for more information ...... 6

When a word processor is not enough Some researchers use a word processor for a considerable portion of their work, particularly when writing books, papers, theses, and notes. Because of this familiarity, it can be tempting to use a word processor for everything. While this approach can save time which would otherwise need to be invested in learning new software and methods, it can lead to people generating extremely long documents, from which it is a struggle to retrieve useful information in any meaningful manner. For structured data, one of the approaches described below may be a better bet.

Spreadsheets If the information you are working with consists of a number of discrete objects, each of which shares essentially the same limited set of characteristics, a spreadsheet may provide the ideal means for structuring your data. Historical surveys, such as censuses, and bureaucratic records are often most clearly set out in the form of a spreadsheet; a list of the bibliographic details of a set of books might be another example, or a list of financial returns from a publishing house. Spreadsheet software is well adapted to sort and re-sort records alphabetically or numerically, and is ideal if you wish to conduct numerical analysis – to establish means and medians in a particular dataset, for instance, or visualise information in the form of charts and graphs.

Figure 1: Example of information in a spreadsheet

Useful for: Ordering simple records Numerical analysis Generating charts and graphs Research Research Disadvantages: Not as good at handling complex relationships as relational databases

Popular software packages: Microsoft Excel OpenOffice.org Calc

Relational databases Spreadsheets are fine for a lot of basic data organisation and analysis, but in some cases a offers significant advantages. If you’re working with information or sources that have relationships with other objects, which in turn have interesting properties or relationships, using a relational database is probably a good idea. When you construct a relational database, you can create separate tables (which individually tend to look much like spreadsheets) and link fields within each to fields in other tables. So, for instance, you could create a table of bibliographic details about books, including the names of the authors, and link this to a separate table of authors, containing information about when they were born and died, where they were educated, and so forth. If you wished, you could link the information about where they were educated to another table, providing information about the size and location of the school or university they attended. Relational databases cater for one-to-many relationships, or even many-to-many. Relational databases can be designed to enable quite complex cross-searching, for instance, listing all the books published by authors who attended a particular university during a given period. Searches of databases are called queries, and are written in a query-language such as SQL (Structured ) – though many database software packages include a query-building function which will automatically convert your instructions to SQL for you. Learning to construct queries is not difficult, but there are various clever tricks that one can pick up to perform complex but efficient searches.

Research Research Figure 2: Example of a relational database structure

Useful for: Situations where you are not sure in advance how you (or others) will want to query your data, and wish to keep your options open and flexible Spotting unexpected relationships between things Hosting information on the Web and allowing others to search it

Disadvantages: Efficiently structuring large databases can be a challenge

Popular software packages: Microsoft Access FileMaker Pro MySQL (particularly for databases hosted on the Web) PostgreSQL (particularly for databases hosted on the Web)

XML databases Spreadsheets and relational databases can be very useful if you are working with essentially consistent data – where there are a limited number of shared characteristics common to each record in a given table. If the information you wish to analyse is difficult to characterise in such a way, however, you may wish to take a different approach. XML (eXtensible Markup Language) is a standard for tagging information in order to render it machine-readable. It is primarily used to assist textual analysis, as it can be used to indicate particular characteristics that apply to particular sections in a text. For instance, you may have a number of texts which cover very different subjects, but you want to find all instances where a particular individual or event is mentioned. You could surround each personal name with tags indicating that the part of the text is question is a personal name –

Christopher Columbus, for example. You could then search your texts for all the people named in them, or index each occurrence of a specific name. You could also use XML to create a standardised version of a name that occurs in many different variants, in order to render it searchable but without having to alter the original spellings in the document itself. Other tags can be used to indicate how a text should be displayed. Enclosing a piece of text between two emphasis tags, for instance, will indicate to a Web browser or some other XML reader that it should be displayed as bold, or italic. The precise interpretation of how an XML file should be displayed can be customised – the important thing is that XML separates content from its representation, ensuring that the document does not become unreadable just because the technology used to display it has changed. As is the case when working with relational databases, it is possible to create quite complicated queries when working with XML-tagged documents. XQuery is one popular language for searching XML databases. As with SQL and its

Research Research equivalents, it is fairly straightforward to learn how to return results from simple searches, but complex queries can also be constructed with a little more knowledge and experience. XML is not only used to indicate textual content, but is also widely used in linguistics to indicate parts of speech or features of spoken language. It is also popular amongst those working with manuscripts or multiple editions, to indicate variations, alternative translations, and so forth. TEI XML Text Encoding Initiative (TEI) XML is a schema established to aid consistency and interoperability between digital humanities projects. Essentially, the TEI has defined a number of labels (about 500) for use when tagging texts, so that people do not end up having to create their own definitions every time they want to make a text machine-readable. The TEI guidelines are available from http://www.tei-c.org/Guidelines/. The University of Oxford is a centre of expertise in TEI XML. OUCS runs an annual summer school, and members of the University can email [email protected] at any time for free advice. Figure 3 (below) shows part of the play The Raigne of King Edvvard the Third marked up into XML. Some of the tags instruct the Web browser how to display the text, whereas others are ‘invisible’, but can assist searching and analysis of the text. For instance, homographs are indicated in the XML, but are not flagged up in the text displayed in the browser. One can see here that the browser has not been instructed to recognise all of the rendering information in the XML original, as it is not displaying the names of the speakers or the stage directions in italic.

Research Research

Figure 3: Example of a text with TEI XML mark-up, rendered into simple HTML

Useful for: Working with texts Providing access to textual databases via the Web Textual and linguistic analysis

Disadvantages: Tagging documents is time-consuming You need to ensure that you tag elements consistently

Popular software packages: Oxygen XML Editor is useful for editing and tagging XML documents and checking that they meet TEI standards eXist is a free, open source, native XML database management system

RDF data Although not yet as widespread as other means of structuring data, the use of RDF (Resource Description Framework) (data that describes other data) is gaining prevalence as a means of linking together data from disparate sources. RDF represents relationships between things in the form of subject-predicate- object expressions. Any given subject (a particular book, for instance) may have

a particular relationship (such as being published by) with a particular object (a given publisher). The book will have other relationships and properties as well, such as being published in (a relationship/property) a particular year (object); or being published as (relationship/property) a paperback (object). In RDF terms, such subject-predicate-object expressions are called ‘triples’, and a database containing them is called a ‘’. RDF data is used especially to describe the relationships between resources on the Web in a machine-readable manner, and as such is a key component in what is known as the ‘Semantic Web’. The idea behind the Semantic Web is essentially to evolve the Web from a linked document store to a database of interlinked information. This may not be the easiest concept to envisage, but it basically means enriching data by enabling data from different sources to be searched together. RDF data is usually written using XML tags to describe the relationship being expressed. Some sort of standard ontology will need to be chosen to ensure a

Research Research degree of consistency between descriptions. The predominant query language for RDF data is SPARQL, which as its name suggests has certain similarities to the SQL-type languages used to query relational databases.

Useful for: Integrating existing data from disparate sources Network analysis

Disadvantages: Can be tricky to conceptualise at first Coding RDF relationships by hand would be time-consuming. It is often therefore generated automatically from SQL or various XML formats. Most triplestore software is at present aimed more at developers than ‘ordinary’ users – you will almost certainly need technical help

Popular software packages: Jena Sesame

Where to go for more information If you wish to produce anything more than a very simple relational database, it would be wise to learn first a little about the principles of structuring data and the capabilities of the software you are considering using. Most good bookshops will have a selection of introductory guides to database design and XML. The IT Learning Programme at OUCS offers a number of courses that may be of interest: see http://www.oucs.ox.ac.uk/itlp/courses/ for more information. Alternatively, if you have an idea for a research database and want to talk it through with a technical expert, speak to a member of the Infodev team at OUCS: see http://www.oucs.ox.ac.uk/infodev/ for more information. Infodev can also help with other aspects of research support, including data manipulation and website building.