Databases and Tools for Structured Data

Databases and Tools for Structured Data An increasing number of research projects are now making use of some form of structured data – that is, data which consists of sets of comparable information, in which multiple items or objects share certain common features. Even if you have no intention of producing a website or presenting your data publicly, there are many situations in which databases or database-like tools may provide the best way of organising this sort of information. This article provides some tips on how to select the tools that are most appropriate for a given research project. Contents Research Research When a word processor is not enough ................................................................... 1 Spreadsheets ........................................................................................................... 1 Relational databases .............................................................................................. 2 XML databases ...................................................................................................... 3 TEI XML ............................................................................................................. 4 RDF data ................................................................................................................ 5 Where to go for more information ........................................................................ 6 When a word processor is not enough Some researchers use a word processor for a considerable portion of their work, particularly when writing books, papers, theses, and notes. Because of this familiarity, it can be tempting to use a word processor for everything. While this approach can save time which would otherwise need to be invested in learning new software and methods, it can lead to people generating extremely long documents, from which it is a struggle to retrieve useful information in any meaningful manner. For structured data, one of the approaches described below may be a better bet. Spreadsheets If the information you are working with consists of a number of discrete objects, each of which shares essentially the same limited set of characteristics, a spreadsheet may provide the ideal means for structuring your data. Historical surveys, such as censuses, and bureaucratic records are often most clearly set out in the form of a spreadsheet; a list of the bibliographic details of a set of books might be another example, or a list of financial returns from a publishing house. Spreadsheet software is well adapted to sort and re-sort records alphabetically or numerically, and is ideal if you wish to conduct numerical analysis – to establish means and medians in a particular dataset, for instance, or visualise information in the form of charts and graphs. Figure 1: Example of information in a spreadsheet Useful for: Ordering simple records Numerical analysis Generating charts and graphs Research Research Disadvantages: Not as good at handling complex relationships as relational databases Popular software packages: Microsoft Excel OpenOffice.org Calc Relational databases Spreadsheets are fine for a lot of basic data organisation and analysis, but in some cases a relational database offers significant advantages. If you’re working with information or sources that have relationships with other objects, which in turn have interesting properties or relationships, using a relational database is probably a good idea. When you construct a relational database, you can create separate tables (which individually tend to look much like spreadsheets) and link fields within each table to fields in other tables. So, for instance, you could create a table of bibliographic details about books, including the names of the authors, and link this to a separate table of authors, containing information about when they were born and died, where they were educated, and so forth. If you wished, you could link the information about where they were educated to another table, providing information about the size and location of the school or university they attended. Relational databases cater for one-to-many relationships, or even many-to-many. Relational databases can be designed to enable quite complex cross-searching, for instance, listing all the books published by authors who attended a particular university during a given period. Searches of databases are called queries, and are written in a query-language such as SQL (Structured Query Language) – though many database software packages include a query-building function which will automatically convert your instructions to SQL for you. Learning to construct queries is not difficult, but there are various clever tricks that one can pick up to perform complex but efficient searches. Research Research Figure 2: Example of a relational database structure Useful for: Situations where you are not sure in advance how you (or others) will want to query your data, and wish to keep your options open and flexible Spotting unexpected relationships between things Hosting information on the Web and allowing others to search it Disadvantages: Efficiently structuring large databases can be a challenge Popular software packages: Microsoft Access FileMaker Pro MySQL (particularly for databases hosted on the Web) PostgreSQL (particularly for databases hosted on the Web) XML databases Spreadsheets and relational databases can be very useful if you are working with essentially consistent data – where there are a limited number of shared characteristics common to each record in a given table. If the information you wish to analyse is difficult to characterise in such a way, however, you may wish to take a different approach. XML (eXtensible Markup Language) is a standard for tagging information in order to render it machine-readable. It is primarily used to assist textual analysis, as it can be used to indicate particular characteristics that apply to particular sections in a text. For instance, you may have a number of texts which cover very different subjects, but you want to find all instances where a particular individual or event is mentioned. You could surround each personal name with tags indicating that the part of the text is question is a personal name – <name>Christopher Columbus</name>, for example. You could then search your texts for all the people named in them, or index each occurrence of a specific name. You could also use XML to create a standardised version of a name that occurs in many different variants, in order to render it searchable but without having to alter the original spellings in the document itself. Other tags can be used to indicate how a text should be displayed. Enclosing a piece of text between two emphasis tags, for instance, will indicate to a Web browser or some other XML reader that it should be displayed as bold, or italic. The precise interpretation of how an XML file should be displayed can be customised – the important thing is that XML separates content from its representation, ensuring that the document does not become unreadable just because the technology used to display it has changed. As is the case when working with relational databases, it is possible to create quite complicated queries when working with XML-tagged documents. XQuery is one popular language for searching XML databases. As with SQL and its Research Research equivalents, it is fairly straightforward to learn how to return results from simple searches, but complex queries can also be constructed with a little more knowledge and experience. XML is not only used to indicate textual content, but is also widely used in linguistics to indicate parts of speech or features of spoken language. It is also popular amongst those working with manuscripts or multiple editions, to indicate variations, alternative translations, and so forth. TEI XML Text Encoding Initiative (TEI) XML is a schema established to aid consistency and interoperability between digital humanities projects. Essentially, the TEI has defined a number of labels (about 500) for use when tagging texts, so that people do not end up having to create their own definitions every time they want to make a text machine-readable. The TEI guidelines are available from http://www.tei-c.org/Guidelines/. The University of Oxford is a centre of expertise in TEI XML. OUCS runs an annual summer school, and members of the University can email [email protected] at any time for free advice. Figure 3 (below) shows part of the play The Raigne of King Edvvard the Third marked up into XML. Some of the tags instruct the Web browser how to display the text, whereas others are ‘invisible’, but can assist searching and analysis of the text. For instance, homographs are indicated in the XML, but are not flagged up in the text displayed in the browser. One can see here that the browser has not been instructed to recognise all of the rendering information in the XML original, as it is not displaying the names of the speakers or the stage directions in italic. Research Research Figure 3: Example of a text with TEI XML mark-up, rendered into simple HTML Useful for: Working with texts Providing access to textual databases via the Web Textual and linguistic analysis Disadvantages: Tagging documents is time-consuming You need to ensure that you tag elements consistently Popular software packages: Oxygen XML Editor is useful for editing and tagging XML documents and checking that they meet TEI standards eXist is a free, open source, native XML database management

Load more