<<

Counting for Humanists

Andrew Goldstone (http://andrewgoldstone.com)

Wednesday, April 30, 2014

...... Academic disciplines (and even interdisciplines or hybrids) are relational entitites; they must define themselves by what they are not. And what literary studies is not is a “counting” discipline. This negative relation to numbers is traditional— foundational, even—and it has not been seriously challenged by the rise of interdisciplinarity….Literary studies has shouldered much of the burden of…defending qualitative models and strategies against the naïve or cynical quantitative paradigm that has become the doxa of higher-educational management. Under these institutional circumstances, antagonism toward counting has begun to feel like an urgent struggle for survival. James English, “Everywhere and Nowhere: The Sociology of Literature After ‘the Sociology of Literature,’ ” NLH 41, no. 2 (Spring 2010): xii–xiii.

shall we count?

...... shall we count?

Academic disciplines (and even interdisciplines or hybrids) are relational entitites; they must define themselves by what they are not. And what literary studies is not is a “counting” discipline. This negative relation to numbers is traditional— foundational, even—and it has not been seriously challenged by the rise of interdisciplinarity….Literary studies has shouldered much of the burden of…defending qualitative models and strategies against the naïve or cynical quantitative paradigm that has become the doxa of higher-educational management. Under these institutional circumstances, antagonism toward counting has begun to feel like an urgent struggle for survival. James English, “Everywhere and Nowhere: The Sociology of Literature After ‘the Sociology of Literature,’ ” NLH 41, no. 2 (Spring 2010): xii–xiii.

...... what shall we count?

favorite author female (%) male (%) Stephen King 17.5 35.9 Wilbur Smith 3.0 23.5 Agatha Christie 11.0 7.2 Danielle Steel 13.0 0.3 Jeffrey Archer 8.1 9.1 Virginia Andrews 11.9 0.8 Catherine Cookson 11.0 0.9 Sidney Sheldon 3.7 3.1 Bryce Courtenay 3.2 2.7 Tom Clancy 1.5 11.6 Table: Australian readers’ favorite authors, by gender, from Tony Bennett et al., Accounting for Tastes (Cambridge UP, 1999), 151.

...... Figure 6: Book imports into India

300

250

200

150

100

50

0 1850 1860 1870 1880 1890 1900

Thousands of pounds sterling. Source: Priya Joshi, In Another Country: Colonialism, Culture, and the English Novel in India, New York 2002.

Figure reprinted in Franco Moretti,v “Graphs, Maps, Trees,” NLR 24 (Nov.-Dec. 2003): 75. An antipathy between politics and the novel. Still, it would be odd if all crises in novelistic production had a political origin: the .French. . . down-...... turn of the 1790s was sharp, true, but there had been others. . in. the. . 1750s...... and 1770s—as there had been in Britain, for that matter, notwithstand- ing its greater institutional stability. The American and the Napoleonic wars may well be behind the slumps of 1775–83 and 1810–17 (which are clearly visible in figure 2), write Raven and Garside in their splen- did bibliographic studies; but then they add to the political factor ‘a decade of poorly produced novels’, ‘reprints’, the possible ‘greater rel- ative popularity . . . of other fictional forms’, ‘a backlash against low fiction’, the high cost of paper . . .6 And as possible causes multiply, one

6 James Raven, ‘Historical Introduction: the Novel Comes of Age’, and Peter Garside, ‘The English Novel in the Romantic Era: Consolidation and Dispersal’, in Garside, Raven and Schöwerling, eds, The English Novel 1770–1829, 2 vols, Oxford 2000; vol. i, p. 27, and vol. ii, p. 44. comma-separated values

"firstname","surname","bornCountry" "Alice","Munro","" "Mo","Yan","China" "Tomas","Tranströmer","Sweden"

...... the norms of CSV

▶ plain-text file for tabular data ▶ delimiter separates columns (usually , or a tab) ▶ newline separates rows ▶ names of columns in first row (optional) ▶ tricky bits: ▶ what if a data point contains a comma? ▶ what if a data point contains a quotation mark? ▶ what text-encoding should be used? ▶ how do you know what rules have been followed? (There is RFC 4180, but no promises.)

...... people

id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,diedCity,gender,year 892,Alice,Munro,1931-07-10,0000-00-00,Canada,CA,Wingham,,,,female,2013 880,Mo,Yan,0000-00-00,0000-00-00,China,CN,Gaomi,,,,male,2012 868,Tomas,Tranströmer,1931-04-15,0000-00-00,Sweden,SE,Stockholm,,,,male,2011 854,Mario,"Vargas Llosa",1936-03-28,0000-00-00,Peru,PE,Arequipa,,,,male,2010 844,Herta,Müller,1953-08-17,0000-00-00,Romania,RO,"Nitzkydorf, Banat",,,,female,2009 832,"Jean-Marie Gustave","Le Clézio",1940-04-13,0000-00-00,France,FR,Nice,,,,male,2008 817,Doris,Lessing,1919-10-22,2013-11-17,"Persia (now Iran)",IR,Kermanshah,"United Kingdom",UK,London,female,2007 808,Orhan,Pamuk,1952-06-07,0000-00-00,Turkey,TR,Istanbul,,,,male,2006 801,Harold,Pinter,1930-10-10,2008-12-24,"United Kingdom",UK,London,"United Kingdom",UK,London,male,2005

Source: requests to api.nobelprize.org. See http://www.nobelprize.org/nobel_organizations/nobelmedia/nobelprize_ org/developer

...... words

WORDCOUNTS,WEIGHT the,766 of,482 and,305 in,259 to,224 a,195 new,101 as,101 that,86 it,75

Source: a wordcounts CSV file from a http://dfr.jstor.org request.

...... And what can’t be?

affordances

What kinds of data can be accommodated in this format?

...... affordances

What kinds of data can be accommodated in this format?

And what can’t be?

...... data types: simple: numerical

▶ Whole numbers (integer scale). How many (books, people, words, genres…)? ▶ Real numbers (interval scale). How much (distance, time, money…)? Special cases: ▶ percentages or proportions (ratio scale). How much of the total (population, corpus of texts…)? ▶ dates. When? (And does the day, month, year, decade, century… matter?)

...... data types: simple: categorical

▶ Unordered. Which of… (languages, nations, genders(?))? Special cases: ▶ binary or Boolean category: true or false, yes or no. ▶ many categories (headwords in the dictionary, authors in the catalogue). ▶ Ordinal. Which (letter of the alphabet, sales rank, “like, dislike, or neutral”)?

Categories to numbers

▶ true: 1, false: 0 ▶ like: 1, neutral: 0, dislike: -1 ▶ like: 2, neutral: 1, dislike: 0 ▶ a: 1, b: 2, c: 3… (character encoding)

...... data types: compound The list / the series 17.5, 3.0, 11.0, 13.0, 8.1, 11.9, 11.0, 3.7, 3.2, 1.5

The list of lists / the table

firstname: Alice, Mo, Tomas surname: Munro, Yan, Tranströmer bornCountry: Canada, China, Sweden

firstname surname bornCountry Alice Munro Canada China Tomas Tranströmer Sweden

(more elaborate possibilities exist…)

...... and text? a (looooong) list of characters (a “string”):

O, n, c, e, *space*, u, p, o, n, *space*, a, *space*, t, i, m, e

other representations

▶ the bag of words (to: 2, be: 2, or: 1, not: 1) ▶ content analysis (automated, human, or semi-automated) ▶ marked-up text Duke.

Haplesse Egeon whom the fates haue markt...

▶ parsed trees ▶ page images ...... 2. A computer performs calculations on numbers and stores the results of those calculations. 3. If the inputs, outputs, and the formal description can be encoded as numbers, a program can be executed on a computer.

programming in a nutshell

1. A program is a formal description of a process for transforming data.

...... 3. If the inputs, outputs, and the formal description can be encoded as numbers, a program can be executed on a computer.

programming in a nutshell

1. A program is a formal description of a process for transforming data. 2. A computer performs calculations on numbers and stores the results of those calculations.

...... programming in a nutshell

1. A program is a formal description of a process for transforming data. 2. A computer performs calculations on numbers and stores the results of those calculations. 3. If the inputs, outputs, and the formal description can be encoded as numbers, a program can be executed on a computer.

...... the R experience

The console You type an expression, R figures out its value (and sometimes: stores a value, draws a figure, reads a file from the disk, saves a file on the disk).

The script

You prepare a list of expressions in a file, and R figures out their value one by one.

...... first steps in the console

R is a parrot

2 "Shiver me timbers"

R gets crabby easily

Shiver Shiver me timbers help ( "Shiver

Press ESC.

...... some hidden features

▶ history navigation with up and down arrows (or RStudio History pane) ▶ tab completion ▶ help: help("sqrt") or ?sqrt

...... R data kinds (“modes”) Numbers Whole, integer, real, ratio… (complex too)

Strings

"Avast" "\"Avast,\" he said" "Beware the \\"

Represent a newline with \n and a tab with \t.

Booleans TRUE and FALSE or T and F for short

Factors (For categorical data: more later) ...... Rithmetic

Try:

2 * 2 5/7 TRUE | FALSE TRUE & FALSE T& T !FALSE !TRUE 4 == 3 !(4 == 3) 4 != 3 1 < 5

...... R functions

Functions map inputs to outputs. Describe these:

sqrt(4) nchar("Munro") paste("Alice", "Munro")

Functions in R can have named parameters as well. Experiment with:

paste("Munro", "Alice", sep = ", ") paste("Munro", "Alice", sep = "")

...... assignment

<- stores a value under a name which you can refer to (or change) later.

x <- 108 x x + 2 storage <- 10 storage <- storage - 10 My_Perfectly_Good_Name2012 <- "Mo Yan"

...... R compound data types

vectors (for a series of values)

Construct a vector with the special function c (concatenate):

xs <- c(2, 4, 8) xs bs <- c(T, F, T) bs people <- c("Munro", "Mo", "Transtromer") people c(people, "Vargas Llosa")

...... subscripting Choose an element or elements from a vector with []: xs[2] people[1] sequences

1:3 c(1:3, 6:8)

What is the value of these expressions?

people[1:2] people[c(1, 3)]

logical subscripting

people[bs] ...... vector operations

c(1, 3, 5) + c(2, 4, 6) paste(c("a", "b"), c("c", "d")) c(T, F, F) | c(F, T, F) c("a", "b", "c") %in% c("b", "c", "d", "e")

...... recycling

c(1, 3, 5) + 1 paste("The", c("beginning", "end")) c(1, 3, 5) == 3 xs choice <- xs > 3 choice xs[choice]

What does Boolean-vector subscripting express?

...... factors

A special type for categorical data, normally made out of strings:

nationalities <- c("American", "Canadian", "French", "French", "Chinese", "American") nat_fact <- factor(nationalities) nat_fact nat_fact[1] nat_fact[3:4]

...... the data frame A list of vectors not necessarily of the same type, but all of the same length:

laureates <- data.frame( firstname=c("Alice","Mo","Tomas"), surname=c("Munro","Yan","Tranströmer"), bornCountry=c("Canada","China","Sweden"), age_now=c(82,59,83)) laureates laureates$surname # Levels??

laureates <- data.frame( firstname=c("Alice","Mo","Tomas"), surname=c("Munro","Yan","Tranströmer"), bornCountry=c("Canada","China","Sweden"), age_now=c(82,59,83), stringsAsFactors=F)

...... indexing by row and column

laureates[1, 1] laureates[1, 2]

laureates[1, "firstname"] laureates[2, "surname"] laureates[3, c("firstname", "surname")]

Exercise Write a single expression in terms of laureates to produce the full name of Canada’s laureate.

...... omitted indices

laureates[3,] laureates[, 2] laureates[, c("surname", "bornCountry")] laureates[c(T, F, T), ]

A shorthand

laureates[, "surname"] laureates$surname laureates$surname[2]

...... getting a real dataframe

laureates <- read.csv("laureates.csv", stringsAsFactors=F) laureates # Scroll up! laureates$surname

properties of the frame

names(laureates) nrow(laureates)

...... the logic of the query

laureates$bornCountry == "Sweden" swedes <- laureates$bornCountry == "Sweden" laureates$surname[swedes] laureates[swedes, ] women <- laureates$gender == "female" laureates[women, ] laureates[women & swedes, ] laureates[women | swedes, ] laureates$surname[women & !swedes]

...... exercise

Write an expression whose value is a dataframe containing the names and prize-years of all the laureates who died in a country other than the country of their birth.

...... exiles and émigrés

laureates[laureates$bornCountryCode != laureates$diedCountryCode, c("surname","year")]

...... counting

table(c("a", "b", "a", "c", "b"))

…and division

table(laureates$bornCountryCode) table(laureates$bornCountryCode)/nrow(laureates) * 100

Exercise Write an expression for a tabulation of the number of men and women to win the Nobel in literature.

...... cross-tabulation

table(laureates$bornCountryCode, laureates$gender)

Sorting

laureate_countries <- table(laureates$bornCountryCode) sort(laureate_countries) sort(laureate_countries, decreasing = T)

Exercise Write an expression for the top three countries-of-death of the Nobel laureates.

...... we’ll always have…

sort(table(laureates$diedCountry), decreasing = T)[1:4]

...... messier data

Metadata for every item in the TEI-encoded Poetry and Crisis from the Modernist Journals Project: from http://sourceforge.net/projects/mjplab/files/.

readLines("Poetry_2.everytitle.txt", n = 4)

!?!!! After consulting help(read.csv)…

poetry_titles <- read.table("Poetry_2.everytitle.txt", sep = "|", strip.white = T, stringsAsFactors = F, quote = "", header = T)

crisis_titles <- read.table("Crisis_2.everytitle.txt", sep = "|", strip.white = T, stringsAsFactors = F, quote = "", header = T)

...... a comparison

overall proportions

table(poetry_titles$genre)/nrow(poetry_titles) table(crisis_titles$genre)/nrow(poetry_titles)

combine and recount

mags <- rbind(poetry_titles, crisis_titles) table(mags$genre, mags$journal.title)

...... who’s in both?

poetry_in_crisis <- poetry_titles$creator %in% crisis_titles$creator shared_auths <- poetry_titles$creator[poetry_in_crisis] unique(shared_auths)

Whoops!

shared_auths <- shared_auths[shared_auths != "" & shared_auths != "Anonymous"] mags_shared <- mags[mags$creator %in% shared_auths, ] table(mags_shared$journal.title,mags_shared$genre, mags_shared$creator)

...... from tables back to data frames

laur_country_tab <- table(laureates$bornCountryCode) laureate_countries <- as.data.frame(laur_country_tab) names(laureate_countries)

names(laureate_countries) <- c("country", "count")

...... visualization, grammatically

A visualization transforms data inputs into graphical outputs (sound familiar?). A grammatical visualization consistently transforms dimensions of the data into aesthetic dimensions of the output.

library("ggplot2")

...... making a point (plot)

data: translations published in US, year by year

Source: UNESCO Index Translationum

us_tx <- read.csv("us-trans.csv")

1. years on x axis 2. counts on y axis 3. what to draw? ▶ point for each yearly entry ▶ line connecting ▶ shaded-in area

...... the code

qplot(x=year, # aesthetics (mapping) y=translations, geom="point", # geometry (shape) data=us_tx) # data source qplot(x=year,y=translations, group=1, # special aesthetic: "1 line" geom="line", # geometry (shape) data=us_tx) qplot(x=year,y=translations,geom="area", data=us_tx)

...... arbitrary mappings

1. countries on x axis, in alphabetical order 2. laureate count on y axis 3. point for each country

qplot(x=country,y=count,geom="point", data=laureate_countries) qplot(x=country,y=count, geom="bar", # bars but: stat="identity",# don't tally y var. data=laureate_countries) qplot(x=bornCountryCode, geom="bar", data=laureates) # bars, and do tally sorted_countries <- laureate_countries[order(laureate_countries$count),] qplot(x=count,geom="bar",binwidth=1, data=sorted_countries) ...... dates

Consider mags. What type is mags$date?

poetry_articles <- poetry_titles[poetry_titles$genre == "articles",] art_series <- as.data.frame(table(poetry_articles$date)) names(art_series) <- c("date", "count") art_series$date <- as.Date(art_series$date)

qplot(x = date, y = count, group = 1, geom = "line", data = art_series)

...... construct the data you want to plot

genre_series <- as.data.frame(table(mags$date, mags$genre,mags$journal.title)) names(genre_series) <- c("date","genre","journal", "count") genre_series$date <- as.Date(genre_series$date)

qplot(x=date,y=count,color=genre,geom="point", data=genre_series) qplot(x=date,y=count,color=genre,group=genre, geom="line",data=genre_series) qplot(x=date,y=count,fill=genre,geom="bar", stat="identity",position="stack",data=genre_series)

...... small multiples

qplot(x=date,y=count,group=genre, facets=genre ~ journal,geom="bar", stat="identity",data=genre_series) qplot(x=date,y=count,group=genre, facets= ~ journal,geom="bar", stat="identity",data=genre_series)

Exercise Generate either overlaid or small-multiples plots of the time series of genres in Poetry.

...... counting on

Navarro, Daniel. Learning Statistics with R. http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/. Pts. 2–3. Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer, 2009. http://dx.doi.org/10.1007/978-0-387-98141-3. Wilkinson, Leland. The Grammar of Graphics. 2nd ed. Springer, 2005. http://link.springer.com/book/10.1007/0-387-28695-0. Online documentation for ggplot2. http://docs.ggplot2.org/.

......