Data Mining and Stylometric Analysis of Barack Obama Speeches
Total Page:16
File Type:pdf, Size:1020Kb
The Authorship of Audacity: Data Mining and Stylometric Analysis of Barack Obama Speeches Jonathan Herz Abdelghani Bellaachia School of Engineering and School of Engineering and Applied Science Applied Science George Washington University George Washington University Washington, DC 20052 Washington, DC 20052 Email: [email protected] Email: [email protected] Abstract—We explore the feasibility of identifying authorship Keenan, Adam Frankel, and Ben Rhodes. Jon Favreau among President Obama’s principal speechwriters with the was Obama’s lead speechwriter from 2005 - 2013, and interesting result that, yes, we can. This task is difficult because has the most speeches in the corpus attributed to him as there are few training examples, multiple authors (four), and primary author: 13 in total. Cody Keenan, the current lead because these authors consciously attempt to emulate a single speechwriter, joined the team in 2009. Five of the speeches in style - the President’s. Four corpuses are created to compare our corpus are attributed to him. Adam Frankel also worked different text pre-processing techniques. On each, function word frequencies are analyzed with ANOVA to select discriminating on the 2008 campaign, and has 9 speeches attributed in our feature vectors. Using leave-one-out cross-validation, K-nearest training set. Ben Rhodes, who wrote for the campaign and neighbors achieves the best classification accuracy, 78%. One now specializes in foreign policy speeches, has 10 attributions. interesting result is that the new head White House lead speechwriter, Cody Keenan, is not distinguishable from the other principal speechwriters beyond pure chance. Classification accuracy is improved to 90% after removing his work from our C. Obama’s Speechwriters corpus. In this paper, we will examine the President Barack Key words: authorship attribution, text mining, classification, Obama’s national speeches and remarks to attempt to statistical analysis, stylometry. determine which speeches were written by which of his principal speechwriters. It is too early to say with certainty I. INTRODUCTION what Obama’s place in history will be, but it is safe to say that he is a politician whose career was built on the strength A. Stylometry of his oratory. Obama catapulted to the national stage with his Stylometry is the study of identifying authorship of address to the 2004 Democratic National Convention, before texts based on information contained in the text itself. he had even won national office. His viability as a presidential Many methods have been proposed, but they all share the candidate only four years later can only be explained by the assumption that every author leaves behind quantifiable and strength of that speech, and the widespread national attention distinctive markers that allow them to be identified from that it garnered. Although a gifted orator in his own right, a pool of possible authors [1]. Some of the more popular even Obama has had plenty of help from others. Beginning features that have been used in the problem domain include with the start of his Senate career in 2009, Obama began to various types of word frequencies[2, 3], n-gram frequencies, assemble a speechwriting team that would follow him to the punctuation[4], and aggregate measurements such as average White House and beyond. sentence length [5]. Unlike many text classification problems, Obama has four principal speechwriters with identifiable authorship identification uses function word frequencies as speeches: Jon Favreau, Cody Keenan, Ben Rhodes, and feature vectors rather than ignoring them as stop words. Adam Frankel. They are, as a group, remarkably young for presidential speechwriters. After five years of the Obama administration, most are now in their late 20s or early 30s. B. The Data They joined the team at different times, and some have recently left the administration. Unfortunately, information We assembled a corpus of 37 speeches and addresses about the specific speeches these writers worked on is very delivered by President Obama for which we can assign sparse. Out of hundreds of speeches delivered by Barack primary authorship to one of four principal speechwriters. Obama over the course of his national political career, we These speeches span both his time as candidate and sitting were only able to attribute 37 to individual speechwriters president. One of the challenges in this experiment was finding through interviews available in the public domain. attributions of speeches, since presidential speechwriters are, Jon Favreau was Obama’s lead speechwriter until early in the words of FDR, expected to “have a passion for 2013. In 2005, Favreau began working for then-Senator anonymity.” Obama as his speechwriter. Some of President Obama’s The four principal speechwriters are Jon Favreau, Cody most defining speeches were written in conjunction with Jon Favreau. After the Jeremiah Wright controversy engulfed the analysis[4], and support vector machines [3,5]. Beginning 2008 Obama presidential campaign, Jon Favreau worked on in the 1980s, J. F. Burrows began to explore the use the key speech “A More Perfect Union,” which began by stylometric techniques to analyze Jane Austen’s novels and contextualizing the remarks of that controversial preacher other English authors of the Victorian era. Like Mosteller and against the backdrop of race relations in the United States, Wallace, Burrows analyses used frequencies of function words and ended with a call to work together to address social such as by, the, from, etc. as discriminant features to classify problems[6]. The speech became popular very quickly on authorial style. Burrows even used stylometric techniques YouTube, with 1.2 million views in the first 24 hours after to study character dialogue in some of Jane Austen’s books release. Favreau also worked on President Obama’s “Nobel to identify the stylometric fingerprint of different fictional Peace Prize acceptance speech”[6]. Other important speeches characters. Burrows’ primary contribution to the field was the that were identified as Favreau’s work include both inaugural pioneering use of Principal Component Analysis to reduce speeches[6,7], the “2008 Jefferson-Jackson Day Dinner the high dimensionality inherent in using the frequencies of address”[8], and, more recently, the President’s “remarks at dozens of function words to categorize text[14]. Principal the prayer vigil for victims of the Sandy Hook Elementary Component Analysis takes a correlation or covariance matrix school shooting”[6]. of the dimensions being used in the analysis in order to Cody Keenan joined the presidential speechwriting team find the highest eigenvalues, which correspond to the most in 2009, and succeeded Jon Favreau as the new Director significant variables. Usually, the marginal amount of variance of Speechwriting in 2013. Keenan wrote the President’s explained by the dimensions or variables after the first few “eulogy for Senator Ted Kennedy”[9], and the president’s drops off dramatically. “remarks upon the signing of the ‘Edward M. Kennedy Serve The combination of feature selection of function words America Act”’[9]. Keenan also worked on the “2009 White introduced by Mosteller and Wallace, and the generalization House Correspondents’ Association dinner remarks”[10], of that approach to multivariate methods using Principal the President’s “speech at the Tucson shooting memorial Component Analysis for dimensional reduction by Burrows, service”[10], and “the 2013 State of the Union Address”[10]. set a standard that has been followed by many stylometric Ben Rhodes’ official job title is deputy national security studies. For example, In the paper Who wrote the 15th Book adviser for strategic communications. His speeches focus of Oz, Jose Binongo uses exactly the approach described on foreign policy. Speeches that have been attributed in the above to determine whether popular sequels to The Wizard press to Rhodes include the New Beginning speech made of Oz by Frank Baum had been written by Baum himself by Obama in June 2009 in Egypt as an attempt to improve or by Ruth Plumly Thompson, an associate of Baum’s who America’s image in the Middle East[11], a speech given is known to have written many of the later Oz books[15]. to a large public audience in Israel in March 2013[12], a First, like Mosteller and Wallace, Binongo tallied the February 2009 speech on ending the Iraq War[11], a “speech frequency of function words. Next, like Burrows’ work, the to the Ghanaian Parliament in July 2009”[11], and a “speech high-dimensional 50 most frequently occurring words were delivered to the New Economic School in Moscow”[11]. reduced to 2 using Principal Component Analysis. Adam Frankel, like Jon Favreau, also started in the Kerry Similar methods have also been used to attribute campaign. He was Favreau’s first hire in 2007 after Obama presidential addresses. In the 2006 paper Who Wrote Ronald announced his candidacy for the presidency, but left in 2011. Reagan’s Radio Addresses?, Airoldi, Anderson, Fienberg, Speeches that are known to have been written by him include and Skinner use function word frequency counts, Principal the President’s “eulogy for Senator Robert C. Byrd”[13], Component Analysis, and other methods to attempt to the President’s “address to the Upper Big Branch coal mine determine authorship of several hundred of Ronald Reagan’s disaster memorial service”[13], the President’s “address at radio addresses that were delivered in the late 1970s, as Reagan the National Peace Officers’ memorial”[13], the President’s prepared to run for the White House[3]. The Reagan study “eulogy for civil rights activist Dr. Dorothy Height”[13], also tried many different methods of stylistic fingerprinting in and the President’s 2009 speech to the American Medical addition to function word frequencies, including delta-squared Association[13]. He also prepared remarks for the president measures, SMART list words, semantic features, information at a National Prayer Breakfast after his first inauguration[8].