SLS 640: English Syntax

Web-based corpus linguistics and Simple searches of the Brown and BNC corpora

Introduction

Internet search engines are a useful rough-and-ready technique for studying English grammar and lexicon. In this course, we'll make use of the Web to collect and study examples of linguistic phenomena.

The major advantages to using the Web as a corpus are:

- It's very big. If something can occur at all, it's likely to occur somewhere. - It's easy to search using existing search engines.

However, there are disadvantages:

- Because we have no control over the composition of the Web, we cannot know if it is representative of the kind of language we're interested in. - It contains numerous errors. - To an unknown extent (probably large) it contains non-native English. - Search engines are designed to find content, not linguistic patterns; many hits are "false positives"; many patterns are hard to find. - Many of the top hits are titles, page headings, and headlines, which have their own peculiar syntax.

It is always necessary to do some hand work after collecting a preliminary sample from the Web.

Mini-project #1.

1. What is meant by "stop words" with respect to a search engine such as Google? Prepare a one sentence definition of stop words. Can you find a list of Google's stop words? (I have never been able to find it.)

2. What is Google's "automatic stemming"?

3. How does putting quotation marks in a search string affect stop words and automatic stemming? Mini-project #2.

The textbook for SLS 640, Parrot (2000) gives useful statements about the position of adverbs in sentences. Consider only the adverb already, (Parrot, p. 36-37). We'll investigate the position of already using Google.

1. What are the sentence positions of already as given by Parrot. (There are three.) Write them down.

2. Using Google, search for already. You'll get hundreds of millions of hits. Look at the first 50 examples given. Very many are not directly applicable, because the examples are not sentences (for example, they may be titles of English lessons, or headlines like "Al-Qaeda nukes already in U.S."). Extract those examples that actually show already in full sentences. There may not be very many. Do all these examples fit with Parrot's rules? Are all of Parrot's main positions exemplified?

3. Parrot fails to mention the position of already with respect to auxiliary verbs (or sequences of auxiliaries). To investigate this, do new Google searches for already on either side of the modal verb should. That is, hunt for should already and then for already should. You'll need to decide whether to surround the search string with quotation marks. a. How many hits are yielded by the search string already should? How many for should already? Which order seems to be more common, just based on raw count of page hits? How much more common? (Twice as common?) b. You'll notice that many of the hits are "false positives" that need to be ignored. For this reason, the answers your gave for (a) above are not very reliable. You will want to estimate the percentage of true cases for each of the two search strings. Do this by looking at a sample of the hits (I usually examine the first one hundred) and count the ones that are real examples of what we're looking for. You can examine more or less, depending on your time and interest. Report the false positives like this: already should Total number of hits returned by Google = ______Number of hits I examined individually = ______Of those: Number of true examples = ______Number of false positives = ______Proportion of true examples in the ones I examined = ______(divide the number of true examples by the number of hits you examined). should already Total number of hits returned by Google = ______Number of hits I examined individually = ______Of those: Number of true examples = ______Number of false positives = ______Proportion of true examples in the ones I examined = ______(divide the number of true examples by the number of hits you examined). c. Now you can correct the numbers you found in (a) using the proportion of true examples you found in (b).

Estimated total number of true hits for already should = ______Estimated total number of true hits for should already = ______

[A great guide to the inner working of Google is at: http://www.googleguide.com ] d. Notice that you’re getting a count of the number of pages on which something occurs, not the number of occurrences. How does this affect your interpretation of relative frequencies. For example, if one form gets twice as many Google page hits, does that mean that it occurs twice as often?

Mini-project #3.

Go to http://webascorpus.org/searchwc.html

This is a corpus compiled from the Web, with hundreds of millions of tokens.

Search for "should already" in this corpus. How many are found? Search for "already should" in this corpus. How many are found?

(These are two-word sequences--that is, "2-grams"--and you'll need to check the 2- gram box to have it work.) Mini-project #4

Using a search function, such as one might find in a word processor, specialized text editor, or search program, search for all occurrences of the word already in the million- word Brown corpus (which you should have downloaded: see syllabus). There should be a couple of hundred occurrences.

List three examples that reflect Parrot's ordering principles and three examples that his principles do not cover.

------

NOTES ON PC SEARCH PROGRAMS: If you wish, you can use Microsoft Word for this. Just use the Find ultility, (control-F; command-F on a Mac), enter already as the search word, and choose "Find Next" repeatedly to see each example in turn. You can also ask all examples in the file to be highlighted and then choose "Find All."

Actually, Word is a cumbersome way to do corpus linguistics. It's better to use a program that has a better search function, ideally one that can handle regular expressions.

Macintosh users should use the great (free) program TextWrangler from http://www.barebones.com/ . (BBedit works great, too, but recent versions cost money.)

TextCrawler is a free Windows search program that many people love. http://www.digitalvolcano.co.uk/

A superb Windows program is UltraEdit from http://www.ultraedit.com/ You can download a free 30-day trial version. (Otherwise, it costs about $50.)

There are many other fine Windows text editors and specialized search programs-- some of them free--that can do corpus searches very well. I've heard great things about EditPad Pro and PowerGREP. If you run across a particularly good program, let us all know.

Many Windows search programs are based on "grep". Try a Google search on Windows grep and see the many possibilities. Mini-project #5

1. For this, we use the Web search utility for the British National Corpus at "Phrases in English" Web site developed by William Fletcher at the U.S. Naval Academy.

Look for should already and already should at using the Simple Search form at:

http:/ /pie.usna.edu/simplesearch.html

How many of each do you find? (Notice the minimum frequency filter; you may need to set that.)

2. The Simple Search form permits searches on parts of speech. {vm0} is the tag for "modal verb" (that's a zero, not the letter O, in {vm0}). In the corpus, tags are placed at the ends of the word; so, the modal may is may{mv0}. The plus symbol + means one word; so +{mv0} means “any modal.” Try this search:

+{vm0} already

This will find all examples of a modal verb {vm0} followed by "already." How many do you find?

How about the other order: already +{vm0}

3. {av0} is the tag for adverbs.

Try this

+{vm0} +{av0} or the reverse +{av0} +{vm0}

How many did it find of each order? The lists also give us some idea of the specific adverbs, and the kind of adverbs, that can appear in these pre- and post-modal positions.