<<

10 questions

• How did this paper change our lives? • What is 's problem? Latent Semantic Analysis: • Oh no! Not more philosophy? • How can Plato’s problem be solved? • What kind of solution do we need? Is it a solution to Plato’s problem? • What is latent semantic analysis? • How is an LSA model constructed? • How is the LSA model used? [And 10 other questions & answers.] – What’s a cosine between vectors? • What are some cool empirical findings? • Is LSA psychologically plausible?

How did this paper change our lives? What is Plato's problem? • Because I saw a talk by Landauer on this work, I became interested in latent semantic analysis [LSA] • Meno (in the Platonic dialog named after him) asks: How • Because I was interested in LSA, I became interested can one ever investigate what one does not know? Curt Burgess's HAL model. • He saw two problems: • i.) How can you propose what you do not know as • Because I was interested in HAL, I decided to come to the object of your search? Edmonton, where Lori Buchanan was working on it • ii.) How will you recognize what you do not know as • Because I came to Edmonton- here I am teaching Psych the thing you did not know if you do (by chance) find 357. it? • If Landauer hadn’t written this paper, we probably • More generally, the problem is that there is a gap between what we and what we know, with the latter wouldn’t have the mutual pleasure of knowing each seeming to be larger than the former is able to support. other as we do.

Oh no! Not more philosophy? How can Plato’s problem be solved?

• Not at all (indeed, the opposite) • i.) Plato's solution was recollection of knowledge gained in a previous life, famously demonstrated in the Meno by • Plato's problem is exactly the poverty of the showing that a slave boy 'knows' the Pythagorean Theorem stimulus/failure of induction problem • ii.) Some favour the idea of innate knowledge, the modern • It is thus central to syntactic knowledge as equivalent of recollection of a previous life well as to many other dimensions of linguistic knowledge (wherever we make • The basic common principle is one we already know and fine-grained untaught distinctions: e.g. love in Psych357: we need some source of strong prosody, phonology, and semantics). additional constraints on the problems (= information) to narrow down the size of the search space.

1 What kind of solution do we need? What is latent semantic analysis?

• That is: What properties are desirable in a • LSA is an algorithmically well-defined way of measuring scientifically-acceptable explanation of how lexical co-occurrence in some set of text • The assumption is that co-occurrence says something constraints on a search space operate? about semantics: words about the same things are likely to • i.) They must be sufficient. occur in the same contexts. • If we have many words and contexts, small differences in • ii.) They must be well-defined. co-occurrence probabilities can be compiled together to • iii.) They must be psychologically- give information about semantics. – Think of 20 questions: No single question might be plausible sufficient to identify an unknown object, but 20 questions usually are sufficient

• i.) Build a matrix with rows representing words and columns representing context (a document or How is an LSA model constructed? word string)

• i.) Build a matrix with rows representing Sonnets Learn C A day at the … words and columns representing context (a zoo document or word string) dog • ii.) Enter in each cell (= a word X document zebra intersection) a count of many times that computer word occurred in that document … • iii.) Transform the matrix

• ii.) Enter in each cell (= a word X document intersection) a count of many times that word occurred in that document

Sonnets Learn C A day at the … zoo dog 6 1 7 zebra 0 2 46 computer 0 123 0 …

2 • iii.) Transform the matrix

– a.) Control for word frequency • The log transform compresses the effects of frequency

– b.) Control for the number of contexts each word appeared in • Words that occur in few contexts are more informative about those contexts (= reduce uncertainty about their context more) than words that appear in many different contexts – Eg. Knowing the word ‘computer’ was common places more constraints on what the document is about than knowing the word ‘the’ was common

• iii.) Transform the matrix How is the LSA model used? – c.) Singular value decomposition • This reduces dimensionality by 'projecting' the tens of thousands of context dimensions onto a smaller • To get a measure of how related a word is to number (roughly 300). another word, measure the distance between the • A mathematical projection is roughly the same as columns containing the two words. real projection: Think of shining a light through a – This gives you a measure of how different the contexts three dimensional pattern and tracing the shadow it of the two words were: that is, how often a word casts to get a two-dimensional projection occurred a different number of times in each context • The 'discarded' dimensions are those that are least • You can also take the distance between two informative = have low variance = are redundant (e.g. a word like 'the' occurred in every context or a document vectors to get a measure of how related word like 'anti-disestablishmentarianism' occurred in they are. hardly any contexts). • You can measure distance by taking the cosine between two vectors

Huh? What’s a cosine between vectors? What are some cool empirical findings?

• They probably forget to mention in your Grade 9 trigonometry class (as they did in mine) that cosine is • i.) LSA models can pass the TOEFL extensible to dimensions above 2 • ii.) LSA can learn the meanings of words it has – Typical teaching: always the special case, never the general. never encountered • The dot product of two vectors is the sum of the products • iii.) LSA can explain some priming effects of corresponding entries in the two vectors: i.e. (x1*x2) + • iv.) LSA replicates human number judgments (y1*y2) + (z1*z2), for two vectors of length 3. • The dot product of two vectors is the cosine of the angle • v.) LSA can mark essays between those two vectors, multiplied by the lengths of • vi.) LSA-like measures predict LD RTs those vectors. • Therefore, cosine is dot product divided by divided by the product of the two vector lengths

3 i.) LSA models can pass the TOEFL ii.) LSA can learn the meanings of words it had never encountered

• On a 4-possibility multiple choice TOEFL, • So can children! the model got 51.5% correct (corrected for • By substituting words with nonsense words and controlling access, they showed that the model could learn the meanings of words it guessing) had never encountered • Chance score is 25% • This replicated (and explained) an odd result which had been found in human children- and estimated that most word • Real foreigners hoping to attend American knowledge was inductive rather than direct. universities averaged 52.7% • The result is not odd when you consider that the meaning of a word is distributed across all vectors with which it shares contexts. – You can learn a lot about lions, even if you have never heard of them before, by knowing they are something like tigers.

iii.) LSA can explain some iv.) LSA replicates human priming effects number judgments

• The model can explain some priming work • Previous work has shown that judgments using homographs: i.e. testing for 'mole' about number size are best represented on (the animal) versus 'mole' (the beauty the assumption that numbers are represented mark). as their log of their values. • If context is marked by (either phonological – That is, people ‘scale down’ large numbers or orthographic word form), then these • LSA got the same representation using their words will indeed get over-lapping contexts contextual occurrences. even though they are semantically different

v.) LSA can mark essays vi.) LSA-like measures predict LD RTs

• LSA judgments of the quality of sentences correlate at r = • An LSA-like measure for single words can predict 0.81 with expert ratings human RTs in lexical decision • LSA can judge how good an essay (on a well-defined set • We used 10 words each side of the target word as topic) is by computing the average distance between the a ‘document’ and got distances between all words essay to be marked and a set of model essays • Words close to their nearest neighbours are – The correlation are equal to between-human recognized more quickly than words far away correlations from them, after controlling for other known • “If you wrote a good essay and scrambled the words you variables would get a good grade," Landauer said. "But try to get the good words without writing a good essay!”

4 Is LSA psychologically plausible? • Well, the above evidence suggests they might be, and is nicely consistent with much of our talk about mapping between schemas • Neuro-philosopher Paul Churchland has written: "Explanatory understanding consists of the activation of a specific prototype vector in a well-trained network. It consists in the apprehension of the problematic case as an instance of a general type, a type for which the creature has a detailed and well-informed representation. Such a representation allows the creature to anticipate aspects of the case so far unperceived, and to deploy practical techniques appropriate to the case at hand." Paul Churchland A Neurocomputational Perspective: The Of and The Structure Of Science

5