<<

This article was downloaded by: 10.3.98.104 On: 30 Sep 2021 Access details: subscription number Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: 5 Howick Place, London SW1P 1WG, UK

Handbook of Latent Semantic Analysis

Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, Walter Kintsch

How to Use the LSA Web Site

Publication details https://www.routledgehandbooks.com/doi/10.4324/9780203936399.ch3 Simon Dennis Published online on: 15 Feb 2007

How to cite :- Simon Dennis. 15 Feb 2007, How to Use the LSA Web Site from: Handbook of Latent Semantic Analysis Routledge Accessed on: 30 Sep 2021 https://www.routledgehandbooks.com/doi/10.4324/9780203936399.ch3

PLEASE SCROLL DOWN FOR DOCUMENT

Full terms and conditions of use: https://www.routledgehandbooks.com/legal-notices/terms

This Document PDF may be used for research, teaching and private study purposes. Any substantial or systematic reproductions, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The publisher shall not be liable for an loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 nvriyo ooaobitteLAWbst (http://lsa.colorado.edu). site Web LSA the built Colorado applying of and University spaces building of complexities thosespacestonewproblems.Inanefforttoprovideasimplerinterface,the the by daunted become analysis semantic (LSA) latent applying in interested students and researchers Many iei anmn,wihpoie atlnst h anaeso h Web the of areas main the to links fast provides which menu, main a is side you LSA. proceeding of before foundations mathematical then the case, on volume the this co not in as 2 is chapter such this read should terms If with on. familiar so are and and psuedodoc, LSA sine, of best mechanics the the of generate derstanding to site the use to how on advice results. give and most the errors of from some common obtained highlight will be We appropriately. can set results not are poor parameters model, any LSAif Like correctly. site the use to der T ways. of number thesiteanddescribessomeoftheissuesthatyouwillneedtobeawareofinor a in spaces those late precomputed several contains site Web n oJs usd n inMri,wohv enrsosbefrtemitnneo h site. the of maintenance the for responsible been have who Martin, Dian and Quesada Jose to and iue31sostehm aeo h S e ie nteleft-hand the On site. Web LSA the of page home the shows 3.1 Figure un basic a have you that assume will we chapter, this in discussions the In 1 pca hnsi u oDrelLhm h a epnil o h nta raino h site, the of creation initial the for responsible was who Laham, Darrell to due is thanks Special How toUsetheLSA University ofAdelaide Web Site Simon Dennis 3 eatcsae n ol omanipu to tools and spaces semantic i hpe rvdsa vriwof overview an provides chapter his 1 The 57 - - - - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 58 icltr ytm ti otlkl h aeta h em“er”sol re should “heart” term the that the case on the essays likely most student is medical it assessing system, in circulatory interested setting. general is specific a one the in if appropriate instance, sometimes are For that However, ones the terms. not are trian terms relevant to of meanings knowledge of sufficient meaning contain will the it that gulate is it likely more the space on rely that distinctions semantic any so constructed, these termswillbelost. corpus correct was the the in it appear on not which did based from that space terms be obvious. any a will ignores choosing are that simply LSA terms (including ). decision the application contain this to the likely of in is that used aspects space Some a use space. must one Clearly, semantic the form the should that of corpus of nature basis the is decision key a LSA, using When namely, employ, you application which of how tochooseasemanticspaceandmanyfactors toselect. regardless issues relevant two consider be to will need chapter that we This however, publications. First, LSA applications. of the list on a focuses func- and auxiliary guestbook to site links the of as list a such and tions settings demonstrations applied some of in series action a LSAin applications, of the are side right-hand the On site. oeohrise r oesbl.A ueo hm,telre the larger the thumb, of rule a As subtle. more are issues other Some iue31 ThehomepageoftheLSAWebsiteathttp://www.lsa.colorado.edu Figure 3.1. SPACES ANDFACTORS DENNIS - - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 s oiei hsn oeta h efrac safnto ftenumber the of function a as performance high- the the word that with Note stem alternative chosen. the the is and cosine between calcualted est stem cosine is a alternatives the to the LSA, of meaning each with in and task closest this is simulate that To one word. the alternatives countries four non-English-speaking from from choose universities this U.S. In of to 1997). Test Dumais, applicants & the task, Landauer of (TOEFL; component Language for Foreign synonym gives (corrected a the as factors from performance English 50 items shows retired than 3.2 80 fewer on that Figure guessing) rare 300 example, is about For it general, results. and in good well, that, work found to have seems We factors employed. be should that tors The space. own your construct to necessary be be next chapteroutlinesthisprocess. may will it you that then material op the with, different to similar dealing the sufficiently with are these yourself of familiarize none If You to tions. levels. list grade this and through French areas, subject look and genres, should English of both number include a from These taken site. Web spaces LSA that the spaces on semantic re the available outlines texts are appendix from The constructed interest. of corpus area a more the to use is lating to these it appropriate For more however, compassion. is and usage, courage it general to purposes, refer In to “heart” blood. term the pumps for that common organ the to fer HOWTOUSETHELSAWEBSITE 3. eodiseta rssi l plctoso S stenme ffac of number the is LSA of applications all in arises that issue second A of factorsemployed. a number the of function as task TOEFL Perfor vocabulary the on 3.2. mance Figure - 59 - - - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 esosi neitn pc seuvln ocetn e pc ihterdcdnumber reduced the with space new a creating to equivalent of dimensions. is space existing an in mensions rpia itk) hnispsto ilb eycoet h em hthap typo that a terms perhaps the to or close word very be frequency will appears low position term can very its then a a it mistake), If is graphical as corpus. it well the (i.e., as from infrequently has vector very it each term information place of occurrence placement to the the attempts given on algorithm deciding LSA When cutoff the a terms. set vectors, these and of returned be frequency will the terms many for how select to you allows cation a eahee sn h plcto n icsigteassociated the discussing and application the using parameters. achieved be can the the with interacts However, performance dimensions. best task andthesizeofspace,soexperimentationisoftencalledfor. 300 produces that about dimensions at of number peaks employed dimensions of 60 eue fyulaeti il ln) I blank). field this will leave available you factors if of used number be maximum (the employ to factors of number Ta and application neighbor nearest the ble 3.1showstheresults fortheterm“dog.” shows 3.3 to cosines Figure corresponding term. the target and term the given LSA a given to the similar in most are terms that the space of list a returns application neighbor nearest The Nearest Neighbors Web site: There are fivemainapplicationssuppliedbytheLSA 2 h olwn etoscnie ahi un onigottetssthat tasks the out pointing turn, in each consider sections following The co- the receive and texts of sequence a Submit comparison: texts Pairwise of set 5. a against text target a Compare comparison: many to One 4. swt l fteapiain,yums eetasmni pc n the and space semantic a select must you applications, the of all with As co the receive and texts of sequence a Submit comparison: 3. are Matrixcomparison:Compare asetoftextsagainsteachother. that space LSA 2. the in terms the of list a Returns neighbor: Nearest 1. hspolmi lodsusdi h etcatr oeta euigtenme fdi of number the reducing that Note chapter. next the in discussed also is problem This sines betweeneachpair. textual (used forvocabularytestingandessaygrading). calculating for (used pair adjacent coherence). each between sines closest toatarget text. THE APPLICATIONS diin h ers egbrappli neighbor nearest the addition, n DENNIS 2 ------Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 .HOWTOUSETHELSAWEBSITE 3. 7 wag kennel lassie barking leash collie wagging dogs barked dog .71 .72 .74 Term .74 .79 .80 .81 .86 .86 .99 LSA Similarity Results FromtheNearestNeighborApplicationforTerm “Dog”With the iue33 Thenearestneighbortool. Figure 3.3. Frequency Cutoff Setto10 TABLE 3.1 61 Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 il,te o a aet osdrstigu oronsaea outlined as space own your pos up not setting in thenextchapter. is consider this to If have tool. may the you to then possible, them sible, If submitting server. before the these of of capacity size limited the the reduce to due error out weighting. time a employ receive not does which term, also to is is weighting but the This comparisons, set 2). neighbor to nearest chap. possible for (see appropriate employed most are the values usually singular entropy log inverse that employed. and means which the is weighting pseudodoc, Finally, to that set for. weighting is application by called the default, influenced be By be may will experimentation results ob neighbor your and on nearest fre LSA depends it The using again output. interest. in but the five, of from jectives about terms terms to these this the set filter be would to Conse you Typically, to you strong. unlikely allows mechanism are not cutoff they is quency that location fact lists, neighbor this nearest the the in in despite high quite appear that appear often fact should will the terms despite these it quently, appears, it that which evidence in document the one the in appear to pen 62 n al . hw h eut htaertre ftetrs“o, “cat,” “dog,” terms the application if returned matrix are the that shows results 3.4 the Figure shows texts. com- 3.2 or Table all terms and obtain of quickly set to a you of allow parisons to designed is application matrix The Matrix Comparison iue34 Thematrix tool. Figure 3.4. ia ann htapisars h ol sta ogwr it may lists word long that is tools the across applies that warning final A DENNIS - - - - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 .HOWTOUSETHELSAWEBSITE 3. notefrttx o n h oprsntxsaeetrdit h second the into entered input are texts is comparison text the primary text boxseparatedbyblank lines.Table and 3.4showsatypicaloutput. The box application. text first the the shows into 3.6 number a Figure to it texts. compares other and text of single a takes application many to one The separate One toManyComparison to you text. the require segment to used not is punctuation Simple does lines. blank application with sentences comparison application, sentence matrix the unlike the and that mean Note values. the these provides of and deviation sentences standard of co pair the successive indicates each between application sine The submitted. nursery is the when Spider” results Wincy the “Incy shows rhyme 3.3 Figure Table and task. application this the shows achives 3.5 application comparison on represent data sentence empirical vectors The approximate LSA could coherence. the one of text, a (1998) cosines in the sentences Landauer of successive ing and mean the Kintsch, taking Many Foltz, by text. that con- but the found the of coherence, coherence well to other—the how each contribute by to factors related influenced are heavily text is the this in cepts do to the ability of representation The connected content. a form must readers text, a comprehend To Sentence Comparison no log set, are items is all term” set, is to document” “term to entropy weighted. “document When When used. used. is is weighting weighting to entropy factors “document controls of log or parameter this number whether term” application, and neighbor to nearest space the comparison—“term in the As of document.” than kind other the pa set is only be The parameters text. to or term needs each that between com line rameter be blank to a items leave must the you entering pared, When submitted. are “kitten” and “puppy,” itn02 .104 1 0.61 0.43 0.28 0.43 0.38 1 0.76 Kitten 0.61 1 0.38 Puppy 0.36 0.28 0.36 0.76 Cat 1 kitten puppy cat Dog dog Document Results FromtheMatrixApplicationforTerms ”Dog,””Cat,””Puppy,” and ”Kitten” TABLE 3.2 63 - - - - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 to Sentenceis.32. Note 64 ntefre ae h rmr etwl o ewihe,bttecompari the but weighted, be not will text to son textswillbe.Inthelater case,thereverse istrue. primary comparisons. the addition term” case, to you former in “document the comparisons, and In First, document” document” to to useful. “term make “document prove also and can may term” to that “term in application been making have this that facilities in additional of cluded couple a are there However, tion. .20 .91 .24 SENTENCES COS Results FromtheSentenceComparisonApplicationforNurseryRhyme nfnto,ti plcto ssmlrt h arxcmaio applica comparison matrix the to similar is application this function, In iue35 Thesentencecomparisontool. Figure 3.5. eno h etnet etneChrnei 4.Sadr eito fteSentence the of deviation Standard .45. is Coherence Sentence to Sentence the of Mean . 4: Soincywincyspiderclimbedupthespoutagain. 3: Outcamethesunshineanddriedupallrain. 2: Downcametherainandwashedspiderout. 1: Incywincyspiderclimbedupthewaterspout. ”Incy WincySpider” TABLE 3.3 DENNIS - - - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 .HOWTOUSETHELSAWEBSITE 3. S.Frisac,teoet ayto a eue oase multi answer Language to Foreign a used as be English can of Test tool the many like to tests one synonym the ple-choice instance, For LSA. the in role a played have not construction ofthemeaningtext(Landauer, in 2002). may docu when terms the careful important when especially Consult results be terpreting flawed. should seriously You the proceeding.” be in before can exist mentation Results not does selected. kiten you word corpus The “WARNING: warning: this the in gives and terms tool such LSAignores previously, outlined text. As misspelled. larger was a with associated vector individual meaning of the importance the to encoded determining subtexts is in or useful that terms be information can of they and amount text the a of by indication an give lengths tor h n omn oli h n hti sdfrmn eosrtosof demonstrations many for used is that one the is tool many to one The “kiten” term the 3.4), (Table example the in that note will you Finally, Vec 3.4). Table (see lengths vector generate to you allows tool the Second, iue36 Theonetomanytool. Figure 3.6. 65 - - - - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 66 lo eerhr n tdnst netgt h rpriso S.Many LSA. to of Colorado properties of the University investigate the to at students staff and the researchers by allow created was site Web LSA The to “document to and “term document,” specify As to to term” comparisons. on. “term you so allows document,” tool to and pairwise “document four, the term,” and tool, three many to then one tool compared, the parwise are with the in two on, and so and one three, sentences to com two is sentence one two, sentence sentence to which pared in of application that Note comparison comparisons application. sentence the the from the returns output unlike and the shows lines 3.5 Table blank pairs. by the of separated each texts sub to of you list allows which a 3.7), mit (Fig. tool pairwise the is application final The Pairwise Comparison significantly be better thanthisexercise a mightsuggest. LSAwill employ perfor- incorporates that so systems commercial and software current cosines in LSA mance grading the beyond information essay additional of current deal great that however, aware, should One un- mechanisms. be in scoring useful essay be of would component LSA exercise the an as- derstanding Such essay. that target be essays the the can to to similar assigned score most were A that are scores textbox. the of second mean the the taking by into signed questions essays textbox prescored essay first simulate the putting into can mark graded and one be to tool, to summary many the putting to ability by one assessment essay the its Using 1997). is Dumais, & controversial (Landauer most and startling has that option The first textbox. the second the the highestcosinewouldbeonechosen. into into target options the choice putting the by and 1997) textbox Dumais, & Landauer see (TOEFL; ie .00 .70 1.88 3.36 Vector Length kiten puppy cat dog Texts falo h eosrtoso h aaiiiso S,prastemost the perhaps LSA, of capabilities the of demonstrations the of all Of Results FromtheOnetoManyApplication CONCLUSIONS TABLE 3.4 DENNIS - - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 .HOWTOUSETHELSAWEBSITE 3. ot,P . itc,W,&Lnae,T .(98.Temaueeto eta coher textual of measurement The (1998). K. T. Landauer, & W., Kintsch, W., P. Foltz, The spaces. own your create to to necessary able be be will to it process forachievingthisisoutlined inthenextchapter. resource needs then shred one site, a or Web with the site possible like is the than on rapidly included more not comparisons generate are ba the that on spaces corpora create of to wishes sis possi one are if LSA However, with site. Web accomplish the to through like ble would one that tasks common the of adur .K 20) ntecmuainlbsso erigadcgiin Argu cognition: and learning of basis computational the On (2002). K. T. Landauer, up .43 Kitten .36 Cat puppy Texts dog Texts ence withlatentsemanticanalysis. pp. 43–84).New York: Academic Press. (Ed.), Ross N. In lsa. from ments iue37 Thepairwisetool. Figure 3.7. Results FromthePairwiseApplication REFERENCES h scooyo erigadmotivation and learning of psychology The TABLE 3.5 Discourse Processes , 25 , 285–308. Vl 41, (Vol. 67 - - - - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 adur .K,&Dmi,S .(97.Aslto opaospolm h aetse latent The problem: ’s to solution A (1997). T. S. Dumais, & K., T. Landauer, 68 hssalsaecnan h etfo ubro rilsaotteheart. the about articles of number a Each documentisasentence ofanarticle. from text the contains space small This Small Heart Optimum dimensions. saved 398 are There dimensionality appearstobeapproximately 300–400. and terms. documents 13,902 unique are There document. 30,119 textbooks a psychology as used collegelevel paragraph three each from with text the contains space This Psychology that show Studies dimensions. are saved There 371 articles. are the optimumdimensionalityforthiscollectionisusually 275–350. encyclopedia There 30,473 terms. from unique text 60,768 the contains space This Encyclopedia indi- the of meanings the from that separated so be token, single can vidual words. Thecorpushasbeencreated with500factors. a idiom into the combined of been The meaning has parsing. the idioms different the a of with each but in space, words same the is idioms with Literature Literature With Idioms Gutenberg with documents, project 942,425 the and 338 dimensions.Thetotalnumberofwords terms is57,092,140. from works). 104,852 taken of 444 composed text = is literary space American The of page. works; collection 294 a = is (English space from This century literature 19th American and and 18th English of the composed is space literature The Literature atcaayi hoyo h custo,idcin n ersnainof representation and induction, acquisition, the of knowledge. theory analysis mantic AN OUTLINEOFAVAILABLE SEMANTICSPACES Psychological Review APPENDIX , 104 , 211–240. DENNIS - Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 55 R nt.FrteLApcs l ouet esta reult the to equal or than less documents all LSAspaces, from the range For classes units. 3rd-grade DRP in 45–51 used 73. texts the about example, differ for to in levels, 30 used grade ent being about are difficulty from of range ranges what corpus determined TASA TASAstudies the in each to scores TASA by DRP assigned sample. scale) power reading of (DRP—degrees score ity docu 3rd-grade the all includes space 6th-grade cumu andsoforth. ments, the9th-gradespaceincludesall6thand3rd, are the These is, level. that “college” for spaces, one lative plus grades, 12th and 9th, 6th, 3rd, forproviding uswiththesesamples. ful tothekindfolksatTASA There are 8French semanticspaces: French Spaces HOWTOUSETHELSAWEBSITE 3. omto,fo h ocsoeApidSineAscae,Ic (TASA) Inc. develop Associates, to Science used corpus Applied Touchstone in other the and from articles, newspaper formation, novels, texts, of variety a use spaces These General ReadingSpace h ugetfricuini rd-ee pc oe rmareadabil a from comes space grade-level a in inclusion for judgment The for spaces are level—there grade by out break spaces TASA-based The children by written texts contains (300) Francais-Production-Total • some as well as tales traditional contains (300) literature Francais-Contes-Total recent only • contains and smaller is (100) Francais-Livres3 + 1920 • before published books contains (300) Francais-Livres1and2 • 14,622 1920: before published books contains (300) + Francais-Livres (300) • Francais-Monde of concatenation the is (300) Francais-Total • De to (July months other 6 contains “Le (300) Francais-Monde-Extended of June • to (January months 6 contains (300) Francais-Monde • 3 ouetsad304uiu em.Ti pc a rae using created was a stoplistof439commonwords. are There are 94saveddimensions. space There This France. terms. and unique 3,034 Belgium and in documentss school 830 primary in years 12 to 7 from of summary or recall study stories bychildren to andadolescents. used is space semantic This tales. recent with idioms.Livres3 is26,000documents. recent books.Livres1and2 is119,000 documents. total 5,748,581 and words, unique words. different 111,094 documents, Francais-Livres (300). cember of“LeMonde”newspapers,1993). 150,756 documents, 20,208 contains different uniquewords, and8,675,391totalwords. It 1993). newspapers, Monde” h dctrsWr rqec Guide Frequency Word Educator’s The eaethank are We . 69 ------Downloaded By: 10.3.98.104 At: 13:43 30 Sep 2021; For: 9780203936399, chapter3, 10.4324/9780203936399.ch3 aiu R cr o rd ee r nldd o xml,the example, for included, are level Following are thespecificsforeachspace: grade a units. for 3rd-grade corpusincludesalltextsamplesthatscore<=51DRP score DRP maximum 70 The breakdown forsamplesbyacademicarea(intasaALL): oa 761119,627 6,305 4,834 2,272 15,569 37,651 29,280 462 3,396 2,212 403 1,079 675 5,356 57,106 Total 10,501 Unmarked 142 Miscellaneous 1,359 Business 283 SocialStudies Science 16,044 Industrial Arts Home Economics Health Language Arts aa21 72,8 612412 407 419 412 76,132 432 63,582 92,409 55,105 28,882 29,315 22,211 37,651 17,949 #dims 6,974 67 62 73 #terms 59 51 college 12 #docs 9 tasaALL maxDRP 6 tasa12 3 tasa09 grade tasa06 tasa03 name ape paragraphs samples DENNIS