Quick viewing(Text Mode)

Incresing Usability Using Semantic Analysis

Incresing Usability Using Semantic Analysis

2007:167 CIV MASTER'S THESIS

Increasing Usability Using Semantic Analysis

Md. Farhad Shahid

Luleå University of Technology MSc Programmes in Engineering Computer Science and Engineering Department of Computer Science and Electrical Engineering Division of Computer Science

2007:167 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--07/167--SE IncreasingUsabilityUsing SemanticAnalysis

Md.FarhadShahid LuleåUniversityofTechnology DepartmentofComputerScienceandElectricalEngineering DivisionofSoftwareEngineering May2007 Supervisor:KåreSynnes,Ph.D. LuleåUniversityofTechnology

Abstract In,semanticanalysisisapartoftheArtificialIntelligenceanditisamajor interestintoday’sresearchinComputerScience.Semanticanalysisisusingtostudy themeaningofwordsandfixedwordcombinations,andhowthesecombinetoform themeaningsofsentences.Whentheusersenterainput text to search a particual thing,thesemanticanalysistechniqueisusingto finding out the matching content fromthewebtext.Fortheusersaswellasfortheapplication,semanticanalysisis incresingtheusability.Toachivingthisusabilityandincresingthefacilitiesforthe users,semanticanlysisisdevelopingdaybyday. Likethehuman,themachinecanalsounderstandthehumannowa days.Sothemachineisusingforfindoutcertineinputtextsordocumentfromthe online newspapers and web content texts. It is faster then human to collecting thousandsandmillionsnewsinacertainamountoftime. Somefewtechniquesaredevelopedtounderstandthetextordocumentlike crimereportanalysisandresumetyperecognization.Crimereportanalysiscanfind outthedate,placeandtimeofthecrime.Whichis easier to analysis by using the machine.Butthissystemwillnotworkforotherdifferenttypeoftexts.Soifwewant to find out the object name, location,person name from a input text, there are no suitabletechniqueforus. Thisthesisproposeanddescribeasolutioninthe area of semantic analysis whichwillidentifytheobjectname,texttypeandpriorityfromausersinputtext. Theoutcomeofthisthesisworkissatifiableandabletoworkonfuturefor generalpurposeuseandincludemoreoption.

I

Contents Acknowledgments...... IV Chapter1...... 1 1.1Introduction...... 1 1.2Whatissemanticanalysis?...... 1 1.3Background ...... 2 1.4WhyIstudy? ...... 4 1.5Briefdescription...... 5 Chapter2...... 10 2.1Literaturereview ...... 10 2.2GenericTextSummarizationUsingRelevanceMeasureandLatentSemantic Analysis...... 10 2.3ProbabilisticLatentSemanticIndexing ...... 12 2.4IndexingbyLatentSemanticAnalysis ...... 13 2.5LatentSemanticAnalysisforUserModeling...... 19 2.5.1Automaticdetectionofmisunderstandings ...... 20 2.5.2Automaticselectionofstimuli ...... 21 2.5.2.1Selectingtheclosestsequence...... 22 2.5.2.2Selectingthefarthestsequence ...... 24 2.5.2.3Selectingtheclosestsequenceamongthosethatarefarenough ...... 24 2.6AStructureRepresentationofWordSensesforSemanticAnalysis...... 25 2.7NaturalLanguageProcessingComplexityandParallelism...... 30 2.8Summary ...... 32 Chapter3...... 32 3.1WhatIamdoing?...... 32 3.2Proposealgorithm ...... 32 3.2.1ObjectIdentifyAlgorithm...... 34 3.2.2TextTypeIdentifyAlgorithm ...... 35 3.2.3TextPriorityAlgorithm...... 36 3.3HowtheAlgorithmwork? ...... 37 3.4ImplementedApplication...... 39 Chapter4...... 42 4.1Testing...... 42

II

4.2Result...... 42 4.3Evaluation ...... 44 Chapter5...... 45 5.1Futurework ...... 45 5.2Conclusion...... 45 References ...... 46

III

Acknowledgments TheresearchdescribeinthisMaster’sThesiswascarriedoutduringspring2007at LuleåUniversityofTechnology,underthedivisionofSoftwareEngineering. First I would like to extend my perpetual gratitude to my thesis supervisor Kåre Synnes,PhDandmyseniorprojectmanagerJohanE.Bengtsson.Theygivemefull technicalsupport,ideaandencouragemetoworkinthisthesisandfinishedallthe scheduleworkwithintime. IwanttogiveaspecialthanktoDr.DavidA.Carrtodesignmycoursecurriculumfor mytwoyearsMaster’s.AndhealsoselectaresponsiblecourseadvisorKåreSynnes forme,andtheendofmyMaster’sKåreSynnesgivemetheopportunitytoworkin thewonderfulSMARTproject. Iwouldliketoexpressmyadmirationtomylovingparentsandsister.Withouttheir continuoussupportitisreallyimpossibleformetocompletemyMastersinSweden. MyspecialthanksgotoAnnaCarinLarsson,whohasalwayshelpedmetoprovidea valuablefeedbackrelatedtoeducationalandadministrativepurpose. Finally, I would like to express my enthusiastic appreciation to Sweden and the Swedishpeopleforthewarmembrace. Md.FarhadShahid May,2007

IV

Dedicate to my Loving Parents and Sister

V Introduction

Chapter1

1.1Introduction Theterm ArtificialIntelligence(AI) wasfirstusedbyJohnMcCarthywhoconsiders ittomean“thescienceandengineeringofmakingintelligentmachines”[38].Itcan also refer to intelligence (trait) as exhibited by an artificial (nonnatural, manufactured)entity. The terms strong and weak AI canbe used to narrowthedefinitionforclassifying suchsystems.AIisstudiedinoverlappingfieldsofcomputerscience,psychologyand engineering,dealingwithintelligentbehavior,learningand adaptationinmachines, generallyassumedtobecomputers. Research in AI is concerned with producing machines to automate tasks requiring intelligentbehavior.Examplesincludecontrol,planningandscheduling,theabilityto answer diagnostic and consumer questions, handwriting, natural language, speech, and facial recognition. As such, the study of AI has also become an engineering discipline,focusedonprovidingsolutionstoreallifeproblems,knowledgemining, softwareapplications,strategygameslikecomputerchessandothervideogames. Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of naturalhuman.Naturallanguage generation systems convert information fromcomputerdatabasesintonormalsoundinghumanlanguage,andnaturallanguage understanding systems convert samples of human language into more formal representationsthatareeasierforcomputerprogramstomanipulate[40]. 1.2Whatissemanticanalysis? Inlinguistics,semanticanalysisistheprocessofrelatingsyntacticstructures,fromthe levelsofphrases,clauses,sentences,andparagraphstotheleveloftheasa whole, to their languageindependent meanings, removing features specific to particularlinguisticandculturalcontexts,totheextentthatsuchaprojectispossible.

1 Introduction

Theelementsofidiomandfigurativespeech,beingcultural,mustalsobeconverted intorelativelyinvariantmeanings[41].

1.3Background NaturalLanguageprocessing(NLP),theuseofcomputerstoextractinformationfrom inputineverydaylanguage,hasbeguntocomeofage. Parsing is a very common thinginnaturallanguageprocessing.Byparsingwemeantheprocessofanalyzinga to determine its syntactic structure according to a formal grammar. The exampleofparserisgivebelow: Input:BoeingislocatedinSeattle. Output: S NP VP N V VP Boeing is V PP located P NP in N Seattle Hopcroft and Ullman introduce a context free grammar in 1979. A context free grammarG=(N,∑,R,S)where: • Nisasetofnonterminal • ∑isasetofterminalsymbols

• RisasetofrulesoftheformX→Y 1Y2…Y n

Forn≥0,X ∈N,Y i∈(NU∑) • S ∈Nisadistinguishedstart

2 Introduction

N={S,NP,VP,PP,DT,Vi,Vt,NN,IN} S=S ∑={sleep,saw,man,women,telescope,the,with,in} R= S→NPVP Vi→sleeps VP→Vi Vt→saw VP→VtNP NN→man VP→VPPP NN→woman NP→DTNN NN→telescope NP→NPPP DT→the IN→with PP→INNP IN→in Here,S=Sentence,VP=verbphase,NP=nounphase,PP=prepositionalphase,DT = determiner, Vi = intransitive verb, Vt = transitive verb, NN = noun, IN = preposition Theabovepartisparsingtechniqueandtherearetwodifferentkindofparsing.Top downparsingandBottomupparsing[39]. Parsinganutteranceintoitsconstituentsisonlyonestepinprocessinglanguage.The ultimategoal,forhumansaswellasnaturallanguageprocessing(NLP)systems,isto understand the utterance which, depending on the circumstances, may mean incorporatingtheinformationprovidedbytheutterance into one’s own knowledge baseor,moreingeneral,performingsomeactioninresponsetoit.’Understanding’an utteranceisacomplexprocess,thatdependsontheresultsofparsing,aswellason lexicalinformation,context,andcommonsensereasoning;andresultsinwhatwewill calltheSEMANTICINTERPRETATIONINCONTEXToftheutterance,whichis basisforfurtheractionbythelanguageprocessingagent. Research in Natural Language Processing (NLP) has identified two aspects of ’understanding’ as particularly important for NLP systems. Understanding an utterancemeans,firstofall,knowingwhatanappropriateresponsetothatutterance is. For example, when we hear the instruction in {1} we know what action is

3 Introduction requested,andwhenwehear{2}weknowthatitis a request for verbal response givinginformationaboutthetime: Mixtheflourwiththewater.{1} Whattimeisit? {2} Anunderstandingofanutterancealsoinvolvesbeingabletodrawconclusionsfrom what we hear or read, and being able to relate this new information to what we already know. For example, when we semantically interpret {3} we acquire informationthatallowsustodrawsomeconclusionsaboutthespeaker;ifweknow that Torino is an Italian town, we may also make some guesses about the native languageofthespeaker,andhispreferenceforacertaintypeofcoffee.Whenwehear {4},wemayconcludethatJohnboughtaticketandthatheisnolongerinthecityhe startedhisjourneyfrom,amongotherthings. IwasborninTorino. {3} JohnwenttoBirminghambytrain. {4} Theeffectofsemanticinterpretationdependsonreader’sintentionandaction.Agreat numberoftheoriesofknowledgerepresentationforsemanticinterpretationhavebeen proposed;theserangefromtheextremely ad hoc totheverygeneral[28].

1.4WhyIstudy? As I mention in the first chapter that semantic analysis is a part of Artificial Intelligence (AI). And our SMART Project’s goal istoidentifythesubject,object, personname,priority andmessagetype fromaninput message which can be take fromamobileorwebcontent.Becauseitisverydifficulttofindoutasubjectfroma messagethat’swhyIstudytheresearchpaperstofindoutsomesuitablealgorithmto generatetheobjectnameandmessagetype.

4 Introduction

1.5Briefdescription TheobjectiveoftheSMARTprojectistoexploretheof"reactionmedia", allowingindividualstoengagethemselvesandtakeactivepartinmany situations. Todayusers often do not provide suggestions, opinions andalarms, due to uncertaintyonwhotocontact,andtheeffortneededtotakeactivepart.

Figure1:PrototypeofaSMARTProjectWebContent Theideaistomake"SMARTobjects"inthephysicalenvironmentavailabletobe reactinguponbyindividuals,andtomarkthoseobjectswithaneasilyrecognizable symbol.Intime,thatsymbolwouldbeuniversallyrecognizedandembeddedinthe minds of individuals, so that providing information becomes a natural everyday activity,ratherthansomethingthatbreaksthenormalflowoftheday. For developers of products, services and places, SMART enables activating user groups in the development process. For individuals in their roles as citizens, employeesetc,andSMARTmakesitpossibletoeasilyanddirectlytakeactivepartin manymoresituations.

5 Introduction

For a region where a SMART system is in use, dynamic growth effects will be achieved, by adding precision to innovations in products and processes, by deeply involvingusers[42]. Figure2:MessageinputLayoutwithobjectnameandID SMARTsystemistheideatobuildfortheindustrialpurpose.Soanyonecanusehis orhermobileforsmartresponsetogiveanopinionaboutaparticularobject.Figure3 is the web content image where we can see the real database which is storing the submittedtextbytheusers.Somedatabasesarelike responz_media, responz_object , responz_posts and users are using to store real data information about media file, object information, posts information and the users. Where the responz_media databaseisstoretheinformationaboutimage,audioandvideofiles.TheFigure4is theviewofmediadatabase.Wecanseethattheimagefiles areinjpg, gif format; audio files are amr format and video files are is 3gp format. Where the responz_object tablestorestheinformationabouttheobjectname,ID,texttypeetc. Figure 5 is the object database. Another database is responz_post which store the informationabouttextpriority,textownerinformationetc.

6 Introduction

Figure3:Databaserunningonwebserver

Figure4:Mediainformationdatabaserunningonwebserver

7 Introduction

Figure5:Objectinformationdatabaserunningonwebserver Thisisthewebclientwhichisusingfortheinternetusersandstoringthesubmitted information using by mobile. Figure 6 is the mobile interface which is using for submittingtheusersopinion.Thepackageforthemobileclientisdownloadablefrom theinternet. Figure6:MobileClientInterfaceforSMARTSystem

8 Introduction

Wheremypartistounderstandtheinputmessageusingsemanticanalysisandfind outthesubject,object,personname,priorityandmessagetype.InFigure2wecan seethatwecaninputtheobjectname,IDandmessageforopinion,suggestion,report orproposal. Butforthiswemustknowwhereanyexistingalgorithmcandosuchanoperationor not. That’s why I select some research papers for the study and find out some solutionsforthesemanticanalysis.

9 Literature Review

Chapter2

2.1Literaturereview ToachievemoreknowledgeaboutthesemanticanalysisIhaveselectsomeresearch papers. From the researchpapers I can learn some new techniques, advantages and disadvantages about new method which are using now a day for natural language processing.Someresearchpapersanditsmethodandtechniquesaredescribedbelow.

2.2GenericTextSummarizationUsingRelevanceMeasure andLatentSemanticAnalysis

Y.GongandX.Liuproposedtwogenerictextsummarizationmethodsthatcreatetext summariesbyrankingandextractingsentencesfromtheoriginaldocuments.Thisis anattempttocreateasummerywithawidercoverageofthedocument’smaincontent andlessredundancy.Fortextsummarizationqueryisusebutitdoesnotmakeany sensetotheoveralldocument.AsImentionbeforethattwomethodsareproposeand bothofmethodsneedtofirstdecomposethedocumentintoindividualsentences,and tocreateaweightedtermfrequencyvectorforeachofthesentences.Let Ti=[t1i t2i T … t ni ] be the term frequency vector of passage i, where element tji denotes the frequencyinwhichterm joccursinpassage i. Herepassagesicouldbeaphase,a sentence, a paragraph of the document, or could be the whole document it. The T weightedtermfrequencyvector Ai = [a 1i a 2i …a ni ] ofpassage iisdefineas:

aji = L(t ji ).G(t ji )

Where L(t ji ) is the local weighting for term j in passage i, and G(t ji ) is the global weighting for term j in the whole document. Where the weighted term frequency vector Aiiscreated,wefurtherhavethechoiceofusing Aiwithitsoriginalform,or normalizationitbyitslength| Ai|. IwouldliketoexplainsummarizationbyLatentSemanticAnalysisbecauseIwantto knowhowsemanticanalysisworkfortextsummarization.

10 Literature Review

LSA(LatentSemanticAnalysis)inspiredbytheLSI(Latentsemanticindexing)the appliedthesingularvaluedecomposition(SVD)togenerictextsummarization.The processstartswiththecreationofatermbysentencesmatrix A=[ A1A 2…A n]with eachcolumnvector AirepresentingtheweightedtermfrequencyvectorofsentenceI inthedocumentunderconsideration.Ifthereareatotalof mtermsand nsentencesin thedocument,thentheyhave m x nmatrix A for the document. Since every word doesnotnormallyappearineachsentence,thematrix Aisusuallysparse. Givenan m xn matrixA,wherewithoutlossofgenerality m≥ n,theSVDof Ais definedas[35].

Where U=[ uij ]isan mxncolumnorthonormalmatrixwhosecolumnsarecalledleft singularvectors; Σ=diag(σ1,σ 2…σ n)isan nxndiagonalmatrixwhosediagonal elements are nonnegative singular values sorted in descending order columns are calledrightsingularvectors.Ifrank( A)=r,then Σsatisfies

σ1≥ σ2…≥ σr> σr+1 =…= σn=0. The interpretation of applying the SVD to the terms by sentences matrix A can be made from two different viewpoints. Form transformationpointofview,theSVD derivesamappingbetweenthemdimensionalspacespannedbytheweightedterm frequency vectors and the rdimensional singular vector space with all of its axes linearlyindependent.Thismappingprojectseachcolumnvector iinmatrix A,which representstheweightedtermfrequencyvectorofsentence i,tocolumnvector ψi =[ vi1 T T vi2 … v ir ] of matrix V , and maps each row vector j in matrix A,whichtellsthe occurrencecontoftheterm jineachofthedocuments,torowvectorφj=[ uj1uj2… ujr ]ofmatrix U.Hereeachelement vix of ψi, ujy of φjiscalledtheindexwiththe x´th, y´thsingularvectors,respectively. Based on the above discussion, they proposed the following SVDbased document summarizationmethod. 1. Decomposethedocument Dintoindividualsentences,andusethesesentences toformthecandidatesentenceset S,andset K=1. 2. Constructthetermbysentencematrix Aforthedocument D.

11 Literature Review

3. PerformtheSVDon Atoobtainthesingularvaluematrix Σ,andtheright

singularvectorspace,eachsentenceIisrepresentedbythecolumnvector ψi = T T [vi1vi2… vir ] of V . 4. Selectthek’thrightsingularvectorfrommatrix VT. 5. Selectthesentencewhichhasthelargestindexvaluewiththek’thright singularvector,andincludeitinthesummary. 6. Ifkreachesthepredefinednumber,terminatetheoperation;otherwise, incrementkbyone,andgotoStep4. Thismethodisusedtocreateasummerywithawider coverageofthedocument’s contentandalessredundancy.Andforexcrementalevaluation,adatabaseconsisting of two months of the CNN Worldview news programs was constructed, and performances of the summarization methods were evaluated by comparing the machine generated summaries with the manual summaries created by three independenthumanevaluator.Anditproducedquitecompatibleperformancescores [12].

2.3ProbabilisticLatentSemanticIndexing Oneofthemostpopularfamiliesofinformationretrievaltechniquesisbasedonthe Vector-Space Model (VSM)fordocuments.AVSMvariantischaracterizedbythree ingredients:(i)atransformationfunction(alsocalledlocaltermweight),(ii)aterm weightingscheme(alsocalledglobaltermweight),and(iii)asimilaritymeasure.In theirexperimentstheyhaveutilized(i)arepresentationbasedonthe(untransformed) termfrequencies(tf) n(d, w)whichhasbeencombinedwith(ii)thepopularinverse document frequency (idf) term weights, and the (iii) standard cosine matching function.Thesamerepresentationappliestoqueriesqsuchthatthematchingfunction forthebaselinemethodscanbewrittenas ⌢ ⌢ n(, d wn )(, q w ) ∑ s( d , q ) = w ⌢ ⌢ ∑ndw(,)2 ∑ nqw (,) 2 w w

12 Literature Review

⌢ Where ndw( , )= idf( wndw ).( , ) aretheweightedwordfrequencies. Inlatentsemanticindexing,theoriginalvectorspacerepresentationofdocumentsis replacedbyarepresentationinthelowdimensionallatentspaceandthesimilarityis computerbasedonthatrepresentation.Theyactuallyconsiderlinearcombinationsof the original similarity score (weight λ) and the one derived from the latent space representation(weight1λ). TheperformanceofPLSIhasbeensystematically comparedwiththestandardterm matchingmethodbasedontherawtermfrequencies(tf)andtheircombinationwith theinversedocumentfrequencies(tfidf),aswellaswithLSI[14].

2.4IndexingbyLatentSemanticAnalysis In this research paper Deerwester S. used singular value decomposition technique, whichalargetermbydocumentmatrixisdecomposedintoasetofca100orthogonal factorsfromwhichtheoriginalmatrixcanbeapproximatedbylinearcombination. Documentsarerepresentedbyca100itemvectorsoffactorweights.Itisdesignedto overcomeafundamentalproblemthatplaguesexistingretrievaltechniquesthattryto match words of queries with words of documents. The proposed approach tries to overcomethedeficienciesoftermmatchingretrievalbytreatingtheunreliabilityof observed term document association data as a statistical problem. There is some underlying latent semantic structure in the data that is partially obscured by the randomnessofwordchoicewithrespecttoretrieval.Theyusestatisticaltechniquesto documents based on the latent structure, and get rid of the obscuring “noise”. A descriptionoftermsanddocumentsbasedonthelatentsemanticstructureisusedfor indexingandretrieval. Inthispapertheyhave triedusessingularvaluedecomposition. They take a large matrixoftermdocumentassociationdataandconstructa“semantic”spacewherein terms and documents that are closely associated are placed near one another. Singularvaluedecompositionallowsthearrangementofthespacetoreflectthemajor associativepatternsinthedata,andignorethesmaller,lessimportantinfluences.Asa result,termsthatdidnotactuallyappearinadocumentmaystillendupclosetothe

13 Literature Review document, if that is consistent with the major patterns of association in the data. Positioninthespacethenservesasthenewkindofsemanticindexing,andretrieval byusingtheterminaquerytoidentifyapointinthespace,anddocumentsinits neighborhoodarereturntotheuser[8]. Afundamentaldeficiencyofcurrentinformationretrievalmethodsisthatthewords searchersuseoftenarenotthesameasthosebywhichtheinformationtheyseekhas been indexed. There are actually two sides to the issue; they called them broadly synonymy and polysemy .Theyusedsynonymy inaverygeneralsensetodescribethe factthattherearemanywaystorefertothesameobject.Usersindifferentcontextsor with different needs, knowledge, or linguistic habits will describe the same informationusingdifferentterms.Indeed,wehavefoundthatthedegreeofvariability indescriptivetermusageismuchgreaterthaniscommonlysuspected.Forexample, twopeoplechoosethesamemainkeywordforasinglewellknownobjectlessthan 20% of the time [9]. Comparably poor agreement has been reported in studies of interindexerconsistency[31]andintheGenerationofsearchtermsbyeitherexpert intermediaries [10] or less experienced searchers [19] [3]. The prevalence of synonymstendstodecreasethe"recall"performanceofretrievalsystems. Bypolysemywerefertothegeneralfactthatmostwordshavemorethanonedistinct meaning (homograph). In different contexts or when used by different people the same term (e.g. "chip") takes on varying referentialsignificance.Thustheuseofa term in a search query does not necessarily mean that a document containing or labeled by the same term is of interest. Polysemy is one factor underlying poor "precision". Thefailureofcurrentautomaticindexingtoovercometheseproblemscanbelargely traced to three factors. The first factor is that thewayindextermsareidentifiedis incomplete.Thetermsusedtodescribeorindexadocumenttypicallycontainonlya fractionofthetermsthatusersasagroupwilltrytolookitupunder.Thisispartly becausethedocumentsthemselvesdonotcontainallthetermsuserswillapply,and sometimesbecausetermselectionproceduresintentionallyomitmanyofthetermsin adocument.

14 Literature Review

Attemptstodealwiththesynonymyproblemhavereliedonintellectualorautomatic term expansion, or the construction of a thesaurus. These are presumably advantageousforconscientiousandknowledgeablesearcherswhocanusesuchtools tosuggestadditionalsearchterms.Thedrawbackforfullyautomaticmethodsisthat some added terms may have different meaning from that intended (the polysemy effect)leadingtorapiddegradationofprecision[30]. Itisworthnotinginpassingthatexperimentwithsmallinteractivedata baseshave shown monotonic improvements in recall rate without overall loss of precision as moreindexingterms,eithertakenfromthedocumentsorfromlargesamplesofactual users’wordsareadded[11][13]. Whetherthis"unlimitedaliasing"method,whichwehavedescribedelsewhere,will beeffectiveinverylargedatabasesremainstobedetermined.Notonlyistherea potential issue of and lack of precision, but the problem of identifying indextermsthatarenotinthetextofdocumentsgrowscumbersome.Thiswasoneof themotivesfortheapproachtobedescribedhere. The second factor is the lack of an adequate automatic method for dealing with polysemy.Onecommonapproachistheuseofcontrolled vocabularies and human intermediariestoactastranslators.Notonlyisthissolutionextremelyexpensive,but itisnotnecessarilyeffective.AnotherapproachistoallowBooleanintersectionor coordinationwithothertermstodisambiguatemeaning.Successisseverelyhampered byusers’inabilitytothinkofappropriatelimitingtermsiftheydoexist,andbythe factthatsuchtermsmaynotoccurinthedocumentsormaynothavebeenincludedin theindexing. The third factor is somewhat more technical, having to do with the way in which currentautomaticindexingandretrievalsystemsactuallywork.Insuchsystemseach wordtypeistreatedasindependentofanyother(see, for example, van Rijsbergen

[32]).Thusmatching(ornot)bothoftwotermsthatalmostalwaysoccurtogetheris countedasheavilyasmatchingtwothatarerarelyfoundinthesamedocument.Thus thescoringofsuccess,ineitherstraightBooleanorcoordinationlevelsearches,fails totakeredundancyintoaccount,andasaresultmaydistortresultstoanunknown

15 Literature Review degree.Thisproblemexacerbatesauser’sdifficultyinusingcompoundtermqueries effectivelytoexpandorlimitasearch. InthispapertheyexplaintheSVDwherethelatentsemanticstructureanalysisstarts withamatrixoftermsbydocuments.Thismatrixisthenanalyzedbysingularvalue decomposition (SVD) to derive our particular latent semantic structure model. Singular value decomposition is closely related to a number of mathematical and statistical techniques in a wide variety of other fields, including eigenvector decomposition,spectralanalysis,andfactoranalysis.Wewillusetheterminologyof factoranalysis,sincethatapproachhassomeprecedenceintheinformationretrieval literature. The traditional, onemode factor analysis begins with a matrix of associations betweenallpairsofonetypeofobject,e.g.,documents[4].Thismightbeamatrixof humanjudgmentsofdocumenttodocumentsimilarity,orameasureoftermoverlap computedforeachpairofdocumentsfromanoriginaltermbydocumentmatrix.This squaresymmetricmatrixisdecomposedbyaprocesscalled"eigenanalysis",intothe product of two matrices of a very special form (containing "eigenvectors" and "eigenvalues"). These special matrices show a breakdown of the original data into linearlyindependentcomponentsor"factors".Ingeneralmanyofthesecomponents areverysmall,andmaybeignored,leadingtoanapproximatemodelthatcontains many fewer factors. Each of the original documents’ similarity behavior is now approximated by its values on this smaller number of factors. The result can be represented geometrically by a spatial configuration in which the dot product or cosine between vectors representing two documents corresponds to their estimated similarity. Intwomodefactoranalysisonebeginsnotwithasquaresymmetricmatrixrelating pairsofonlyonetypeofentity,butwithanarbitraryrectangularmatrixwithdifferent entities on the rows and columns, e.g., a matrix of terms and documents. This rectangularmatrixisagaindecomposedintothreeother matrices of a very special form, this time by a process called "singularvalue decomposition" (SVD). (The resulting matrices contain "singular vectors" and "singularvalues".)Asintheone modecasethesespecialmatricesshowabreakdownoftheoriginalrelationshipsinto

16 Literature Review linearly independent components or factors. Again, many of these components are verysmall,andmaybeignored,leadingtoanapproximatemodelthatcontainsmany fewerdimensions.Inthisreducedmodelallthetermterm,documentdocumentand termdocumentsimilarityisnowapproximatedbyvaluesonthissmallernumberof dimensions. The result can still be represented geometrically by a spatial configurationinwhichthedotproductor cosinebetween vectors representing two objectscorrespondstotheirestimatedsimilarity. Thus, for information retrieval purposes, SVD can be viewed as a technique for derivingasetofuncorrelatedindexingvariablesorfactors;eachtermanddocument is represented by its vector of factor values. Note that by virtue of the dimension reduction,itispossiblefordocumentswithsomewhatdifferentprofilesoftermusage tobemappedintothesamevectoroffactorvalues.Thisisjusttheweneed toaccomplishtheimprovementofunreliabledataproposedearlier.Indeed,theSVD representation,byreplacingindividualtermswithderivedorthogonalfactorvalues, canhelptosolveallthreeofthefundamentalproblemswehavedescribed. Invariousproblems,theyhaveapproximatedtheoriginaltermdocumentmatrixusing 50100 orthogonal factors or derived dimensions. Roughly speaking, these factors maybethoughtofasartificial;theyrepresentextracted commonmeaning componentsofmanydifferentwordsanddocuments.Eachtermordocumentisthen characterizedbyavectorofweightsindicatingitsstrengthofassociationwitheachof these underlying concepts. That is, the "meaning" of a particular term, query, or documentcanbeexpressedby k factorvalues,orequivalently,bythelocationofa vector in the k space defined by the factors. The meaning representation is economical,inthesensethat N originalindextermshavebeenreplacedbythe k

17 Literature Review the derived k –dimensional factor space not reconstruct the original term space perfectly,becausewebelievetheoriginaltermspacetobeunreliable.Ratherwewant aderivedstructurethatexpresseswhatisreliableandimportantintheunderlyinguse oftermsasdocumentreferents. Unlike many typical uses of factor analysis, we are not necessarily interested in reducing the representation to a very low dimensionality, say two or three factors, becausewearenotinterestedinbeingabletovisualizethespaceorunderstandit.But wedowishbothtoachievesufficientpowerandtominimizethedegreetowhichthe spaceisdistorted.Webelievethattherepresentationofconceptualspaceforanylarge document collection will require more than a handful of underlying independent "concepts",andthusthatthenumberoforthogonalfactorsthatwillbeneededislikely tobefairlylarge.Moreover,webelievethatthemodelofaEuclideanspaceisatbest ausefulapproximation.Inreality,conceptualrelationsamongtermsanddocuments certainlyinvolvemorecomplexstructures,including,for example,localhierarchies andnonlinearinteractionsbetweenmeanings.Morecomplexrelationscanoftenbe madetoapproximatelyfitadimensionalrepresentationbyincreasingthenumberof dimensions.Ineffect,differentpartsofthespacewillbeusedfordifferentpartsofthe language or object domain. Thus we have reason to avoid both very low and extremely high numbers of dimensions. In between we are guided only by what appearstoworkbest.Whatwemeanby"worksbest"isnot(asiscustomaryinsome otherfields)whatreproducesthegreatestamountofvarianceintheoriginalmatrix, butwhatwillgivethebestretrievaleffectiveness. How do we process a query in this representation? Recall that each term and documentisrepresentedasavectorin k dimensionalfactorspace.Aquery,justasa document,initiallyappearsasasetofwords.Wecanrepresentaquery(or"pseudo document") as the weighted sum of its component term vectors. (Note that the location of each document can be similarly described; it is a weighted sum of its constituent term vectors.) To return a set of potential candidate documents, the pseudodocumentformedfromaqueryiscomparedagainstalldocuments,andthose withthehighestcosines,thatarethenearestvectors,arereturned.Generallyeithera thresholdissetforclosenessofdocumentsandallthoseaboveitreturned,orthe n closestarereturned.(Weareconcernedwiththeissueofwhetherthecosinemeasure

18 Literature Review isthebestindicationofsimilaritytopredicthumanrelevancejudgments,butwehave notyetsystematicallyexploredanyalternatives,cf.Jones&Furnas[16].) Aconcreteexamplemaymaketheprocedureanditsputativeadvantagesclearer.And theexampleisgiveninDeerwesterS.researchpaper[8].Thetechnicaldetailsand moredetailsaregiveninthispaper.

2.5LatentSemanticAnalysisforUserModeling Latent semantic analysis (LSA) is a tool for extracting semantic information from textsaswellasamodeloflanguagelearningbasedontheexposuretotexts.They also designed tutoring strategies to automatically detect lexeme misunderstandings andtoselectamongthevariousexamplesofadomaintheonewhichisbesttoexpose thestudentto. LSAisbothatoolforrepresentingthemeaningofwords[8]andacognitivemodelof learning[20].LSAanalyseslargeamountoftextsbymeansofastatisticalmethod and represents the meaning of each word as a vector in a highdimensional space. Pieces of texts are also represented in this semantic space. Semantic comparisons betweenwordsorpiecesoftextsaremadebycomputingthecosinebetweenthem. Theyrelyonthistooltorepresentbothdomainandstudentknowledgeinatutoring system.Inourfield(languagelearning),domainknowledgeiscomposedoftheusual meaningofwordsaswellastextualmaterials.Studentknowledgeiscomposedofthe studentmeaningofwords. Theydesignedtwotutoringstrategiesbasedonthisdualknowledge representation. The first one automatically detects student misunderstandings from the analysis of whathe/shehaswritten.Thesecondoneselectsamongthevarioustextualstimulia studentcanbeexposedto,theonewhichissupposedtobethebestforimproving learning.

19 Literature Review

2.5.1Automaticdetectionofmisunderstandings Themeaningofalexemeisgivenbyallthelexemesclosetoit.Thisisakintothe Saussurian point of view that the meaning (of a word) is determined by what surroundsit[29].Anexampleresultsfromananalysisofasmalldatabaseofanimal features.Theclosestlexemestothelexemeeatsmeat are fawncolored (.51), tiger (.36),hasblackspots(.20),etc. That representation allows us to design a method to automatically detect lexeme misunderstandings.Theideaistotake,foreachlexemewrittenbythestudent,the neighboringlexemesinthestudentsemanticspace.Then,wecomparethesesemantic proximitiesinthestudentsemanticspaceandinthedomainsemanticspace.Ifthereis atoobigdifferencebetweenthesemanticproximitiesinthetwospaces,it meansthatthereisastudentmisunderstandingofthatlexeme.Forinstance,ifinthe studentspacetheword pillow iscloseto drugs , codeine , aspirin , dosage ,thesystem saysthatthestudentdoesnothaveacorrectunderstandingoftheword pillow .

Tobemoreformal,foreachlexemeXofthelearnerspace,weconsidertheX 1,...,

Xα closestlexemes.Then,foreachX i,wecomputethedifference:

|proximity domainspace (X,X i)proximity studentnspace (X,X i)| Therefore, we obtain α difference. The smaller these differences are, the better the understandingofXbythelearner.Thesedifferencesareclassifiedintwointervals: [0; λ[, [λ; 2]. The values α and λ need to be defined experimentally. From our experience,wethinkthatvaluesuchas20forαand0.2forλwouldbegoodstarting points. The understanding of X by the learner is determined from the distribution of the differencesXioverthetwointervals: if most of the Xi belongs to [0; λ[, we consider that the meaning of X is well understood; else,weconsiderthatthemeaningofXisnotwellunderstood.

20 Literature Review

Figure7presentsthealgorithm.Weimplementedtheprevious“mostof”bythefact thattwothirdoftheXibelongtothecorrespondingcategory. Thelistoflexemesthatarelikelytobemisunderstoodcanthenbeuseddirectlybya teacher or by the pedagogical module of a tutoring system in order to select the appropriatelearningmaterials.

Figure7:AlgorithmfortheautomaticdetectionofthemisunderstandingoflexemeX.

2.5.2Automaticselectionofstimuli IntheLSAmodel,learningresultsfromtheexposuretosequencesoflexemeswhich theymentionedbefore.Theideathatlearningasecondlanguageisessentiallybased on the exposure to the language, and not only to explanation of the rules of that language,isnowadaysrecognizedbyresearchersinsecondlanguageacquisition[17]. By being exposed to sequences of lexemes in a random fashion, a student would certainlylearnsomelexemesinthesamewayachildlearnsnewwordsbyreading variousbooks. However,theprocessoflearningcouldbespeededupbyselectingtherightsequence of lexemes given the current state of student entities. Therefore, the problem is to knowwhichtext(forlanguagelearning)orwhichmove(forgamelearning)hasthe highest chance of enlarging the part of the semantic space covered by the student entities.

21 Literature Review

2.5.2.1Selectingtheclosestsequence Suppose we decide to select the sequence which is the closest to the student sequences.Supposethat{s1, s 2, . . . s n}arethestudentsequencesand{d1, d 2, . . . d p} thedomainsequences,weselect djsuchthat: n proximity(si,dj) ∑ i=1 isminimal.Figure8showsthatselectionina2dimensionalrepresentation(remind thatLSAworksbecauseitusesalotofdimensions).Domainentitiesarerepresented byblacksquaresandstudententitiesbywhitesquares. Letusillustratethisbymeansofanexample.Supposethedomainiscomposedof82 sequencesoflexemescorrespondingeachtoan Aesop’sfable.Thensupposethata beginnerstudentwasaskedtoprovideanEnglishtextinorderfortheprocesstobe initiated.Theusermodeliscomposedofonlythissequenceoflexemes: MyEnglishisverybasic.Iknowonlyafewverbsandafewnouns.Iliveinasmall villageinthemountains.IhaveabeautifulbrowncatwhosenameisFelix.Lastweek, Figure8:Selectingtheclosestsequence.

22 Literature Review mycatcaughtasmallbirdandIwasverysorryforthebird.Hewasinjured.Itriedto saveitbutIcouldnot.ThecatdidnotunderstandwhyIwasunhappy.Ilikewalking intheforestandinthemountains.Ialsolikeskiinginthewinter. Iwouldliketo improvemyEnglishto beabletoworkabroad. Ihave a brother and a sister. My brotherisyoung. RunningLSA,theclosestdomainsequenceisthefollowing: Longago,themicehadageneralcounciltoconsiderwhatmeasurestheycouldtake tooutwittheircommonenemy,theCat.Somesaidthis,andsomesaidthat;butatlast ayoungmousegotupandsaidhehadaproposaltomake,whichhethoughtwould meetthecase.“Youwillallagree,”saidhe,“thatourchiefdangerconsistsinthesly andtreacherousmannerinwhichtheenemyapproachesus.Now,ifwecouldreceive somesignalofherapproach,wecouldeasilyescapefromher.Iventure,therefore,to proposethatasmallbellbeprocured,andattachedbyaribbonroundtheneckofthe Cat.Bythismeansweshouldalwaysknowwhenshewas about,andcouldeasily retirewhileshewasintheneighborhood.”Thisproposalmetwithgeneralapplause, untilanoldmousegotupandsaid:“Thatisallverywell,butwhoistobelltheCat?” Themicelookedatoneanotherandnobodyspoke.Thentheoldmousesaid:Itiseasy toproposeimpossibleremedies. Itishardtotellwhythistextoughttobetheeasiestforthestudent.Afirstanswer wouldbetoobservethatseveralwordsofthefableoccurredalreadyinthestudent’s text(likecat,young,small,know,etc.).However,LSAisnotlimitedtooccurrence recognition:themappingbetweendomainandstudent’sknowledgeismorecomplex. Asecondansweristhatthewriterofthefirsttextactuallyfoundthatfabletheeasiest from a set of 10 randomly selected ones. The third answer is that LSA has been validated several times as a model of knowledge representation; however, experimentswithmanysubjectsneedtobeperformedtovalidatethatparticularuseof LSA. Although the closest sequence could be considered the easiest by the student, itis probably not suited for learning because it is in fact too close to the student’s knowledge.

23 Literature Review

2.5.2.2Selectingthefarthestsequence Another solution would be then to choose the farthest sequence (figure 9). In our example,thiswouldreturn: A Horse and an Ass were traveling together, the Horse prancing along in its fine trappings,theAsscarryingwithdifficultytheheavyweightinitspanniers.“IwishI wereyou,”sighedtheAss;“nothingtodoandwellfed,andallthatfineharnessupon you.”Nextday,however,therewas agreatbattle, and the Horse was wounded to deathinthefinalchargeoftheday.Hisfriend,theAss,happenedtopassbyshortly afterwardsandfoundhimonthepointofdeath.“Iwaswrong,”saidtheAss:Better humblesecuritythangildeddanger. Figure9:Selectingthefarthestsequence. Thatsequencewasfoundquitehardtounderstandbyourwriter.Choosingthefarthest sequenceisthereforeprobablynotappropriateforlearningeither,becauseitistoofar fromthestudent’sknowledge.

2.5.2.3Selectingtheclosestsequenceamongthosethatarefarenough Noneoftheprevioussolutionsbeingsatisfactory,asolutionwouldthenbetoignore domain sequences that are too close to any of the student sequences. A zone is therefore defined around each student sequence and domain sequences inside these

24 Literature Review zonesarenotconsidered(wepresentawayofimplementingthatprocedureinthe nextsection).Thenbyusingthesameprocessdescribedintheprevioussection,we select the closest sequence from the remaining ones. Figure 10 illustrates this selection. Figure10:Selectingthenextstimulus:theclosestamongthosethatarefarenough. Theideathatlearningisoptimalwhenthestimuliisneithertooclosenortoofarfrom thestudent’sknowledgehasbeentheorizedbyVygotsky[34]withthenotionofzone of proximal development. He influenced Krashen [17] who defined the input hypothesis as an explanation of how a second language is acquired: the learner improves his/her linguistic competence when he receives second language ‘input’ whichisonestepbeyondhis/hercurrentstageoflinguisticcompetence[37]. AnexperimentofasimilarideawasperformedbyWolfeetal.[36].Theyshowthat learningwasgreatestfortextsthatwereneithertooeasynortoodifficult.

2.6AStructureRepresentationofWordSensesfor SemanticAnalysis This paper focused on semantic knowledge representation issues. However, many otherissuesrelatedtonaturallanguageprocessinghavebeendealtwith.Thepurpose ofthissectionistogiveabriefoverviewofthe textunderstandingsystemandits

25 Literature Review current status of implementation. Figure 11 shows the three modules of the text analyzer. AllthemodulesareimplementedinVM/PROLOGandrunonIBM3812mainframe. Themorphologyassociatesatleastonelemmatoeachword;inItalianthistaskis particularlycomplexduetothepresenceofrecursivegenerationmechanisms,suchas alterations,nominalizationofverbs,etc.Forexample,fromthelemmacasa(home)it ispossibletoderivethewordscasetta(littlehome),casettina(nicelittlehome),cas ettinaccia(uglynicelittlehome)andsoon.Atpresent,themorphologyiscomplete, andusesforitsanalysisalexiconof7000lemmata[1]. The syntactic analysis determines syntactic attachmentbetween wordsby verifying grammarrulesandformsagreement;thesystemisbasedonacontextfreegrammar [1].ItaliansyntaxisalsomorecomplexthanEnglish:infact,sentencesareusually composed by nested hypothetical phrases, rather than linked Para tactical. For example,asentencelike"JohngoeswithhisgirlfriendMarytothehousebytheriver to meet a friend for a pizza party” might sound odd in English but is a common sentencestructureinItalian. Syntacticrelationsonlyrevealthesurfacestructureofasentence.Amainproblemis to determine the correct prepositional attachments betweenwords:itisthetaskof semanticstoexplicitthemeaningofprepositionandtodetecttherelationsbetween words. Thetaskofdisambiguatingwordsensesandrelatingthemtoeachotherisautomatic forahumanbeingbutisthehardestforacomputerbasednaturallanguagesystem. Thesemanticknowledgerepresentationmodelpresentedinthispaperdoesnotclaim tosolvethenaturallanguageprocessingproblem,butseemstogivepromisingresults, incombinationwiththeothersystemcomponents.

26 Literature Review

a)TheTextAnalyzer b)Asampleoutput

ThePrimeMinister ...decidesameetingwithparties... Lexicon MORPHOLOGY Desinmodels Decides=verb.3.sing.pres. Meeting=noun.sing.masc. Parties=noun.plur.masc. Grammerrules SYNTACTICS VP

Conceptual SEMANTICS decides NP dictionary

NP PP

a meeting with parties

VP

VP NP

decides PP a meeting with parties

MEETING PARTIC POL_PARTY

Figure11:SchemeoftheTextUnderstandingSystem. The semantic processor consists of a semantic knowledge base and a parsing algorithm.Thesemanticdatabasepresentlyconsistsof850wordsensedefinitions; each definition includes in the average 20 elementary graphs. Each graph is representedbyapragmaticrule,withtheform:

27 Literature Review

(1)CONC_REL(W,*x)<COND(Y,*x). The above has the reading :"*x modifies the wordsense W by the relation CONC_RELif*xisaY".Forexample,thePR: AGNT(think,*x)<COND(HUMAN_ENTITY,*y). Correspondstotheelementarygraph: [think]>(AGNT)>[HUMAN_ENTITY] TheruleCOND(Y,*x)requiresingeneralamorecomplexcomputationthanasimple super type test, as detailed in [22]. The short term objective is to enlarge the dictionaryto1000words.Aconcepteditorhasbeendevelopedtofacilitatethistask. Theeditoralsoallowsvisualizing,foreachwordsense,alistofalltheoccurrencesof the correspondent words within the press agency releases data base (about 10000 news). Thealgorithmtakesasinputoneormoreparsetrees,asproducedbythesyntactic analyzer. The syntactic surface structures are used to derive, for each couple of possiblyrelatedwordsorphrases,aninitialsetofhypothesisforthecorrespondent semanticstructure.Forexample,anounphrase(NP)followedbyaverbphrase(VP) couldberepresentedbyasubsetoftheLINKrelationslistedintheAppendix.The specificrelationisselectedbyverifyingtypeconstraints,expressedinthedefinitions ofthecorrespondentconcepts.Forexample,thephrase"Johnopens(thedoor)"gives theparse: NP=NOUN(.Iohn) VP=VERB(opens) A subjectverb relation as the above could be interpreted by one of tile following conceptual relations: AGNT, PARTICIPANT, INSTRUMENTetc.Eachrelationis testedforsemanticplausibilitybytherule:

28 Literature Review

(2)RFI._CONC(×,y)<(x:REL_CONC(x,*y=y))&(y:REI._CONC(*x=x,y)). The(2)isprovedbyrewritingtheconditionsexpressedontherightendsideinterms of COND(Y,*x) predicates, as in the (I), and then attempting to verify these conditions.Intheaboveexample,(1)isprovedtruefortherelationAGNT,because: AGNT(open,person:John)<(open:AGNT(open,*x=person:John))& (person:AGNT(*y=open,person:John)). (open:AGNT(open,*x)<COND(HUMAN_ENTITY,*x). (person:AGNT(*y,person)<COND(MOVEACT,*y)). Theconceptualgraphwillbe [PERSON:John1.:(AGNT)<[OPEN] For a detailed description of the algorithm, referto[34]attheendofthesemantic analysis, the system produces two possible outputs. The first is a set of short paraphrasesoftheinputsentence:forexample,giventhesentence"TheACEsignsan agreementwiththegovernment"gives: TheSocietyACEistheagentoftheact. AGP,EEMENTistheresultoftheactSIGN. TheGOVERNMEN'FparticipatestotheAGREEMENT. Thesecondoutputisaconceptualgraphofthesentence, generatedusinga graphic facility.AnexampleisshowninFigure12.APROI.OGlistrepresentingthegraphis alsostoredinadatabaseforfutureanalysis(queryanswering,deductionsetc.).

29 Literature Review

COMPANY AGNT SIGN OBJ CONTRACT :Ace

PART GOVERMENT

Figure12:Conceptualgraphforthesentence"TheACEsignsacontractwith thegovernment" Asfarasthesemanticanalysisisconcerned,currenteffortsaredirectedtowardstile developmentofaqueryansweringsystemandalanguage generator. Futurestudies willconcentrateondiscourseanalysis[26].

2.7NaturalLanguageProcessingComplexityand Parallelism This paper will focus on rulebased and on constraint based systems and not on statistical approaches such as those exemplified by the work at IBM [5]. The followingwordmorphologyandtextanalysissectionsshowwhereprocessingchoices needtobemadeandwhereparallelismcouldbeimplemented[2],[33]. TheideaspresentedherearemorespecificallyapplicabletoFrench,andlesstoother languages.Wheneverneeded,suchdifferencesareindicated. Themorphologycomponentallowsthesystemtorecognizethevariousformsofa wordinanefforttoreducethesizeofdictionaries.ThePetitRobertdictionaryfor Frenchcontainsaround60,000entries.Imagineifinsteadofthesingleentry danser , the number of entries will have to be multiplied by the number of all possible conjugationsforthe6personsandforalltenses.Morphologyanalysisreducesstorage requirements and hence converts the problem of recognizing word forms from an informationretrievalproblemtoaclassificationproblemwherethesystemhastostrip thewordfromitssuffixinordertorecognizeitsstem.

30 Literature Review

Essentially, the system removes the letters at the end of the word (its suffix or termination )oneatatimestartingbythelastletter.Eachtimeitstripsthewordofa letter;itwillthenlookitupinastemdictionarytoseeiftheremainingstringisa viable stem. It then iterates over the other letters. If the dictionary is already in memory and is accessed by a hashing algorithm, the time complexity of such a process would be: w * Θ (1); where w is the length of the word, which is quite insignificant in comparison to the size of the Hash Table. This analysis does not includethefileaccessingoverhead. In general, such a methodology is acceptable for languages where the effect of suffixesislocalbutnotacceptableforlanguageswheretheinfluenceisoveralonger distance.Forillustration,inArabic,bothprefixesandsuffixesaremodifiedduring inflection [23] [24] while in Turkish, suffixes are influenced by vowel harmony processes[7][25].Accordingly,theproblemofmapping between the surface form andthelexicalform,twolevelmorphology,isquitehardinsuchlanguagesmakingit NPcompleteifnullcharactersareexcluded[6]. For French, the stem selection could be parallelized by sending the same word of length wtoasmanyprocessorsastherearelettersintheword,i.e. wprocessors.Each processor, Pi,where iisthenumberofletterstobestripped,wouldhavetoverifyif theresultingwordoflength w − iisastem.Supposingtheparallelizationisbasedon a Directedacyclicgraph model (DAG) where synchronization and communication costsareignored,thecomplexityofsuchanalgorithmdependsontheaccessmethod used for the Stem dictionary. If it were hashed as would be expected, the time complexitywouldbeΘ(1),whichisthesameasthat of the serial algorithm. This result should not, however, be used to underestimate the usefulness of the parallel approach.Itonlyindicatesthatbothalgorithmswillnotdeteriorateifthesizeofthe dataincreasesrapidly,whichisneitherthecaseofthesizeofwordsinFrenchnorisit forhashing[21]. Asfortheexpectedactualexecutiontimeoftheparallelalgorithm,itwillbereduced to1/ wofthatofitssequentialcounterpart.Thisapproachisanalogoustothatofthe Kimmosystem,whichusesparallelfinitestateautomatatofindthecorrespondence betweenthesurfaceandthelexicalformofaword[18].

31 Literature Review

2.8Summary After the study I have got some idea that how the semantic analysis work for informationretrievalanddocumentmatching.Singularvaluedecomposition(SVD)is theonebesttechniquewhichisusingfrequentlyforinformationretrieval.Buteach technique has some limitation for using in the system. For example, SVD have approximatedtheoriginaltermdocumentmatrixusing50100orthogonalfactorsor deriveddimensions.Memorylimitationsometimegives boundary for processing a documentusingsemanticanalysis.Butstillwecanusesometechniquetoavoidthis kindofproblems.

32 Analysis

Chapter3

3.1WhatIamdoing? InourSMARTprojecttheuserwillsubmitsometextfromhisorhermobiletoa serverandthetextwillbesaveinadatabase.Itcanalsobedonefrominternet.But themainideaisfindoutthesubject,object,messagetype,priorityandnameofthe sender.AndIamtryingtoimplementanddevelopanewalgorithmforthisproject. Byusingmyalgorithmwecanfindouttheobject,messagetypeandalsothepriority.

3.2Proposealgorithm Theproposedalgorithmisdividedbythreeparts.Thealgorithmisimplementedby JAVAlanguage.ForthedatabaseIuseMicrosoftAccessbutwecanalsouseOracle, SQLServerdependingontheprojectrequirement.Therearethreedatatablewhich are Objects, Processeddata and Rowdata .

Figure13:ObjectsDataTable The Objects table is using for store the object information. Such as object Id (Identificationnumber),Objectnameandthekeywordswhichareusingtoidentify theobjectsnamebythesystem.The Objects tableisFigure13. TheFigure14isthe Processeddata Table.Whereallinputmessageisprocessedby semanticanalysisandafteridentifytheObjectname,texttypeandpriorityitstorethe data is the processeddata table. Meta data of this table are the ID , Message , ObjectName , Type , Priority, Subject , Comments and Reviewed . Where ID is the

32 Analysis uniquerowidentificationnumber, Message isusingforstoringtheinputtext.After semantic analysis of the input text, the identified object name is store in the ObjectName columnandtexttypeisstoredinthe Type column. Priority isusingfor storethetextpriorityand Subject fieldisusingforstoretheshortsubjectofthewhole input message. Comments is using for analyzer if he or she want to add some comment.Andthe Reviewed fieldisusingtocountthathowmanytimethemessage reviewedbyanalyzerortheuser.

Figure14:ProcessedDataTable TheFigure15isthe Rowdata Table.Thistableisusingforsubmitthetextbyuser frommobileorinternet.Therearethreefieldsinthistablewhicharethe Message , ID and status .Message fieldisusingforstoretheinputtextwhichisusinginfuturefor semanticanalysisforidentifyingtheobjectname,texttypeandpriority.TheIDfield isusingforuniquelyidentifytherow.Andthe status fieldisusingforcheckbythe system where the message or text is processed or not. If the text is processed by semanticanalysisthenthe status fieldwillbe read elseitwillbealways unread .

Figure15:RowDataTable

33 Analysis

3.2.1ObjectIdentifyAlgorithm

String Detect_Object(String message[all_unread_message], String Objects[all_objects]) { Integer row = 1, col = 0; //Create two dimension array for store object name and keywords. String object_array[max_keywords][max_objects]; //Create two dimension array for store the weight value. Integer weight_array[max_keywords][max_objects]; //Create one dimension array for store sum of weight matrix value. Integer sum_array[max objects]; //Initialize both array before enter the value. for(int i = 0; i < max_keywords; i ++) for(int j = 0; j < max_objects; j ++) { object_array[i][j] = “”; weight_array[i][j] = 0; } //Store all the objects in first row of the object array for(int j = 0; j < max_objects; j ++) object_array[0][j] = object[j]; //Store all the keywords of the objects following by same column for(int i = 0; i < max_objects; i ++) { while(object[i].hasMoreKeywords()) { object_array[row++][col] = keyword; } row = 1; col ++; } //For each text it use keyword matching for(int i = 0; i < message.length(); i++) { String Token[] = TokenizeMessage(message[i]); for(int j = 0; j < Token.length(); j ++) {

34 Analysis

for(int y = 0; y < max_objects; y ++) for(int x = 1; x < max_keywords; x ++) if(Token[j] = = object_array[x][y] ) then weight_array[x][y] +=1; } } Integer Sum = 0; //Find out the maximum which is maximum keyword matching object for(int y = 0; y < max_objects; y ++) { for(int x = 1; x < max_keywords; x ++) { sum + = weight_array[x][y]; } sum_array[y] = sum; sum = 0; } // Findout the maximum from the sum array and its index too. Integer index = maximum(sum_array[max_objects]); //Return the object name. If(maximum(sum_array[max_objects]) > 1) then Return object_array[0][index]; else Return “”; }

3.2.2TextTypeIdentifyAlgorithm

String Detect_type(String message[all_unread_message]) { // Create a weight array for store the value. Integer weight_array[max_msg_type] ; // Create a array to store the message type. String message_type[] = {“proposal”, “report”, “suggestion”, “opinion”, “request”, “information”, “complain”}; //checking the type from each text for(int i = 0; i < message.length(); i++) { String Token[] = TokenizeMessage(message[i]);

35 Analysis

While(Token[i].hasMoreToken()) { for(int j = 0; j < message_type.length(); j ++) { if( Token[i] = = message_type[j]) weight_array[j] += 1; } } //Find out the maximum weight from the weight array. Integer index = maximum( weight_array[max_msg_type] ); //Return the text type if( maximum( weight_array[max_msg_type] ) ! = 0 ) Return message_type[index]; else Return “”; }

3.2.3TextPriorityAlgorithm

Integer get_priority(String message_type) { if(message_type = = “report”) Return Random_Number(50 to 100); else if(message_type = = “proposal”) Return Random_Number(10 to 50); else if(message_type = = “suggestion”) Return Random_Number(30 to 70); else if(message_type = = “opinion”) Return Random_Number(10 to 50); else if(message_type = = “information”) Return Random_Number(20 to 50); else if(message_type = = “complain”) Return Random_Number(70 to 100); else if(message_type = = “”) Return 0; }

36 Analysis

3.3HowtheAlgorithmwork? ToidentifyingtheobjectIusedtwoarraysforstoretheobjectnameandkeywords, another one is using for counting the weight. At first all the keywords and object storedinthe object_array.Forexample,ifwetakeasampletext: Lulea University Of Technology. All the type of the view point and proposal collection. The object_array willbelookliketheFigure16. LuleaUniversityofTechnology Isrutchsbana BnearIT Lulea Play company University ground system Technology child integration Education Isrutchsbana architecture Research Ice industry Collaboration Kids peace Student playing Bnear Doctor IT Examination Lecturer Teacher Lecture Lesson Laboratory Thesis Schedule Lesson Figure16:Object_array Inthe object_array thefirstrowisusingforstoretheobjectname.Otherrowsare usingforkeywordsfollowingbyobjectname.Therowsizedependsonmaximum keywordforeachobject.Afterthesemanticanalysisthe weight_array willbelook liketheFigure17.

37 Analysis

0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure17:weight_array. AfterthatIcounttheentirecolumnfrom weight_array andstoreit sum_array .Now the sum_array willbelooklike: 3 0 0 Becausethefirstcolumnshowingthemaximumnumberthatmeansthefirstobject namefromthe object_array willbereturn. Thetexttypealsoidentifyinthesameway.Atfirstthesystemtokenizedthewhole message and each word of the message is checking for the type: proposal, report, suggestionetc.Thereisalso weight_array whichstorethematchingvalueandfrom thatarraythesystemfindoutthemaximumvalueandindex.Thenreturnthetext typefrom message_type arrayusingbytheindex. Thetextpriorityisdependingonthetexttype.Wecansetthisanytimeinsideour system.

38 Analysis

3.4ImplementedApplication The semantic analysis is implemented by JAVA language and there is also a GraphicalInterfaceforSMARTclientwhichiscreatedbyVisualBasic6.TheFigure 18isclientinterfacefortextanalyzer.Wherethereare subject field, text field, object field, user field, type field , priority field and comment field .ThetaskofeachfieldI describe before. But some field like the Subject field is using to generate a short subjectfromtheinputtext.

Figure18:SMARTClientInterface1. Another is the user filed which is using for the users who login for analysis the processeddata.Andthe comment field isusingfortheusercommentsabouttheinput text. Where the status isanoptionbuttonwherewecanseetherowinput data by clickon Input .Andifweclickon Processed thenalltheprocesseddatawhichare analysis by semantic analysis will be shown. And the Reviewed option will be checkediftheprocesseddataisreviewedbyanyuser.The save button isusingfor update information from the saved data in the database. Delete button is using for deletedatafromthedatabase.Thetwoarrowbuttonisusingformovethecursorto

39 Analysis next and previous data. Another button is Edit Object and keywords which call anotherformforediting,wecanseeintheFigure19.

Figure19:SMARTClientInterface2. Thisinterfaceisusingtoaddandedittheobjectlistandkeywordwhichisrelated with object. There are some command buttons which functions are same as the previousinterface.But Add button isusingtoaddanewobjectandkeywordsforthe semanticanalysis.Clear button isusingforclearthetextfieldsof object id, object name and keywords .AnotherinterfaceisforloginwhichwecanseeintheFigure20.

40 Analysis

Figure20:LoginFormforSMARTClient Thisisthefirstloadingformwhichisusingforsecurityreason.Thisgivesusafull securitythatnoonecanchangeandupdatetherowdatafromthedatabase.

41 Implementation

Chapter4

4.1Testing Testingisasignificantpartofasystemandwithout testing we can’t appraise any system.ForourSMARTsystemItakethefollowingdataforinputinsidethesystem tochecktheoutputandsystemconstancy.TheFigure21isourinputdatatable. Type Name Description Proposal Lulea LuleaUniversityof Universityof Technology.Allthetype Technology oftheviewpointand proposalcollection Proposal UnimboAB Herewearedoingobject Report Isrutschbana IsrutchsbanawithIce sculpturewherekids playingwithsnow Proposal Proposalbox Hereistheproposalbox Proposal BnearIT CompanyBnearITAB. Allthetypeofthe viewpointandproposal receives. Report Trashcan Atrashcanwhereone throwstrash.Report abouttheproblemorifit isfull. Proposal ExpireTest Hereisthetemporary proposalbox.Afterthe deadline,nonew proposalcanbetaken butthereceived proposalwillremain

Figure21:InputDataTable.

Intheabovetablethetexttype,objectnameandsampleinputtextsaregivetocheck semanticanalysis.

4.2Result All the data from the data table which we see in the Figure 21 are saved in our database.Andthesystemisreadthedatafromthedatabaseandanalysisthosedata.

42 Implementation

Aftersemanticanalysisalltheprocesseddatasaveinanotherdatabase.Soafterrun thesemanticanalysisIfoundtheresultlike: Text:- Lulea University Of Technology. All the type of the view point and proposal collection Object:-Lulea University of Technology Text Type:- proposal Text Priority:- 38 Text:- Here we are doing object Object:-Object not found by system Text Type:- Type not found by system Text Priority:- 0 Text:- Isrutchsbana with ice sculpture where kids playing with snow Object:-Isrutchsbana Text Type:- Type not found by system Text Priority:- 0 Text:- Here is the proposal box Object:-Proposal Box Text Type:- proposal Text Priority:- 40 Text:- Company Bnear IT AB. All type of the viewpoint and proposal receives. Object:-Bnear IT Text Type:- proposal Text Priority:- 46 Text:- A trash can where one throws trash. Report about the problem if it is full. Object:-Trash can Text Type:- report Text Priority:- 93 Text:- Here is the temporary proposal box. After the deadline, no new proposal can be taken but the recieved proposal will remain Object:-Expire Test Text Type:- proposal Text Priority:- 42 Alltheaboveoutputcomedirectfromthesystem.Dependingonthetextoruserinput datathesemanticanalysisidentifytheobjectsandtexttypewithpriority.Thekey conceptofthissemanticanalysisstronglydependonthekeywords.

43 Implementation

4.3Evaluation Aftertestingitcometothesystemaccuracythathowmuchcorrectresultthesystem can done. I used eight text examples for testing the SMART system for semantic analysis. And it gives us five accurate results. But it is totally depending on the informationthathowmuchkeywordwestoredinour database. After evaluation it appearsthatthesystemcangiveus62.5%accurateresultevenifitisaprototype.So the stability of the system will be depending in future on the input text and the keywordinformationtoidentifythe object and text type .

44 Discussion

Chapter5

5.1Futurework Inmyproposedalgorithmwecaneasilyfindoutthe object, type and priority fromthe inputtext.Nowadaysemanticanalysisisusingfor resume type recognize, crime reportinvestigation,patternmatchingfromthe websitetexts.Togenerateasmall subjectfromtheinputtextcanbeagoodfutureworkbyusingsemanticanalysis.We canalsofindouttheinputtext’susernameorthepersonnameifitisgiveninthe inputtext.Anotherinterestingworkcanbefindingouttheuserlocation.Depending onthe object namewecanalsomatchthesidescenariooftheoriginal object andcan findoutthereal object fromtheinputtext. In future semantic analysis can be also use for exam paper checking, search the certaintopicfromthewebpublishednewspaper.Itcanalsobeusetounderstandthe humanlanguagewherethetextisEnglishorSwedish.

5.2Conclusion This research has produced the user input text summaries and find out the object name, text type and text priority. Hopefully, this research part can be successfully added in SMART project for increasing user usability. And in future this idea of semanticanalysiscanbeuseinlargescale.Withthisresearchproject,Ihopetohave madeausefulcontributionandevaluationoftheuserinputtextsummaries.

45 References

References [1]AntonacciF.,RussoM. Three steps towards natural language understanding : morphology, morphosyntax, syntax .(1987). [2]Adriaens,G.&HahnU.,Editors, Parallel Natural Language Processing ,Ablex Publishing,USA,480pages,1994. [3]Bates,M.J. Subject access in online catalogs: A design model .JASIS,1986,37 (6),357376. [4]Borko,HandBernick,M.D. Automatic document classification . Journal of the ACM ,April1963, 10(3) ,151162. [5]Brown,P.etal.A statistical approach to language , Computational Linguistics ,vol.16,1990,pp.7985. [6]Bartonetal. Computational Complexity and Natural Language Processing ,MIT Press,USA,1987. [7]Clements,G.&E.Sezer.VowelandconsonantdisharmonyinTurkish,The StructureofPhonologicalRepresentations,VanderHulst&Smitheds.,Holland: ForisPubs,1982,pp.213256. [8]DeerwesterS.,DumaisS.T.,FurnasG.W.,LandauerT.K.,HarshmanR. “ Indexing by Latent Semantic Analysis.”JournaloftheAmericanSocietyfor InformationScience,1990. [9]Furnas,G.W.,Landauer,T.K.,Gomez,L.M.,andDumais,S.T.Statistical semantics: Analysis of the potential performance of key-word information systems .BellSystemTechnicalJournal,1983, 62(6) ,17531806. [10]Fidel,R. Individual variability in online searching behavior.InC.A.Parkhurst (Ed.).ASIS’85:ProceedingsoftheASIS48thAnnualMeeting,Vol.22,October 2024,1985,6972. [11]Furnas,G.W. Experience with an adaptive indexing scheme . In Human Factors in Computer Systems ,CHI’85Proceedings.SanFrancisco,Ca.,April1518, 1985. [12]GongY.,LiuX.“ Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis ”,September2001. [13]Gomez,L.M.andLochbaum,C.C. People can retrieve more objects with enriched keyword vocabularies. But is there a human performance cost? In Proceedings of Interact 84 ,London,England,September1984. [14]HofmannT.,“ Probabilistic Latent Semantic Indexing ”,1999.

46 References

[15]HofmannT.,“ Unsupervised Learning by Probabilistic Latent Semantic Analysis ”,2001,KluwerAcademicPublishers. [16]Jones,W.P.andFurnas,G.W.Picturesofrelevance. JASIS ,1987, 38(6) ,420 442. [17]Krashen,S.D.(1988).SecondLanguageAcquisitionandSecondLanguage Learning.PrenticeHallInternational. [18]Kartunen,L..Constructinglexicaltransducers.,ProceedingsColing.94,1994,pp. 406411. [19]Liley,O.Evaluationofthesubjectcatalog. American Documentation ,1954, 5(2) , 160. [20]Landauer,T.K.andDumais,S.T.ASolutionto’sProblem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge .(1997).PsychologicalReview,104(2),211–240. [21]MoghrabiC.,MoussaS.E.,EidM.S.,“ Natural Language Processing Complexity And Parallelism ”,2002,IEEE. [22]M.T.Pazienza,P.Velardi.PragmaticKnowledgeonWordUsesforSemantic AnalysisofTextsinKnowledgeP,RepresentationwithConceptualGraphs editedbyJohnSowa,AddisonWesley [23]McCarthy,J..Prosodictemplates,morphemictemplates,andmorphemictiers., TheStructureofPhonologicalRepresentations,VanderHulst&Smitheds., Holland:ForisPubs.,1982,pp.191223. [24]Moghrabi,C.,Girard,L.&Eid,M.S." Chemistry: a New Domain for a Portable Text Generation System ",Proceedings,InternationalConferenceRecent AdvancesInNaturalLanguageProcessing,Bulgaria,pp.337343,1995. [25]Olfazer,K.TwoleveldescriptionofTurkishwords,Literary and Linguistic Computing ,vol.9,no.2,1994,pp.137148. [26]PazienzaM.T.,“ A Structured Representation of word-senses for semantic analysis ”. [27]PustejovskyJ.,BerglerS.,AnickP.,“ Lexical Semantic Techniques for Corpus Analysis ”,1993. [28]RobertD.,HermannM.,HaroldS. Handbook of Natural Language Processing. MarcelDekker,2000 [29]Saussure,F.Saussure’sThirdCourseofLecturesinGeneralLinguistics. PergamonPress.(1993).

47 References

[30]SparckJones,K. A statistical interpretation of term specificity and its applications in retrieval. JournalofDocumentation,1972,28(1),1121. [31]Tarr,D.andBorko,H. Factors influencing inter-indexer consistency. In Proceedings of the ASIS 37th Annual Meeting, Vol. 11 ,1974,5055. [32]vanRijsbergen,C.J.Atheoreticalbasisfortheuseofcooccurrencedatain informationretrieval. Journal of Documentation ,1977, 33(2) ,106119. [33]vanLohuizen,M.P.. Parallel processing of Natural Language Parsers ., ParCo99,TheNetherlands,1999. [34]Vygotsky,L.S.(1962). Thought and Language .Cambridge:M.I.T.Press. [35]W.Pressandetal.NumericalRecipesinC: The Art of Scientific Computing . Cambridge,England:CambridgeUniversityPress,2 nd ed.,1992. [36]Wolfe,M.B.,Schreiner,M.E.,Rehder,B.,Laham,D.,Foltz,P.W.,Kintsch,W., andLandauer,T.K.(1998).LearningfromText: Matching Readers and Texts by Latent Semantic Analysis. DiscourseProcesses,25(2/3),309–336. [37]ZampaV.,LemaireB.,“ Latent Semantic Analysis for User Modeling ”,2002, KluwerAcademicPublishers. [38]JohnMcCarthy, http://www-formal.stanford.edu/jmc/whatisai/whatisai.html

[39]MIT, http://www.ai.mit.edu/courses/6.891-nlp/

[40]NLP, http://en.wikipedia.org/wiki/Natural_language_processing

[41]SA, http://en.wikipedia.org/wiki/Semantic_analysis_%28linguistics%29

[42]SMART, http://www.cdt.ltu.se/projectweb/4407056182976/English.html

48