Evaluation Metric Boardsearch Metrics: Recall - C/N, Precision C/E

Evaluation Metric Boardsearch Metrics: Recall - C/N, Precision C/E

Overview • Forums provide a wealth of information • Semi structured data not taken advantage of by popular search software Board Search • Despite being crawled, many An Internet Forum Index information rich posts are lost in low page rank Forum Examples vBulletin •vBulletin • phpBB •UBB • Invision •YaBB • Phorum • WWWBoard phpBB UBB 1 gentoo evolutionM bayareaprelude warcraft Paw talk Current Solutions • Search engines • Forum’s internal search 2 Google lycos internal boardsearch Evaluation Metric boardsearch Metrics: Recall - C/N, Precision C/E Rival system: • Rival system is the search engine / forum internal search combination • Rival system lacks precision Evaluations: • How good our system is at finding forums • How good our system is at finding relevant posts/threads Problems: • Relevance is in the eye of the beholder • How many correct extractions exist? 3 Implementation • Lucene Improving Software Package • Mysql • Ted Grenager’s Crawler Source Search Quality • Jakarta HTTPClient Dan Fingal and Jamie Nicolson The Problem Sourceforge.org • Search engines for softare packages typically perform poorly • Tend to search project name an blurb only • For example… Gentoo.org Freshmeat.net 4 How can we improve this? Better Sources of Information • Better keyword matching • Every package is associated with a • Better ranking of the results website that contains much more detailed • Better source of information about the information about it package • Spidering these sites should give us a • Pulling in nearest neighbors of top richer representation of the package matches • Freshmeat.net has data regarding popularity, vitality, and user ratings Building the System How do we measure success? • Will spider freshmeat.net and the project • Create a gold corpus of queries to relevant webpages, put into mySQL database packages • Also convert gentoo package database to • Measure precision within the first N results mySQL • Compare results with search on • Text indexing done with Lucene packages.gentoo.org, freshmeat.net, and • Results generator will combine this with google.com other available metrics Any questions? Incorporating Social Clusters in Email Classification By Mahesh Kumar Chhaparia 5 Previous Work Email Classification • Previous work on email classification focus mostly on: • Emails: – Binary classification (spam vs. Non-spam) – Usually small documents – Supervised learning techniques for grouping into multiple existing – Keyword sharing across related emails may be small or indistinctive folders – Hence, on-the-fly training may be slow • Rule-based learning, naïve-Bayes classifier, support vector machines – Classifications change over time, and • Sender and recipient information usually discarded – Different for different users !! • Some existing classification tools • Motivation: – POPFile : Naïve-Bayes classifier – The sender-receiver link mostly has a unique role (social/professional) – RIPPER : Rule-Based learning for a particular user – MailCat : TF-IDF weighting – Hence, it may be used as one of the distinctive characteristics of classification Incorporating Social Clusters Evaluation • Identify initial social clusters (unsupervised) • Recently Enron Email Dataset made public • Weights to distinguish – The only substantial collection of “real” email that is public – From and cc fields, – Fast becoming a benchmark for most of the experiments in – Number of occurrences in distinct emails • Social Network Analysis • Email Classification • Textual Analysis … • Study effects of incorporating sender and recipient information: – Can it substitute part of the training required ? • Study/Comparison of aforementioned metrics with the already – Can it compensate for documental evidence of similarity ? available folder classification on Enron Dataset – Quality of results vs. Training time tradeoff ? – How does it affect regular classification if used as terms too ? Extensions References • Role discovery using Author-Topic-Recipient Model to facilitate • Provost, J. “Naïve-Bayes vs. Rule-Learning in Classification of Email”, classification The University of Texas at Austin, Artificial Intelligence Lab. Technical Report AI-TR-99-284, 1999. • Lexicon expansion to capture similarity in small amounts of data • E. Crawford, J. Kay, and E. McCreath, “Automatic Induction of Rules for • Using past history of conversation to relate messages E-mail Classification,” in Proc. Australasian Document Computing Symposium 2001. • Kiritchenko S. & Matwin S. “Email Classification with Co-Training”, CASCON’02 (IBM Center for Advanced Studies Conference), Toronto, 2002. • Nicolas Turenne. “Learning Semantic Classes for improving Email Classification”, Proc. IJCAI 2003, Text-Mining and Link-Analysis Workshop, 2003. • Manu Arey & Sharma Chakravarthy. “eMailSift: Adapting Graph Mining Techniques for Email Classification”, SIGIR 2004. 6 Outline A research literature search engine • Motivation with abbreviation recognition • Approach – Architecture • Technology Group members • Evaluation Cheng-Tao Chu Pei-Chin Wang Motivation • Existing research literature search engines don’t perform well in author, conference, proceedings abbreviation • Ex: search “C. Manning, IJCAI” in Citeseer, Google Scholar Search result in Google Scholar Goal • Instead of searching by only index, identify the semantic in query • Recognize abbreviation for author and proceedings names 7 Approach Architecture • Crawl DBLP as the data source • Index the data with fields of authors, proceedings, etc. DBLP • Train the tagger to recognize authors and proceedings Crawler • Use the probabilistic model to calculate the probability of each possible name Tagger • Use the tailored edit distance function to calculate the Database weight of each possible proceeding Query • Combine these weights to the score of each selected Search Browser result Retrieved Engine Documents Probabilistic Tailored Model Edit Distance Technology Evaluation • Crawler: UbiCrawler • 1. We will ask for friends to participate in the • Tagger: LingPipe or YamCha evaluation (estimated: 2000 queries/200 friends). • 2. Randomly sample 1000 data from DBLP, • Search Engine: Lucene extract the authors and proceedings info, query • Bayesian Network: BNJ with abbreviated info, check how well the • Web Server: Tomcat retrieved documents match the result from the Google scholar • Database: MySQL • Programming Language: J2SE 1.4.2 Outline • QA Background A Web-based Question • Introduction to our system Answering System • System architecture – Query classification Yu-shan & Wenxiu – Query rewriting 01.25.2005 – Pattern learning • Evaluation 8 QA Background Our QA System • Traditional Search Engine • Open domain – Google, Yahoo, MSN,… • Massive web documents based – Users construct keywords query – Users go through the HitPages to find answer – redundancy guarantee effective • Question Answering SE • Question classification – Askjeeve, AskMSR, … – Users ask in natural language pattern – focus on numeric, definition, human… – Return short answers • Exact answer pattern – Maybe support by reference System Architecture Question Classifier • Given a question, map it to one of the predefined classes. • 6 coarse classes (Abbreviation, Entity, Description, Human, Location, and Numeric Value) and 50 fine classes. • Also show syntactic analysis result such as POS Tagging, Name Entity Tagging, and Chunking. • http://l2r.cs.uiuc.edu/~cogcomp/demo.php?dkey=QC Query Rewrite Answer Pattern Learning • Use the syntactic analysis result to decide • Supervised machine learning approach which part of question to be expanded with • Select correct answers/patterns manually synonym. • Statistics answer pattern rule • Use WordNet for synonyms. 9 Evaluation • Use TREC 2003 QA set. Answers are retrieved from the Web, not from TREC corpus. • Metrics - MRR(Mean Peciprocal Rank) of the first correct Streaming XPath Engine answer - NAns(Number of Questions Correctly Answered), and Oleg Slezberg - %Ans(the proportion of Questions Correctly Amruta Joshi Answered) Traditional XML Processing Streaming XML Processing • Parse whole document into a DOM tree structure • XML parser is event-based, • Query engine search the in-memory such as SAX tree to get the result • XPath processor performs the • Cons: online event-based matching – Extensive memory overhead • Pros: – Unnecessary multiple traversals of the – Less memory overhead document fragment – Only process necessary • E.G. /Descendent::x/ancestor::y/child::z part of input document – Can not return result as early as – Result returned on-the-fly, possible efficient support for non- • E.G. Non-blocking query blocking query What is XPath? A Simple Example • A syntax used for selecting parts of an XML document An XML document SAX API Event <doc> Start element: doc doc • Describes paths to elements similar to an os <para1> Start element: para1 Hello world! data: Hello world! para1 para2 describing paths to files </para1> End element: para1 <para2> … </para2> ... • Almost a small programming language; it has </doc> End element: doc Hello, world functions, tests, and expressions • XPath query Q = /doc/para1/data() • W3C standard • Traditional processing: – Build an in-memory DOM strucuture • Not itself written as XML, but is used heavily in – Return “Hello world” after end document XSLT • Streaming processing – Match /doc in Q when start element doc – Match /doc/para1 in Q when start element para1 – Return “Hello world” when end element para1 10 Objective XPath Challenges • Build an Streaming XPath Engine using • Predicates TurboXPath algorithm • Backward axis • Common subexpressions • // + nested tags (e.g. <a> ... <a> ... </a> ... </a>) • Contributions: •* – comparison of FA-based (XSQ) and tree- • Children in predicates that are not yet seen based (TurboXPath) algorithms (e.g. a[b]/c and c is streamed before b) – performance comparison between • Simultaneous multiple XPath query processing TurboXPath & XSQ Algorithms Evaluation • Finite-Automata Based • Implementations will be evaluated for – XFilter – Feature Completeness – YFilter – Performance (QPS rate) –XSQ • Tree-Based •XMark –XAOS – XML Benchmarking Software – TurboXPath 11.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    11 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us