Overview
• Forums provide a wealth of information • Semi structured data not taken advantage of by popular search software Board Search • Despite being crawled, many An Internet Forum Index information rich posts are lost in low page rank
Forum Examples vBulletin
•vBulletin • phpBB •UBB • Invision •YaBB • Phorum • WWWBoard
phpBB UBB
1 gentoo evolutionM
bayareaprelude warcraft
Paw talk Current Solutions
• Search engines • Forum’s internal search
2 Google lycos
internal boardsearch
Evaluation Metric boardsearch Metrics: Recall - C/N, Precision C/E
Rival system: • Rival system is the search engine / forum internal search combination • Rival system lacks precision
Evaluations: • How good our system is at finding forums • How good our system is at finding relevant posts/threads
Problems: • Relevance is in the eye of the beholder • How many correct extractions exist?
3 Implementation
• Lucene Improving Software Package • Mysql • Ted Grenager’s Crawler Source Search Quality • Jakarta HTTPClient Dan Fingal and Jamie Nicolson
The Problem Sourceforge.org
• Search engines for softare packages typically perform poorly • Tend to search project name an blurb only • For example…
Gentoo.org Freshmeat.net
4 How can we improve this? Better Sources of Information
• Better keyword matching • Every package is associated with a • Better ranking of the results website that contains much more detailed • Better source of information about the information about it package • Spidering these sites should give us a • Pulling in nearest neighbors of top richer representation of the package matches • Freshmeat.net has data regarding popularity, vitality, and user ratings
Building the System How do we measure success?
• Will spider freshmeat.net and the project • Create a gold corpus of queries to relevant webpages, put into mySQL database packages • Also convert gentoo package database to • Measure precision within the first N results mySQL • Compare results with search on • Text indexing done with Lucene packages.gentoo.org, freshmeat.net, and • Results generator will combine this with google.com other available metrics
Any questions?
Incorporating Social Clusters in Email Classification
By Mahesh Kumar Chhaparia
5 Previous Work Email Classification
• Previous work on email classification focus mostly on: • Emails: – Binary classification (spam vs. Non-spam) – Usually small documents – Supervised learning techniques for grouping into multiple existing – Keyword sharing across related emails may be small or indistinctive folders – Hence, on-the-fly training may be slow • Rule-based learning, naïve-Bayes classifier, support vector machines – Classifications change over time, and • Sender and recipient information usually discarded – Different for different users !!
• Some existing classification tools • Motivation: – POPFile : Naïve-Bayes classifier – The sender-receiver link mostly has a unique role (social/professional) – RIPPER : Rule-Based learning for a particular user – MailCat : TF-IDF weighting – Hence, it may be used as one of the distinctive characteristics of classification
Incorporating Social Clusters Evaluation
• Identify initial social clusters (unsupervised) • Recently Enron Email Dataset made public • Weights to distinguish – The only substantial collection of “real” email that is public – From and cc fields, – Fast becoming a benchmark for most of the experiments in – Number of occurrences in distinct emails • Social Network Analysis • Email Classification • Textual Analysis … • Study effects of incorporating sender and recipient information: – Can it substitute part of the training required ? • Study/Comparison of aforementioned metrics with the already – Can it compensate for documental evidence of similarity ? available folder classification on Enron Dataset – Quality of results vs. Training time tradeoff ? – How does it affect regular classification if used as terms too ?
Extensions References
• Role discovery using Author-Topic-Recipient Model to facilitate • Provost, J. “Naïve-Bayes vs. Rule-Learning in Classification of Email”, classification The University of Texas at Austin, Artificial Intelligence Lab. Technical Report AI-TR-99-284, 1999. • Lexicon expansion to capture similarity in small amounts of data • E. Crawford, J. Kay, and E. McCreath, “Automatic Induction of Rules for • Using past history of conversation to relate messages E-mail Classification,” in Proc. Australasian Document Computing Symposium 2001. • Kiritchenko S. & Matwin S. “Email Classification with Co-Training”, CASCON’02 (IBM Center for Advanced Studies Conference), Toronto, 2002. • Nicolas Turenne. “Learning Semantic Classes for improving Email Classification”, Proc. IJCAI 2003, Text-Mining and Link-Analysis Workshop, 2003. • Manu Arey & Sharma Chakravarthy. “eMailSift: Adapting Graph Mining Techniques for Email Classification”, SIGIR 2004.
6 Outline
A research literature search engine • Motivation with abbreviation recognition • Approach – Architecture • Technology Group members • Evaluation Cheng-Tao Chu Pei-Chin Wang
Motivation
• Existing research literature search engines don’t perform well in author, conference, proceedings abbreviation • Ex: search “C. Manning, IJCAI” in Citeseer, Google Scholar
Search result in Google Scholar Goal
• Instead of searching by only index, identify the semantic in query • Recognize abbreviation for author and proceedings names
7 Approach Architecture
• Crawl DBLP as the data source • Index the data with fields of authors, proceedings, etc. DBLP • Train the tagger to recognize authors and proceedings Crawler • Use the probabilistic model to calculate the probability of each possible name Tagger • Use the tailored edit distance function to calculate the Database weight of each possible proceeding Query • Combine these weights to the score of each selected Search Browser result Retrieved Engine Documents
Probabilistic Tailored Model Edit Distance
Technology Evaluation
• Crawler: UbiCrawler • 1. We will ask for friends to participate in the • Tagger: LingPipe or YamCha evaluation (estimated: 2000 queries/200 friends). • 2. Randomly sample 1000 data from DBLP, • Search Engine: Lucene extract the authors and proceedings info, query • Bayesian Network: BNJ with abbreviated info, check how well the • Web Server: Tomcat retrieved documents match the result from the Google scholar • Database: MySQL • Programming Language: J2SE 1.4.2
Outline
• QA Background A Web-based Question • Introduction to our system Answering System • System architecture – Query classification Yu-shan & Wenxiu – Query rewriting 01.25.2005 – Pattern learning • Evaluation
8 QA Background Our QA System
• Traditional Search Engine • Open domain – Google, Yahoo, MSN,… • Massive web documents based – Users construct keywords query – Users go through the HitPages to find answer – redundancy guarantee effective • Question Answering SE • Question classification – Askjeeve, AskMSR, … – Users ask in natural language pattern – focus on numeric, definition, human… – Return short answers • Exact answer pattern – Maybe support by reference
System Architecture Question Classifier
• Given a question, map it to one of the predefined classes. • 6 coarse classes (Abbreviation, Entity, Description, Human, Location, and Numeric Value) and 50 fine classes. • Also show syntactic analysis result such as POS Tagging, Name Entity Tagging, and Chunking.
• http://l2r.cs.uiuc.edu/~cogcomp/demo.php?dkey=QC
Query Rewrite Answer Pattern Learning
• Use the syntactic analysis result to decide • Supervised machine learning approach which part of question to be expanded with • Select correct answers/patterns manually synonym. • Statistics answer pattern rule • Use WordNet for synonyms.
9 Evaluation
• Use TREC 2003 QA set. Answers are retrieved from the Web, not from TREC corpus. • Metrics - MRR(Mean Peciprocal Rank) of the first correct Streaming XPath Engine answer - NAns(Number of Questions Correctly Answered), and Oleg Slezberg - %Ans(the proportion of Questions Correctly Amruta Joshi Answered)
Traditional XML Processing Streaming XML Processing
• Parse whole document into a DOM tree structure • XML parser is event-based, • Query engine search the in-memory such as SAX tree to get the result • XPath processor performs the • Cons: online event-based matching – Extensive memory overhead • Pros: – Unnecessary multiple traversals of the – Less memory overhead document fragment – Only process necessary • E.G. /Descendent::x/ancestor::y/child::z part of input document – Can not return result as early as – Result returned on-the-fly, possible efficient support for non- • E.G. Non-blocking query blocking query
What is XPath? A Simple Example
• A syntax used for selecting parts of an XML document An XML document SAX API Event
10 Objective XPath Challenges
• Build an Streaming XPath Engine using • Predicates TurboXPath algorithm • Backward axis • Common subexpressions • // + nested tags (e.g. ... ... ... ) • Contributions: •* – comparison of FA-based (XSQ) and tree- • Children in predicates that are not yet seen based (TurboXPath) algorithms (e.g. a[b]/c and c is streamed before b) – performance comparison between • Simultaneous multiple XPath query processing TurboXPath & XSQ
Algorithms Evaluation
• Finite-Automata Based • Implementations will be evaluated for – XFilter – Feature Completeness – YFilter – Performance (QPS rate) –XSQ • Tree-Based •XMark –XAOS – XML Benchmarking Software – TurboXPath
11