Overview

• Forums provide a wealth of information • Semi structured data not taken advantage of by popular search Board Search • Despite being crawled, many An Index information rich posts are lost in low page rank

Forum Examples vBulletin

•vBulletin • phpBB •UBB • Invision •YaBB • Phorum • WWWBoard

phpBB UBB

1 gentoo evolutionM

bayareaprelude warcraft

Paw talk Current Solutions

• Search engines • Forum’s internal search

2 Google lycos

internal boardsearch

Evaluation Metric boardsearch Metrics: Recall - C/N, Precision C/E

Rival system: • Rival system is the search engine / forum internal search combination • Rival system lacks precision

Evaluations: • How good our system is at finding forums • How good our system is at finding relevant posts/threads

Problems: • Relevance is in the eye of the beholder • How many correct extractions exist?

3 Implementation

• Lucene Improving Software Package • Mysql • Ted Grenager’s Crawler Source Search Quality • Jakarta HTTPClient Dan Fingal and Jamie Nicolson

The Problem Sourceforge.org

• Search engines for softare packages typically perform poorly • Tend to search project name an blurb only • For example…

Gentoo.org Freshmeat.net

4 How can we improve this? Better Sources of Information

• Better keyword matching • Every package is associated with a • Better ranking of the results website that contains much more detailed • Better source of information about the information about it package • Spidering these sites should give us a • Pulling in nearest neighbors of top richer representation of the package matches • Freshmeat.net has data regarding popularity, vitality, and user ratings

Building the System How do we measure success?

• Will spider freshmeat.net and the project • Create a gold corpus of queries to relevant webpages, put into mySQL packages • Also convert gentoo package database to • Measure precision within the first N results mySQL • Compare results with search on • Text indexing done with Lucene packages.gentoo.org, freshmeat.net, and • Results generator will combine this with google.com other available metrics

Any questions?

Incorporating Social Clusters in Email Classification

By Mahesh Kumar Chhaparia

5 Previous Work Email Classification

• Previous work on email classification focus mostly on: • Emails: – Binary classification (spam vs. Non-spam) – Usually small documents – Supervised learning techniques for grouping into multiple existing – Keyword sharing across related emails may be small or indistinctive folders – Hence, on-the-fly training may be slow • Rule-based learning, naïve-Bayes classifier, support vector machines – Classifications change over time, and • Sender and recipient information usually discarded – Different for different users !!

• Some existing classification tools • Motivation: – POPFile : Naïve-Bayes classifier – The sender-receiver link mostly has a unique role (social/professional) – RIPPER : Rule-Based learning for a particular user – MailCat : TF-IDF weighting – Hence, it may be used as one of the distinctive characteristics of classification

Incorporating Social Clusters Evaluation

• Identify initial social clusters (unsupervised) • Recently Enron Email Dataset made public • Weights to distinguish – The only substantial collection of “real” email that is public – From and cc fields, – Fast becoming a benchmark for most of the experiments in – Number of occurrences in distinct emails • Social Network Analysis • Email Classification • Textual Analysis … • Study effects of incorporating sender and recipient information: – Can it substitute part of the training required ? • Study/Comparison of aforementioned metrics with the already – Can it compensate for documental evidence of similarity ? available folder classification on Enron Dataset – Quality of results vs. Training time tradeoff ? – How does it affect regular classification if used as terms too ?

Extensions References

• Role discovery using Author-Topic-Recipient Model to facilitate • Provost, J. “Naïve-Bayes vs. Rule-Learning in Classification of Email”, classification The University of Texas at Austin, Artificial Intelligence Lab. Technical Report AI-TR-99-284, 1999. • Lexicon expansion to capture similarity in small amounts of data • E. Crawford, J. Kay, and E. McCreath, “Automatic Induction of Rules for • Using past history of conversation to relate messages E-mail Classification,” in Proc. Australasian Document Computing Symposium 2001. • Kiritchenko S. & Matwin S. “Email Classification with Co-Training”, CASCON’02 (IBM Center for Advanced Studies Conference), Toronto, 2002. • Nicolas Turenne. “Learning Semantic Classes for improving Email Classification”, Proc. IJCAI 2003, Text-Mining and Link-Analysis Workshop, 2003. • Manu Arey & Sharma Chakravarthy. “eMailSift: Adapting Graph Mining Techniques for Email Classification”, SIGIR 2004.

6 Outline

A research literature search engine • Motivation with abbreviation recognition • Approach – Architecture • Technology Group members • Evaluation Cheng-Tao Chu Pei-Chin Wang

Motivation

• Existing research literature search engines don’t perform well in author, conference, proceedings abbreviation • Ex: search “C. Manning, IJCAI” in Citeseer, Google Scholar

Search result in Google Scholar Goal

• Instead of searching by only index, identify the semantic in query • Recognize abbreviation for author and proceedings names

7 Approach Architecture

• Crawl DBLP as the data source • Index the data with fields of authors, proceedings, etc. DBLP • Train the tagger to recognize authors and proceedings Crawler • Use the probabilistic model to calculate the probability of each possible name Tagger • Use the tailored edit distance function to calculate the Database weight of each possible proceeding Query • Combine these weights to the score of each selected Search Browser result Retrieved Engine Documents

Probabilistic Tailored Model Edit Distance

Technology Evaluation

• Crawler: UbiCrawler • 1. We will ask for friends to participate in the • Tagger: LingPipe or YamCha evaluation (estimated: 2000 queries/200 friends). • 2. Randomly sample 1000 data from DBLP, • Search Engine: Lucene extract the authors and proceedings info, query • Bayesian Network: BNJ with abbreviated info, check how well the • Web Server: Tomcat retrieved documents match the result from the Google scholar • Database: MySQL • Programming Language: J2SE 1.4.2

Outline

• QA Background A Web-based Question • Introduction to our system Answering System • System architecture – Query classification Yu-shan & Wenxiu – Query rewriting 01.25.2005 – Pattern learning • Evaluation

8 QA Background Our QA System

• Traditional Search Engine • Open domain – Google, Yahoo, MSN,… • Massive web documents based – Users construct keywords query – Users go through the HitPages to find answer – redundancy guarantee effective • Question Answering SE • Question classification – Askjeeve, AskMSR, … – Users ask in natural language pattern – focus on numeric, definition, human… – Return short answers • Exact answer pattern – Maybe support by reference

System Architecture Question Classifier

• Given a question, map it to one of the predefined classes. • 6 coarse classes (Abbreviation, Entity, Description, Human, Location, and Numeric Value) and 50 fine classes. • Also show syntactic analysis result such as POS Tagging, Name Entity Tagging, and Chunking.

• http://l2r.cs.uiuc.edu/~cogcomp/demo.php?dkey=QC

Query Rewrite Answer Pattern Learning

• Use the syntactic analysis result to decide • Supervised machine learning approach which part of question to be expanded with • Select correct answers/patterns manually synonym. • Statistics answer pattern rule • Use WordNet for synonyms.

9 Evaluation

• Use TREC 2003 QA set. Answers are retrieved from the Web, not from TREC corpus. • Metrics - MRR(Mean Peciprocal Rank) of the first correct Streaming XPath Engine answer - NAns(Number of Questions Correctly Answered), and Oleg Slezberg - %Ans(the proportion of Questions Correctly Amruta Joshi Answered)

Traditional XML Processing Streaming XML Processing

• Parse whole document into a DOM tree structure • XML parser is event-based, • Query engine search the in-memory such as SAX tree to get the result • XPath processor performs the • Cons: online event-based matching – Extensive memory overhead • Pros: – Unnecessary multiple traversals of the – Less memory overhead document fragment – Only process necessary • E.G. /Descendent::x/ancestor::y/child::z part of input document – Can not return result as early as – Result returned on-the-fly, possible efficient support for non- • E.G. Non-blocking query blocking query

What is XPath? A Simple Example

• A syntax used for selecting parts of an XML document An XML document SAX API Event Start element: doc doc • Describes paths to elements similar to an os Start element: para1 Hello world! data: Hello world! para1 para2 describing paths to files End element: para1 ... • Almost a small programming language; it has End element: doc Hello, world functions, tests, and expressions • XPath query Q = /doc/para1/data() • W3C standard • Traditional processing: – Build an in-memory DOM strucuture • Not itself written as XML, but is used heavily in – Return “Hello world” after end document XSLT • Streaming processing – Match /doc in Q when start element doc – Match /doc/para1 in Q when start element para1 – Return “Hello world” when end element para1

10 Objective XPath Challenges

• Build an Streaming XPath Engine using • Predicates TurboXPath algorithm • Backward axis • Common subexpressions • // + nested tags (e.g. ... ... ... ) • Contributions: •* – comparison of FA-based (XSQ) and tree- • Children in predicates that are not yet seen based (TurboXPath) algorithms (e.g. a[b]/c and c is streamed before b) – performance comparison between • Simultaneous multiple XPath query processing TurboXPath & XSQ

Algorithms Evaluation

• Finite-Automata Based • Implementations will be evaluated for – XFilter – Feature Completeness – YFilter – Performance (QPS rate) –XSQ • Tree-Based •XMark –XAOS – XML Benchmarking Software – TurboXPath

11