Opportunities and Challenges

C. Prasenjit Mitra Pradeep Teregowda Jian Wu Juan Pablo Fernandez Ramirez

Pennsylvania State University USA Community l Focus on Computer and and related areas

l Papers online on scholars’ webpages.

l Funded by NSF CISE as CRI not NSDL l Users include researchers, decision makers, students and industry all over the world spanning multiple areas, not restricted to computer or information sciences.

x l The community depends on CiteSeer for support at multiple stages in research and publication of their work l Some of graduate students who worked on CiteSeer or other Seers

l Isaac Councill, Levent Bolelli – Google

l Yang Song, Sandip Debnath, Q. Tan – Microsoft

l Ziming Zhuang – TenCent

l Shuyi Zheng – Facebook

l Bingjun Sun – eBay

l Ying Liu – KAIST

l Hui Yan - Yahoo SeerSuite

Open source and tool kit used to build academic search engines and digital libraries for papers and data X CiteSeer , ChemXSeer, ArchSeer,YouSeer, CollabSeer, etc. Built on commercial grade software (reuse not reinvent) Different model: research plus DL ‒ crawler based Supports research in Indexing and search Data structures Information and Social networks /infometrics Systems engineering User design Software engineering and management Facilitates digital collection of academic documents Trains students in search and software systems Educational tool for search engine creation CiteSeer (aka ResearchIndex) l Project at NEC Research Institute l Hosted at NEC Princeton, from 1997 – 2004

l Publicly available 1998 C. Lee Giles l Moved to Penn State after collaborators left NEC. l Crawled the web for academic documents on scholar (mostly CS) web pages

l No content evaluation except academic paper Kurt Bollacker l Computer scientists publish mostly in conferences l Provided a broad range of services including

l Autonomous citation indexing, reference linking, full text indexing, similar documents listing, and several other pioneering features.

Steve Lawrence http://citeseer.ist.psu.edu Old look In between Next Generation CiteSeer, CiteSeerX

•1.5 M documents •30 M citaons •2 to 4 M authors • 400K unique •2 to 4 M hits day •800K individual users

But is CiteSeerX a repository?

http://citeseerx.ist.psu.edu Use: Hits and Clicks l In a single month

l Avg per day - 3M hits; 8% queries - 2.4M w/o bots. - 800K unique visitors.

l Users from all over the world. - Majority from the US.

l Usage distribution with variation across the day (10 AM/11 AM Peak).

l Feedback - one a day

l OAI – 1500

l Data CC SA NC shared weekly documents from crawled urls Support • Varies from year to year – Primarily NSF CISE 6 years (not NSDL) – Early support by Microsoft – Never enough for what we want to do – Couple to several graduate students – Some faculty support (couple) – Total # 4 to 6 (mostly grad students) – New hardware every couple of years • Bandwidth support provided by IST/ PSU – 10 Gig switch few wide pipe hops from backbone • Great value for the dollar! Goals in practice l Crawl the web for (all) publicly available relevant papers and index them l Provide to our community (primarily CISE)

l Continued services

l Improved services

l New services

l Research support and data l Create similar services to ours through

l Data sharing

l OS software l Sustainability

l Create similar services

l CiteSeerX in the cloud

Opportunities x l Adaptable interfaces and access CiteSeer ( SeerSuite). l Better and more extraction l Better crawling l Integration: Allow multiple types of media to be linked together. l Allow adding content from multiple sources, including data and metadata. l Extending features to support collaboration among users – exchange material (MyCiteSeerX)

l Experimental data

l References (Bibliography)

l Metadata l Reliable and fresh data

l Faster updates

l Robust architecture and infrastructure. l Cloudize l Automated semantic labeling l New Seers – code more user friendly and flexible

Challenges Combining research and system admin l Hardware, hardware, hardware (code is stable) l Satisfy user demand for accuracy and ease of finding the most relevant information in the shortest time. x l Adaptable interfaces and access CiteSeer ( SeerSuite). l Allow multiple types of media to be linked together. l Allow adding content from multiple sources, including data and metadata from publishers to increase coverage. l Extending features to support collaboration among users – exchange material (MyCiteSeerX)

l Experimental data

l References (Bibliography)

l Metadata l Reliable and fresh data

l Faster updates

l Robust architecture and infrastructure.