Searching JACo PDF files on the web

Pascal Le Roux JACo Team Meeting Thoiry, France, 18-19 February 2002

1 Status of the CERN JACoW Site • The CERN Joint Accelerator Conference Web site is hosted on the CERN central web servers (a pool of ≈ 10 machines running Windows 2000 + SP2+ (x thousand) patches with Internet Information Services 5.0 web server).

• 10 conferences are published on this site: – 4 PAC (1995, 1997, 1999, 2001) – 3 EPAC (1996, 1997, 2000) – 1 APAC (1998) – 1 ICALEPC (1999) – 1 LINAC (1998) ⇒ About 8000 PDF files.

• We recently received the CDs from Cyclotrons 2001 and Linac’96 but the PDF files are not yet “JACoW compliant” (files not cropped, no keywords…)

2 A tool is required to search papers ! • The CERN JACoW web site provides a search form which serves as a custom interface of the : http://accelconf.web.cern.ch/AccelConf/top-page.html

3 • Once you click on the Go! Button, the form is sent to an ASP script that parses the fields, and formats the query string which is redirected to the CERN global search engine. The query looks like this:

http://search.cern.ch/query.html?col=cern&qp=&qt=%2Burl%3Aacc elconf+-url%3Aabstract+- site%3Aaps.anl.gov+%2Bdoctype%3Apdf+%2Btitle%3Amagnet&qs=&qc= cern&pw=600&ws=0&qm=0&st=1&nh=10&lk=1&rf=0&rq=0

• This customized query string restricts the search to PDF files published on the JACoW site and specifies where (in which hidden field) to search for the words entered by the user.

4 • Once the engine gets a bunch of matches, it sorts them according to a relevance ranking or by date before sending back the customized result page.

5 CERN Global Search Engine

• Since 1997, CERN has used Ultraseek search engine, running on a Sun Ultra 1 with Sun OS 5.6.

• In 2000, acquired Infoseek Corporation. Inktomi is a leader in the web-wide search market, providing results for major sites such as: – MSN Search, Yahoo, Oracle, IBM… and…Fermi National Accelerator Laboratory

• In November 2001, CERN upgraded its search engine from Ultraseek 4.08 to Inktomi Enterprise Search 4.2.

6 • Product changes Basically, the main product changes are bug and security fixes, cosmetic changes for the users, supports of direct indexing of Oracle and other ODBC compliant databases, plus indexing of NTFS file sources and improvements in International support.

• Platform and performance The search engine now runs on a PC with Dual 500Mhz CPU, 1GB of RAM, 70 GB SCSI drive, Windows 2000 server + SP2 (but it’s also available for Sun and Linux) This platform can indexed the CERN Intranet : – approximately 1 million documents – Every 3 / 7 days – Answers about 1000 queries per day – With peaks up to 200 queries / hour

7 • Specifications – Inktomi Enterprise Search supports: • HTML, XML, Text, RTF, MS OFFICE, PDF (search in hidden fields, and full text search), PostScript, Framemaker, Lotus, WordPerfect • In English, French, German, Spanish, Portuguese, Italian, Dutch, Swedish, Norvegian, Danish, Finnish, Chinese and Japanese. – In addition to the full PDF text indexation, the engine can also index PDF metadata (our hidden fields: Title, subject, author, keywords). As a result, the search results are therefore more accurate than a simple full text search. – The search result page provides : • Linked results titles to the PDF doc. • Smart Summaries • Path and Size of the PDF file – The results can be sorted by date or by relevance ranking.

• Comments from the staff who installed the search engine “CERN has not done any evaluation since 1997, except for Microsoft SharePoint (2001) which was not adapted for CERN needs, but we can recommend Inktomi as it requires little work and gives reasonable results.”

8 • Price $2,995 for 1-3,000 pages, $7,495 to 10,000 pages But CERN IT told me: “We had a nice price from Inktomi. I cannot tell you how much… This was our main reason to purchase this product as the IT budget is small…”

9 Is there an alternative to the Inktomi Enterprise Search locally?

• Hundreds of other search services/products are available on the market. • But they do not always suit PDF searches. Some tools are not capable to index the text contained in the PDF hidden fields.

10 Local search tool, Remote Search service? • Local search tool This is the solution described previously. You have to purchase : – the search engine software. – A powerful machine dedicated to this indexing and search service. – An administrator who takes care of the system 24 hours a day.

CERN has selected Inktomi mainly because they got a really interesting price for such a product.

But of course, many products are available on the market.

Since I didn’t make any product evaluation, I can’t rate them without serious testing. I can only give you a list of leading product according to articles found on the web…

11 Product Price Platform supported Specifications

Windows NT, Windows 2000, AltaVista Enterprise $15,000 for smaller companies to millions Tru64 UNIX, HP/UX, Solaris, Search for large corporations!! Linux • Handle over 200 files formats. Including XML, -specific Linux on PDF, PostScript, MS $20,000 for 1x rack mountable box supplied hardware Office Appliance (150,000 documents) • Support about 30 languages • Can index ≈ 10000 files Windows NT/2000; Unix: Inktomi Enterprise $2,995 for 1-3,000 pages / hour Solaris 2.5 and above, Linux, Search $7,495 to 10,000 pages HP-UX 11.0

Microsoft Index Adobe PDF IFilter 5.0 ≈ Free: integrated with Server Windows NT (Server only, extends the search Microsoft Internet Information Server + not Workstation), capabilities of MIS and the Windows NT® Server 4.0 Adobe PDF IFilter Windows 2000 by indexing all Free Adobe PDF IFilter 5.0 5.0 the hidden fields

PDF WebSearch A search engine $7,500 Windows 95/98/NT4/2000 specially designed for (based on dtSearch) PDF

Windows NT 4 / 2000 Optimized to support the searching of Elan Web Search ? + PDF hidden fields + 16 Microsoft Internet Information more custom fields Server

More exhaustive list at : http://www.searchtools.com/info/pdf.html

12 • Remote search services In this case, you just have to sign up for one of the various search services available online. Some of them are free, completely supported by advertising. Advantages – You don’t have to worry about the work involved in setting up a search engine. – No expensive software to buy. – No machine to maintain – No technician to pay for taking care of the service. – Remote search engines work just as well as local ones. Drawbacks You don’t have as much control: – On the indexing process. You do not know how often your site is indexed. (Sometimes it can take many weeks for free services…) – On the search engine accessibility and response time. – On the design of the search result pages (advertising…) – If you pay the services and have a lot of pages to index. Local searching solution can be really cheaper.

13 Product Price Indexing frequency Comments

•No advertising just an $10,000 per years and up Atomz logo. Atomz Enterprise depending on the number Weekly and on demand •15 languages supported of domains and pages •Indexes and searches hidden fields in PDF

- Free with Google Logo and limited customization. Google controls Google scheduling (≈ 1 month for - Paid version offers free version) many more options...

- Free with advertising PDF indexing available FreeFind Enterprise $79 per month for 5,000 Daily for paid version only for paid version. pages

For a more exhaustive list, have a look at: http://www.searchtools.com/info/pdf.html

14 Example of remote search service using Google web wide search engine

• Since our CERN web servers are indexed by the Google web wide search engine. I’ve duplicated the JACo search form to test Google.

• In the free version of Google: you can’t create precise query using title and keywords fields. You can only perform full text searches or author field searches. • But you can restrict the search to a given domain (http://accelconf.web.cern.ch/AccelConf/) and a given file type (PDF), to search only the PDF files located on our JACoW site.

15 • The result page is quite similar the Inktomi one, with an interesting feature: the possibility to get an HTML version of the PDF. • The PAC 2001 papers which were added on the site mid January are not yet indexed! (Like a few EPAC 2000 papers…) (It took 3 days to be indexed by Inktomi).

16 My Conclusions

• We (the JACo team at CERN) don’t have to worry about the search engine tool. An administrator has installed and upgraded the system for us, and keeps the machine and the software up 24 hours a day… • The indexation is done quite often (maximum of 7 days) • The only things to do were to create the HTML form and the ASP script and of course, upload all the files on a web server. • Since November 2001 (when the search engine was upgraded), we have received about 3600 hits on the JACo search form. • We never received any complains from the users of the CERN instance (Yes, this doesn’t mean that the service is fine…) • I don’t think that the CERN JACoW site needs another search engine. This service is sufficient. • It could be used at FNAL since they already have the same search engine… ;-)

17