Apache Tika: 1 point Oh!

Chris A. Mamann NASA JPL/Univ. Southern California/ASF ma[email protected] November 9, 2011 And you are? • Senior Computer Scienst at NASA JPL in Pasadena, CA USA • Soware Architecture/Engineering Prof at Univ. of Southern California

• Apache Member involved in – OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor) Roadmap • 1st part of the talk – Why Tika? – What is Tika? – What are the current versions of Tika? – What can it do? • 2nd part of the talk – NASA Earth Science Data Systems – Data System Needs and Requirements – How does Tika help? The Information Landscape Proliferaon of content types available

• By some accounts, 16K to 51K content types* • What to do with content types? – Parse them • How? • Extract their text and structure – Index their metadata • In an indexing technology like Lucene, Solr, or in Google Appliance – Identify what language they belong to • Ngrams

*hp://filext.com/ Importance of content types Importance of content type detecon Search Engine Architecture Goals

• Identify and classify file types – MIME detection • Glob pattern – *.txt – *.pdf • URL – http://…pdf – ftp://myfile.txt • Magic bytes • Combination of the above means • Classification means reaction can be targeted is…

• A content analysis and detecon toolkit • A set of Java providing MIME type detecon, language idenficaon, integraon of various parsing libraries • A rich Metadata API for represenng different Metadata models • A command line interface to the underlying Java code • A GUI interface to the Java code Tika’s (Brief) History • Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 • Proposed as Lucene sub-project – Others interested, didn’t gain much traction • Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in – A Content Management System • Graduated from the Incubator to Lucene sub- project in 2008 • Graduated to Apache TLP in April 2010 • 40, 88 and 29 issues resolved in versions 1.0, 0.10, and 0.9 Community • Mailing lists – User: 125 peeps, ~70 msg/mo. – Dev: 210 peeps, ~250 msg/mo. • Committers/PMC – 13 peeps – Large majority of them active • Releases – 11 releases so far – Just pushed out 1 point OH • http://s.apache.org/N0I Credit: svnsearch.org Use in the classroom • Have used for the past 2 years in both my Search Engines/Informaon Retrieval class and my Soware Architecture class – Several student final projects have turned into contribuons for the project and merit for the students • Define data management projects that involve the use of OODT, and other technologies like Solr, Tika, Nutch, Hadoop, etc. The A pache S of tware F ounda tion A nnounc es A pa che T ika (tm) v1.0 http://www.gl obe newswire.c om /ne wsr oom /ne ws.ht m l?d=237692

Some recent 1 point oh pressApa che Announc e s T ool kit f or C ont ent D etection a nd A nalyi sis http: //www.c mswire.c om /cms/inf or mation- m anage ment /a pache -a...

HO M E SERVIC ES NEW SRO OM ABO UT US CO NT ACT US S ea r ch

La te st N e w s F ea tur ed Pro d ucts So ftw a re D i re cto ry U p co ming Ev e nts

So urce: T he Apa che So ftw a re F o unda ti o n Dat e: N o v ember 0 9, 2 011 0 8:00 ET Receive the Free C MSWire Newsletter We keep thousands of people informed each we ek via concise updates. Privacy respected — we will never share your information. T he Apache Software F oundation Announces Apache Tika (tm) v1.0 I'm alr e ad y su b scr ib e d (enter email address) Sign-up Stan dards-based , Content and Met adat a Det ect ion an d Analysi s Toolkit Powers Large- scale, Multi-lingual, M ulti- format Repositories at Adobe, the Internet Archive, NASA Jet Propulsi on Laboratory, and more. Apache Announces Toolkit for F o rest H ill, M D , N o v. 9, 2011 ( G L O BE N EWS WIR E) - - T he A pa che S o ftw a re F o unda tio n (A SF ), the a ll- v olunt e e r de v e lop e rs, st e wa rds, a nd incub a tors of ne a rly 150 O pe n Sou rce Content Detection and Analyisis pr o jec ts a nd i niti a ti v es , to da y a nno unc ed Apa che T ik a v 1.0, a n embedda bl e, l igh tw ei ght By R ikk i En dsley ( @rikkie nds) Nov 9, 2011 to o lk it fo r co ntent detectio n a nd a na ly sis. Ot he r C om pa ny Pre s s R e le a se s Featured Ho w -to : Building the Paperless Office with Docu ment "T he Apa che T ika v1. 0 re le a se is f iv e y e a rs in the ma king, pro v iding nume ro us Management im pro v e m e nts a nd ne w pa rsing form a ts, " sa id Chris M a ttm a nn, Apa che Tika Vice The Apa che Sof tw a re F ound a tion Pr e s ide nt, S e nio r Co m pute r S cie nti st a t N ASA J e t Pr o puls io n L a bo rato ry , a nd Un iv e rsity o f An nou nces ApTachhee AC pasascahnedr Sa (otfmtw) va1re.0 F - oundation announced Tika v.1, an S o uther n C a lif o rnia Adj unc t As s ista nt Pr o fe s so r o f C o mpu te r Sc ie nc e . "F ro m a to o lk it Oc t 18, 2011 08:00 ET per s pec ti v e, i t' s ea s y to i ntegr a te, a nd pr o v ides ma x imu m fu ncti o na lity w ith l ittl e embeddable toolkit for content detect ion and analysis five conf iguration. " The Apa che Sof tw a re F ound a tion An nou nces Apyaceahre sT oimnE tEh Ce emr tiafi kedi ansg. Wit h the incre a sing a m ount of inf orm a tion a v a ila ble on the Interne t tod a y, a utom a tic Ja v a EE 6 W e b Pr o f il e C om pa tibl e - Oc t 4, inf orm a tion proce ssing and re trie v a l is urge ntly ne e de d to und ersta nd conte nt across 2011 15: 10 E TWhat is Tika? The announcement describes it as a one-stop culture s, la ngua ge s, a nd contine nts. shop for identifying, retrieving and parsing text and M ED IA ALER T : T he Apac he So ft war e Apac he T ik a i s a o ne- sto p s ho p fo r i den tif yi ng, r et ri ev i ng, an d par si ng t ex t an d met a da ta metadata from more than 1,200 file formats, such as fr o m o v er 1 ,2 00 fi le fo rma ts i nc ludi ng H T M L, XM L, M i c ro so ft Offi c e, F o unda tio n a nno unces A pa cheC o n O pe nO f fi ce /O pe nD o cum e nt, PD F, i ma ge s , e bo o k s/ EPU B, Ri ch Te x t, c o m pre ss io n an d Ke y no te s by nHoTtedM OLp,e Pn DSoFu,r ciem aautgheosri,t yOpenOffice, Microsoft Office, email pa c k a gi ng fo rma ts , tex t/ au di o /ima ge/ v ideo , Ja va c la ss fi l es a nd a r chi ves , ema il/ mbo x , Da vid A. Whe e le r, Horton w orks CEO Eric a nd mo re. Bal de sc hw ei lear,n adn dm IoBMre E:me rgi ng I nterne t Te c hno lo gy gro up CT O D a v id Bo lo k e r - Se p Tika e nte re d the Apa che Incub a tor in 2007, beca m e a sub-p roj e ct of Apa che L uce ne in 27, 2011 09: 00 E T 200 8, a nd gra dua ted a s a n AS F To p- le v e l Pr oj e c t (T L P) in Apri l 2010. A pa che T ik a ha s been tes ted ex ten s iv el y i n r epo si to r ies ex ceedi ng 5 00 mi ll io n do cumen ts a cr os s a v arie ty 1 of 6 11/10/ 11 12: 17 A M The Apa che Sof tw a re F ound a tion of a pplica tions in ind ustry , a ca de m ia a nd go ve rnm e nt la bs. An nou nces 10th An ni v er sar y o f Apac he Lu cen e - Se p 27, 2011 08:00 ET "A t NA S A, we le v e rage A pa che T ika o n se ve ral o f o ur Ea rth scie nce da ta syste m pro je c ts," e x pla i ne d D a n Cric hto n, Pr o gra m M a na ger a nd Pr inc ipa l Co m pute r Sc ie ntis t, N ASA Je t Pr o pul sio n L a bo ra tor y. " T ik a he l ps us pro c e ss e s hundre ds o f te ra by te s o f s ci e ntif ic da ta i n The Apa che Sof tw a re F ound a tion my ria d f or ma ts a nd the ir a ss o ci a te d m e tada ta m o del s. Us ing T ik a w ith o th er Ap ac he An nou nces Apac he W hi rr as a T op- Lev el techno lo gies such a s O O D T, L ucene, a nd S o lr, we a re a ble to a uto ma te, v irtua liz e a nd Pr o j e ct - Sep 13, 2011 08:00 ET incre a se the e ff icie ncy of NAS A's scie nce da ta proce ssing pipe line ." M o re > > User s a n d so ftw a re a pplica tio ns u se Apa che T ik a to ex plo re th e in fo rma tio n la ndsca pe thro ugh flex ible interfa ces in Ja v a, fro m the co mma nd line, R EST -ful W eb serv ices, a nd a lso b y co nsuming its functio na lit y fro m a multitude o f pro gr a mming la ngua ges direct ly, i ncl uding Py tho n, . NET a nd C+ + . Ti k a de f ine s a sta nda rd a ppli ca tio n pro gra m mi ng Re la te d Ne ws i nter fa c e (A PI ) a nd m a k e s use o f e x is ting li bra rie s s uch Apa c he PO I a nd PD F Bo x to de tect a nd extr a ct meta da ta a nd structured text co ntent fro m v a rio us do cuments using existi ng pa rs er l ibr a ries . CO M PU T ER S INTERNET "W e 'v e use d Apa che T ika e xte nsiv e ly f o r a wide range o f co nte nt e xtr actio n ta sks, SOF T W AR E includ ing pa rsing a lm ost 600 m illion pa ge s a nd docum e nts f rom a la rge we b cr a wl, " sa id Ke n Kr ugle r, Fo unde r a nd Pr e sid ent o f S ca le U nlim ite d. "It's pro v e n inv a lua ble a s a sim ple ye t ro bust so l utio n to the cha l le nge s o f e x tra cti ng te x t an d m e ta da ta f ro m the jun gle o f fo rma ts y o u fi nd o n th e w eb ." Blo g g i n g/ Bo o k ma rk i n g

"Hippo CM S 7 use s Apa che J a ckr abbit to inde x co nte nt re po sito rie s co nta ining a s ma ny a s Fa c e bo o k 500, 000 docum e nts, " exp la ine d A rj é Ca hn, CT O of Hip po . "W e a re e xp loring w a ys tha t Apac he T ik a c an en han ce ac c es s to met adat a i n o ur fac eted n a v igat io n feat ur e, w hic h Tw itte r ma y r es ult i n a po ss ibl e fu tu re pa tc h." Li nk ed In A vai l abi l it y a nd O ve rs ig ht

1 of 2 1 1/10/ 11 12: 21 A M Getting started rapidly…like now! • Download Tika from: – http://tika.apache.org/download.html • Grab tika-app-1.0.jar • alias tika “java –jar tika-app-1.0.jar” • tika < somefile.doc > extracted-text.xhtml • tika –m < somefile.doc > extracted.met • Works on Windows too (alias only on UNIX) A quick NASA dataset • Atmospheric Infrared Sounder Mission (AIRS) – Level 2 Cloud Clear Radiance Product – Grab it from here: • p://airspar1u.ecs.nasa.gov/p/data/s4pa/Aqua_AIRS _Level2/AIRI2CCF.003/2007/005/ – Just grab the first file • java -jar ka-app-1.0.jar -m < AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf – Hopefully this worked for you, if not, blame.. • Windows – And Bill Gates

25-Mar-11 CORDEX-MATTMANN 16 Detecting MIME types from Java • String type = Tika.detect(…) – java.io.InputStream – java.io.File – java.net.URL – java.lang.String Adding new MIME types

• Got XML?

• Based on freedesktop.org spec (loosely) Many custom applicaons and tools

• You need this: to read this: Third-party parsing libraries

• Most of the custom applicaons come with soware libraries and tools to read/write these files – Rather than re-invent the wheel, figure out a way to take advantage of them • Parsing text and structure is a difficult problem – Not all libraries parse text in equivalent manners – Some are faster than others – Some are more reliable than others Parsing

• String content = Tika.parseToString(…) – InputStream – File – URL Streaming Parsing

• Reader reader = Tika.parse(…) – InputStream – File – URL Extraction of Metadata

• Important to follow common Metadata models – Dublin Core – any electronic resource – XMP – also general like Dublin Core – Word Metadata – specific to .doc, .ppt, etc. – EXIF – image related • Lots of standards and models out there – The use and extraction of common models allows for content intercomparison – All standardize mechanisms for searching – You always know for X file type that field Y is there and of type String or Int or Date Cancer Research Example Cancer Research Example

Aributes

Credit: A. Hart

Relaonships Tika Sponsoring the Any23 Project

• Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011) • Any23 = “Anything to Triples” • Semanc Toolkit for parsing, idenficaon of all major semanc web content types (RDF, etc.) • Related to • Looking for synergies between 2 efforts Metadata

• Metadata met = new Metadata(); //Dubiln Core met.set(Metadata.FORMAT, “text/html”); //mul-valued met.set(Metadata.FORMAT, “text/plain”); System.out.println( met.getValues(Metadata.FORMAT)); • Other met models supported (HTTP Headers, Word, Creave Commons, Climate Forecast, etc.) – Run: ka --list-met-models Methods for language idenficaon

• N-grams – Method of detecting next character or set of characters in a sequence – Useful in determine whether small snippets of text come from a particular language, or character set • Non-computational approaches – Tagging – Looking for common words or characters Language Detection

• LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile( FileUtils.readFileToString(new File(filename)))); • System.out.println(lang.getLanguage()); • Uses Ngram analysis included with Tika – Originating from Nutch – Can be improved Running Tika in GUI form

• tika --gui

Integrating Tika into your App • Maven • Ant • Eclipse tika- tika- tika- • It’s just a set of jars server app bundle – tika-core tika-parsers – tika-parsers – tika-app tika-core – tika-bundle – tika-server Some really great stuff in 1.0 • Super improved OSGi support NICK ALREADY– New tika-bundle module • Improved RTF parsing support, OO support, and parsing of Outlook email TALKED ABOUTattachments • Language Detection for Belarusian, Catalan, Esperanto,Thunder stolen Galician, Lithuanian THIS!!! Romanian, Slovak,Slovenian, and Ukrainian • Improved PDF parsing (extract annotation) Things to watch out for

• Deprecated APIs->gone –Recompile code

• No more JDK 1.4 version of Tika –Upgrade Improvements to Tika • Adding more parsers for content types • Improve the JAX-RS server support • Expanding ability to handle random access file parsing – Scientific data file formats, some work on this – Leverage improvements in file representation TIKA-701, TIKA-654, TIKA-645, TIKA-153 • Geospatial parsing support through GDAL • Improving language and charset detection Part 2

Science Data Systems at NASA

Credit: hp://www.jpl.nasa.gov/news/news.cfm?release=2011-295 NASA Ground Data Systems

Credit: D. Woollard Context

• NASA develops science data processing systems for multiple earth science missions • These systems convert the instrument telemetry delivered to earth from space into useful data for scientific research • Typical characteristics – Remote sensing instruments that orbit the Earth multiple times daily – Data are acquired constantly – Complex algorithms convert instrument measurements to geophysical quantities The Square Kilometer Array

• 1 sq. km of antennas • Never-before seen resoluon looking into the sky • 700 TB – Per second! NASA DESDynI Mission

• 16 TB/day • Geographically distributed • 10s of 1000s of jobs per day • Tier 1 Earth Science Decadal Mission Some Considerations • Scale – Data throughput rates – # of data types – # of metadata types – # of users to send the data to • Federation – Must leave the data where it is – Socio/Economic/Political • Heterogeneity – Technology, data formats, skills! Apache OODT

• We’ve got some components to deal with these issues How are we building these

systems now? -Allow for push/pull of data over arbitrary protocols

- Ingestion builds std catalog and archive

-Deliver product metadata to search, portal or GIS

-Plug in arbitrary met extractors How are we building these

systems now? -Separation of file management from workflow management

-Allow for heterogeneous computing resources

-Easily integrate PGEs

-Leverages same ingestion crawler What does this have to do with Tika?

Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA! Metadata Ext: TIKA! What does this have to do with Tika?

Metadata Ext: TIKA!

MIME identification: TIKA! MIME identification: TIKA! Science Data File Formats • Hierarchical Data Format (HDF) – http://www.hdfgroup.org – Versions 4 and 5 – Lots of NASA data is in 4, newer NASA data in 5 – Encapsulates • Observation (Scalars, Vectors, Matrices, NxMxZ…) • Metadata (Summary info, date/time ranges, spatial ranges) – Custom readers/writers/APIs in many languages • C/C++, Python, Java Science Data File Formats • network Common Data Form (netCDF) – www.unidata.ucar.edu/software/netcdf/ – Versions 3 and 4 – Heavily used in DOE, NOAA, etc. – Encapsulates • Observation (Scalars, Vectors, Matrices, NxMxZ…) • Metadata (Summary info, date/time ranges, spatial ranges) – Custom readers/writers/APIs in many languages • C/C++, Python, Java – Not Hierarchical representation: all flat So how does it work? • Ingestion – Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format – Need to extract their met, catalog and archive them, etc. • Can now use Tika to do this! TIKA-399 and TIKA- 400 added this capability • Processing – Processors (PGEs) generate NetCDF and HDF, must extract met, catalog and archive Tool support • Entire stacks of tools written around these formats – OPeNDAP, LAS, readers, writers, custom NASA mission toolkits – OGC • WMS, WCS, etc. – Unique, one of a kind software build around these data file formats • Apache can contribute strongly in this area! Besides processing science files • …Tika also helps with • MIME identification – Useful in remote file acquisition – Useful in classification (catalog/archive) of existing content – Useful in crawling see my Nutch talk last year http://s.apache.org/UvU • Language identification – Can be useful when data is coming from around the world, but need to quickly identify whether or not we can process it Big Goal • More closely link OODT and Tika – Add new parser to Tika – Easily get OODT met extractor based on it • Contribute back some features still baking in OODT – Configuration aspects of parsing – File types and extensions for science data files • Spatial – Some work done in my CS572 class on spatial parser for Tika – would be great to integrate with Tika, OODT, SIS, and Solr NASA Geo Challenges • Sometimes the data isn’t annotated with lat and lon – How to discover this? • Even when the data is annotated with spatial information, computation of e.g., bounding box around the poles is difficult • Efficiency and speed are difficult since data is at scale Acknowledgements

• Some Tika material inspired by Jukka Zitting’s talks – http://www.slideshare.net/jukka/text-and- metadata-extraction-with-apache-tika – http://www.slideshare.net/jukka/text-and- metadata-extraction-with-apache-tika- 4427630 • NASA Jet Propulsion Laboratory – OODT Team Book • Jukka and I have finished the first definitive guide on Apache Tika • Official release date: 11/17 • Early Access available through MEAP program

• http://manning.com/mattmann/ Alright, I’ll shut up now

• Any quesons?

• THANK YOU! – ma[email protected] – chris.a.ma[email protected] – @chrismamann on Twier