SPARQling Pig –Processing Linked Data with Pig Latin
Ste an Hagedorn1 Kat a Hose2 Kai-Uwe Sattler1
1 Te nis e Universitat¨ Ilmenau Ilmenau Germany {first.last}@tu-ilmenau.de 2 Aal orgUniversity Aal org Denmark [email protected]
Abstract: In re ent years data owlanguages su as Pig Latin ave emerged as exi le and power ul tools or andling omplexanalysis tasks on ig data. T ese languages support s ema exi ility as well as ommon programming patterns su as iteration. T eyo er extensi ility t roug user-de ned un tions w ile running on top o s ala le distri uted plat orms. In doing so t ese languages ena le analyti al tasks w ile avoiding t e limitations o lassi al query languages su as SQL and SPARQL. However t e tuple-oriented viewo general-purpose languages likePig does not mat very well t e spe i s o modern datasets availa le on t e We w i o ten use t e RDF data model. Grap patterns or instan e are one o t e ore on- epts o SPARQL ut aveto e ormulated as expli it oins w i urdens t e user wit t e details o e ient query pro essing strategies. In t is paper weaddress t is pro lem y proposing extensions to Pig t at deal wit linked data in RDF to ridge t e gap etween Pig and SPARQL or analyti s. T ese extensions are realized y aset o user-de ned un tions and rewriting rules still allowing to ompile t e en an ed Pig s ripts to plain MapRedu e programs. Forall proposed extensions we dis uss possi le rewriting strategies and present results rom an experimental evaluation.
1Introduction Pro essing and analyzing RDF [W C0 ] data and parti ularly Linked (Open) Data [BL06] o ten requires operations going eyond t e apa ilities o lassi query languages su as SPARQL [W C08]. Alt oug t e most re ent W C re ommendations or SPARQL support ederated queries and aggregate un tions typi al data preparation and leaning steps as well as more advan ed analyti al tasks annot easily e implemented dire tly in SPARQL or anysimilar query language. Consider or example auser w o wants to o tain in ormation a out events in all availa le ategories ( estival azz or estra ro k et .) t at takepla e in t e ve iggest ities in t e world. Even i t e user as ull a ess to t e event dataset determining w i 5 ities are t e iggest in t e world requires a ess to anot er remote dataset. Answering su a query urt er requires expensive grouping and an t ere ore easily e ome expensive to evaluate in atriple store. Furt ermore integrating non-RDF sour es e.g. CSV les or omplextasks su as lustering t e events on t eir geograp i lo ation or data leaning tasks are not possi le in SPARQL engines. On t e ot er and wit t e wide a eptan e o t e MapRedu e plat orm [DG08] de lara- tive data owlanguages su as Pig [ORS+08] or Jaql [BEG+ ] ave gained mu atten-