Processing Linked Data with Pig Latin
Total Page:16
File Type:pdf, Size:1020Kb
SPARQling Pig –Processing Linked Data with Pig Latin Stean Hagedorn1Kata Hose2Kai-Uwe Sattler1 1 Tenise Universitat¨ Ilmenau Ilmenau Germany {first.last}@tu-ilmenau.de 2 AalorgUniversityAalorg Denmark [email protected] Abstract: In reent years dataowlanguages su as Pig Latin ave emerged as exile and powerul tools or andling omplexanalysis tasks on ig data. Tese languages support sema exiility as well as ommon programming patterns su as iteration. Teyoffer extensiility troug user-dened untions wile running on top o salale distriuted platorms. In doing so tese languages enale analytial tasks wile avoiding te limitations o lassial query languages su as SQL and SPARQL. However te tuple-oriented viewogeneral-purpose languages likePig does not mat very well te speis o modern datasets availale on te Web, wi oten use te RDF data model. Grap patterns or instane are one o te ore on- epts o SPARQL utavetoeormulated as expliit oins wi urdens te user wit te details o eient query proessing strategies. In tis paperweaddress tis prolem y proposing extensions to Pig tat deal wit linked data in RDF to ridge te gapetween Pig and SPARQL or analytis. Tese extensions are realized y aset o user-dened untions and rewriting rules still allowing to ompile te enaned Pig sripts to plain MapRedue programs. Forall proposed extensions we disuss possile rewriting strategies and present results rom an experimental evaluation. 1Introduction Proessing and analyzing RDF [WC0] data and partiularly Linked (Open) Data [BL06] oten requires operations going eyond te apailities o lassi query languages su as SPARQL [WC08]. Altoug te most reent WC reommendations or SPARQL support ederated queries and aggregate untions typial data preparation and leaning steps as well as more advaned analytial tasks annot easily e implemented diretly in SPARQL or anysimilar query language. Consider or example auser wo wants to otain inormation aout events in all availale ategories (estival azz orestra rok et.) tat takeplae in te veiggest ities in te world. Even i te user as ull aess to te event dataset determining wi 5ities are te iggest in te world requires aess to anoter remote dataset. Answering su a query urter requires expensive grouping and an tereore easily eome expensive to evaluate in atriple store. Furtermore integrating non-RDF soures e.g. CSV les or omplextasks su as lustering te events on teir geograpi loation or data leaning tasks are not possile in SPARQL engines. On te oter and wit te wide aeptane o te MapRedue platorm [DG08] delara- tive dataowlanguages su as Pig [ORS+08] or Jaql [BEG+] ave gained mu atten- 79 tion. Teyprovide ari set o data proessing operations transparent parallelization to exploit data parallelism an deal wit exile semas and are easy to learn even or non- speialists. Furtermore tese languages an easily e extended y user-dened untions enaling ustom proessing. Finally,bytargeting MapRedue as exeution environment dataowsripts are deployale to large-sale lusters or analyzing ig datasets. Altoug Pig supports anested data model wi is ased on ags o tuples tat in turn an ontain ags atuple-oriented viewresults in several limitations and sortomings: • Wile SPARQL provides asi grap patterns (BGP) as aore onept o query or- mulation su query patterns ave to e expressed as sel oins in atuple-oriented language. Unortunatelytis is aquite expensive operation; rst eause it is im- plemented in Pig y aso-alled COGROUP operator representing ageneralized orm o oin and grouping tat is mapped to aomination o map and redue tasks. Te seond reason is tat sel oins in Pig require to load te data twie i.e. one or ea ran o aoin. Furtermore te user is responsile or ormulating eient sripts e.g. in terms o oin order or oin implementation. • Pig is targeting te MapRedue platorm and tereore works est i te data an e loaded rom HDFS [Fou09]. However proessing Linked Data oten requires to aess and retrieve data rom remote (SPARQL) endpoints ±wi is usually ad- dressed y ederated SPARQL engines. Depending on te size o te remote dataset te apailities o te endpointand te requenyosript exeution it migt e useul to download and materialize te dataset eore running te Pig sript or to et (and ae) te remote data on demand. In ot ases tis as to e expliitly implemented in te sripts. • Pig provides aexile nested data model tat oers several alternativesor repre- senting groups o triples or RDF statements. Tis inludes not only te representa- tion wile proessing te data utalso te storage ormat in HDFS les. However te oie o te most eient ormat also depends on te workload and is ene again te responsiility o te user. To overome tese limitations tis paper presents Pig language extensions adding SPARQL-likeeatures tat are partiularly useul or implementing analytial proess- ing tasks overLinked Data. Te main goal o our work is to provide operators on alevel o astration tat lits te urden o oosing te most eient strategy and operator implementations rom te user.Tese extensions inlude eient support o BGPs and FILTERsaswell as transparent aess to remote Linked Open Data. Our ontriution is twoold: • We present aset o language extensionsor making Pig alanguage suitale or analytial Linked Data proessing. We disuss possile strategies or implementing tese extensions in Pig. • We report results rom an experimental evaluation o te dierent strategies. In lak o aost model and ost-ased Pig ompilers tese results an e used as aounda- tion to implement rewriting euristis in Pig. 280 Te work desried in tis paper is emedded in alarger vision o aplatorm or (Linked) Open Data analytis (LODHu [HS]) tat allows to pulis sare and analyze Linked Datasets. In tis ontext Pig is used ±under te ood o avisual dataowdesigner ±as te data proessing language. Te remainder o te paper is strutured as ollows. Ater adisussion o related work in Set. 2 we riey desrie te ontext o our work in Set. and derive requirements on Pig suitale extensions or Linked Data proessing. Te proposed extensions are ten presented in Set. 4, ollowed y adisussion o te neessary dataowplanning and ompiling steps in Set. 5. Results o te experimental evaluation o te implementation and rewriting strategies are reported in Set. 6. FinallySet. 7onludes tis paper and outlines uture work. 2Related Work Wit te ontinuous growt o RDF datasets we experiene salaility issues wen us- ing asingle server to proess SPARQL queries. Despite eient entralized SPARQL engines or asingle server installation su as RDF-X [NW08] literature as proposed several systems [KM] tat use parallel proessing in aluster o maines. Te ar- itetures used or tis purpose dier etween orestrating multiple entralized RDF stores [GHS] using key-value stores [W+] or uilding upon distriuted le sys- tems [SPLCTW]. Espeially systems using adistriuted lesystem makeuse o MapRedue [DG08] to pro- ess oins [HMM+11,PKT+]. One o te keyonsiderations is to minimize te numer o MapRedue yles wile still omputing te omplete answer to te query.Instead o diretly mapping SPARQL queries to MapRedue programs Pig Latin [ORS+08] an e used as an intermediary step i.e. SPARQL queries are mapped to Pig Latin wi in turn is ompiled into MapRedue. One o te rst approaes o tis lass waspresented in [SPL] were te translation o SPARQL queries into Pig Latin sripts is disussed. In tis work RDF data is represented as plain tuples o tree elds and te operators o te SPARQL algera are mapped to sequenes o Pig Latin ommands. In ontrast to tis approa our work aims at integrating SPARQL elements su as BGPs into Pig keep- ing te dataowlanguage as te primary language or speiying data analytis pipelines. Furtermore we support multiple RDF datasets and SPARQL endpoints witin asingle sript. In addition we disuss and exploit dierent (intermediate) data representation or- mats or RDF data togeter wit possile optimization strategies depending on tese data representations. An alternative line o optimization teniques makeuse o algerai optimiza- tion [RKA] to minimize te numer o MapRedue yles on asystem ased on Pig in omination wit an algera [KRA] tat onsiders triple groups e.g. all triples wit te same suet are onsidered agroup. In our work we ollowasimilar idea o represent- ing RDF triples utgeneralize te approa and onsider it as only one alternative strategy or deriving eient MapRedue programs. In omparison to proessing o standard SPARQL queries analytial queries pose ad- ditional allenges due to teir speial arateristis su as omplexityevaluation on 281 typially very large datasets and long runtime [ADE+13, CGMR]. Data ues wit dimensions and measures overRDF data an e dened using RDF ompliant stan- dards [EV2 WC]. On tis asis analytial queries an e ormulated evaluated and optimized. Kotoulas et al. [KUBM2] propose an approa or query optimization tat interleavesquery exeution and optimization and SPARQL queries are translated to Pig Latin. In summaryexisting systems ave oused on RDF data organized in asingle dataset tat te user as omplete ontrol over. In ontrast tis paper proposes asystem tat targets aroader perspetive y onsidering analytial queries and Linked Data organized in multiple datasets and even remote soures tat te user wo issues te query does not ave anyontrol over. Use Case: The LODHub Platform Te goal o LODHu [HS] is to omine te untionalities o rameworks or Open Data management wit inrastruture or storing and proessing large sets o Linked (Open) Data. An elasti server inrastruture e.g. osted y IaaS (Inrastruture as a Servie) providers su as Amazon EC2 will allowtostart maines automatially as workload peaks our.Queries and datasets an ten e distriuted among tese maines transparently to te user. LODHu is designed to enourage users to sare teir data wit oter users and tus on- triute to te idea o Open Data.