SPARQling Pig –Processing Linked Data with Pig Latin

Stean Hagedorn1Kata Hose2Kai-Uwe Sattler1

1 Tenise Universitat¨ Ilmenau Ilmenau Germany {first.last}@tu-ilmenau.de 2 AalorgUniversityAalorg Denmark [email protected]

Abstract: In reent years dataowlanguages su as Pig Latin ave emerged as exile and powerul tools or andling omplexanalysis tasks on ig data. Tese languages support sema exiility as well as ommon programming patterns su as iteration. Teyoer extensiility troug user-dened untions wile running on top o salale distriuted platorms. In doing so tese languages enale analytial tasks wile avoiding te limitations o lassial query languages su as SQL and SPARQL. However te tuple-oriented viewogeneral-purpose languages likePig does not mat very well te speis o modern datasets availale on te We wi oten use te RDF data model. Grap patterns or instane are one o te ore on- epts o SPARQL utavetoeormulated as expliit oins wi urdens te user wit te details o eient query proessing strategies. In tis paperweaddress tis prolem y proposing extensions to Pig tat deal wit linked data in RDF to ridge te gapetween Pig and SPARQL or analytis. Tese extensions are realized y aset o user-dened untions and rewriting rules still allowing to ompile te enaned Pig sripts to plain MapRedue programs. Forall proposed extensions we disuss possile rewriting strategies and present results rom an experimental evaluation.

1Introduction Proessing and analyzing RDF [WC0] data and partiularly Linked (Open) Data [BL06] oten requires operations going eyond te apailities o lassi query languages su as SPARQL [WC08]. Altoug te most reent WC reommendations or SPARQL support ederated queries and aggregate untions typial data preparation and leaning steps as well as more advaned analytial tasks annot easily e implemented diretly in SPARQL or anysimilar . Consider or example auser wo wants to otain inormation aout events in all availale ategories (estival azz orestra rok et.) tat takeplae in te veiggest ities in te world. Even i te user as ull aess to te event dataset determining wi 5ities are te iggest in te world requires aess to anoter remote dataset. Answering su a query urter requires expensive grouping and an tereore easily eome expensive to evaluate in atriple store. Furtermore integrating non-RDF soures e.g. CSV les or omplextasks su as lustering te events on teir geograpi loation or data leaning tasks are not possile in SPARQL engines. On te oter and wit te wide aeptane o te MapRedue platorm [DG08] delara- tive dataowlanguages su as Pig [ORS+08] or Jaql [BEG+] ave gained mu atten-

79 tion. Teyprovide ari set o data proessing operations transparent parallelization to exploit data parallelism an deal wit exile semas and are easy to learn even or non- speialists. Furtermore tese languages an easily e extended y user-dened untions enaling ustom proessing. Finallyytargeting MapRedue as exeution environment dataowsripts are deployale to large-sale lusters or analyzing ig datasets. Altoug Pig supports anested data model wi is ased on ags o tuples tat in turn an ontain ags atuple-oriented viewresults in several limitations and sortomings: Wile SPARQL provides asi grap patterns (BGP) as aore onept o query or- mulation su query patterns ave to e expressed as sel oins in atuple-oriented language. Unortunatelytis is aquite expensive operation; rst eause it is im- plemented in Pig y aso-alled COGROUP operator representing ageneralized orm o oin and grouping tat is mapped to aomination o map and redue tasks. Te seond reason is tat sel oins in Pig require to load te data twie i.e. one or ea ran o aoin. Furtermore te user is responsile or ormulating eient sripts e.g. in terms o oin order or oin implementation. Pig is targeting te MapRedue platorm and tereore works est i te data an e loaded rom HDFS [Fou09]. However proessing Linked Data oten requires to aess and retrieve data rom remote (SPARQL) endpoints ±wi is usually ad- dressed y ederated SPARQL engines. Depending on te size o te remote dataset te apailities o te endpointand te requenyosript exeution it migt e useul to download and materialize te dataset eore running te Pig sript or to et (and ae) te remote data on demand. In ot ases tis as to e expliitly implemented in te sripts. Pig provides aexile nested data model tat oers several alternativesor repre- senting groups o triples or RDF statements. Tis inludes not only te representa- tion wile proessing te data utalso te storage ormat in HDFS les. However te oie o te most eient ormat also depends on te workload and is ene again te responsiility o te user. To overome tese limitations tis paper presents Pig language extensions adding SPARQL-likeeatures tat are partiularly useul or implementing analytial proess- ing tasks overLinked Data. Te main goal o our work is to provide operators on alevel o astration tat lits te urden o oosing te most eient strategy and operator implementations rom te user.Tese extensions inlude eient support o BGPs and FILTERsaswell as transparent aess to remote Linked Open Data. Our ontriution is twoold: We present aset o language extensionsor making Pig alanguage suitale or analytial Linked Data proessing. We disuss possile strategies or implementing tese extensions in Pig. We report results rom an experimental evaluation o te dierent strategies. In lak o aost model and ost-ased Pig ompilers tese results an e used as aounda- tion to implement rewriting euristis in Pig.

Te work desried in tis paper is emedded in alarger vision o aplatorm or (Linked) Open Data analytis (LODHu [HS]) tat allows to pulis sare and analyze Linked Datasets. In tis ontext Pig is used ±under te ood o avisual dataowdesigner ±as te data proessing language. Te remainder o te paper is strutured as ollows. Ater adisussion o related work in Set. 2 we riey desrie te ontext o our work in Set. and derive requirements on Pig suitale extensions or Linked Data proessing. Te proposed extensions are ten presented in Set. ollowed y adisussion o te neessary dataowplanning and ompiling steps in Set. 5. Results o te experimental evaluation o te implementation and rewriting strategies are reported in Set. 6. FinallySet. 7onludes tis paper and outlines uture work.

2Related Work

Wit te ontinuous growt o RDF datasets we experiene salaility issues wen us- ing asingle server to proess SPARQL queries. Despite eient entralized SPARQL engines or asingle server installation su as RDF-X [NW08] literature as proposed several systems [KM] tat use parallel proessing in aluster o maines. Te ar- itetures used or tis purpose dier etween orestrating multiple entralized RDF stores [GHS] using key-value stores [W+] or uilding upon distriuted le sys- tems [SPLCTW]. Espeially systems using adistriuted lesystem makeuse o MapRedue [DG08] to pro- ess oins [HMM+PKT+]. One o te keyonsiderations is to minimize te numer o MapRedue yles wile still omputing te omplete answer to te query.Instead o diretly mapping SPARQL queries to MapRedue programs Pig Latin [ORS+08] an e used as an intermediary step i.e. SPARQL queries are mapped to Pig Latin wi in turn is ompiled into MapRedue. One o te rst approaes o tis lass waspresented in [SPL] were te translation o SPARQL queries into Pig Latin sripts is disussed. In tis work RDF data is represented as plain tuples o tree elds and te operators o te SPARQL algera are mapped to sequenes o Pig Latin ommands. In ontrast to tis approa our work aims at integrating SPARQL elements su as BGPs into Pig keep- ing te dataowlanguage as te primary language or speiying data analytis pipelines. Furtermore we support multiple RDF datasets and SPARQL endpoints witin asingle sript. In addition we disuss and exploit dierent (intermediate) data representation or- mats or RDF data togeter wit possile optimization strategies depending on tese data representations. An alternative line o optimization teniques makeuse o algerai optimiza- tion [RKA] to minimize te numer o MapRedue yles on asystem ased on Pig in omination wit an algera [KRA] tat onsiders triple groups e.g. all triples wit te same suet are onsidered agroup. In our work we ollowasimilar idea o represent- ing RDF triples utgeneralize te approa and onsider it as only one alternative strategy or deriving eient MapRedue programs. In omparison to proessing o standard SPARQL queries analytial queries pose ad- ditional allenges due to teir speial arateristis su as omplexityevaluation on

typially very large datasets and long runtime [ADE+ CGMR]. Data ues wit dimensions and measures overRDF data an e dened using RDF ompliant stan- dards [EV2 WC]. On tis asis analytial queries an e ormulated evaluated and optimized. Kotoulas et al. [KUBM2] propose an approa or query optimization tat interleavesquery exeution and optimization and SPARQL queries are translated to Pig Latin. In summaryexisting systems ave oused on RDF data organized in asingle dataset tat te user as omplete ontrol over. In ontrast tis paper proposes asystem tat targets aroader perspetive y onsidering analytial queries and Linked Data organized in multiple datasets and even remote soures tat te user wo issues te query does not ave anyontrol over.

Use Case: The LODHub Platform

Te goal o LODHu [HS] is to omine te untionalities o rameworks or Open Data management wit inrastruture or storing and proessing large sets o Linked (Open) Data. An elasti server inrastruture e.g. osted y IaaS (Inrastruture as a Servie) providers su as Amazon EC2 will allowtostart maines automatially as workload peaks our.Queries and datasets an ten e distriuted among tese maines transparently to te user. LODHu is designed to enourage users to sare teir data wit oter users and tus on- triute to te idea o Open Data. Still users an also oose to keep teir datasets private. Additionallyasusers oten tend to work in teams te platorm will allowtoassoiate user aounts wit organizations and to organize users in teams. Wen saring datasets wit oter users or reating anew version o an existing dataset provenane inormation will e kept to trak te original autor. Data exploration eatures will elp users to disovernew knowledge y automatially extending teir query results wit inormation ound not only in oter availale datasets witin LODHu utalso rom remote soures on te We. Te integration o Pig [ORS+08] enales users to perorm analytial tasks tat are not pos- sile wit apure SPARQL approa. Su tasks inlude data transormation omining dierent soures wit dierent ormats and oter omplextasks or wi no equivalent SPARQL operations exist. To let users work wit teir data LODHu provides an easy-to-use and intuitive grapial editor to reate Pig sripts. Wit tis editorauser an an drag and drop operators and edit teir parameters. Operators ave ports or input and output and y onneting te output port o one operator to te input port o anoter operatoradataowisreated. Conneting all needed operators results in adataowgrap wi must ontain one or more LOAD operators and at least one DUMP or STORE operator to sowte results on sreen or to save tem as anew dataset respetively.Between tose twotypes o operators users an add as manyoperators as needed. Te sreensot in Fig. sows te grapial Pig editor as well as te Pig sript ompiled rom te operator grap.

Figure : LODHu’svisual dataowdesigner or Pig sripts

In its urrent implementation LODHu eatures almost all operators tat Pig provides. Some o tese operators owever were modied to meet te allenges tat arise wen working wit RDF data. Note tat LODHu is ale to work not only wit RDF data ut wit anyormat or wi aload untion e.g. as aUDFexists. Te ollowing senario illustrates an example senario and an analytial task tat is sup- ported y LODHu.Assume tere exists an RDF dataset wit inormation aout events. Tese events may e onerts sport events et. or even some sensor readings tat ave een logged into ale. Among oter inormation ea event is desried y atitle a desription o te event or sensor reading aategory (i.e. ”sport” ”musi” or ”tempera- ture”) and aposition o te event in te orm o lat/long oordinates. Consider te ollowing exerpt rom adataset (in total a. 6.5 million events and more tan 5 million statements) tat were extrated rom te wesite http://eventful.com and transormed into RDF (or simpliityweareviate namespaes and event IDs):

"Metallica" . "2012-08-17T18:00:00+02" . "Rexall Place" . "-1.135E2" . "5.353E1" . "music" .

Assume auser wants to nd te numer o events in ea ategory or ea o te ve iggest ities in te world. To answer tis question we rst need to nd veiggest ities.

Tis inormation ould or instane e extrated rom te remote DBpedia [ABK+07] dataset wi ontains inormation aout ities inluding name loation and population. Te next step is to nd out wi event took plae in wi ity.Aste event dataset ontains te oordinates in twoseparate RDF statements (longitude and latitude) a self- join is required to otain te ull oordinate. Now using tese oordinates we an use an external webservice su as GeoNames [Wi] to otain te name o te ity tat an event is loated in. One we ave identied te orresponding ity or ea event we an oin te results otained rom te DBpedia dataset wit our events dataset on te ities. Tis produes te events loated in te 5iggest ities. Tis intermediate result nowas to e oined again wit our original eventdataset to otain te event ategory.Wean group tis oin result on ity and event ategory and ten ount te requenies to produe te answer to te user’squery. Based on tis example senario we identiy several allenges to Pig operators: Loading datasets. First te dataset loader must e ale to eiently load datasets rom te le system (HDFS [Fou09] in our ramework). It sould ide URIs and le names rom te user as users sould not e otered wit details su as te internal le loations. Furtermore te dataset loader must andle te speial data ormat tat is optimized or Pig on RDF data (see Set. or details). Basic Graph Pattern. SPARQL uses BGPs tat are mated against te underlying dataset. Pig owever does not support tese BGPs natively.Toallowusers to easily work wit RDF data te Pig operators must e ale to eiently interpret BGPs wi may range rom single statements to omplexgrap patterns involving multiple oins. Self joins. RDF data is ased on te onept o triples dening arelationsip (property) etween asuet and an oet. Tereore proessing SPARQL queries requires to eval- uate manysel oins to proess te BGPs dened in SPARQL queries. In ontrast to triples stores Pig (and Hadoop) wasnot designed or RDF data and assumes only very ewoins witin aquery.Tis makes plain Pig and Hadoop very ineient on RDF data eause perorming sel oins requires to load te same dataset twie. As an example onsider te previous senario: to onstrut te oordinate pair or te events te dataset as to e loaded twie in order to perorm te sel oin. And later to add te ategory inormation we again ave to perorm aoin wit te original event dataset. Remote joins. One o te enets o RDF is its distriuted nature. Inormation an e distriuted overmanyservers and troug te links witin te datasets tese datasets are inter-onneted. Tis means tat proessing aquery migt inlude perorming aoin wit adataset on aremote server.Asdatasets an e large alot o inormation migt ave to e downloaded. To assure reasonale response times query strategies or su remote oins ave to implemented. In our senario we need to ompute aoin wit te DBpedia dataset. Su aoin an e perormed in dierent ways depending on te remote dataset’ssize and te data needed to answer te query: te datasetsan eiter e downloaded ompletely (i possile) or i aSPARQL endpoint is availale aSPARQL query an e sent to retrieve only relevant parts o te dataset wi proaly results in only asmall amount o data to download.

Pig Extensions forLinked Data Processing

In order to address te requirements disussed in Set. we ave developed several exten- sions to te Pig Latin language wi we introdue in te ollowing. Tese extensions par- tiularly deal wit eatures or proessing RDF data as well as dening analytial pipelines and an e lassied into te ollowing groups: onversion etween RDF triples and te Pig data model or eient proessing aessing linked (RDF) datasets supporting asi grap patterns or users more amiliar wit te SPARQL style o query ormulation.

TUPLIFY –Tuple construction. Altoug te asi triple struture o RDF data in te orm o suet sprediate pand oet o is very exile and allows to represent aritrary strutured data inluding graps it is not te optimal oie or eient ana- lytis. Te main reason is tat reonstruting omplexoets rom multiple triples re- quires (sel) oins wi are very expensive partiularly in MapRedue-ased exeution models su as te model underlying Pig. On te oter and arigid relational struture wit apredened sema were aset o triples wit te same suet is represented y (s,o1,o2,...,on)or te sema (s,p1,p2,...pn)isoten not exile enoug or RDF data. FortunatelyPig provides aexile nested data model wit tuples ags and maps. Tus tuple onstrution is an important operation neessary to proess RDF data in Pig eiently.

Arst possile representation is to use maps or te prediate-oet pairs i.e. (s, {p1 → o1,p2 → o2,...}).Howevertis proiits te use o te same prediate multi- ple times or agiven tuple. Tereore an alternative approa uses ags instead o maps or te prediate-oet pairs i.e. asema {subject:bytearray, stmts: {(predicate:bytearray, object:bytearray)}}.Tis an urter e generalized y allowing to group triples on anyote tree omponents e.g. on te predi- ate so tat we otain {predicate:bytearray, stmts:{(subject:bytearray, object:bytearray)}}.Wi o tese representations is te est oiedepends on te operations tat sall e applied on te data. Te details o tuple reonstrution are idden y a TUPLIFY operator wi is mapped to dierent implementations. I te input is aplain ag o triples it simply groups tem as ollows: triple_groups =FOREACH (GROUP triples BY subject) GENERATE group as subject, triples.(predicate, object) AS stmts;

RDFLoad –Accessing datasets. Foraessing linked datasetsweavetodistinguis etween loal and remote aess. Te most straigtorward wayoloal aess is to load textual les in te Notation (N) ormat rom HDFS wi is natively supported y Pig’s LOAD operator.Amore eient storage seme ot in terms o spae onsumption and read osts is to leverage triple groups y using adaenylists i.e. aset o triples

5 {(s1,p1,o1),(s1,p2,o2),...,(s1,pn,on)}is represented y (s1, {(p1,o1),...,(pn,on)}.Be- ause te Pig data model supports ags maps and nested tuples textual les representing tis struture an e loaded diretly into Pig sripts. In addition y using auser-dened untion (UDF) or LOAD tis priniple an also e extended to lok-oriented inary les using ditionary and/or run-lengt enoding o te s pand o values. Currentlywesupport te ollowing strategies or aessing RDF data rom HDFS: using Pig’snative LOAD operator or N les. Beause RDF literals an ontain witespaes aUDF (RDFize)or tokenizing text lines ontaining triples into RDF omponents is used as part o amaro:

REGISTER sparklingpig.jar; DEFINE RDFFileLoad(file) RETURNS T { lines =LOAD ’$file’ AS (txt: chararray); $T =FOREACH lines GENERATE FLATTEN(pig.RDFize(txt)) AS (subject, predicate, object); } triples =RDFFileLoad(’rdf-data.nt’):

using Pig’s LOAD operator wit te BinStorage untion or triple groups:

rdf_tuples =LOAD ’rdf-data.dat’ USING BinStorage() AS (subject: bytearray, stmts: (predicate: bytearray, object:bytearray));

In addition RDF datasets may also e retrievedrom remote SPARQL endpoints. Fortis purpose we provide aUDFalled SPARQLLoadertat sends agiven SPARQL query to te endpoint and transorms te result data (usually enoded in XML or JSON) into Pig tuples: REGISTER sparklingpig.jar; raw =LOAD ’http://endpoint.org:8080/sparql’ USING SPARQLLoader(’SELECT * WHERE { ?s ?p ?o }’) AS (subject, predicate, object); Note tat te latter opens up newoptions to deal wit te retrieveddata: depending on te size o te dataset it ould e useul to download and materialize te data in HDFS or reuse eore starting te atual proessing. Te keep te load at puli SPARQL endpoints low it is o ourse advisale to download dataase dumps wen availale and manually store tem in HDFS. Furtermore depending on te user querywean oose not to download all triples o te remote endpoint utonly tose triples tat are neessary to answer te user’squery.For instane givenaFILTER expression on te population in auser queryten te ollowing sript retrievesonly triples enoding te population o ities: REGISTER sparklingpig.jar; raw =LOAD ’http://endpoint.org:8080/sparql’ USING SPARQLLoader( ’CONSTRUCT {?s dbpedia:populationTotal ?pop } WHERE {?s rdf:type dbpedia:City . ?s dbpedia:populationTotal ?pop}’) AS (subject, predicate, object);

6 I metadata is availale tat desries te storage ormat o te data and indiates weter data rom remote endpoints ave already een downloaded ten te details o tese di- erent strategies an e enapsulated y ageneri RDFLoad operator.

FILTER –Basic Graph Pattern. BGPs are sets o triple patterns tat are used not only or simple ltering utalso or oins. Translating BGPs in te style o relational query proessing results in anumer o expensive oins. Furtermore te exat ormulation de- pends on te triple/tuple layout o te data; or example i triples are used ten we need a JOINwereas or ag-ased tuple strutures nested FOREACH/FILTER operations an e used. Consider te ollowing simple BGP as an example: { ?s ?o1 . ?s ?o2 }

Based on atriple sema in Pig i.e. {subject:bytearray, predicate:bytearray, object:bytearray }te BGP ould e implemented y: triples1 =RDFLoader(’rdf-data.nt’); triples2 =RDFLoader(’rdf-data.nt’); result =JOIN triples1 BY (subject), triples2 BY (subject); Note tat or sel oins Pig requires to load te data twie as in te example aove.O ourse tis would not e neessary i te seond pattern stemmed rom anoter dataset. In ontrast i we ave onstruted RDF tuples using te sema {subject:bytearray, stmts:{(predicate: bytearray,object: bytearray)}} eore i.e. reating a ag o pairs (prediate oet) elonging to asuet ten we an implement te BGP y: tmp =FOREACH rdf tuples { r1 =FILTER stmts BY (predicate == ’’); r2 =FILTER stmts BY (predicate == ’’); GENERATE *,COUNT(r1) AS cnt1, COUNT(r2) AS cnt2; }; result =FILTER tmp BY cnt1 >0AND cnt2 >0; Foronveniene we ide te details o implementing BGP proessing y an extended FILTER operator wi aepts aBGP as part o te BY lause utavetotranslate tis expression aordingly: result =FILTER triples BY { ?s ?o1 . ?s ?o2 };

5Pig Planning and Compiling

In te previous setion we desried several extensions to Pig Latin to proess Linked Data. To exeute sripts involving tese extensions we need to transorm tem into plain Pig sripts. Tis transormation is implemented as atree aspets: planning rewriting and UDFs. Te remainder o tis setion disusses tese aspets in more detail. 7 Planning. Givenauser querywean oten use metadata to rewrite te query into amore eient version. Sine our extended Pig or instane allows aess to loal as well as remote datasets aquery migt ontain LOAD operations on aremote SPARQL endpoint. Givente inormation tat te dataset wasalready downloaded and materialized wile exeuting anoter querywean rewrite te original query into one tat uses aloal dataset instead. To identiy tese ases we need to ollet metadata and statistis e.g. amapping (URL, query) → HDFS path tat we an use to deide weter aremote dataset is already availale loally.Asanexample onsider te LOAD operation rom Set. : raw =LOAD ’http://endpoint.org:8080/sparql’ USING SPARQLLoader(’SELECT * WHERE { ?s ?p ?o }’) AS (subject, predicate, object);

I amapping (http://endpoint.org:8080/sparql SELECT * WHERE {?s ?p ?o}) → hdfs:///rdf-data.nt exists ten we an rewrite te expression as raw =RDFFileLoad(’hdfs:///rdf-data.nt’); wi an oviously e exeuted more eiently y avoiding expensive data exange overte We. Note tat tis annot e aievedinapure Pig environment eause o te missing option to generate and store meta inormation. Aramework likeLODHu owever ould easily store su inormation and reuse data aross user queries. In addition to te aove mentioned mappings LODHu ould also store inormation aout te le ormat i.e. i te ontent o ale is strutured as plain triples or as triple groups as well as additional statistis (e.g. VoID [WC0] statistis) aout te downloaded datasets wi will allowor urter optimization. Rewriting. Toug UDFs provide aexile meanism to extend Pig wit additional untionalitywedeided not to implement all extensions as UDFs eause transorming te extended Pig expressions to plain Pig allows us to use all optimizations tat Pig already omes wit. Instead we deided to implement BGP support via rewriting rules i.e. eore exeution te Pig sript is passed to arewriter tat analyzes te ode and uses rewriting rules to transorm extended Pig statements into plain Pig statements. Tese rules over various input ormats BGPs and operations. An example or su arewriting wasalready giveninSet. or te extended lter operation. Te input ormat denes te struture o te data stream i.e. i teyare plain tuples rom an N le or an external SPARQL endpoint or i te stream is already in atriple group representation as introdued in Set. . In te ollowing we present and disuss tese rewriting rules in detail. (schema) We use te notation aliasformat to desrie te arateristis o te input and out- put dataoanoperator.Te alias is te Pig alias o te input or output data format de- sries te ormat o te data i.e. i it is plain triples or triple groups. In te ormer ase we write plain and in te latter ase we write group.Te schema denotes te sema o te data e.g. (subject, predicate, object) or plain triples or (subject, {(predicate, object)}) or triple groups. Forease o presentation te ollowing rules use te suet (s) as adeault to denote grouping ±neverteless te rules old or oter groupings as well.

(T) I te data is availale as plain triples it an e tuplied using te TUPLIFY opera- tor.Itiste task o te optimizer to deide weter or not to apply tis rule in order to reate triple groups or agiven input. (s,p,o) aliasplain ⇒ (s,{(p,o)}) outgroup =TUPLIFY(alias) =FOREACH (GROUP alias BY s) GENERATE group as s, alias.(p, o) AS stmts;

(R) I te RDFLoad operator is used wit areerene to aremote endpoint (uri.protocol = ′http′) replae te operator wit te UDF SPARQLLoader as introdued in Set. . out =RDFLoad(’uri’) AS (schema); ⇒ (schema) outplain =LOAD ’uri’ USING pig.SPARQLLoader( ’SELECT * WHERE { ?s ?p ?o }’) AS (schema); (R2) As already outlined in Set. we an use arestritive query to redue te amount o triples eted rom te SPARQL endpoints. Te neessary knowledge an e drawn rom FILTER operations tat use BGPs. out =RDFLoad(’uri’) AS (schema);

out2 =FILTER out BY {BGP1 . BGP2}; ⇒ (schema) outplain =LOAD ’uri’ USING pig.SPARQLLoader( ’CONSTRUCT * WHERE { BGP1 }’) AS (schema); (schema) (schema) out2plain =FILTER outplain BY { BGP2 };

BGP1 and BGP2 represent asi grap patterns e.g. star oins. Te (su)set o te BGP tat restrits te remote data an e used to dene te query tat is sent to te SPARQL endpoint. Oten te FILTER’s BGP annot ompletely e used in te query.Intis ase te remaining part o te BGP still as to e applied in a FILTER.Note tat te LOAD does not ave to e diretly ollowed y te FILTER to mat tis rule. Raterte output o te LOAD as to e passed as input to te FILTER transitively. (L) To load aloal dataset te URI uses te hdfs protool (uri.protocol =’hdfs’). In tis ase te RDFLoad operation an e replaed y te RDFFileLoad maro. I te dataset as een savedasplain triples te output o te loader are plain triples too. out =RDFLoad(’uri’); ⇒ (s,p,o) outplain =RDFFileLoad(’uri’);

(L2) Loading loal datasets tat ave een savedinte triple group ormat produes an output wit te triple group sema. Consider te output wasgrouped y suet (s) te RDFLoad operation will e rewritten as:

9 out =RDFLoad(’uri’); ⇒ (s,{(p,o)}) outgroup =LOAD ’uri’USING BinStorage() AS (s:chararray, stmts:bag{ t:(p:chararray,o:chararray) });

Te resulting sema depends on wi omponent (suet prediate or oet) wasused or grouping. Tis inormation an e stored in te meta data or ea dataset. (F) Asinge triple pattern may use only unound variales. I su atriple pattern is used in a FILTERitan e omitted sine it as no eet: out(schema1) =FILTER in(schema1) BY { ?s ?p ?o }; out2(schema2) = any operator out(schema1); ⇒ out2(schema2) = any operator in(schema1);

Note tat in tis ase te ormat o te aliases does not matter. (F2) Filter operators tat use asingle triple pattern wit only one ound variale an e rewritten as asimple FILTER wit one prediate. (s,p,o) (s,p,o) outplain =FILTER inplain BY { ?s ?p ’value’ }; ⇒ (s,p,o) (s,p,o) outplain =FILTER inplain BY o==’value’; Rewriting rules or one ound variale in te oter twoomponents (subject or predicate)are oviously analogously dened. Hene we omit tem ere.

(F) I te triple pattern o alter ontains two(ore more) ound variales it is rewritten as alter operation tat onnets ot prediates y AND. (s,p,o) (s,p,o) outplain =FILTER inplain BY { ?s ’value1’’value2’ }; ⇒ (s,p,o) (s,p,o) outplain =FILTER inplain BY p==’value1’AND o==’value2’;

As in rule (F2) aovetere are analogous rules or te ound variales in oter (or all tree) omponents. (F) I te input alias o te FILTER operator is in te triple group ormat and i te one ound variale is te grouping olumn e.g. ste ollowing rule applies: (s,{(p,o)}) (s,{(p,o)}) outgroup =FILTER ingroup BY { ’value’?p?o}; ⇒ (s,{(p,o)}) (s,{(p,o)}) outgroup =FILTER ingroup BY s==’value’;

(F5) I te ound variale o te lter triple pattern is not te grouping olumn e.g. o ten te lter operation eomes more omplex:

9 (s,{(p,o)}) (s,{(p,o)}) outgroup =FILTER ingroup BY { ?s ?p ’value’ };

⇒ (s,{(p,o)},cnt) (s,{(p,o)}) tmpgroup =FOREACH ingroup { t=FILTER stms BY o==’value’; GENERATE *,COUNT(t) AS cnt; }; (s,{(p,o)},cnt) (s,{(p,o)},cnt) outgroup =FILTER tmpgroup BY cnt >0;

(F6) I te triple pattern ontains twoound variales and neiter o tem is te grouping olumn ten ot prediates an e onneted y AND. (s,{(p,o)}) (s,{(p,o)}) outgroup =FILTER ingroup BY { ?s ’v1’’v2’ };

⇒ (s,{(p,o)},cnt) (s,{(p,o)}) tmpgroup =FOREACH ingroup { t=FILTER stms BY p==’v1’AND o==’v2’; GENERATE *,COUNT(t) AS cnt; }; (s,{(p,o)},cnt) (s,{(p,o)},cnt) outgroup =FILTER tmpgroup BY cnt >0;

(F7) I te triple pattern ontains twoound variales and one is te grouping olumn it will e split into twolter operations. Tese an ten urter e proessed y rules (F) and (F5). (s,{(p,o)}) (s,{(p,o)}) outgroup =FILTER ingroup BY { ’v1’?p’v2’ };

⇒ (s,{(p,o)}) (s,{(p,o)}) tmpgroup =FILTER ingroup BY { ’v1’ ?p ?o }; (s,{(p,o)},cnt) (s,{(p,o)}) outgroup =FILTER tmpgroup BY { ?s ?p ’v2’ }; (F8) Analogously to rule (F7)ite input is in te triple group ormat and alter triple pattern ontains only ound variales it will e replaed y twolter operations wi in turn an urter e proessed y rules (F) and (F6). (s,{(p,o)}) (s,{(p,o)}) outgroup =FILTER ingroup BY { ’v1’’v2’’v3’ };

⇒ (s,{(p,o)}) (s,{(p,o)}) tmpgroup =FILTER ingroup BY { ’v1’ ?p ?o }; (s,{(p,o)},cnt) (s,{(p,o)}) outgroup =FILTER tmpgroup BY { ?s ’v2’ ’v3’ };

(F9) I alter ontains more tan one triple pattern (TP1,..., TPN) i.e. aBGPand tey do not orm aoin teyan e evaluated one ater te oter as an impliit AND.We leave it to te Pig engine to optimize te sequene o lters. In tis ase te ormat o te respetive aliases again do not matter. (schema) (schema) out =FILTER in BY { TP1 ... TPN }; ⇒ (schema) (schema) f1 =FILTER in BY { TP1 }; ... (schema) (schema) fN =FILTER f(N-1) BY { TPN }; out(schema) =fN; (J) I te data is availale in plain triple ormat and i te BGP o alter expression orms astar oin (i.e. te triple pattern TP1,..., TPN sare te same variale in te same position e.g. on suet) te dataset as to e loaded or ea BGP partii- pating in te oin. Su oins may ontain ound variales. Hene we rst lter te input or tese ound variales.

9 (s,p,o) inplain =RDFLoad(’uri’); (s,p,o) (s,p,o) outplain =FILTER inplain BY { TP1 ... TPN };

⇒ (s,p,o) in1plain =RDFLoad(’uri’); (s,p,o) (s,p,o) f1plain =FILTER in1plain BY { TP1 }; ... (s,p,o) inN plain =RDFLoad(’uri’); (s,p,o) (s,p,o) fN plain =FILTER inN plain BY { TPN }; (s,p,o) (s,p,o) (s,p,o) outplain =JOIN f1plain BY (s), ..., fNplain BY (s); Ea lter operation an ten e urter proessed y rules (F) -(F). (J2) As already stated in previous setions te star oin an easily e realized using FILTER operators i te input is already in te triple group ormat. I te triple pat- tern ontains ound variales e.g. in te prediate position wit values p1,..., pN a lter as to e applied. (s,{(p,o)}) (s,{(p,o)}) outgroup =FILTER ingroup BY { TP1 ... TPN }; ⇒ (s,{(p,o)},cnt1,...,cntN) (s,{(p,o)}) tmpgroup =FOREACH ingroup { t1 =FILTER stms BY p==’p1’; ... tN =FILTER stms BY p==’pN’; GENERATE *,COUNT(t1) AS cnt1, ..., COUNT(tN) as cntN; }; (s,{(p,o)},cnt1,...,cntN) (s,{(p,o)},cnt1,...,cntN) outgroup =FILTER tmpgroup BY cnt1 >0AND ... AND cntN >0;

(J) Apat oin is araterized y te at tat te oin variales are not on te same positions in all triple pattern e.g. te oet o TPi is oined wit te suet o TPi+1.Iatriple pattern ontains aonstant tat an e used as altertose lters an e pused down. (s,p,o) inplain =RDFLoad(’uri’); (s,p,o) (s,p,o) outplain =FILTER inplain BY { TP1 ... TPN }; ⇒ (s,p,o) in1plain =RDFLoad(’uri’); (s,p,o) (s,p,o) filter1plain =FILTER inplain BY { TP1 }; ... (s,p,o) inN plain =RDFLoad(’uri’); (s,p,o) (s,p,o) filterNplain =FILTER inN plain BY { TPN }; (schemafilter1,schemafilter2) (s,p,o) out1plain =JOIN filter1plain BY (o), (s,p,o) filter2plain BY (s); ... (schemaouti-1,schemafilteri+1) (schemaouti-1) outiplain =JOIN outi-1plain BY (o), (s,p,o) filteri+1plain BY (s); ... (schemaoutN-1,schemafilterN) (schemaoutN-1) outplain =JOIN outN-1plain BY (o), (s,p,o) filterN plain BY (s);

9 Te lter operations on ea triple pattern an e proessed y rules (F)-(F) (J) In ase te input is loaded as triple groups and apat oin needs to e evaluated we ave to atten te struture and pass te at triples to te oin as in rule (J) aove.

(s,{(p,o)}) ingroup =RDFLoad(’uri’); (s,{(p,o)}) (s,{(p,o)}) outgroup =FILTER ingroup BY { TP1 ... TPN }; ⇒ (s,{(p,o)}) in1group =RDFLoad(’uri’); (s,{(p,o)}) (s,{(p,o)}) filter1group =FILTER ingroup BY { TP1 }; (s,p,o) (s,{(p,o)}) flat1plain =FOREACH filter1group GENERATE s, FLATTEN(stmts); ... (s,{(p,o)}) inN group =RDFLoad(’uri’); (s,{(p,o)}) (s,{(p,o)}) filterNgroup =FILTER inN group BY { TPN }; (s,p,o) (s,{(p,o)}) flatNplain =FOREACH filterNgroup GENERATE s, FLATTEN(stmts); (schemaflat1,schemaflat2) (s,p,o) (s,p,o) out1plain =JOIN flat1plain BY (o), flat2plain BY (s); ... (schemaouti-1,schemaflati+1) (schemaouti-1) outiplain =JOIN outi-1plain BY (o), (s,p,o) flati+1plain BY (s); ... (schemaoutN-1,schemaflatN) (schemaoutN-1) outplain =JOIN outN-1plain BY (o), (s,p,o) flatN plain BY (s);

UDFs and Compiling. We implemented aew UDFs to load and proess linked datasets. We use te SPARQLLoader tat aepts aURL o aSPARQL endpoint and aquery to retrieve data rom tat endpoint. Furtermore tere is a RDFize UDF tat is used to read and parse statements in N les. Forompiling and optimizing te resulting sript we ompletely rely on te Pig ompiler to generate te MapRedue programs.

6Evaluation

In tis setion we sowtat te perormane o te sripts depends on wi rules are applied to transorm te extended Pig sript into plain Pig. To do so we ompare te exeution times or sripts tat were generated y dierent rules. Tese experiments sow tat tere is aneed or an optimizer tat as aess to various statistis in order to deide wi rule to oose. In our experiments we used aHadoop luster o eigt nodes onsisting o one name node and sevendata nodes. Ea node as an Intel Core i5-70S CPU wit our ores running at 2.9 GHz and 6 GB RAM. Te luster is managed y Cloudera CDH 5.2 running Hadoop 2.5 and Pig 0.2. We imported an 8GBdataset wi we introdued in Set. tat ontains ∼5 million statements. In te experiments we manually transormed te extended Pig statements into plain Pig or typial operations on RDF data su as ltering on BGP and sel oins. Tere are

9 twosenarios or input data: (i) te input data are plainNtriples (ased on wi triple groups migt e generated on-te-y) and (ii) te input data is in triple group ormat. Te tird ase were data is eted rom aremote SPARQL endpoint is not onsidered in tis evaluation as te delays aused y remote ommuniation will not lead to additional insigts. Fortese senarios we rst ompared te exeution times o asel oin. Te task is to ompute te latitude and longitude or ea event. We started wit aPig sript tat uses extended Pig eatures or loading and evaluating aBGP: raw =RDFLoad(’hdfs:///eventful.nt’) res =FILTER raw BY { ?s ?lat . ?s ?long }

Te sript inludes one LOAD and one FILTER operator.For LOAD we ave te two senarios mentioned aove.Depending on te osen rewriting or LOADweavetwo senarios or te BGP proessing: on plain triples or on triple groups. Forte latter ase te triple groups are eiter read diretly rom te input le or generated on-te-y.Tis results in (teoretially) our exeution senarios: (i) data is loaded as plain triples and te lter is evaluated on plain triples (ii) data is loaded as plain triples and te lter is evalu- ated on triple groups (tis requires generating te groups on-te-y) (iii) data is loaded as triple groups and evaluation is on plain triples and (iv) input is loaded as triple groups and lter is evaluated on tese groups. We do not onsider ase (iii) in tis evaluation eause onverting te ompat orm to te verose N ormat will oviously not e eneial. For senario (iv) te triple groups were generated eoreand using te ode sown in Set. . Te result wassaved to HDFS using Pig’s BinStorage. Forsenario (i) te aove extended Pig sript an e rewritten using rule (L) or te load operation and (J) or te BGPs: raw1 =RDFFileLoad(’hdfs:///eventful.nt’); f1 =FILTER triples1 BY predicate == ’’; raw2 =RDFLoad(’hdfs:///eventful.nt’); f2 =FILTER triples2 BY predicate == ’’; res =JOIN f1 BY (subject), f2 BY (subject);

To rewrite te input senario (ii) we used rules (L) or te LOAD and (T) to reate te triple groups. To rewrite te BGP in te lter expression rule (J2) wasused. Forsenario (iv) te rules (L2) and (J2) were applied. Te exeution times o te dierent rewriting results are sown in Fig. 2. Forea sript we used one to eigt partitions per input dataset. One partition is ten proessed y one mapper.For senario (i) tis means tat eause we need to load te dataset twie to perorm te oin we atually ave twie as manymappers as partitions. Forot senarios (ii) and (iv) we only need to load te dataset one. As we an see in te gure working wit triple groups is learly aster tan working wit plain triples. Altoug tese gaps derease wen inreasing te numer o partitions

9 450 on-the-fly 400 plain triple groups 350

300

250 econds]

[s 200 me

ti 150

100

50

0 1 2 3 4 5 6 7 8 partitions

Figure 2: Exeutiontimes or sel oin operation wit dierent numers o partitions. Let: senario (ii) middle: senario (i) and rigt: senario (iv). note tat or plain triples tere are atually twie as manymappers as or te oter two senarios i.e. tis senario onsumes mu more resoures. However i te triple groups ave to e generated on-te-y rst tere is asigniant perormane overead and te proessing takes signiantly longer as or te senario were te groups were loaded rom ale. As expeted te results suggest tat it may e eneial to materialize su intermediate results sine tis would allowoter operators in omplexsripts to reuse tese results. I te materialized results are stored permanentlyte optimizer an make use o it in oter sripts too. Furterwean see te impat o te numer o mappers used or exeution. Wen in- reasing te numer o partitions Pigan makeuse o more mappers to proess te data in parallel wi redues te overall proessing time.

450 on-the-fly 400 plain triple groups 350

300

250 econds]

[s 200 me

ti 150

100

50

0 1 2 3 4 5 6 7 8 partitions

Figure : Exeutiontimes or aBGP lter on anon-grouping omponent (oet) wit dierent numers o partitions. Let: senario (ii) middle: senario (i) and rigt: senario (iv).

Next we measured te perormane or alter on te oet omponent o atriple. Te extended Pig sript is similar to te one aovedespite te lter ontains only one triple pattern:

95 raw =RDFLoad(’hdfs:///eventful.nt’); res =FILTER raw BY { ?s ?p ’Metallica’. }; Similar to te sel oin sript aoveweevaluated tree rewriting senarios. Forsenario (i) te load operation is also rewrittenusing rule (L) and te ler using rule (F2). In senario (ii) we applied te rules (L) (T) and (F5). And in senario (iv) te rules (L2) and (F5) were used. As an example we sowte rewriting result or senario (ii): raw =RDFFileLoad(’hdfs:///eventful.nt’); out =TUPLIFY(raw); tmp =FOREACH out { t=FILTER stmts BY (object == ’Metallica’); GENERATE *,COUNT(t) as cnt; }; res =FILTER tmp BY cnt >0; Te exeution times or te tree input senarios are sown in Fig. . Here te dierenes etween te senarios are even greater.Sine ltering is not as ex- pensive as aoin operation te overead o reating te triple groups on-te-y auses agreater gapinomparison to te oter twosenarios. However we an learly see tat te triple groups (wen loaded diretly rom le) are still aster tan ltering plain triples. Sine te lter wasonte oet te lter operation ad to proess ea tuple in te ag or ea suet wi leads to inspeting ust as manytriples as wen evaluation plain triples. In tis ase te advantage omes rom te ewer data tat as to e read rom disk sine te suet is stored only one. Hene wen te dataset eomes largerte dierenes will inrease to te advantage o te triple groups. Also i te lter as to e evaluated on te omponent on wi te triple groups are grouped te dierene etween ltering triple groups and plain triples eomes even greater eause te lter prediate as to e evaluated only one per grouped entity (e.g. suet) see Fig. . In ontrast or plain triples te lter is evaluated or ea statement tat elongs to an entityleading to signiantly more lter operations.

450 on-the-fly 400 plain triple groups 350

300

250 econds]

[s 200 me

ti 150

100

50

0 1 2 3 4 5 6 7 8 partitions

Figure : Exeutiontimes or aBGP lter on agrouping omponent (suet) wit dierent numers o partitions. Let: senario (ii) middle: senario (i) and rigt: senario (iv).

One an see tat te numer o mappers also as an impat on te exeution time. In our small setup tese dierenes do not seem to e ig enoug to e signiant. However

96 i tere is adierene in su asmall setup te dierene will e even greater wen te numer o nodes in te luster tat an partiipate in te omputation or te size o te dataset to proess inreases. In tis evaluation we orgo experiments wit apat oin eause our rule set transorms triple groups to plain triples and ten relies on Pig oin. Hene an evaluation o tis oper- ation will not reveal newinsigts.

7Conclusion

In tis paperweextended te Pig language wit support o SPARQL-likeeatures su as ltering and oins on BGPs wi are partiularly useul to perorm analytial tasks on linked datasets. Te extensions were introdued at tree levels o implementation: as strategies or query planning and data aess as rewriting rules to transorm te Pig ex- tensions to plain Pig and as UDFs tat implement newuntionality.Indoing so we are ale to use te Pig ompiler to ompile and optimize te sripts or exeution. Te ur- rent implementation o our system ontains te UDFs and maros to load RDF data eiter rom HDFS or rom remote soures via SPARQL endpoints. Te implementation is also apale o generating loading proessing and storing data in te triple group ormat. Te results o our experiments sowtat perormane depends on wi rules are used or in te rewriting step. To sowtis eet te sripts were generated manually using aset o rewriting rules. To support aritrary queries we plan to implement arewriter in our uture work tat omines te planning and rewriting steps into one optimization step. Te optimizer will use euristis to automatially deide wi rules to use or rewriting extended Pig and to deide weter intermediate results sould e materialized. To ail- itate tese deisions statistis and metadata aout te datasets will e otained rom te LODHu platorm wi provides te ontext or our work.

Acknowledgements

Tis work waspartially unded y te German Resear Foundation (DFG) under grant no. SA782/22

References [ABK+07] S. AuerC.BizerG.Koilarov J. Lemann R. Cyganiak and . G. Ives. DBpedia: ANuleus or aWe o Open Data. In ISWC2007. [ADE+] A. Aello´ J. Darmont L. EteverryM.Golarelli J. Mazon´ F. Naumann T. B. Pedersen S. Rizzi J. Truillo P. Vassiliadis and G. Vossen. Fusion Cues: Towards Sel-Servie Business Intelligene. IJDWM9(2):66±88 20. [BEG+] K. S. BeyerV.Eregova R. Gemulla A. Balmin M. Eltaak C.-C. Kanne F. Oz- an and E. JSekita. Jaql: Asripting language or large sale semistrutured data analysis. In PVLDB’1120. [BL06] T. Berners-Lee. Linked Data. WC Design Issues. http://www.w3.org/ DesignIssues/LinkedData.html2006.

97 [CGMR] D. Colazzo F. Goasdoue´ I. Manolesu and A. Roatis. RDF analytis: lenses over semanti graps. In WWW’14pages 67±78 20. [DG08] J. Dean and S. Gemawat. MapRedue: simplied data proessing on large lusters. Communications of the ACM5():07± 2008. [EV2] L. Eteverry and A. A. Vaisman. QBOLAP: AVoaulary or OLAP Cues on te Semanti We. In COLD’12202. [Fou09] Te Apae Sotware Foundation. Hadoop. http://hadoop.apache.org/ core/2009. [GHS] L. Galarraga´ K. Hose and R. Senkel. Partout: ADistriuted Engine or Eient RDF Proessing. In WWW’14pages 267±268 20. [HMM+] M. Husain J. MGlotlin M. MMasud L. Kan and B. Turaisingam. Heuristis- ased query proessing or large rd graps using loud omputing. IEEE Transactions on Knowledgeand Data Engineering2(9):2±27 20. [HS] S. Hagedorn and K.-U. Sattler.LODHu -APlatorm or Saring and Analyzing large-sale Linked Open Data. In Proc. of ISWC 2014 demo trackvolume 272 pages 8±8 Otoer 20. [KM] . Kaoudi and I. Manolesu. RDF in te louds: asurvey. The VLDB Journalpages ±25 20. [KRA] H. Kim P. Ravindra and K. Anyanwu. From SPARQL to MapRedue: Te Journey Using aNested TripleGroup Algera. PVLDB(2):26±29 20. [KUBM2] S. Kotoulas J. Urani P. Bonz and P. Mika. Roust Runtime Optimization and Skew-Resistant Exeution o Analytial SPARQL Queries on Pig. In ISWC’12pages 27±262. 202. [NW08] T. Neumann and G. Weikum. RDF-X: aRISC-style engine or RDF. PVLDB ():67±659 2008. [ORS+08] C. Olston B. Reed U. SrivastavaR.Kumarand A. Tomkins. Pig latin: anot-so- oreign language or data proessing. In SIGMOD’08pages 099±0 2008. [PKT+] N. Papailiou I. Konstantinou D. Tsoumakos P. Karras and N. Koziris. H2RDF+: Hig-perormane distriuted oins overlarge-sale RDF graps. In IEEE Interna- tional Conference on Big Datapages 255±26 20. [RKA] P. Ravindra H. Kim and K. Anyanwu. An Intermediate Algera or Optimizing RDF Grap Pattern Mating on MapRedue. In ESWC’11pages 6±6 20. [SPL] A. Satzle¨ M. Przyaiel-aloki and G. Lausen. PigSPARQL: Mapping SPARQL to Pig Latin. In SWIM’11pages :±:8 20. [WC0] WC. RDF Primer (WC Reommendation 200-02-0). http://www.w3.org/ TR/rdf-primer/200. [WC08] WC. SPARQL Query Language or RDF (WC Reommendation 2008-0-5). http://www.w3.org/TR/rdf-sparql-query/2008. [WC0] WC. Desriing Linked Datasets wit te VoID Voaulary. http://www.w3. org/TR/void/200. [WC] WC. Te RDF Data Cue Voaulary. http://www.w3.org/TR/2013/ CR-vocab-data-cube-20130625/20. [Wi] Mar Wik. GeoNames. http://www.geonames.org/. [CTW] X. ang L. Cen . Tong and M. Wang. EAGRE: Towards salale I/O eient SPARQL query evaluation on te loud. In ICDE’13pages 565±576 20. [W+] K. eng J. ang H. Wang B. Sao and . Wang. Adistriuted grap engine or we sale RDF data. In PVLDB’13pages 265±276 20.

9