arXiv:1809.07473v1 [cs.DB] 20 Sep 2018 elwrdDL eod,called records, DBLP real-world r seta n retynee osc loihsi the in algorithms such to effectiveness. needed and efficiency urgently of ne pri and evaluation public-private as of essential vertices datasets selecting are real-world randomly datasets, Therefore, by real-world ones. concern on datasets privacy tested synthetic and stud not on sources are been data approaches limited has proposed the tree) to reachability due However, and shortest PageRank, literature. (e.g., problems centrality, the computing th graph in important including of interest deal network research the significant leads of attracted eac an network This perspective where public-private Recently, networks, friends. own connections. private public-private even her/his called or has model rela users user graph private public new such to her/ a makes visible hide which not to relationships, ships prefer all users or Instagram), partial and Twitter, Google+, stehde nomto,wihaeol nw yteauthor the by of known datasets only released are collabor which Our information, ongoing hidden regard the and as information public as articles nwihec etxi loascae ihapiaegraph. private a with associated gr also public is called a vertex model, contains each It graph which [4]. new in [3], a [2], list to graphs friend’s leads public-private their 1.4 protection hid of privacy users 52.6% social Facebook Such [1], structure, among City study York recent shared network New a million in information a reported way of As users. the views network controls the also network affects a social but only become of analysis issues not a algorithmic privacy Privacy the become However, social in has concern research. investigating major theories, of of graph point and process focal network a using prediction analysis, behavior structures user applications huge network and by marketing a social Driven media among social users. as influence connected such socially and of ideas, the number for information, platforms of important been spread have Instagram, and Weibo, inyo ulcpiaemdl n tt-fteatalgo state-of-the-art an and effectiveness models public-private the of verify ciency datasets attrib results real-world on experimental problems generated public- Preliminary open attributed graphs. discuss public-private also of bu edges We private model attributes. only not private advanced have vertices an where attr graphs existing private propose widely by we motivated addition, graphs, In al way. analysis fair public-private a efficient in of kinds various fying nti ae,w eeaefu ulcpiaentok fro networks public-private four generate we paper, this In Abstract niesca ewrs uha aeok wte,Google+ Twitter, Facebook, as such networks, social Online PDL:Mdln n eeaigAttributed Generating and Modeling PP-DBLP: ∗ { ihag xin co,xj,cszwzhang xujl, bchoi, jxjian, xinhuang, I ayoln oilntok eg,Facebook, (e.g., networks social online many —In i Huang Xin ulcPiaeNtok ihDBLP with Networks Public-Private .I I. PP NTRODUCTION ∗ - ixnJiang Jiaxin , DBLP ∗ eateto optrSine ogKn ats Universi Baptist Kong Hong Science, Computer of Department PP # fe h rset o veri- for prospects the offer eateto oraim ogKn ats University Baptist Kong Hong Journalism, of Department - DBLP etk published take We . ∗ yo Choi Byron , lsshas alysis gorithms great A rithms. nour on tworks effi- d ibuted paths, ations also t tion- aph, uted vate ∗ ied. but his inin Xu Jianliang , m to s. s, s. h e . , , } @comp.hkbu.edu.hk, ewrsfo ellf BPrcrs eoe by denoted public-priv records, of evaluations. DBLP datasets real-life experimental public-priva real-world from networks the of fair generates datasets for work real-world This benchmarks have as to networks Theref desirable edges. private is as it edges incident ver their real-worl selecting regarding and randomly graphs of by public to graphs, use datasets know, public-private [3] released simulate and we [2] publicly as Both no networks. far public-private exist as attributed there importantly, of date priv More and formulation attributes. public a the vertex considering give by and networks public-private attributes netw i public-private vertex the are model with th been properties we paper, vertex this both not In the [9]. applications, portant and has real-world structure attributes topological many graph vertex In Howeve of aspect yet. graphs. essential investigated issue public-private one important on of focuses another structure [2] topological study sk name, Recent of including on. information so has and user a e.g., attributes, on. so and of [8] size pairwi clustering [6], the correlation paths [7], as shortest similarities such all-pair address [5], processing, to tree graph reachability proposed of been problems have [2] sketchi essential Several approaches graph. each t sampling private of From own private and union her/his user. the each and exactly corresponding graph is and public network the social everyone, to the to viewpoint, only users visible visible is is graph graph public The ehv lopbil released publicly also have We osmaie hspprmkstefloigcontributions following the makes paper this summarize, To 1 ntesca ewrs etcsaeawy soitdwith associated always are vertices networks, social the In • • https://github.com/samjjx/pp-data fcec fsaeo-h-r loihs(eto II). (Section algorithms state-of-the-art of efficiency aaesofrtepopcst eiyvrosknsof kinds the conduct various We on way. verify fair experiments a released to in algorithms The prospects and graph public-private network. the public DBLP offer the a public- datasets to on real-world information according of private datasets, series a network release private and generate We fpprtte nDL eod ScinIII). (Section records DBLP on titles paper of ulcpiaentok n eeaetecorresponding the attributed generate PP the and on networks directions public-private attributes. promising private two public- have highlight attributed We vertices of where model networks, new private a propose formally We - DBLP ∗ hwiZhang Zhiwei , aaeswt trbtsfo h ihkeywords rich the from attributes with datasets # [email protected] PP ∗ - uy Song Yunya , ty DBLP PP - DBLP aaest aiaethe validate to datasets otecommunity. the to # PP - enode se DBLP tices orks ore, ills, ate ate m- ng he to te r, : d e 1 . v 11 v11 v11 v 2 v2 v2 v 10 v10 v10 v public v v v3 6 v 6 v 6 v v v 3 v 3 1 9 1 v9 1 v9 v private 8 v8 v8 v v v v v v v 4 5 7 4 v5 7 4 v5 7

(a) Public-private graph (b) Public graph G (c) In the view of v9: G ∪ Gv9

Figure 1. An example of undirected and simple public-private graph. In Figure 1(a), public edges are depicted in solid black lines and private edges are depicted in dashed edges. The red edges incident to vertex v9 are private to v9. The blue edges incident to vertex v3 and the edge (v1, v2) are private to v3. Figure 1(b) shows the public graph G consisting of all the solid black edges. Figure 1(c) shows the structure of public-private graph in the view of v9. Table I NETWORK STATISTICS. II. PUBLIC-PRIVATE NETWORKS Network |V | |E| |Vprivate| |Eprivate| δ(G) In this section, we first introduce a public-private graph PP-DBLP-2013 1,791,688 5,187,025 825,170 2,636,570 0.086 model for online social networks. Then, we generate real- PP-DBLP-2014 1,791,688 5,893,083 686,292 1,930,512 0.087 PP DBLP PP-DBLP-2015 1,791,688 6,605,428 515,549 1,218,167 0.087 world - datasets, and compare state-of-the-art of PP-DBLP-2016 1,791,688 7,378,090 263,937 445,505 0.083 public-private graph algorithms on PP-DBLP.

A. Public-Private Graph Model new approach to generate real-world datasets of public-private DBLP collaboration networks (PP-DBLP), according to the We present the model of a public-private graph G [2] as public and private information on DBLP records [10]. follows. Given a public graph G = (V, E), the vertex set V The intuition is that the information of one accepted pa- represents users, and the edge set E represents connections per available in public is usually later than the co-author between users. For each vertex u in the public graph G, u has collaboration really happened in private. In addition, such an associated private graph G = (V , E ), where V ⊆ V u u u u collaborations are always only known for authors themselves are the users from public graph and the edge set E satisfies u in person, and are not known to others. Thus, we take E ∩ E = ∅. The public graph G is visible to everyone, and u collaboration relationships in the published papers as public the private graph G is only visible to user u. Thus, in the u edges, and regard collaboration relationships in the ongoing view of user u, she/he can see and access the structure of works as private edges, which are only known by their authors. graph that is the union of public graph G and its own private Note that if two authors have had a pubic collaboration graph G as G ∪ G = (V, E ∪ E ). Let the private vertex set u u u relationship already, then their private ongoing collaboration of G as V = {u ∈ V : E 6= ∅} and the private edge set private u is not accounted as private. of as E v, w E u V . Note that G private = {( ) ∈ u : ∈ private} The public-private DBLP network is constructed as follows. E E , and each edge presented in different private private ∩ = ∅ We first obtain the DBLP raw data published in 2017 [10]. graphs only counts once in the private edge set E . private Next, we select one timestamp Y to distinguish the published Example 1: Consider a pubic-private graph with 11 vertices yet papers and on-going papers. For example, taking the cut- in Figure 1(a). The pubic-private graph consists of two kinds off timestamp Y as 2013/01/01, all collaborations happened of edges: public edges and private edges. The public edges before timestamp Y are regarded as public edges and the col- are depicted in solid black lines, e.g., v , v , indicating that ( 3 6) laborations happened on and after timestamp Y are regarded as v , v ( 3 6) is visible to everyone. The private edges are depicted private edges. We first construct the public graph. We sort all using dash lines in red and blue. Edges in red are private and papers in the increasing order of published dates in DBLP. For visible to v , e.g., v , v . Edges in blue are private and visible 9 ( 6 9) each paper p published before timestamp Y , we consider each v v , v G to 3, e.g., ( 1 2). Figure 1(b) shows the public graph author of this paper as a vertex, and add public edges between consisting of all public edges. Figure 1(c) shows the structure any pair of authors in this paper. Similarly, we construct all v of public-private graph in the view of 9, which is richer than private graphs as follows. For each paper p published on and G v the public graph in Figure 1(b). Because vertex 9 can after timestamp Y , if the two authors u, v do not have a public access all private relationships in private graph G . v9 edge, we add a private edge between u and v. This private edge B. Constructing Public-Private DBLP Networks can be accessed by all authors of this paper p. We generate four PP-DBLP datasets using four different timestamps Y In light of privacy concerns, there exist no publicly re- in {2013-01-01, 2014-01-01 2015-01-01, 2016-01-01}. The leased datasets of real-world public-private networks by now. network statistics of PP-DBLP are shown in Table I. Previous algorithms for public-private social networks, used public graphs to simulate public-private graphs, by randomly C. Evaluations on Real-world PP-DBLP datasets selecting some vertices and hiding their incident edges (in a We use PP-DBLP to evaluate two pubic-private graph star or a clique) as private edges from the public graph [2]. algorithms proposed by Chierichetti et al. [2]: shortest path ap- To verify competitive algorithms in a fair way, we propose a proximation and personalized PageRank. Sampling algorithms Table II 2 pp--2013 pp-dblp-2013 EFFICIENCYANDACCURACYOFPERSONALIZED PAGERANK USING 10000 pp-dblp-2014 1.8 pp-dblp-2014 HEURISTIC ON PP DBLP NETWORKS pp-dblp-2015 pp-dblp-2015 [2] - . pp-dblp-2016 pp-dblp-2016 1.6 Network Speed-up Ratio RMSE Cosine τ@50 1000 1.4 PP-DBLP-2013 6646 0.0036 0.9907 0.6007 PP DBLP Speed-up Ratio 1.2 - -2014 5746 0.0034 0.9908 0.6067 Approximation Ratio PP-DBLP-2015 5462 0.0041 0.9906 0.5715 100 1 4 8 16 32 64 4 8 16 32 64 PP-DBLP-2016 6319 0.0041 0.9890 0.5370 Multiplicative Factor Multiplicative Factor (a) Efficiency Improvement (b) Quality Approximation of the rankings, the score of τ@50 is still quite high falling PP DBLP Figure 2. Evaluation of shortest path approximation on - networks in [0.537, 0.6067]. Similar results on other datasets are also reported in [2]. are developed to precompute the public graph G offline, and III. ATTRIBUTED PUBLIC-PRIVATE NETWORKS then run the online update algorithm on private graph Gu with samples of G. We use the publicly available implementation In this section, we first define an attributed graph model of of sampling algorithms2 and set all the same parameters with public-private networks. We extend the approach of PP-DBLP [2] by default. We randomly select private graphs and report generation to produce real-world datasets of attributed public- the running time and accuracy of each task averaged over 50 private networks using title keywords in the papers. Finally, independent tests. we discuss promising research directions on attributed public- private graphs. Shortest Path Approximation. Given a vertex u in public- private graph G ∪ Gu, this task is to compute the shortest A. Attributed Public-Private Graph Model path from u to another arbitrary vertex in G ∪ Gu. We An attributed public-private graph of G is modeled as evaluate the performance of one sampling algorithm of shortest follows. Given an attributed public graph G = (V,E,A), path [2] with one baseline method of classical Dijkstra’s where the vertex set V represents users, the edge set E algorithm [11]. Figures 2(a) and 2(b) respectively show the represents connections between users, and the public attribute results of efficiency improvement and quality approximation set A(u) describes the public attributes of each user u ∈ V . varied by the multiplicative factor, which decides the number For each user u in the public graph, u has an attributed private of samples. The results verify that [2] achieves the great graph Gu = (Vu, Eu, Au), where Vu ⊆ V is a set of users improvement of efficiency (more than 300 times faster than from the public graph, private edge set Eu ∩ E = ∅, and the baseline method) and obtains a good balance of shortest Au(v) represents the private attributes of vertex v ∈ Vu that path approximations (no greater than 1.6 times of the optimal are visible to u. The public attributed graph G is visible to answer) on all PP-DBLP datasets. Figure 2(b) shows that the everyone, and the private attributed graph Gu is only visible to sampling algorithm [2] improves with smaller approximation user u. In terms of network structure, attributed public-private ratios as the multiplicative factor decreases, due to the more graphs have no difference with the public-private graphs in samples used in the algorithm. Section II. In terms of attributes, consider an attributed public- Personalized PageRank. Given a vertex u in public-private private graph G ∪ Gu, the vertex u can access both public and graph G ∪ Gu, the problem of personalized PageRank (PPR) private attributes of vertex v, i.e., Av ∪ Au(v). is to find node similarities to u for all vertices in G ∪ Gu. We Example 2: Figure 3(a) shows an example of attributed compare two methods. The first one is personalized PageRank public-private graph, which has the same graph structure as using heuristic [2]. The second one is a baseline method that Figure 1(a). Public attributes are in black, and private attributes directly applies the algorithm of Andersen et al. [12] on graph are in blue and red. Consider the vertex v3. The public attribute ′ G ∪ Gu. The results obtained by the baseline are used as of v3 is A(v3) = {‘SQL }. Attributes in blue (e.g., the the ground-truth PPR ranking of u. Table II reports the ef- attribute of ‘XML’ associated with vertices v1, v2 and v3) are

ficiency performance and accuracy of personalized PageRank private and visible to v3. Thus, Av3 (v1)= Av3 (v2)= Av3 (v3) ′ ′ using heuristic on PP-DBLP networks. In terms of efficiency = { XML }. Attributes in red (e.g., vertex v6’s attribute of comparison, the heuristic algorithm [2] is faster by three orders ‘Skyline’) are private and visible to v9. Figure 3(b) shows the of magnitude than the baseline [12], which are shown in the public attributed graph G consisting of all public edges and column of speed-up ratio in Table II. Table II also shows the public attributes that are visible to everyone. Figure 3(c) shows accuracy of PPR ranking computed by the heuristic algorithm the attributed graph G∪Gv3 in the view of v3. The attributes of ′ ′ w.r.t. the ground-truth PPR ranking, in terms of three metrics: v1 are {‘Skyline , ‘XML } as a result of the union of public the Root Mean Square Error (RMSE), the Cosine Similarity, attributes and private attributes, i.e., A(v1)∪Av3 (v1), showing and the Kendall-τ index. In terms of RMSE, the heuristic that v1 extends her/his research interests. algorithm produces ranking achieving the RMSE close to 0 B. Constructing Attributed Public-Private DBLP Networks on all datasets; In terms of cosine similarity, it obtains nearly 1; In terms of the Kendall-τ correlation of the first 50 positions To construct the attributed public-private DBLP networks, we add attributes into vertices on PP-DBLP in Section II-B as 2https://github.com/aepasto/public-private follows. For each author, we collect keywords in the title of all XML v11 XML XML SQL v11 v11 v 2 v2 v2 SQL,XML SQL, XML SQL,XML XML XML v XML XML XML 10 SQL v10 v10 Skyline XML XML v public v 6 XML v v v v Skyline 3 Skyline 3 6 XML Skyline 3 6 XML XML v v SQL, Skyline v v 1 SQL 9 1 SQL v9 XML 1 SQL, XML v9 XML v private 8 v8 v8 XML XML XML SQL v v v7 v v v v 4 5 XML 4 v5 7 XML 4 v5 7 XML SQL XML SQL SQL XML SQL XML

(a) Attributed public-private graph (b) Attributed public graph G (c) In the view of v3: G ∪ Gv3

Figure 3. An example of attributed public-private graph. In Figure 3(a), public attributes are in black. Private attributes are in blue and red, which respectively are visible to v3 and v9. Figure 3(b) shows the attributed public graph G consisting of all public edges and public attributes. Figure 3(c) shows the attributed graph G ∪ Gv3 in the view of v3. published articles and extract the most frequent keywords as Besides PP-DBLP, one future plan is to construct a real-world the public attributes. For the private attributes, let’s consider pubic-private Facebook social network by conducting a survey one author u and its attributed private graphs Gu. For each of Facebook users, who will be asked to manually identify all author v in Gu, the private attributes of v as Au(v) are the of private relationships that they hid. The survey can make use most frequent keywords from the title of all ongoing papers of Facebook built-in app3 to conduct investigations. involving authors v and u. To select representative keywords, ACKNOWLEDGMENT we set each number of public attributes and private attributes is no greater than a maximum threshold of 5, i.e., |A(v)|≤ 5 This work is supported by the Hong Kong General Re- search Fund (GRF) Project Nos. HKBU 12200917, 12232716, and |Au(v)| ≤ 5. The difference of public attributes Av and 12258116, 12632816, and National Natural Science Founda- private attributes Au(v) shows the evolving research interests tion of China (NSFC) Project Nos. 61702435, 61602395. of author v. Note that, Av ∩ Au(v) 6= ∅ may hold. We use θu(v) to quantify the overlapping ratio of public attributes REFERENCES and private attributes of vertex v in graph Gu, denoted by [1] R. Dey, Z. Jelveh, and K. Ross, “Facebook users have become much θ v |Av∩Au(v)| Pv∈Vu u( ) more private: A large-scale study,” in Pervasive Computing and Commu- θu(v) = . Let δ(u) = represent the |Av∪Au(v)| |Vu| nications Workshops (PERCOM Workshops), 2012 IEEE International average ratio of overlapping attributes over all authors in Gu. Conference on, 2012, pp. 346–352. For an attributed public-private graph G, we propose δ(G) to [2] F. Chierichetti, A. Epasto, R. Kumar, S. Lattanzi, and V. Mirrokni, measure the ratio of overlapping public-private attributes for “Efficient algorithms for public-private social networks,” in KDD, 2015, pp. 139–148. P ∈ δ(u) all private graphs, denoted by δ(G)= u Vprivate . Table I [3] A. Archer, S. Lattanzi, P. Likarish, and S. Vassilvitskii, “Indexing public- |Vprivate | PP DBLP private graphs,” in WWW, 2017, pp. 1461–1470. reports the statistic δ(G) for all - datasets. [4] B. Mirzasoleiman, M. Zadimoghaddam, and A. Karbasi, “Fast dis- tributed submodular cover: Public-private data summarization,” in Ad- C. Open Problems vances in Neural Information Processing Systems, 2016, pp. 3594–3602. We highlight two open problems of keyword search and [5] E. Cohen and H. Kaplan, “Summarizing data using bottom-k sketches,” in PODS, 2007, pp. 225–234. community search in attributed public-private networks as [6] A. Das Sarma, S. Gollapudi, M. Najork, and R. Panigrahy, “A sketch- follows. based distance oracle for web-scale graphs,” in WSDM, 2010, pp. 401– 410. • Keyword search in attributed public-private networks. [7] T. H. Haveliwala, “Topic-sensitive pagerank,” in WWW, 2002, pp. 517– Keyword search finds users in the vicinity of a given user 526. with similar keywords [13], [14]. Keyword search queries [8] N. Bansal, A. Blum, and S. Chawla, “Correlation clustering,” Machine learning, vol. 56, no. 1-3, pp. 89–113, 2004. in an attributed public-private network are generated [9] H. Cheng, Y. Zhou, X. Huang, and J. X. Yu, “Clustering large attributed from a vertex that looks for nearest vertices with certain information networks: an efficient incremental computing approach,” keywords, w.r.t. private structure and private attributes. DMKD, vol. 25, no. 3, pp. 450–477, 2012. [10] http://dblp.uni-trier.de/xml. • Community search in attributed public-private net- [11] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction works. Attributed community search aims at finding to algorithms, 2009. the densely-connected subgraphs containing given query [12] R. Andersen, F. Chung, and K. Lang, “Local partitioning for directed graphs using pagerank,” in International Workshop on Algorithms and nodes with similar attributes [15], [16]. Given a commu- Models for the Web-Graph, 2007, pp. 166–178. nity search query asked by a user u, the task needs to be [13] Q. Zhu, H. Cheng, and X. Huang, “I/o-efficient algorithms for top-k considered in the attributed public-private graph G ∪ G , nearest keyword search in massive graphs,” The VLDB Journal, vol. 26, u no. 4, pp. 563–583, 2017. w.r.t. private structures and private attributes. [14] J. Jiang, B. Choi, J. Xu, and S. S. Bhowmick, “A generic ontology framework for indexing keyword search on massive graphs,” in ICDE, IV. CONCLUSIONS AND DISCUSSIONS 2019. [15] X. Huang and L. V. Lakshmanan, “Attribute-driven community search,” In this paper, we develop a new model of attributed public- Proceedings of the VLDB Endowment, vol. 10, no. 9, pp. 949–960, 2017. private networks by considering the information of vertices in [16] X. Huang, L. V. Lakshmanan, and J. Xu, “Community search over many real-world networks. In addition, we provide real-world big graphs: Models, algorithms, and opportunities,” in ICDE, 2017, pp. 1451–1454. PP-DBLP datasets for attributed public-private networks, which are useful to further research of public-private graphs. 3https://apps.facebook.com/my-surveys