Hidden Community Detection in Online Forums
Total Page:16
File Type:pdf, Size:1020Kb
Hidden Community Detection in Online Forums Daniel Salz Nicholas Benavides Jonathan Li Stanford University Stanford University Stanford University [email protected] [email protected] [email protected] Abstract cial networks, communities take on a very re- latable meaning, as communities of users are This project investigated both dominant and densely connected and may share common in- hidden community structure on a network of in- terests, backgrounds, or experiences. Online fo- teractions between subreddits on Reddit.com. We rums, such as Reddit, are generally divided into identified moderate community structure using communities based on the topic of the forum (ie the Louvain algorithm, and we also detected some r/nba for NBA fans or r/politics for those inter- hidden community structure with the HICODE ested in political news and discussions). The algorithm. We also explored a method to de- users in a specific forum can be thought of as tect future communities that involved predicting members of a community, but we can also think new edges and their weights, adding the predicted about groups of these forums as communities in edges to the existing network, and re-running the their own right. Louvain and HICODE algorithms to determine Within the field of community detection, there changes in the community structure. While the has been some interesting research conducted on edge weight prediction was a difficult task, we the topic of hidden community detection, which found that a linear regression model using 256- attempts to identify weaker community structure dimension node embeddings yielded the best re- that often goes undetected by the popular commu- sults. When we added predicted edges to the net- nity detection algorithms. Although these hidden work, we found slightly weaker community struc- communities are more difficult to identify, they ture, no evidence of hidden community structure can reveal unique insights into the functioning of as determined HICODE, and fewer communities a network. HICODE, one such algorithm for hid- overall, suggesting that lines between communi- den community detection, removes edges from ties may become more blurred as the network ma- the graph to try and uncover more subtle commu- tures and experiences more interactions between nities that are dominated by the density of edges forums. in the original graph. This paper aims to analyze both the dominant and hidden community structure of subreddits on 1. Introduction Reddit, leveraging a combination of established and new techniques to understand how the algo- Community detection is an important task in rithms perform in different conditions. We also the field of network analysis, as it can reveal in- introduce a new way to think about hidden com- formation about the structure of a network and munity detection that adds edges to the graph how the flow of information can occur. In so- 1 rather than removing them. By predicting new community as well as crowd-sourced sentiment edges and running existing community detection of the posts to identify negative mobilizations. algorithms on the resulting graph, we can detect The analysis relies on two variants of the PageR- hidden communities that are not present in the ank algorithm, which are used to measure the current graph but may appear in the network in flow of information from the perspectives of at- the future. With this information, moderators of tackers and defenders in the conflict, respec- these online forums will have greater insight into tively. Finally, the paper presents an LSTM model the interactions between forums as well as how that leverages data from both the post text and they can expect those interactions to change in the the user-community interaction network to pre- future. dict the likelihood of a post leading to a conflict between communities, achieving a significantly higher AUC than the baseline model. 2. Related Work In A Unified Framework for Community De- In their paper Hidden Community Detection in tection and Network Representation Learning, Social Networks, He et. al. [1] highlight the Tu et. al. [3] propose a unified framework utility and importance of detecting hidden com- for community detection and network representa- munity structures in graphs, sparse structures that tion learning (NRL) called community-enhanced can be tangled with the structure of the dom- network representation learning (CNRL). The inant community structure and are thus harder method considers each vertex grouped into mul- to detect. He et al note that the existing liter- tiple overlapping communities. Different from ature and community detection algorithms often typical NRL where vertex embeddings are learnt are able to find dominant community structures from local context vertices, CNRL uses both lo- but have difficulty revealing hidden community cal and global community information. They im- structures. He et al introduce HICODE, an al- plemented two different strategies with CNRL, gorithm for detecting hidden communities within a statistical based and an embedding based one, graphs that builds upon existing community de- of which both performed similarly. The ini- tection algorithms. HICODE works by reveal- tial step of the method consists of perform- ing hidden communities in layers, first isolating ing a Community-enhanced DeepWalk over the dominant communities and then weakening the network to generate “word” sequences as well connections within those communities and rean- as community assignments and updated ver- alyzing the resulting graph for new community tex representations. By doing this representa- structure. This process is repeated to reveal mul- tional learning process, they create representa- tiple layers of hidden communities, and has been tions for vertices, communities, and community found to produce state-of-the-art results in reveal- distributions of the vertices. From their experi- ing both strong and weak community structure. ments both the embedding and statistical based Kumar et. al.’s Community Interaction and community enhanced deep walk/node2vec im- Conflict on the Web [2] examines the nature plementations consistently and significantly out- of conflict between communities on Reddit.com. performed current methods such as DeepWalk, Without ground truth labels indicating when an LINE, node2vec, SDNE, MNMF, and ComE. intercommunity conflict occurs, the authors rely on hyperlinks between communities and a null model of user activity that detects cases where a hyperlink mobilizes a substantial number of users in one community to comment in another 3. Dataset graph, which revealed that most nodes have de- gree ≤ 10, while a small fraction of nodes have 3.1. Community Detection Dataset degree ≥ 100. We utilized the SNAP Reddit Hyperlink Net- work Dataset, which is a directed, signed, tem- poral, and attributed graph. Nodes in the graph are subreddits (online forums within Reddit), and edges between subreddits signify a post from one subreddit hyperlinking to the other subreddit, with the edge weight corresponding to either a -1 (negative sentiment) or 1 (neutral or positive sen- timent). Each edge (post) is annotated with the timestamp, the sentiment of the source commu- nity post towards the target community post, and the text property vector of the source post contain- ing 86 features related to post content. The graph contains 55,863 nodes (subreddits) and 858,490 Figure 1. Degree Distribution Plot edges (posts). This dataset was generated as part We also ran PageRank on our graph to un- of the Kumar et. al. paper [2]. derstand what subreddits were central in the net- The dataset distinguishes between posts that work of hyperlinks. From these results, it ap- contain the hyperlink in the title of the post and pears that the most influential subreddits tend to posts that contain the hyperlink in the body of be larger subreddits about more general topics the post. From this, we generated a single undi- such as ’askreddit’, ’bestof’, ’funny’, and ’pics’. rected, weighted graph for the community detec- tion task. To do so, we first took the union of 3.2. Edge Weight Prediction Dataset nodes and edges where the hyperlink was in the From our graph, we generated a dataset that title and where the hyperlink was in the body to could be used for edge weight prediction. Given generate a larger graph. From there, removed the a set of node embeddings, we created a positive directionality of the edges, and collapsed multi- example for each edge in the graph by concate- ple edges between nodes into a single edge with nating the node embeddings associated with the weight equal to the number of hyperlinked posts edge, resulting in 309,667 positive examples. The between the subreddits. We chose to convert the assigned label is the weight of the edge. With- graph to be undirected because most community out a ground truth for negative examples, we detection algorithms require an undirected graph, leveraged a heuristic that pairs of nodes that are and we chose to include edge weights because 5 or more hops apart are unlikely to be related it enabled us to capture richer information about and have hyperlinks between them in the future. the interactions between subreddits. Our resulting To generate negative examples, we randomly se- graph contains 67,180 nodes and 309,667 edges. lected 300,000 node pairs whose nodes had dis- Conducting a basic exploration of the dataset, tance ≥ 5 between them, concatenated the node we found that the graph has 712 connected embeddings for each pair of nodes, and assigned components, with a giant component containing the example a label of 0 since no edge exists be- 65,648 nodes and the other 711 connected com- tween the nodes. We also generated an infer- ponents containing 9 or fewer nodes. We also ence dataset of 300,000 examples by randomly examined the degree distribution of nodes in the selecting pairs of nodes that are not connected in the graph and concatenating their embeddings to ations of refinement, form examples, which we use to predict future PT t=1 Qt edges.