Self-disclosure topic model for conversations

JinYeong Bak Chin-Yew Lin Alice Oh Department of Computer Science Microsoft Research Asia Department of Computer Science KAIST Beijing 100080, P.R. China KAIST Daejeon, South Korea [email protected] Daejeon, South Korea [email protected] [email protected]

Abstract is social support from others (Wills, 1985; Der- lega et al., 1993), shown also in online social net- Self-disclosure, the act of revealing one- works (OSN) such as Twitter (Kim et al., 2012). self to others, is an important social be- Receiving social support would then lead the user havior that contributes positively to inti- to be more active on OSN (Steinfield et al., 2008; macy and social support from others. It Trepte and Reinecke, 2013). In this paper, we seek is a natural behavior, and social scien- to understand this important social behavior using tists have carried out numerous quantita- a large-scale Twitter conversation data, automati- tive analyses of it through manual tagging cally classifying the level of self-disclosure using and survey questionnaires. Recently, the machine learning and correlating the patterns with flood of data from online social networks subsequent OSN usage. (OSN) offers a practical way to observe Twitter conversation data, explained in more de- and analyze self-disclosure behavior at an tail in section 4.1, enable a significantly larger unprecedented scale. The challenge with scale study of naturally-occurring self-disclosure such analysis is that OSN data come with behavior, compared to traditional social science no annotations, and it would be impos- studies. One challenge of such large scale study, sible to manually annotate the data for a though, remains in the lack of labeled ground- quantitative analysis of self-disclosure. As truth data of self-disclosure level. That is, a solution, we propose a semi-supervised naturally-occurring Twitter conversations do not machine learning approach, using a vari- come with the level of self-disclosure in ant of latent Dirichlet allocation for au- each conversation. To overcome that challenge, tomatically classifying self-disclosure in a we propose a semi-supervised machine learning massive dataset of Twitter conversations. approach using probabilistic topic modeling. Our For measuring the accuracy of our model, self-disclosure topic model (SDTM) assumes that we manually annotate a small subset of self-disclosure behavior can be modeled using a our dataset, and we show that our model combination of simple linguistic features (e.g., shows significantly higher accuracy and pronouns) with automatically discovered seman- F-measure than various other methods. tic themes (i.e., topics). For instance, an utterance With the results our model, we uncover “I am finally through with this disastrous relation- a positive and significant relationship be- ship” uses a first-person pronoun and contains a tween self-disclosure and online conversa- topic about personal relationships. tion frequency over time. In comparison with various other models, SDTM shows the highest accuracy, and the result- 1 Introduction ing self-disclosure patterns of the users are cor- related significantly with their future OSN usage. Self-disclosure is an important and pervasive so- Our contributions to the research community in- cial behavior. People disclose personal informa- clude the following: tion about themselves to improve and maintain relationships (Jourard, 1971; Joinson and Paine, We present a topic model that explicitly in- • 2007). For example, when two people meet for cludes the level of self-disclosure in a conver- the first time, they disclose their names and in- sation using linguistic features and the latent terests. One positive outcome of self-disclosure semantic topics (Sec. 3).

42 Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in , pages 42–49, Baltimore, Maryland USA, 27 June 2014. c 2014 Association for Computational Linguistics

We collect a large dataset of Twitter conver- 훾 훼 • sations over three years and annotate a small 휋 휃푙 subset with self-disclosure level (Sec. 4). 3 We compare the classification accuracy of 훽푙 푟 푧 푦 • SDTM with other models and show that it 휙푙 푤 ω 푥 performs the best (Sec. 5). 퐾푙 N 3 T C We correlate the self-disclosure patterns of 휆 • users and their subsequent OSN usage to show that there is a positive and significant Figure 1: Graphical model of SDTM relationship (Sec. 6).

2 Background self-disclosure for high accuracy in classifying the three levels of self-disclosure. In this section, we review literature on the relevant Self-disclosure and online : aspects of self-disclosure. According to social psychology, when someone Self-disclosure (SD) level: To quantitatively discloses about himself, he will receive social sup- analyze self-disclosure, researchers categorize port from those around him (Wills, 1985; Derlega self-disclosure language into three levels: G (gen- et al., 1993), and this pattern of self-disclosure eral) for no disclosure, M for medium disclosure, and social support was verified for Twitter con- and H for high disclosure (Vondracek and Von- versation data (Kim et al., 2012). Social support dracek, 1971; Barak and Gluck-Ofri, 2007). Ut- is a major motivation for active usage of social terances that contain general (non-sensitive) infor- networks services (SNS), and there are findings mation about the self or someone close (e.g., a that show self-disclosure on SNS has a positive family member) are categorized as M. Examples longitudinal effect on future SNS use (Trepte and are personal events, past history, or future plans. Reinecke, 2013; Ledbetter et al., 2011). While Utterances about age, occupation and hobbies are these previous studies focused on small, qualita- also included. Utterances that contain sensitive in- tive studies, we conduct a large-scale, machine formation about the self or someone close are cat- learning driven study to approach the question of egorized as H. Sensitive information includes per- self-disclosure behavior and SNS use. sonal characteristics, problematic behaviors, phys- ical appearance and wishful ideas. Generally, 3 Self-Disclosure Topic Model these are thoughts and information that one would generally keep as secrets to himself. All other This section describes our model, the self- utterances, those that do not contain information disclosure topic model (SDTM), for classifying about the self or someone close are categorized self-disclosure level and discovering topics for as G. Examples include gossip about celebrities or each self-disclosure level. factual discourse about current events. 3.1 Model Classifying self-disclosure level: Prior work on quantitatively analyzing self-disclosure has re- We make two important assumptions based on our lied on user surveys (Trepte and Reinecke, 2013; observations of the data. First, first-person pro- Ledbetter et al., 2011) or human annotation (Barak nouns (I, my, me) are good indicators for medium and Gluck-Ofri, 2007). These methods consume level of self-disclosure. For example, phrases such much time and effort, so they are not suitable for as ‘I live’ or ‘My age is’ occur in utterances that re- large-scale studies. In prior work closest to ours, veal personal information. Second, there are top- Bak et al. (2012) showed that a topic model can ics that occur much more frequently at a particular be used to identify self-disclosure, but that work SD level. For instance, topics such as physical applies a two-step process in which a basic topic appearance and mental health occur frequently at model is first applied to find the topics, and then level H, whereas topics such as birthday and hob- the topics are post-processed for binary classifica- bies occur frequently at level M. tion of self-disclosure. We improve upon this work Figure 1 illustrates the graphical model of by applying a single unified model of topics and SDTM and how these assumptions are embodied

43 Notation Description 1. For each level l G, M, H : For each topic∈ { k }1,...,Kl : G; M; H general; medium; high SD level l ∈ { l } { } Draw φk Dir(β ) C; T ; N Number of conversations; tweets; 2. For each conversation∼ c 1,...,C : words G ∈ { } (a) Draw θc Dir(α) G M H M ∼ K ; K ; K Number of topics for G; M; H (b) Draw θc Dir(α) c; ct Conversation; tweet in{ conversation} c (c) Draw θH ∼ Dir(α) c ∼ yct SD level of tweet ct, G or M/H (d) Draw πc Dir(γ) (e) For each message∼ t 1,...,T : rct SD level of tweet ct, M or H ∈ { } i. Observe first-person pronouns features xct zct Topic of tweet ct th ii. Draw ωct MaxEnt(xct, λ) wctn n word in tweet ct iii. Draw y ∼Bernoulli(ω ) ct ∼ ct λ Learned Maximum entropy parame- iv. If yct = 0 which is G level: G ters A. Draw zct Mult(θ ) ∼ c xct First-person pronouns features B. For each word n 1,...,N : Draw word w ∈ { Mult(φ}G ) ωct Distribution over SD level of tweet ct ctn zct Else which can be M or∼H level: πc SD level proportion of conversation c G M H A. Draw rct Mult(πc) θc ; θc ; θc Topic proportion of G; M; H in con- ∼ rct { } B. Draw zct Mult(θc ) versation c C. For each word∼ n 1,...,N : G M H ∈ { }rct φ ; φ ; φ Word distribution of G; M; H Draw word wctn Mult(φ ) ∼ zct α; γ Dirichlet prior for θ; {π } βG, βM ; βH Dirichlet prior for φG; φM ; φH ncl Number of tweets assigned SD level l Figure 2: Generative process of SDTM. in conversation c l nck Number of tweets assigned SD level l and topic k in conversation c l 3.3 Classifying M vs H levels nkv Number of instances of word v as- signed SD level l and topic k The second part of the classification, the M and the mctkv Number of instances of word v as- H level, is driven by informative priors with seed signed topic k in tweet ct words and seed trigrams. Table 1: Summary of notations used in SDTM. Utterances with M level include two types: 1) information related with past events and fu- ture plans, and 2) general information about self in it. The first assumption about the first-person (Barak and Gluck-Ofri, 2007). For the former, we pronouns is implemented by the observed variable add as seed trigrams ‘I have been’ and ‘I will’. xct and the parameters λ from a maximum en- For the latter, we use seven types of information tropy classifier for G vs. M/H level. The second generally accepted to be personally identifiable in- assumption is implemented by the three separate formation (McCallister, 2010), as listed in the left word-topic probability vectors for the three lev- l column of Table 2. To find the appropriate tri- els of SD: φ which has a Bayesian informative grams for those, we take Twitter conversation data prior βl where l G, M, H , the three levels ∈ { } (described in Section 4.1) and look for trigrams of self-disclosure. Table 1 lists the notations used that begin with ‘I’ and ‘my’ and occur more than in the model and the generative process, Figure 2 200 times. We then check each one to see whether describes the generative process. it is related with any of the seven types listed in the table. As a result, we find 57 seed trigrams for 3.2 Classifying G vs M/H levels M level. Table 2 shows several examples. Classifying the SD level for each tweet is done in two parts, and the first part classifies G vs. M/H Type Trigram levels with first-person pronouns (I, my, me). In Name My name is, My last name Birthday My birthday is, My birthday party the graphical model, y is the latent variable that Location I live in, I lived in, I live on represents this classification, and ω is the distri- Contact My email address, My phone number bution over y. x is the observation of the first- Occupation My job is, My new job Education My high school, My college is person pronoun in the tweets, and λ are the param- Family My dad is, My mom is, My family is eters learned from the maximum entropy classifier. With the annotated Twitter conversation dataset Table 2: Example seed trigrams for identifying M (described in Section 4.2), we experimented with level of SD. There are 51 of these used in SDTM. several classifiers (Decision tree, Naive Bayes) and chose the maximum entropy classifier because Utterances with H level express secretive wishes it performed the best, similar to other joint topic or sensitive information that exposes self or some- models (Zhao et al., 2010; Mukherjee et al., 2013). one close (Barak and Gluck-Ofri, 2007). These are

44 Category Keywords 4 Data Collection and Annotation physical acne, hair, overweight, stomach, chest, appearance hand, scar, thighs, chubby, head, skinny To answer our research questions, we need a mental/physical addicted, bulimia, doctor, illness, alco- condition holic, disease, drugs, pills, anorexic large longitudinal dataset of conversations such that we can analyze the relationship between self- Table 3: Example words for identifying H level of disclosure behavior and conversation frequency SD. Categories are hand-labeled. over time. We chose to crawl Twitter because it offers a practical and large source of conversations generally keep as secrests. With this intuition, we (Ritter et al., 2010). Others have also analyzed crawled 26,523 secret posts from Six Billion Se- Twitter conversations for natural language and so- crets 1 site where users post secrets anonymously. cial media research (Boyd et al., 2010; Danescu- To extract seed words that might express secre- Niculescu-Mizil et al., 2011), but we collect con- tive personal information, we compute mutual in- versations from the same set of dyads over several formation (Manning et al., 2008) with the secret months for a unique longitudinal dataset. posts and 24,610 randomly selected tweets. We 4.1 Collecting Twitter conversations select 1,000 words with high mutual information and filter out stop words. Table 3 shows some of We define a Twitter conversation as a chain of these words. To extract seed trigrams of secretive tweets where two users are consecutively replying wishes, we again look for trigrams that start with to each other’s tweets using the Twitter reply but- ‘I’ or ‘my’, occur more than 200 times, and select ton. We identify dyads of English-tweeting users trigrams of wishful thinking, such as ‘I want to’, with at least twenty conversations and collect their and ‘I wish I’. In total, there are 88 seed words tweets. We use an open source tool for detect- 2 and 8 seed trigrams for H. ing English tweets , and to protect users’ privacy, we replace Twitter userid, usernames and url in 3.4 Inference tweets with random strings. This dataset consists For posterior inference of SDTM, we use col- of 101,686 users, 61,451 dyads, 1,956,993 conver- lapsed Gibbs sampling which integrates out la- sations and 17,178,638 tweets which were posted random variables ω, π, θ, and φ. Then we between August 2007 to July 2013. only need to compute y, r and z for each tweet. We compute full conditional distribution p(yct = 4.2 Annotating self-disclosure level j0, rct = l0, zct = k0 y ct, r ct, z ct, w, x) for To measure the accuracy of our model, we ran- | − − − tweet ct as follows: domly sample 101 conversations, each with ten or fewer tweets, and ask three judges, fluent in English, to annotate each tweet with the level of p(yct = 0, zct = k0 y ct, r ct, z ct, w, x) | − − − self-disclosure. Judges first read and discussed exp(λ0 xct) · g(c, t, l0, k0) the definitions and examples of self-disclosure ∝ 1 exp(λ x ) j=0 j · ct level shown in (Barak and Gluck-Ofri, 2007), then they worked separately on a Web-based platform. p(yct = 1, rct = l0,P zct = k0 y ct, r ct, z ct, w, x) | − − − Inter-rater agreement using Fleiss kappa (Fleiss, exp(λ1 xct) ( ct) · (γl + n − ) g(c, t, l0, k0) 1971) is 0.67. ∝ 1 exp(λ x ) 0 cl0 j=0 j · ct whereP z ct, r ct, y ct are z, r, y without tweet 5 Classification of Self-Disclosure Level − − ct, m is the marginalized− sum over word v of ctk0( ) This section describes experiments and results of m and· the function g(c, t, l , k ) as follows: ctk0v 0 0 SDTM as well as several other methods for classi- fication of self-disclosure level. V l0 l0 (ct) Γ( v=1 βv + nk−v ) We first start with the annotated dataset in sec- g(c, t, l0, k0) = 0 V l (ct) l0 0− tion 4.2 in which each tweet is annotated with SD Γ( v=1Pβv + nk v + mctk0( )) 0 · level. We then aggregate all of the tweets of a l0( ct) V l l0 (ct) αk + n − P Γ(β 0 + n − + mctk v) 0 ck0 v k0v 0 conversation, and we compute the proportions of K l0 l l0 (ct) k=1 αk + nck ! v=1 Γ(βv0 + n − ) tweets in each SD level. When the proportion of Y k0v P1http://www.sixbillionsecrets.com 2https://github.com/shuyo/ldig

45 tweets at M or H level is equal to or greater than 0.2, Method Acc G F1 M F1 H F1 Avg F1 we take the level of the larger proportion and as- LDA 49.2 0.000 0.650 0.050 0.233 sign that level to the conversation. When the pro- MedLDA 43.3 0.406 0.516 0.093 0.338 portions of tweets at M or H level are both less than LIWC 49.2 0.341 0.607 0.180 0.376 0.2, we assign G to the SD level. SEED 52.0 0.412 0.600 0.178 0.397 We compare SDTM with the following methods ASUM 56.6 0.320 0.704 0.375 0.466 for classifying tweets for SD level: FirstP 63.2 0.630 0.689 0.095 0.472 SDTM 64.5 0.611 0.706 0.431 0.583 LDA (Blei et al., 2003): A Bayesian topic • model. Each conversation is treated as a doc- Table 4: SD level classification accuracies and F- ument. Used in previous work (Bak et al., measures using annotated data. Acc is accuracy, 2012). and G F1 is F-measure for classifying the G level. Avg F1 is the average value of G F1, M F1 and H MedLDA (Zhu et al., 2012): A super- • F1. SDTM outperforms all other methods com- vised topic model for document classifica- pared. The difference between SDTM and FirstP tion. Each conversation is treated as a doc- is statistically significant (p-value < 0.05 for ac- ument and response variable can be mapped curacy, < 0.0001 for Avg F1). to a SD level.

LIWC (Tausczik and Pennebaker, 2010): • value of 0.01 for all other words. This approach Word counts of particular categories. Used is same as other topic model works (Jo and Oh, in previous work (Houghton and Joinson, 2011; Kim et al., 2013). 2012). As Table 4 shows, SDTM performs better than other methods by accuracy and F-measure. LDA Seed words and trigrams (SEED): Occur- • and MedLDA generally show the lowest perfor- rence of seed words and trigrams which are mance, which is not surprising given these mod- described in section 3.3. els are quite general and not tuned specifically ASUM (Jo and Oh, 2011): A joint model of for this type of semi-supervised classification task. • sentiment and topic using seed words. Each LIWC and SEED perform better than LDA, but sentiment can be mapped to a SD level. Used these have quite low F-measure for G and H lev- in previous work (Bak et al., 2012). els. ASUM shows better performance for classi- fying H level than others, but not for classifying First-person pronouns (FirstP): Occurrence • the G level. FirstP shows good F-measure for the of first-person pronouns which are described G level, but the H level F-measure is quite low, in section 3.2. To identify first-person pro- even lower than SEED. Finally, SDTM has sim- nouns, we tagged parts of speech in each ilar performance in G and M level with FirstP, but tweet with the Twitter POS tagger (Owoputi it performs better in H level than others. Classi- et al., 2013). fying the H level well is important because as we will discuss later, the H level has the strongest rela- SEED, LIWC, LDA and FirstP cannot be used tionship with longitudinal OSN usage (see Section directly for classification, so we use Maximum en- 6.2), so SDTM is overall the best model for clas- tropy model with outputs of each of those models sifying self-disclosure levels. as features. We run MedLDA, ASUM and SDTM 20 times each and compute the average accuracies 6 Self-Disclosure and Conversation and F-measure for each level. We set 40 topics Frequency for LDA, MedLDA and ASUM, 60; 40; 40 top- ics for SDTM KG,KM and KH respectively, and In this section, we investigate whether there is a set α = γ = 0.1. To incorporate the seed words relationship between self-disclosure and conversa- and trigrams into ASUM and SDTM, we initial- tion frequency over time. (Trepte and Reinecke, ize βG, βM and βH differently. We assign a high 2013) showed that frequent or high-level of self- value of 2.0 for each seed word and trigram for disclosure in online social networks (OSN) con- 6 that level, and a low value of 10− for each word tributes positively to OSN usage, and vice versa. that is a seed word for another level, and a default They showed this through an online survey with

46 and StudiVZ users. With SDTM, we 1.5 can automatically classify self-disclosure level of a large number of conversations, so we investi- 1.0 gate whether there is a similar relationship be- tween self-disclosure in conversations and subse- 0.5 quent frequency of conversations with the same partner on Twitter. More specifically, we ask the 0.0 following two questions: 0.5 1. If a dyad displays high SD level in their con- # Conversaction changes proportion over time 1.0 versations at a particular time period, would 1.0 1.5 2.0 2.5 3.0 they have more frequent conversations subse- Initial SD level quently? Figure 3: Relationship between initial SD level 2. If a dyad shows high conversation frequency and conversation frequency changes over time. at a particular time period, would they dis- The solid line is the linear regression line, and the play higher SD in their subsequent conver- coefficient is 0.118 with p < 0.001, which shows sations? a significant positive relationship.

6.1 Experiment Setup G level M level H level We first run SDTM with all of our Twitter con- Coeff (β) 0.094 0.419 0.464 versation data with 150; 120; 120 topics for p-value 0.1042 < 0.0001 < 0.0001 SDTM KG,KM and KH respectively. The hyper-parameters are the same as in section 5. To Table 6: Relationship between initial SD level handle a large dataset, we employ a distributed al- proportions and changes in conversation fre- gorithm (Newman et al., 2009). quency. For M and H levels, there is significant p < 0.0001 G Table 5 shows some of the topics that were positive relationship ( ), but for the p > 0.1 prominent in each SD level by KL-divergence. As level, there is not ( ). expected, G level includes general topics such as food, celebrity, soccer and IT devices, M level in- cludes personal communication and birthday, and the independent variable, and the rate of change finally, H level includes sickness and profanity. in conversation frequency between initial period For comparing conversation frequencies over and subsequent period as the dependent variable. time, we divided the conversations into two sets The result of regression is that the independent for each dyad. For the initial period, we include variable’s coefficient is 0.118 with a low p-value conversations from the dyad’s first conversation to (p < 0.001). Figure 3 shows the scatter plot with 60 days later. And for the subsequent period, the regression line, and we can see that the slope we include conversations during the subsequent 30 of regression line is positive. days. We also investigate the importance of each SD We compute proportions of conversation for level for changes in conversation frequency. We each SD level for each dyad in the initial and run linear regression with initial proportions of subsequent periods. Also, we define a new mea- each SD level as the independent variable, and surement, SD level score for a dyad in the period, the same dependent variable as above. As ta- which is a weighted sum of each conversation with ble 6 shows, there is no significant relationship SD levels mapped to 1, 2, and 3, for the levels G, between the initial proportion of the G level and M, and H, respectively. the changes in conversation frequency (p > 0.1). But for the M and H levels, the initial proportions 6.2 Does self-disclosure lead to more frequent show positive and significant relationships with conversations? the subsequent changes to the conversation fre- We investigate the effect of the level self- quency (p < 0.0001). These results show that M disclosure on long-term use of OSN. We run lin- and H levels are correlated with changes to the fre- ear regression with the intial SD level score as quency of conversation.

47 G level M level H level 101 184 176 36 104 82 113 33 19 chocolate obama league send twitter going ass better lips butter he’s win email follow party bitch sick kisses good romney game i’ll weekend fuck feel love cake vote season sent tweet day throat smiles peanut right team dm following night shit cold softly milk president cup address account dinner fucking hope hand sugar people city know fb birthday lmao pain eyes cream good arsenal check followers tomorrow shut good neck

Table 5: High ranked topics in each level by comparing KL-divergence with other level’s topics

2.05 classifying SD levels from Twitter conversation data. We devised a set of effective seed words and 2.00 trigrams, mined from a dataset of secrets. We also annotated Twitter conversations to make a ground- 1.95 truth dataset for SD level. With annotated data, we showed that SDTM outperforms previous methods 1.90 in classification accuracy and F-measure.

Subsequent SD level We also analyzed the relationship between SD 1.85 level and conversation frequency over time. We found that there is a positive correlation between 1.80 0 20 40 60 80 100 initial SD level and subsequent conversation fre- Initial conversation frequency quency. Also, dyads show higher level of SD if Figure 4: Relationship between initial conversa- they initially display high conversation frequency. tion frequency and subsequent SD level. The These results support previous results in social solid line is the linear regression line, and the co- psychology research with more robust results from efficient is 0.0016 with p < 0.0001, which shows a large-scale dataset, and show importance of a significant positive relationship. looking at SD behavior in OSN. There are several future directions for this re- search. First, we can improve our modeling for 6.3 Does high frequency of conversation lead higher accuracy and better interpretability. For to more self-disclosure? instance, SDTM only considers first-person pro- Now we investigate whether the initial conversa- nouns and topics. Naturally, there are patterns tion frequency is correlated with the SD level in that can be identified by humans but not captured the subsequent period. We run linear regression by pronouns and topics. Second, the number of with the initial conversation frequency as the inde- topics for each level is varied, and so we can pendent variable, and SD level in the subsequent explore nonparametric topic models (Teh et al., period as the dependent variable. 2006) which infer the number of topics from the The regression coefficient is 0.0016 with low p- data. Third, we can look at the relationship be- value (p < 0.0001). Figure 4 shows the scatter tween self-disclosure behavior and general online plot. We can see that the slope of the regression social network usage beyond conversations. line is positive. This result supports previous re- sults in social psychology (Leung, 2002) that fre- Acknowledgments quency of instant chat program ICQ and session time were correlated to depth of SD in message. We thank the anonymous reviewers for helpful comments. Alice Oh was supported by the IT 7 Conclusion and Future Work R&D Program of MSIP/KEIT. [10041313, UX- In this paper, we have presented the self-disclosure oriented Mobile SW Platform] topic model (SDTM) for discovering topics and

48 References Christopher D Manning, Prabhakar Raghavan, and Hinrich Schutze.¨ 2008. Introduction to information JinYeong Bak, Suin Kim, and Alice Oh. 2012. Self- retrieval, volume 1. Cambridge University Press disclosure and relationship strength in twitter con- Cambridge. versations. In Proceedings of ACL. Erika McCallister. 2010. Guide to protecting the confi- Azy Barak and Orit Gluck-Ofri. 2007. Degree and dentiality of personally identifiable information. DI- reciprocity of self-disclosure in online forums. Cy- ANE Publishing. berPsychology & Behavior, 10(3):407–417. Arjun Mukherjee, Vivek Venkataraman, Bing Liu, and David M Blei, Andrew Y Ng, and Michael I Jordan. Sharon Meraz. 2013. Public dialogue: Analysis of 2003. Latent dirichlet allocation. Journal of Ma- tolerance in online discussions. In Proceedings of chine Learning Research, 3:993–1022. ACL.

Danah Boyd, Scott Golder, and Gilad Lotan. 2010. David Newman, Arthur Asuncion, Padhraic Smyth, Tweet, tweet, retweet: Conversational aspects of and Max Welling. 2009. Distributed algorithms retweeting on twitter. In Proceedings of HICSS. for topic models. Journal of Machine Learning Re- search, 10:1801–1828. Cristian Danescu-Niculescu-Mizil, Michael Gamon, Olutobi Owoputi, Brendan OConnor, Chris Dyer, and Susan Dumais. 2011. Mark my words!: Lin- Kevin Gimpel, Nathan Schneider, and Noah A guistic style accommodation in social media. In Smith. 2013. Improved part-of-speech tagging for Proceedings of WWW. online conversational text with word clusters. In Proceedings of HLT-NAACL. Valerian J. Derlega, Sandra Metts, Sandra Petronio, and Stephen T. Margulis. 1993. Self-Disclosure, Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsu- volume 5 of SAGE Series on Close Relationships. pervised modeling of twitter conversations. In Pro- SAGE Publications, Inc. ceedings of HLT-NAACL.

Joseph L Fleiss. 1971. Measuring nominal scale Charles Steinfield, Nicole B Ellison, and Cliff Lampe. agreement among many raters. Psychological bul- 2008. Social capital, self-esteem, and use of on- letin, 76(5):378. line social network sites: A longitudinal analy- sis. Journal of Applied Developmental Psychology, David J Houghton and Adam N Joinson. 2012. 29(6):434–445. Linguistic markers of secrets and sensitive self- disclosure in twitter. In Proceedings of HICSS. Yla R Tausczik and James W Pennebaker. 2010. The psychological meaning of words: Liwc and comput- Yohan Jo and Alice H Oh. 2011. Aspect and senti- erized text analysis methods. Journal of Language ment unification model for online review analysis. and Social Psychology. In Proceedings of WSDM. Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. 2006. Hierarchical dirichlet pro- Adam N Joinson and Carina B Paine. 2007. Self- cesses. Journal of the american statistical associ- disclosure, privacy and the internet. The Oxford ation, 101(476). handbook of Internet psychology, pages 237–252. Sabine Trepte and Leonard Reinecke. 2013. The re- Sidney M Jourard. 1971. Self-disclosure: An experi- ciprocal effects of social network site use and the mental analysis of the transparent self. disposition for self-disclosure: A longitudinal study. Computers in Human Behavior, 29(3):1102 – 1112. Suin Kim, JinYeong Bak, and Alice Haeyun Oh. 2012. Do you feel what i feel? social aspects of emotions Sarah I Vondracek and Fred W Vondracek. 1971. The in twitter conversations. In Proceedings of ICWSM. manipulation and measurement of self-disclosure in preadolescents. Merrill-Palmer Quarterly of Behav- Suin Kim, Jianwen Zhang, Zheng Chen, Alice Oh, and ior and Development, 17(1):51–58. Shixia Liu. 2013. A hierarchical aspect-sentiment model for online reviews. In Proceedings of AAAI. Thomas Ashby Wills. 1985. Supportive functions of interpersonal relationships. Social support and Andrew M Ledbetter, Joseph P Mazer, Jocelyn M DeG- health, xvii:61–82. root, Kevin R Meyer, Yuping Mao, and Brian Swaf- Wayne Xin Zhao, Jing Jiang, Hongfei Yan, and Xiaom- ford. 2011. Attitudes toward online social con- ing Li. 2010. Jointly modeling aspects and opin- nection and self-disclosure as predictors of facebook ions with a maxent-lda hybrid. In Proceedings of communication and relational closeness. Communi- EMNLP. cation Research, 38(1):27–53. Jun Zhu, Amr Ahmed, and Eric P Xing. 2012. Medlda: Louis Leung. 2002. Loneliness, self-disclosure, and maximum margin supervised topic models. Journal icq (” i seek you”) use. CyberPsychology & Behav- of Machine Learning Research, 13:2237–2278. ior, 5(3):241–251.

49