Identifying Trendsetters in Online Social Networks by Advanced Analytics

Rechts- und Wirtschaftswissenschaftliche Fakultät Fachbereich Wirtschafts- und Sozialwissenschaften

Friedrich-Alexander-Universität Erlangen-Nürnberg

zur Erlangung des Doktorgrades Dr. rer. pol.

vorgelegt von Martina Wenzel, M.Sc. aus Roth

Als Dissertation genehmigt von der Rechts- und Wirtschaftswissenschaftlichen Fakultät / vom Fachbereich Wirtschafts- und Sozialwissenschaften der Friedrich-Alexander-Universität Erlangen-Nürnberg

Tag der mündlichen Prüfung: 22.06.2021

Vorsitzende/r des Promotionsorgans: Prof. Dr. Klaus Henselmann

Gutachter/in: Prof. Dr. Freimut Bodendorf

Prof. Dr. Andreas Fürst

Abstract

Abstract

The fashion industry operates in a highly competitive market with an increasing power of consumers regarding fashion trend creation and diffusion induced by the wide usage of online social networking platforms. This forces fashion companies to adapt their methods of trend prediction to meet consumer needs and preferences and to stay competitive. Social networking platforms provide an instrument to share ideas and opinions which allows users to influence others in their behaviors, and therefore, influence the development of trends. The content published on these platforms, thus, is a rich data source for the fashion industry containing information about changing consumer needs and upcoming trends. Fashion companies, however, challenge to benefit from this data for trend prediction purposes as they lack the knowledge about the trend-relevant users who publish content that includes information about future trends. This research addresses the challenge of profiting from this valuable data source for trend prediction, especially in the highly competitive fashion industry. It argues that trends are created and diffused by trendsetters and that the content which is shared by these trendsetters in online social networks includes information that enables early trend detection. Due to this, the study seeks to identify trendsetters based on their digital trace which they leave on online social networking platforms and addresses the question of how fashion trendsetters in online social networks can be identified automatically based on social media data. To achieve this goal, a feature framework is created based on literature review and expert interviews which enables the measurement of characteristics of trend-relevant roles based on social media data. Next, a two-step approach is developed which first extracts a topic- relevant sample of users (community) from a huge online social network, and then identifies the online trendsetters within this sample based on a supervised machine learning approach. For its development, a prototypical data analysis is realized based on publicly accessible data from the online social networking platform . The resulting methodology for the identification of online trendsetters related to a specific topic area consists of a topic-focused community detection approach and a classification . The analysis of the relevant features for the model’s class decision further reveals insights into online trendsetters’ characteristics in online social networks. The evaluation of the developed methodology shows its transferability to other use cases and validates the trend prediction potential of the identified online trendsetters.

I Abstract

The results of this thesis contribute to research and practice. The insights gained about online trendsetters’ behavioral patterns, characteristics, and the relevant features for their detection in online social networks expand the knowledge about online trendsetters related to the fashion industry, and thus, contribute to the area of trend research and the recently emerging field of fashion informatics. Besides, the insights can be used by companies to identify appropriate marketing partners to influence trends. Furthermore, the developed methodology supports fashion companies with providing a new data source to increase trend prediction quality and facilitates the identification of changing consumer needs and preferences. 1

1 For a German version of this abstract, please refer to Appendix A.1. II Table of content Table of content

List of figures ...... VI

List of tables...... VIII

List of abbreviations ...... IX

1 Introduction ...... 1 1.1 Motivation and research gap ...... 1 1.2 Objectives and research questions ...... 4 1.3 Research design ...... 6 1.4 Structure of the thesis ...... 10

2 Theoretical background and conceptual foundations ...... 13 2.1 Trend diffusion ...... 13 2.1.1 Terms and definition ...... 13 2.1.2 Trend diffusion theories ...... 15 2.1.3 Trendsetters...... 18 2.1.4 Trend diffusion in the fashion industry ...... 23 2.2 Online social networks ...... 25 2.2.1 Online social networking platforms...... 26 2.2.2 Users and communities ...... 30 2.3 Advanced social media analytics ...... 32 2.3.1 Process ...... 32 2.3.2 Text analytics ...... 34 2.3.3 Social network analytics ...... 36 2.3.4 Supervised machine learning ...... 38 2.4 Interim conclusion ...... 39

3 Related work ...... 42 3.1 Overview ...... 42 3.2 Trend detection on social media platforms ...... 42 3.3 Community detection ...... 44 3.3.1 Process initialization ...... 45 3.3.2 Detection methods and considered data ...... 46 3.3.3 Detection process ...... 47 3.4 Identification of influential users on social media platforms ...... 48 3.4.1 Characteristics and measurement ...... 49 3.4.2 Approaches to detect influential spreaders ...... 53 3.4.3 Classification approaches ...... 55 3.5 Validation of insights by experts from the fashion industry ...... 57 III Table of content

3.5.1 Method ...... 58 3.5.2 Data collection and analysis ...... 58 3.5.3 Findings ...... 60 3.6 Interim conclusion ...... 64

4 Conceptual framework to identify online trendsetters ...... 67 4.1 Two-step approach ...... 67 4.2 Topic-focused community detection ...... 68 4.2.1 Detection metrics ...... 69 4.2.2 Detection mechanisms ...... 73 4.2.3 Data layers ...... 77 4.2.4 Iterative process ...... 78 4.3 Trendsetter classification model ...... 81 4.3.1 Labeling concept ...... 82 4.3.2 Feature framework ...... 87 4.4 Interim conclusion ...... 93

5 Identification of online trendsetters by advanced analytics ...... 94 5.1 Use case description ...... 94 5.1.1 Sneaker trends ...... 94 5.1.2 Instagram ...... 95 5.2 Topic-focused community detection ...... 97 5.2.1 Methods ...... 97 5.2.2 Process initialization and iterations ...... 104 5.2.3 Results and evaluation ...... 107 5.3 Trendsetter classification model ...... 111 5.3.1 Methods ...... 111 5.3.2 Exploratory data analysis and data pre-processing ...... 128 5.3.3 Development and selection of models ...... 132 5.3.4 Comparison of models ...... 135 5.3.5 Relevant features for online trendsetter identification...... 136 5.4 Summary ...... 152

6 Model application ...... 155 6.1 Use case description ...... 155 6.2 Application of trendsetter identification methodology ...... 156 6.3 Trend prediction capability of online trendsetters ...... 158 6.4 Application areas and transferability ...... 167

7 Conclusion ...... 170 7.1 Summary ...... 170

IV Table of content

7.2 Contribution to theory and practice...... 173 7.3 Limitations and implications for future research ...... 174

References ...... IX

Appendix ...... XXVIII A.1 Abstract (German version) ...... XXVIII A.2 Features for the measurement of opinion leaders’ characteristics ...... XXX A.3 Interview guide...... XXXIII A.4 List of features for the identification of online trendsetters ...... XXXV A.5 List of hyperparameters’ search space and final model settings ...... XLI A.6 SHAP dependence plots ...... XLIII

V List of figures

List of figures

Figure 1-1 Design Science Research process model ...... 9 Figure 1-2 Research design ...... 10 Figure 2-1 Fashion lifecycle ...... 14 Figure 2-2 Fashion lifecycle related to Rogers’ adopter categories ...... 17 Figure 2-3 Fashion trendsetters’ characteristics ...... 22 Figure 2-4 Fashion trend creation and diffusion process ...... 24 Figure 2-5 Main components of online social networking platforms and related data .. 27 Figure 2-6 SMA process ...... 33 Figure 2-7 Data from online social networking platforms and applied SMA methods .. 34 Figure 2-8 Application of text analytics within the thesis ...... 35 Figure 2-9 Application of social network analytics within the thesis ...... 37 Figure 2-10 Steps of building a predictive model ...... 39 Figure 3-1 Summary of results – literature review on community detection ...... 48 Figure 3-2 Summary of results – literature review on online opinion leader detection . 55 Figure 3-3 Social media influencer identification process ...... 63 Figure 4-1 Two-step OTS identification approach ...... 68 Figure 4-2 Overview community detection process ...... 69 Figure 4-3 Pseudocode text score ...... 73 Figure 4-4 Pseudocode of consensus algorithm...... 77 Figure 4-5 Community detection process chart ...... 80 Figure 4-6 Applied supervised machine learning process ...... 82 Figure 4-7 Labeling concept based on Rogers’ diffusion model ...... 83 Figure 4-8 Labeling process...... 84 Figure 4-9 Identification of trends – price evolution ...... 86 Figure 4-10 Feature framework ...... 88 Figure 4-11 TSIM concept overview ...... 93 Figure 5-1 Community detection – required data and applied SMA methods ...... 98 Figure 5-2 Matrix decomposition and notations of NMF algorithm ...... 103 Figure 5-3 Score value calculation, example: CPP ...... 106 Figure 5-4 Evolution of user composition ...... 107 Figure 5-5 Community statistics – sneaker community ...... 108 Figure 5-6 Comparison of centrality measures in the initial and final iteration ...... 109 Figure 5-7 Topic models of all community members’ posting texts (sneaker) ...... 110 Figure 5-8 Feature extraction – required data and applied SMA methods ...... 112 Figure 5-9 Model development – applied methods and implementation ...... 115 Figure 5-10 Data splitting – training, validation and test split ...... 115 Figure 5-11 5-fold cross-validation ...... 116 Figure 5-12 Applied feature selection methods ...... 125 Figure 5-13 Confusion matrix ...... 127 Figure 5-14 Community evolution – OTS and non-OTS (sneaker) ...... 130

VI List of figures

Figure 5-15 Visualization of value ranges of the features avg. time between posts, followed by, distinct emoji in bio ...... 131 Figure 5-16 Visualization of PCA analysis ...... 132 Figure 5-17 Process of model training, validation and evaluation ...... 134 Figure 5-18 Overview experimental setup ...... 134 Figure 5-19 Feature importance – SHAP summary plot ...... 138 Figure 5-20 SHAP dependence plot – no. of tags ...... 141 Figure 5-21 Value distribution – no. of (distinct) tags...... 142 Figure 5-22 SHAP dependence plot – ratio follows-followers ...... 143 Figure 5-23 Value distribution – ratio follows-followers ...... 143 Figure 5-24 Value distribution – no. of and posts ...... 144 Figure 5-25 SHAP dependence plot – no. of distinct hashtags...... 145 Figure 5-26 Value distribution – no. of distinct hashtags and avg. no. of videos...... 145 Figure 5-27 SHAP dependence plot – avg. no. of videos ...... 146 Figure 5-28 SHAP dependence plot – ratio hashtags bio ...... 147 Figure 6-1 Topic models of all community members’ posting texts (sustainability) ... 157 Figure 6-2 Community evolution – OTS and non-OTS (sustainability) ...... 158 Figure 6-3 Steps to assess the trend prediction potential of identified OTSs ...... 159 Figure 6-4 Topic evolution in OTSs’ postings (2016-2019) ...... 161 Figure 6-5 Example – evolution trend popularity ...... 162 Figure 6-6 Example – popularity evolution of reuse ...... 163 Figure 6-7 Example – popularity evolution of reuse after differencing ...... 164

VII List of tables List of tables

Table 1-1 Research questions ...... 5 Table 2-1 Social networking platform – components, functions and information ...... 30 Table 3-1 Feature examples for the measurement of opinion leaders’ characteristics ... 53 Table 3-2 Overview of interviewees ...... 59 Table 3-3 Most mentioned criteria of social media influencer selection ...... 63 Table 4-1 Filter criteria ...... 74 Table 4-2 Scoring criteria ...... 75 Table 4-3 Overview different data layers ...... 78 Table 4-4 Examples of user-related features ...... 89 Table 4-5 Examples of content-related features ...... 90 Table 4-6 Examples of context-related features ...... 91 Table 4-7 Examples of network-related features ...... 92 Table 5-1 Instagram functions, related data, and data access ...... 96 Table 5-2 Relevant concepts of applied methods and use case related examples ...... 99 Table 5-3 Use case-specific settings – variables and values (sneaker) ...... 105 Table 5-4 Relevant machine learning terminology ...... 113 Table 5-5 Performance measures ...... 127 Table 5-6 Comparison OTS and non-OTS (sneaker community) ...... 129 Table 5-7 Value ranges of the features avg. time between posts, followed by, distinct emoji in bio ...... 131 Table 5-8 Performance results of the best model of each classifier ...... 136 Table 5-9 Important features for OTS detection according to the SHAP value ...... 140 Table 5-10 Summary of results – feature importance analysis ...... 151 Table 6-1 Use case-specific settings – variables and values (sustainability) ...... 156 Table 6-2 Comparison OTS and non-OTS (sustainability community) ...... 157 Table 6-3 Consumer trends related to the sustainability megatrend in 2020 ...... 160 Table 6-4 Granger causality test results ...... 165 Table 6-5 Comparison of OTSs and users with the highest reach (Granger causality) 166 Table 6-6 Transferability of approach to other social media platforms ...... 169

VIII List of abbreviations List of abbreviations

API Application Programming Interfaces Avg. Average BOW Bag-Of-Words BS Biography Score CPP Comments per Post FFR Followers-follows ratio FN False negative FP False positive HTML Hypertext Markup Language IE Information Extraction IS Information Systems LDA Latent Dirichlet Allocation LPP Likes per Post MDA Mean Decrease Accuracy NLP Natural Language Processing NMF Non-negative Matrix Factorization No. Number OSN Online Social Network OTS Online Trendsetter PCA Principal Component Analysis POS Part-of-Speech PPD Number of Posts per Day RQ Research Question RSS Rich Site Summary SD Standard Deviation SHAP SHapley Additive exPlanations SMA Social Media Analytics SMOTE Synthetic Minority Oversampling Technique SNA Social Network Analytics SVM Support Vector Machine TF-IDF Term Frequency-Inverse Document Frequency

IX List of abbreviations

TN True negative TP True positive TS Text Score TSIM Trendsetter Identification Methodology UGC User-Generated Content URL Uniform Resource Locator VSM Vector Space Models

X Introduction 1 Introduction

1.1 Motivation and research gap

The digitalization and the increasing interconnectedness, due to the emergence of the internet and especially the Web 2.0, has influenced and changed many areas of society during the last years. Particularly the fashion industry is changing as consumers steadily ask for new styles and products these days (McNeill and Moore, 2015). To face this new demand, the fashion business moves from the traditional bi-seasonal fashion collections towards so- called with constantly changing, new intermediate collections (Bhardwaj and Fairhurst, 2010; Kim et al., 2011). This transition leads to shorter product life cycles with a decreasing timespan between a new design release and consumption (Kim et al., 2011), and forces companies to generate new ideas and innovative products faster than ever (Bhardwaj and Fairhurst, 2010). Besides this new pace, the growing usage of online social networks platforms increases the power of consumers regarding trend creation and diffusion (Dillon, 2012; Chang et al., 2015) as such platforms support the generation and exchange of content across borders (Kaplan and Haenlein, 2010). Internet users transform from bare consumers to active creators of this so-called user-generated content (Beheshti-Kashi et al., 2015; Schiele et al., 2008), and at the same time, these online social networks (OSNs) take a key role in information diffusion (Guille et al., 2013). Some of this content becomes popular and even contributes to new trends (Guille et al., 2013). Due to OSNs, ordinary people have a platform to become visible for anyone online, share their ideas, opinions and preferences, reach a mass audience, and influence others in their decisions. This shows that “trendsetting” is no longer restricted to traditional actors like designers, fashion companies, forecasting agencies and fashion magazines, but trends are more and more influenced by active users of social media platforms (Jackson, 2007; Manikonda et al., 2016). Some of these social media users are highly interconnected and perceived by their community as experts in a specific field. They influence the attitudes and behavior of their audience by the content they spread via social media (Freberg et al., 2011). Therefore, this published content is a valuable data source regarding trend detection (Abdullah and Wu, 2011) and can serve as an inspiration source for new product development and marketing campaigns (Tsur and Rappoport, 2012). As the ability to meet consumers’ preferences and the speed of responsiveness highly influence the profitability and the competitiveness of fashion retailers nowadays (Bhardwaj and Fairhurst, 2010), the content which is published by these new trend-relevant users in

1 Introduction

OSNs, in the following called online trendsetters (OTSs), can support fashion companies to stay competitive. In this thesis, OTSs are defined as users of online social networking platforms that adopt and diffuse new ideas before these ideas become popular (Rogers, 2003; Saez-Trumper et al., 2012). This definition bases on Rogers’ innovation diffusion theory and is adapted to the online space. OTSs can be traditional trend actors who are active in OSNs or the new above-mentioned ordinary people who become important regarding trends due to their activity and position in OSNs.

Fashion companies have already recognized the relevance of OSNs in the context of trend creation and diffusion. This is underlined by the increasing spendings on social media initiatives and related functions within the companies (Roberts and Piller, 2016). Thus, fashion companies nowadays use OSNs in different ways to support product creation and to influence trends. In this context, social media influencers and the related marketing branch of gained more and more attention during the last decade (Audrezet et al., 2020). Social media influencers are often referred to as opinion leaders on social media platforms who engage in electronic word-of-mouth to spread information (Lou and Yuan, 2019). Especially in the OSN Instagram, they are successful message spreaders who have an impact on creating and diffusing new trends and push sales (Jin et al., 2019). Therefore, marketing departments try to identify these influential users to collaborate with them to shape their image, push products in the market, and influence fashion trends in a specific domain or area (Jin et al., 2019). More recently, fashion companies have also started to co-create new products with social media influencers to benefit from their knowledge about consumer preferences and their closeness to the target group (Ahmad et al., 2015).

The challenge of practitioners to profit from the value of OSNs for their business is the identification of relevant users and data regarding new trends out of the huge and noisy data source. There are 3.5 billion active social media users worldwide, which is 42% of the total global population (Hootsuite & We are social, 2019). Instagram, which is one of the most relevant OSN related to fashion today (Casaló et al., 2020; Phua et al., 2017) has one billion monthly active users, 95 million daily postings and 4.2 billion likes per day (Hootsuite & We are social, 2019). Taking into account this huge data volume, it is challenging to find the right fraction of users and the respective data with relevant information about upcoming trends manually. There is also no clear understanding of which users have the potential of creating and influencing the development of new trends. The selection of online opinion leaders for marketing cooperation, for instance, is mainly done by a limited number of quantitative measures such as the number of followers, and mostly rely on the assumption 2 Introduction that the larger the audience of a user, the larger her or his impact on the network (De Veirman et al., 2017; Rakoczy et al., 2018). Although it is known from offline studies that the trend- relevant opinion leaders are perceived as trustworthy, likable, and having expertise in a specific field (Rogers, 2003; Lazarsfeld et al., 1944; Chan and Misra, 1990), such qualitative criteria are neglected by existing automated selection approaches (Freberg et al., 2010; De Veirman et al., 2017). One reason for this is the missing knowledge about how to measure such characteristics based on the data provided by online social networking platforms. Therefore, companies mostly consider users with a huge network as trend-relevant users, although studies have shown that users with a small community often have more impact on their followers (Kay et al., 2020).

From the academic perspective, a lot of research already exists which deals with trend detection and the identification of influential spreaders (Guille et al., 2013). Studies, which focus on trend-relevant users regarding early adoption in a fashion context, however, are rare (e.g., Bakshy et al., 2009; Saez-Trumper et al., 2012; Cervellini et al., 2016). Cervellini et al. (2016), for instance, test an algorithm that combines a network topology approach with a temporal analysis to identify trendsetters in the OSN Yelp. Similar to most of the research in this field, this algorithm identifies trendsetters related to a specific trend and by considering the time of adoption (e.g., Cervellini et al., 2016; Saez-Trumper et al., 2012). These studies do not investigate specific behavioral patterns or characteristics of the trendsetter group based on extensive analysis of their digital trace in OSNs. Their detection by recognizing such “trendsetter-patterns” based on social media data from OSNs, however, allows their identification without the knowledge about an already existing trend, and therefore, enables the detection of early signals of trends before they become a trend. Although there is a growing body of studies that analyze different types of opinion leaders and influential spreaders in OSNs, little research focuses on fashion-related actors. Only a small number of studies apply data-intensive computational approaches within the field of fashion (e.g., Chen and Luo, 2017; Park et al., 2016; Lin et al., 2015; Lin et al., 2014). In most recent years the notion of fashion informatics emerged which describes the usage of computational methods in the context of the fashion industry to profit from rich data sources such as social media data. At the same time, there is a call for more research in this field (Copeland et al., 2019; Zhao and Min, 2019). Zhao and Min (2019) underline the importance of social media data in the field of fashion research as this industry has become a social- media-driven one. They highlight the value of social media data for fashion research, especially in terms of actuality, accessibility, volume and low cost. Lee et al. (2017)

3 Introduction furthermore emphasize the importance of dealing with the identification of “key personnels that generate significant changes (e.g. trend-setting)” in a fashion-related network as a baseline to improve fashion trend prediction (Lee et al., 2017, p. 4). As recent studies, which investigate the detection of emerging trends and fads, do not consider the role of trendsetters in OSNs, they ask for future research focusing on fashion trendsetters especially their online actions and motivations (Lee et al., 2017).

To sum up, the identified gaps in research and practice are the following:

(1) Lack of knowledge about OTSs and their behavioral patterns and personality traits based on data from OSNs related to the fashion industry (2) Missing approach to detect potential OTSs based on extensive features derived from social media data which allow their automated detection and which consider quantitative and qualitative criteria

The dissertation addresses these issues and aims to contribute to research and practice by analyzing fashion trend diffusion and trend-relevant actors in the online space to develop a methodology to identify OTSs in online social networks. By providing companies with information about who will create and shape trends in the online world, they can improve trend prediction and the selection of appropriate partners for marketing cooperations and co- creation (Lee et al. 2017). Besides, this dissertation aims to narrow the research gap regarding trend-relevant user groups in a fashion context, especially regarding their traits and behavioral patterns based on social media data as well as their embeddedness in the fashion-based network. It contributes to the new research field of fashion informatics by providing an approach for trendsetter identification by using data-intensive computational methods such as social network analysis, text mining, and machine learning algorithms.

1.2 Objectives and research questions

The overall goal of this thesis is to develop an approach for trendsetter identification in online social networks related to a specific fashion area by answering the following questions: (1) Who are the trend-relevant users in OSN,

(2) what characterizes them, and

(3) how to identify them based on data from OSNs to support trend detection and trend influencing?

4 Introduction

Therefore, the main research question (RQ) of this thesis is the following:

How can fashion trendsetters in OSNs be identified automatically based on social media data? Based on this question, several sub-questions arise which are related to the construction of a conceptual framework (RQ 1.1 - 1.3), the development of the previously conceptualized solution (RQ 2.1 - 2.3) and its evaluation (RQ 3.1 - 3.2). Table 1-1 gives an overview of these sub-questions related to the three areas.

1 - Concept

RQ 1.1: Which trend-relevant roles and actors do exist, related to the fashion industry, and how can they be characterized?

RQ 1.2: Which features based on social media data from OSNs can be used to measure these characteristics and which data is required for the calculation of these features?

RQ 1.3: Which methods enable the identification of specific user roles in online social networks according to existing literature?

2 - Analysis

RQ 2.1: How can a community related to a specific interest field be identified in OSNs and based on which data?

RQ 2.2: Which classifiers are most suitable for fashion trendsetters identification with regards to state-of-the-art performance measures using social media data?

RQ 2.3: Which features are relevant for the identification of fashion trendsetters according to the analysis?

3 - Evaluation

RQ 3.1: How does the developed methodology perform on a specific use case compared to existing methods?

RQ 3.2: How can it be used in practice?

Table 1-1 Research questions

RQ 1 addresses the theoretical fundament of the research project and aims to build a conceptual framework for the identification of fashion trendsetters in online social networks. 5 Introduction

For this purpose, existing knowledge about trend diffusion, trend-relevant roles, and actors as well as about trend- and fashion-relevant user groups of social media platforms is gathered and analyzed with regard to important elements and insights for the construction of the framework. Based on the insights gained within RQ1, the second RQ deals with the realization of the concept and the development of the methodology. Therefore, a process for data collection is derived which results in a topic-focused community. Subsequently, a classification model is developed which aims to classify the members of the identified community in OTS and non-OTS accounts. Additionally, the relevant features for the respective class decision are investigated. RQ 3 subsequently addresses the evaluation of the developed solution and shows its application and usage in practice.

1.3 Research design

The research project follows the research paradigm of Design Science Research by Hevner (2004), which is often applied in the field of Information Systems (IS) (Peffers et al. 2012). He developed a conceptual framework that combines behavioral science and design science, to understand, execute and evaluate IS research. He states that the problem space is defined by the environment of the area of interest, whereas the solution for this problem and the creation of an artifact refers to a specific knowledge base. After creating a new artifact, it is tested and evaluated in realistic conditions within this environment, and the gained insights are added to the initial knowledge base. In this research project, the newly emerging OTSs and their increasing power regarding trend creation and diffusion bring up the business need of identifying them online, and gain insights about their behaviors and characteristics based on social media data. The basic knowledge to solve this problem consists of insights from trend research, research on social media platforms and their user groups as well as on techniques from social media analytics and machine learning. This thesis adds a methodology to detect OTSs in OSNs based on social media data and reveals knowledge about their online behavioral patterns and characteristics.

Hevner et al. (2004) define seven guidelines to achieve effective design-science research. The following section gives an overview of the dissertation project and aims to emphasize how these guidelines are implemented in the thesis:

1. Creation of a viable artifact The core artifact of this thesis is a methodology to identify OTSs in a fashion context based on social media data, which consist of two components:

6 Introduction

(1) An approach to detect a topic-focused community with means of social media analytics based on social media data (2) An explainable classification model which classifies a community in OTS and non-OTS

The process and algorithms, that allow extracting a community from an OSN, classifying them into OTS and non-OTS as well as the knowledge about the relevant features are the artifacts of this research. As an example, the classification model as artifact consists of the classification algorithm, the selection of features that are integrated into the model, and the hyperparameter settings. In the following, this two- step methodology is referred to as Trendsetter Identification Methodology (TSIM).

2. Development of a solution for a business problem The importance of the problem is emphasized by the increasing power of users in OSNs to create and diffuse trends combined with the high number of active users and the missing capabilities of the fashion industry to profit from this huge new data source which potentially includes signals about new trends (Lee et al., 2017). Therefore, fashion companies struggle to keep pace and lose their competitiveness. The relevance of the problem is underlined by the increasing number of studies dealing with trend detection on social media platforms and the related research string on identifying and analyzing influential users of information cascades. The newly emerging research field of fashion informatics also shows the relevance of research projects which apply computational methods to solve problems related to the fashion industry (Copeland et al., 2019).

3. Evaluation of the solution by well-executed methods The evaluation activities take place in several phases of the research project by applying evaluation methods such as technical experiments, logical argument, illustrative scenario method and expert evaluation, which are according to Peffers appropriate methods to evaluate the respective artifact (Peffers et al., 2012).

4. Contribution of research The dissertation contributes to both theory and practice. It expands the knowledge about fashion OTSs by providing insights into their online behavioral patterns and their detection based on social media data. Based on an empirical analysis, it reveals insights about features derived from social media data which increase the accuracy of the classification model, and therefore, can be considered as relevant criteria for

7 Introduction

the identification of OTSs. It also contributes knowledge about the potentially important features for such a classification problem by transferring characteristics of trendsetters based on traditional trend theories to measurable metrics. From a practitioner's perspective, this thesis helps to identify potential OTSs as it provides a methodology for their identification in OSNs, and subsequently supports the prediction of new trends.

5. Application of rigorous methods in construction and evaluation of the artifact Research on trends and trend-relevant groups often refers to Rogers’ innovation diffusion model and social theories such as social network theory and the social capital theory. The research design of this thesis bases on previous research related to those theories. The conceptual framework builds on Rogers’ diffusion model which provides relevant elements for the identification of OTSs. The theory delivers the fundamental definitions of trend-relevant groups and related concepts. For the identification of potentially relevant features and the creation of the feature framework, studies that base on social capital theories are considered as well as insights from studies referring to the social network theory. The latter also builds the baseline for the community detection approach. Thus, this dissertation bases on a clearly defined and tested fundament of literature and knowledge of the relevant research areas. Additionally, Figure 1-2 shows that rigorous methods of Design Science Research are applied in all phases of research, for the design as well as for the development and evaluation of artifacts.

6. Utilization of available means for searching effective artifacts The design science process aims to identify OTSs without relating to a specific trend, and therefore, design methods have to be identified which satisfy this purpose. Therefore, characteristics of trendsetters are identified by conducting extensive literature reviews in the field of trend diffusion, trend actors, and influential spreaders on social media platforms. Additionally, expert interviews are carried out to further enrich the insights by the practitioner's perspective. Besides, research in the field of social media analytics and machine learning is conducted to find appropriate methods for the design of the solution. During the design phase, several evaluation and improvement loops are conducted and different variants are compared. For the development of the classification model, for instance, four different machine learning algorithms and three ensemble learning methods are compared according to their classification accuracy, and the best model is selected for the TSIM. 8 Introduction

7. Communication of research The communication of results and insights is realized by their documentation in this thesis. The targeted audience is the research community that is familiar with trend detection and influential user groups as well as the community around the newly emerging research area of fashion informatics. The thesis also includes useful information for practitioners in the field of product development, marketing and strategy. The TSIM assists companies with the identification of potential OTS who can be then monitored to detect new innovative ideas and emerging trends early on. This knowledge can support companies in their strategic decisions regarding product development and marketing. The method also can be used by marketing departments to find potential partners to influence future trends or provide designers with a data source for new product inspirations.

To realize the above-described research project, the dissertation follows a process model for design science in IS research provided by Peffers et al. (2008) which is illustrated in Figure 1-1.

Problem Definition of Design & identification and objectives of a Demonstration Evaluation Communication Development motivation solution

Observe how Find suitable context Scholarly publication Define problem What would a better effective, efficient Artifact Use artifact to solve Professional Show importance artifact accomplish? Iterate back to problem publication design

Design & Problem- Objective- Client/context development- centered centered initiated centered initiation solution approach approach Possible research entry points Figure 1-1 Design Science Research process model adapted from Peffers et al. (2008, p. 54)

According to this process model, a research project has various potential starting points and follows a build-evaluate pattern. Therefore, each artifact is evaluated after its construction and further improved. Based on the description of Peffers et al. (2008) this thesis follows a problem-centered initiation and includes several build and improve iterations. As suggested by Sonneberg and Brocke (2012), besides evaluating the final solution, there are several evaluation steps included during the research process which aim to improve each component of the final solution. Figure 1-2 gives an overview of the research process related to the chapters of the thesis and the RQs. It also includes the applied methods and deliverables.

9 Introduction

The structure follows the process model of Peffers et al. (2008) and distinguishes the six phases, where the focus lies on the design and development phase (cf. Figure 1-2)

Problem Define Demon- Design & development Evaluation Communication definition objectives stration

Chapter 1 Chapter 2 - 4 Chapter 5 Chapter 6, 7

Literature review Social media analytics

Methods Expert interviews Machine learning

Concept Analysis

Relevance of problem Two-step approach Community detection Evaluation

RQ1 RQ2 RQ3

Output Implication for practice Objectives for solution Feature framework Classification model and research

Logical argument Technical experiment Illustrative scenario

Evaluation Expert evaluation

Figure 1-2 Research design

1.4 Structure of the thesis

The structure of the thesis follows the previously introduced research design (cf. Figure 1-2). First, in Chapter 1, the motivation for the research project, the existing problems as well as the research gap is laid out. Based on the identified problems, the objectives of the dissertation and the resulting RQs are presented. The following section describes the applied research design and the research process, and aims to give a clear understanding of how to answer the defined RQs and thus, how to reach the goal of the thesis.

Within Chapter 2 and 3, the theoretical foundation for the research project is built by conducting an extensive literature review. This includes the analysis of fundamental theories related to trends and trend actors as well as online social networking platforms and the users of such platforms, which is described in Chapter 2. Chapter 3 subsequently deals with a review of related work studies. This chapter is divided into three parts which investigate studies related to the online space and which focus on the following research strings: (1) trend detection, (2) community detection, and (3) influential users. The analysis gives an overview of applied methods and considered features in the above-mentioned research fields. Combined with Chapter 2, it provides the necessary knowledge base about relevant

10 Introduction definitions, important elements and methods for the construction of the conceptual framework of TSIM, which is described in Chapter 4 and answers RQ 1. The framework consists of a two-step-approach, which resembles a funnel. It presents a process that can be applied to identify OTS accounts out of the huge data source of OSNs, and involves a concept for the detection of a topic-focused community as well as for the development of a classification model. The latter includes a process for data labeling and a feature framework. The feature framework combines user-related, content-related, context-related as well as network-related metrics, and aims to describe users’ characteristics and behaviors based on social media data from OSNs. The baseline for this two-step approach consists of Rogers’ innovation diffusion model and insights from Chapter 2 and 3, especially regarding methods and features.

Based on the conceptual framework, Chapter 5 addresses RQ 2 and deals with the application of the framework. For this, two empirical data analyses are conducted using data from the social networking platform Instagram due to its high relevance for the fashion industry. Section 5.2 describes the collection of data and results in a topic-focused community. This experimental phase is realized with data related to sneakers. The outcome of this section is a dataset containing all data from the identified sneaker community which are necessary to calculate the features defined in the feature framework. Based on this dataset, Chapter 5.3 subsequently focuses on the development of the classification model. Following the process steps of supervised machine learning, a training dataset is created by transferring the labeling concept to real data, and applying the feature framework using different methods of social media analytics, e.g., statistical and network analysis as well as methods of text mining. The resulting training dataset constitutes the basis for the development of the classification model, which includes the training of four algorithms and three ensemble learning methods, and the comparison of classification results based on specific performance measures.

RQ 3 focuses on the evaluation and transferability of TSIM and is covered by Chapter 6 of this thesis. The developed TSIM is applied to a use case related to sustainability using data from the platform Instagram. For this, a community on Instagram with a discussion focus on sustainability topics is extracted by using the developed community detection process, and the OTS accounts are identified by applying the developed classification model. The evaluation is realized by investigating past resp. current trends related to the sustainability movement. Therefore, the communication of OTSs who are identified by the TSIM is analyzed regarding the occurrence of trend-related topics and their evolution over time. To further assess their trend detection potential, the communication about these topics over time 11 Introduction is compared to the communication of the non-OTS group as well as to the evolution of the respective Google trend index, which reflects the mass market. Additionally, to validate that the developed approach is better than existing methods, the communication of OTSs about these topics is also compared to the group of influential users identified by the number of followers. OTSs who are identified by the TSIM, thereby, are assumed to communicate earlier about an upcoming trend than the rest of the community and than influential users according to the number of followers. The analysis is realized using topic modeling and conducting several hashtag analyses. The topic modeling approach enables the identification of appropriate keywords. Based on these keywords, a hashtag analysis allows the investigation of the popularity of trend-related topics in the respective group over time and gives indications about the potential influence of OTSs on the other groups.

Chapter 7 presents a summary of the achieved results. Besides, the limitations of the study are pointed out with proposals for further empirical studies to improve the developed solution. To close the thesis, the implications for theory and practice are discussed.

12 Theoretical background and conceptual foundations 2 Theoretical background and conceptual foundations

2.1 Trend diffusion

In the following section, baselines related to trend creation and diffusion are outlined. The objective is to get a clear understanding of what describes a (fashion) trend, what are the trend-relevant roles according to fashion theories, and how do trend creation and diffusion currently take place in the fashion industry. This serves to identify important elements and processes which need to be considered when developing a framework for the identification of OTSs.

2.1.1 Terms and definition

The original meaning of the term trend is “to turn” (Vejlgaard, 2008). In general, trends are changes that move in one specific direction and which differentiate in their cycles depending on the environment. Nowadays, the term trend is widely used with different meanings which relate to various scientific disciplines. To sociologists, for instance, a trend is a prediction of something that is going to happen in the (near) future, in a specific manner and which will be accepted by an average person. In the fashion industry, in the following also referred to as clothing or apparel industry, the term is mostly used to describe short-term movements which refer to new products related to, e.g., new colors, textiles and styles, and long-term changes, which focus on the directions in general, e.g., the emergence of new materials or new means of production (Kim et al., 2011). Both are influenced by broad trends in society, e.g., a changing economy, demographical movements, political events or changes, cultural movements, and technology development (Sproles and Burns, 1994). Style can be described as a “characteristic mode of presentation that typifies several similar objects of the same category or class” (Sproles and Burns, 1994, p. 7), e.g., the “” or “punk” style. A specific style is not necessarily a trend as a style does not imply that it is accepted by a specific number of people (Kim et al., 2011). Moreover, a vast number of different products can refer to a style. Various sneaker models, for instance, can be assigned to the fashion of streetstyle. Similar to the term trend, fashion is broadly used and has various meanings. It can refer to the clothing and apparel industry as the notion fashion industry is often used interchangeably with these two terms (Vejgaard, 2008), but it also refers to a trend as it describes “a style of consumer product or way of behaving that is temporarily adopted by a discernible proportion

13 Theoretical background and conceptual foundations of members of a social group because that chosen style or behavior is perceived to be socially appropriate for the time and situation” (Sproles and Burns, 1994, p. 4). According to this definition, the term fashion is not restricted to the clothing industry but can also be related to behavior (e.g., the usage of ), other product categories, or ideas (Kim et al., 2011). In the context of the clothing industry, however, fashion refers to the lifecycle curve of a specific product trend or a broader style trend. The concept of product or style lifecycle, which comprises the phases introduction, growth, maturity and decline as illustrated in Figure 2-1 (Sproles, 1981), bases on the proposal that all products and styles have a finite lifecycle, which defers in their rate and duration of use. Reaching a specific adoption rate, a product or style is called a fashion (trend) and depending on the target group, a fashion can be adopted only by a specific consumer group, e.g., a specific social group, or by a broad mass market (Easey, 2009; Kim, et al. 2011). The concept supports the clothing industry in better predicting the performance of specific styles and products regarding sales or profitability (Easey, 2009; Kim et al., 2011).

Maturity ) adopters )

Growth

Decline

Frequency of(new Frequency Introduction Acceptance

time fashion lifecycle

Figure 2-1 Fashion lifecycle (based on Easey, 2009, p.171-172 and Kim et al., 2011, p.10)

In the following the notion fashion is used to outline the relation to the clothing industry and a fashion trend in this thesis is summarized as follows:

Fashion Trend: Tred describes a way of behaving, a style or a individual designed object (product) related to the clothing industry which is temporarily adopted by a discernible proportion* of members of a social system.

*) encompasses the first three adopter groups of Rogers’ diffusion model (innovators, early adopters, early majority)

14 Theoretical background and conceptual foundations

The diffusion of fashion trends can be influenced by a variety of factors, e.g., world events, economic conditions, subcultural influences, social changes, entertainment, technological innovation, or fashion leaders. The latter includes designers and celebrities who are influencing the creation and diffusion of new styles. To further investigate how fashion trends spread across different trend actors and which roles are relevant for trend creation and diffusion, in the next section, trend diffusion theories that are widely used in the fashion industry are presented.

2.1.2 Trend diffusion theories

There exist several theories in literature dealing with trend diffusion, which focus on different components, e.g., the source of trend or the process itself. Regarding fashion trends, different fashion theories explain how trends spread across society. There are three well- known theories, the trickle down, trickle up and trickle across theory which all assume a social hierarchy and differ from each other by the source of new trends (Veblen, 1899; Simmel, 1904; Field, 1970; King, 1963). The first two theories presume a vertical diffusion of trends. The trickle-down theory describes the upper or elite class of people as the source of trend creation and the spread of new trends from the upper class downward (Veblen, 1899; Simmel, 1904). The second theory states that trends come from the lower class and subcultures and diffuse upwards (Field, 1970). King (1963), in contrast, assumes that new styles trickle across a social class horizontally. He suggests that the acceptance of new styles is influenced by fashion innovators and opinion leaders within a social class (Sproles and Burns, 1994; King, 1963). Another theory, which is related to the latter, is the theory of Collective Selection by Blumer (1969). It emphasizes that almost any creative or innovative person can become a leader of fashion trends (Blumer, 1969). Due to social media and other mass media, nowadays, consumers of different social classes have access to diverse sources of fashion information. Especially, the emergence of OSNs and the resulting high interconnectivity of people potentially enable the influence across social groups. In contrast to King’s theory consumers of different social classes can, therefore, be influenced by fashion leaders of their social class as well as of a member of an upper (e.g., celebrities) or lower class (e.g., any person on a social platform) (Kim et al., 2011). As social media platforms such as online social networking platforms provide the means to any creative or innovative person to spread their ideas, the theory of Blumer is further supported in that way that Web 2.0 allows any creative user to become a fashion leader. Social media platforms consequently enable the diffusion of fashion trends across social classes and provide any

15 Theoretical background and conceptual foundations user the platform to become a fashion leader. This shows that fashion trend diffusion is more complex nowadays, and theories developed before the emergence of social media platforms need to be reconsidered in the context of Web 2.0. Besides these theories, another widespread model that originally focused on the diffusion of innovations is Rogers’ innovation diffusion model. Rogers’ theory investigates the process of trend diffusion by considering diffusion time and adoption rate (Rogers, 2003). It is one of the most influential theories in marketing and still sparks various follow-up research to date (Chang et al., 2015; Guille et al., 2013; Kobayashi and Lambiotte, 2016). Many researchers applied Rogers’ diffusion model to the area of the clothing industry and derived fashion-related insights (Behling, 1992). The model is also extensively used in studies related to social media (data) (Guille et al., 2013). Due to its relevance in both areas, fashion industry and social media studies, Rogers’ model serves as a profound baseline for this research and is described more in detail in the following. Rogers defines diffusion as “[…] the process in which an innovation is communicated through certain channels over time among the members of a social system.” (Rogers, 2003, p. 5). He distinguishes four main components of diffusion in his model, namely (1) innovation, (2) time, (3) social system, and (4) communication channel. The first element, the innovation, is established as “[…] an idea, practice, or object that is perceived as new by an individual or other unit of adoption” (Rogers, 2003, p. 12). Hence, an innovation does not have to be a revolutionary invention. The important aspect is that the object is subjectively perceived as new by the user. This is important for the transition to the fashion domain as a new product or style can be considered an innovation and Rogers’ model, therefore, can be applied in the context of fashion trend diffusion. Another main characteristic of diffusion according to Rogers is the aspect of time. Therefore, it considers how fast a product is adopted by the consumer, the so-called Innovation-Decision Process (Rogers, 2003). The element of time also plays a role in the rate of adoption in a system which is measured as “[…] the number of members of the system who adopt the innovation in a given time period” (Rogers, 2003, p. 20). Rogers visualizes this adoption process by the Innovation Diffusion Curve which shows the adoption rate of innovation over time (cf. Figure 2-2). He assumes and proves in his work that the adoption curve is s-shaped and closely approaches a normality distribution (Rogers, 2003). He relates the adoption process to different adopter types, which presents the third element of diffusion, the social system. The categories classify the members of the social system into five classes according to their innovativeness into innovators, early adopters, early majority, late majority and laggards. Innovativeness

16 Theoretical background and conceptual foundations describes the degree to which a user adopts innovations relatively earlier than other members of a social system (Rogers, 2003). As Figure 2-2 shows, innovators are the first 2.5% of consumers who adopt a new idea or innovation and represent the most innovative consumer segment. They have a high impact on the introduction of an innovation to a larger audience and, therefore, play a gatekeeping role in the flow of new ideas or innovations into a system (Rogers, 2015). Early adopters are the next 13.5% of consumers who adopt new ideas and innovations and are well integrated into a digital or real-world network (Kim et al., 2011). They are good communicators, have the greatest degree of opinion leadership compared to the other adopter segments, and serve as role models for later adopter groups (Rogers, 2003). The adoption of a new style or product by the next category, the early majority, decides if the new style becomes an established fashion trend. Combined with the group of late majority they comprise 68% of all adopters. Sproles and Burns (1994) refer to these groups as mass-market consumers or followers. The last 16% who adopt the style are the laggards.

Early Majority Late Majority 34% 34% Early Adopters 13.5% Maturity Late Adopters

) adopters 16% Innovators 2.5% Growth

Decline Introduction Frequency (new of

time Acceptance

Figure 2-2 Fashion lifecycle related to Rogers’ adopter categories (based on Easy, 1995 and Rogers, 2003)

Thus, the Innovation Diffusion Curve is closely related to the model of fashion lifecycle (Easey, 2009), which is described in section 2.1.1, as it also explains the adoption process of a new product or style. In contrast to the lifecycle model, which distinguishes different lifecycle phases according to time and diffusion rate, Rogers’ model uses these two dimensions to divide the curve into five adopter categories. Figure 2-2 shows the fashion lifecycle related to Rogers’ adopter categories. Hence, both concepts model the diffusion of a new product or style across a social system and only differ in their perspective of the product on the one hand and adopter groups on the other hand.

17 Theoretical background and conceptual foundations

The fourth and last element of diffusion introduced by Rogers is the communication channel. It describes how information gets from one member of the social system to another (Rogers, 2003). He distinguishes mass-media channels and interpersonal channels. While the mass media as a channel refers to one-way communication from e.g., television to the receiver, interpersonal communication relates to two-way communication between two or more individuals. He states that mass media channels are efficient in informing a huge audience about the existence of an innovation whereas interpersonal channels are more effective regarding the persuasion of accepting and adopting a new idea. Social media platforms such as online social networking platforms can be categorized as both, mass-media channels and interpersonal channels as they provide a platform for one-way and two-way communication (Ellison and Boyd, 2013).

2.1.3 Trendsetters

“Trends are always created by people, so trend spotting is about watching people who create or are preoccupied with new and innovative styles” (Vejgaard, 2008, p. 27). This underlines the importance of knowing the trend-relevant roles to recognize early signals of upcoming trends. Therefore, the following section aims to identify these roles and related insights according to traditional trend theories. The identified trend-relevant roles are summarized in the subsequent thesis with the term trendsetters. Depending on the theoretical perspective, there exist various definitions and understandings of social roles. According to the functional role theory, which is often cited in studies related to online social networks (e.g., Herrmann et al., 2004), a role is described as a characteristic behavior of individuals who have a specific position in a stable social system (Linton, 1936; Parsons and Shils, 1951). There is a huge body of research that investigates fashion trend-relevant roles, namely fashion leaders, and their characteristics. They aim for a better understanding of this group due to their major importance regarding fashion trend creation and diffusion (Beaudoin et al., 2000). Early fashion theories such as the previously mentioned trickle-down theory introduced the concept of fashion leadership and the related behavioral dimensions. In the following years, the group of fashion leaders was extensively analyzed, and the concept was further developed within studies such as those of King (1963), Summers (1970), and Baumgarten (1975), who investigate the role of fashion innovators and fashion opinion leaders as part of the group of fashion leaders. King and Ring (1980) refer to them as relevant fashion change agent roles, who can influence other people and trends. Baumgarten (1975) and Hirschmann and Adcock (1978), however, find that there is an overlap between innovativeness and

18 Theoretical background and conceptual foundations opinion leadership, especially in studies related to the apparel industry (e.g., Summers 1970) and, therefore, introduce a third role, namely innovative communicator. It refers to someone who combines the characteristics of an innovator and an opinion leader (Hirschmann and Adcock, 1978). Most of these studies relate to the previously explained Rogers’ diffusion model and its adopter categories (Behling, 1992) as fashion leaders are assumed to be part of the two first adopter groups, namely innovators and early adopters (Beaudoin et al., 2000; Baumgarten, 1975). Rogers does not include opinion leaders as a group in his model but rather uses this concept to describe the group of early adopters (Behling, 1992). In the following, the three identified fashion leadership roles and their characteristics according to existing literature are outlined, although, studies on fashion adoption have shown that fashion innovators differ little from fashion opinion leaders and innovative communicators. Behling (1992) therefore suggests summarizing these different roles under one term as they are difficult to separate from each other.

(Fashion) Innovator They interact commonly with other innovators and have a good understanding and knowledge in their specific interest fields and often are referred to as experts in this area (Rogers, 2003). Therefore, fashion innovators have a strong interest in fashion-related topics (Schrank and Gilmore, 1973; Darden and Reynolds, 1972). They are venturesome and at the same time more inner-directed than non-innovators (Rogers 2003). Studies also have shown that they have a high media usage and introduce new ideas from outside their social system (Darden and Reynolds, 1974; Rogers 2003).

Innovative Communicator This role refers to an intersection of characteristics of innovators and opinion leaders. They are more innovative than the average opinion leader and more socially active and communicative than innovators. Similar to opinion leaders, they can influence others (Hirschmann and Adcock, 1978). This already indicates that it is difficult to differentiate the three fashion leader roles clearly.

(Fashion) Opinion Leader The concept of opinion leadership has its origin in a study of Lazarsfeld et al. (1944) on the presidential election campaign 1940 in Erie-County (Ohio). The authors disprove the assumption that media messages have a direct impact on the recipients and introduce a new model for information diffusion in mass media, the so-called two-step-flow of

19 Theoretical background and conceptual foundations communication (Katz and Lazarsfeld, 1955). They show the importance of opinion leaders in terms of information diffusions and their ability to influence and spread information to others. They point out that information is not directly communicated to the recipients but passes across opinion leaders to a wider population of opinion followers, often with adding their interpretations. Therefore, opinion leaders have a personal influence on their immediate environment, but they are also likely to influence a broader audience (Katz et al., 2005). Polegato and Wall (1980) specifically focus on fashion opinion leaders and underline their decisive role in mass-acceptance of a new clothing style. Rogers also states that opinion leaders can influence others’ behaviors and attitudes (Rogers, 2015), and therefore, play an important role in trendsetting. Similar to Baumgartner (1975), who underlines their good embeddedness in their peer social group, he emphasizes their good network. One other often listed characteristic is their good communication skills. Summers (1970), for instance, finds that fashion opinion leaders communicate more to other people about fashion-specific topics and provide more trend-related information compared to the average consumer. They also have a high interest and involvement in fashion-related topics and use more fashion information sources than the average consumer (King and Sproles, 1973; Kim and Schrank 1982; Polegato and Wall, 1980). Lazarsfeld et al. (1944) also emphasize their credible knowledge in a specific field. Therefore, they serve as an information source for later adopter groups and are perceived as trustworthy and experts in a specific field (Rogers, 2003; Summers 1970). With the emergence of Web 2.0 and the increasing usage of online social networking platforms, opinion leaders got a new platform to exert their influence and the number of studies that investigate the so-called online opinion leaders escalates (Bamakan et al., 2019). The reason for this rising interest in this group of users is their highly influential power due to their good interconnectedness via social media platforms, and the resulting impact on other users’ attitudes, behaviors and decisions, such as their buying decisions. Some of these online opinion leaders use these influential skills commercially to collaborate with companies for marketing purposes. Therefore, a new type of online opinion leader arises, who exert opinion leadership professionally, the social media influencers. Due to their high relevance within the fashion industry, this specific type of opinion leader is added to the identified traditional trend- relevant roles (Quelhas-Brito et al., 2020; Audrezet et al., 2020; Casaló et al., 2020) and described in the following.

Social Media Influencers As companies recognized the value of online opinion leaders for business purposes, a new marketing channel namely influencer marketing emerged. It bases on the concept of word-of-

20 Theoretical background and conceptual foundations mouth and profit from the ability of an online opinion leader to influence other users. Therefore, companies try to identify appropriate opinion leaders on social media platforms, which fit their own brand to spread messages, create positive word-of-mouth, and to influence the adoption of new products (Nirschl and Steinberg, 2018). The term social media influencer subsequently refers to online opinion leaders who cooperate with companies for different purposes, e.g., recommendation marketing (Kim et al., 2017a; Jin et al., 2019; Freberg et al., 2011). Although there is no clear understanding of who is a social media influencer (Rakoczy et al., 2018), several studies refer to them as social media users who have a large network of followers (De Veirman et al., 2017; Varsamis 2018; Jin et al., 2019) across one or more social media platforms, e.g., YouTube, Instagram, Snapchat, or personal blogs (Freberg et al., 2011; Varsamis 2018), and who shape the attitudes and behaviors of their followers (Freberg et al., 2011; Varsamis 2018). One major reason for their success is their credibility (Audrezet et al., 2020). As more recent studies have shown that not quantity, but the quality of a user’s network is important for his or her influential power, the factor of network size loses importance and characteristics such as the interaction with their network gains relevance (Rakoczy et al., 2018).

The term social media fashion influencer especially refers to a user that creates mainly fashion content on social media platforms and has the power to influence the opinion and purchase behavior of others with their recommendations (Chetioui et al., 2020).

TredSocial Media Influencer:

is a specific type of online opinion leader who cooperate with companies to monetize

her or his social media activities and influence their followers’ attitudes towards a

product or brand.

More recent studies introduce the term trendsetters to describe the group of trend-relevant roles. Batinic et al. (2006), for instance, use this notion and define trendsetters, based on Rogers’ model, as “individuals who become aware of innovations earlier than their respective social group. They can isolate important elements and features of these innovations and tend to pass them on to their social environment” (Batinic et al., 2006, p. 60). Trendsetting in their perspective is related to the concept of opinion leaders as well as to innovators and combines the two roles. This goes with the suggestion of Behling (1992) to summarize all trend-relevant roles as they are difficult to separate from each other. According to Batinic et al. (2006), an innovator or opinion leader is only a trendsetter if the

21 Theoretical background and conceptual foundations innovation is communicated effectively and becomes a trend. This refers to the fact that not all innovators (for instance designers) create trends. Often, new ideas or products do not result in a major change in style or taste that affects large numbers of people. Only if an innovation spreads successfully, e.g., the style becomes a trend (Vejgaard, 2008), and an innovator becomes a trendsetter. Vejgaard (2008) further states that trendsetters are typically not trendsetters in all categories. Therefore, the above-introduced fashion leaders can be summarized as trendsetters in the case, that the newly introduced idea or product becomes a trend. The notion of OTS within this thesis, however, refers to the group of trendsetters on social media platforms who are trendsetters due to their social media activities and their characteristics on these platforms. They can be described as follows.

TredOnline Trendsetters: comprises (fashion) innovators, innovative communicators, and (fashion) opinion

leaders who create and spread new ideas effectively by their activities and characteristics on social media platforms, and therefore set trends.

Figure 2-3 summarizes the main insights of this section. It shows the two trend-relevant adopter categories of Rogers’ diffusion model (innovator, early adopter) related to the identified trend-relevant roles and their main characteristics according to existing studies in the field of fashion leadership. These trend-relevant roles are summarized as (fashion) trendsetters as they are crucial that innovation, such as a new style, becomes a trend.

Characteristics Fashion research Rogers‘ theory Studies Online studies Innovativeness

Innovator Innovator Expertise Innovative communicator Influence Other online Early adopter opinion leaders Opinion leader Communicativeness Social media influencers Interconnectedness

Credibility If innovation (e.g., new style) becomes a trend

(Fashion) Trendsetters

Figure 2-3 Fashion trendsetters’ characteristics

22 Theoretical background and conceptual foundations

This overview serves as a baseline for the identification of OTSs and is the starting point for the related work analysis (cf. section 3), which aims to identify metrics to describe users’ characteristics and behaviors based on social media data.

2.1.4 Trend diffusion in the fashion industry

After introducing trend-relevant roles and their related characteristics, in the following, it is examined which actors take these roles and how the processes within the fashion industry have changed due to the emergence of social media regarding trend creation and diffusion. This aims to better understand the role of social media in terms of trendsetting.

The fashion system is an interaction between professionals of the fashion industry such as designers, producers or retailers, who create and propose innovations, and consumers who decide whether to adopt an innovation or not. According to this description, fashion professionals have an important impact on fashion trends as they create new styles and products, and decide which styles and products are introduced to the market. Therefore, they provide a selection of alternatives on which the consumer can decide whether it becomes a trend by adopting it. All actors who contribute to the creation of new products are summarized as developers in Figure 2-4 (Kim et al., 2011). The figure illustrates how the process of fashion creation and diffusion has changed due to social media. Traditionally, the process was mainly guided by the fashion industry. Designers start the product creation process by collecting inspiration at yarn and fabric trade shows, scanning information from forecasting agencies, such as WGSN2 (Worth Global Style Network) or doing trend scouting by traveling in different places around the world and observing people in the streets. The new styles and designs are presented to editors of huge fashion magazines, (retail) buyers, and celebrities at garment fashion trade shows, who take a gatekeeper role as they add their opinion on the new presented styles and pass it to the consumers via mass media channels resp. decide on which styles and products to provide in the retail stores. Afterwards, marketing campaigns are initialized to push the new styles and products in the market and to influence the adoption rate and increase the probability that a new style or product becomes a trend. Due to the increasing usage of social media, the dictatorial power of the fashion industry regarding new styles and products, however, has shifted as those platforms provide consumers with more information and change the way of product creation and diffusion (Kim et al. 2011).

2 WGSN is a forecast service which covers a broad range of areas and topics related to the fashion industry 23 Theoretical background and conceptual foundations

Nowadays, fashion companies must respond to consumer preferences to stay competitive and profitable (Bhardwaj and Fairhrust, 2010). Therefore, they must recognize changes in consumers’ needs and attitudes early and create the style concepts and products accordingly. They conduct consumer research to identify changing consumer lifestyles, preferences, and problems to improve decisions on product development. Developer Research and trend agencies New way of diffusing styles and influencing trends via social media Manufacturers and Trade shows designers Traditional way of diffusing styles and influencing trends

(Gatekeeper) New emerging communication medium Fashion Retailers/buyers shows Traditional communication medium Social media Fashion bloggers, social Fashion journalists and Celebrities media influencers magazines editors New trend actors due to the emergence of social media

Traditional trend actors Consumer

Social media users Traditional consumers

Figure 2-4 Fashion trend creation and diffusion process (based on Kim et al., 2011)

In recent years, however, forecasters and designers start to use OSNs, in Figure 2-4 indicated as social media, as a key source for information about consumer preferences (Kim et al. 2011). Therefore, they try to identify relevant content creators, the before-mentioned opinion leaders, who use blogs and online social networking platforms to spread their ideas and opinions digitally and cooperate with them. At the same time, some of these cooperating online opinion leaders, namely social media influencers, are invited to shows as the fashion industry recognizes their power to influence consumers’ decision of adopting new styles and products. As these shows are one of the most relevant events regarding trend diffusion in the fashion industry (Bhardwaj and Fairhrust, 2010), the power of social media influencers further increases regarding trendsetting. In addition to fashion journalists, celebrities and retailers, social media influencers influence how the new ideas of fashion designers are perceived by their followers (Jackson, 2007; Ahmad et al., 2015). Besides, active participants in OSNs serve as an inspiration source and are integrated into the design and development process of new products. Fashion companies integrate concepts to co- create with these actors such as co-innovation or crowdsourcing. As indicated in Figure 2-4, the above-described gatekeeping role of industry professionals steadily decreases as social media platforms and their active users take a major role in the process of product creation 24 Theoretical background and conceptual foundations and diffusion. The relevant trend actors get more diverse as potentially every active user of an online social networking platform can take the role of trend creator and spreader. At the same time fashion professionals, such as designers, are also active users of social media platforms and use their accounts to spread their new ideas and receive feedback on their new design and style ideas (Kim et al. 2011). This shows that the process of trendsetting has become more complex in terms of relevant actors and also underlines the high relevance of social media platforms and their users regarding trendsetting. Active social media users have gained influence on new creations as well as on what will be accepted as a trend. Social media users can become designers, but designers also can become trendsetters due to the emergence of social media platforms and their increasing usage (Ahmad et al., 2015).

2.2 Online social networks

As outlined in the previous section, trend creation and diffusion have changed in terms of trend-relevant actors. Today, online social networking platforms and their users have a strong impact on what becomes a trend. Especially in the fashion industry, some of these users take a key role in trendsetting. Due to their functions which enable the connection and interaction of people, these platforms build huge online social networks and accelerate the speed of information diffusion. In this thesis, the term online social network refers to the social system resp. social network which is built among users on these platforms, whereas the term online social networking platform describes the technical components of these platforms, the provided functions, and the data which is stored and which enables the analysis of users’ social behavior and characteristics.

Online Social Network: describes the social system resp. social network which is built among users on

online social networking platforms, the content which is shared among these users and the interactions which take place within this social system.

In the following section social media platforms, especially online social networking platforms, and their users are investigated. This serves to a better understanding of the usage of those platforms and the available data generated by their users. Moreover, the field of social media analytics (SMA) is introduced. This aims at building fundamental knowledge

25 Theoretical background and conceptual foundations on how to reveal insights from the data provided on social networking platforms, and therefore, how to investigate behavioral patterns and characteristics of OTSs in OSN.

2.2.1 Online social networking platforms

Online social networking platforms, also called social networking sites, are one specific type of social media application (Kaplan and Haenlein, 2010). According to Kietzmann et al. (2011), social media encompasses all types of interactive platforms where users (individuals and communities) share, co-create, discuss and modify user-generated content (UGC). In general, all types of social media platforms share some similar attributes (Stieglitz et al., 2018). One of those is the UGC which is a key element of social media and is defined as content, e.g., pictures, videos, comments, ratings, stories, and audio files, that is created by internet users (Obar and Wildman, 2015; Kreutzer, 2012). Another fundamental component of social media is the ability of social interaction between users. For this, social media offers several applications such as online social networking platforms that support the creation and the exchange of the above-mentioned UGC (Kreutzer, 2012; Kaplan and Haenlein, 2010). These social media applications often provide the option to create a user profile (Obar and Wildman, 2015), where the user chooses a unique username and provides specific personal information (Boyd and Ellison, 2008; Obar and Wildman, 2015). By interacting with each other and sharing personal information in their profiles, users of these platforms leave a digital trace, which enables the analysis of their social behavior and interests (Rakoczy et al., 2018; Batrinca and Treleaven, 2015; Boyd and Ellison, 2008) by applying diverse methods of SMA (Stieglitz and Dang-Xuan, 2013). Depending on the type of application, different information is available which can be investigated. There exists no clear categorization of the different types of social media applications as the boundaries are fluid (Ellison and Boyd, 2013), but they mainly differ in the motivation of usage. Some of them focus on the support of communication and interaction such as online social networking platforms (e.g., Facebook, Instagram), others provide features that enable the cooperation of users (e.g., wikis) or facilitate content-sharing such as YouTube. Besides online social networking platforms, common types of such applications are blogs, microblogs (e.g., ), content communities (e.g., Pinterest, Youtube) or collaborative projects (e.g., wikis) (Kreutzer 2012; Kaplan and Haenlein, 2010). Online social networking platforms focus on the support of interaction between users, and therefore, enable the fast diffusion of information. Consequently, they are highly relevant in terms of trend creation and diffusion (Bakshy et al., 2012; Guille et al., 2013).

26 Theoretical background and conceptual foundations

A clear definition of online social networking platforms is challenging as they develop and change rapidly in terms of functions. However, similar to Kaplan and Haenlein (2010), Ellison and Boyd (2013) emphasize several main components of such platforms. They state that participants first have a unique user profile that consists of content provided by the user, the UGC. Additionally, the users of such platforms have connections that are visible for others in form of a contact list that can be crossed by others. Lastly, they can consume, produce, and/or interact with the UGC of their connections. Both definitions highlight the function of online social networking platforms to provide a communication channel and support the exchange of content with a bounded group of users. Figure 2-5 showcases the main components of online social networking platforms and the related social media data.

Social media platforms

Online social Content communities Blogs networking platforms further

Online social networking platforms provide social media User profile UGC Connections Interactions data that enables the analysis of users’ characteristics and behaviors in OSNs.

Content data Network data

Social media data Figure 2-5 Main components of online social networking platforms and related data

In the following, the main components and the respective functions are described more in detail. This yields to obtain an overview of the available social media data which can be extracted from online social networking platforms and then investigated to identify OTSs. The presentation of available functions focuses on the most common ones and those which are relevant for the thesis. Table 2-1 summarizes the results but does not represent an overarching list of available functions. A precondition of participating in online social networking platforms is the creation of a public or semi-public unique user account. For this, such platforms often provide the feature of uploading a profile photo as well as adding a self-descriptive text. Additional functions like the option to add a current status and update the profile content emphasize the dynamic of OSNs (Ellison and Boyd, 2013). This user-related information stored as social media data provides valuable insights about users’ characteristics and interests (Pennacchiotti and Popescu, 2011). Besides, these platforms offer features to produce and share content. This UGC can be textual or media-based such as images and videos, and is often tagged with so-

27 Theoretical background and conceptual foundations called hashtags which aim to categorize the content to a specific topic area (Yang et al., 2012; Ellison and Boyd, 2013). Depending on the service, the published content often contains some contextual data such as a timestamp or information about the posting location (Batrinca and Treleaven, 2015). Therefore, this data encompasses content- and context- related information. Closely related to the profile is its list of contacts, which is another characterizing feature of online social networking platforms. Users can create a contact list, which represents one’s social network and describes the social connections and relationships with other users. These connections can be bi-directional and undirected, representing a symmetrical relationship between users or uni-directional and directed. A reciprocal connection is only installed if both parties agree. On Facebook, for instance, so-called friendship relations only take place if two users follow each other. Twitter was one of the first platforms which allows a uni-directional connection between users (Ellison and Boyd, 2013), followed by others such as Instagram. One user can follow another, and therefore, consume the shared content of this user without being followed back. If user A follows user B, user B does not necessarily follow user A. User A, however, is connected with user B in the way that the published content of user B as well as the news feed is visible for her/him. Therefore, user B potentially influences user A with her/his published content. These one- directional relations are often labeled as “fans” or “followers” (Boyd and Ellison, 2008). Besides, online social networking platforms provide different ways of communication, the one-to-many and the one-to-one mode. By posting content on their personal stream, the users’ content is visible for everyone in the OSN or only for their list of contacts, depending on the profile settings (public/private). Several online social networking platforms have an announcement function to inform about the availability of new content from users of one’s contact list, either in the form of a stream of updates (e.g., Instagram) or via automated messages. These media streams often serve as an information source of novel information, and therefore, can support the diffusion of new ideas and trends. Many platforms also support private one-to-one communication via a private messaging or chat function (Ellison and Boyd, 2013). This data about one’s connections provide network-related information as it contains insights about a user’s position in her/his social network. It also provides information about a user’s activity, e.g., the frequency of publishing content on the platform. Connections between users can also be set up by their interactions. Users can relate to others in their postings as well as react to the published content of others. Therefore, many platforms provide features like mentioning or tagging others in the posting or sharing, commenting and liking the postings of others. These interactions are visible to others and

28 Theoretical background and conceptual foundations represent the social behavior of users (Ellison and Boyd, 2013). It contains information about users’ interconnectedness and social relations with other users (Gandomi and Haider, 2015).

Online social networking platforms can be summarized as follows:

Online Social Networking Platform: is a type of social media platform. It provides its users specific functions which enable the creation of a user profile, the publication of UGC, and which supports the building of social networks by connecting and interacting with each other. It stores structured and unstructured social media data which include information about users’ social behaviors and characteristics.

The variety of functions that allows to share different types of content, react to it, and interact with each other, includes a vast amount of information about human interactions (Boyd and Ellison, 2008). Table 2-1 presents the main components of online social networking platforms, common functions, and the potentially included information assigned to different categories, namely user-, network-, content- and context-related information. The table serves as orientation and structure for the identification of metrics which enable the measurement of behavioral dimensions and characteristics, which is part of section 3 and 4.3.2.

29 Theoretical background and conceptual foundations

Components Function Included information

User profile - Profile photo User-related: - Self-descriptive text - Personal characteristics - Current status update - User’s interests

UGC - Textual (e.g., caption) Content- and context-related: - Media-based: images, videos, - Type of content audio - Information within content - Time of publication - Location of publication Connections - Contact list Network-related: - One-to-one: private messaging - Position within the network - One-to-many: personal stream - Level of publishing activity

Interaction - Mentioning Network-related: - Tagging - Type of social relations - Sharing - Level of one’s social activity - Commenting - Level of interconnectedness - Liking Table 2-1 Social networking platform – components, functions and information

Many online social networking platforms provide access to these publicly published data via an Application Programming Interfaces (API) (Batrinca and Treleaven, 2015), and therefore, enable their analysis for different purposes.

2.2.2 Users and communities

As outlined in section 2.1.3, a social role relates to a position in a social system, and depending on the social system, e.g., a specific group of friends, a person can take different roles. Therefore, a trendsetter can only be identified in the context of a specific group. As users of online social networking platforms can be members of several communities (Papadopoulos et al., 2012), a user can be a trendsetter in a group with an interest focus on , for instance, but a “mainstreamer” in a vegan-oriented group as individuals tend to be influential only in one specific domain (Guille et al., 2013). Similar to the offline world, OSN users form social groups, which are often referred to as communities. As OSNs often consist of millions of nodes (e.g., users) and edges (e.g., follow-relations between users) which represent a not manageable data volume, the detection of such communities gains increasing attention. The investigation of communities helps to reveal insights about behavioral patterns and characteristics which are transferable to the entire OSN as

30 Theoretical background and conceptual foundations communities summarize huge networks and size them down (Gandomi and Haider, 2015). Therefore, they provide the basis for the analysis of social systems and social roles, such as trendsetters, and subsequently are highly relevant within this thesis. According to the literature, there exist various definitions of a community. In the context of social media and OSNs, it is often defined as a collection of nodes, e.g., users on an online social networking platform, that are better connected to each other than to nodes outside this community (Girvan and Newman, 2002). More precisely, communities are sub-networks of users who interact more frequently with each other compared to other users of the platform (Gandomi and Haider, 2015). Besides the structure of the network, more recent studies in the field of social media consider additional information such as users’ similar interests to define a community and their interactions more realistically. According to Bedi and Sharma (2016), a community is a group of similar and well-connected users (Bedi and Sharma, 2016). Abdelbary and El-Korany (2013) follow a similar understanding and describe it as “a collection of users who share the same interest(s) and interact with each other most likely than other users in the network” (Abdelbary and El-Korany, 2013, p. 50). Gupta et al. (2018) also emphasize the similarity of interest, and state that the members within the community have a higher interaction rate than with users outside the group. In addition to the traditional definition of a community, these authors add the attribute of the nodes’ similarity in terms of interest or behavior. Depending on the purpose of the study, the literature on communities often refers to two types of communities and differentiates topological based-communities (structure of community) and topical-based communities (topic focus of community) (Ding, 2011). As the role of being a trendsetter often relates to a specific topic, and the precondition of trend diffusion is the interaction (communication) of members of a social system, both community types are relevant within this research.

A lot of research also investigates the members of those communities, their user behavior, and their position in the respective community. This aims at differentiating various groups of users and analyzing their roles within the respective network. A major part of research focuses on the classification of users according to their level of influence on their community (e.g., Rehman et al., 2020; Lin et al., 2018; Chen et al., 2017), which is analyzed more in detail in section 3.4, due to its relevance for this thesis.

31 Theoretical background and conceptual foundations 2.3 Advanced social media analytics

As previously emphasized, social media platforms such as online social networking platforms encompass a huge volume of data ranging from structured data like activity-related data (e.g., number of comments, mentions, likes) to unstructured data like the UGC (e.g., textual content, images). Thus, extracting meaningful insights from this massive amount of social media data is challenging and requires sophisticated data collection and analysis methods (Stieglitz et al., 2014). Due to this, the new field of SMA arises with the emerge of the Web 2.0 (Fan and Gordon, 2014). It can be described as follows.

Social Media Analytics (SMA): is the analysis of structured and unstructured social media data to reveal knowledge

by combining, extending, and adapting existing analysis methods such as text analytics, social network analytics or supervised machine learning to social media

data (Gandomi and Haider, 2015; Stieglitz et al., 2014).

In the following, major SMA steps as well as methods are introduced, which are relevant for the empirical study of this thesis.

2.3.1 Process

According to the number of citations, the most accepted SMA framework in IS is the one by Stieglitz et al. (2014). The process was further extended by Stieglitz et al. (2018) and encompasses four steps: discovery, tracking, preparation, and analysis (Stieglietz et al., 2018). The authors underline that first the topic area and the objectives of the research have to be clear as this is the precondition for the extraction of the relevant data and the decision on the appropriate data source. This is realized in the first process step namely discovery. The second phase, tracking, comprises decisions on the data source (e.g., a specific online social networking platform such as Facebook), the tracking approach, the tracking method, and the specification of the required data output. The latter also covers the definition of a specific timeframe for data collection, e.g., the tracking of data on a daily basis, in regular intervals of choice, or only once. There exist different tracking approaches. Most commonly, data collection bases on keyword-, actor- or uniform resource locator (URL)-related approaches. An actor-related approach, for instance, refers to a process of data collection, where all information related to one specific user is extracted or monitored. As an example, the

32 Theoretical background and conceptual foundations postings of one user on an online social networking platform with all related comments, likes and further metadata can be collected for a specific timeframe. The selection of tracking methods to realize this data extraction varies depending on the platform. Popular platforms such as Twitter or Instagram provide access to the data by API, Rich Site Summary (RSS) or Hypertext Markup Language (HTML) parsing. APIs are programming interfaces that can be used to request elements of a webpage such as posts, images or likes, directly from the provider’s server. Therefore, it is a time-saving and efficient method for data gathering (Brügger, 2018). Data extraction from rather unknown platforms, however, often requires the design of an individual method for tracking. The third process step is the preparation phase which encompasses the data pre-processing. Besides the identification and elimination of outliers, missing values and noise, it can also include the removal of stopwords, low- frequency words, punctuations and abbreviations in the case of text data (Holsapple et al., 2018). The last process step is the analysis of the extracted data which aims to reveal valuable insights out of the gathered social media data. Depending on the pre-defined goal of the investigation, several analytical methods such as content-based analytics (e.g., text analytics), social network analytics (SNA), or machine learning can be applied (Batrinca and Treleaven, 2015).

Figure 2-6 provides an overview of the SMA process steps according to Stieglitz et al. (2018) and the respective tasks.

Data source? Data pre-processing: Topic area? Tracking approach? - Cleaning Objectives? Tracking method? - Transformation Analysis methods? Required data? - Reduction

Discovery Tracking Preparing Analysis Insights

Figure 2-6 SMA process based on Stieglitz and Duang-Xuan (2013) and Sieglitz et al. (2018)

As outlined in section 2.2.1, online social networking platforms provide various data related to the user, the user’s network, and the published content. To profit from this rich but noisy and mostly unstructured social media data in terms of extracting valuable insights, SMA makes use of various analytical methods (Stieglitz et al., 2014). Figure 2-7 shows the methods which are especially relevant within this thesis to transform the data provided by

33 Theoretical background and conceptual foundations online social networking platforms into valuable information. These methods, which are text analytics and SNA, are introduced in the following.

Data provided by online social networking platforms

Unstructured data Structured data

Content data Network data

User profiles UGC Connections Interactions

Methods of social media Content-based analytics Social network analytics analytics enable the generation of knowledge about users’ characteristics and behaviors in OSNs based on data which is Social media analytics provided by online social networking platforms. Figure 2-7 Data from online social networking platforms and applied SMA methods

2.3.2 Text analytics

The main information sources from online social networking platforms are the UGC including the profile information as well as the relationship and interaction data of users (cf. section 2.2.1). Thus, two important pillars of SMA are content-based analytics and SNA. Content-based analytics investigates the posted content (e.g., posting or profile description) such as text, images and videos. Depending on the type of postings, methods of text, image or video analytics can be applied to extract information (Gandomi and Haider, 2015). As text data takes an important role in social media (Hu and Liu, 2012), especially text analytics methods are relevant for the investigation of the data extracted from online social networking platforms, and therefore, for this thesis. Text analytics, also referred to as text mining, encompasses methods to seek and extract useful information from a large volume of textual data such as profile information, postings or comments. It aims to reveal structured patterns from unstructured textual data to discover knowledge (Salloum et al., 2017). Therefore, it includes statistical analysis, computational linguistics and machine learning (Gandomi and Haider, 2015), and uses a variety of techniques such as information extraction (IE), summarization, information retrieval, natural language processing (NLP) and clustering (Talib et al., 2016). Hu and Liu (2012) divide the process of text analytics into three consecutive phases of text pre-processing, text representation and knowledge discovery. Figure 2-8 shows the process steps with common respective methods. In the first step, different pre-processing and cleansing operations are applied such as the removal of stop

34 Theoretical background and conceptual foundations words (elimination of general and meaningless words), lemmatization (reducing inflected word forms to their base form) and tokenization (splitting the document into smaller entities such as sentences or single words) (Hu and Liu, 2012). These operations are important, especially regarding social media data, as social media users tend to use informal language including slang and jargon, and misspellings. This noise causes inaccurate results, and therefore, it should be eliminated by text preprocessing (Baldwin et al., 2013). This procedure transforms text data into a standardized form and enables its aggregation. The second phase, text representation, comprises the transformation of text data in numeric vectors in order to apply mathematic operations and models to them. Common techniques are, for instance, Bag-Of-Words (BOW) or Vector Space Models (VSM) (Hu and Liu, 2012). Other techniques such as IE or NLP are also applied for this task and provide a more sophisticated way of representation (Aggarwal and Zhai, 2012). In the last phase, machine learning and data mining methods are used to extract latent and useful information from the documents. Classification to determine sentiments, clustering to detect topics, or cosine similarity calculation to detect associations are some of these techniques which enable the discovery of knowledge (Hu and Liu, 2012). Figure 2-8 showcases the usage of text analytical methods within the thesis.

User profiles UGC Data

Content-based analytics

Video analytics Text analytics Image analytics

Text analytics framework

Knowledge Method Pre-processing Representation discovery

e.g., stop word removal, e.g., classification, e.g., BOW, VSM, lemmatization, clustering, sentiment IE, NLP tokenization analysis

o Feature extraction: identification of features which enable the measurement of trendsetters’ characteristics o Community detection: identification of communities with a specific topic

Application focus

Figure 2-8 Application of text analytics within the thesis

As text data such as profile data (e.g., the biography text of a user) and UGC (e.g., the content of text postings) include valuable data about users’ characteristics, interests and

35 Theoretical background and conceptual foundations communication behavior, text analytics is used to create appropriate features which enable the measurement of trendsetters’ characteristics by data from online social networking platforms. Besides, it is applied to support the detection of topic-focused communities (cf. Figure 2-8).

2.3.3 Social network analytics

Besides text analytics, SNA is applied within this thesis to derive insights about the social behavior of users. SNA deals with the structural attributes of a social network and aims to reveal insights based on the connections and interactions of participants of an OSN (Gandomi and Haider, 2015). The structure of a network is composed of nodes and connections between pairs of nodes and can be visualized as a graph. Both, nodes and edges, can hold further information on the actors and the connections between them (node and edge attributes). Nodes can represent individuals such as the users of an online social networking platform, but also keyword tags, locations, events, or web pages (Wasserman and Faust, 1994). The connections are often referred to as edges, which can be directed or undirected. In directed graphs, the edge points in one specific direction, while the edges between the nodes do not necessarily require reciprocal relations (Malliaros and Vazirgiannis, 2013). This is, for instance, the case if one user follows another user on a social networking platform (Krishna et al., 2018). Directed graphs are also used to visualize the diffusion of information and resources through a network (Scott, 2017). Graphs without any associated direction, however, are called undirected graphs, e.g., a reciprocal relation between two users of an OSN such as two users follow each other (friendship) (Krishna et al., 2018). Edges can represent multiple types of relations between nodes such as a friendship (e.g., follow- follower relation) or activities (e.g., commenting or mentioning relations). Depending on the type of relation, social graphs are differentiated from activity graphs. Social graphs visualize the existence of a connection between two entities of a network, e.g., the friendship between two users. These connections between users are often used to identify communities (topological-based communities). Activity graphs, however, represent current interactions between users such as mentioning each other in the postings or commenting on postings, and therefore, contain rich information about social interactions and behavior (Gandomi and Haider, 2015). Specifications of these interactions such as the frequency of interaction or the type of relation can be considered in such a model by adding a weight. In an unweighted network, an edge is either present or absent, whereas in weighted networks some edges are thicker representing the strength of a connection between two nodes (Oliveira and Gama,

36 Theoretical background and conceptual foundations

2012). An edge can be weighted, for instance, with the number of comments between two users. SNA is commonly applied to detect communities or influential users in OSNs, and therefore, is highly relevant for the thesis. SNA mostly relies on structural metrics such as centrality measures (Bamakan et al., 2019). These metrics can be divided into actor-level (node-level), e.g., centrality measures, and network-level statistical measures, e.g., density or average degree. Centrality measures such as degree, betweenness, closeness and eigenvector centrality are assigned to the group of actor-level measures and describe the positioning of one node within a network (Trappmann et al., 2011). These measures include information about the potential influence of an actor within the network (Ortiz-Arroyo, 2010). They are applied in the field of influence analysis and opinion leader detection in OSNs as they yield to quantify the importance of a node within the network (Gandomi and Haider, 2015; Bamakan et al., 2019). Network-level measures are more relevant in terms of communities as they describe a network as a whole, and therefore, enable the comparison to other networks with respect to their characteristics. As Figure 2-9 shows, within the thesis, SNA is applied to connection and interaction data from the respective online social networking platform to extract appropriate features for the identification of trendsetters as well as for the detection of a topic-focused community.

ata Connections Interactions D

Social network analytics

Actor-level measures Network-level measures Method

e.g., centrality e.g., density

Feature extraction: Community detection: calculation of features which identification of communities enable the measurement of with a specific topic focus trendsetters’ characteristics Application Figure 2-9 Application of social network analytics within the thesis

37 Theoretical background and conceptual foundations

2.3.4 Supervised machine learning

To detect OTSs based on their behavioral patterns in OSNs and to reveal insights about their characteristics, supervised machine learning is used within this thesis. Supervised machine learning enables the identification of hidden patterns in data using various algorithms such as Decision Trees or Support Vector Machines (SVM). Therefore, within the learning process, a model is trained on labeled data. A labeled dataset contains the input (independent variables), e.g., features that describe a user’s characteristics, and the corresponding correct output (label) of samples, e.g., being an OTS. If the objective is to predict the label for future unknown data samples based on the independent variables of samples, then this is called predictive modeling (Brownlee, 2017). Classification is one form of predictive task and is a subcategory of supervised machine learning which focuses on the prediction of categorical class labels such as the prediction of OTS accounts (Kotsiantis, 2007).

Shmueli and Koppius (2011) provide an overview of relevant steps for building a predictive model in IS such as a classification model. It encompasses eight steps and starts with the goal definition. This includes the determination of the objective to be predicted, e.g., OTS account, and the exploration and understanding of the problem space. The latter enables the decision on which data to consider in the data analysis and ensures the collection of all relevant data in the step of data collection and study design. Besides, this step also includes the determination of the required sample size. The data preparation step focuses on the handling of missing values as well as on the splitting of data into training, validation and holdout resp. test sets. The exploratory data analysis in the subsequent aims towards a better understanding of the collected data and the extraction of features. The choice of variables deals with the selection of appropriate features based on, e.g., theory, domain knowledge or empirical evidence. Within the step of choice of potential methods appropriate algorithms, e.g., Decision Trees or ensemble techniques like Random Forest are chosen. The next step consists of the evaluation, validation, and model selection, which also involves the training of models. Thereby, the evaluation of the predictive performance of a model is measured by pre-defined performance metrics, e.g., accuracy. Besides, model validation aims to verify the ability of a model to predict new data accurately. For this, the overfitting tendency of a model is assessed by comparing the performance results on the training set and the validation set. The objective of the model selection is to find a model with high predictive performance. Therefore, various techniques such as feature selection and hyperparameter tuning can be applied to identify the best model. The final step, model use and reporting, focuses on the documentation of results. Therefore, statistical reportings which include, for instance, the 38 Theoretical background and conceptual foundations predictive power of the model measured by the performance metrics, the relevant features and the used algorithms resp. methods, are created. Furthermore, the gained insights, e.g., new knowledge about the behavior of OTSs, are outlined. Figure 2-10 summarizes the relevant steps of building a predictive model. These steps provide the basic structure for the creation of a classification model.

Choice of Evaluation, Data collection Exploratory data Choice of Model use & Goal definition Data preparation potential validation & & study design analysis variables reporting methods model selection

o Statistical o Calculation of o Training of o Identification Choice of features reporting, e.g., o Exploration of features Selection of model of relevant based on theory, performance problem space o Handling o Handling method, e.g., o Evaluation of data and data domain know- measures like missing values outliers choice of prediction o Definition of source ledge, association Accuracy objective to be o Splitting data o Recognition of algorithms or performance o Determination with the response Documentation feature ensembles o Assessment of o predicted of sample size (label) correlations overfitting of model benefit Figure 2-10 Steps of building a predictive model (adapted from Shmueli and Koppius, 2011, p. 563)

Supervised machine learning and especially classification approaches are commonly applied to social media data for different purposes such as sentiment analysis (e.g., Agarwal et al., 2011), bot detection (e.g., Morstatter, 2016) or user classification (e.g., Kim et al., 2017b). As a classification approach allows the detection of OTSs by recognizing specific patterns in the provided data, such an approach is chosen within the thesis. The development of a classification model based on features that describe trendsetters’ characteristics and behaviors furthermore enables the exposure of insights about OTSs’ characteristics and behaviors in OSNs.

2.4 Interim conclusion

In the following, the key insights from the previous section are outlined.

(1) The detection of OTSs based on past trends can be realized by analyzing the communication about a specific product trend in OSNs over time. A product that is adopted by a specific number of consumers within a social system can be presumed a fashion trend. According to Rogers’ diffusion model, the time of communicating about such a trend by the members of a social system can be assumed the time of adoption, and the volume of conversation can be considered the adoption rate. One potential communication channel for the diffusion of a trend is online social networking platforms. The first individuals who talk about something new, which becomes in the subsequent a trend, are considered trendsetters.

39 Theoretical background and conceptual foundations

(2) The role of OTS relates to the concept of online opinion leaders and social media influencers. Studies dealing with those users can provide valuable information for the identification of OTSs. Innovators and early adopters who create and spread a trend can be summarized as trendsetters. These adopter categories are closely related to fashion trend-relevant roles such as opinion leaders. Opinion leaders who are active on social media have attracted a lot of attention due to their influential power regarding product and brand perception based on their high interconnectedness on social media platforms such as online social networking platforms.

(3) Users in OSNs take an important role in trend creation and diffusion in the fashion industry. Besides the newly emerging online opinion leaders who influence other users in their buying behavior and tastes, traditional actors of the fashion industry such as designers also use online social networking platforms to spread their ideas, and therefore, become trendsetters.

(4) The role of OTS relates to a specific topic area and a closely connected topical community. Their detection, therefore, has to be realized related to such a topic- focused community. The role of a trendsetter depends on a specific interest field. As an individual can be a member of several social groups, she/he can take different roles depending on the group. Online communities represent a social group of users, who interact regularly or/and have similar interests. Besides, trends spread across members of a social system who interact regularly with each other.

(5) Online social networking platforms provide relevant social media data to measure the characteristics and behaviors of users online, and therefore, provide the precondition of OTS detection according to these characteristics using SMA methods. As emphasized in section 2.1.3, trend-relevant roles are assigned specific characteristics. At the same time, users of online social networking platforms leave a digital trace in the form of profile data, connection and interaction data as well as content data which include information about their characteristics and social behaviors. Methods of SMA such as text analytics and SNA provide the necessary

40 Theoretical background and conceptual foundations

means to extract this information from social media data. Based on this information, supervised machine learning enables the detection of OTSs, and can also reveal insights about OTSs’ characteristics in OSNs.

Point (3) highlights the importance of users in OSNs regarding fashion trends, and therefore, emphasizes the relevance of this thesis. Furthermore, point (5) underlines that data from online social networking platforms provide the necessary prerequisite for the detection of OTSs as they include relevant information about users’ characteristics and behaviors online. For the realization of OTS detection based on data from these platforms, which is the focus of section 4 and section 5, findings of prior work provide the fundamental knowledge about which data to consider and how to reveal the necessary information. As points (1), (2), and (4) outline that studies investigating trend detection on social media platforms, community detection, and influential users can provide valuable knowledge for this research, the following section presents an overview of these related research areas and aims to identify:

relevant components of trend diffusion online, methods for community detection in OSNs, characteristics and behavioral dimensions of online opinion leaders, features that enable the measurement of identified dimensions, and methods that are applied for the detection of trend-relevant users.

41 Related work 3 Related work

3.1 Overview

In recent years, studies on trend detection on social media platforms are an increasing research field. These studies deal with the analysis of information diffusion, the properties of influential messages or the participating user groups (Guille et al., 2013). To profit from prior work, in the three following sections studies of this research field are analyzed. This aims to reveal knowledge about how to capture a trend and measure its diffusion in OSNs (section 3.2 and 3.3), which data and features to consider for the detection of influential messages (section 3.2) or users (section 3.4.1), and which approaches are used for this task (section 3.4.2). As there is a huge overlap of functions of the different types of social media platforms (cf. section 2.2.1), e.g., content sharing platforms and online social networking platforms, the focus is not only on studies analyzing data from online social networking platforms. Publications that deal with data from other social media platforms, e.g., blogs, are also considered. Another reason for the extension of the literature review is the availability of studies. There exist more studies dealing with other types of social media platforms such as blogs as the interest in OSNs and related research arises more recently.

The literature review is split into three parts, which focus on different components of Rogers’ diffusion model previously introduced (cf. section 2.1.2):

1) The diffusion process and attributes of an influential message (time, innovation) (section 3.2). 2) The detection of communities (social system) (section 3.3). 3) The identification of influential users (adopter categories) (section 3.4).

3.2 Trend detection on social media platforms

This first part summarizes important aspects of detecting new ideas and trends on social media platforms and the process of spreading across these platforms. A major part of studies in this research field investigates data of the OSN Twitter (e.g., Stai et al., 2018; Wang and Zheng, 2014; Lehmann et al., 2012; Ma et al., 2012; Tsur and Rappoport, 2012; Romero et al., 2011; Chang, 2010) due to its precondition for fast information diffusion and the good accessibility of data. Only a few studies (e.g., Figueiredo et al., 2014; Susarla et al., 2012) focus on content-sharing platforms such as YouTube. Several of the reviewed studies

42 Related work investigate hashtags (e.g., Stai et al., 2018; Wang and Zheng, 2014; Lehman et al., 2012; Ma et al., 2012; Romero et al., 2011; Chang, 2010), their attributes and their popularity over time as they provide an easily trackable system of the information cascade (Stai et al., 2018). The diffusion of hashtags is measured by the number of postings that contain the specific hashtag over time (Wang and Zheng, 2014). Studies show that there exist different classes of hashtags that differentiate in their diffusion pattern (Stai et al., 2018; Wang and Zheng, 2014; Lehman et al., 2012; Romero et al., 2011), similar to different trend lifecycles. Wang and Zheng (2014) distinguish three classes according to the temporal pattern and find that hashtags with a single spike pattern mostly relate to specific events or topics, and tend to be longer (more letters) than others. Hashtags subsequently are an appropriate means to capture trends such as specific product trends (single spike pattern, specific hashtag) as well as broader movements with a longer lifetime (fluctuation pattern, generic hashtag) (Wang and Zheng, 2014). Several authors investigate the prediction of the popularity of hashtags, events or other UGC (e.g., videos) based on a variety of features using machine learning approaches (e.g., Figueiredo et al., 2014; Ma et al., 2012; Gupta et al., 2012; Tsur and Rappoport, 2012). Ma et al. (2012) inspect the properties of influential messages and try to predict hashtag popularity by using a classification approach. The popularity is measured by the number of users who adopt a specific hashtag resp. mention a specific event. They develop a hashtag profile that consists of six content features and eight context features based on data of the related tweet of the hashtag. They show that contextual features are the most effective regarding prediction accuracy and find that Maximum Entropy classifier performs best (Ma et al., 2012). Similar to Ma et al. (2012), Gupta et al. (2012) also measure popularity based on the number of users who mention a specific hashtag in their postings and include features of similar categories. They show in their analysis that ratio features, e.g., number of retweets relative to the number of tweets related to a specific event perform significantly better than others such as absolute values like the number of followers of a user who has mentioned the event. Tsur and Rappoport (2012) also examine the features related to a hashtag in terms of probability to spread across the network. They apply a hybrid approach considering features related to the content and the topology of the social graph. They distinguish four different categories of features, which refer to the hashtag content such as character length, the overall sentiment of the respective tweet, its structural attributes, and features that relate to the time, e.g., specific day time the hashtag is used. They emphasize that content-related features play a decisive role in the acceptance of a hashtag by the community. Susarla et al. (2012) focus

43 Related work on the diffusion of videos on the social media platform YouTube, and outline that the social interactions between users are highly relevant in terms of which video becomes successful (spreads widely) and to which extent. They analyze the diffusion process within a specific community of similar interests as this enables the investigation of the network structure and the interaction patterns of users. This community approach aims to set the boundary for the analysis of network structure (Susarla et al., 2012). The reviewed studies show that hashtags are an appropriate means to capture a specific topic resp. trend. Their diffusion can be examined by the volume of postings containing a specific hashtag over time. Besides, researchers derive several features from social media data to reveal knowledge about relevant criteria of successful messages, which span from content- over context- to network-related features. To investigate the diffusion across a social system and to analyze the structure of the network, Susarla et al. (2012) suggest focusing on a specific community to enable these analyses and highlight the importance of users and their characteristics in successful information diffusion. As research on trend detection emphasizes that trends spread across a social group, such as a community (Salehi et al., 2012), and often focus on a specific topic-related group, in the following, existing studies in the field of community detection are analyzed regarding insights supporting the extraction of an active topical community from OSNs.

3.3 Community detection

There exists a huge body of research in this field that either focuses on the characteristics of the community members such as their interests or on specific group properties of a community such as their interaction density (Zafarani et al., 2014). The member-focused approaches rely on the assumption that individuals who share similar characteristics, like interests or behaviors (e.g., Gupta et al., 2018), often build social groups. Algorithms, therefore, aim to assign users with similar characteristics to the same community and measure users’ similarity. Group-based approaches, on the other hand, focus on the type of community and aim to detect e.g., dense communities with a high level of interactions within the group (Zafarani et al., 2014). For this research, both approaches are relevant as trend development requires active interaction and communication among the members of a social system, and often relates to a group with specific interests. Therefore, studies of both research areas are considered. The investigation of studies is conducted according to three

44 Related work aspects: the process initialization, the applied methods and considered data as well as the detection process.

3.3.1 Process initialization Depending on the purpose of the studies, different tracking approaches are used. Most commonly actor- and keyword-based approaches are applied (cf. section 2.3.1). Keyword- related approaches often base on hashtags or specific keywords posted either in a user's biography or in the posting text. This increases the probability that a user has an interest in a specific field, and therefore, is often applied in studies that aim to detect a community within a specific topic area (e.g., Morgan et al., 2019; Wang et al., 2018; Ferrara et al., 2014). Ferrara et al. (2014), for instance, yield to analyze the characteristics of an Instagram community. For its detection, they use several hashtags related to a specific competition to identify a set of initial seed users. The authors crawl 72 popular contest hashtags and randomly select 2,100 users that have posted at least one of these hashtags. In a second step, they gather all postings and the related media of these users and use this as a database for the subsequent analysis. The study of Wang et al. (2018) focuses on detecting a community of users on Twitter that have an eating disorder. They utilize the users’ profile descriptions for this purpose as they state that a user’s profile description is often regarded as the biography of a user, while statements within tweets are less trustworthy (Wang et al., 2018). Therefore, they track users that have mentioned specific keywords related to eating disorder diagnosis and personal information such as body weight indicated by keywords in their profile description. From there, the authors use a snowball sampling method to expand the group. For this, the follow-edges of initial users are utilized to identify more relevant user profiles. The profile descriptions of new users are subsequently checked for the defined keywords (Wang et al., 2018). Morgan et al. (2019), in contrast, follow an actor-related approach as they start with some initial seed users. For this purpose, experts of the relevant field (UK retrofit sector) define a list of 56 “core users”. Similar to Wang et al. (2018), they proceed with a snowball sampling method using the follow-relations of initial seed users afterward. Both approaches are subjectively biased by the selection of seed users resp. seed hashtags. The first approach, however, reduces this bias by considering a higher number of seed users (3,380) who have posted specific seed words compared to 56 users in the other approach.

45 Related work

3.3.2 Detection methods and considered data In general, the topological-based approach, which investigates the connections of users, and the topic-based approach, which focuses on similar characteristics of users, can be distinguished. More recently, these approaches are also combined and network measures are weighted with metrics representing users' similarity. In the following, these methodologies are briefly introduced and discussed.

Topological-based approach Most previous studies only focus on the connections between users to detect communities. Therefore, they base on the links between people and do not consider other users’ characteristics and interactions within the network (Abdelbary and El-Korany, 2013). Jarukasemratana et al. (2013) and Kloumann and Kleinberg (2014), for instance, follow a local node expansion approach and use centrality measures to detect local communities. Jarukasemratana et al. (2013) especially focus on node closeness, whereas Kloumann and Kleinberg (2014) consider several centrality measures within their analysis. They combine several centrality measures such as PageRank in a vector to then train a SVM algorithm.

Topic-based approaches More recent studies base their detection approaches on similar interests of users (e.g., Xiao et al., 2014; Abdelbary and El-Korany, 2013; Sachan et al., 2012). Community detection, therefore, relies on the content of the users’ interactions and measures users’ topical similarity by using techniques such as Latent Dirichlet Allocation (LDA) (Gupta et al., 2018). Users with a high similarity measure are more likely to be a member of the same community. Xiao et al. (2014), for instance, use hashtags to identify a community that focuses on a specific topic.

Mixed approaches Darmon et al. (2015) emphasize the multifaceted nature of OSNs, and therefore, suggest considering all this data within community detection to profit from the value of available data. They underline that people can belong to several social (e.g., college friends or family) and topical (e.g., various interests such as cycling and politics) communities. They state that it is worth not only consider the topic similarity or structural interconnectedness of users for community detection purposes but suggest also include the type and level of interactions between the users such as the mentioning, tagging or retweeting activity. They show that especially the same topics (e.g., hashtags) and frequent conversations, such as mentions, are

46 Related work strong indicators that users belong to the same community (Darmon et al., 2015). Other researchers take a similar view and combine several approaches to consider various attributes for community detection. Sachan et al. (2012), for instance, combine connection, interaction as well as topical information to model communities. This underlines that the combination of various approaches is especially relevant for this research as it supports the detection of communities with members who are not only connected by e.g., follow- follower-relations but who interact commonly with each other and share similar interests.

3.3.3 Detection process The study of Papadopoulos et al. (2012) focuses specifically on the application of approaches to real data from OSNs. The authors outline that iterative approximation schemes, such as the approach of Salehi et al. (2012), are more effective compared to optimization algorithms and clustering due to their computational efficiency and conceptual simplicity. Due to this, studies that follow an iterative process are investigated more in detail. Those studies mostly start with an initial seed word or user but differentiate in the metrics which are considered within the process. Khorasgani et al. (2010), for instance, identify a community starting with some top-leader nodes and their associated nodes which represent the leader’s respective community. New leaders and associated nodes (communities) are identified via an iterative process which ends when no new leader is identified. Leaders in their perspective are the most central members of a community. Their “Top Leaders Algorithm”, therefore, ranks the values of centrality measures, and then only considers the top values as leaders within the next iteration. Salehi et al. (2012) also propose a sampling method starting with an initial seed node and expand it by including a set of nodes which is close to the initial seed. For this, they calculate the PageRank value to determine the importance of every new considered node to the seed. This iterative process is repeated until the community reaches a specific target size. Morgan et al. (2019) aim to identify the core of a network and therefore, also follow an iterative process. Starting from a group of seed users, they extend the network by the follow-relation to then rank the users according to the strength of connections to the other users in the considered group. Therefore, to construct an efficient and applicable approach, it is valuable to follow an iterative process using different evaluation mechanisms such as ranking according to specific values. Figure 3-1 presents an aggregation of the relevant insights gained from the review of prior work in the field of community detection.

47 Related work

Identification of seed users Community detection Community

Process initialization Evaluation of (new) potential community members using filtering, scoring, ranking

via expert Actor-based recommendation Topic-based reached or nonew Mixed approach

Topological-based Keyword- via hashtag based search Identification of new potential community

potentialcommunity member identifiedis members based on the network of best ranked epeated target size until is

r potential community members

via follows or Network-based followers lists

Figure 3-1 Summary of results – literature review on community detection

3.4 Identification of influential users on social media platforms

A huge part of research in the field of trend detection focuses on user groups that are relevant for information diffusion across social media platforms and which especially influence the adoption rate of e.g., new information or products within the network. These influential spreaders are often referred to as online opinion leaders (e.g., Rehman et al., 2020; Chen et al., 2017; Khan et al., 2015; Song et al., 2007), who are introduced previously in section 2.1.3. Others use the term influencer (e.g., Rodriguez-Vidal et al., 2019; Rosenthal and McKeown, 2017), influential users (e.g., Segev et al., 2018; Weng et al. 2010), (community) leaders (e.g., Tsai et al., 2014; Shafiq et al., 2013), micro-influencers (e.g., Rakoczy et al., 2018) or trendsetters (e.g., Saez-Trumper et al., 2012), but also relate their studies to the concept of opinion leadership. As the role of OTS is closely connected to the concept of opinion leadership and persons who exert influence on others, in the following, 25 studies dealing with the detection of influential users resp. online opinion leaders on social media platforms are analyzed. The objective is the identification of features which enable the measurement of behaviors and personality traits of users based on data from OSNs, and to get insights about appropriate approaches which support the detection of influential users. The studies are selected either due to their relation of features to the characteristics of opinion leaders (e.g., Li et al., 2013) or their specific focus on OSNs (e.g., Rehman et al., 2020). The major portion (17) of the considered studies investigates data from online social networking

48 Related work platforms (e.g., Twitter, Facebook, Sina Weibo, Instagram). Others focus on blog data (3) (e.g., Kayes et al., 2012), forum data (3) (e.g., Chen et al., 2017) or other social media platforms (2) such as a bookmarking site (Lü et al., 2011) and a social platform to exchange about adolescent health (Wang et al., 2018). 13 of the 25 studies use already existing data sets for their analyses, e.g., especially provided for research projects. Nine studies collect the data by applying various data collection approaches, such as a keyword-based, actor- based, community-based, time-related approach or the authors use a randomly selected data set.

3.4.1 Characteristics and measurement

These online-related studies show that the outlined main characteristics of online opinion leaders go along with those which are identified in section 2.1.3, namely innovativeness, expertise, influence, interconnectedness, communicative and credibility. To translate these dimensions into quantitive measures, several researchers relate to insights of different social theories such as the concept of social capital (Rosenthal and McKeown, 2017). However, in the reviewed studies a clear assignment of metrics to one dimension is difficult as the dimensions expertise, credibility and interconnectedness, for instance, are assumed to make someone influential, and therefore, these dimensions are closely linked with each other. Nevertheless, Table 3-1 presents an excerpt of features that describe the characteristics of online opinion leaders based on social media data according to the reviewed studies. In the following, the key insights regarding the six dimensions and their measurement in the social media environment are summarized.

1) Innovativeness A user’s innovativeness is underlined by the novelty of her/his shared content (Rosenthal and McKeown, 2017; Chen et al., 2017; Li et al., 2013; Saez-Trumper et al., 2012; Song et al. 2007). Chen et al. (2017) use the notion of “content generators” to express an online opinion leader’s innovativeness and Song et al. (2007) describe them as “novel content contributors” (Song et al., 2007, p. 972). The novelty of published content is an indicator of the innovativeness of a user and is mostly measured based on metadata such as the posting time compared to other users (e.g., Saez-Trumper et al., 2012). Some authors use negatively correlated measures, for instance, Li et al. (2013) who consider the number of forwarding the posting of other users as an indicator of being not innovative.

49 Related work

2) Expertise Similar to the fashion-related literature dealing with opinion leaders (cf. section 2.1.3), an often mentioned characteristic of online opinion leaders is their expertise in a specific field (Chen et al., 2017; Li et al., 2013; Saez-Trumper et al., 2012). Saez-Trumper et al. (2012) highlight that a user’s level of expertise varies in different topics. Li et al. (2013) also find that a person who is an online opinion leader in one specific area can be a follower in another area. As Table 3-1 shows, a major part of features describing a user’s expertise relies on content-based approaches that require the application of text analytics. Cha et al. (2010) underline the importance of considering the content of postings for measuring one’s expertise in a specific field. Rodrigues-Vidal et al. (2019) introduce the term domain signals, which correspond to the usage of topic-specific keywords. They involve complex text features which base on topic modeling approaches to identify influential users. Li et al. (2013) also apply text analytics to measure the level of expertise. For this, they use LDA to investigate the topic focus within the postings and comments of a user. Pal et al. (2016) base their analysis on the profile description and identify topical authorities on Instagram based on their self-descriptive biography information. They show that the biography content reflects the users' interests (Pal et al., 2016). Rakoczy et al. (2018) use the audience engagement to describe one’s expertise. They distinguish this influence quality from influence quantity which refers to the audience size. The engagement is measured by the sum of reactions on the user’s published content such as comments and shares.

3) Influence The most commonly mentioned characteristic of online opinion leaders is their influence within a specific network (Rodriguez-Vidal et al., 2019; Chen et al., 2017; Rosenthal and McKeown, 2017; Khan et al., 2015; Li et al., 2013; Saez-Trumper et al., 2012). Lin et al. (2018) emphasize that the influence of online opinion leaders is restricted to their immediate environment which indicates their influence only on a specific sub-network or community of a social media platform. As several researchers, such as Segev et al. (2018), state that the audience size is not sufficient to measure influence, in recent years, a lot of academical studies apply network topological measures to analyze influence. Therefore, they investigate a user’s position within the network and the existence of connections, the so-called social graph (Song et al., 2007; Khrabrov and Cybenko, 2010; Cha et al., 2010; Weng et al., 2010; Lü et al., 2011; Kayes et al., 2012; Liu et al., 2013; Chen et al., 2014; Khan et al., 2015; Rehman et al., 2020). Rosenthal and McKeown (2017) consider the average (avg.) time to respond to a user’s posting, such as the first comment, as a relevant indicator for influence.

50 Related work

Nevertheless, the majority of studies bases on structure-related data to investigate one's influence and calculate different statistical network measures, such as the PageRank or closeness from the field of SNA.

4) Interconnectedness The dimension of interconnectedness often relates to the number of followers of a user (Chen et al., 2017) or a central position in the network (Li et al., 2013). Chen et al. (2017) outline that influential spreaders have a big number of people who follow her/his comments or ideas. Therefore, the measurement of this dimension often bases on the activity graph of users and relies on topological measures, e.g., betweenness centrality. Their calculations involve activity data such as mentioning, commenting/replying, liking, sharing/retweet to consider the type and level of users’ interactions (cf. Table 3-1).

5) Communicativeness A major portion of studies further highlights the communication skills of online opinion leaders (Rosenthal and McKeown, 2017; Khan et al., 2015; Chen et al., 2017; Li et al., 2013). Li et al. (2013), for instance, consider the posting activity of users as well her/his reactions on other postings such as the frequency of posting content or commenting on other postings to measure the activity. Agarwal et al. (2008) use the term eloquence to describe one’s communication ability. The authors state that such skills are closely related to the engagement which a posting receives as the engagement is an indicator of the effectiveness of the message. This is also often cited as a measure of one’s expertise.

6) Credibility Some authors especially emphasize the dimension of credibility (Rodriguez-Vidal et al., 2019; Rosenthal and McKeown, 2017). Rosenthal and McKeown (2017), for instance, define an influencer as “someone who has credibility in the group, persists in attempting to convince others, and introduces topics/ideas that others pick up on or support.” (Rosenthal and McKeown, 2017, p. 5). Similar to the dimension of expertise, metrics that aim to quantify credibility often base on textual data (cf. Table 3-1). Rosenthal and McKeown (2017) assume that postings with the indication of content source or numbers, which underline the facts, are perceived as more credible. Therefore, they examine the content of postings in regards to the usage of URLs, numbers and quotes. Rodriguez-Vidal et al. (2019) additionally use metadata related to the profile. They consider the number (no.) of follows as a negative indicator for credibility as they state that a high number indicates a polite reciprocal relation without true interest in the posting content.

51 Related work

Table 3-1 presents an overview of the six dimensions with examples of features that are used for their measurement, the corresponding data from OSN, and examples of related studies. The extensive list is included in Appendix A.2 (cf. p. XXX).

Dimensions Data Feature examples References

Innovativeness Interaction Network-related (activity): Li et al. (2013), - number (no.) of sharing Cha et al. (2010) other users’ content UGC Context-related: Li et al. (2013), Saez- - adopting time rank Trumper et al. (2012) Expertise Profile User-related: Pal et al. (2016) - biography-based interest focus Interaction Network-related (activity): Rodrigues-Vidal et al. - no. of comments on the (2019), Rakoczy et al. user’s postings (2018), Pal and Counts - no. of likes a user receives (2011), Cha et al. (2010)

UGC Content-related: Rodrigues-Vidal et al. - no. of postings related to a (2019), Chen et al. (2017), specific topic/all postings Li et al. (2013) - no. of topic-specific words used in postings Influence Connections Network-related (social graph): Rehman et al. (2020), - in-degree, out-degree Rakoczy et al. (2018),

- centrality measures (e.g., Chen et al. (2017), Khan eigenvector) et al. (2015) Interaction Network-related (activity): Segev et al. (2018), - no. comments/no. postings Rosenthal and McKeown - no. of distinct commenters/ (2017), no. postings Li et al. (2013) - no. of posts with likes/no. of likes Context-related: - response time Interconnectedness Interaction Network-related (activity): Rehman et al. (2020), calculated based on retweet, Chen et al. (2017) mention, reply network - in-degree, out-degree, - betweeness centrality - no. of being mentioned or tagged by others

Communicativeness Interaction Network-related (activity): Chen et al. (2017), - no. of comments a user Li et al. (2013), adds on other postings Agarwal et al. (2008) - no. of replies on comments of own posting

52 Related work

Dimensions Data Feature examples References

UGC Content-related: Agarwal et al. (2008) - no. of postings - length of a posting

Credibility Connections Network-related (social graph): Rodrigues-Vidal et al. - no. of follows (2019) - no. of followers/no. of follows Interaction Network-related (activity): Pal and Counts (2011) - no. of mentions in posting - no. of being mentioned in postings UGC Content-related: Rosenthal and McKeown - no. of URLs in posting (2017) - no. of numbers in posting - no. of questions/no. of sentences in posting - usage of specific domain- relevant words

Table 3-1 Feature examples for the measurement of opinion leaders’ characteristics

The structure of Table 3-1 relates to the main components of online social networking platforms and the different categories of information, e.g., user-related, which are presented in section 2.2.1 (cf. Table 2-1) and which build the basis for the categorization of features. A major part of used metrics, for instance, bases on connection and interaction data, and can be summarized as network-related features according to their included information. Several of the studies also consider features which bases on UGC data to measure specific characteristics, like a user’s credibility. These features are referred to as content-related features (cf. Table 3-1). Besides, characteristics such as a user’s innovativeness are measured by context-related features, e.g., the posting time. User-related features, e.g., a user’s interest focus which is used as an indicator for a user’s expertise in a specific field, bases on profile data. Segev et al. (2018) additionally integrate some features into their process which aim to identify artificial behavior like a bought engagement to ensure the quality of results and improve the process effectiveness.

3.4.2 Approaches to detect influential spreaders

Research focusing on influential spreaders uses various methods for their identification and often combines several of them. Bamakan et al. (2019) distinguish six different approaches for opinion leader detection according to the analysis method. They differentiate descriptive, statistical and stochastic, diffusion process-based, topological-based, data mining and

53 Related work machine learning, and hybrid content mining approaches. The authors emphasize that with regards to high dimensional data, hybrid content mining approaches perform better compared to the other methods as they combine techniques of text mining and network- based analytics. This advantage is simultaneously the weakness of these methods as they require massive amounts of data to be effective (Bamakan et al., 2019). The majority (10) of reviewed studies apply network topological approaches (e.g., Rehman et al., 2020; Liu et al., 2013), followed by data mining and machine learning approaches (9) (e.g., Rodriguez-Vidal et al., 2019; Segev et al., 2018), and hybrid content mining approaches (6) (e.g., Chen et al., 2014; Li et al., 2013). As being a trendsetter, similar to being an opinion leader, is considered topic dependent, content-related data is highly relevant for their detection. Although topological approaches are widely spread, they do not consider dimensions like the user’s expertise in a specific field or contextual data, e.g., the time of posting, which is relevant in terms of innovativeness. The sole consideration of network measures ignores important aspects of opinion leadership, and therefore, is not sufficient to identify OTSs in huge networks (Bamakan et al., 2019). Studies of the category data mining and machine learning approaches often use techniques such as clustering as a filtering mechanism, and then proceed with a second method, e.g., dimension reduction, to identify the influential users (e.g., Chen et al., 2017). Another technique of this field is learning models which aim to reveal knowledge by recognizing patterns within a big volume of data. A weakness of these learning approaches spanning from unsupervised over semi- supervised to supervised learning, however, is the lack of labeled data (Bamakan et al., 2019). To evaluate the results of unsupervised data or to train the algorithm within a supervised approach requires the labeling of data which is often realized manually by the annotators in a subjectively biased manner (e.g., Kim et al., 2017b; Lee et al., 2017). A major strength, however, is that such approaches allow the consideration of various user attributes like network, content and statistical measures and yield to recognize hidden patterns in the provided data (Bamakan et al., 2019). As the objective of this thesis is to uncover characteristics of OTSs, such as behavioral patterns in OSNs, this type of approach is deemed appropriate for this task. Several researchers suggest a multiple-step approach to identify influential users on social media platforms (e.g., Pal and Count, 2011; Xiao et al., 2014; Chen et al. 2017, Rehman et al., 2020). Rehman et al. (2020), for instance, develop a two-step approach. They first identify a community of most influential users in a network and then identify key users based on centrality measures. Chen et al. (2017) also suggest a two-step opinion leader detection

54 Related work approach, consisting of a community detection step and an opinion leaders identification step using k-means clustering. Similar to Rehman et al. (2020), Pal and Count (2011) rely on a two-step approach combining clustering with a list-based approach applying Gaussian ranking algorithm. Xiao et al. (2014) identify in a first step a topic-focused community and then proceed with the calculation and ranking of two activity-based network metrics, namely RetweetRank and MentionRank to identify influential users within this community. These authors state that a major step towards the influential core is the extraction of a sub-network of users from the whole network. Several studies also apply a ranking on the calculated values to identify the top opinion leaders (e.g., Segev et al., 2018; Xiao et al., 2014; Lü et al., 2011; Weng et al., 2010). Often, specific score values serve as a basis for the ranking (e.g., Segev et al., 2018; Kayes et al., 2012). Figure 3-2 summarizes the analysis of related work studies.

Data collection Feature categories Detection approaches Influential users

Keyword-based User-related Network topological approach Actor-based Network-related Hybrid content mining Community detection approach Content-related Time-based Data mining and machine Context-related learning approach Random

Figure 3-2 Summary of results – literature review on online opinion leader detection

As supervised machine learning approaches allow the consideration of various user attributes and aim to reveal hidden patterns in the provided data, in the following, especially studies that focus on the classification of users by using supervised machine learning are examined. Besides the detection of additional relevant features for OTS identification, the objective of the analysis is also to reveal insights about appropriate algorithms for the classification of users in OSNs.

3.4.3 Classification approaches

For the classification of user accounts, the considered studies utilize user profile data (e.g., Lee et al., 2017; Pennacchiotti and Popescu, 2011), UGC data (e.g., Kim et al., 2017b; Lee

55 Related work et al., 2017; Lima and Castro, 2014), connection data (e.g., Pennacchiotti and Popescu, 2011; Reis et al., 2019) as well as interaction data (e.g., Morstatter et al., 2016; Reis et al., 2019). Lee et al. (2017), for instance, describe the development of a classification model for identifying fashion-related accounts on Twitter. For this purpose, they consider user- and content-related features for classifier training. They calculate a fashion measure that quantifies the fashion focus of accounts by counting the fashion-related words of a pre- defined list in the users’ postings as well as the presence of the word “fashion” within the biography. Pennacchiotti and Popescu (2011) classify users according to their political affiliation, their ethnicity and their starbucks affinity on Twitter, and apply a supervised machine learning approach for this task. They also calculate features related to the profile information, the posting behavior, the posted content (UGC) and their social connections. Besides user-related features, e.g., the length of the username or the number of followers, they consider content-related features which refer to the user’s way of communicating on the platform and include measures like the average number of hashtags and URLs per tweet or the average time between tweets. Therefore, they include metadata, such as time information. They also introduce a category which they call linguistic content features. These features describe the interests of a user and their lexical usage, and base on the usage of specific words and hashtags which are classified in different categories. Network-related features finally describe the social connections of a user and base on the friendship network (Pennacchiotti and Popescu, 2011). Kim et al. (2017b) classify twitter users with interest in e-cigarettes into five classes and find that the inclusion of content-related features which describe a user’s posting behavior improves the classification result significantly compared to only considering metadata features such as whether a profile is verified. Content-related features include the number of URLs in tweets or the number of hashtags in tweets. For each aspect, they calculate the mean and standard deviation to also consider the value distribution of these characteristics. In total, they involve 58 behavioral and 15 context-related features for the multi-class classification (Kim et al., 2017b). Lima and Castro (2014) investigate the prediction of a user’s personality and distinguish two categories of features, the grammatical category (e.g., avg. length of text, avg. no. of question marks) and the social behavior category (e.g., avg. no. of mentions, no. of followers). In recent years, especially the detection of bot or fraud accounts gained attention. Studies dealing with the identification of fake accounts also use features that relate to the posted content, the user and her/his activity and the user’s environment, such as the social network structure, e.g., Reis et al. (2019). Morstatter et al. (2016), for instance, consider the average time between tweets as bots tend

56 Related work to publish many postings within a short period of time. In contrast to the reviewed studies in section 3.4, where only one of the studies considers profile data (Pal et al., 2016), the studies which focus on user classification on online social networking platforms emphasize that the profile description contains valuable information about the users' interests and characteristics (e.g., Wang et al., 2018; Pal et al., 2016). Due to this, such data should be considered within the data analysis of this thesis.

Commonly applied algorithms and ensemble learning methods for the classification of user accounts in OSNs are Naïve Bayes, SVMs, Random Forest, Neural Networks as well as various boosting methods. Researchers mostly use Recall, Precision and F-Scores to measure the performance of algorithms. Lee et al. (2017), for instance, apply Naive Bayes and SVM for the classification of user accounts into fashion-related and not fashion-related accounts. Both classifiers deliver similar results with a F1-Score value reaching 67%. Lima and Castro (2014) also apply Naïve Bayes and SVM, and additionally test a Multilayer Perceptron Neural Network to predict one's personality based on social media data, and find that all three classifiers have a similar performance. Pennacchiotti and Popescu (2011) use Gradient Boosted Decision Trees due to their faster decoding time and smaller resulting models compared to algorithms such as SVM. Similar to Lee et al. (2017) they use Precision, Recall and F1-Score to evaluate the classification results and apply 10-folds cross-validation. Morstatter et al. (2016) also focus on boosting methods as they state that bot behavior can be heterogeneous depending on the initializing party and the purpose of bots. They underline that boosting methods can better handle this issue as they combine several classifiers that can focus on the different types of bots. They use AdaBoost and find that the algorithm outperforms heuristics such as the posting length. Kim et al. (2017b) compare a variety of algorithms resp. ensemble learning methods. They test eight different algorithms resp. ensembles for the classification task. The Gradient Boosting Regression Trees perform the best (F1-Score = 83.3%), followed by SVM, Logistic Regression and Random Forest. This goes along with the results of Reis et al. (2019), who apply five algorithms resp. ensemble methods and find that Random Forest and XGBoost algorithms perform better than k- Nearest Neighbor, Naïve Bayes and SVM.

3.5 Validation of insights by experts from the fashion industry

To further enrich and validate the insights from prior studies and to include the practitioner perspective, interviews are conducted with experts in the related field of influencer

57 Related work marketing as they especially provide knowledge about OSNs and their influential users. As outlined in section 2.1.3, social media influencers are online opinion leaders who cooperate with companies and use their role of opinion leadership commercially. Due to their marketing value, the interest of companies to cooperate with online opinion leaders has recently increased, and a new business model has been established which focuses on the detection of online opinion leaders for marketing purposes, e.g., influencer marketing agencies. These agencies deal with the detection of appropriate online opinion leaders for business cooperations who fit the of their clients. As previously outlined, insights from the field of online opinion leaders detection can be valuable for the identification of OTSs. This qualitative study also aims to avoid missing important aspects which are relevant for the detection of OTSs and which might not have been identified by the literature review.

3.5.1 Method Semi-structured interviews are conducted with experts in the field of influencer marketing. Therefore, an interview guideline is created which provides the structure and serves as orientation. This further ensures a specific level of consistency across the interviews and enables their comparison. This type of interview also provides the option to include open questions, adapt the order of questions, focus on specific aspects which are mentioned within the interview as well as asking spontaneous questions (Hopf, 2012). Therefore, it is appropriate for the objective of this study as it can reveal new relevant aspects for online opinion leader detection. The interviews specifically aim to reveal insights about the identification and selection process of online opinion leaders in a business context which subsequently yields to support the concept development for OTS identification. For this purpose, the guide is developed based on the findings of section 3 and consists of several topical sections including questions about the identification process in general and, specifically, the selection and evaluation criteria. The latter comprises questions about network-related and content-related criteria. The experts are also asked to add further criteria which do not fit the provided categories. The focus is on the OSN Instagram due to its importance for the fashion industry and its good performance regarding influencer marketing. The interview guide is attached in Appendix A.3 (cf. p. XXXIII).

3.5.2 Data collection and analysis The precondition for participating in the interviews is a specific level of knowledge and experience in the targeted field (Merkens, 2012). Therefore, the recruiting of experts

58 Related work encompasses agencies which are specialized in influencer marketing as well as agencies that are active in this field, but also provide other online marketing services (full-service agencies). Besides, responsible employees for influencer marketing in fashion companies are contacted. The request for participation is realized by e-mail and agencies as well as companies from Germany, Austria and Switzerland are considered in the selection. In this qualitative study, experts are defined based on their professional knowledge which is measured by their working experience in the specific field (Bogner et al., 2014). The selected participants have worked for at least five years in the field of influencer marketing or have the authority to select social media influencers. An overview of the interviewed experts is shown in Table 3-2. The interviewees span from agency founders and managing directors (I2, I4, I5) to employees who are specialized in influencer marketing (I3, I6, I7). Expert I1 is a fashion company employee specializing in . The seven interviews are conducted between October 2019 and January 2020 and last between 20 and 36 minutes. All interviews are done by phone and recorded using a digital audio recorder.

Experts Position Company

I1 Deputy Head of E-Commerce Fashion company

I2 Founder and Managing Director Influencer marketing agency

I3 Manager Influencer marketing agency

I4 Founder and Managing Director Online marketing agency

I5 Founder and Managing Director Influencer marketing agency

I6 Head of Social Media Online and influencer marketing agency

I7 Content and Social Media Manager Full-service agency with a focus on fashion

Table 3-2 Overview of interviewees

The transcription of interviews bases on the recordings and is supported by the software ELAN3. The rules for the transcription, which define how spoken language is transformed to written form, refer to Kuckartz (2014). For the definition of these rules, the objective of analysis has to be considered as it determines the level of acceptable loss of the transmission from the spoken situation to the written form. Those rules can encompass, for instance, that

3 www.archive.mpi.nl/tla/elan 59 Related work dialects, intonations or details of the interview situation are transcribed (Kuckartz, 2014). Within this qualitative study, the interviews are transcribed literally without considering dialects. Long pauses, accentuations or loud speaking are highlighted by capital letters, for instance. The analysis of data is realized by a text-analytical approach using the qualitative content analysis according to Laudel and Gläser (2004). Their analysis concept bases on five steps and aims to classify the content of interviews into a category system. The advantage of this procedure is its intersubjective traceability (intercoder-reliability). The categories within the qualitative study refer to the different components of the interview guideline (process and criteria) and are extended based on the data. Besides, some sub-categories are created. The software MAXQDA4 is used to support the coding, e.g., the assignment of text elements to the categories.

3.5.3 Findings In the following, the key insights of the interviews are summarized according to the main categories of process and selection criteria.

Process For the identification of online opinion leaders, who are potential new social media influencers, the process starts with the search for appropriate user accounts. “New” refers to the fact that the user does not yet cooperate with the respective company for business purposes. Therefore, in the marketing-related field, online opinion leader identification is separated into two strings: users who have an influence on a community within a specific geographical area and those who influence a community within a specific topic area. For the latter, most of the experts use hashtags to first identify topic-related postings and the respective users which is aligned with the insights gained from section 3.3. Two of the experts (I3, I6) also use the follow- and follower-relations to identify further potential users as also applied, for instance, in the study of Wang et al. (2018) (cf. section 3.3). After this first step, the collected accounts are evaluated according to different criteria that relate to the published content (UGC), their network and their profile data. Based on the different criteria, the pre-selected group of users is evaluated and ranked accordingly. One expert emphasizes that they base their ranking on score values which are calculated based on criteria like the quality of content and the interaction of her/his community with the posted content. The process described by the experts, including the assessment of users, is mostly realized

4 www.maxqda.com 60 Related work manually. I1 and I3 even report that 100% of the process is conducted manually, and only one interviewee states that 70% of the selection process is automized (I2). This indicates that practitioners lack appropriate methods to automate the process. The application of such methods, however, enables the consideration of more data and more potentially relevant users.

Selection criteria The most mentioned criteria category is the content-related one (34) followed by network- related (26) and user-related (14) criteria. For the final selection of social media influencers, especially their topic focus, the content quality as well as the interaction with her/his community is emphasized. In the following, the key criteria within each category are outlined. Table 3-3 provides a list of the most relevant criteria according to the interviews.

1) Content-related criteria Besides textual content, experts also examine visual content such as images and videos as this is especially relevant in the fashion domain. They state that videos and dynamic content gain importance, and therefore, consider different aspects of posted visual content such as the esthetic of images (e.g., I5, I6, I7). As the definition of criteria to assess images is difficult, six of the seven experts underline that the evaluation of visual content is realized manually and subjectively. This is underlined by the aspect that the experts, for instance, could not name any criteria of an esthetic image. As this applies for all the criteria mentioned for the evaluation of visual content, this content-media category is not further considered. Regarding the textual content, three of the seven experts outline the quality of comments as relevant criteria. They especially analyze if the commenters indicate a real interest in the posting, and if they are real people or bots (e.g., I7). The content quality, for instance, includes a user's expression or spelling errors. Besides, two of the experts emphasize the focus on a specific topic (I5, I6). Other mentioned criteria are personal addressing towards the community (I5), a user’s creativity (I6) and the usage and variety of hashtags (I2). The latter aims to the user's interest focus or interest variety.

2) Network-related criteria As an indicator for a user’s influence six of the seven experts emphasize the interaction of a user with her/his community. Besides analyzing the reactions of the community in form of likes and comments, the experts underline that especially the interaction is important, e.g., if the user responds to a comment (e.g., I5, I6).

61 Related work

To evaluate users in regard to their social media influencer potential, one of the most mentioned aspects is the demographics of followers, which is especially important in the context of social media influencer detection as social media influencers are used for marketing purposes. Therefore, the community of a user should reflect the target group of the company which is not relevant for OTS identification. Besides, two of the experts check if the community of a user mainly consists of real users or bots and commercial accounts such as company accounts or other social media influencer accounts. The interaction quality is measured by the number of questions in comments as this reflects the communities' interest in the published content of a user. Furthermore, it is examined if or how often the users respond to these questions, and therefore, engage with their social network (e.g., I3). Besides these qualitative aspects, six of the seven experts name the interaction frequency as relevant criteria, e.g., the number of likes and comments on a posting. Another often mentioned criteria is the number of followers (I1, I3, I5, I6). I6 additionally names the portion of the number of followers to the number of follows as relevant criteria.

3) User-related criteria User-related criteria can be distinguished into demographic-related and activity-related criteria. Demographic-related criteria refer to the characteristics of the planned marketing campaign for the cooperation with the user (e.g., home country for a country-specific campaign, I1), and therefore, is not of relevance for this thesis. One relevant activity-related measure, however, is according to the experts the frequency of posting. Two of the experts emphasize that they examine the presence of commercial postings as they state that commercially motivated postings for several different brands decrease a user's credibility (e.g., I6). Two other experts underline the importance of considering the social media presence on several social media platforms as this indicates a major influence.

Category Criteria Mentions Experts

Content Comment quality 3 I5, I6, I7 Content Comment sentiment 2 I4, I5 Content Posting quality 3 I3, I4, I6 Content Topic focus 2 I5, I6 Network Demographics 3 I2, I3, I4 Network Interaction quality 3 I3, I5, I6 Network Fake accounts 2 I2, I4

62 Related work

Category Criteria Mentions Experts

Network Commercial accounts (social media 2 I3, I5 influencer accounts) Network Topic focus 2 I3, I6 Network Interaction frequency 6 I1, I3, I4, I5, I6, I7 Network Reach (follower) 4 I1, I3, I5, I6 User Location 4 I2, I4, I6, I7 User Posting frequency 2 I5, I7 User Posting frequency of commercially 2 I2, I6 motivated postings Table 3-3 Most mentioned criteria of social media influencer selection

Figure 3-3 summarizes the insights gained from the interviews and presents the selection process including the considered selection criteria and mechanism provided by the experts. Based on the interviews the process starts, depending on the objective, with location-based, topic-based or network-based search criteria to identify an initial group of potential social media influencers. Within the selection process, the potential social media influencers are then evaluated by various aspects including user-related, network-related and content-related criteria. The assessment of users along these criteria is then utilized to rank the users and to select those users for cooperation with the best values.

Identification of seed users – potential social Evaluation of potential social media Ranking Selection media influencers influencers

Search criteria Category Criteria

Demography- based Location-based via geo-location search User-related Activity-based

Based on single criteria Community- based Topic-based via hashtag search Network-related Final selection of social Interaction- media influencers based Based on specific scoring values Media-based via follows or Network-based Content-related followers lists Text-based

Figure 3-3 Social media influencer identification process

To sum up, the interviews further support the insights gained from the related work analysis regarding a several-step approach to identify online opinion leaders as well as using hashtags to guarantee a specific interest field of considered users. The qualitative study also underlines the relevance of a user’s interaction quantity and quality with other users to

63 Related work evaluate a user’s influence within a network. The experts also assess the content quality and the topic focus of the posted content as relevant criteria for the identification of online opinion leaders, and thus, potential social media influencers. Furthermore, the interviews reveal some additional criteria such as direct addressing which can be measured by the number of mentions used by a user in her/his postings. These criteria can further feed the feature framework. As the last point, some experts emphasize the need to filter out bots and users with artificial behavior before assessing the potential social media influencers. This is also an important aspect to consider within the development of TSIM.

3.6 Interim conclusion

(1) The diffusion of a product trend in OSNs can be analyzed by measuring the volume of postings containing a specific trend-hashtag within a community over time. Specific trends can be captured by hashtags that relate to the trend (“trend-hashtags”). The number of postings containing these hashtags indicates the adoption rate of a trend over time. For the investigation of trend diffusion, a specific community can be analyzed as a community presents a social system. Depending on the first time of using the trend-hashtag, the members of this community take the role of the different adopter categories such as the role of a trendsetter.

(2) Hashtags and specific keywords provide an appropriate starting point for community detection processes. Compared to the approach of selecting the seed users manually, the keyword approach contains a lower level of subjective bias and requires less manual effort to discover an appropriate number of seed users. This method starts with one or several hashtags or keywords describing the field of interest and collects a huge number of hashtag-related postings to then identify the seed users. This approach is also used in practice to select social media influencers.

(3) The involvement of connection, interaction, and content data in the community detection process allows identifying a topic-focused community with good interconnectedness and high interaction.

64 Related work

Whereas topological-based approaches ensure the interconnectedness of members, topic-based ones identify communities that contain members with similar interests or behaviors but who are not necessarily connected or in exchange with each other. By combining the two approaches and extending them by considering the level and type of interactions, the detection of active topic-focused communities can be realized.

(4) An iterative community detection process including different types of selection mechanisms such as filtering, scoring, and ranking is more effective than other existing approaches. Due to its computational efficiency and conceptual simplicity, several studies follow an iterative process starting with an initial seed (e.g., users) which is subsequently expanded (e.g., by the follow-relations of users). The members of the resulting group are then evaluated by various measures (e.g., PageRank), ranked according to the value, and finally, only the top x users proceed in the next iteration. As only a limited volume of data is considered in each iteration this method is very efficient.

(5) Network-, content-, user-, and context-related features allow the measurement of OTS characteristics and behaviors. Network-related features which base on interaction data and content-related features have more relevance in terms of measuring influence, expertise, and credibility. The most mentioned characteristics of online opinion leaders according to the reviewed studies are their influence, expertise, and credibility. For their measurement, features that relate to profile, connection, interaction, and UGC data are considered. According to the related category of data, network-related, content-, user- and context-related features can be distinguished. Network-based and content- based features, however, are the most applied for opinion leader detection as well as for social media influencer identification in practice.

(6) Multi-step approaches can improve the efficiency of opinion leader detection and the combination of multiple methods allows the consideration of various valuable social media data from OSNs. More recent studies apply a mixture of approaches combining text-oriented and topological techniques and use content mining and learning approaches.

65 Related work

Additionally, researchers highlight the value of multi-step approaches in terms of efficient opinion leader detection. Several researchers start with the extraction of a sub-network such as a community, and also include filter and ranking mechanisms in their detection processes.

(7) Boosting methods are especially promising regarding the performance of user classification based on social media data. Naïve Bayes, SVM, Logistic Regression, and Random Forest algorithms also perform well in previous studies. According to previous studies, besides boosting methods, Naïve Bayes, SVM, Logistic Regression, and Random Forest algorithms provide satisfying results in user account classification. Promising features are calculated based on profile, UGC, network connection, and interaction data.

These findings serve as a fundament for the conceptualization of the TSIM in the following section. As (6) emphasizes the efficiency of multi-step approaches, the TSIM is developed as such. Furthermore, prior work has outlined the importance of first identifying a sub- network, namely community, before identifying some key players within this group. Therefore, the TSIM includes two steps, the topic-focused community detection phase and the classification of the community members into OTS and non-OTS. The insights presented in (2), (3), and (4) support the development of a topic-focused community detection approach. The concept of classification profits from the aspects outlined in (5), (6), and (7). The development of an appropriate labeling concept bases on findings of (1) and the creation of a feature framework utilizes the insights summarized in (5).

Based on this, the following section aims to:

- develop an overall concept for the TSIM which follows a multi-step approach, - design a topic-focused community approach using the insights from the previous section, - provide a labeling concept for the creation of a training dataset as a precondition for the training of machine learning algorithms and the development of a classification model, and - create a feature framework that enables the measurement of users’ behaviors and characteristics based on data from OSNs.

66 Conceptual framework to identify online trendsetters 4 Conceptual framework to identify online trendsetters

4.1 Two-step approach

Based on the findings of section 3 the TSIM is designed as a two-step approach (cf. Figure 4-1). The first step consists of a topic-focused community detection approach. Similar to methods for opinion leader detection which often consist of multiple steps, this yields to first ensure a specific topic area and to improve effectiveness (cf. section 3.4) as the access to data of a whole OSN requires huge computational costs (Hamann et al., 2017).

The reasons for the focus on a specific topic community are the following:

1) Trendsetting is often restricted to a specific topic area (cf. section 2.4). 2) Trends spread across a social system and a community can be deemed a social system (cf. section 2.4 and 3.6). 3) Users can take different roles depending on the community, and therefore, the role of OTS depends on the membership to a specific community (cf. section 2.4 and 3.4). 4) The consideration of data of the whole OSN is not feasible due to the huge volume and limited data access (cf. section 3.3).

As Figure 4-1 shows, the process starts with the definition of the topic area, followed by the detection of a topic-focused community to then identify the OTSs within this community. In the following sections, these two main components are explained more in detail. The developed community detection approach, which is described in the following section (cf. section 4.2), provides a data collection procedure that enables the analysis of a relevant exert of the whole network. This follows the suggestion of Tsugawa and Kimura (2018) who show in their study that it is worth identifying influential users within a sample of the entire network. From there, the second step comprises the identification of OTSs within this topic- focused community based on their digital trace in OSNs by applying a classification model. The model is developed using supervised machine learning as it enables the exploitation of the value of social media data from OSNs by considering features that describe different aspects of user behaviors and characteristics as outlined in section 3.4. Section 4.3 then provides the concept for the development of a classification model using supervised machine learning.

67 Conceptual framework to identify online trendsetters

Topic-focused Development of Topic definition OTS characteristics community detection classification model

Figure 4-1 Two-step OTS identification approach

4.2 Topic-focused community detection

As outlined in section 2.2.2, users in OSNs form social groups, so-called communities. To detect communities, researchers apply various methods (cf. section 3.3) depending on the purpose of research. The community detection approach which is developed within this thesis aims to identify a topic-focused community with good interconnectedness and a high level of interaction. The reasons for this are the findings of trend diffusion derived earlier in this thesis, which are the following:

1) Trends diffuse across members of a social system who interact with each other as interactions are a precondition for the process of diffusion. 2) Trends and the role of trendsetter relate to a specific topic area. 3) Being active and generating content is a precondition to actively participate in trend creation and diffusion.

Therefore, various aspects such as topological, topical, and activity-based characteristics are involved within the detection process by considering connection, interaction, and UGC data as suggested by e.g., Darmon et al. (2015) (cf. section 3.3). The process is initialized by one or several hashtags which describe the topic area of interest as hashtags provide an appropriate starting point to identify a large initial seed group of users with a low level of subjective bias which is outlined in section 3.3, and which is further supported by the qualitative study (cf. section 3.5). Following the gained insights from section 3.3, the process is constructed as an iterative one which involves scoring, filtering, and ranking mechanisms to design it as efficiently as possible. Figure 4-2 presents an overview of the conceptualized community detection approach. The process starts with initial seed users who are identified by a pre-defined keyword (hashtag). Then, in the subsequent, specific filter criteria are applied (cf. section 4.2.1 and 4.2.2) to eliminate bots or commercial accounts. This results in a group of potential community members. To ensure a certain level of activity and to guarantee a community member’s interest focus on the pre-defined field, each potential community member is evaluated by scores in three categories. The best users according to this scoring, the top-scored users, build the basis for the identification of new potential

68 Conceptual framework to identify online trendsetters community members. For this, the follow-relations of top-scored users are considered in the next process iteration until no new user appears in the process. This means that all new potential community members were already evaluated in one of the previous process iterations. The iterative process is explained more in detail in section 4.2.4.

#seed

Initial seed users Bots, commercial Filtering accounts

Low-scored users Scoring

Bots, commercial Potential community members Filtering accounts New potential community members Low-scored users Scoring

Add top-scored user‘s follows Top-scored users until no new user appears

Target community

Figure 4-2 Overview community detection process

Based on the findings of the literature research, the detection approach is composed of four main components which are the detection metrics (network, activity, content), the evaluation and selection mechanisms (scoring, filtering, ranking), the data layers (layer 0, I, II, III, IV), and the iterative process. Thereby, the layers describe the data which is required within the different steps of the community detection process (cf. section 4.2.3). The components of this detection process are introduced in the following sections.

4.2.1 Detection metrics

According to the requirements of research, the detection of community bases on three pillars of detection criteria. These criteria are measured by analyzing textual UGC, profile,

69 Conceptual framework to identify online trendsetters connection, and interaction data which can be extracted from online social networking platforms (cf. section 2.2.1). The three pillars with the respective evaluation metrics and their calculation are presented in the following.

Network pillar This dimension yields to detect a community with members who are connected and regularly interact with each other. Therefore, the included criteria consider users’ connections to each other and measure the level of connectedness by involving four metrics: no. of followers, no. of follows, followers-follows ratio (FFR) and PageRank value. The FFR describes the relation between the no. of followers (users who follow a user) and the no. of follows (users who are followed by the user) as presented in formula (1). PageRank measures the importance of nodes within a network by also assessing the influence of direct and indirect neighbor nodes. Consequently, a user with only a few influential friends on a social media platform might have a higher PageRank value than a user with many low-influential friends. Thus, PageRank considers both the number and the quality of connections. Due to this, it is considered within the community detection approach. The PageRank algorithm was developed to evaluate the importance of web-pages by their link structure and is used by the search engine of Google Search (Oliveira and Gama, 2012). In recent years, however, the algorithm is widely utilized for the detection of communities within directed social networks that occur, for instance, between users in OSNs (cf. section 3.3). For this, the PageRank algorithm ranks users according to their PageRank value, and only high-ranked users are considered as members of a community. The PageRank measure is a variant of the eigenvector centrality measure which is used in directed networks. The algorithm assigns each node of the graph, e.g., a user within a community, a weighting which is determined by its incoming edges, e.g., a user’s followers. The weighting procedure consists of two phases – the initialization step and the iteration steps. During initialization, all nodes receive the same weighting. In the subsequent iteration phase, these weights are recalculated for each node based on its incoming edges. Equation (2) shows a simplified version of PageRank of a user u. In the case of the community detection approach, B(u) is the set of users who point to a user u, which is the number of a user`s followers. PR(u) and PR(f) are rank scores of a user u and a follower f. Ff describes the number of outgoing links. In the case of the community approach, this is the number of follows of a follower f, and c is a factor utilized for normalization (Xing and Ghorbani, 2004).

70 Conceptual framework to identify online trendsetters

�� (1) ��� = � FB = no. of followers; F = no. of follows

��(�) (2) ��(�) = � ∑ �� � ∈ �(�)

u = user; B(u) = set of followers f of u who are members of the community C (u ϵ C); F = no. of follows of f f

Activity pillar To guarantee a specific level of users’ activity, this pillar consists of metrics that measure the level of activity of potential community members such as the number of posts per day (PPD), but also the volume of reaction on the posted content in form of comments and likes. This pillar encompasses three metrics: posting, liking, and commenting frequency. As some of the experts of the qualitative study (cf. section 3.5) have emphasized the importance of excluding bots from the process, the metric PPD also serves to detect bots and unnormal behavior as suggested by Morstatter et al. (2016). A high outlier value shows extraordinary activity and serves as an indicator of bot behavior. Due to computational disadvantage, not all the postings of a user are gathered, but only the most recent x postings of each user. The number x is defined individually for each use case and mostly depends on the accessibility of data of the targeted OSN. This restriction to a specific number is deemed sufficient as the most recent postings reflect a user’s activity as well as its current interest focus. Therefore, the PPD measure is calculated by considering the period between the last and the first collected posting as indicated in (3). The measures comments per post (CPP) and likes per post (LPP), which are considered within the scoring process, are calculated accordingly.

� (3) ��� = � (������� ���� ����� ���� − ������� ���� ������ ����)

u = user, x = no. of considered most recent posts

Content pillar This pillar aims to ensure the quality of user accounts as well as a specific level of interest within the defined topic area and measures a user’s authenticity and topic focus. The pillar relies on three assumptions:

71 Conceptual framework to identify online trendsetters

1) Commercial accounts use a specific wording which relates to a buying purpose. 2) The more postings a user has posted on a specific topic, the more interest she/he has in this topic. 3) The more topic-related words she/he includes in her/his postings and biography, the more knowledge in this specific field she/he has. The two latter aspects follow the assumptions of Li et al. (2013). As described earlier in 3.4.3, the biography or profile includes information regarding users’ social environment and interests such as being a parent or having an interest in sneakers. Users who are highly interested in fashion, for instance, tend to include the word “fashion” in their biography (Lee et al., 2017). Similar to Morgan et al. (2019), who use specific keywords to identify relevant user accounts within their community detection process, specific filter words and keywords are defined which are used to filter out commercial accounts and to value topic-relevant accounts. If any of the filter words, for instance, is indicated in the text the user is excluded from the process. The biography text of commercial accounts such as shops often includes words like “shop online” and “worldwide shipping”. For this, biography information is examined for detecting commercial entities. As the qualitative study has indicated that commercial accounts are less credible, these accounts are excluded from the process (cf. section 3.5). On the other hand, if some of the keywords are mentioned in the biography text or the posting text, the account is scored with a higher value. This follows the assumptions 2) and 3) mentioned before. The more keywords are included in the text, the higher is the content value of the user (biography score and text score). The calculation of biography score (BS) and text score (TS) bases on a pre-defined list of topic-relevant keywords and requires some text pre-processing steps. Similar to the definition of hashtags, the selection of relevant keywords is realized by experts of the respective field or the stakeholders who set the objectives of the study, and therefore, are deemed to be familiar with the topic area. Figure 4-3 presents the pseudocode for the calculation of TS of one user. The BS of a user is calculated in the same way but refers to the biography as the text component.

72 Conceptual framework to identify online trendsetters

Input:

Tu = set of text components (postings) of a user u, Tu = {c1, c2, c3,…}

ci = text component with index i, ci ϵ Tu

K = set of defined keywords; K = {k1, k2, k3,…, kn} Output:

TSu = Text score of user u

set TSu to 0 set i to 1 a

while i ≤ length(Tu) do if ci in K then set TSu to TSu + 1 set i to i + 1

return (TSu)

Figure 4-3 Pseudocode text score

4.2.2 Detection mechanisms

As previously mentioned, the detection process bases on three mechanisms: filtering, scoring and ranking. This allows a low level of computational costs by calculating the scores only for a specific pre-filtered group of users. This subsequently reduces the data collection efforts and yields towards an efficient detection process. The detection mechanisms are introduced in the following.

Filtering To avoid the consideration of bots or commercial accounts within the scoring step as suggested by some of the experts (cf. section 3.5), the metrics listed in Table 4-1 serve as filter criteria. User accounts with outlier values of specific metrics are excluded from the process. Those users with a PPD value of more than twelve posts, for instance, are excluded from the process as this value is assumed commercial or artificial behavior. This specific value is derived from several experiments with data from an OSN analyzing the average user behavior and discussions with experts in the field of influencer marketing. This application of filter criteria follows the idea of Segev et al. (2018) who also involve certain features to identify artificial behavior (cf. section 3.4). Table 4-1 presents an overview of filter criteria with the respective category, filter values, and the reason for their application.

73 Conceptual framework to identify online trendsetters

Category Criteria Filter values Reason (remove if…) Network No. of followers > 1 billion A high number indicates (not considered in commercial and iteration 0) professional motivation of account and no real social connections.

No. of follows > 5 000 A high number indicates a (not considered in commercial or iteration 0) professional interest. Other users are followed to achieve reach by being followed back.

Activity No. of posts per day > 12 A higher value is (PPD) considered as artificial or commercial behavior.

Content Filter words in List of filter words The usage of filter words biography indicates commercial (not considered in motivation. iteration 0) Filter words in List of filter words The usage of filter words postings text indicates commercial motivation.

Table 4-1 Filter criteria

Scoring As suggested by some of the experts of the qualitative study (cf. section 3.5), for each user three scores are calculated (network, activity, content). As indicated in Table 4-2 each score is composed of several criteria. The network score, for instance, combines the criteria no. of follows and FFR. The scores, therefore, reflect a user’s quality in each of the three categories. In regards to transferability and automation, the calculation of scores bases on the normal distribution of all considered user values for each criterion. This means that the score values are determined relative to the metric values of all potential community members. Therefore, a percentile is set as the reference point where users receive the maximum score value (= 1.0). Values of an upper or lower percentile of the reference point are assigned a decreasing score. The advantage of this procedure is the continuous score assignment instead of a discrete allocation of points. The reference point for the maximum score for CPP, for instance, is defined as the 100% percentile as a higher value of CPP indicates high interaction and engagement among the users. Content scores, however, base on the absolute value as the metrics consist of the mean of all keyword values which results in a score between zero

74 Conceptual framework to identify online trendsetters and one, and therefore, corresponds to the range calculated by the percentiles for the other metric values. Table 4-2 presents the list of all scoring criteria related to the specific category, the basis for the calculation of scoring values (column “scoring”), and the reasons for their application.

Category Criteria Scoring Reason

Network No. of follows 50% = maximum To value authentic users (not considered in (score value = 1); higher. iteration 0) standard deviation (SD) = 0.25 Follower ratio (FFR) 75% = maximum To value authentic users (not considered in (score value = 1); higher. iteration 0) SD = 0.2 Activity No. of posts per day 50% = maximum To ensure high activity (PPD) (score value = 1); within the community but SD = 0.25 avoid artificial or commercial accounts. Comments per post 100% = maximum To ensure high activity (CPP) (score value = 1); within the community. SD = 0.4 Likes per post (LPP) 100% = maximum To ensure high activity (score value = 1); within the community. SD = 0.4 Content Biography score 1 Value between 0 and 1 To ensure topic focus. (BSα) (not considered in iteration 0) Biography score 2 Value between 0 and 1 To ensure topic focus. (BSβ) (not considered in iteration 0)

Text score 1 (TSα) Value between 0 and 1 To ensure topic focus.

Text score 2 (TSβ) Value between 0 and 1 To ensure topic focus.

Table 4-2 Scoring criteria

The single scoring criteria of each category can be assigned different weights. The activity score, for instance, is a function of three sub-scores, where LPP is given a lower weight as according to previous studies comments are a stronger indicator of engagement than likes. The score values per user of the PPD, CPP and LPP metrics, which range from zero to one, are then summed up and result in the overall activity score (cf. equation (5)).

75 Conceptual framework to identify online trendsetters

As described in section 4.2.1, the calculation of biography and text scores bases on pre- defined keywords, which can be classified into different categories, e.g., according to their specificity. The different categories can be weighted differently, e.g., the more specific the higher the information value, and therefore, the higher the weight. In formula (6) this is considered by multiplying the scores with a specific weight “α, β, …ω”. Depending on the use case, the objective and the subsequent number of keyword categories, a varying number of weights can be involved.

(4) ������� ����� = � + ���

F = no. of follows, FFR = followers-follows ratio

(5) �������� ����� = 2 × ��� + 2 × ��� + ���

PPD = no. of posts per day, CPP = comments per post, LPP = likes per post

(6) ������� ����� = � × ��� + � × ��� + ⋯ + � × ��� +

� × ��� + � × ��� + ⋯ + � × ���

α, β, …ω: weightings to express importance of respective keyword category; BS α, β, …ω = biography score of keywords of the respective keyword category; TS α, β, …ω = text score of keywords of the respective keyword category

Ranking To ensure that the community and its members meet the pre-defined criteria, users are ranked according to their score values. Only the top n potential members proceed in the process. The value of n depends on the target number m of final community members. If, for instance, due to a specific business objective, the targeted community size m is 1,000 users, the number of potential community members is set to 10,000. The top 10% of potential members, thus, are considered as top-scored users. This procedure follows the approach of Morgan et al. (2019) and Khorasgani et al. (2010) who also proceed with the top users according to pre-defined criteria within the iterative process to detect a specific group of users. For the identification of top-scored users across all three scoring pillars, a consensus algorithm is applied which yields to identify the users with the most optimal score combination of the three metrics categories. This algorithm is established on a sorting

76 Conceptual framework to identify online trendsetters mechanism where the scores of the three metric pillars are used to prioritize users within a given score class. It follows an iterative process until the target number of top-ranked users m according to their score values is achieved. Figure 4-4 presents the pseudocode of the consensus algorithm and its interpretation. The algorithm works as follows: in iteration zero it compares, for instance, the 1,000 best-ranked users in all three categories (network, activity, content) and extracts only those who are part of all of the three groups, thus, the intersection of all three top 1,000 users. As one user can reach a high-ranked score value in the network category, but a low-ranked value in the content category, this first iteration can result in a group of only 41 users, whereas the pre-defined target number m was set to 1,000 users. To achieve this target number the procedure is repeated until the intersection size reaches 1,000 users. Therefore, in each iteration, the group of top-ranked users is enlarged by one additional user. In the mentioned example, the algorithm has to compare the top 3,102 users of all three groups to reach the target of 1,000 users.

Algorithm 1 Pseudocode for consensus algorithm

Input:

S = set of potential community members, S = {s1, s2, s3,…}

si = user with index i m = target number of users in R it = number of iterations AS = activity score, CS = content score, NS = network score

Output: S sorted descending according to R = set of users with the best network, activity and content scores network score

set it to 0 a while length(R) ≤ m do

set R, subsetAS, subsetCS, subsetNS empty sort S descending according to AS for i = 1 to m + it do S sorted descending S sorted descending according to according to add si to subsetAS content score activity score sort S descending according to CS for i = 1 to m + it do

add si to subsetCS sort S descending according to NS for i = 1 to m + it do add si to subsetNS R for i = 1 to length (S) do

if si in subsetAS and si in subsetCS and si in subsetNS then add si to R S = set of potential community members R = set of users with the best network, activity and content scores set it to it + 1 return (R) Figure 4-4 Pseudocode of consensus algorithm

As a result, the algorithm indicates a group of users who serve as starting point for the next iteration of the community detection process.

4.2.3 Data layers

In each iteration, only relevant data of users are gathered. Five different layers describe the required data per user in the different process steps. Table 4-3 presents an overview of the

77 Conceptual framework to identify online trendsetters different layers and the respective data. The layers relate to the different stages within a process iteration, e.g., the available data of a user in the group of potential community members is reflected by layer II and encompasses specific network, profile and UGC data.

Available data

Layer Network User profile UGC (connection, inter- action) Layer 0 - No. of comments No data - Text of postings on a seed-hashtag which include the (Initial seed users) posting initial seed-hashtag - No. of likes on a - Date and time of seed-hashtag postings posting

Layer I - List of contacts No data No data (follows and (New potential followers) community members)

Layer II - No. of followers - Biography - No. of postings - No. of follows description - Posting text of x - No. of likes of most recent (Potential community members) postings (in px) postings (px) (time - No. of comments of crawling) of postings (in px) - Date and time of postings

Layer III - Layer II data - Layer II data - Layer II data - Follow-relation (Top scored users)

Layer IV - Layer III data - Layer II data - Layer II data - Further interaction - Posting history

(Target community) data (mentions, - Content of tags, commentors) comments

Table 4-3 Overview different data layers

The data which is collected for the target community users (layer IV) encompasses all data which is required for the calculation of features presented in section 4.3.2. These features are considered within the classification task.

4.2.4 Iterative process

The community detection approach follows an iterative process. Starting with initial seed users, the follow-relation is used to expand the group of potential community members similar to prior studies (cf. section 3.3). The underlying assumption is that one user follows 78 Conceptual framework to identify online trendsetters another who serves as a source of information. Moreover, users are likely to follow others, who have the same interests (Pal et al., 2016). Figure 4-5 presents a process chart of the topic-focused community detection approach. The first iteration starts with the definition of the field of interest and the respective hashtag. The hashtag aims to capture the specific topic. As outlined in section 3.3, previous studies consult experts or define commonly used words in the respective field as keywords. Following this, the definition of keywords (hashtags) is realized by a hybrid approach. First, an initial list of keywords is created by experts in the respective field. The keywords which are the most used hashtags on the respective online social networking platform are then set as initial seed hashtags. As suggested by Ferrara et al. (2014) (cf. section 3.3), the users who have mentioned these hashtags in one of their postings are considered in the selection of initial seed users. Therefore, the posting text including the respective hashtag and related metadata (cf. Table 4-3) is extracted and evaluated according to the filter criteria defined in Table 4-1 and the scoring criteria listed in Table 4-2. This yields to select an appropriate number of potential community members. Similar to the subsequent iterations, the consensus algorithm is applied for the identification of the top-scored users but only bases on the activity score and content score as in the initial iteration no network score is calculated due to the different data basis (cf. Table 4-3). The result of this initial iteration is a group of m top-scored users. The follow-relation of these users, namely new potential community members, serves as an expansion of the group of potential community members. Similar to the study of Salehi et al. (2012), the decision on which users to consider in the subsequent iteration bases on the PageRank value of accounts. For this, the network metric is calculated for the top-scored users and their follow-relation accounts. The top 10% ranked users according to this value proceed to the next iteration. Therefore, for new emerging users, the necessary data is extracted. The new group of potential community members (for iteration > 0) passes through the filtering and scoring steps which result in the application of the consensus algorithm (cf. Figure 4-4). After each iteration, two metrics are calculated, new count and never count. New count refers to the number of users, who are in the group of layer III users of the current iteration, but who were not part of this group in the previous iteration. Never count represents the number of users, who are in the group of layer III users of the current iteration, but who were never part of this group in any previous iterations. The iterations are repeated until the never count measure attains a value of zero as this suggests that the community has converged.

79 Conceptual framework to identify online trendsetters Layer IV Layer community Convergence Target community Target data data of members of stable Download Download complete profile no - relations* ew ew in the users Layer III Layer lists of listsfollows? N Download Download all follow * the list of users which a person follows a person which users of list the * yes yes no users are are users excluded users are are users excluded yes according according to no ranking? m Layer II Layer Are Are the filtering criteria criteria fulfilled? Top Top consensus algorithm consensus and x most recent and mostx recent posts Download biography Download biography data Calculate Calculate scores and apply yes

no ranking ranking 1? - PR values values PR yes in i = layer layer = II Layer I Layer to ranking? users users yes Users Users Top 10 % according 10 % according Top according according values to PR and descending and descending Calculation Calculation of no users are are users excluded 0 no according according to Topic Community detection process chart detection process Community ranking?

m Layer Layer 5 - Are Are the filtering criteria criteria fulfilled? initial seed users Data crawling Data crawling of 4 Top Top consensus algorithm consensus yes no Calculate Calculate scores and apply users are are users excluded users are are users excluded Figure

80 Conceptual framework to identify online trendsetters 4.3 Trendsetter classification model

In the next step, the OTS profiles within this topic-focused community are identified based on a classification model. For the development of this model a supervised machine learning approach is chosen as it allows the consideration of a large number of different user attributes and aims to detect patterns in data such as specific behaviors and characteristics. Figure 4-6 shows a supervised machine learning process and the related sections of this thesis. The process considers the major steps of building a predictive model which are outlined in section 2.3.4 and is adapted to the specific classification goal of this thesis. Similar to the community detection process, the supervised machine learning process follows an iterative process with continuous optimization of results. The process starts with the goal definition, which is the detection of OTS profiles within a community. As outlined in section 2.3.4, exploring and understanding the problem space is one important aspect of this step as it enables the decision on relevant data, and ensures the consideration of all relevant and informative data in the data collection phase. In this thesis, section 2 provides the fundamental knowledge about (fashion) trend-relevant roles and outlines, for instance, that the role of opinion leaders and their characteristics are closely related to those of OTSs. Therefore, data that is considered for the detection of online opinion leaders can be assumed as relevant for the detection of OTSs. Section 3, in the subsequent, reveals the necessary knowledge about which data to collect. As an example, online opinion leaders are often described as credible which can be measured by the number of times they are mentioned by others within the community. Therefore, all postings of the community including the mention-information need to be collected. Besides, performance measures are selected which build the basis for evaluating and validating the model. The selection of appropriate performance measures bases on the defined goal and the influence of specific classification errors, and is further outlined in section 5.3.1. The following two sections are especially relevant for the data pre-processing step. Section 4.3.1 describes a labeling concept which results in a baseline where the attribute of being an OTS is known. This concept is applied to data in section 5.3.2. Section 4.3.2 then provides a framework for the extraction of features that need to be considered within the classifier training. This labeled dataset is the precondition for the training and testing of models. The choice of appropriate data splitting and data resampling techniques as well as the selection of algorithms resp. ensemble learning methods are outlined in section 5.3.1. Finally, in section 5.3.3, the training and evaluation phase including feature selection and hyperparameter tuning is explained.

81 Conceptual framework to identify online trendsetters

Exploration of Section 2: (fashion) trend diffusion, problem space trend-relevant roles and their characteristics Goal definition Selection of Section 5.3.1: selection of appropriate performance metrics performance measures

Definition of relevant Section 3: data which enables the measurement data of potential OTS characteristics Data collection Definition of sample Section 4.2: topic-focused community detection size approach Section 4.3.1: labeling concept based on past Data labeling trends and Rogers’ diffusion model (application in section 5.3.2) Data pre-processing Feature extraction Section 4.3.2: feature framework which aims to measure OTS characteristics (application in section 5.3.2) Data splitting Section 5.3.1: selection of techniques and Algorithm selection methods for the development of the trendsetter classification model Data resampling

Feature selection Section 5.3.3: development of models based on Training various algorithms and ensembles Hyperparameter tuning

Evaluation with test Section 5.3.3: evaluation of models set

no yes OK? Export model

Figure 4-6 Applied supervised machine learning process (based on Kotsiantis, 2007)

4.3.1 Labeling concept

The precondition for the development of a classification model by applying supervised machine learning is a labeled dataset. As outlined in section 3.4, a major weakness of such learning approaches is the lack of labeled data and the high manual effort to create such a dataset. Furthermore, the labeling is often done by annotators which involves a specific level of subjective bias. This is especially the case for OTSs labeling due to their definition (cf. section 2.1.3). They are credible, creative, have good communication skills, and the ability to influence others' behavior. All of these characteristics are difficult to quantify or translate into single criteria which are observable by human annotators. Therefore, a rating in OTSs and non-OTSs on this base will be strongly biased by the subjective assessment of the experts. By looking at the profile information and shared content of an account on an OSN,

82 Conceptual framework to identify online trendsetters experts like trend scouts or product designers may have a feeling if an account has trendsetting potential. This classification, however, bases on their experiences and feelings, and could miss accounts which, at first sight, do not share creative content but which are influential due to important nodes in its network. To reduce the subjective bias as well as the manual effort, a labeling concept is developed which bases on Rogers’ innovation diffusion model (cf. section 2.1.2). For this, Rogers’ model is transferred to the context of fashion trends spreading across OSNs.

Concept components and process Based on the findings summarized in sections 2.4 and 3.6, the four components which determine the diffusion process are translated to the presented context as follows:

1. Innovation = new product (e.g., shoe model) 2. Time of adoption = time of mentioning the new product the first time (e.g., shoe model-related hashtag in a user’s posting) 3. Social system = community (e.g., specific topic-focused community) 4. Channel = OSN (e.g., Instagram)

Subsequently, for the measurement of the adoption rate, the volume of communication about the new product within a specific timeframe can be analyzed. Figure 4-7 shows Rogers’ innovation diffusion model adapted to the context of fashion trend diffusion in OSNs.

Non-OTSs 84%

OTSs 16 % conversations )

Trend: rate* * number of postings with popular product trend-related hashtag Adoption volume of (new

time Social system: topic-focused community in OSN

Figure 4-7 Labeling concept based on Rogers’ diffusion model

The concept uses past product trends (= successful innovations) to classify users according to Rogers into OTS and non-OTS. A successfully released sneaker model, for instance, can be considered a new product trend as it is subjectively perceived as new by the consumers

83 Conceptual framework to identify online trendsetters at the time of its launch. The focus lies on the communication channel online social networking platform as the aim of this research is to identify fashion OTSs on these platforms. The evolution of the adoption rate over time is analyzed by investigating the number of users who mention the defined product for the first time in their postings. Taking the example of a new sneaker model, the adoption rate per defined time period is described by the sum of users who have mentioned this specific sneaker model for the first time within this period. This analysis is realized within a specific, previously identified, topic-focused community (cf. section 4.2), which represents the social system of Rogers’ model. According to Rogers, the first 16% of adopters in time, resp. users who have mentioned the new product in their postings, are crucial in terms of a new product becomes a trend, and therefore, will be labeled as OTS. As Rogers (2003) states that non-adopters are not included in this adopter classification, only users of the community who have mentioned the product are considered. Figure 4-8 shows the labeling process which starts with the identification of appropriate trends and the definition of related keywords which enable the identification of trend-related conversations in OSNs. The second step encompasses the detection of a topic-focused community (cf. section 4.2). In the subsequent, the trend-related conversations are extracted from the topic-focused community which is the precondition for the categorization of the community into OTS and non-OTS based on Rogers’ model. This classification is conducted in the fourth process step and results in a classified community, thus, a labeled dataset.

Adoption rate over time, Successful innovation Social system Adopter categories communication channel

Identification of trend- Identification of appropriate Detection of topic- Labeling according to related conversations Labeled dataset Rogers’ model product trends focused community in OSN

Figure 4-8 Labeling process

As the concept of topic-focused community detection is presented in the previous section, only the remaining steps are explained in the following.

Identification of trends The precondition for the labeling of the community into OTS and non-OTS is the knowledge about specific product trends, e.g., which sneaker model was a trend model, and the identification of the relevant conversations in OSNs related to this product trend. This brings up the following two main issues:

84 Conceptual framework to identify online trendsetters

1) When is a new product a trend resp. popular product? 2) How can conversations resp. postings related to this product trend be identified in OSNs?

As defined in section 2.1.1, a fashion trend is a product that is accepted by a specific number of people (Kim et al., 2011). A new product is not automatically a new trend, which shows the need to distinguish between a new product and a popular new product (= new trend), whereas “popular” reflects a specific acceptance or desire by the consumers. An indicator of a product’s popularity is its price. One of the most important concepts in microeconomics, the law of supply and demand, states that the price for a specific product varies until the demand is equal to the offered volume (= economic equilibrium) (Gale, 1955). Further, it assumes that if at a given price level (e.g., official retail price), the demand exceeds the supply on the market, the price for this product will increase (Gale, 1955). According to this model, the willingness of consumers to pay more than the offered price (= retail price) indicates that the demand for the product exceeds its offer. Thus, one can assume a high level of adoption of this product in the respective target market. Using this assumption for the identification of trends (= popular products), this means, that the bidding price (real market price) for a product surpasses the official retail price for a defined period of time.

Nowadays, there exist various online marketplaces, e.g., eBay or Amazon, where consumers and companies can offer and purchase products or services (Luca, 2017). Some of these platforms are specialized in a specific niche (Luca, 2017), e.g., StockX5 on streetwear. One of the price models on these platforms is online auctions, where (re)sellers and buyers negotiate about the price for a specific product (Stahl et al., 2016). The price evolution for a specific product on such an online marketplace can be used for the identification of trends as it reflects the desire of the consumer and shows its acceptance on the market. Analyzing the difference between retail price and the real market price on such a platform over the product’s lifecycle as illustrated in Figure 4-9, can be used as an indicator for its popularity on the market. The platform StockX, for instance, provides information about the evolution of the bidding price and retail price for sneaker models, and therefore, enables their evaluation according to the above-described scheme.

5 https://stockx.com/ 85 Conceptual framework to identify online trendsetters price end of product lifecycle = b

price - online market place

retail price

Delta Price Retail Price time x date of product launch = a Figure 4-9 Identification of trends – price evolution

If the delta between the retail price and the real market price from the launch date of a new product over its lifecycle is positive, one can assume a specific level of acceptance of this product, thus, speaking of a product trend. If the delta is negative, however, the product was not a trend. Formula (7) presents the calculation of the developed success index which supports the categorization of products in trends and non-trends.

� (7) Success index = [∑ (date x – datex-1) x (pricex – pricex-1)] �=� – [(dateb – datea) x retail price]

a: date of product launch, b: date of end of product lifecycle, x: specific point of time within the product lifecycle

Identification of trend-related conversations After the identification of trends, the next step is the detection of related conversations on the respective OSN. Based on the findings of section 3.2, this is realized by identifying postings that include a trend-related hashtag (e.g., Zafarani et al., 2014; Wang and Zheng, 2014). To capture the defined product trend, the latter has to relate to a specific and individual name resp. a hashtag, e.g., a specific sneaker model like adidas 4D Maroon Aero Green, which allows a clear assignment of postings and comments to this product by defined search strings. Furthermore, to ensure that the analysis is conducted at the end of the product’s lifecycle, products with short lifecycles, e.g., one to two years such as sneakers, are selected for the investigation. The identification of the trend-related postings by

86 Conceptual framework to identify online trendsetters searching for the corresponding hashtags can be realized by simple search strings or by more sophisticated methods of text mining such as the Jaccard similarity coefficient. The latter measures similarity between two samples, e.g., a specific search string or keyword and the content of a posting (e.g., Niwattanakul et al., 2013).

Labeling based on Rogers’ model The classification of community members is realized by sorting the identified trend-related postings according to their time of publishing. Only the first posting of each user including the respective hashtags is considered in the ranking, and only those users who have mentioned it at least once. The first 16% of users who have mentioned the product trend in their postings are labeled as OTS, the users who have mentioned it later are classified as non-OTS. The process can be repeated with a list of several trends. Users who are several times in the group of OTS can be indicated as strong OTSs, for instance, and can be considered in the training phase of the classification model to increase the performance of models.

4.3.2 Feature framework

The structure of the framework bases on the components of online social networking platforms (user profile, UGC, connections and interactions) and the related information (user-, content-, context- and network-related,) presented in Table 2-1 (cf. section 2.2.1). Section 3 provides the input of appropriate features to measure specific characteristics based on data from OSNs. The developed framework combines measures that are used for the investigation of influential messages, influential users and the classification of social media user accounts for various purposes as well as for the identification of potential social media influencers in practice. The consideration of this broad range of features yields to cover as many components of behavior and personal characteristics as possible, and subsequently enables the detection of OTSs by patterns related to their behavior and characteristics in OSNs. As this thesis aims at the development of an OTS detection approach that is applicable in practice, the focus is on features with low complexity which base on well-accessible data, and which are transferable to different topical use cases. The calculation of features relies on nominal, binary, count, time and text data, and features include absolute and relative metrics. As Gupta et al. (2012) show in their analysis that ratio features, such as the number of retweets divided by the number of tweets, perform significantly better than other features, several of these relative measures are included in the framework. Geolocation data is not

87 Conceptual framework to identify online trendsetters considered as their accessibility strongly depends on the online social networking platform and often only a fraction of account data includes location information. Related to the provided information presented in section 2.2.1, measures that refer to connection and interaction data are summarized as network-related features. Metrics which base on profile data are labeled as user-related features, and UGC data is described by content-related features. As prior studies have shown that context-related features which refer to the metadata of UGC, are relevant in the context of opinion leader detection (e.g., Rosenthal and McKeown, 2017) and trend detection (e.g., Ma et al., 2012), this category is also considered in the framework. Figure 4-10 presents the feature framework and shows the four feature categories, the respective data basis, the methods applied for their calculation, and two feature examples for each category. It also underlines the process of extracting the relevant data from the respective OSN and choosing an appropriate method of SMA for calculating the features to measure different behaviors and characteristics of users.

Selection of Measurement of Data extraction Calculation of features methods characteristics

Data from OSN SMA methods Feature category Feature example Example characteristics

Biography length Extroversion User profile User-related External URL Credibility Content-based analytics No. of distinct Topic focus hashtags UGC Content-related No. of distinct Creativity emojis Statistical analytics Avg. time between Trend diffusion posts potential Connections Context-related Avg. time to Engagement comment Social network analytics Follow PageRank Influence Interactions Network-related Avg. no. of Expertise comments

Figure 4-10 Feature framework

The four feature categories are introduced in the following:

1) User-related features This feature category bases on data that relates to the profile of a user such as the self- descriptive text of biography or the metadata provided by the online social networking platform, e.g., the profile type. The features aim to extract and quantify the included information about one’s interests and characteristics. For this, methods of text analytics,

88 Conceptual framework to identify online trendsetters especially text pre-processing, as well as common statistical methods are applied. Measures like the no. of distinct emojis and hashtags, for instance, are calculated to get insights into the creativity of a user. The usage of several different emojis indicates a higher level of creativity. The length of biography and a user’s presence on several social media platforms (presence of URLs of other social media platforms in biography) are used as indicators for one’s level of being extrovertive as well as for their influence (cf. section 3.5). Sharing more text within the biography (longer self-descriptive text) shows the willingness of a user to reveal information of her/himself which is a characteristic of extrovertive and communicative people. The indication of an external URL in general, however, is assumed as one’s motivation to provide others with new or external content. More sophisticated text- mining approaches such as those applied by Rodriguez-Vidal et al. (2019) are not used to ensure simplicity and applicability of the framework. In total, this category includes 13 features (cf. Appendix A.4). Selected examples are presented in Table 4-4.

Feature Description Data source (type) Method

Biography No. of characters used in the Profile data (text) Text pre-processing length biography without emojis Biography No. of mentions used in the Profile data (text) Text pre-processing mention count biography Ratio emoji No. of emojis relative to the Profile data (text) Text pre-processing, no. of characters used in the statistical analytics biography Account type Indicates if the account is used Profile data (binary) Statistical analytics commercially External URL Presence of an external URL in Profile data (binary) Text pre-processing the biography, e.g., an external blog, website Table 4-4 Examples of user-related features

2) Content-related features This category yields to transfer the posted content into measurable values and draw insights about a user’s communication skills, creativity, credibility, and topic focus as well as one’s level of providing other users with information. The data for the calculation of features is mainly text data as it consists of the UGC published by the user. Visual content such as images and videos are not considered to ensure the application of the developed approach on various OSNs, also on those which mainly maintain textual data such as Twitter. A second reason for this focus on text is the difficulty to transfer characteristics of online opinion 89 Conceptual framework to identify online trendsetters leader-specific images, such as being esthetic, into measurable values as revealed by the expert interviews (cf. section 3.5). Besides, some basic metrics which are often provided by online social networking platforms, e.g., no. of postings per user, are considered. For the calculation of features methods of text analytics and statistical analytics are applied. The usage of only a small number of different hashtags, for instance, serves as a measure for one’s topic focus which was emphasized as relevant criteria by some of the experts (cf. section 3.5). This measure is used instead of calculating complex topic models, which would increase the computational costs and also require manual intervention as topics dependent on the use case. Furthermore, based on the findings of section 3.4, measures that quantify the inclusion of URLs within postings and the usage of questions are considered as an indicator of one’s credibility. Table 4-5 provides five examples of content-related features out of 24 features from this category (cf. Appendix A.4).

Feature Description Data source (type) Method

Avg. no. of No. of nouns relative to the no. UGC (text) Text analytics nouns of words used in all postings of (NLP) the user Avg. no. of No. of questions relative to the UGC (text) Text pre-processing questions no. of postings of the user Proportion media No. of postings of a specific UGC (count) Statistical analysis type media type (e.g., images) relative to the no. of postings of another media type (e.g., videos) Avg. comment Avg. no. of characters which UGC (text) Text pre-processing length are used in received comments Ratio distinct No. of distinct hashtags UGC (text) Text analytics hashtags/hashtags relative to the no. of hashtags (NLP) which the user has posted Table 4-5 Examples of content-related features

Although some of the investigated studies in section 3 include sentiment-related features, they are excluded from this framework as the classification of content into negative, positive, and neutral sentiments is still challenging, especially, applied to data from OSNs. The reason for this is the style of communication used in OSNs as it often includes sarcasm and irony. Current methods still lack a satisfying performance of sentiment classification including such linguistic patterns (Eke et al., 2019).

90 Conceptual framework to identify online trendsetters

3) Context-related features Features of this category aim to capture the time dimension which has been identified to be relevant regarding trendsetting previously (cf. section 3.4). This category encompasses features measuring speed and frequency, and describes a user’s activity as well as the activity of the user’s social contacts. For the calculation of features, e.g., the avg. post per day, time- related metadata is used and statistical analyses are applied. The avg. time between posts, for instance, shows a user’s tendency to share information. Therefore, it indicates a user’s potential to spread new trends as well as her/his potential to influence other users within the community. The avg. time to comment describes the reaction of the community to a user’s published content. Fast reactions can be an indicator of fast diffusion of the posted information resp. the new trend across the members of a social network. Table 4-6 shows five of the seven context-related features (cf. Appendix A.4).

Feature Description Data source Method (type) Avg. postings per No. of postings divided by the UGC (time) Statistical analysis day no. of days between the first and the last posting of the user

SD time between SD of the time between two UGC (time) Statistical analysis postings sequential postings Avg. time to Avg. time between a posting UGC (time) Statistical analysis comment and each of the respective comments Minimum time to Time between a posting and a UGC (time) Statistical analysis comment first comment

Evolution no. of Development of the no. of UGC (time) Statistical analysis comments received comments per posting over time Table 4-6 Examples of context-related features

4) Network-related features As previous studies emphasize the importance of a user’s position and interactions within the network as a precondition of her/his influence on the community (cf. section 3.4), the category of network-related features is included in the feature framework. The involved features mainly describe one’s expertise and one’s social environment such as the type and level of social connections and interactions. The calculation of these features bases on the connection and interaction data of OSNs, and particularly requires the application of SNA.

91 Conceptual framework to identify online trendsetters

Centrality measures such as in-degree, out-degree, betweenness, closeness, eigenvector or PageRank based on the follow- and follower-relation, for instance, aim to quantify one’s position in the community and the ability to control information flows. Therefore, they serve as an indicator of one’s potential influence. These centrality measures are also added to the framework calculated based on interaction data such as mentioning or commenting to quantify one’s interconnectedness and to measure the quality of social connections. As outlined by one of the experts, mentioning can be interpreted as personal addressing, and therefore, can be assumed as an indicator of the quality of connection (cf. section 3.5). Additionally, network-interaction features, e.g., no. of comments per post, are considered as a measure for one’s expertise as suggested in prior work (cf. Table 3-1). Selected examples of the 69 network-related features are presented in Table 4-7.

Feature Description Data source (type) Method

Follow Centrality measure: describes Network – SNA (actor-level betweenness the degree to which other users connection (count) measure) are connected via the follow- relation of the user Media tag in- Centrality measure: no. of Network – SNA (actor-level degree times the user is tagged in interaction (count) measure) images of other users within the community Avg. comment- No. of distinct comment Network – Statistical analysis owner ratio owners relative to the no. of interaction (count) comments per posting of the user Ratio follows - No. of follows relative to the Network – Statistical analysis followers no. of followers connection (count) Ratio postings No. of postings with a mention Network – Statistical analysis with mentions - relative to the no. of postings interaction (count) postings of the user

Table 4-7 Examples of network-related features

The whole feature list with the respective calculation and description is included in the appendix of this thesis (cf. A.4). The developed framework provides a list of features, which aims to describe humans’ behaviors and characteristics based on social media data. As outlined in section 2.2.1 the functions of online social networking platforms vary as well as

92 Conceptual framework to identify online trendsetters the accessibility of data. Therefore, the list of features serves as orientation, but needs to be adapted to the conditions of the respective OSN.

4.4 Interim conclusion

Figure 4-11 summarizes the TSIM concept and presents the different steps of the OTS identification approach with the respective outcomes. Regarding the considered data volume, the process resembles a funnel as it starts with the extraction of a sample (topic- focused community) from the enormous data pool of the targeted OSN to then identify specific members of the community (OTS). The process, therefore, reduces the huge available data source to a relevant group of users with specific characteristics. The identification of those characteristics is also part of the investigation and is realized within the last process step (cf. Figure 4-11).

Topic-focused community detection Topic definition Development of classification model OTS characteristics approach

Exploration of problem space Goal definition #seed Selection of performance metrics Initial seed users Bots, commercial Definition of relevant Filtering Expert in the accounts data Data collection specific field Definition of sample size Low-scored users Scoring Data labeling Data pre-processing Feature calculation OTS Bots, commercial Potential community members Filtering accounts List of potential keywords New potential Data splitting community members Algorithm selection Investigation of relevant Low-scored users Scoring

Method Data resampling features according to analysis

Add top-scored user‘s follows Feature selection Top-scored users until no new user appears Training Most used keywords Hyperparameter in the specific field tuning Insights about OTS behaviors Evaluation with test and characteristics according to targeted set OSN no yes Target community OK? Export model

Model to classify community members as OTSs Insights about behaviors Topic-focused community and related data and non-OTSs Hashtags which capture and characteristics of from the targeted OSN which is required for the the topic area OTSs based on the Output calculation of features Knowledge about the relevant features for the relevant features classification decision

Figure 4-11 TSIM concept overview

In the following section, the developed concept is applied to real data. The analysis consists of two parts. The first part yields to validate the conceptualized topic-focused community detection approach. The second part focuses on the development of an appropriate classification model by first realizing the labeling concept and, based on the resulting dataset, train different algorithms to identify the best classifier regarding state-of-the-art performance measures. In a last step, those features, which are relevant for the model to decide on the class, are investigated to reveal insights about OTS based on their digital trace.

93 Identification of online trendsetters by advanced analytics 5 Identification of online trendsetters by advanced analytics

5.1 Use case description

The application of the previously developed concept to real data aims to evaluate the performance of the conceptualized approach as well as its further development. Therefore, the community detection approach is used to extract a community with a topic focus on sneakers (cf. section 5.2). The OSN Instagram is selected as the data source for different reasons which are outlined in section 5.1.2. Based on the resulting dataset, in section 5.3 a classification model is developed. For this, the members of the identified sneaker-focused community are labeled as OTS and non-OTS according to the developed labeling concept (cf. section 4.3.1). The resulting dataset is, subsequently, used for the training and testing of selected algorithms and ensemble learning methods. Afterwards, their performances regarding the classification task are compared to each other based on different state-of-the- art performance measures. The best performing model finally is added to the TSIM.

5.1.1 Sneaker trends

The specific fashion domain of sneakers is chosen as use case for two reasons which is firstly its relevance for fashion companies. The sneaker culture, and therefore, the sneaker industry has recently gained popularity within the fashion industry. As a consequence, the sneaker market worldwide experiences considerable growth, and the interest of fashion-related companies in sneaker trends steadily increases (Guan, 2020; Hyun and Koh, 2020). Simultaneously, a secondary market (resell market) develops as some sports companies, e.g., adidas or Nike, release limited numbers of specific shoe models such as the adidas Yeezy line or the Nike Jordan line. Sneaker models of these series often sell out within minutes on the website of the respective brand. Online marketplaces such as GOAT or StockX profit from this development and provide a platform to sell and buy popular sneaker models (Watts, 2019). The second reason for choosing this domain as the use case is the limited product lifecycles and the specific product names of sneakers. The market is characterized by regular launches of new sneaker models with individual recognizable model names, e.g., Nike Air Jordan 11 Concord (Denny, 2020). This allows capturing a popular sneaker model (= product trend) within a social media posting by pre-defined search strings. A new popular sneaker model is considered as a trend and enables the analysis of its diffusion across a social network by investigating posting data of a time range of one to two years (length of

94 Identification of online trendsetters by advanced analytics lifecycle). In general, data of such a limited time range, which consists of the posting history of around two years, is accessible on online social networking platforms.

5.1.2 Instagram

Due to Instagram’s high relevance for the fashion industry (cf. section 1.1) and its potential to spread new fashion trends fast (Casaló et al., 2020), this OSN is selected for the data analysis. The platform was launched in 2010 and has one billion monthly active users as well as 95 million daily postings in 2018 (Clement, 2020) which emphasizes the huge volume of available data. To participate on the platform, users have to create a profile that consists of a unique username, a profile photo and a self-descriptive text, namely biography. Furthermore, the user can choose between different account types and profile settings. If a person or a company creates an account for business purposes, they can select the business account option which provides additional statistics, for instance, the number of views of a post. A private person can choose between private and public account settings. The public mode allows all users of the platform to access the published content of the respective user. A private account setting, however, restricts the accessibility of content to the group of a user’s followers. Therefore, participants of the platform can create a contact list by following others (follows) or permit others to follow them, and provide them access to their published content (in case of private account setting). The connections on Instagram are non- reciprocal, thus, users can decide to follow a user’s updates without this user subscribing back (Hu et al., 2014). Within the data analysis, only public accounts are considered due to legal restrictions, the accessibility of data, and their higher relevance regarding trend creation and diffusion. In contrast to other OSNs such as Facebook, Instagram users more commonly keep public profiles, and thus, allow all other users on the platform to interact with the published content, e.g., view, like and comment on postings (Hu et al., 2014). This further fosters the fast diffusion of trends. Instagram offers the option of one-to-one communication via a chat function and one-to-many communication via a user’s personal stream. In this personal stream, users can share content, especially images and videos, which are always combined with a textual description, a so-called caption. Similar to other OSNs, e.g., Twitter or YouTube, these descriptions often include hashtags to highlight the topic of the posting and to get attention. Instagram also offers functions that enable its users to interact with each other by commenting, sharing, tagging, mentioning others in their posted images, videos and captions, or by liking others’ content (Hu et al., 2014). Based on the components of an online social networking platform, Table 5-1 provides an overview of the above described

95 Identification of online trendsetters by advanced analytics

Instagram functions, the related relevant data and the respective methods to extract them for data analysis purposes. Within the thesis, only publicly accessible data is collected and data protection principles are considered.

Component Function and data Usage in data analysis Data access

Self-descriptive text: Community detection: Biography (text) - Filtering (filter words) - Scoring (keywords) Feature extraction: - User-related features User profile HTML-site Account setting: Community detection: Public/private (metadata) - Filtering Account type: Feature extraction: Business account/person - User-related features (metadata)

Posting (text, image, video): Community detection: Instaloader - Quantity, time (metadata) - Filtering - Content (text) - Scoring UGC Feature extraction: - Content-related features - Context-related features Contact list: Community detection: Instagram Follows and followers list - Filtering API - Ranking Connections - Scoring Feature extraction: - Network-related features Mentioning, tagging (@) Feature extraction: - Network-related features

Commenting: Community detection: - Quantity, time (metadata) - Scoring - Content (text) Feature extraction: - Content-related features Interaction - Context-related features Instaloader - Network-related features Liking (metadata) Community detection: - Scoring Feature extraction: - Context-related features - Network-related features Table 5-1 Instagram functions, related data, and data access 96 Identification of online trendsetters by advanced analytics

As Table 5-1 indicates, there are different ways of accessing Instagram data. First, the platform provides an API which allows the extraction of information such as the follower- relations of users. For the data analysis, this Instagram API is used to extract the connection data. Furthermore, Instaloader6 is utilized to download the postings and the related metadata. Therefore, publicly available scripts are used, which can be found on GitHub. The profile information is extracted using the HTML-site of the respective accounts.

5.2 Topic-focused community detection

As previously outlined, the OSN Instagram is selected for this experiment. The data analysis is realized using python. Therefore, respective python packages are utilized and executed in Jupyter notebooks. The resulting community is evaluated along two criteria which are the connectivity of community members and their topic focus as the objective of the approach is a community that fulfills these two aspects (cf. section 4.2). The realization of data analysis and the evaluation of results base on different SMA methods. The following section indicates the key techniques used for the data analysis. Afterwards, section 5.2.3 presents the results of the analysis and the evaluation.

5.2.1 Methods

As outlined in section 4.2.1, statistical, network, and content metrics are considered in the iterative process of the community detection approach. The calculation of content metrics especially requires the application of several text pre-processing steps which are indicated in section 2.3.2 of this thesis. The evaluation additionally bases on techniques of topic modeling to validate the topic focus of the community. Figure 5-1 provides an overview of applied methods, the respective data basis and the targeted output. The steps and respective methods of the data analysis and the evaluation are emphasized in the following.

6 https://instaloader.github.io 97 Identification of online trendsetters by advanced analytics

Connections: User profile: UGC: No. of followers and follows, Biography text Posting text data Follower-follow relations

Instagram Instagram Unstructured data Structured data

Text analytics Social network analytics

Text analytics framework Actor-level measures Network-level measures Knowledge methods Pre-processing Representation discovery SMA SMA

Stop word removal, Centrality measures: Topic modeling: Centrality measures: tokenization, BOW1, TF-IDF2 In-degree centralization, NMF3 PageRank, in-degree lemmatization average degree

Community detection process: Community detection process: Identification of potential o Filter words in biography and Evaluation: Evaluation: posting text (filtering) Discover topics of community community members (ranking) Validation of network centrality o Biography score, text score postings Evaluation: and connectivity

Application (scoring) Validation of connectivity

1) Bag-Of-Words, 2) Term frequency inverse document frequency approach, 3) Non-negative Matrix Factorization Figure 5-1 Community detection – required data and applied SMA methods

As Figure 5-1 shows, SNA is applied to the connection data extracted from the social networking platform Instagram to calculate two centrality measures on actor-level. Besides PageRank, which was previously introduced in section 4.2.1, in-degree based on the follow- relation is considered to prove the presence of connections between community members + within the evaluation phase. Formula (8) indicates the calculation of in-degree �� of a user u. Thereby auf = 1 represents the presence of a link between a user u and a follower f, whereas auf = 0 represents the absence of such a link (Oliveira and Gama, 2012).

+ (8) �� = ∑ ��� ��� � ≠ � �∈�

+ ku = in-degree of a user u, u = user, f = follower, C = set of community members, u ϵ C auf = link between u and f, presence of a link is represented by auf = 1

For the calculation of in-degree measure, the python package NetworkX7 is used. Based on the degree centrality values of the users, the degree centralization metric and the average degree of the whole community are calculated. These metrics enable the comparison of the information flow and connectedness of the potential community members in the initial

7 https://networkx.org/ 98 Identification of online trendsetters by advanced analytics iteration with the final target community. Degree centralization is used in SNA as an indicator for the level of centralization of a network towards only a few or one single actor such as opinion leaders. It is calculated as the variation of the users’ degrees of the network (e.g., a community) divided by the maximal degree variation which is possible in a network of the same size. Its value ranges between zero and one, where a value of one refers to a network that is centralized towards one single actor, and a value of zero refers to a network where each actor has the same level of degree centrality (Freeman, 1978; Wasserman and Faust, 1994). A network with more sparsely connected central actors has a low centralization value (Himelboim, 2017). Degree centralization, therefore, is also an indicator of the connectedness of a community and its information diffusion potential. The density value which is commonly used in SNA as an indicator for a network’s connectedness, however, is not applied as this measure is inappropriate for the comparison of networks of different sizes, which is the case in the detection process (Stokman, 2001; Ghali et al., 2012). Besides, the realization of filtering and scoring steps within the community detection process as well as the evaluation of the approach base on text analytics methods (cf. Figure 5-1). For a better understanding of these methods in the context of this thesis, Table 5-2 presents relevant concepts with use case-related examples.

Concept Use case-related examples

(Text) document Biography text, posting text, comment text

(Text) corpus Collection of all biography/posting/comment texts published by a community

A single word of a posting text, e.g., “I”, Token “love”, “sneaker”

The base form of a word, e.g., “be” is the Lemma base form of “was”, “are”, “is”,… Table 5-2 Relevant concepts of applied methods and use case related examples

Text pre-processing The calculation of biography and text scores within the community detection process as well as the application of the topic modeling algorithm require several steps of text pre- processing. The aim of text pre-processing is to transfer the input documents, e.g., the biography text, into consistent data which enables the application of text representation methods (cf. Figure 5-1). The applied pre-processing steps consist of stop word removal,

99 Identification of online trendsetters by advanced analytics tokenization and lemmatization. These steps are realized by using spaCy8, an open-source natural language processing library for python.

Stop word removal eliminates words that are considered more general and meaningless. For instance words like “a”, “as”, “at”, “by” or “but” that only have grammatical significance are eliminated (Stieglitz and Dang-Xuan, 2013). For the stop word elimination, the pre- loaded spaCy list of stop words is used. Within the evaluation phase, this list of stop words is manually amended by further words such as the word “sneakers” which is used as the seed keyword, and therefore, has no additional value. Afterwards, the text is tokenized. Tokenization fragments the sentence “I love sneaker”, for instance, into its three tokens “I”, “love” and “sneaker”. In a next step, the single tokens of each document (e.g., posting or biography text) are lemmatized and separately appended to the dataset for later analysis, e.g., a Term Frequency-Inverse Document Frequency (TF-IDF) analysis. Lemmatization refers to the process in which a word is reduced to its base form – the lemma. Some words like nouns, adjectives, and verbs are inflected but have the same underlying essential meaning, and therefore, refer to the same common base form (Manning et al., 2008). The words “loves”, “loved”, “loving, for instance, are attributed to their base form “love”. Lemmatization also recognizes more difficult inflections and can attribute the base form “be” to the words “is”, “was”, “am”, “are”. Through lemmatization, the original texts of the dataset are each transformed into a list of single items, which are the lemmas of the original words. Based on the pre-processed data, potential community members are filtered out and the biography and text scores are calculated according to a pre-defined word list as described in section 4.2.1. These pre-processing steps are also applied for calculating certain content features within the feature extraction phase during the development of the classification model, which is described in section 5.3.2.

Text representation Machine learning algorithms such as the ones used for topic modeling as well as statistical approaches usually operate on a numeric feature base. Within the text representation phase, the text documents are transformed to enable the application of such methods on text data. Within the data analysis, the BOW approach is used for the transformation of documents into numerical vectors. Each document is treated as a collection of words, ignoring grammar and word order (Bengfort et al., 2018). To consider also the importance of a word for the

8 https://spacy.io 100 Identification of online trendsetters by advanced analytics individual text document with respect to the context of the text corpus, the TF-IDF approach is applied as weighting scheme. The central idea of TF-IDF is that a significant meaning is more likely encoded in rare terms than in the most frequent ones. A term, which is frequent in only a single text but not in the others, is important as it contributes to the local meaning of this particular text. However, if a term occurs many times in one text, but also in all other texts, it is not individual or thematically differentiating. Therefore, such terms are penalized by the weighting scheme (Bengfort et al., 2018). In a text about sneakers, for instance, less frequent words like “vegan” or “leather” are more meaningful to distinguish text documents, than the frequently used words “kicks”, “shoe” and “footwear”. TF-IDF is a two-step approach that starts with counting the term frequency. In the following, the terms are ranked according to their frequency as a high frequency is assumed to indicate high importance. Within the second step, the inverse document frequency is calculated. Words that occur in many documents of the text corpus are penalized as they provide less value regarding a document's relevance within the text corpus. TF-IDF, consequently, assigns a term the highest weight when it occurs often within only a small number of documents and the lowest weight when the term occurs often in all of the documents (Manning et al., 2008). For the realization of TF-IDF, the machine learning library scikit-learn is used. The scikit-learn TfidfVectorizer is applied to transform the biography and posting texts into a matrix of TF- IDF features. Terms with a low frequency that only enlarge the data or terms that occur too many times, and therefore, do not provide additional document-individual meaning, are excluded. As adding n-grams as a contextual feature extraction to the BOW approach can significantly improve the performance of a model (Bengfort et al., 2018), the “ngram_range” of the TfidfVectorizer is set to bigrams. N-grams is a feature extraction method that recognizes the context in which words appear. Bigrams, for instance, are n-grams where n = 2, and represent two words that are next to each other in the text document (Bengfort et al., 2018). To increase the time performance, the lemmatized words are used for processing the text corpus into TF-IDF-weighted document-term matrices as this decreases the size of the document-term matrix, and therefore, the number of operations (Truica et al., 2016). The result of this step is the input for the topic modeling algorithm which enables the discovery of latent topics included in the posting texts of community members.

101 Identification of online trendsetters by advanced analytics

Knowledge discovery using topic modeling For the validation of the topic focus of the resulting community, a topic modeling approach is chosen as it recognizes latent topics in text data by detecting hidden semantic structures and meanings of documents, e.g., a posting text. In contrast to clustering, which segments a text corpus into topical clusters where each text document is associated with one single topic, topic modeling follows a soft clustering approach and determines the membership probability for each text document to a topical cluster (Aggarwal and Zhai, 2012). In the context of this thesis, this means that one single posting out of the collection of all community postings can address one or more of all discussed topics within the community. Thus, a posting text is distributed over different topics and a topic is distributed over different keywords (Choo et al., 2013). Topic modeling, therefore, can also be understood as an unsupervised machine learning approach for soft clustering where a document is represented as a combination of weighted clusters or topics, and where the cluster weights represent how closely a document is related to the respective topic. Especially, the keyword-wise topic representation is the strength of topic modeling as it does not solely calculate the proximity of a document to the different topics, but rather shows the semantic meanings of the topics based on the keywords. It describes topics as weighted combinations of keywords, where the weight indicates how closely a keyword is related to a topic. Semantically coherent documents form groups or clusters, the semantic meaning of such a cluster or topic is then represented by frequent keywords of the topic (Choo et al., 2013). This enables the creation of a semantic summary of the text corpus (e.g., all community postings) (Ramamonjisoa et al., 2015). LDA and Non-negative Matrix Factorization (NMF) are two prominent and widely used topic modeling schemes in research and practice (Chen et al., 2019). As experiments demonstrate that NMF models produce better quality compared to LDA and generate more meaningful topics applied to short texts (Chawla, 2017; Bakharia, 2016; Ramamonjisoa et al., 2015), the NMF algorithm is selected for the realization of the evaluation task. As the text length of postings on online social networking platforms is often limited to a specific size, this approach is deemed to be appropriate for the targeted goal. Moreover, this algorithm overcomes certain issues of LDA which are criticized in recent years such as the inconsistency of multiple runs and the lack of empirical convergence (Choo et al., 2013). NMF is a linear-algebraic model and follows a different mathematical basis than LDA. It factors a high-dimensional document-term-matrix into a representation of two vectors that are low in dimensions (Chawla, 2017; Bakharia, 2016). As presented in Figure 5-2 the

102 Identification of online trendsetters by advanced analytics model’s input is a BOW matrix X, which is TF-IDF normalized, and a user-specified number of topics k (Kalyanam et al., 2015). Thus, a topic represents a weighted combination of m terms where a high weight value indicates that the topic is closely related to that term (Choo et al., 2013). NMF represents an optimization problem in which the error of the reconstruction of X through W x H is calculated. W, thereby, includes the membership weights for a topic in a document (document-topic-matrix) and H includes the membership weights for a term in a topic (topic-term-matrix) (cf. Figure 5-2). W and H are adjusted until the Euclidian distance between X and W x H is minimized which means that the multiplication either approaches X or a specified number of iterations is reached (Bakharia, 2016; Ramamonjisoa et al., 2015).

m k Notation Description m n No. of documents m No. of terms n X ≈ n W x k H k No. of topics x X ϵ ℝn m Document-term matrix Factor x n k (topics x terms) W ϵ ℝ Document-topic matrix Factor x Input Matrix H ϵ ℝk m Topic-term matrix (documents x terms) (documents x topics) x k 1 wi ϵ ℝ Topic-wise representation of document i x m 1 hl ϵ ℝ Term-wise representation of the l-th topic

Figure 5-2 Matrix decomposition and notations of NMF algorithm (based on Choo et al., 2013)

The output of NMF is a topic-wise document representation and a term-wise topic representation (Choo et al., 2013; Ramamonjisoa et al, 2015). The number of topics k is one of the most important parameters in topic modeling (Liu and Jansson, 2017), which is at the same time a challenge, since the optimal number of topics is not automatically calculated by the NMF model. A k-value has to be determined that results in meaningful interpretable topics and simultaneously enables a fair content representation. Therefore, for the evaluation of the community's topic focus, the quality of the topics, which are produced in different runs using a varying k, are assessed manually. k-values between four and ten are tested as this range is deemed to be sufficient to prove the focus on sneaker-related topics or uncover the diversity of topics which are not related to the sneaker domain. The output of topic modeling are several topics describing the interest fields of the community members by analyzing their text postings.

103 Identification of online trendsetters by advanced analytics

5.2.2 Process initialization and iterations

The application of the conceptualized detection process on real data using the previously introduced methods is summarized in the following.

Initialization In the first step, the objectives of the community detection process are clarified. These are the topic focus on the fashion domain of sneakers and a target community size of 750 members. As the resulting community serves as training and testing data for the development of the classification model, it is important to receive an appropriate number of samples for this task. As according to the literature a sample size between 500 and 1,000 users is deemed to be sufficient to achieve satisfying results and several experiments within this research project have shown that the detection process converges best within a range of 600 to 1,000 users, the target size is defined in between as 750. For the calculation of the PPD (except for the initial seed users), the 12 most recent posts of each user are considered to keep the iterative process efficient. The reason for this is the crawling limit which is set to 12 postings. The definition of potential initial seed hashtags as well as the creation of the list of keywords bases on the recommendations of experts, who are either active members of the worldwide sneaker community or have a professional background as product designer or manager in the fashion industry, especially in the sneaker environment. The list of filter words is the result of manual screening of around 5,000 user accounts and their respective postings on Instagram as well as the experiences of the experts. Table 5-3 presents the variables which need to be defined individually for each detection process in regards to the objectives and the selected values for the use case.

Variables Values

Target size m 750

Considered no. of potential 7,500 members Top users percentage 10%

Considered posting volume x 12 most recent postings per user

Filter words Worldwide ship, shipping world, ship world, reseller, retailer, store, sell, sale, trade

104 Identification of online trendsetters by advanced analytics

Variables Values

Category I (weight = 1): fashion, lifestyle, style, design Category II (weight = 2): streetwear, kicks, deadstock, Keywords sneakerholic Category III (weight = 3): sneakerhead, sneaker, sneakers

Table 5-3 Use case-specific settings – variables and values (sneaker)

Iteration zero Iteration zero differs from the subsequent iterations as it starts with the definition of an appropriate keyword that captures the targeted topic area. The initial keyword is set to “sneakers” as this is the most used hashtag on Instagram in the group of pre-defined words by the experts (36.2 billion postings including the hashtag at the time of data collection). The initial data collection, which was conducted on May, 4th 2019, encompasses the most recent and some of the most relevant 1,160,152 postings with the defined hashtag. These postings are published by 202,879 unique users, the initial seed users. The postings span from 2012 to 2019 whereas 99.5% of postings are published in 2019. As Instagram does not explain its display algorithm, it is not transparent why not only the most recent postings but also some postings from previous years appear within this posting collection. The collected data contains the posting texts, the respective number of comments and likes, the posting time, and the number of postings per user within the collection of postings. As presented previously in Figure 4-5, based on this data, filter and scoring criteria are calculated. Postings that include one of the defined filter words (cf. Table 5-3), as well as users with a PPD > 12, are filtered out. The remaining group of 162,572 users is evaluated according to the activity scoring criteria as well as according to the text score values. This also differentiates from the following iterations which involve additional scoring criteria as explained in section 4.2.2. As the targeted size of the community is m = 750 users, the threshold value n of potential community members who are considered within the iterative process is set to 7,500. The application of consensus algorithm in iteration zero, therefore, results in a group of 7,500 top-scored users, the potential community members. For this group, additional data is extracted as described in section 4.2.3. Based on this new database, the filtering and scoring procedure is repeated with additional metrics, presented in Table 4-2, and results in the group of potential community members. The calculation of network and activity scores bases on a normal distribution of percentiles of respective metric values (cf. section 4.2.2). Figure 5-3 presents the calculation steps which result in the final score values illustrated at the example of the metric CPP.

105 Identification of online trendsetters by advanced analytics

Calculation of detection metrics Calculation of percentiles Calculation of score values Scoring

Calculation of detection metrics for all Calculation of percentiles values o Definition of percentile which community members: based on the metric values of all is scored the highest o Network: no. of follows, FFR community members for the o Normal distribution over

Procedure o Activity: PPD, CCP, LPP respective detection metric CPP-values of the respective percentile

Values of each detection metric Percentile with respective CPP: maximal score = 100 % quantile e.g., CPP-value of 4.08 receives a for all community members CPP- values value score of 0.22

Frequency and distribution Percentiles CPP CPP-score Percentiles CPP-value of CPP-values within the min 0.0 0.0 example community min 0.0 10% 1.33 0.08 and 10% 1.33 20% 2.58 0.14 20% 2.58

Output 30%, 40%, 4.08, 6.00, 0.22, 0.33, 30%, 40%, 50%, 4.08, 6.00, 8.58, 50%, 60%, 8.58, 12.33, 0.46, 0.61, 60%, 70%,…. 12.33, 18.42,… 70%,…. 18.42,… 0.75,…

Figure 5-3 Score value calculation, example: CPP

The follows-lists of the 750 top-scored users serve as the basis for new potential community members who are considered in the next iteration. This process is repeated until it converges.

Convergence As Figure 5-4 shows, the community converges after 14 iterations. Throughout the iterations, the composition of the top-scored users changes (cf. Figure 5-4). Three types of users can be distinguished:

- U1: users that have appeared in one of the previous iterations (blue bars)

- U2: users that have already appeared within the process but not in the last iteration (green bars)

- U3: users that are completely new to the process in the current iteration and have never been appeared in any of the previous iterations (orange line)

The iterative process aims to decrease the number of U2 and U3 users continuously to then reach a U3-value of zero as this indicates the stability of the community. Figure 5-4 showcases that the number of unknown users decreases with each iteration.

106 Identification of online trendsetters by advanced analytics

750 745 800 715 731 738 740 743 750 681 700 707 700 647 658 609 600

500 416 400 335 335 300 200 141 103 93 120 70 50 100 87 87 43 35 20 52 12 10 7 5

Number Number users of 41 0 28 24 15 5 3 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Iteration

in previous iteration (U1) new to previous iteration (U2) never in any of the previous iterations (U3)

Figure 5-4 Evolution of user composition

For the final community members, additional data, which are necessary for the calculation of the features, are extracted from Instagram.

5.2.3 Results and evaluation

As 26 users have either switched their profile settings to private mode or have deleted their accounts, the collected dataset contains data of 724 user accounts. The postings are published within the timespan from May 2011 to May 2019. The data includes all postings since the respective account’s creation, which encompasses 459,243 postings and around 1.3 million comments with the related context data which are required for the calculation of features. Figure 5-5 shows the evolution of the number of community members, the number of monthly active community members as well as the number of postings over time. The average posting length over the whole posting history is 62.78 words per post, which underlines the usefulness of applying NMF models for topic modeling in the evaluation phase as they can better handle short texts compared to other methods. The graph indicates that most of the community members are active users who at least post once per month. This goes along with the fact that the number of postings increases with the number of community members. The graphic also underlines that only few accounts have created their profiles in the year of the launch of the platform. The major part of the community set up their profiles over the last five years. The community, therefore, is a mixture of users who are active on the platform for several years and users who are rather new on Instagram.

107 Identification of online trendsetters by advanced analytics

800 12000 700 10000 mean length of posts 600 in no. of words: 62.78

500 8000

400 6000

300 4000 200 2000 100

0 0 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16 Jan-17 Jan-18 Jan-19 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16 Jan-17 Jan-18 Jan-19 Sep-11 Sep-12 Sep-13 Sep-14 Sep-15 Sep-16 Sep-17 Sep-18 Sep-11 Sep-12 Sep-13 Sep-14 Sep-15 Sep-16 Sep-17 Sep-18 May-11 May-12 May-13 May-14 May-15 May-16 May-17 May-18 May-19 May-11 May-12 May-13 May-14 May-15 May-16 May-17 May-18 May-19

No. of users in community No. of posts

No. of monthly active users1 1) The user has posted at least once in the current month Figure 5-5 Community statistics – sneaker community

As the goal of the detection approach is to identify a community with members who are connected and who have a specific interest in a certain topic, the evaluation of the resulting community considers these two criteria. Therefore, the social graph of the community is analyzed to validate the connectivity of its members, and the posting history of the community is investigated regarding the targeted topic focus. The results of the evaluation are summarized in the following.

Validation of connectivity As described in section 5.2.1, the in-degree measure based on the follow-relations of community members is used to validate the connection criteria. Besides, the average degree of the whole community and its degree centralization are used to compare the group of users considered in the initial iteration with the final target community. Figure 5-6 presents a visualization of the social graph of both groups, the respective average degree as well as the degree centralization value. This comparison shows, that at the beginning of the process, several small sub-groups, which are not connected, are part of the group of potential community members. This is supported by the low values of average degree (2.52) and degree centralization (0.018). The graph on the right-hand side, in contrast, shows that the members of the final community are noticeably closer connected via the follow-relation and more centralized, which is indicated by the strong increase of the average degree as well as of the centralization value. This validates that the detection process results in a group of users which are highly connected, and therefore, potentially influence each other.

108 Identification of online trendsetters by advanced analytics

Potential community members, it = 0 Target community, it = 14 No. of users = 162,572 No. of users = 724 Avg. degree = 2.52 Avg. degree = 68.6 Degree centralization: 0.018 Degree centralization = 0.591

Figure 5-6 Comparison of centrality measures in the initial and final iteration

Validation of topic focus The second evaluation criteria is the topic focus of the community. As the objective of the detection process is to identify a group of users who has a strong interest in sneakers, it needs to be validated that the conversations within the community mainly focus on topics in this field. To proof this, topic modeling is applied to the collected postings. This is realized as described in section 5.2.1 and results in six main topics (k = 6, n = 459,243 , m = 246,001). The term “sneakers” is excluded from the analysis as this was the initial seed keyword of the detection process. The topics are visualized in Figure 5-7. All of the six clusters deal with sneaker-related topics, which is validated by the experts who have contributed to the creation of the keyword list. The visualization presents the most important terms which are associated with the respective topic. The most important topic according to the weights assigned by the algorithm is the topic “sneakerhead” which contains more general sub-topics like sneaker brands (e.g., Nike) or sneaker lines (e.g., Jordan). Terms like “kicks”, “soleonfire” or “kicksonfire” are common used terms of sneaker enthusiasts. The second topic deals with websites that focus on the publication of news around streetwear fashion and sneakers that also provide options to buy sneakers or streetstyle fashion. The Sole Firm9, for instance, is a popular sneaker community

9 https://www.thesolefirm.com/

109 Identification of online trendsetters by advanced analytics that shares the “top news” related to sneakers. Another important term related to the second topic is “goldengrails10”, which is a brand with a focus on hip hop style jewelry. As the hip hop style and sneakers, both, belong to fashion, they are closely connected with each other. The third topic relates to blogs, communities and magazines which publish new trends and news about streetwear and sneakers. Highnobiety11, for instance, is a streetwear blog, which covers trends and news related to fashion, specifically streetwear fashion. Hypebeast12 is a lifestyle magazine, which also publishes content related to current men fashion with a focus on streetwear and sneakers, and modernnotoriety13 is a website, which publishes sneaker news and release dates of new sneaker models. Topic four and five refer to popular sneaker lines. Airmax is a sneaker line of Nike, and Yeezy is a sneaker line released by adidas (Watts, 2019). The last topic contains several sub-topics and addresses the secondary sneaker market (e.g., trustedkicks14) and sneaker accessories (e.g., angelusdirect15).

sneakerhead (22.79%) jordandepot (19.14%) highsnobiety (17.67%)

airmaxkicks (15.53%) yeezy (13.62%) certifiedshots (11.26%)

Figure 5-7 Topic models of all community members’ posting texts (sneaker)

As all detected topics of the community conversation refer to sneaker-related topics, the evaluation criteria of topic focus is deemed to be validated.

10 https://goldengrails.com/ 11 https://www.highsnobiety.com/ 12 https://hypebeast.com/ 13 https://www.modern-notoriety.com/ 14 https://trustedkicks.com/ 15 https://angelusdirect.com/ 110 Identification of online trendsetters by advanced analytics 5.3 Trendsetter classification model

Based on the resulting dataset of the community detection process, a classification model is developed. For this purpose, in the phase of pre-processing, the data is annotated by applying the developed labeling concept (cf. section 4.3.1) to the data of the sneaker community. The second major step of data pre-processing encompasses feature extraction, which bases on the feature framework derived from literature analysis and expert interviews (cf. section 4.3.2). The pre-processed data is then used for the model development. Therefore, selected algorithms are trained and their classification performances are evaluated and compared using specific metrics. The best performing model is considered as part of the TSIM. The process of model building, consisting of data pre-processing and exploratory data analysis, model training and validation, and model evaluation, bases on different methods such as SNA and techniques of supervised machine learning. In the following, the relevant techniques and concepts are introduced and briefly explained. This yields for a better understanding of the analysis and results. Afterwards, the subsequent sections focus on the development of the model, its evaluation, and interpretation. Finally, the characteristics of OTSs are investigated by analyzing the features which are relevant according to the best performing model.

5.3.1 Methods

Various methods are utilized for data pre-processing and classifier training. Especially for the calculation of features, means of text analytics and SNA are used. Within the supervised machine learning process different methods, e.g., several techniques of data resampling, are applied to achieve reliable and satisfying results. This section gives an overview of the methods that are relevant for the data analysis.

Data pre-processing As described in section 4.3.2, the feature extraction bases on statistical, network and content metrics. Figure 5-8 summarizes the SMA methods, which are used for the feature extraction within the data pre-processing phase, and the required social media data from the respective online social networking platform.

111 Identification of online trendsetters by advanced analytics

Connections: Interactions: User profile: UGC: Follower-follow Mentions, comments, Biography text Posting, comment text data relations likes, tags

Instagram Instagram Unstructured data Structured data

Text analytics Social network analytics

Text analytics framework Actor-level measures Knowledge Pre-processing Representation discovery

SMA methods SMA Social graph – centrality measures Activity graph – centrality based on follow-follower measures based on interaction Stop word removal, relations: relations: tokenization, POS1-tagging PageRank, out-degree, in-degree, PageRank, out-degree, in-degree, lemmatization closeness, betweenness, closeness, betweenness, eigenvector centrality eigenvector centrality

Feature extraction: Feature extraction: o Content-related feature calculation Network-related feature calculation o User-related feature calculation Application

1) Part-of-speech-tagging Figure 5-8 Feature extraction – required data and applied SMA methods

In addition to the text pre-processing methods used within the community detection process, part-of-speech-tagging (POS-tagging) based on tokens is used for feature extraction. POS- tagging determines the grammatical category of each word and assigns a respective tag to it, such as NOUN (noun), VERB (verb) or ADJ (adjective). While nouns tend to provide information on a particular issue, verbs identify actions and events, and adjectives convey emotional responses (Jabreel et al., 2018). The usage of nouns, verbs and adjectives in a user’s posting reflects her/his linguistic style, and therefore, are considered as specific features. Similar to tokenization and lemmatization described in section 5.2.1, the POS- tagging is realized with the NLP python library spaCy. Besides content-related features, a major portion of considered features bases on SNA and encompasses metrics that refer to the connections between the community members as well as their interactions as outlined in section 4.3.2. For the calculation of these centrality measures, the python modules graph- tool16 and networkX17 are used. The entire list of considered network metrics is attached in appendix A.4 of the thesis.

16 https://graph-tool.skewed.de/ 17 https://networkx.org/ 112 Identification of online trendsetters by advanced analytics

Model development and evaluation As the objective of this thesis is the identification of a known label (OTS) based on specific hidden patterns and not the discovery of new unknown user groups, a supervised learning approach is chosen. For a better understanding of classification, Table 5-4 provides an overview of relevant terms with a short explanation and a use case-related example.

Concept Explanation and use case related example

Input variable, Input for the machine learning algorithm to learn a function (f) that independent variable, can predict the output. input attribute, feature, Case example: avg. time to comment X If there is more than one variable as input. Input vector Case example: list of all features which describe specific user characteristics (feature framework) Output variable, dependent variable, The output of prediction. output attribute, label, Case example: OTS class, Y

Single data example observed or generated by the problem domain. Instance, sample Case example: all features related to one specific user of the community

Maps the input to a specific output. Function (f) Case example: logistic function used by the Logistic Regression algorithm

Mean for learning a specific model. Algorithm Case example: Logistic Regression algorithm

A certain representation which is learned from data: model = algorithm (data). Case example: Model LogisticRegression(C=1,class_weight=None,dual=False, fit_intercept=True,intercept_scaling=1,max_iter=10, multi_class='auto',n_jobs=None,penalty='l1',random_state=None, solver='liblinear',tol=0.0001,verbose=0,warm_start=False).fit(X,Y)

Table 5-4 Relevant machine learning terminology (Brownlee, 2017; Herrera et al., 2016)

In supervised machine learning, algorithms aim to learn a target function (f) that best maps the independent variables (X) to a dependent variable (Y). In a classification dataset, there are two subsets of attributes. One contains the input variables, e.g., the features which describe a user’s characteristics. These variables act as predictors. The other subset consists of the output variable, e.g., OTS, which is also called label. This label is assigned to each instance, e.g., to each community member. The algorithms then analyze the correlation

113 Identification of online trendsetters by advanced analytics between the input variable resp. vector and the output variable. The obtained model, in the subsequent, can be utilized to predict the label of new data samples. As there is only one output variable that can take only two different values (1 or 0), a binary classification problem is solved within this data analysis. The values are known as positive or negative, but are often referred to as 1 or 0, or true and false (Herrera et al., 2016). In the case of OTS identification, the label OTS, for instance, would return either positive, 1 or true. The classification algorithm aims to find a boundary that separates the group with positive output variables (OTS) from those with negative output variables (non-OTS) by analyzing the given features. This results in two classes of community members, namely OTS and non-OTS. The development of a classification model, that separates the two groups, encompasses several steps (cf. sections 2.3.4 and 4.3). In each of these steps, the characteristics of the dataset and the objective of the classification task have to be considered to result in a well-performing model. Therefore, the following characteristics of the community dataset and the objective of the analysis have to be respected:

The dataset consists of social media data from an OSN, which commonly contains noise (cf. section 2.3.2). The training, validation and test dataset is an imbalanced one, containing fewer examples of OTS than of non-OTS. The features which are extracted based on the feature framework can contain irrelevant and correlated features. The focus of the classification task is the identification of the OTS class. Additionally, the objective of the analysis is to reveal knowledge about OTS characteristics in OSNs.

Considering these aspects, specific methods and algorithms are chosen for the development of the classification model. Figure 5-9 presents these methods and algorithms used in the different steps of model development and evaluation, the reason why they are selected as well as the respective python packages which are used for their implementation. These applied methods are explained in the following.

114 Identification of online trendsetters by advanced analytics

4) Feature selection & 1) Data splitting 2) Data resampling 3) Algorithm selection 5) Evaluation hyperparameter tuning

o Training, validation and o Random over- and under- Parametric: Feature Selection: Performance measures: test split, using stratified sampling o Logistic Regression o Filter approach: Pearson o Precision sampling o Synthetic minority o Naïve Bayes correlation o Recall oversampling (SMOTE) Non-parametric: o Embedded approach: o F1-Score o K-fold cross-validation, o SVM depending on the selected o Accuracy k = 10, using stratified o Decision Trees algorithm sampling o Feature importance: L1- Ensemble learning methods: regularized Logistic o Random Forest

Applied methods Regression Adaptive Boosting o o Combination of approaches o Gradient Boosting Hyperparameter tuning: o Grid search

o Avoid overfitting o Solve imbalance problem o Good results in related o Improve model o Measure model o Validate model performance o Increase prediction quality studies performance performance of model o Test different algorithms to (prediction quality, o Enable the comparison of reveal which performs best speed, effectiveness) different models

Reasons regarding the underlying o Basis for further classification problem optimization and selection of model

Python library sklearn1: Python library imbalanced- Python library sklearn: Python library sklearn: Python library sklearn: o train_test_split learn2: o MultinomialNB o SelectFromModel o precision_score o StratifiedKFold o RandomOverSampler o LogisticRegression o LogisticRegression o recall_score o cross_val_score o RandomUnderSampler o NuSVC o GridSearchCV o f1_score o SMOTE o DecisionTreeClassifier Python library pandas: o accuracy_score RandomForestClassifier o o corr mplementation I o AdaBoostClassifier o GradientBoostingClassifier 1) https://scikit-learn.org/stable/ 2) https://imbalanced-learn.org/stable/ Figure 5-9 Model development – applied methods and implementation

1) Data splitting

The first step before starting with the training of algorithms is the separation of a final test set from the labeled dataset. This test set is used for the unbiased evaluation of the final model. A commonly chosen proportion of training and validation as well as test set is a 80:20 split to ensure enough samples for the training phase (Allibhai, 2018). As the dataset in this thesis is limited to around 700 samples, this split proportion is chosen using stratified sampling. Stratified sampling ensures that the class distribution within the sub-samples stays the same. The remaining data serves to train and validate the model as outlined in Figure 5-10.

Labeled dataset

Test set Training and validation set

20% 80%

Figure 5-10 Data splitting – training, validation and test split

For the training and tuning of classifiers, k-fold cross-validation is applied as this approach is a gold-standard used in machine learning which aims to avoid overfitting and which especially can improve the prediction quality in case of rather small datasets (Brownlee, 2017; Marcot and Hanea, 2020). In k-fold cross-validation, the training set is divided into k

115 Identification of online trendsetters by advanced analytics equal-sized subsets. For each subset, the classifier is trained on the union of all other subsets. Each fold, therefore, serves once as a validation fold during the training phase. Figure 5-11 showcases the process of k-fold cross-validation for k = 5.

Validation fold Training folds

1st Performace1

2nd Performace2 folds) - k

3rd Performace3 Performance

= ����������� = � iterations iterations (

k 4th Performace4

5th Performace5

i = current iteration Figure 5-11 5-fold cross-validation (based on Raschka, 2016)

The error rate or the performance measure is calculated as the average of the error rates resp. performance values across all k iterations (Kotsiantis, 2007). The value for k is often set to 5 or 10 as studies have shown empirically that these values achieve test error estimates that neither have a high variance error nor a high bias error (Hastie et al., 2009; Kohavi, 1995). Variance error is caused by the incapability of the algorithm to generalize the data. The machine learning algorithm learns the specifics and noise of the training data and adapts its parameter settings accordingly. This is also referred to as overfitting. In contrast, the bias error, also called over-generalization or underfitting, is induced by the assumptions of a model that aim to simplify the learning. Cross-validation supports the efficient training of models by finding the right level of model complexity as well as the appropriate extent of hyperparameter tuning (Brownlee, 2017; Ghojogh and Crowley, 2019). Within the data analysis, k is set to 10. This follows the insights gained from the study of Marcot and Hanea (2020) who compared the validation outcome of Bayesian network models choosing different values of k and based on three different dataset sizes (50, 500, 5000). According to their experiments, for a dataset size of 500 instances, a k of 10 results in the most stable classification error rate (Marcot and Hanea, 2020). As the community dataset has around 700 samples, which is closer to 500 rather than 5000 samples, this k-value is chosen.

116 Identification of online trendsetters by advanced analytics

2) Data resampling

Several studies such as the one of Weiss and Provost (2003) examine the influence of class distribution on the classification performance and find that a balanced distribution tends to have better results. Especially in the case of a small training dataset, the algorithm fails to distinguish the rare samples from the majority ones as it has only a few instances of the minority class to learn from (Japkowicz and Stephen, 2002). The available community data consists of fewer samples of the OTS class than the non-OTS class, and encompasses only 700 samples. However, the class imbalance is rather low with 39.5% of the dataset being OTS-samples and 60.5% being non-OTS samples. Nevertheless, as the imbalanced problem can decrease the prediction accuracy, it is considered within the analysis. To address this problem, different techniques can be applied which aim to increase the quality of classification results in case of imbalanced training data (He and Garcia, 2009; Sun et al., 2009). Oversampling the minority class or undersampling the dominant class are often used methods. For this, various techniques are applied such as random over- and undersampling as well as synthetic oversampling. The latter refers to a technique where the small class is oversampled by creating new synthetic data. Random undersampling, in contrast, refers to a technique where randomly selected instances of the majority class are removed to result in a balanced dataset (He and Garcia, 2009). Within the data analysis, random under- and oversampling as well as synthetic minority oversampling technique (SMOTE) is applied due to their wide acceptance. Besides, there exist various implementations which allow their combination with different classifiers. For the data analysis, RandomUnderSampler, RandomOverSampler and SMOTE oversampling from the python package imbalanced- learn18 are used.

3) Algorithm selection

Four machine learning algorithms and three ensemble learning techniques are tested which are deemed to be suitable for the underlying binary classification task. Based on these algorithms, different models are created to then compare their performances and decide on the best model. In general, two categories of learning algorithms can be distinguished, parametric and non- parametric ones. Parametric machine learning algorithms such as Logistic Regression simplify the learning process by making assumptions and realizing the mapping of input and

18 https://imbalanced-learn.org/stable/index.html 117 Identification of online trendsetters by advanced analytics output variables by a known functional form. The advantage of such models is their interpretability and their speed in learning. Their downside, however, is the constraint to specific functional forms and the limitation to solve rather simple problems than complex ones. In general, parametric algorithms have a higher bias as they rely on more assumptions than non-parametric algorithms (Brownlee, 2017). However, due to their good interpretability two parametric algorithms which are commonly applied to solve binary classification tasks are considered within the data analysis, namely Logistic Regression and Naïve Bayes. Non-parametric algorithms, in contrast, do not rely on strong assumptions regarding the form of the mapping function. Prominent examples are SVM or Decision Trees. Their strength is their flexibility regarding the mapping function which often results in higher prediction performance, especially applied to high dimensionality data. However, such algorithms tend to overfit (Brownlee, 2017). To face the problem of overfitting, many non-parametric algorithms contain techniques or hyperparameters which restrict the learning of the model to a specific detail level. Within the analysis, the two non-parametric algorithms SVM and Decision Trees are tested due to their wide usage especially in studies dealing with social media data (cf. section 3.4.3). Besides, two different techniques of ensemble learning namely bagging and boosting are applied within the data analysis. These techniques create strong classification models based on several weak learners. A weak learner refers to a learning algorithm that is capable to produce a classifier with an accuracy slightly above random guess (Ferreira and Figueiredo, 2012). Bagging refers to a technique where several weak learners are trained independently of each other. This means that each instance is chosen with an equal probability. The final prediction consists of the average of the prediction results of all classifiers. Within the data analysis, Random Forest is applied, which is a variant of bagging. In contrast to bagging, in boosting, the creation of a strong classification model out of several weak ones is realized by an iterative process starting with the building of a model from the training data and then continuing with the creation of a second model which attempts to correct the errors from the prior one. Models are added until the training error is below a specific threshold or a defined number of models is added (defined via hyperparameter). In each iteration, the training set remains the same. The prediction is realized by taking a weighted average of the predictions of each classifier across all iterations. The weights, thereby, are proportional to each classifier’s accuracy on its training set (Ferreira and Figueiredo, 2012). Two boosting approaches are considered within the data analysis which are AdaBoost and Gradient

118 Identification of online trendsetters by advanced analytics

Boosting. For the practical application of these ensembles, it is crucial to limit the number of sub-classifiers. This is important to avoid overfitting as well as to reduce the computation time (Kotsiantis, 2007).

In the following, the selected algorithms are introduced briefly.

Logistic Regression Logistic Regression is a probability-based algorithm that is also known as maximum entropy classification. Such algorithms use statistical inference to find the best class for a given instance. As output, they provide a probability for each instance of being a member of the respective classes, e.g., OTS or non-OTS. The instance is then assigned to the class with the higher value. The features are not assumed to be independent of each other (Deng et al., 2014). Further information about the algorithm and its underlying functions is provided by Deng et al. (2014). A major advantage of using a Logistic Regression algorithm is the interpretability of results. The algorithm also provides information about the extent and the direction of the impact of each variable on the dependent variable. The classifier, however, tends to over- generalization. Nevertheless, the algorithm is considered within the data analysis due to its good interpretability and good performance in related work studies (cf. section 3.4.3). The data analysis is realized using LogisticRegression19 of the python library scikit-learn.

Naïve Bayes Similar to the Logistic Regression algorithm, Naïve Bayes is a probability-based algorithm that predicts a class membership probability. It relies on a Bayesian network with only one parent and several children. The child nodes are assumed to be independent of each other which is called class conditional independence (Deng et al., 2014). Therefore, the algorithm assumes that the presence resp. absence of a feature is not related to the presence resp. absence of any other feature. A detailed explanation of this algorithm is provided by Deng et al. (2014). A disadvantage of Naïve Bayes is that its application is not suitable for datasets with a high dimension feature space as it often results in bad classification performance due to its independence assumption. The advantage, however, is its simplicity, the resulting speed and the interpretability of results. Additionally, the algorithm requires only small datasets for its

19 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

119 Identification of online trendsetters by advanced analytics training (Sen et al., 2020). Due to this, Naïve Bayes is selected for the data analysis. The implementation is realized using the MultinomialNB20 from the python library scikit-learn.

Support Vector Machine The SVM algorithm aims to find a decision boundary that maximizes the distance between the closest members of the classes which is called the optimal separating “hyperplane”. Those points which are closest to the line from both classes are called support vectors. The benefit of SVM in practice mainly results from its ability to separate non-linearly separable data by using kernels to model non-linear decision boundaries. Kernels are a group of functions that transform the low-dimensional input space into a higher dimension to solve the classification for complex data structures. By mapping the training data to a sufficiently high dimension, the data from the two classes can be separated by a hyperplane (Cristianini and Shawe-Taylor, 2000). Cristianini and Shawe-Taylor (2000) provide further information about the algorithm and its implementation techniques. The strength of SVM is its good generalization ability and its robustness to high dimensional data (Kotsiantis, 2007). The major disadvantages are its decreasing performance applied to noisy and class-overlapping data (Sen et al., 2020), and the difficulty to interpret the results (Kotsiantis, 2007). However, SVM is considered due to its good generalization ability and its wide application, especially in studies that focus on the classification of social media data (cf. section 3.4.3). NuSVC21 from scikit-learn is used to realize the data analysis.

Decision Trees Decision trees are hierarchical models where the features are represented by the tree nodes, the edges indicate the possible values for a particular feature and the leaves are related to class labels. A decision tree represents the relationship between a dependent variable and one or more predictor variables. During tree construction, attribute selection measures, such as the Information Gain or the Gini Index, are used to select the attribute that best partitions the samples into distinct classes (Kotsiantis, 2007). The advantage of decision trees is their good interpretability. Their downside, however, is their tendency to overfit due to their flexibility. One possibility to solve this issue is the limitation of the maximum depth of trees by choosing the respective hyperparameter of the algorithm accordingly (Kotsiantis, 2007). For imbalanced class distribution, decision tree algorithms also tend to create biased trees.

20 https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html 21 https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC 120 Identification of online trendsetters by advanced analytics

Resampling the data before fitting with the decision tree algorithm can solve this issue. Further information about decision trees is provided by Murthy (1998). This algorithm is selected as the results are good to understand and interpret. Furthermore, the outlined weaknesses of this algorithm are easy to handle by hyperparameter tuning and resampling. The algorithm is implemented using DecisionTreeClassifier22 from the python library scikit-learn.

Random Forest Random Forest introduced by Breimen (2001) is a variant of the bagging ensemble technique. It uses decision trees as base classifiers. By combining several decision trees, it profits from the strength of decision trees and simultaneously avoids overfitting. The construction of each tree bases on a random sampling of the training data. Additionally, only a random subset of features is considered when splitting each node in each decision tree. As the samples are drawn with replacement, they can be used several times during the training phase. This process increases the diversity of base classifiers and the model’s overall stability. Besides, it fosters the model’s robustness to missing data. The final predictions consist of the average of the predictions of each decision tree (Breimen, 2001). Breimen (2001) provides detailed information about Random Forest. The major advantages of Random Forest are a lower overfitting tendency compared to decision trees and the possibility to reveal information about a feature’s importance for the model. Due to this, Random Forest is considered within the data analysis. Its implementation is realized by using the RandomForestClassifier23 of the ensemble package provided by the python library scikit-learn.

AdaBoost and Gradient Boosting One commonly applied boosting method is Adaptive Boosting, also known as AdaBoost, which was developed by Freund and Schapire (1996) to solve binary classification problems (Ghojogh and Crowley, 2019). The idea of AdaBoost is to learn x models in a hierarchy where every model gives more attention (higher weight) to instances that are misclassified by the previous model (Ferreira and Figueiredo, 2012). The algorithm is often applied using decision trees (Quinlan, 1996). A detailed explanation of AdaBoost is presented in Freund and Schapire (1996). Another widely used boosting approach is Gradient Boosting introduced by Friedman (2001). Similar to AdaBoost, Gradient Boosting builds a strong

22 https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html 23 https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 121 Identification of online trendsetters by advanced analytics classification model based on simple base-learners in an iterative process. Gradient Boosting, however, focuses on minimizing a loss function. The contribution of each weak learner to the final prediction is based on a gradient optimization process to minimize the overall error of the strong learner (Bentéjac et al., 2020). The major strength of these ensemble learners is their high accuracy compared to single learning algorithms (Kotsiantis, 2007). A weakness is their low robustness to noise as noisy instances or outliers tend to be misclassified, and therefore, the weight of the respective instance will increase. This results in overfitting as the algorithms learn the specificities of these instances. However, this issue can be solved by setting the right hyperparameters, for instance, limiting the number of sub-classifiers and iterations (Kotsiantis, 2007). AdaBoost is selected for the analysis as it is one of the most frequently used and studied boosting algorithms and has already been applied to various fields (Ferreira and Figueiredo, 2012). Besides, the study of Morstatter et al. (2016) (cf. section 3.4.3) also used AdaBoost in the analysis for bot detection as they emphasize that boosting approaches can better handle heterogeneous behavior within a class such as it can be the case for the group of OTS. For this reason, also Gradient Boosting is chosen. Moreover, Gradient Boosting is less sensitive to outliers compared to AdaBoost. Similar to the other algorithms, both ensembles are implemented using the python library scikit-learn. Therefore, AdaBoostClassifier24 and GradientBoostingClassifier25 are applied. For both ensemble learners, Decision Trees serve as base learners as this algorithm is the most commonly used base learner for these boosting ensembles with good performance in various application areas.

4) Feature selection

Feature selection is dealing with the elimination of redundant and irrelevant features to improve the speed and effectiveness of algorithms. As it reduces the complexity of a model, it also increases its interpretability. Besides, the elimination of irrelevant features can increase a model’s accuracy. Depending on the algorithm, the involvement of irrelevant and highly correlated features within the training phase can result in overfitting and poor models, and therefore, such features should not be considered. The idea of feature selection is to remove features that do not increase performance. The objective is to select a subset of features that efficiently describe the input data and still provides good prediction results (Tang et al., 2014). Although the feature framework, which bases on extensive literature

24 https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html 25 https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html 122 Identification of online trendsetters by advanced analytics review and expert interviews, already aims to only involve potentially relevant features for OTS classification, the issue of redundant and irrelevant features has to be considered in the data analysis. Therefore, different feature selection methods, as well as combinations of several of them, are tested. There are three widespread techniques which are filter, wrapper and embedded approaches. Filter methods evaluate and rank the features according to a specific statistical measure such as their variance or correlation coefficients. Based on these metrics the features remain in the dataset or are removed from the dataset. These methods are very efficient and are independent of the selected classification algorithm. Within the data analysis, a correlation filter method based on the Pearson correlation is applied to eliminate redundant features (approach 1, Figure 5-12). This is realized by calculating the Pearson correlation between each pair of features which is a commonly used technique to identify redundant features (Garcia et al., 2015). In the case of highly correlated feature pairs, the one with a lower correlation to the target label (OTS) is excluded from the dataset. Therefore, from a feature pair with a Pearson correlation value higher than 0.7, one is excluded, as features with a value between -/+0.7 and -/+1 indicate a strong negative resp. positive correlation (Ratner, 2009). For the realization of this analysis corr26 from the python library pandas is utilized. Wrapper models, in contrast, are integrated into the algorithms, consider the specificities of algorithms but lack efficiency, especially applied on a dataset with a large number of features. Therefore, no wrapper approach is applied within the analysis. Embedded models combine the accuracy of the wrapper approach and the efficiency of filter approaches. Therefore, an embedded approach is applied to realize a fine-grained feature selection that also considers the specificities of the different algorithms (cf. approach 2, Figure 5-12). The feature selection using an embedded approach is realized within the construction phase of model resp. learning time (Tang et al., 2014). For this, the python library scikit-learn provides the meta-transformer SelectFromModel27 which enables the implementation of embedded methods for feature selection in the respective classifier. It can be utilized along with any algorithm containing the attribute “coef” or “feature_importance”. Thereby, features with a feature importance measure below a specific threshold value are dropped from the feature list. Within the data analysis, the threshold value is set to the mean of the values of the feature importance measure of the respective classifier. Except for non-

26 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html 27https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html 123 Identification of online trendsetters by advanced analytics linear SVM, all the selected algorithms have such a built-in feature selection method. For these algorithms, this embedded approach is tested. Additionally, one approach is involved which bases on a built-in feature selection method of a learning algorithm but which is independent of the specific classifier used for model building as it is the case for filter approaches (cf. approach 3, Figure 5-12). More recently, studies have tested such approaches as well as a combination of several such feature selection methods (e.g., Sahu et al., 2017; Haq et al., 2019) and outline their good performance and efficiency compared to the commonly used methods. Haq et al. (2019), for instance, use an L1-regularized Logistic Regression to identify relevant features and combine this approach with a correlation-based clustering method to remove redundant features. They show that their approach results in a higher accuracy of the prediction model compared to commonly used filter and embedded methods. Due to this, and as several studies (e.g., Ng, 2004; Zakharov and Dupont, 2011) have shown the effectiveness of L1-regularized Logistic Regression as a feature selection method, this algorithm is tested within the analysis. Thereby, the feature importance measure bases on the regression coefficients. The L1 penalty term is used to shrink the coefficients of less important features resp. irrelevant features to zero (Haq et al., 2019). Therefore, only the features with a non-zero coefficient value are selected. For its implementation LogisticRegression from the scikit-learn package linear model is used, setting the hyperparameter penalty to l1. Besides, also combinations of the correlation-based filter approach, which eliminates redundant features, and the other two approaches which aim to remove irrelevant features, are applied as showcased in Figure 5-12 (approach 4 and 5). To identify the best performance setup, the training of classifiers is also realized based on all features (entire dataset). The entire dataset is tested as correlated data can contain additional and valuable data, which can be important for the class decision of some of the algorithms.

124 Identification of online trendsetters by advanced analytics

Feature Selection

Filter approach 1

Pearson correlation

Embedded approach 2

Dependent of the selected classification algorithm*

Mix of filter and embedded approach 3 Features which are considered for model Labeled dataset L1-regularized Logistic training Regression

Combined approach I 4

L1-regularized Logistic Pearson correlation Regression

Combined approach II 5

Classifier-dependent Pearson correlation embedded approach

*) coefficient-based feature importance: Logistic Regression, Naïve Bayes; impurity-based feature importance: Decision Trees, Random Forest, AdaBoost based on Decision Trees, Gradient Boosting based on Decision Trees Figure 5-12 Applied feature selection methods

5) Hyperparameter Tuning

Typically, machine learning algorithms have several hyperparameters which enable their configuration in a way that fits best a specific task and dataset. These hyperparameters vary from algorithm to algorithm and can take a range of different values. The optimal value of each hyperparameter depends on the respective task and dataset, and choosing the optimal set of hyperparameters can increase the prediction performance (Probst et al., 2019). They, for instance, control the complexity of a model such as the number of trees of the Random Forest classifier, which can avoid overfitting (Kuhn and Johnson, 2013). There are different options to select the hyperparameter setting (Probst et al., 2019). The identification of the hyperparameter setting which results in the best classification performance, such as maximum accuracy or minimum error, is often realized by a search process, the so-called hyperparameter optimization or hyperparameter tuning. Two commonly used techniques support this search process, namely random search and grid search (Bergstra and Bengio, 2012). Both methods go through various combinations of hyperparameters to identify the best combination of values. In contrast to random search, which randomly selects combinations of hyperparameter values and evaluates the resulting model, grid search

125 Identification of online trendsetters by advanced analytics analyses all possible combinations within a pre-defined search space. In the case of a high searching space, random search is often preferred over grid search as it can test a wider hyperparameter space. Grid search, in contrast, can be more accurate as it looks through all value combinations of the preset list. Within the data analysis, grid search is chosen as a tuning technique due to its higher accuracy. Besides choosing an efficient tuning strategy, the hyperparameters which should be tuned and the respective value ranges (search space) need to be defined (Probst et al., 2019). To determine the value ranges for the extensive grid search, several different value ranges are tested to narrow the final search space. A list of all hyperparameters which are optimized for each algorithm and their pre-defined value ranges are provided in appendix A.5. The package GridSearchCV28 from the python library scikit-learn is used for the implementation of hyperparameter tuning. The F1-Score measure is set as the objective for the optimization task as experiments related to this thesis have shown that the sole optimization towards Precision results in an overall poor performance of classifiers. These performance measures are introduced in the following.

6) Evaluation

The evaluation of the classifiers’ performance is realized by using different performance measures. The selection of these metrics is done considering the respective learning objective. These measures are crucial for the assessment of the classification performance as well as for guiding the classifier modeling. The evaluation of the classification performance bases on the so-called confusion matrix, which encodes the number of correct and incorrect predictions for each class. From this matrix, various widely used indices can be derived, such as Accuracy, Recall and Precision (Sun et al., 2009). In the binary scenario of OTS classification with the OTS class having fewer samples but high identification importance, this class is referred to as positive one, being a non-OTS as negative. As Figure 5-13 shows, samples can be assigned to four groups according to the confusion matrix which are true positive (TP), true negative (TN), false positive (FP) and false negative (FN).

28 https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html 126 Identification of online trendsetters by advanced analytics

Predicted as positive Predicted as negative e.g., user is predicted as OTS e.g., user is predicted as non-OTS

True positive (TP) False negative (FN) Actually positive e.g., user is correctly predicted as e.g., user is OTS but is wrongly e.g., user is OTS OTS predicted as non-OTS

Actually negative False positive (FP) True negative (TN) e.g., user is non-OTS e.g., user is non-OTS but is wrongly e.g., user is correctly predicted as predicted as OTS non-OTS

Figure 5-13 Confusion matrix (based on Sun et al., 2009)

For each model the performance measures Accuracy, Precision, Recall and F1-Score are calculated as these measures are commonly accepted (Sokolova et al., 2006) and cover different aspects of an algorithm’s performance. The optimization of models focuses on the increase of the F1-Score value to ensure a good overall prediction performance of the OTS class. The comparison of models, however, also considers their Precision values. The reason for this is the major importance of the OTS class within this thesis and the objective to reveal insights about this specific class. The Accuracy is included to ensure a specific level of overall performance. Table 5-5 shows the calculation of these performance measures. Accuracy describes a classifier’s overall effectiveness as it shows the ratio of correctly predicted classes. Precision estimates the probability that a positive prediction, e.g., OTS class, is correct, which is expressed by the ratio of correctly classified OTS to the number of predicted OTS (TP and FP) (Sokolova et al., 2006). The F1-Score combines Recall and Precision and is the harmonic mean of Precision and Recall. All measures’ values range between 0 and 1 (Ballabio et al., 2018).

Performance measure Formula

�� + �� Accuracy �� + �� + �� + �� �� Precision �� + �� �� Recall �� + �� 2 ∗ ������ ∗ ��������� F1-Score ������ + ��������� Table 5-5 Performance measures

127 Identification of online trendsetters by advanced analytics

5.3.2 Exploratory data analysis and data pre-processing

Two major steps of the data pre-processing phase consist of the data labeling and feature extraction. The following section provides a brief explanation of how these steps are realized within the data analysis as well as a summary of the respective results.

Data labeling As described in section 4.3.1, the developed concept uses past product trends to label users based on Rogers’ innovation diffusion theory as OTS and non-OTS. For the identification of appropriate trends, specific sneaker models are analyzed regarding their trend character. For this, 48 sneaker models of six different popular sneaker brands such as adidas, Nike or New Balance are selected which were launched between 01.11.2017 and 29.12.2018. The selection of these models is supported by several sneaker experts, who are either part of the worldwide sneaker community or have a professional background related to streetwear. The time range of releases is chosen to enable the data access to a one-year history of price evaluation of the respective sneaker model which is the basis for the estimation of its popularity (success index). Additionally, the chosen time range ensures the availability of social media postings along the sneaker model’s lifecycle. For each of these pre-defined models the success index, which is explained in section 4.3.1, is calculated. Therefore, data from the platform StockX is extracted. StockX is a trading platform for sneakers where people can buy and sell especially limited editions such as the adidas Yeezy line, the Nike Jordan line, and the adidas NMD Pharrell line. On the site, a buyer places a bid and a seller can place an ask (Watts, 2019). The platform provides data about the retail price of specific sneaker models as well as the bidding price evolution over time. The data collection encompasses the bidding and retail price evolution for each of the pre-defined sneaker models since the launch time for a time range of one year. Those sneakers with a positive success index value are assumed to be product trends. 40 of the pre-selected models reach a positive value, and therefore, are considered as product trends for the labeling task. Based on pre-defined search strings, postings of the community which are related to these sneaker models are identified. A manual quality check is realized on a randomly selected sample of postings for each sneaker model to ensure that there are no major errors in the automated process. This results in 67,495 postings which refer to 32 different models. Sneaker models which are mentioned in less than 100 postings are excluded from the data analysis. The community postings which relate to one of those sneakers are in the subsequent extracted from the data including the time stamp. If one community member has posted several times

128 Identification of online trendsetters by advanced analytics about one of the sneakers, only the first posting in time is considered within the labeling process. In total, 665 users of the community have posted about at least one of the sneakers. For the assignment of the class labels (OTS, non-OTS) to these users, Rogers’ innovation diffusion model is transferred to the data in a simplified way. Therefore, the postings per sneaker model are ranked according to their posting time. The users who relate to one of the first 16% of postings are labeled as OTS (263 users). Users who have mentioned the product later in time are labeled as non-OTS (402 users). The labeling process results in an imbalanced dataset that includes more examples for non-OTS than for OTS. This needs to be considered within the development of classification models as ignoring this fact can reduce the quality of the classification result (Herrera, et al. 2016). Table 5-6 provides information about the size and the posting activity of both classes. It already indicates that the group of OTS is more active in terms of posting volume than the group of non-OTS. Although the statistics encompass a posting history of eight years, the posting frequency in both groups is high, which is emphasized by a value of the avg. posts per user of 1110.8 (OTS) resp. 391.7 (non-OTS). This high activity of sneaker interested users is confirmed by several sneaker experts who are part of the worldwide sneaker community. They outline that sneaker enthusiasts often post several times a day.

OTS non-OTS

No. of users 263 402

No. of posts 292,140 157,463

Avg. posts per user 1110.8 391.7

Table 5-6 Comparison OTS and non-OTS (sneaker community)

Furthermore, Figure 5-14 shows the evolution of the number of OTS and non-OTS accounts in the community over time as well as the evolution of the number of postings per month of both groups. While the number of OTS and non-OTS within the community develop similarly until 2018, the number of OTS postings is higher than those of the non-OTS group, which further emphasizes the higher activity level of OTS accounts. The number of postings of OTSs, however, decreases from the end of 2018 which can be an indicator that the sneaker trend has reached the mass and the OTS move their focus on other new emerging topics within other communities.

129 Identification of online trendsetters by advanced analytics

700 12000

600 10000

500 8000

400 6000 300 4000 200

2000 100

0 0 Jul-14 Jul-15 Jul-16 Jul-17 Jul-18 Jan-14 Jan-15 Jan-16 Jan-17 Jan-18 Jan-19 Jun-12 Jun-13 Jul-14 Jul-15 Jul-16 Jul-17 Jul-18 Oct-14 Oct-15 Oct-16 Oct-17 Oct-18 Sep-11 Sep-12 Sep-13 Apr-14 Apr-15 Apr-16 Apr-17 Apr-18 Apr-19 Jan-14 Jan-15 Jan-16 Jan-17 Jan-18 Jan-19 Dec-11 Dec-12 Jun-12 Jun-13 Mar-12 Mar-13 Oct-14 Oct-15 Oct-16 Oct-17 Oct-18 Sep-11 Sep-12 Sep-13 Apr-14 Apr-15 Apr-16 Apr-17 Apr-18 Apr-19 May-11 Dec-11 Dec-12 Mar-12 Mar-13 May-11

No. of OTS No. of OTS posts

No. of non-OTS Total no. of users No. of non-OTS posts Total no. of posts

Figure 5-14 Community evolution – OTS and non-OTS (sneaker)

This dataset, consisting of 665 instances, is used as input for the training of the model, and therefore, is further prepared for this task in the following.

Feature extraction and normalization The extraction of features bases on the feature framework and aims to transform the social media raw data into features that better describe the underlying problem to the predictive models. All features, listed in the appendix (cf. A.4), are calculated using methods of text analytics and SNA as described in section 5.3.1. As user accounts that do not provide all the required data for the calculation of features are already excluded within the data collection process, no further handling of missing values, which can distort the machine learning process and result in incorrect conclusions, is necessary (Garcia et al., 2015). The feature extraction phase results in 113 features represented by numbers that either have integer values such as the feature no. of posts, or boolean values (0 or 1), e.g., is verified. Therefore, no further adjustment of data types is required. As Table 5-7 and Figure 5-15 show, the value ranges of features in the dataset vary widely. Whereas time-related features such as the avg. time between posts is measured by seconds, and therefore, can have a huge range, e.g., the value of avg. time between posts ranges from 9,574.6 to 3,619,916 seconds, features like distinct emojis in bio encompasses only a value range from 0 to 20 (cf. Table 5-7).

130 Identification of online trendsetters by advanced analytics

Avg. time between posts Followed by Distinct emoji in bio (in seconds) mean 3,31646.4 16,705.63 3.17

SD 3,23873.9 43,883.29 2.8 minimum 9,574.6 353 0 maximum 3,619,916 551,717 20

Table 5-7 Value ranges of the features avg. time between posts, followed by, distinct emoji in bio

The largely varying value ranges of the features also become obvious by their visualization in a boxplot. An example is presented in Figure 5-15.

Avg. time between posts No. of followers No. of followers No. of distinct emojis bio Figure 5-15 Visualization of value ranges of the features avg. time between posts, followed by, distinct emoji in bio

As some algorithms, e.g., SVM can hardly handle such diverging value ranges, a scaling technique is applied to align the value ranges of all features. Distance-based classifiers, e.g., SVM, are strongly affected by the value range of features and assign features with a broad value range more importance. Besides, rescaling the features support algorithms such as Logistic Regression which uses gradient descent as an optimization technique to converge faster. The aim of feature rescaling, therefore, is to increase the speed of the learning process and to prevent incorrectly assigning features with a higher range more importance. Therefore, normalization is applied based on the quantile information of features. This results in a value range between 0 and 1 (Garcia et al., 2015; Kotsiantis et al., 2006). Quantile normalization transforms the features to follow a normal distribution and decreases the influence of outliers. Therefore, it is a robust pre-processing technique (Roy, 2020). Based on this dataset, a principal component analysis (PCA) is conducted as it can reveal insights about the classification power of the data. The PCA derives relevant features, so- called principal components, through linear or non-linear combinations of the original

131 Identification of online trendsetters by advanced analytics features (Tharwat, 2016). The number of principal components in this analysis is set to two as two principal components can often explain most of the overall feature variation and enable a simple visualization (O’Sullivan, 2020). Figure 5-16 visualizes the results of this PCA analysis. The figure shows that there is no clear separation of OTS and non-OTS instances. Both groups rather merge seamlessly. This can be already an indicator that the classification task for the algorithms becomes challenging and a clear distinction based on the input data is difficult.

Non-OTS OTS rincipal rincipal component II P

Principal component I Figure 5-16 Visualization of PCA analysis

This can result from the applied data collection process which focuses on profiles with a specific topic focus. Besides, within the community detection process, several filter and evaluation criteria are applied which aim to ensure a specific level of quality of the community members. Therefore, this process can already prefer profiles with a tendency to trendsetting and related characteristics.

5.3.3 Development and selection of models

Figure 5-17 presents the process of model training, validating, and evaluation using the four algorithms and the three ensemble approaches. Before training, the data is split into a training and validation (80%), and a test set (20%) using stratified sampling as explained in section 5.3.1. In the subsequent, each algorithm is trained with the default values of its hyperparameters using 10-fold cross-validation resulting in an algorithm’s “default model”.

132 Identification of online trendsetters by advanced analytics

After splitting the data into 10 folds, the data subsets are normalized. In the case of training on resampled data, each sub-sample is then resampled by applying the respective resampling approach. This is iterated until each of the 10 folds has been used once as the validation set. Depending on the approach, feature selection is integrated into this pipeline before the training (e.g., Pearson Correlation and L1-based Logistic Regression) or within the training of the respective algorithm (e.g., classifier-dependent embedded approach). As described in section 5.3.1, to eliminate redundant features, a correlation analysis is conducted which results in a reduced feature list of 90 features. These features are highlighted in the feature list in appendix A.4. The realization of the L1-based Logistic Regression as described in section 5.3.1, which is also independent of the selected classification algorithm, is applied after data resampling. The number of relevant features selected by this approach, thereby, varies slightly depending on the resampling approach. Applied to the imbalanced dataset, for instance, it results in a reduced feature list of 43 relevant features. Combining the correlation-based approach and the L1-regularized approach (cf. Figure 5-12, approach 4) results in a feature subset of 38 features applied to the imbalanced dataset. The process is repeated using the same dataset but including hyperparameter tuning to optimize the algorithms towards the performance measure F1-Score. The optimization towards F1-Score ensures a specific overall performance level of the models. Therefore, grid search is used to support this process as described in section 5.3.1. The model with the hyperparameter settings which result in the best F1-Score is selected, in the following called “tuned model”. Appendix A.5 provides the hyperparameters which are tuned and the respective value ranges which are tested within the grid search. For each algorithm resp. ensemble learning algorithm, both models, the default model and the tuned model, are applied on the holdout test set to evaluate them on unseen data, to compare their performances, and to measure the impact of hyperparameter optimization on the performance of classifiers.

133 Identification of online trendsetters by advanced analytics

Labeled dataset Feature selection

Data split, stratified sampling

80% Training and validation set 20% Test set

10-fold cross validation, using hyperparameter 10-fold cross validation, applying hyperparameter Model evaluation default values tuning

Stratified sampling, Stratified sampling, splitting into 10 folds splitting into 10 folds

Normalization Normalization

Comparison of results Resampling Resampling

Feature selection Feature selection

Application of model on Application of model on Feature selection Training and testing Feature selection Training and testing unseen test data unseen test data Repeat with each asfold validationset Repeat with each asfold validationset Hyperparameter yperparameter

optimization towards h

F1-Score Grid search: repeat for all value combinations

Default model Tuned model

Figure 5-17 Process of model training, validation and evaluation

In total, 160 experiments are conducted to determine the best model. Within the experiments, the algorithms are trained with and without the integration of different resampling and feature selection approaches. Figure 5-18 shows the different training setups. Besides the SVM algorithm, which is only tested with three different feature selection approaches as this algorithm has no built-in feature selection method, all the other algorithms are trained in combination with five different feature selection approaches and the entire feature list.

Feature selection No feature selection (five different approaches)

Random Random Random Random SMOTE No resampling SMOTE No resampling undersampling oversampling undersampling oversampling

Logistic Regression

Naïve Bayes

SVM

Decision Trees

Random Forest

AdaBoost (using Decision Trees)

Gradient Boosting (using Decision Trees)

Figure 5-18 Overview experimental setup

134 Identification of online trendsetters by advanced analytics

5.3.4 Comparison of models

In the following, the results of the analysis, especially the best model of each of the seven applied classifiers, e.g., Logistic Regression, Naïve Bayes, Random Forest, are presented. Table 5-8 showcases the best model of each algorithm according to their Precision values. The models are compared based on their Precision and F1-Score value but also considering the Accuracy value to evaluate their overall performance. The results show that the best model for each classifier is achieved by realizing the training on the imbalanced dataset. Besides, except for Decision Trees and AdaBoost, the best model is reached by including the correlation-based feature selection approach, which removes redundant features. The hyperparameter settings of the best model of each algorithm are indicated in appendix A.5. Random Forest performs the best with a Precision value of 72.32% and a F1-Score value of 64.12% based on the imbalanced dataset and the feature subset created by the feature selection combination of the correlation-based approach and the L1-regularized Logistic Regression approach. The model, therefore, bases on 38 of the initially 113 input features defined in the developed feature framework. The same model setup based on the entire feature list (113), in comparison, results in a Precision of only 66.72%. The model including feature selection achieves an Accuracy value of 74.22% which indicates the probability of the model to correctly predict both classes. Random guessing, in comparison, would only result in an Accuracy value of 52%29. Furthermore, the model performance achieves values that are close to results reported in the related work studies which deal with binary classification problems based on social media data (cf. section 3.4.3). Lee et al. (2017), for instance, aim to classify Twitter accounts into fashion-related and non-fashion-related accounts and reach a model performance with a F1-Score of 67% and a Precision value of 74%. Morstatter et al. (2016) also develop a binary classification model that targets detecting bots on social media platforms. The best model reaches a Precision value of 71.42%. These studies underline that the classification of social media users, in general, is challenging. The developed OTS classification model, however, achieves a similar performance level than comparable models with other classification targets. As the PCA has already indicated, a clear distinction of both groups is difficult, but the model still provides an approach to identify a major portion of OTS accounts within a community correctly. Furthermore, in contrast to the detecting of cancer diseases, for instance, where the non-recognition of the

29 https://stackoverflow.com/questions/53182709/how-to-calculate-accuracy-score-of-a-random-classifier 135 Identification of online trendsetters by advanced analytics positive class has serious consequences, the early detection of trends does not require the monitoring of all existing OTSs of a specific topic area.

Training setup Performance measurement Classifier Resampling Feature Selection Precision F1-Score Accuracy

Logistic No Pearson correlation 65.95% 62.68% 71.98% Regression

Pearson correlation + Naïve Byes No L1-regulized Logistic 70.65% 57.09% 71.23% Regression

SVM No Pearson correlation 71.86% 61.68% 71.98%

L1-regulized Logistic Decision Trees No 59.69% 60.45% 68.24% Regression

Pearson correlation + Random Forest No L1-regulized Logistic 72.32% 64.12% 74.22% Regression

AdaBoost (using Decision No No 68.40% 60.01% 71.61% Trees)

Gradient Pearson correlation + Boosting No L1-regulized Logistic 70.13% 63.07% 73.85% (using Decision Regression Trees)

Table 5-8 Performance results of the best model of each classifier

5.3.5 Relevant features for online trendsetter identification

To reveal knowledge about the characteristics of OTSs according to the analysis, the most relevant features of the best-performing model are investigated. For this, the considered features are analyzed regarding their overall importance for the classification model to decide on the class. Therefore, a hybrid approach that combines global and local interpretation is applied. The aim of interpreting a model globally is to get general insights about the features’ importance for the specific model. Local interpretation, in contrast, refers to an investigation of single predictions and the features which decide on the class decision for each instance. This yields to understand the class decision of a model for each single sample (Scholbeck et al., 2020). The feature importance describes the degree to which a model bases on a specific feature (Molnar et al., 2020). In the following, feature importance relates to a feature’s associated predictive power. Depending on the algorithm, different

136 Identification of online trendsetters by advanced analytics techniques can be used to identify the most important features. Parametric models such as Logistic Regression or Naïve Bayes, for instance, are interpretable algorithms where model- specific techniques can be applied. Importance scores or feature effects can be deduced from the learned parameters and the structure of the model. For the interpretation of non- parametric and ensemble-based algorithms, however, often model-agnostic techniques are used (Scholbeck et al., 2020). As the best model bases on the Random Forest ensemble learner, a model-agnostic method is applied in the following. Breiman (2001) provides one of the most prominent model-agnostic approaches, namely permutation feature importance, which was introduced for Random Forests. It measures the feature importance of a model by the increase of a model’s prediction error after the permutation of the particular feature, and is also referred to as the Mean Decrease Accuracy (MDA) (Casalicchio et al., 2018). Although this approach is widely used, its downside is that it does not consider feature interactions. However, the volume of information that a feature contributes to a model varies strongly when considered solely or together with a larger set of features (Covert et al., 2020). Additionally, this method only focuses on the question of which features are important but does not investigate how a particular feature influences the predictive power of a model. To get also insights about the “how”, the method SHapley Additive exPlanations (SHAP) is used, which was initially introduced for local model interpretation by Lundberg and Lee (2017) and was further developed for global model interpretation (Lundberg et al., 2020). The method bases on the concept of shapely value derived from game theory and measures the features’ global contribution to the predictive power of a model. It enables the interpretation of the global model structure based on several local explanations and aims to explain the outcome of any machine learning model. In contrast to permutation feature importance, it also accounts for feature interactions. As the concept bases on the coalition game theory, SHAP calculates the average marginal contribution of each feature value across all possible coalitions to the prediction made by the model. The combination of local explanations by the TreeExplainer of SHAP across the whole dataset provides a global representation of feature importance and additionally enables the visualization and interpretation of the contribution of each feature to the model performance in terms of magnitude, frequency and direction. Lundberg et al. (2020) provide detailed information about SHAP and its implementation. Within the data analysis, the implementation of SHAP is realized using the python package SHAP30 provided

30 https://github.com/slundberg/shap 137 Identification of online trendsetters by advanced analytics by Lundberg. The SHAP summary plot in Figure 5-19 shows the 20 most important features of the 38 for the model relevant ones by plotting the SHAP values of each feature for every instance and sorting the features according to the sum of SHAP value magnitudes across all instances. Each dot represents one specific instance resp. user, the color indicates the feature value. A blue colorway relates to a low feature value, whereas a red one relates to a high value. A dot that is positioned outer right increases the prediction towards the OTS class. This representation also visualizes the distribution of the impacts of each feature on the model output. The feature importance analysis shows that network-related (ten features) and content-related features (six features) are most important for the model. These categories are also the two most relevant ones according to the expert interviews (cf. section 3.5) and the related work analysis (cf. section 3.4.1).

high No. of hashtags No. of distinct used tags in posts No. of comments No. of followers No. of tags in posts No. of distinct hashtags Evolution no. of comments Ratio follows-followers Comment-mention closeness Post-mention PageRank Ratio emojis bio

Ratio hashtag bio Feature value Follow out-degree Comment-mention PageRank Avg. no. of videos Tag closeness Avg. time to comment Avg. post length Avg. no. of questions Ratio emojis/words in posts low

SHAP value (impact on model output)

Figure 5-19 Feature importance – SHAP summary plot

In the subsequent, the most important features are investigated in detail. The analysis especially focuses on how the values of a feature influence the class decision. Additionally,

138 Identification of online trendsetters by advanced analytics it is examined if the contribution of a feature depends on the combination with another feature. Besides, the group of OTS and non-OTS are compared by analyzing the value distribution of the important features in both classes. The objective is to reveal insights about how OTSs distinguish from non-OTSs, and aims to characterize OTSs and their behaviors in OSNs. The main findings are summarized in Table 5-10 and are described in the following. Table 5-9 provides a list of the 20 most important features according to the SHAP method with their mean and standard deviation values in the group of OTS and non-OTS. The importance rank of a feature, thereby, bases on the mean of the SHAP value magnitudes across all instances. Some of the features, e.g., no. of hashtags, no. of comments, no. of followers, have high standard deviation values. This indicates a large variation of feature values in the respective group. In general, a high value of standard deviation suggests that the respective group is not homogeneous regarding the particular feature, and the assignment to one or the other class is rather induced by a specific combination of particular values of different features. Due to this, besides comparing the value distribution of features of both classes, also the SHAP dependence plots are examined, which can uncover hidden interactions between features. These partial dependence plots such as presented in Figure 5-20 visualize the marginal effect (e.g., SHAP value) which a specific feature (e.g., no. of tags) has on the predicted output of a model (Friedman, 2001). Besides, it shows the feature with the highest interaction with a specific feature (e.g., ratio follows-followers). The plot also indicates the type of relationship between the class and a feature, e.g., linear, monotonic or more complex, and bases on the normalized feature values.

Importance rank Mean (SD) (mean |SHAP Feature Category values|) OTS Non-OTS No. of hashtags content 1 (0.171) 18,461 (29,827) 6,175 (6,462)

No. of distinct network 2 (0.146) 334.2 (302.8) 170.5 (156.5) used tags in posts

No. of comments network 3 (0.135) 26,842 (76,112) 8,112 (14,010)

No. of followers network 4 (0.103) 28,489 (64,021) 8,997 (18,979)

No. of tags in posts network 5 (0.10) 617.5 (1,677.6) 97.8 (274.0)

No. of distinct content 6 (0.098) 885.5 (1014.2) 466.8 (454.5) hashtags

139 Identification of online trendsetters by advanced analytics

Importance rank Mean (SD) (mean |SHAP Feature Category values|) OTS Non-OTS Evolution no. of context 7 (0.092) 0.08 (0.23) 0.23 (0.58) comments

Ratio follows- network 8 (0.088) 0.192 (0.287) 0.332 (0.318) followers

Comment-mention network 9 (0.067) 0.358 (0.129) 0.287 (0.171) closeness

Post-mention network 10 (0.058) 0.002 (0.0045) 0.001 (0.0019) PageRank

Ratio emojis bio user 11 (0.055) 0.03 (0.032) 0.05 (0.065)

Ratio hashtag bio user 12 (0.049) 0.008 (0.013) 0.006 (0.013)

Follow out-degree network 13 (0.045) 49.63 (47.04) 53.67 (41.57)

Comment-mention network 14 (0.039) 0.0016 (0.0014) 0.0013 (0.001) PageRank

Avg. no. of videos content 15 (0.036) 0.023 (0.041) 0.027 (0.073)

Tag closeness network 16 (0.030) 0.31 (0.11) 0.28 (0.12)

Avg. time to 2,93823.0 3,26810.3 context 17 (0.030) comment (4,41455.7) (5,31349.1)

Avg. post length content 18 (0.028) 341.2 (150.2) 334.9 (168.3)

Avg. no. of content 19 (0.027) 0.16 (0.19) 0.12 (0.19) questions

Ratio emojis/words content 20 (0.025) 0.13 (0.16) 0.16 (0.25) in posts

Table 5-9 Important features for OTS detection according to the SHAP value

Network-related features A major part of the important features relates to the network of the users. The features no. of distinct used tags (rank 2) and no. of tags (rank 5) describe the extent to which a user directly addresses other users of the network by tagging them in their posted images and videos. A higher value of both features increases the prediction towards the OTS class. This is emphasized by the red color of the dots with positive SHAP values (cf. Figure 5-19). The SHAP dependence plot of no. of tags in Figure 5-20, however, shows that the contribution

140 Identification of online trendsetters by advanced analytics of this feature to the prediction also depends on the value of the feature ratio follows- followers (y-axis, right). Samples that combine a high value of no. of tags and a low value of ratio follows-followers, for instance, result in a higher SHAP value than samples with the same number of tags but a higher follows-followers ratio. The value of the second feature is visualized by the color. Similar to the SHAP summary plot, a red colorway of a dot refers to a high value, whereas a blue colorway relates to lower values. A user with a medium no. of tag-value (0.5-0.8), however, increases the prediction towards the OTS class when combined with a high follows-followers ratio. As the dots are rather distributed than close to each other, but still show a stepwise non-linear trend, which means higher values of the feature tend to increase the SHAP value, the plot shows a weak monotonic positive trend. This indicates that the OTS class can not only be explained by a high value of the feature no. of tags. Rather the right combination of several features increases the prediction. normalized) ( tags of no. followers followers

Depending on the value range of - the feature no. of tags a low or a high value of ratio follows- SHAP value for feature: SHAP

followers increases the prediction follows towards the OTS class Ratio

No. of tags (normalized)

Figure 5-20 SHAP dependence plot – no. of tags

Nevertheless, the comparison of the feature values of both groups shows that OTSs tend to have on average a higher tagging activity. This is underlined by the violin plots in Figure 5-21 which visualize the distribution of values and highlight some basic statistics such as the median (highlighted by the white dot), the interquartile range (represented by the black bar in the center of the violin), and the upper and lower adjacent values (visualized by the black lines which are stretched from the bar). Values that lie below or above the adjacent values can be considered as outlier values. The plots further showcase the bigger value variation in the group of OTS upwards as well as the presence of high outlier values. As the no. of distinct tags refers to the number of different tags and the no. of tags relates to the total number of

141 Identification of online trendsetters by advanced analytics used tags, this group tends to tag more other users (no. of distinct tags) more frequently (no. of tags).

OTS OTS Non-OTS Non-OTS

No. of distinct used tags No. of tags Figure 5-21 Value distribution – no. of (distinct) tags

The feature no. of comments is another important network-related feature (rank 3). According to the SHAP summary plot, higher values refer to the OTS class (cf. Figure 5-19). This is in line with the insights gained from the related work analysis and the expert interviews which outline that comments are a highly relevant metric to detect online opinion leaders as they reflect the interest of other users in the posted content of the respective user. The comparison of the value distribution in both groups emphasizes that OTSs tend to receive more comments. The first 25% of OTSs according to the number of comments, for instance, receive up to 4,363 comments per user, whereas the values of non-OTSs of the similar percentile range between zero and 1,799.8 comments. The avg. no. of comments per post, which is 24.2 for the OTS group and 20.7 for the non-OTS group, furthermore shows, that the higher volume of comments is not solely induced by the higher posting activity of the OTS group. Two other important features are the no. of followers and the ratio follows- followers. Higher values of no. of followers and lower values of ratio follows-followers positively impact the class decision towards the OTSs (cf. Figure 5-19). The SHAP dependence plot of the feature ratio follows-followers, which is presented in Figure 5-22, reveals that for high and low values of this feature the combination with a high value of post mention PageRank results in a higher SHAP value. Samples with a combination of specific values of both features, therefore, are more preferable to be assigned to the OTS class. The plot follows a linear negative trend as the dots are rather close to each other and lower values tend to relate to a higher SHAP value.

142 Identification of online trendsetters by advanced analytics

Combined with a high value of post mention PageRank, low and high values of ratio follows-followers increase the prediction towards the OTS class (normalized) followers followers - PageRank PageRank ratio ratio follows SHAP value value for feature: SHAP Post mention Post

Ratio follows-followers (normalized)

Figure 5-22 SHAP dependence plot – ratio follows-followers

Investigating the feature values in both groups, OTSs tend to have, on average, more followers and a lower follows-followers ratio than non-OTSs. The violin plot in Figure 5-23 presents the value distribution of the feature ratio follows-followers in the group of OTS and non-OTS. It further underlines that the OTS class can not solely be explained by the feature ratio follows-followers as the value distribution in both groups does not diverge massively, although the values in the OTS group are more concentrated towards lower values.

OTS Non-OTS

Ratio follows-followers

Figure 5-23 Value distribution – ratio follows-followers

The insights gained from the analysis of the remaining network-related features are summarized in Table 5-10. The table provides a list of all important features containing the direction of the feature effect on the model's performance, the feature with the highest interaction, and the resulting interpretation of OTSs’ characteristics. All dependence plots are provided in appendix A.6.

143 Identification of online trendsetters by advanced analytics

Content-related features According to the analysis, two highly ranked features that relate to the published content of a user are the no. of hashtags and the no. of distinct hashtags. As Figure 5-19 shows, using more (distinct) hashtags favors the assignment of an instance to the OTS class which is emphasized by the red color of the dots with positive SHAP values. This is further supported by the fact that the mean value of the feature no. of hashtags in the group of OTS is three times higher than in the group of non-OTS (cf. Table 5-9), although the standard deviation in the group of OTS is high. This, however, indicates more variation across the OTS instances than across the non-OTS samples, which is showcased by the violin plot in Figure 5-24. It compares the value distribution of the no. of hashtags in the OTS class with the distribution in the non-OTS class, and shows that the group of OTS has a wider value range. 75% of OTSs post between zero and 19,425 hashtags, whereas 75% of the non-OTS class only posts between zero and 8,441 (75% percentile). The comparison of the overall posting behavior of both groups, however, reveals that the higher number of hashtags is mainly induced by the higher posting activity of OTSs, which is also almost three times higher for OTSs (Ø 1,110.8 posts per user) than for non-OTSs (Ø 391.2 posts per user).

OTS OTS Non-OTS Non-OTS

No. of hashtags No. of posts

Figure 5-24 Value distribution – no. of hashtags and posts

Similar to the no. of hashtags, the no. of distinct hashtags can be traced back to the higher posting volume of OTSs. Although a higher value refers to the OTS class, which is indicated by the violin plot in Figure 5-26, the avg. no. of distinct hashtags per post is lower in the group of OTS (Ø 0.8) than in the group of non-OTS (Ø 1.2). The dependence contribution plot of the feature no. of distinct hashtags presented in Figure 5-25 follows a monotonic positive trend and showcases that for a wide value range the SHAP value is zero. For these values, the feature does not contribute to the model’s decision. Only values below and above a specific border decrease resp. increase the prediction towards OTSs. It also shows that, for

144 Identification of online trendsetters by advanced analytics instance, the combination with a low value of evolution no. of comments rather tends towards the assignment of a sample to the OTS class. )

Different values of the feature no. of distinct hashtags do not influence the prediction No. of hashtags (normalized) hashtags of No. Evolution comments no. of (normalized

No. of distinct hashtags (normalized)

Figure 5-25 SHAP dependence plot – no. of distinct hashtags

Another relevant feature is the avg. no. of videos. Thereby, a lower value relates to the OTS class (cf. Figure 5-19). The mean value of both groups, however, is close to each other (OTS: 0.023 vs. 0.027) as well as their value distributions, which is presented by the right violin plot in Figure 5-26. The high outlier value in the group of non-OTS results from one single user who, on average, has posted more videos than the others, but still on a low level.

OTS OTS Non-OTS Non-OTS

No. of distinct hashtags Avg. no. of videos Figure 5-26 Value distribution – no. of distinct hashtags and avg. no. of videos

Both groups use videos very rarely. The SHAP dependence plot in Figure 5-27 further indicates that this feature only contributes to the model in combination with other features. The plot shows no specific value direction which increases the prediction towards OTSs, rather it emphasizes that the relationship between the feature and the label is more complex.

145 Identification of online trendsetters by advanced analytics

The figure showcases, for instance, that samples reach a higher SHAP value if they also have a high value of no. of hashtags.

Same feature value range but different SHAP values of videos no. . avg SHAP value value for feature: SHAP of hashtags of hashtags (normalized) No.

Avg. no. of videos (normalized)

Figure 5-27 SHAP dependence plot – avg. no. of videos

User-related features Two of the ranked features which refer to the published content, but which relate to the user profile, are the ratio emojis bio and the ratio hashtags bio. The features describe the number of emojis resp. hashtags relative to the total number of characters used in a user’s biography. A lower feature value of ratio emojis bio and a higher value of ratio of hashtags in the biography of a user tend to be assigned to the OTS class. Both groups rarely use hashtags in their biography, OTS on average use slightly more hashtags (0.9) than non-OTS (0.6). Similar to several of the other features, ratio hashtags bio interacts largely with other features. The SHAP dependence plot for ratio hashtags bio (cf. Figure 5-28) shows a monotonic positive trend and reveals that, for instance, the prediction increases when an instance combines a high value of ratio hashtags bio and a low value of no. of distinct used tags. Therefore, combining specific values of the two features increase the prediction towards OTSs.

146 Identification of online trendsetters by advanced analytics

A low value of no. of distinct used tags increases the prediction towards the OTS class for high values of ratio bio hashtags bio hashtags hashtags ratio ratio SHAP value value for feature: SHAP of distinct used tags of distinct tags used (normalized) No.

Ratio hashtags bio (normalized)

Figure 5-28 SHAP dependence plot – ratio hashtags bio

Context-related Evolution no. of comments is one of two context-related features which is ranked in the list of the important features. This feature describes the development of the number of comments over time. The low mean value in the groups of OTS indicates that the comment activity is stable for this group, in contrast to the group of non-OTS. As Figure 5-19 shows, instances with lower values of this feature rather tend to be assigned to the OTS class. Besides, the feature has its highest interaction with the attribute avg. no. of nouns (cf. Table 5-10).

The investigation of the features’ value distribution in the group of OTS shows for several of the features high outlier values and a wider value range compared to the group of non- OTS, e.g., no. of tags (cf. Figure 5-21). This indicates that the group of OTS behaves rather heterogeneous. Moreover, the analysis above indicates that the behavioral patterns of OTSs are rather complex. Therefore, their identification bases on different value combinations of features. The contribution of the feature avg. no. of videos or ratio hashtag bio, for instance, highly depends on their interaction with other feature values such as no. of hashtags resp. no. of distinct used tags. Nevertheless, for several features, a specific value direction increases the probability of the assignment to the OTS class, e.g., a higher value of no. of followers or a lower value of evolution no. of comments. Based on these observations, the following assumptions about the characteristics of OTS in OSN are derived.

147 Identification of online trendsetters by advanced analytics

High activity and more content The tendency of OTSs to post more (distinct) hashtags, which relates to their higher posting activity, refers to a user’s higher level of communicativeness (cf. section 3.4.1). Consequently, according to the model, OTSs are more communicative than non-OTS, which also matches the insights gained from traditional fashion trend research about trendsetters (cf. section 2.1.3). The, on average, lower usage of distinct hashtags per post in the group of OTS, however, is an indicator for their expertise (cf. section 4.3.2). Considering also the features which are less important for the model but still ranked as important, such as avg. post length or no. of questions, this view is further supported. The investigation of the values of avg. post length in both classes derives that OTSs tend to publish longer posts than non- OTSs. Furthermore, the group of OTS uses on average more words than non-OTS. Therefore, they provide more content which potentially influences their network. According to the related work analysis, similar to provide more postings, publishing longer postings also relates to a user’s communication skills and a higher level of communicativeness. Besides, as Table 5-10 shows, higher values of the feature no. of questions increase the prediction of the OTS class. This is supported by the comparison of both groups, which emphasizes that OTS on average ask more questions than non-OTS (cf. Table 5-9). According to the related work study, asking more questions is an indicator of a user’s credibility as it shows a user’s interest in exchanging with the network. Therefore, the analysis indicates that OTSs are more credible which is also emphasized as a characteristic of fashion trendsetters as outlined in section 2.1.3. Additionally, the analysis derives that the prediction towards OTSs increases with a lower value of ratio emojis bio. This refers to their usage of fewer emojis in their bio (Ø 3.1 emojis per bio) compared to non-OTSs (Ø 4.2 emojis per bio) by simultaneously providing longer biographies (OTS: Ø 103.3 characters vs. non-OTS: Ø 98 characters). This indicates that they provide more information, and therefore, tend to be more extroverted as outlined in section 4.3.2.

Many and close interactions As revealed above, OTSs have more interactions with their network (no. of tags), which means that they more often directly address other members of the community in their posted images and videos. This is emphasized as an important indicator for the quality of a user’s interactions by the experts in section 3.5.3. The feature no. of distinct tags furthermore shows that OTSs interact with a wider range of different users. Overall, they tend to have a bigger, and simultaneously, closer network as comparing both groups shows that OTSs mention many different other users in their media posts (no. of distinct tags) and address them more 148 Identification of online trendsetters by advanced analytics often (no. of tags). Additionally, a higher value of tag closeness more probably refers to the OTS class. The tag closeness describes the extent of being connected to other users of the community by tagging each other. A higher value indicates a higher level of interconnection via the tagging activity.

High engagement of the community According to the analysis, higher values of no. of comments refer to the OTS class. As this reflects the high interest of the community in the user’s posted content, a higher value indicates higher engagement of the community, which is an indicator of a user’s expertise (cf. section 3.4.1) The number of comments over time is stable for the group of OTS in contrast to the group of non-OTS (evolution no. of comments). This underlines that the engagement with the published content of OTSs is constant and can further be an indicator that they are more integrated into the community and that they have no intention to artificially boost their reach, which would refer to a high value of evolution no. of comments. The rapid growth of an account can indicate that the user has bought the engagement to increase her/his attractiveness for companies to cooperate with them for commercial purposes. These accounts are not credible. Therefore, accounts with low values of this feature tend to be more credible, especially combined with other features.

Good interconnectedness The analysis derives that a higher number of followers more likely refers to the OTS class. According to the related work analysis, a high number of followers is an indicator of a user’s influence in the community and her/his reach. Furthermore, combined with the low value of ratio follows-followers, it indicates that OTSs tend to follow fewer other community members. This is an indicator of a user’s credibility as it shows that the no. of followers is not induced by polite reciprocal relations. Moreover, as presented in Table 5-10, a higher value of comment mention closeness refers to the OTS group. OTS, therefore, are more often mentioned in comments by others within the community and have a central position in the network. They are very well connected via their commenting activity. Closeness centrality in general is a measure that describes a user’s ability to spread information very efficiently through a graph. Higher values refer to a low distance to all other users within the community. Furthermore, higher values of post mention PageRank as well as comment mention PageRank also relates to an OTS account. This further emphasizes the OTSs good interconnectedness.

149 Identification of online trendsetters by advanced analytics

Relationship to Interacting Feature Characteristics OTS class decision feature No. of distinct Weak monotonic Avg. no. of likes - Credibility used tags trend, positive* - Interconnectedness *) higher values increase prediction towards OTS class

No. of tags Weak monotonic Ratio follows- - Credibility trend, positive followers - Interconnectedness

No. of comments Linear trend, Avg. number of - Expertise positive verbs - Influence

Comment mention Weak monotonic No. of distinct - Credibility

closeness trend, positive emojis in bio - Interconnectedness related

- No. of follower Monotonic trend, Follow out-degree - Influence positive - Interconnectedness

Network Ratio follows- Linear trend, Post mention - Credibility followers negative PageRank

Post mention Weak linear trend, Ratio follows- - Credibility PageRank positive followers - Interconnectedness

Follow out-degree Weak monotonic No. of followers - Credibility trend, negative

Comment mention Monotonic trend, No. of distinct - Credibility PageRank positive used tags - Interconnectedness

Tag closeness Weak linear trend, Ratio follows- - Credibility positive followers - Interconnectedness

No. of hashtags Weak monotonic Ratio hashtag bio - Communicativeness trend, positive high

No. of distinct Monotonic trend, Evolution no. of - Expertise hashtags positive comments related - Avg. no. of videos No trend – complex No. of hashtags No implication due to relationship complex feature interaction pattern Content Avg. post length No trend – complex Avg. no. of verbs No implication due to relationship complex feature interaction pattern

150 Identification of online trendsetters by advanced analytics

Relationship to Interacting Feature Characteristics OTS class decision feature Avg. no. of Weak monotonic No. of distinct - Credibility questions trend, positive used tags

Ratio Weak linear trend, Post mention - Communicativeness emojis/words in negative PageRank posts

Ratio emojis bio Weak monotonic Ratio follows- - Communicativeness trend, negative followers related - Ratio hashtag bio Monotonic trend, No. of distinct - Communicativeness

User positive used tags

Evolution no. of Linear trend, Avg. no. of nouns - Credibility comments negative related - Avg. time to No trend – complex No. of comments - Influence comment relationship Context

Table 5-10 Summary of results – feature importance analysis

The investigation of the most important features shows that some of them strongly interact with other features. For most of them, the contribution to the model’s class decision increases in combination with specific values of other features. This is further emphasized by the analysis of the value distribution and value ranges of several of the important features in both classes which, for instance, do not show class-separating value ranges. Nevertheless, for most of the features, higher values (e.g., no. of followers) resp. lower values (e.g., ratio follows-followers) tend to increase the prediction towards the OTS class. This is underlined by a positive or negative monotonic or linear trend indicated in Table 5-10. The remark “complex relationship” in the table, however, refers to features where no such trend is discovered. Their contribution to the model, therefore, strongly depends on the interaction with other features. Moreover, according to the analysis, the feature ratio follows-followers seems to have a huge impact on the OTS class decision as it is indicated as the feature with the highest interaction for four of the important features. Furthermore, for several features, such as no. of distinct hashtags, the contribution to the model for samples with the maximum values of 1, highly depends on other interacting features. This can be induced by the normalization of features which decreases the impact of outlier values. For the class decision, however, such outlier values may be relevant. Finally, this analysis reveals that the group of OTS tends towards specific characteristics. However, in this analysis, there is no clear separation from the non-OTS class. Rather there is a smooth transition from one group to

151 Identification of online trendsetters by advanced analytics the other. Besides, the characteristics of OTSs revealed by this analysis are in line with the insights gained from traditional (fashion) trend research about trendsetters’ characteristics (cf. section 2.1.3). This indicates that the developed TSIM provides an automated solution to detect OTS in OSNs. However, some limitations of the data analysis need to be considered. First, the model development bases on a dataset that is limited to 665 samples and highly depends on the labeling concept. Besides, although the purpose of the study does not require the identification of all OTS accounts, it has to be taken into account that the resulting model does not recognize all OTSs within a community. Finally, the findings about OTSs’ characteristics and behavioral patterns are derived from the features which are relevant for the model’s class decision, and therefore, depend on the specific classification model.

5.4 Summary

(1) The topic-focused community detection process enables the identification of an active and connected group of users with a specific pre-defined interest field. According to the experiments conducted within the research project, the community detection process converges best with a targeted community number of between 600 and 1,000 users. The process allows the detection of an active community, which is emphasized by the fact that the major part of the identified community posts at least once a month. The boost of in-degree and degree centralization from iteration zero to the final iteration furthermore underlines the connectivity of community members. Their focus on a specific field of interest is indicated by the topic models which all relate to the pre-defined topic of streetwear and sneakers.

(2) The developed OTS classification model enables the identification of most of the OTSs within a topic-focused community. The developed classification model bases on a Random Forest classifier trained on the imbalanced dataset and includes only a subset of features that results from a combination of classifier-independent feature selection methods. The model achieves a Precision value of 72.3% and a F1-Score of 64.12%. Therefore, it provides an approach to identify a major portion of OTSs within a topic-focused community. As the support of early trend detection or influencing trends does not require the

152 Identification of online trendsetters by advanced analytics

detection of all OTSs within such a community, the performance of the classification model is deemed to be sufficient for this purpose.

(3) Network- and content-related features are most important for the model’s class decision, whereas the decision highly depends on the interaction of several features. Similar to the related work analysis and the expert interviews, the investigation of relevant features reveals the major importance of network- and content-related features. The contribution of features to the model, however, often depends strongly on the interaction with other features. For most of the attributes, a specific value direction (higher/lower values) refers to the OTS class, and therefore, provides indications on the characteristics of OTSs.

(4) According to the analysis, OTSs tend to be more active and provide more content than non-OTSs. They are well connected to other users and have close and many interactions. Additionally, their network is highly engaged with their published content. The direction and extent of the contribution of each feature to the model performance combined with the investigation of SHAP dependence plot and the comparison of value distribution of features in both classes reveal insights about the behavioral patterns of OTSs in OSNs. This investigation shows that overall, OTSs have a higher posting activity. They exchange more with other users and their published content engages others in the network more than the content provided by non-OTSs.

(5) The analysis indicates that OTSs are credible, interconnected, and communicative. Furthermore, they tend to be more influential and have a higher level of expertise compared to non-OTSs. The findings about specific behavioral patterns of OTSs further allow their interpretation towards particular characteristics. Using the insights gained from the related work study about the measurement of specific characteristics based on several social media metrics, the behavioral patterns are examined. The derived characteristics are in line with those which are summarized for (fashion) trendsetters in section 2.1.3, and which bases on insights from traditional fashion trend theories.

153 Identification of online trendsetters by advanced analytics

(6) The dependency of the OTS class on feature interactions, which makes their manual selection challenging, further underlines the utility of TSIM for OTS detection in OSNs and its advantage compared to current selection methods. As the decision on the OTS class is rather influenced by value combinations of different features, the manual selection of OTS according to specific features and respective values is difficult. A high value of no. of tags, for instance, not necessarily indicates the OTS class. Rather, the combination of a high value of no. of tags with a low ratio follows-followers is crucial for being an OTS. Simultaneously using fewer tags combined with a high value of ratio follows-follows also refers to the OTS class. This underlines that a manual identification of OTS according to specific metrics is challenging as the decision bases on multiple feature value combinations. Therefore, the classification model provides an automated solution that can consider these complex patterns.

After the development and evaluation of the two main components of the TSIM on real data, which is the community detection process (cf. section 5.2) and the classification model (cf. section 5.3), the following section aims to validate its utility and the transferability to other use cases. Therefore, the TSIM is applied to a use case related to the sustainability movement using data from the online social networking platform Instagram. The following section encompasses the following steps:

- the extraction of a community with a major interest in sustainability-related topics by applying the topic-focused community detection approach, - the application of the developed classification model to identify OTS accounts within this community, and - the investigation of the communication of OTSs regarding past trends to assess their potential for early trend detection.

154 Model application 6 Model application

6.1 Use case description

As in recent years, sustainability has evolved into a megatrend31 that tends to shape all aspects of life, and therefore, becomes an important and business-relevant topic for companies, in the following, this topic area serves as use case. Especially the fashion demand is impacted by this movement as various aspects of sustainability become major drivers of consumers’ purchasing decisions. The attitude of consumers to avoid plastic pollution, for instance, recently forces fashion companies to reduce the usage of plastic in their products and packaging. These changing consumer attitudes, which emerge from the megatrend sustainability, not only affect the products but also relate to the production processes of fashion companies (Gazzola et al., 2020). Due to the increasing interest of the fashion industry in trends related to sustainability, and the growing influence of this movement within society, the TSIM is applied to identify OTS in the area of sustainability. For this, Instagram is chosen as the data source due to its high relevance for the fashion industry and its high portion of fashion-conscious users. The community detection process is applied to identify a community with a discussion focus on sustainability topics. The target size of the community is set to 950 to result in a sufficient number of OTS accounts which enables the evaluation. Table 6-1 summarizes the use case-specific settings. The definition of keywords bases on the recommendation of experts, who have a professional background in either a strategic department or a product development department related to the sustainability movement, especially in the fashion industry. The two hashtags “sustainability” and “sustainable” serve as initial seed words.

Variables Values

Target size m 950

Considered no. of potential 9,500 members Top users percentage 10%

Considered posting volume 12 most recent postings per user

Filter words Worldwide ship, shipping world, ship world, reseller, retailer, store, shop, sell, purchase, sale, trade

31 is characterized as a significant movement within the society, economy, politics and technology (Mittelstaedt, Shultz, Kilbourne and Peterson, 2014). 155 Model application

Variables Values

Category I (weight = 1): green, environment Category II (weight = 2): plastic, organic Keywords Category III (weight = 3): sustain, sustainability, sustainable

Table 6-1 Use case-specific settings – variables and values (sustainability)

Based on the extracted data of the resulting community, the classification model is applied to identify the OTS accounts within the community. The following section provides further information about the resulting community.

6.2 Application of trendsetter identification methodology

The final community contains data of 920 user profiles as 30 accounts have either changed their profile settings to private mode or have deleted their account between the time of conducting the detection process and extracting the relevant data for the subsequent profile classification in OTSs and non-OTSs. The postings are published within the timespan from August 2011 to August 2019. The data contains 116,812 postings and includes all postings since the respective account’s creation. To validate the connectivity and the topic focus of the resulting community, similar to section 5.2.3, the change of degree centralization from the first to the final iteration, and the topics based on the community’s posting texts are investigated. The increase of degree centralization from 0.09 in the first iteration to 0.55 in the final iteration shows that the members of the final community are noticeably closer connected via the follow-relation and more centralized than the users who are identified by the initial hashtag search. Figure 6-1 furthermore showcases that all identified topics are related to environmental resp. sustainability topics.

156 Model application

Zero waste (33.03%) Grow your own food (28.01%) Ocean pollution (13.75%)

Reduce trash (12.79%) Plasticfree (12.42%)

Figure 6-1 Topic models of all community members’ posting texts (sustainability)

The classification model assigns 300 of the 920 to the OTS class. As indicated in section 5.3.2, this high number can be induced by the community detection process which already applies several quality criteria which tend to prefer user profiles which OTSs characteristics. Table 6-2 indicates a lower level of activity of the sustainability community compared to the sneaker community. An OTS account within the sustainability community, for instance, posts on average 322.2, whereas an OTS account within the sneaker community publishes 1110.8 posts. Similar to the sneaker community, however, there is a higher posting activity among the OTS accounts. The lower activity is also induced by the age of accounts. The sustainability community is younger than the sneaker community, and therefore, provides less posting history.

OTS non-OTS

No. of users 300 620

No. of posts 96,648 20,164

Avg. posts per user 322.2 32.6

Table 6-2 Comparison OTS and non-OTS (sustainability community)

The age of an account is indicated by the time of account creation. Figure 6-2 showcases that the number of users within the community experiences a significant increase as of 2018. This is an indicator that the interest in sustainability topics has recently developed and communities focusing on this issue emerge only in the previous three to five years. Comparing the number of OTS account creations with those of non-OTS, it is obvious that

157 Model application until 2017 there are almost only OTS accounts. This is in line with the definition of a trendsetter who tends to be the first to talk about new things.

1000 4000

900 3500 800 3000 700 600 2500 500 2000

400 1500 300 1000 200 100 500 0 0 Apr-12 Apr-13 Apr-14 Apr-15 Apr-16 Apr-17 Apr-18 Apr-19 Apr-12 Apr-13 Apr-14 Apr-15 Apr-16 Apr-17 Apr-18 Apr-19 Dec-11 Dec-12 Dec-13 Dec-14 Dec-15 Dec-16 Dec-17 Dec-18 Dec-11 Dec-12 Dec-13 Dec-14 Dec-15 Dec-16 Dec-17 Dec-18 Aug-11 Aug-12 Aug-13 Aug-14 Aug-15 Aug-16 Aug-17 Aug-18 Aug-19 Aug-11 Aug-12 Aug-13 Aug-14 Aug-15 Aug-16 Aug-17 Aug-18 Aug-19

No. of OTS No. of OTS posts No. of non-OTS Total no. of users No. of non-OTS posts Total no. of posts

Figure 6-2 Community evolution – OTS and non-OTS (sustainability)

In the following section, the postings of the identified OTSs are investigated to assess their trend prediction potential and to show the utility of TSIM in practice.

6.3 Trend prediction capability of online trendsetters

As outlined in section 1, the major motivation for developing a framework to identify trendsetters in OSNs related to a specific fashion area is to support trend detection. Therefore, the utility of TSIM is validated by assessing the trend prediction potential of resulting OTSs. This is realized by evaluating the prediction potential of their published content based on consumer trends of 2020 related to the sustainability megatrend. As indicated in section 2.1.1, in the fashion industry short-term trends such as a specific product, e.g., a specific sneaker model, and long-term trends, e.g., new materials or changes in consumer behaviors, can be distinguished. Both are influenced by broader trends such as the sustainability movement. In the following, the focus is on how the sustainability megatrend affects the fashion industry regarding changing consumer behaviors and attitudes (long-term trends). This also serves to show, that the developed approach is applicable to a variety of topics, spanning from very specific topics such as sneakers to broader movements like sustainability. Consumer research studies provide insights into current consumer problems, preferences, and behaviors. Thus, they provide the necessary information about consumer trends to investigate the trend prediction potential of OTSs. The analysis of consumer research studies of 2020 reveals that consumers’ behavior is strongly influenced by their desire to support the environment and avoid its further pollution (Gazzola et al., 2020). Due to this, consumers

158 Model application adopt new behaviors related to the concept of circular economy such as recycling and reuse things or reduce waste and plastic. Based on the studies, eight consumer attitudes and behaviors are identified which strongly impact the consumer product industry, and especially the fashion industry, in 2020 as they have been adopted from a major portion of consumers. In the following, they are referred to as consumer trends 2020. These trends serve as the basis for the assessment of OTSs’ trend prediction potential. Figure 6-3 presents an overview of how to assess the trend prediction potential of the identified OTSs. In the following, each step and the resulting outcome is outlined.

What? How?

1 Identification of consumer trends 2020 Investigation of consumer research studies

2 Exploration of occurrence and popularity of o Topic modeling consumer trends in OTSs’ postings over time o Hashtag analysis (volume of trend hashtags per month)

3 Comparison of trend popularity over time to o Google trend index reflects consumer behavior* non-OTS and the overall consumer market o Hashtag analysis (volume of trend hashtags per month) *) Silva et. al, 2019

4 Assessment of the trend prediction potential Granger causality test based on the trend-related of OTSs time series

5 Comparison of the trend prediction potential Comparison of Granger causality test results of OTSs of OTSs with the trend prediction potential of and users with the largest reach users with the largest reach

Figure 6-3 Steps to assess the trend prediction potential of identified OTSs

(1) Consumer trends 2020 The awareness of the increasing ocean pollution, the masses of plastic waste, and the climate change lead towards changing consumer behavior (Gazzola et al. 2020), which is measured within the considered consumer research studies of 202032. The studies investigate the shopping behavior and buying drivers of consumers worldwide and analyze how the megatrend sustainability changes consumer demand in 2020. According to the studies, the new consumer behavior aims to reduce environmental pollution, and consumers, therefore, increasingly follow concepts like reusing and recycling things or reducing plastic and waste which support the circular economy approach (Gazzola et al. 2020). For this, consumers in 2020 ask for zero-waste products, recyclable or reusable packaging, or products made of

32 Capgemini research institute (2020) and IBM Institut for value (2020) 159 Model application biodegradable materials (Capgemini research institute, 2020; IBM Institut for value, 2020). Besides, consumers in 2020 prefer to buy locally grown food to reduce the environmental impact of consumption (Capgemini research institute, 2020). The major consumer trends revealed from the consumer research studies are showcased in Table 6-3. Based on these trends, the OTSs postings are investigated regarding trend occurrence and trend popularity over time. Additionally, one important consumer driver (climate change) which induces the changing consumer behavior according to the studies is considered within the analysis.

Category Consumer trend

Circular economy Reduce waste, reuse, recycling, secondhand

Ecofriendly Biodegradable materials, plastic-free products, local products

Table 6-3 Consumer trends related to the sustainability megatrend in 2020

(2) Occurrence and popularity of consumer trends in OTSs’ postings In a first step, it is investigated if and how many of those consumer trends are discussed in the group of OTS. This also serves to identify the most relevant keywords which describe the respective trend and enables the analysis of trend popularity over time. Therefore, topic modeling is applied to the postings of the identified OTSs. Due to the low number of community members before 2016, and the resulting low number of postings before this year, the investigation bases on data from 2016 to August 2019. This encompasses 69,810 OTSs’ postings. The analysis of trend occurrence in these postings is realized on a yearly basis. Thus, topic modeling is applied to the text corpus of postings that are published in the respective year as described in section 5.2.1. Due to data availability, the topics of 2019 only bases on postings from January to August of this year. The terms “sustainability” and “sustainable” are excluded from the analysis as these are the initial seed keywords of the detection process (cf. section 6.1). The determination of the k-value (number of topics) which results in meaningful interpretable topics is realized by using different values and assessing the quality of the resulting topics manually. k-values between two and ten are tested, whereas a k-value of four results in distinct and meaningful topics. Figure 6-4 shows the topics discussed by the group of OTS between 2016 and 2019 and indicates the importance of each topic relative to all identified topics per year in brackets. The topic analysis shows that in 2017 already two of the investigated consumer trends of 2020 are recognized as a major topic by the topic modeling algorithm. Besides, the other two topics, ocean pollution and homegrown, relate to the consumer attitudes and behaviors of 2020.

160 Model application

Ocean pollution, for instance, is mentioned as one of the drivers for changing consumer behavior. The topic urban gardening and homegrown goes along with the trend to buy local products as it has similar intentions, although it results in different behavior. In 2018, already 80.1% of postings refer to consumer trend-related topics, e.g., reduce (plastic, waste) or ecofriendly, compared to 37.4% in 2017. In 2019, the importance of consumer trend-related topics of 2020 slightly decline to 74.1%. The peak, therefore, is reached in 2018. To gain deeper insights into the evolution of trend popularity over time, the most important keywords per topic are identified based on their frequency within the respective topic.

2016 2017 2018 2019

Urban gardening + Reduce (plastic, waste) Reduce (plastic, waste) Gardening (42.5) homegrown (44.4) (53.9) (52.5)

Vegan food (24.1) Recycle (23.3) Homegrown (19.9) Ecofriendly (21.6)

Ocean pollution (19.3) Ocean pollution (18.2) Circular economy (15.5) Urban gardening (16.4) Topics OTS Topics

A mother’s life (14.1) Circular economy (14.1) Ecofriendly (10.7) Homegrown (9.5)

Figure 6-4 Topic evolution in OTSs’ postings (2016-2019)

As outlined in section 3.2, analyzing the number of trend-related hashtags over time offers insights into their popularity within the different groups. Therefore, the number of trend- related hashtags per month is analyzed from 2016 to 2019. Figure 6-5 exemplary presents the resulting time series of four of the trend-related keywords. The graph shows, for instance, that for three of the trends, the popularity peak is reached in the end of 2018.

161 Model application

350

300

250

200

150

100

50

0

#ecofriendly #recycling #reuse #seconhand Figure 6-5 Example – evolution trend hashtag popularity

(3) Comparison of trend popularity to the group of non-OTSs and to the consumer market To reveal insights about the trend prediction potential of OTSs, the evolution of trend popularity in the group of OTS is compared to its evolution in the group of non-OTSs and the overall consumer market. The Google trend index, thereby, reflects the consumer market. Several studies (e.g., Silva et al., 2019; Boone et al., 2018) show that the Google trend index can be used as an indicator for the consumer behavior as it measures the consumer search interest which links to the consumers’ buying decisions (Silva et al., 2019). Therefore, it reflects the popularity of a topic within the consumer market. The Google trend index represents Google search data which is normalized to enable the comparison between time series of several trend terms. The index ranges between 0 and 100, where 100 represents the maximum search volume in the respective topic and consists of the maximum number of search queries in a time unit within the observation period (Rogers, 2016). The analysis of the search interest in reuse since the beginning of 2016, for instance, shows that in March 2020, there is the highest search interest with a value of 100 (cf. Figure 6-6). Google provides free access to its Google trend index data33. To enable the comparison of the time series related to the eight consumer trends which are obtained from Google and Instagram data, the time series based on Instagram data are normalized in the same way as the Google search data to result in a hashtag index between

33 accessible at: https://trends.google.com/trends 162 Model application

0 and 100. Similar to the Google trend index, a value of 100 represents the maximum interest in the topic which is measured by the number of postings including the trend-related hashtag in a time unit (e.g., one month). The analysis of the interest in reuse since the beginning of 2016, for instance, shows the peak in December 2018 with the value of 100 (cf. Figure 6-6). Figure 6-6 exemplary showcases the evolution of this popularity index for the group of OTS and the overall consumer market (Google trends). The index evolution with the peak in December 2018 in the group of OTS and in March 2020 in the consumer market (Google trend index), already indicates the prediction potential of OTSs for the overall consumer market. As only data until August 2019 are available for the group of OTS, the curve of OTSs shows a decline to zero after this month.

100 90 80 70 60 50 40 30 20 10 0

OTS index Google trend index

Figure 6-6 Example – popularity evolution of reuse

The Google trend index data for a time range between 2016 and 2020 are extracted for all considered consumer trends for further investigation.

(4) OTS trend prediction potential The assessment of the trend detection potential of the identified OTSs is realized based on a Granger causality test. The Granger causality test determines if the time series of a variable x, e.g., OTSs trend popularity, supports the prediction of another variable y, e.g., trend popularity within the consumer market. Therefore, it analysis patterns of correlations. As the Granger causality test requires stationary data to ensure reliable results (Granger, 1969), the time series are tested regarding their stationarity using the augmented Dickey-Fuller test (Fuller, 1976). The non-stationary time series are transformed by utilizing the commonly

163 Model application applied technique differencing (Pal and Prakash, 2017). Figure 6-7 shows exemplarily the transformed time series of the OTS index and the Google trend index for reuse. Compared to Figure 6-6, the graph shows no trend upwards, but only the variations in the trend popularity, which indicates the data’s stationarity.

60

40

20

0

-20

-40

OTS index after differencing Google trend index after differencing

Figure 6-7 Example – popularity evolution of reuse after differencing

Based on these stationary time series, the Granger causality test is realized. Thereby, the time series x, which represents the trend popularity in the group of OTS, Granger-causes the time series y, which reflects the consumer market, if the null hypothesis is rejected. Thus, the null hypothesis states that lagged values of x do not explain the variation in y (Granger, 1969). In the underlying case, this means, if the hypothesis is rejected the evolution of trend popularity in the group of OTS supports the prediction of the trend evolution in the overall consumer market. The hypothesis is tested using the statistical F-test. The significance level denoted by alpha is set to 0.05 (Lehmann and Romano, 2005). Therefore, if the hypothesis test results in a p-value ≤ 0.05, the hypothesis is rejected. The number of lags is set to the maximum possible number which depends on the number of observations considered in the test. Besides testing the prediction potential of OTSs for the consumer market, it is also examined if the evolution of trend popularity in the group of OTS can support the prediction of trend popularity in the group of non-OTS. Table 6-4 summarizes the results of the analysis. The Granger causality tests, which examine the prediction potential of OTSs for the consumer market, show that for seven of the eight trends the hypothesis is rejected. This underlines the prediction potential of the group of OTS for the consumer market. The test of the prediction potential of OTSs for the group of non-OTSs confirms the OTSs’ prediction potential for all eight trends.

164 Model application

Prediction potential based on Granger Trend causality test (A B: A supports the prediction of B)

OTS OTS Consumer trend Related hashtag consumer market non-OTS Reduce waste zerowaste Yes, p = 0.0005 Yes, p = 0.0004

Reuse reuse Yes, p = 0.0009 Yes, p = 0.0002

Recycling recycling Yes, p = 0.0492 Yes, p = 0.0000

Secondhand secondhand Yes, p = 0.0014 Yes, p = 0.0011

Ecofriendly ecofriendly Yes, p = 0.0196 Yes, p = 0.0000

Biodegradable materials biodegradable Yes, p = 0.0118 Yes, p = 0.0000

Plasticfree products plasticfree No, p = 0.1291 Yes, p = 0.0009

Climate change climatechange Yes, p = 0. 0054 Yes, p = 0.0000

Table 6-4 Granger causality test results

(5) Comparison to the trend prediction potential of the users with the highest reach To further validate the utility of the developed approach, its advantage compared to existing approaches that identify trend-influencing users is demonstrated. One of the most commonly used metric to identify influential users, e.g., for marketing purposes, is the number of a user’s followers as it is a simple measure that describes a user’s reach within an online social network (Segev et al., 2018). For this, the postings of the 300 top-ranked users according to the number of followers within the topic-focused community are investigated. Table 6-5 showcases the comparison of Granger causality test results analyzing the prediction potential of the group of OTS and the group of users with the highest reach for the evolution of trend popularity in the overall consumer market. Thereby, also the time lag which results in the rejection of the hypothesis is considered as it contains information about the time lead of OTSs resp. users with the highest reach compared to the consumer market. The results show a lower trend prediction potential for the group of users with the highest reach than for the group of OTS as the null hypothesis is only rejected in six of eight trends. Besides, for the six trends where the trend prediction potential of both groups is confirmed by the test, for five of the trends, the group of OTS can predict the consumer market earlier which is underlined by the bigger time lag.

165 Model application

Prediction potential based on Granger causality test Trend (A B: A supports the prediction of B)

OTS Users with highest reach Consumer trend Related hashtag consumer market consumer market Yes, p = 0.0005 Yes, p = 0.0486 Reduce waste zerowaste timelag = 13 timelag = 13

Yes, p = 0.0009 Yes, p = 0.0142 Reuse reuse timelag = 15 timelag = 14

Yes, p = 0.0492 Yes, p = 0.0366 Recycling recycling timelag = 19 timelag = 15

Yes, p = 0.0014 Secondhand secondhand No, p = 0.0855 timelag = 19

Yes, p = 0.0196 Yes, p = 0.0013 Ecofriendly ecofriendly timelag = 17 timelag = 11

Biodegradable Yes, p = 0.0118 Yes, p = 0.0005 biodegradable materials timelag = 12 timelag = 6

Plasticfree plasticfree No, p = 0.1291 No, p = 0.1278 products

Yes, p = 0. 0054 Yes, p = 0. 0193 Climate change climatechange timelag = 15 timelag = 11

Table 6-5 Comparison of OTSs and users with the highest reach (Granger causality)

The investigation validates the prediction potential of OTSs regarding the popularity evolution of trend-related keywords for the group of non-OTSs and the overall consumer market. Although the analysis is limited to the eight selected consumer trends and drivers of 2020 and the respective selected keywords, it still emphasizes the huge potential of the identified OTSs to support the early detection of upcoming trends. The analysis also indicates their potential to increase the performance of trend and sales forecasting. The comparison with the group of trend-influencing users identified based on the number of followers additionally highlights the value of the TSIM for companies as it shows the higher trend prediction potential of the OTSs compared to the users with highest reach. Besides, the investigation indicates several application areas for the TSIM which are addressed in the following section.

166 Model application 6.4 Application areas and transferability

The major motivation for the development of the TSIM is the decreasing control of fashion companies over fashion trends and its shift towards the consumer, which makes reliable fashion trend analysis and forecasting based on consumer data crucial for the future competitiveness of a fashion company (Park et al., 2016). The application of TSIM to the use case of sustainability outlines that the approach provides a solution to support trend forecasting and indicates some additional application opportunities. The following section summarizes the application areas and examines the transferability of the TSIM to other topic areas as well as to other data sources.

Application areas As the investigation of OTSs’ published content in the previous section shows, the identification of OTSs can support early trend detection. Besides, the analysis also indicates their potential to increase the performance of trend and sales forecasts. It also demonstrates that OTSs discuss new emerging consumer problems and preferences earlier than other user groups. Therefore, the content which is published by the OTSs within a specific topic area is a valuable source of information and inspiration for product development as well as for the marketing department of a fashion company. By identifying changing consumer needs and preferences, it supports the development of products that meet the consumer needs and allows the right consumer targeting. Besides, the identified OTSs are potential marketing partners for companies to influence the development of trends and to push new products in the market as the analysis indicates that OTSs influence non-OTSs within a specific topic-focused community. Therefore, OTSs can be used as partners for influencer marketing campaigns to reach a very targeted group of consumers who have a high interest in a specific field (e.g., sustainability). This can be a new approach of efficient consumer targeting where OTSs influence other users in a micro space, namely a specific topic area.

Transferability to other topic areas The two data analyses show that the approach is applicable to different topic areas, ranging from very specific topics such as sneakers to broader movements like sustainability. This indicates that the approach can be used in various fashion-related fields of interest as well as in other areas apart from fashion-related ones.

167 Model application

Transferability to other social networking platforms Although both data analyses are realized based on data from Instagram, the approach can also be applied to other online social networking platforms such as Facebook or Twitter as they share common functions, and therefore, provide similar data about the users' behavior and characteristics. Furthermore, as outlined in section 2.2.1, also other types of social media platforms provide similar functions to those of online social networking platforms. To obtain insights about the transferability of the approach to other social media platforms, the most popular platforms according to the number of monthly active users worldwide are investigated. Besides Instagram and the two messenger services Whatsapp and Facebook Messenger, the most popular social media platforms in January 2021 are Facebook (2.7 billion), Youtube (2.3 billion), and TikTok (689 million) (Tankovska, 2021). As the two messenger services focus on private communication between people, only the transferability of TSIM to Facebook34, Youtube35 and TikTok36, which provide their users also a public mode of communication, is analyzed. Therefore, it is investigated if the data which is required for the realization of the developed approach is available on these platforms. The accessibility of data from the respective platform is not considered. Table 6-6 summarizes the results. The table shows that the three mentioned social media platforms provide most of the data, which are required for the application of the developed approach, and that only small adaptions need to be done. This indicates that the TSIM can also be applied to other types of social media platforms such as the content-sharing platform Youtube or the recently emerging platform TikTok, where the focus lies on sharing videos. As both platforms, similar to the online social networking platform Instagram, allow the interaction between users, the reaction on a user’s posted content and a descriptive text below the shared video, the TSIM can be used on these platforms. Table 6-6 showcases, for instance, that the application of TSIM to the content-sharing platform Youtube requires only small adaptions regarding the input features for the classification model.

34 https://www.facebook.com/ 35 https://www.youtube.com/ 36 https://www.tiktok.com/ 168 Model application

Data availability TSIM Required data Facebook Youtube TikTok Yes Yes Yes User-related: User profile Channel biography text User profile description information description Yes Yes Yes Content-related: Text postings or Descriptive text of Descriptive text of posting text, no. of descriptive text of posted videos

posted videos postings posted images and videos

Context-related: Yes Yes Yes time of posting

Yes with adoption Depending on the Community detection Community Network-related: profile settings, Yes Followers = no. of followers, no. only reciprocal Yes of follows, no. of relationships subscribers

comments, no. of (friends) or one- Follows = likes, list of follows directional subscriptions relationship (followers) Yes Yes Yes User-related: User profile Channel biography text User profile description information description

Yes with adoption Yes - Descriptive text of posted videos Content-related: Text postings or posting text, posting descriptive text of - If posting type is always video adoption of feature framework: deletion

extraction) type posted images and videos of image-related features, e.g., no. of images Context-related: time of posting, time Yes Yes Yes of comment

Network-related: Yes with adoption No. of likes, no. of no tagging, comments, comment adoption of feature

Classification model (feature (feature model Classification interactions, no. of framework: Yes Yes tags, tagging deletion of interactions, no. of tagging-related mentions, mention features, e.g., no. of interactions tags in posts

Table 6-6 Transferability of approach to other social media platforms

169 Conclusion 7 Conclusion

7.1 Summary

With a global business of 1.3 trillion dollars worldwide in 2020 (Shahbandeh, 2021), the fashion industry has huge economic power and represents an important driver for the global gross domestic product. The industry operates in a highly competitive market (Gazzola et al., 2020) with the increasing power of consumers regarding fashion trend creation and diffusion. Due to this, reliable fashion trend analysis and forecasting, which consider consumer data, is essential for the competitiveness of fashion companies (An and Park, 2020). As the new control of the consumer over fashion trends is enabled by the increasing usage of social networking platforms which provide their users the tool to share their ideas and opinions, and therefore, influence other users in their behaviors, the content published on these platforms is a rich data source for the fashion industry containing information about changing consumer needs and upcoming trends (cf. section 1.1). Fashion companies, however, can not exploit the potential of this new data source as they lack the knowledge about the trend-relevant data which contains valuable information within this huge data pool. This research addresses the challenge of profiting from the huge and valuable data source of online social networking platforms for trend prediction, especially in the highly competitive fashion industry. It argues that trends are created and diffused by trendsetters, and the content which is shared by these trendsetters in OSNs includes information that enables early trend detection. These trendsetters are active in OSNs, share and spread their ideas and opinions, and thereby, influence others in their decisions. Due to this, the study seeks to identify trendsetters based on their digital trace which they leave on online social networking platforms. To achieve this, the characteristics of trendsetters are identified by investigating trend theories (cf. section 2.1), and a feature framework is developed based on a literature review (cf. section 3.4) and expert interviews (cf. section 3.5) which enables the measurement of these characteristics based on social media data. The framework, therefore, contains features that are potentially relevant for the identification of OTSs in OSNs, and which are used as input for the development of a classification model (cf. section 4.3.2). Additionally, it is the basis for the investigation of OTSs’ characteristics in OSNs within the data analysis as it allows the translation of the relevant features for the model’s class decision into characteristics. Besides extending the knowledge about OTSs’ behaviors and characteristics

170 Conclusion in OSN, the objective of this research is the development of a methodology that enables the detection of OTSs in OSNs based on social media data in an automated way (cf. research question, section 1.2). For this, a two-step approach first extracts a relevant sample of users, namely topic-focused community, to identify in the second step community members who take a specific role within the community, namely OTSs. As social roles, such as the trendsetter role, depend on the social system (e.g., a community) which means that someone can take the role of a trendsetter within a group of friends who often talk about fashion, but not in the group of collaborators who discuss mainly mobile phone related topics, a process is created which allows extracting an active community with a specific pre-defined topic focus. The identification of OTSs within such a community is realized based on a classification model. The model building relies on supervised machine learning and uses a labeled dataset consisting of a community with a fashion-related topic focus. The online social networking platform Instagram serves as the data source for the analysis due to its high relevance for the fashion industry. Thereby, the community detection process, which considers almost one million of the most recent postings containing a predefined hashtag to identify a group of seed users, results in a group of 665 users. The data published by these users, comprising 459,243 postings, is the basis for the development of the classification model. The analysis of the relevant features for the class decision using a combination of local and global model interpretation methods reveals insights about feature interactions and the influence of specific feature values on the class decision, and therefore, provides indications about OTSs behavioral patterns in OSN. The evaluation of the developed methodology encompasses the validation of its transferability to another use case as well as the assessment of the trend prediction potential of the identified OTSs based on past consumer trends related to the megatrend sustainability. The major results of the study are summarized in the following.

(1) Feature framework Based on an extensive literature review and interviews with experts in the field of influencer marketing, a feature framework is developed including 113 features that describe the identified six major characteristics of trend-relevant roles based on social media data. The features encompass network-, content-, context-, and user-related features which are calculated based on user profile, UGC, connection and interaction data from online social networking platforms. These features are used as input data for the development of the classification model.

171 Conclusion

(2) Community detection approach The community detection approach is developed in two phases, the concept phase, and the data analysis phase. A concept is created based on insights gained from related work studies, which is in a second step applied to data from the online social networking platform Instagram and which is further optimized. The process starts with the identification of topic- relevant seed users based on around one million of the most recent postings including a pre- defined hashtag and considers 10,000 users in each iteration. As this community detection approach only selects users with a focus on the pre-defined topic by measuring the occurrence of specific keywords in the users’ postings, it enables the extraction of a topic- relevant part of the respective OSN. The application of specific filter and scoring criteria further ensures the quality of final community members and enables the detection of an active and connected group of users with a specific pre-defined interest field.

(3) Classification model The classification model to recognize OTSs within the topic-focused community is developed using supervised machine learning. The model building bases on a labeled community including the required data for the feature extraction of 665 samples. By 160 experiments that combine different machine learning algorithms with various feature selection methods and resampling techniques, the best model according to specific performance measures is selected. This model bases on a Random Forest algorithm and considers 38 input features.

(4) Relevant features and OTSs’ characteristics To reveal insights into OTSs’ characteristics based on the conducted data analysis, a model interpretation using the SHAP method is realized. This method identifies the relevant features for the model’s class decision and investigates the influence of specific feature values and feature interactions on the class decision. It shows that network- and content- related features are most important and reveals that the class decision highly depends on feature interactions which outline the challenge of a manual OTS selection. Furthermore, the analysis reveals that OTSs tend to have a higher activity and closer interactions compared to non-OTSs, and that the community has a higher interest in the content published by OTSs than of non-OTSs. Transfering the quantitative results of the feature analysis back to characteristics, especially credibility and interconnectedness can be assigned to the investigated OTSs.

172 Conclusion

(5) Transferability and utility of the developed methodology Finally, the application of the developed methodology to a use case emphasizes its transferability and shows its utility in practice by validating the trend detection potential of identified OTSs based on consumer trends from 2020. Therefore, the posting history of OTSs encompassing 69,810 postings is investigated and the trend prediction potential of OTSs for the overall consumer market is tested based on Granger causality tests. The analysis validates the trend prediction potential of OTSs and indicates their higher trend prediction potential compared to community members with the highest reach.

7.2 Contribution to theory and practice

The conducted research contributes to theory and practice and its results support different research and business areas. These contributions and the related areas are outlined in the following.

Contribution to theory The community detection approach provides an efficient data collection method, which enables the extraction of a topic-relevant sample from online social networking platforms. As the extraction of data from an entire online social networking platform such as Instagram is not feasible due to the huge volume of data and the restricted data access, the developed detection approach offers an appropriate solution for this problem. Moreover, as the detection process only requires a limited volume of data in each iteration, it is a very efficient collection method. Therefore, the community detection approach supports research in the field of social media within the data collection phase. Besides, the insights gained about OTSs’ behavioral patterns, characteristics, and the relevant features for their detection in OSNs, expand the knowledge about OTSs related to the fashion industry, and thus, contribute to the area of trend research and the recently emerging field of fashion informatics. The developed concept of first extracting a relevant sample, and then classifying the resulting community members into different roles can also be used as a guideline for future research in the field of user classification.

Contribution to practice The developed methodology supports companies of the fashion industry by early trend detection as well as by the identification of changing consumer needs and preferences as it enables the detection of OTSs in OSNs. As these OTSs talk early about future trends, they provide the potential to improve the performance of trend prediction. The insights into the

173 Conclusion relevant features for OTS detection and their characteristics in OSNs also assist marketing departments in finding appropriate partners to push new products or to influence future trends. As the impact of OTSs is related to a specific topic area, the research results can especially support influencer marketing in finding new concepts which allow more targeted and authentic marketing. By cooperating with topic-focused OTSs rather than with the topic- universal influencers, who have a high reach and post about various topics depending on their cooperations, the efficiency of marketing can be increased. Besides, the investigation of the published content of identified OTSs facilitates data-driven strategic decisions in the field of product development and marketing communication. Monitoring the publishing activity of OTSs offers a new opportunity for market research and provides a fast study solution, compared to traditional market research studies such as panel or lead users surveys. An additional major advantage compared to traditional approaches is the avoidance of response bias, which occurs in surveys. Moreover, investigating the published content can deliver valuable information about the language of the target group, and therefore, enables efficient consumer targeting and appropriate marketing slogans. Thus, the results of research support companies in the marketing areas of market research, consumer strategy, influencer marketing, and product creation and design.

7.3 Limitations and implications for future research

As the insights about OTSs’ behavioral patterns and characteristics in OSNs are derived from only one data analysis of one specific OSN, future studies can extent and validate these findings about OTSs by further analyses. Moreover, this study especially focuses on OTSs and their detection in the context of the fashion industry without considering specific locations, future research can expand the gained insights by analyzing OTSs related to other industries or related to specific geographical locations, e.g., countries. Thereby, it is interesting if OTSs in different industries and locations behave similarly in OSNs or if there are major differences. Besides, the research validates the trend prediction potential of the OTSs identified by the developed methodology but does not investigate how the published content can be integrated into existing forecasting models to best capitalize on its prediction power. This can be the focus of future studies in the field of trend prediction and forecasting. Related to this point, another future research opportunity is the analysis of how to extract valuable trend information from the various media types, e.g., images and videos, which are

174 Conclusion published by OTSs. The published images, for instance, can contain rich data to support the design innovation of products and which can increase the speed of overall product creation. Moreover, the utility of TSIM is outlined by the trend prediction potential of OTSs but is not tested in a company’s environment. For the rollout of the developed approach in companies, future research needs to address the legal aspect related to the analysis of publicly available social media data and the usage of extracted information from the published UGC for business purposes. Especially, there is a need for recommendations on how to collect and store data and on how to realize the data analysis while respecting data protection and privacy requirements. It has to be clarified what social media data and how social media data can be stored by a company, e.g., applying methods of data encoding. Although the trend prediction potential of the identified OTSs is already proved by the conducted Granger causality tests, the usage of the developed methodology in a company environment can further confirm its value for trend prediction by measuring the resulting improvement of the trend forecasting performance. Furthermore, the study especially focuses on OTSs’ value for trend prediction purposes. As outlined in section 7.2, OTSs and their posted content, however, can also be worthwhile for other marketing areas, such as online marketing. This opens another future research direction that investigates how online marketing such as influencer marketing can profit from OTSs and their published content. As OTSs influence others, the language they use and the specific wording can, for instance, support marketing in choosing the right wording within marketing slogans. Thereby, it can be analyzed how the application of sophisticated text mining techniques can support marketing automation, e.g., automatic generating marketing slogans based on the postings.

This study applies methods of advanced analytics on social media data to reveal knowledge about the new emerging trend-relevant role of OTSs who are especially relevant for the competitiveness of fashion companies. To achieve the research objective, the study combines various disciplines, e.g., social media analytics, (fashion) trend research, social science, and thereby, reveals insights that contribute to several research areas, e.g., fashion informatics or trend research. The research at hand shows the benefit of cross-disciplinary research and is hoped to inspire future research which uses analytical methods from the field of advanced analytics to reveal knowledge in the field of marketing research.

175 References

References Abdelbary, H. A. and El-Korany, A. (2013), “Semantic topics modeling approach for community detection“, International Journal of Computer Applications, Vol. 81 No. 6, pp. 50-58.

Abdullah, S. and Wu, X. (2011), “An Epidemic Model for News Spreading on Twitter”, 23rd International Conference on Tools with Artificial Intelligence, IEEE Computer Society, Boca Raton, pp. 163-169.

Aggarwal, C. C. and Zhai, C. (2012), “An introduction to text mining”, in: Aggarwal C. and Zhai C. (eds.), Mining text data, pp. 1-10, Springer, Boston, MA.

Agarwal, A., Xie, B., Vovsha, I., Rambow, O. and Passonneau, R. J. (2011), “Sentiment analysis of twitter data”, Proceedings of the workshop on language in social media, pp. 30- 38, Association for Computational Linguistics, Stroudsburg, PA.

Agarwal, N., Liu, H., Tang, L. and Yu, P. S. (2008), „Identifying the influential bloggers in a community“, Proceedings of the 2008 international conference on web search and data mining, pp. 207-218, ACM, New York.

Ahmad, N., Salman, A. and Ashiq, R., (2015), “The impact of social media on fashion industry: Empirical investigation from Karachiites“, Journal of resources development and management, Vol. 7, pp. 1-7.

Allibhai (2018), “Hold-out vs. Cross-validation in Machine Learning”, available at: https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning- 7637112d3f8f (accessed 2021, January 6)

An, H. and Park, M. (2020), “Approaching trend applications using text mining and semantic network analysis”, Fashion and Textiles, Vol. 7 No. 1, pp. 1-15.

Audrezet, A., De Kerviler, G. and Moulard, J. G. (2020), “Authenticity under threat: When social media influencers need to go beyond self-presentation“, Journal of Business Research, Vol. 117, pp. 557-569.

Ballabio, D., Grisoni, F. and Todeschini, R. (2018), “Multivariate comparison of classification performance measures”, Chemometrics and Intelligent Laboratory Systems, Vol. 174, pp. 33-44.

Bakharia, A. (2016). “Topic Modeling with Scikit Learn”, available at: https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730 (accessed 2019, September 25)

Bakshy, E., Karrer, B. and Adamic, L. (2009), “Social influence and the diffusion of user- created content”, Proceedings of Conference on Electronic commerce, pp. 325–334, ACM, New York.

IX References

Bakshy, E., Rosenn, I., Marlow, C. and Adamic, L., (2012), “The role of social networks in information diffusion”, Proceedings of the 21st international conference on World Wide Web, pp. 519-528, ACM, New York,

Baldwin, T., Cook, P., Lui, M., MacKinlay, A. and Wang, L. (2013), “How noisy social media text, how diffrnt social media sources?”, Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 356-364, Asian Federation of Natural Language Processing, Nagoya.

Bamakan, S. M. H., Nurgaliev, I. and Qu, Q. (2019), “Opinion leader detection: A methodological review“, Expert Systems with Applications, Vol. 115, pp. 200-222.

Batinic, B., Haupt, C. M. and Wieselhuber, J. (2006), “Validierung und Normierung des Fragebogens zur Erfassung von Trendsetting (TDS)”, Diagnostica, Vol. 52 No. 2, pp. 60- 72.

Batrinca, B. and Treleaven, P. C. (2015), “Social media analytics: a survey of techniques, tools and platforms”, AI & Society, Vol. 30 No.1, pp. 89-116.

Baumgarten, S. A. (1975), “The innovative communicator in the diffusion process“, Journal of Marketing Research, Vol. 12 No. 1, pp. 12-18.

Beaudoin, P., Moore, M.A. and Goldsmith, R.E. (2000), “Fashion leaders' and followers' attitudes toward buying domestic and imported apparel“, Clothing and Textiles Research Journal, Vol. 18 No. 1, pp. 56-64.

Bedi, P. and Sharma, C. (2016), “Community detection in social networks”, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 6 No. 3, pp. 115- 135.

Beheshti-Kashi, S., Karimi, H. R., Thoben, K.-D., Lütjen, M. and Teucke, M. (2015), “A survey on retail sales forecasting and prediction in fashion markets”, Systems Science & Control Engineering, Vol. 3 No. 1, pp. 154-161.

Behling, D. U. (1992), “Three and a half decades of fashion adoption research: what have we learned? “, Clothing and Textiles Research Journal, Vol. 10 No. 2, pp. 34-41.

Bergstra, J. and Bengio, Y. (2012), “Random search for hyper-parameter optimization”, The Journal of Machine Learning Research, Vol. 13 No. 1, pp. 281-305.

Bhardwaj, V. and Fairhurst, A., (2010), “Fast fashion: response to changes in the fashion industry”, The International Review of Retail, Distribution and Consumer Research, Vol. 20 No. 1, pp. 165–173.

Bengfort, B., Bilbro, R. and Ojeda, T. (2018), Applied Text Analysis with Python, O’Reilly Media.

Bentéjac, C., Csörgő, A. and Martinez-Munoz, G. (2020), “A comparative analysis of gradient boosting algorithms”, Artificial Intelligence Review, Vol. 54, pp. 1937-1967.

X References

Blumer, H. (1969), “Fashion: from class differentiation to collective selection“, Sociological Quarterly, Vol. 10, pp. 275-291.

Bogner, A., Littig, B. and Menz, W. (2014), Interviews mit Experten. Eine praxisorientierte Einführung, Springer Verlag, Wiesbaden.

Boone, T., Ganeshan, R., Hicks, R. L. and Sanders, N. R. (2018), “Can Google trends improve your sales forecast?”, Production and Operations Management, Vol. 27 No. 10, pp. 1770-1774.

Boyd, D. M. and Ellison, N. B. (2008), “Social network sites: Definition, history, and scholarship“, Journal of computer-mediated communication, Vol. 13 No. 1, pp. 210-230.

Breiman, L. (2001), “Random forests”, Machine learning, Vol. 45 No. 1, pp. 5-32.

Brownlee, J. (2017), Master Machine Learning Algorithms: discover how they work and implement them from scratch, Machine Learning Mastery.

Brügger, N. (2018), “Web history of social media”, in: Burgess, J.; Marwick, A. E. and Poell, T. (eds.), The SAGE handbook of social media, pp. 196–212, SAGE Publications, London.

Capgemini research institute (2020), “How sustainability is fundamentally changing consumer preferences”, available at: https://www.capgemini.com/wp- content/uploads/2020/07/20-06_9880_Sustainability-in-CPR_Final_Web-1.pdf, (accessed 2021, January 14).

Casalicchio, G., Molnar, C. and Bischl, B. (2018), “Visualizing the feature importance for black box models”, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 655-670, Springer, Cham.

Casaló, L. V., Flavián, C., and Ibáñez-Sánchez, S. (2020), “Influencers on Instagram: Antecedents and Consequences of Opinion Leadership”, Journal of Business Research, Vol. 117, pp. 510-519.

Cervellini, P., Menezes, A. G. and Mago, V. K. (2016), “Finding trendsetters on yelp dataset”, 2016 IEEE Symposium series on computational intelligence (SSCI), pp. 1-7, IEEE, Athens.

Cha, M., Haddadi, H., Benevenuto, F. and Gummadi, K.P. (2010), “Measuring user influence in twitter: The million follower fallacy“, Fourth international AAAI conference on weblogs and social media, pp. 10-17, AAAI, Menlo Park, CA.

XI References

Chan, K. K. and Misra, S. (1990), “Characteristics of the opinion leader: A new dimension“, Journal of advertising, Vol. 19 No. 3, pp. 53-60.

Chang, H. C. (2010), “A new perspective on Twitter hashtag use: Diffusion of innovation theory”, Proceedings of the American Society for Information Science and Technology, Vol. 47 No. 1, pp. 1-4.

Chang, Y. T., Yu, H. and Lu, H. P. (2015), “Persuasive Messages, Popularity Cohesion, and Message Diffusion in Social Media Marketing”, Journal of Business Research, Vol. 68 No. 4, pp. 777-782.

Chawla, R. (2017), “Topic Modeling with LDA and NMF on the ABC News Headlines dataset”, available at: https://medium.com/ml2vec/topic-modeling-is-an-unsupervised- learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df, (accessed 2019, July 30).

Chen, K. T. and Luo, J. (2017), “When fashion meets big data: Discriminative mining of best selling clothing features”, Proceedings of the 26th International Conference on World Wide Web Companion, pp. 15-2, International World Wide Web Conferences Steering Committee, Perth.

Chen, Y. C., Chen, Y. H., Hsu, C. H., You, H. J., Liu, J. and Huang, X. (2017), “Mining opinion leaders in big social network“, 31st International Conference on Advanced Information Networking and Applications (AINA), pp. 1012-1018, IEEE, Taipei.

Chen, Y., Wang, X., Tang, B., Xu, R., Yuan, B., Xiang, X. and Bu, J. (2014), “Identifying opinion leaders from online comments“, Chinese national conference on social media processing, pp. 231-239, Springer, Berlin, Heidelberg.

Chen, Y., Zhang, H., Liu, R., Ye, Z. and Lin, J. (2019), “Experimental explorations on short text topic mining between LDA and NMF based Schemes”, Knowledge-Based Systems, Vol. 163, pp. 1–13.

Chetioui, Y., Benlafqih, H. and Lebdaoui, H. (2020), “How fashion influencers contribute to consumers' purchase intention”, Journal of Fashion Marketing and Management, Vol. 24 No. 3, pp. 361-380.

Choo, J., Lee, C., Reddy, C. K. and Park, H. (2013), “UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization”, IEEE Transactions on Visualization and Computer Graphics, Vol. 19 No. 12, pp. 1992–2001.

Clement, J. (2020), “Global social networks ranked by number of users 2020”, available at: https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of- users/, (accessed 2020, July 05)

Copeland, L., Ciampaglia, G. L. and Zhao, L. (2019), ”Fashion informatics and the network of fashion knockoffs”, First Monday, Vol. 24 No. 12.

XII References

Covert, I., Lundberg, S. and Lee, S. I. (2020), “Understanding global feature contributions with additive importance measures”, Advances in Neural Information Processing Systems 33.

Cristianini, N. and Shawe-Taylor, J. (2000), An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, Cambridge.

Darden, W. R. and Reynolds, F. D. (1972), “Predicting opinion leadership for men's apparel ”, Journal of Marketing Research, Vol. 9 No. 3, pp. 324-328.

Darden, W. R. and Reynolds, F. D. (1974), “Backward profiling of male innovators”, Journal of Marketing Research, Vol. 11 No. 1, pp. 79-85.

Darmon, D., Omodei, E. and Garland, J. (2015), ”Followers are not enough: A multifaceted approach to community detection in online social networks“, PloS one, Vol. 10 No. 8, p.e0134860.

Deng, H., Sun, Y., Chang, Y. and Han, J. (2014), “Probabilistic models for classification”, in: Aggarwal, C. C. (ed.), Data Classification: Algorithms and Applications, pp. 65-86, CRC Press, Boca Raton.

Denny, I. (2020), “The sneaker–marketplace icon”, Consumption Markets & Culture, pp. 1- 12.

De Veirman, M., Cauberghe, V. and Hudders, L. (2017), “Marketing through Instagram influencers: the impact of number of followers and product divergence on brand attitude”, International Journal of Advertising, Vol. 36 No. 5, pp. 798-828.

Dillon, S. (2012), The fundamentals of fashion management, AVA Academia, Lausanne.

Ding, Y. (2011), ”Community detection: Topological vs. topical”, Journal of Informetrics, Vol. 5 No. 4, pp. 498-514.

Easey, M. (2009), Fashion marketing. John Wiley & Sons, Oxford.

Ellison, N. B. and Boyd, D. (2013), “Sociality through social network sites“, The Oxford handbook of internet studies, pp. 151-172.

Eke, C. I., Norman, A. A., Shuib, L. and Nweke, H. F. (2019), “Sarcasm identification in textual data: systematic review, research challenges and open directions”, Artificial Intelligence Review, Vol. 53 No. 6, pp. 4215-4258.

Fan, W. and Gordon, M. D. (2014), “The power of social media analytics”, Communications of the ACM, Vol. 57 No. 6, pp. 74-81.

Ferreira, A. J. and Figueiredo, M. A. (2012), “Boosting algorithms: A review of methods, theory, and applications”, in: Zhang, C. and Ma, Y. (eds.), Ensemble machine learning, pp. 35-85, Springer, Boston.

XIII References

Ferrara, E., Interdonato, R. and Tagarelli, A. (2014), “Online popularity and topical interests through the lens of Instagram”, Proceedings of the 25th ACM conference on Hypertext and social media, pp. 24-34, ACM, New York.

Figueiredo, F., Almeida, J. M., Gonçalves, M. A. and Benevenuto, F. (2014), “On the dynamics of social media popularity: A YouTube case study”, ACM Transactions on Internet Technology (TOIT), Vol. 14 No. 4, pp. 1-23.

Field, G. A. (1970), “The Status Float Phenomenon: The Upward Diffusion of Innovation”, Business Horizons, Vol. 13, pp. 45-52.

Freberg, K., Graham, K., McGaughey, K., and Freberg, L. A. (2011), “Who are the social media influencers? A study of public perceptions of personality”, Public Relations Review, Vol. 37 No.1, pp. 90-92.

Freeman, L. C. (1978), “Centrality in social networks conceptual clarification”, Social networks, Vol. 1 No. 3, pp. 215-239.

Freund, Y. and Schapire, R. E. (1996), “Experiments with a new boosting algorithm”, Proceedings of the 13th International Conference on International Conference on Machine Learning, Vol. 96, pp. 148-156, ACM, New York.

Friedman, J. H. (2001), “Greedy function approximation: a gradient boosting machine”, Annals of statistics, Vol. 29, pp. 1189-1232.

Fuller, W. A. (1976), Introduction to statistical time series, Wiley, New York.

Gale, D. (1955), “The law of supply and demand”, Mathematica Scandinavica, Vol. 3, pp. 155-169.

Gandomi, A. and Haider, M. (2015), “Beyond the hype: Big data concepts, methods, and analytics“, International journal of information management, Vol. 35 No. 2, pp. 137-144.

García, S., Luengo, J., Herrera, F. (2015), Data Preprocessing in Data Mining, Springer.

Gazzola, P., Pavione, E., Pezzetti, R. and Grechi, D. (2020), “Trends in the fashion industry. The perception of sustainability and circular economy: A gender/generation quantitative approach”, Sustainability, Vol. 12 No. 7, pp. 1-19.

Girvan, M. and Newman, M. E. (2002), “Community structure in social and biological networks“, Proceedings of the national academy of sciences, Vol. 99 No. 12, pp. 7821-7826.

Ghali, N., Panda, M., Hassanien, A.E., Abraham, A. and Snasel, V. (2012), “Social networks analysis: Tools, measures and visualization”, in: Abraham, A. (ed.), Computational Social Networks, pp. 3-23, Springer, London.

Ghojogh, B. and Crowley, M. (2019), “The theory behind overfitting, cross validation, regularization, bagging, and boosting: tutorial”, arXiv:1905.12787.

Guan, Z. (2020), “Irrational Consumer Behaviors in the Sneaker Market”, Management Science and Engineering, Vol. 14 No. 1, pp. 49-52.

XIV References

Granger, C. W. (1969), “Investigating causal relations by econometric models and cross- spectral methods”, Econometrica: journal of the Econometric Society, Vol. 37, pp. 424-438.

Guille, A., Hacid, H., Favre, C. and Zighed, D. A. (2013), “Information Diffusion in Online Social Networks: A Survey”, ACM Sigmod Record, Vol. 42 No. 2, pp. 17-28.

Gupta, M., Gao, J., Zhai, C. and Han, J. (2012), “Predicting future popularity trend of events in microblogging platforms”. Proceedings of the American Society for Information Science and Technology, Vol. 49 No. 1, pp.1-10.

Gupta, P., Jindal, R. and Sharma, A. (2018), “Community trolling: an active learning approach for topic based community detection in big data“, Journal of Grid Computing, Vol. 16 No. 4, pp. 553-567.

Hamann, M., Röhrs, E. and Wagner, D. (2017), “Local community detection based on small cliques”, Algorithms, Vol. 10 No. 3, pp. 1-22.

Haq, A. U., Zhang, D., Peng, H. and Rahman, S. U. (2019), “Combining multiple feature- ranking techniques and clustering of variables for feature selection”, IEEE Access, Vol. 7, pp.151482-151492.

Hastie, T., Tibshirani, R. and Friedman, J. (2009), The Elements of Statistical Learning, Springer, New York.

He, H. and Garcia, E. A. (2009), “Learning from imbalanced data”, IEEE Transactions on knowledge and data engineering, Vol. 21 No. 9, pp. 1263-1284.

Herrera, F., Charte, F. Rivera, A. J. and del Jesus, M. J. (2016), Multilabel Classification, Problem Analysis, Metrics and Techniques, Springer, Cham, Switzerland.

Herrmann, T., Jahnke, I. and Loser, K. U. (2004), “The role concept as a basis for designing community systems”, in: Darses, F., Dieng, R., Simone, C., Zackland, M. (eds.), Cooperative Systems Design, Scenario-Based Design of Collaborative Systems, pp. 163- 178, IOS Press, Amsterdam.

Hevner, A. R., March, S. T., Park, J. and Ram, S. (2004), “Design Science in Information Systems Research”, MIS Quarterly, Vol. 28 No. 1, pp. 75-105.

Himelboim, I., Smith, M.A., Rainie, L., Shneiderman, B. and Espina, C. (2017), “Classifying Twitter topic-networks using social network analysis”, Social media+ society, Vol. 3 No. 1.

Hirschman, E. C. and Adcock, W. O., (1978), “An examination of innovative communicators, opinion leaders and innovators for men's fashion apparel”, Advances in Consumer Research, Vol. 5 No. 1, pp. 308‐314.

Holsapple, C. W., Hsiao, S.-H. and Pakath, R. (2018), “Business social media analytics: Characterization and conceptual framework”, Decision Support Systems, Vol. 110, pp. 32– 45.

XV References

Hootsuite & We Are Social (2019), “Digital 2019 Global Digital Overview”, available at: https://datareportal.com/reports/digital-2019-global-digital-overview, (accessed 2019, July 15)

Hopf, C. (2012), “Qualitative Interviews - ein Überblick“, in: Flick, U., von Kardoff, E.; Steinke, I. (eds.), Qualitative Forschung: Ein Handbuch, pp. 349–360, Rowohlt Taschenbuch Verlag, Reinbek bei Hamburg.

Hu, X. and Liu, H. (2012), “Text analytics in social media”, in: Aggarwal C., Zhai C. (eds.), Mining Text Data, pp. 385-414, Springer, Boston, MA.

Hu, Y., Manikonda, L. and Kambhampati, S. (2014), “What we instagram: A first analysis of instagram photo content and user types”, 8th International Conference on Weblogs and Social Media, AAAI, Ann Arbor, Michigan.

Hyun, S. J. and Koh, B. (2020), “Benefiting from Resellers: The Impact of the Online Secondary Sneaker Market on the Sneaker Brand's Profit and Consumer Surplus”, Proceedings of the Pacific Asia Conference on Information Systems (PACIS), Association for Information Systems, Dubai, UAE.

IBM Institut for value (2020), “Meet the 2020 consumers driving change”, available at: https://www.ibm.com/downloads/cas/EXK4XKX8 (accessed 2020, January 14)

Jabreel, M., Huertas, A. and Moreno, A. (2018), “Semantic analysis and the evolution towards participative branding: Do locals communicate the same destination brand values as DMOs?”, PLoS one, Vol. 13 No. 11, pp. 1–29.

Jackson, T. (2007), “The Process of Trend Development Leading to a Fashion Season”, in: Hines, T. and Bruce, M. (eds.), Fashion marketing. Contemporary issues, pp. 168-187, Butterworth-Heinemann, Oxford.

Japkowicz, N. and Stephen, S. (2002), “The class imbalance problem: a systematic study”, Intelligence Data Analysis Journal, Vol 6 No. 5, pp. 429–450.

Jarukasemratana, S., Murata, T. and Liu, X. (2013), “Community detection algorithm based on centrality and node distance in scale-free networks”, Proceedings of the 24th ACM Conference on Hypertext and Social Media, pp. 258-262, ACM, New York.

Jin, S. V., Muqaddam, A. and Ryu, E. (2019), “Instafamous and social media influencer marketing”, Marketing Intelligence & Planning, Vol. 37 No. 5, pp. 567-579.

Kalyanam, J., Mantrach, A., Saez-Trumper, D., Vahabi, H. and Lanckriet, G. (2015), “Leveraging social context for modeling topic evolution”, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 517-526, ACM, New York.

Kaplan, A. M. and Haenlein, M. (2010), “Users of the world, unite! The challenges and opportunities of Social Media”, Business horizons, Vol. 53 No. 1, pp. 59-68.

Katz, E. and Lazarsfeld, P. F. (1955), Personal Influence: The part played by people in the Flow of Mass communications, The Free Press, Glencoe, Illinois.

XVI References

Katz, E., Lazarsfeld, P. F. and Roper, E. (2005), Personal influence: The part played by people in the flow of mass communications, Routledge, London.

Kay, S., Mulcahy, R. and Parkinson, J. (2020), “When less is more: the impact of macro and micro social media influencers’ disclosure”, Journal of Marketing Management, Vol. 36 No. 3-4, pp. 248-278.

Kayes I., Qian X., Skvoretz J. and Iamnitchi A. (2012), “How Influential Are You: Detecting Influential Bloggers in a Blogging Community”, in: Aberer, K., Flache, A., Jager, W., Liu, L., Tang, J., Guéret, C. (eds.) Social Informatics, pp. 29-42, Springer, Berlin, Heidelberg.

Khan, N. S., Ata, M. and Rajput, Q. (2015), “Identification of opinion leaders in social network“, 2015 International Conference on Information and Communication Technologies (ICICT), pp. 31-36, IEEE, Karachi, Pakistan.

Khorasgani, R. R., Chen, J. and Zaiane, O. R. (2010), “Top leaders community detection approach in information networks”, 4th SNA-KDD workshop on social network mining and analysis, ACM, New York.

Khrabrov, A. and Cybenko, G. (2010), “Discovering influence in communication networks using dynamic graph analysis“, 2010 IEEE Second International Conference on Social Computing, pp. 288-294, IEEE, Washington, DC.

Kietzmann, J. H., Hermkens, K., McCarthy, I. P., Silvestre, B. S. (2011), “Social media? Get serious! Understanding the functional building blocks of social media”, Business Horizons, Vol. 54 No. 3, pp. 241–251.

Kim, E., Fiore, A. M. and Kim, H. (2011), Fashion Trends: Analysis and Forecasting, Bloomsbury Academic, London, New York.

Kim, S., Han, J., Yoo, S. and Gerla, M. (2017a), “How are social influencers connected in instagram?“, International Conference on Social Informatics, pp. 257-264, Springer, Cham.

Kim, A., Miano, T., Chew, R., Eggers, M. and Nonnemaker, J. (2017b), “Classification of Twitter users who tweet about e-cigarettes”, JMIR public health and surveillance, Vol. 3 No. 3.

Kim, M. and Schrank, H. L. (1982), “Part 1: Fashion Leadership among Korean College Women”, Home Economics Research Journal, Vol. 10 No. 3, pp. 227-234.

King, C. W. (1963), “Fashion adoption- a rebuttal to the 'trickle down' theory“, in: Greyser, S. A. (ed.), Toward scientific marketing, Vol. 112, American Marketing Association, Chicago.

King, C. W. and Ring, L. J. (1980), "The Dynamics of Style and Taste Adoption and Diffusion: Contributions From Fashion Theory", NA - Advances in Consumer Research, Vol. 7, pp. 13-16.

Kloumann, I. M. and Kleinberg, J. M. (2014), “Community membership identification from small seed sets”, Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1366-1375.

XVII References

Kobayashi, R. and Lambiotte, R. (2016), “Tideh: Time-Dependent Hawkes Process for Predicting Retweet Dynamics”, Tenth International AAAI Conference on Web and Social Media, AAAI, Cologne.

Kohavi, R. (1995), “A study of cross-validation and bootstrap for accuracy estimation and model selection”, International Joint Conference on Artificial Intelligence, Vol. 14 No. 2, pp. 1137-1145.

Kotsiantis, S. B. (2007), “Supervised machine learning: A review of classification techniques”, Informatica, Vol. 31, pp. 249-268.

Kotsiantis, S. B., Kanellopoulos, D. and Pintelas, P. E. (2006), Data preprocessing for supervised learning, International Journal of Computer Science, Vol. 1 No. 2, pp. 111-117.

Kreutzer, R. T. (2012), “Corporate reputation management in den sozialen Medien“, in: Wüst, C. and Kreutzer, R. T. (eds.), Corporate Reputation Management, pp. 251-281, Gabler, Wiesbaden.

Krishna, P. M., Mohan, A. and Srinivasa, K. G. (2018), Practical Social Network Analysis with Python, Springer, Cham.

Kuckartz, U. (2014), Qualitative Inhaltsanalyse. Methoden, Praxis, Computerunterstüt- zung, Beltz Juventa, Einheim, Basl.

Kuhn, M. and Johnson, K. (2013), Applied Predictive Modeling, Springer, New York.

Laudel, G. and Gläser, J. (2004). Experteninterviews und qualitative Inhaltsanalyse als In- strumente rekonstruierender Untersuchungen, Verlag für Sozialwissenschaften, Wiesbaden.

Lazarsfeld, P. F., Berelson, B. and Gaudet, H. (1944), The people’s choice: How the voter makes up his mind in a presidential campaign, Columbia University Press, New York.

Lee, D. J., Han, J., Chambourova, D. and Kumar, R. (2017), “Identifying Fashion Accounts in Social Networks”, ML4Fashion’17, Halifax, Nova Scotia.

Lehmann, J., Gonçalves, B., Ramasco, J. J. and Cattuto, C. (2012), “Dynamical classes of collective attention in twitter“, Proceedings of the 21st international conference on World Wide Web, pp. 251-260.

Lehmann, E. L and Romano, J. P. (2005), “Testing Statistical Hypotheses”, Springer, New York.

Li, Y., Ma, S., Zhang, Y. and Huang, R. (2013), “An improved mix framework for opinion leader identification in online learning communities“, Knowledge-Based Systems, Vol. 43, pp. 43-51.

Lima, A. C. E. and De Castro, L. N. (2014), “A multi-label, semi-supervised classification approach applied to personality prediction in social media”, Neural Networks, Vol. 58, pp. 122-130.

XVIII References

Lin, H. C., Bruning, P. F. and Swarna, H. (2018), “Using online opinion leaders to promote the hedonic and utilitarian value of products and services“, Business Horizons, Vol. 61 No. 3, pp. 431-442.

Lin, Y., Zhou, Y. and Xu, H. (2015), “Text-generated fashion influence model: An empirical study on style.com”, Proceedings of the 48th Hawaii International Conference on System Sciences, pp. 3642–3650, IEEE, Hawai.

Lin, Y., Zhou, Y. and Xu, H., (2014), “The hidden influence network in the fashion industry”, The 24th Annual Workshop on Information Technologies and Systems.

Liu, S. and Jansson, P. (2017), “City Event Identification from Instagram Data using Word Embedding and Topic Model Visualization”, Arcada Working Papers.

Liu, H., Yu, X. and Lu, J. (2013), “Identifying TOP-N opinion leaders on local social network“, IET International Conference on Smart and Sustainable City (ICSSC), IET, Shanghai.

Linton, R. (1936), The study of man: an introduction, D. Appleton-Century Company, New York.

Lou, C. and Yuan, S. (2019), “Influencer marketing: how message value and credibility affect consumer trust of branded content on social media”, Journal of Interactive Advertising, Vol. 19 No. 1, pp. 58-73.

Luca, M. (2017), “Designing online marketplaces: Trust and reputation mechanisms”, Innovation Policy and the Economy, Vol. 17, pp. 77-93.

Lundberg, S. and Lee, S. I. (2017), “A unified approach to interpreting model predictions”, Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777, Curran Associates Inc., New York.

Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N. and Lee, S. I. (2020), “From local explanations to global understanding with explainable AI for trees”, Nature machine intelligence, Vol. 2 No. 1, pp. 56-67.

Lü, L., Zhang, Y.C., Yeung, C.H. and Zhou, T. (2011), “Leaders in social networks, the delicious case“, PloS one, Vol.6 No. 6.

Manikonda, L., Venkatesan, R., Kambhampati, S. and Li, B. (2016), “Trending Chic: Analyzing the Influence of Social Media on Fashion Brands.”, Department Computer Science, Arizona State University.

Manning, C. D., Raghavan, P. and Schütze, H. (2008), Introduction to Information Retrieval, Cambridge University Press, Cambridge.

XIX References

Ma, Z., Sun, A. and Cong, G. (2012), “Will this #Hashtag be Popular Tomorrow?”, Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1173-1174, ACM, Portland.

Malliaros, F. D., and Vazirgiannis, M. (2013), “Clustering and community detection in directed networks: A survey”, Physics Reports, Vol. 533 No. 4, pp. 95-142.

Marcot, B. G. and Hanea, A. M. (2020), “What is an optimal value of k in k-fold cross- validation in discrete Bayesian network analysis?”, Computational Statistics, Vol. 52 No. 5, pp. 667-692.

McNeill, L. and Moore, R. (2015), “ consumption and the fast fashion conundrum: fashionable consumers and attitudes to sustainability in clothing choice”, International Journal of Consumer Studies, Vol. 39 No. 3, pp. 212-222.

Merkens, H. (2012), “Auswahlverfahren, Sampling, Fallkonstruktion“, in: Flick, U., von Kardoff, E., Steinke, I. (eds.), Qualitative Forschung: Ein Handbuch, pp. 286–299, Rowohlt Taschenbuch Verlag, Reinbek bei Hamburg.

Mittelstaedt, J. D., Shultz, C. J., Kilbourne, W. E. and Peterson, M. (2014), “Sustainability as megatrend: Two schools of macromarketing thought“, Journal of Macromarketing, Vol. 34 No. 3, pp. 253-264.

Molnar, C., König, G., Herbinger, J., Freiesleben, T., Dandl, S., Scholbeck, C. A., Casalicchio, G., Grosse-Wentrup, M. and Bischl, B. (2020), “Pitfalls to avoid when interpreting machine learning models“, arXiv preprint arXiv:2007.04131.

Morgan, M., Killip, G. and Diakonova, M. (2019), “Using Twitter data to identify networks of interest in minority policy topics”, SRI Working Papers.

Morstatter, F., Wu, L., Nazer, T.H., Carley, K.M. and Liu, H. (2016), “A new approach to bot detection: striking the balance between precision and recall”, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 533-540, IEEE, San Francisco.

Murthy, S. K. (1998), “Automatic construction of decision trees from data: A multi- disciplinary survey”, Data mining and knowledge discovery, Vol. 2 No. 4, pp. 345-389.

Ng, A. Y. (2004), “Feature selection, L 1 vs. L 2 regularization, and rotational invariance”, Proceedings of the twenty-first international conference on Machine learning, pp. 78-85, ACM, New York.

Nirschl, M. and Steinberg, L. (2018), Einstieg in das Influencer Marketing, Springer Fachmedien, Wiesbaden.

Niwattanakul, S., Singthongchai, J., Naenudorn, E. and Wanapu, S. (2013), “Using of Jaccard coefficient for keywords similarity”, Proceedings of the international multiconference of engineers and computer scientists, Vol. 1 No. 6, pp. 380-384.

XX References

Obar, J. and Wildman, S. (2015), “Social Media Definition and the Governance Challenge: An Introduction to the Special Issue”, Telecommunications Policy, Vol. 39 No. 9, pp. 745- 750.

Oliveira, M. and Gama, J. (2012), “An overview of social network analysis”, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 2 No. 2, pp. 99– 115.

Ortiz-Arroyo, D. (2010), “Discovering sets of key players in social networks”, Computational social network analysis, pp. 27-47, Springer, London.

O’Sullivan, C. (2020), “Visualising the Classification Power of Data using PCA”, available at: https://towardsdatascience.com/visualising-the-classification-power-of-data- 54f5273f640 (accessed 2021, January 3)

Pal, A. and Counts, S. (2011), “Identifying topical authorities in microblogs”, Proceedings of the fourth ACM international conference on Web search and data mining, pp. 45-54, ACM, New York.

Pal, A., Herdagdelen, A., Chatterji, S., Taank, S. and Chakrabarti, D. (2016), “Discovery of topical authorities in Instagram”, Proceedings of the 25th international conference on world wide web, pp. 1203-1213, ACM, New York.

Pal, A. and Prakash, P. (2017), Practical Time Series Analysis: Master Time Series Data Processing, Visualization, and Modeling using Python, Packt Publishing Ltd, Birmingha - Mumbai.

Papadopoulos, S., Kompatsiaris, Y., Vakali, A. and Spyridonos, P. (2012), “Community detection in social media”, Data Mining and Knowledge Discovery, Vol. 24 No. 3, pp. 515- 554.

Park, J., Ciampaglia, G. L. and Ferrara, E. (2016), “Style in the Age of Instagram: Predicting Success Within the Fashion Industry Using Social Media”, Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 64-73, ACM, San Francisco.

Parsons, T. and Shils, E. A. (1951), Towards a General Theory of Action, MA: Harvard University Press, Cambridge.

Peffers, K., Tuunanen, T., Rothenberger, M. A. and Chatterjee, S. (2008), “A design science research methodology for information systems research“, Journal of management information systems, Vol. 24 No. 3, pp. 45-77.

Peffers, K., Rothenberger, M., Tuunanen, T. and Vaezi, R. (2012), “Design science research evaluation”, International Conference on Design Science Research in Information Systems, pp. 398-410, Springer, Berlin, Heidelberg.

Pennacchiotti, M. and Popescu, A. M. (2011), “A machine learning approach to twitter user classification”, Fifth international AAAI conference on weblogs and social media, pp. 281- 288, AAAI, Menlo Park.

XXI References

Phua, J., Jin, S. V. and Kim, J. (2017), “Gratifications of using Facebook, Twitter, Instagram, or Snapchat to follow brands: The moderating effect of social comparison, trust, tie strength, and network homophily on brand identification, brand engagement, brand commitment, and membership intention”, Telematics and Informatics, Vol. 34 No. 1, pp. 412–424.

Polegato, R. and Wall, M. (1980), “Information seeking by fashion opinion leaders and followers“, Home Economics Research Journal, Vol. 8 No. 5, pp. 327-338.

Probst, P., Boulesteix, A. L. and Bischl, B. (2019), “Tunability: Importance of hyperparameters of machine learning algorithms”, Journal of Machine Learning Research, Vol. 20 No. 53, pp.1-32.

Quelhas-Brito, P., Brandão, A., Gadekar, M. and Castelo‐Branco, S. (2020), “Diffusing fashion information by social media fashion influencers: understanding antecedents and consequences”, Journal of Fashion Marketing and Management, Vol. 24 No. 2, pp. 137- 152.

Quinlan, J. R. (1996), “Bagging, boosting, and c4.5”, Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 725–730, AAAI Press, Menlo Park, California.

Rakoczy, M. E., Bouzeghoub, A., Gancarski, A. L. and Wegrzyn-Wolska, K. (2018), “In the search of quality influence on a small scale–micro-influencers discovery”, OTM Confederated International Conferences "On the Move to Meaningful Internet Systems", pp. 138-153, Springer, Cham.

Ramamonjisoa, D., Murakami, R. and Chakraborty, B. (2015), “Comments Analysis and Visualization Based on Topic Modeling and Topic Phrase Mining”, Proceedings of the Third International Conference on E-Technologies and Business on the Web, Society of Digital Information and Wireless Communication, Paris.

Raschka, Sebastian (2016), “Model evaluation, model selection, and algorithm selection in machine learning”, arXiv:1811.12808.

Ratner, B. (2009), “The correlation coefficient: Its values range between+ 1/− 1, or do they?”, Journal of targeting, measurement and analysis for marketing, Vol. 17 No. 2, pp. 139-142.

Rehman, A. U., Jiang, A., Rehman, A., Paul, A. and Sadiq, M. T. (2020), “Identification and role of opinion leaders in information diffusion for online discussion network“, Journal of Ambient Intelligence and Humanized Computing, https://doi.org/10.1007/s12652-019- 01623-5.

Reis, J. C., Correia, A., Murai, F., Veloso, A. and Benevenuto, F. (2019), “Supervised learning for fake news detection”, IEEE Intelligent Systems, Vol. 34 No. 2, pp. 76-81.

XXII References

Roberts, D. L. and Piller, F. T. (2016), “Finding the right role for social media in innovation”, MIT Sloan Management Review, Vol. 57 No. 3, pp. 41-47.

Rodríguez‐Vidal, J., Gonzalo, J., Plaza, L. and Sánchez, H. A. (2019), “Automatic detection of influencers in social networks: Authority versus domain signals“, Journal of the Association for Information Science and Technology, Vol. 70 No. 7, pp. 675-684.

Rogers, E. M. (2003), Diffusion of Innovations (5th ed.), Free Press, New York.

Rogers, E. M. (2015), “Evolution: Diffusion of Innovations”, International Encyclopedia of the Social & Behavioral Sciences, Vol. 8, pp. 378-381.

Rogers, S. (2016), “What is Google Trends data — and what does it mean?”, available at: https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean- b48f07342ee8 (accessed 2021, February 20)

Romero, D. M., Meeder, B. and Kleinberg, J. (2011), “Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter“, Proceedings of the 20th international conference on World wide web, pp. 695-704, ACM, New York.

Rosenthal, S. and Mckeown, K. (2017), “Detecting influencers in multiple online genres“, ACM Transactions on Internet Technology (TOIT), Vol. 17 no. 2, pp. 1-22.

Roy, B. (2020), “All about Feature Scaling”, available at: https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35 (accessed 2021, January 3)

Sachan, M., Contractor, D., Faruquie, T. A. and Subramaniam, L.V. (2012), “Using content and interactions for discovering communities in social networks”, Proceedings of the 21st international conference on World Wide Web, pp. 331-340, ACM New York.

Saez-Trumper, D., Comarela, G. Almeida, V., Baeza-Yates, R. and Benevenuto, F. (2012), “Finding trendsetters in information networks”, Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1014–1022. ACM, New York.

Sahu, B., Dehuri, S. and Jagadev, A.K. (2017), “Feature selection model based on clustering and ranking in pipeline for microarray data”, Informatics in Medicine Unlocked, Vol. 9, pp.107-122.

Salehi, M., Rabiee, H. R. and Rajabi, A. (2012), “Sampling from complex networks with high community structures”, Chaos: An Interdisciplinary Journal of Nonlinear Science, Vol. 22 No. 2.

XXIII References

Salloum, S. A., Al-Emran, M., Monem, A. A. and Shaalan, K. (2017), “A survey of text mining in social media: facebook and twitter perspectives”, Advances in Science, Technology and Engineering Systems Journal, Vol. 2 No. 1, pp. 127-133.

Scholbeck, C. A., Molnar, C., Heumann, C., Bischl, B. and Casalicchio, G. (2020), “Sampling, intervention, prediction, aggregation: A generalized framework for model- agnostic interpretations”, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 205-216, Springer, Cham.

Schiele, G., Hähner, J. and Becker, C. (2008), Web 2.0-Technologien und Trends. Interactive Marketing im Web 2.0+, Franz Vahlen, München.

Schrank, H. L. and Gilmore, L. D. (1973), “Correlates of fashion leadership: Implications for fashion process theory”, Sociological Quarterly, Vol. 14 No. 4, pp. 534-543.

Scott, J. (2017), Social Network Analysis, SAGE publications.

Segev, N., Avigdor, N. and Avigdor, E. (2018), “Measuring influence on Instagram: a network-oblivious approach“, The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1009-1012, ACM, New York.

Sen, P. C., Hajra, M. and Ghosh, M. (2020) “Supervised classification algorithms in machine learning: A survey and review”, in: Mandal, J. K., Bhattacharya, D. (eds.), Emerging Technology in Modelling and Graphics, pp. 99-111, Springer, Singapore.

Shafiq, M. Z., Ilyas, M. U., Liu, A. X. and Radha, H. (2013), “Identifying leaders and followers in online social networks“, IEEE Journal on Selected Areas in Communications, Vol. 31 No. 9, pp. 618-628.

Shahbandeh, M. (2021), “Global Apparel Market - Statistics & Facts”, available at: https://www.statista.com/topics/5091/apparel-market-worldwide/ (accessed 2021, March 15)

Shmueli, G. and Koppius, O. R. (2011), “Predictive analytics in information systems research”, MIS Quarterly, Vol. 35 No.3, pp. 553-572.

Silva, E. S., Hassani, H., Madsen, D. Ø. and Gee, L. (2019), “Googling fashion: forecasting fashion consumer behaviour using google trends”, Social Sciences, Vol. 8 No, 4, pp. 1-23.

Simmel, G. (1904), “Fashion”, International Quarterly, Vol. 10, pp. 130-155.

Sokolova, M., Japkowicz, N. and Szpakowicz, S. (2006), “Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation”, Australasian joint conference on artificial intelligence, pp. 1015-1021, Springer, Berlin, Heidelberg.

Song, X., Chi, Y., Hino, K. and Tseng, B. (2007), “Identifying opinion leaders in the blogosphere“, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 971-974, ACM, New York.

XXIV References

Sonnenberg, C. and Vom Brocke, J. (2012), “Evaluations in the science of the artificial– reconsidering the build-evaluate pattern in design science research”, International Conference on Design Science Research in Information Systems, pp. 381-397, Springer, Berlin, Heidelberg.

Sproles, G. B. (1981), “Analyzing fashion life cycles – Principles and perspectives“, Journal of marketing, Vol. 45 No. 4, pp. 116-124.

Sproles, G. B. and Burns, L. D. (1994), Changing Appearances: Understanding Dress in Contemporary Society, Fairchild Publications, New York.

Stahl, F., Schomm, F., Vossen, G., Vomfell, L. (2016), “A classification framework for data marketplaces”, Vietnam Journal of Computer Science, Vol. 3, pp. 137–143.

Stai, E., Milaiou, E., Karyotis, V. and Papavassiliou, S. (2018), “Temporal dynamics of information diffusion in twitter: Modeling and experimentation“, IEEE Transactions on Computational Social Systems, Vol. 5 No. 1, pp. 256-264.

Stieglitz, S. and Dang-Xuan, L. (2013), “Social media and political communication: a social media analytics framework”, Social network analysis and mining, Vol. 3 No. 4, pp. 1277- 1291.

Stieglitz, S., Dang-Xuan, L., Bruns, A. and Neuberger, C. (2014), “Social media analytics- an interdisciplinary approach and its implications for information systems”, Business & Information Systems Engineering, Vol. 6 No. 2, pp. 89-96.

Stieglitz, S., Mirbabaie, M., Ross, B. and Neuberger, C. (2018), “Social media analytics– Challenges in topic discovery, data collection, and data preparation”, International journal of information management, Vol. 39, pp. 156-168.

Stokman, F. N. (2001), “Networks: Social”, in: Smelser, N. J. and Baltes, P. B (eds.), International Encyclopedia of the social and behavioral sciences, pp. 10509-10514, Elsevier, Amsterdam.

Sokolova, M., Japkowicz, N. and Szpakowicz, S. (2006), “Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation”, in: Sattar A., Kang B. (eds.), AI 2006: Advances in Artificial Intelligence, pp. 1015-1021. Springer, Berlin, Heidelberg.

Summers, J. O. (1970), “The identity of women's clothing fashion opinion leaders“, Journal of marketing research, Vol. 7 No. 2, pp.178-185.

Sun, Y., Wong, A. K. and Kamel, M. S. (2009), “Classification of imbalanced data: A review”, International journal of pattern recognition and artificial intelligence, Vol. 2 No. 04, pp. 687-719.

Susarla, A., Oh, J. H. and Tan, Y. (2012), “Social networks and the diffusion of user- generated content: Evidence from YouTube”, Information Systems Research, Vol. 23 No. 1, pp. 23-41.

XXV References

Talib, R., Hanif, M. K., Ayesha, S., and Fatima, F. (2016), “Text Mining: Techniques, Applications and Issues”, International Journal of Advanced Computer Science and Applications, Vol. 7 No. 11, pp. 414-418.

Tang, J., Alelyani, S. and Liu, H. (2014), “Feature selection for classification: A review”, in: Aggarwal, C. (ed.), Data classification: Algorithms and applications, pp. 37-64.

Tankovska, H. (2021), “Global social networks ranked by number of users 2021”, available at: https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number- of-users/, (accessed 2021, March 3).

Tharwat, A. (2016), “Principal component analysis-a tutorial”, International Journal of Applied Pattern Recognition, Vol. 3 No. 3, pp. 197-240.

Trappmann, M., Hummell, H. J., and Sodeur, W. (2011), ”Strukturanalyse sozialer Netzwerke: Konzepte, Modelle, Methoden ”, VS Verlag, Wiesbaden.

Truica, C.-O., Radulescu, F. and Boicea, A. (2016), “Comparing different term weighting schemas for Topic Modeling”, 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pp. 307-310, IEEE, Timisoara, Romania.

Tsai, M. F., Tzeng, C. W., Lin, Z. L. and Chen, A. L. (2014), “Discovering leaders from social network by action cascade“, Social Network Analysis and Mining, Vol. 4, available at: https://doi.org/10.1007/s13278-014-0165-9

Tsugawa, S. and Kimura, K. (2018), “Identifying influencers from sampled social networks”, Physica A: Statistical Mechanics and its Applications, Vol. 507, pp. 294-303.

Tsur, O. and Rappoport, A. (2012), “What's in a Hashtag?: Content Based Prediction of the Spread of Ideas in Microblogging Communities”, Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 643-652, ACM, New York.

Varsamis, E. (2018), “Are Social Media Influencers The Next-Generation Brand Ambassadors?”, available at: https://www.forbes.com/sites/theyec/2018/06/13/are-social- media-influencers-the-next-generation-brand-ambassadors/?sh=7f0ce663473d (accessed 2020, March 13)

Veblen, T. (1899), The theory of the leisure class: An economic study in the evolution of institutions, Macmillan, New York.

Vejlgaard, H. (2008), Anatomy of a Trend, McGraw-Hill Professional, New York.

Wang, Y. and Zheng, B. (2014), “On macro and micro exploration of hashtag diffusion in Twitter“, 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp. 285-288, IEEE, Beijing, China.

Wang, T., Brede, M., Ianni, A. and Mentzakis, E. (2018), “Detecting and Characterizing Eating-Disorder Communities on Social Media”, Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 91-100, ACM, New York.

XXVI References

Wasserman, S. and Faust, K. (1994), Social Network Analysis: Methods and Applications, Cambridge University Press, Cambridge.

Watts, A. (2019), “Auctions with different rates of patience: Evidence from the resale shoe market”, Managerial and Decision Economics, Vol. 40 No. 8, pp. 882-890.

Weiss, G. and Provost, F. (2003), “Learning when training data are costly: the effect of class distribution on tree induction”, Journal of Artificial Intelligence Research, Vol. 19, pp. 315– 354.

Weng, J., Lim, E. P., Jiang, J. and He, Q. (2010), “Twitterrank: finding topic-sensitive influential twitterers“, Proceedings of the third ACM international conference on Web search and data mining, pp. 261-270.

Xiao, F., Noro, T. and Tokuda, T. (2014), “Finding News-Topic Oriented Influential Twitter Users Based on Topic Related Hashtag Community Detection”, Journal of Web Engineering, Vol.13 No. 5 and 6, pp. 405-429.

Xing, W. and Ghorbani, A. (2004), “Weighted pagerank algorithm”, Proceedings of the Second Annual Conference on Communication Networks and Services Research, pp. 305- 314, IEEE, Fredericton, NB, Canada.

Yang, L., Sun, T., Zhang, M. and Mei, Q. (2012), “We know what@ you# tag: does the dual role affect hashtag adoption? “, Proceedings of the 21st international conference on World Wide Web, pp. 261-270, ACM, New York.

Zafarani, R., Abbasi, M.A. and Liu, H. (2014), Social media mining: an introduction, Cambridge University Press, Cambridge.

Zakharov, R. and Dupont, P. (2011), “Ensemble logistic regression for feature selection'', Proceedings of the seventh IAPR international conference Pattern Recognition in Bioinformatics, pp. 133-144, Springer, Berlin Heidelberg.

Zhao, L. and Min, C., (2019), “The Rise of Fashion Informatics: A Case of Data-Mining- Based Social Network Analysis in Fashion”, Clothing and Textiles Research Journal, Vol. 37 No. 2, pp. 87-102.

XXVII Appendix Appendix

A.1 Abstract (German version)

Identifikation von Fashion-Trendsettern in Online Sozialen Netzwerken

Unternehmen aus der Modeindustrie agieren in einem stark umkämpften Markt und stehen somit unter hohem Wettbewerbsdruck, der durch die steigende Macht der Konsumenten hinsichtlich Trendentstehung und -verbreitung weiter zunimmt. Der Einfluss der Konsumenten auf die Entwicklung von Modetrends entsteht dabei durch die zunehmende Nutzung von Social Media Plattformen, insbesondere von online sozialen Netzwerken. Diese Plattformen bieten ihren Nutzern die Möglichkeit, eigene Ideen und Meinungen mit vielen anderen Nutzern zu teilen und damit die Entwicklung von Trends zu beeinflussen. Demzufolge bieten die Inhalte, die auf diesen Plattformen geteilt werden, eine wertvolle Datenquelle für Modeunternehmen, die die frühzeitige Erkennung sich ändernder Verbraucherbedürfnisse sowie zukünftiger Trends unterstützt. Um von dieser neuen Datenquelle profitieren zu können, fehlt jedoch das Wissen darüber, welche der Plattform- Nutzer trendrelevante Inhalte bereitstellen, die Informationen über zukünftige Trends beinhalten.

Die vorliegende Arbeit befasst sich mit dieser Wissenslücke und geht der Frage nach, wie Trendsetter in sozialen Online-Netzwerken anhand der Daten, die sie auf diesen Plattformen hinterlassen, automatisch identifiziert werden können. Dabei wird argumentiert, dass Trends von Trendsettern geschaffen und verbreitet werden, und dass die Inhalte, die von diesen Trendsettern in sozialen Online-Netzwerken geteilt werden, die frühzeitige Trenderkennung unterstützen.

Um dieses Ziel zu erreichen, wird in dieser Arbeit ein Feature Framework erstellt, welches auf Experteninterviews und vorangehenden Forschungsarbeiten in den Bereichen Trendforschung und Social Media Forschung basiert. Das Framework ermöglicht die Messung von Eigenschaften trendrelevanter Rollen mithilfe von Social Media Daten. Anschließend erfolgt die Entwicklung eines zweistufigen Lösungsansatzes, der zunächst eine themenrelevante Stichprobe von Nutzern (Community) aus dem großen Datenpool eines sozialen Online-Netzwerkes extrahiert und in einem zweiten Schritt mithilfe eines überwachten maschinellen Lernansatzes die Online-Trendsetter aus dieser Stichprobe identifiziert. Für die Entwicklung dieser Lösung wird eine Datenanalyse anhand öffentlich

XXVIII Appendix zugänglicher Daten des sozialen Online-Netzwerkes Instagram durchgeführt. Die daraus resultierende Methodik zur Identifizierung von Online-Trendsettern innerhalb eines bestimmten Themenbereichs besteht somit aus einem themenfokussierten Community- Detection-Ansatz sowie einem Klassifikationsmodell. Die Analyse der relevanten Features für die Klassenentscheidung des Modells erlaubt es darüber hinaus, Erkenntnisse über die Charakteristika von Online-Trendsettern in sozialen Online-Netzwerken zu gewinnen. Die Evaluation des entwickelten Lösungsansatzes zeigt dessen Übertragbarkeit auf andere Anwendungsfälle und validiert das Trendvorhersagepotential der identifizierten Online- Trendsetter.

Die Forschungsergebnisse leisten einen Beitrag zur Wissenschaft sowie zur Unternehmenspraxis. Die gewonnenen Erkenntnisse über die Verhaltensmuster und Charakteristika von Online-Trendsettern sowie die für ihre Erkennung in den sozialen Online-Netzwerken relevanten Features erweitern das Wissen über Online-Trendsetter in Bezug auf die Modebranche und tragen damit zum Bereich der Trendforschung und dem neuen Forschungsfeld der Fashion Informatics bei. Daneben können die Erkenntnisse von Unternehmen genutzt werden, um geeignete Marketingpartner zur Trendbeeinflussung zu identifizieren. Darüber hinaus unterstützt der entwickelte Ansatz Modeunternehmen durch die Bereitstellung und Analyse einer neuen Datenquelle zur Trendvorhersage und erleichtert die frühzeitige Identifikation von sich ändernden Konsumentenbedürfnissen und -präferenzen.

XXIX Appendix A.2 Features for the measurement of opinion leaders’ characteristics

Dimensions Data Feature examples References

Innovativeness Interaction Network-related (activity): Li et al. (2013), - no. of sharing other users’ Cha et al. (2010) content UGC Context-related: Li et al. (2013), Saez- - adopting time rank Trumper et al. (2012)

Expertise Profile User-related: Pal et al. (2016) - biography-based interest focus Interaction Network-related (activity): Rodrigues-Vidal et al. - no. of retweets (shares) of (2019), the user’s content by others Rakoczy et al. (2018), - no. of comments on the Pal and Counts user’s postings (2011), - no. of likes a user receives Cha et al. (2010), Agarwal et al. (2008) UGC Content-related: Rodrigues-Vidal et al. - no. of postings related to a (2019), specific topic/all postings Chen et al. (2017), - no. of comments related to Li et al. (2013) a specific topic/all comments - no. of topic-specific words used in postings (domain vocabulary)

Influence Connections Network-related (social graph): Rehman et al. (2020), - no. of followers Rakoczy et al. (2018),

- in-degree Chen et al. (2017), - out-degree Rosenthal and - centrality measures (e.g., McKeown (2017), betweeness, closeness, Khan et al. (2015), eigenvector) Kayes et al. (2012), - combined measures: e.g., Lü et al. (2011), weighted-degree- Cha et al. (2010), betweeness-closeness- Weng et al. (2010) centrality Song et al. (2007) - PageRank and related measures (e.g. starrank, trustrank, influencerrank, twitterrank)

XXX Appendix

Dimensions Data Feature examples References

Interaction Network-related (activity): Segev et al. (2018), - no. of commenters/no. of Rosenthal and followers McKeown (2017), - no. of likes/no. of followers Li et al. (2013) - no. comments/no. likes - no. comments/no. postings - no. of distinct commenters/ no. postings - no. of posts with comments/no. of comments - no. of posts with likes/no. of likes Context-related: - response time

Interconnectedness Interaction Network-related (activity): Rehman et al. (2020), measures are calculated based Chen et al. (2017) on retweets, mention, reply network - in-degree, - out-degree, - betweeness centrality - no. of being mentioned or tagged by others

Communicativeness Interaction Network-related (activity): Chen et al. (2017), - no. of comments a user Li et al. (2013), adds on other postings Agarwal et al. (2008) - no. of replies on comments of own posting UGC Content-related: Agarwal et al (2008) - no. of postings - length of a posting

Credibility Connections Network-related (social graph): Rodrigues-Vidal et al. - no. of follows (2019) - no. of followers/no. of follows

Interaction Network-related (activity): Pal and Counts (2011) - no. of mentions in posting - no. of being mentioned in postings - no. of unique users mentioned by the author - no. of unique users mentioning the author

XXXI Appendix

Dimensions Data Feature examples References

UGC Content-related: Rosenthal and - no. of URLs in posting McKeown (2017) - no. of numbers in posting - no. of quotes in posting - no. of questions/total no. of sentences in posting - usage of specific domain- relevant words

XXXII Appendix A.3 Interview guide

Leitfaden zum Experteninterview

Thema: Identifikation von Influencern in der Modebranche auf Instagram: Eine qualitative Untersuchung relevanter Eigenschaften

Eingangsfragen

• Wie definieren Sie Influencer? • Warum ist die Identifikation von geeigneten Influencern so wichtig? • Woran erkennen Sie den Einfluss, den ein Influencer mitbringt?

Prozess der Identifikation

• Wie erfolgt der Prozess der Identifikation von Influencern? • Welche Rolle nehmen Sie in diesem Prozess ein? • Wie stellen Sie sicher, dass der Influencer zum Unternehmen, der Marke bzw. der Kampagne passt?

Social Network Analyse

• Welche Rolle spielt das Netzwerk des Profils bei der Identifikation? • Auf welche Merkmale wird dabei geachtet? • Durch welche Methoden bzw. Hilfsmittel wird das Netzwerk des Influencers untersucht?

Content Analyse

• Welche Rolle spielen Bilder/Videos des Influencers bei der Identifikation? • Wie finden Sie heraus, ob die Bilder bzw. Videos zum Image des Unternehmens passen? • Welche Rolle spielt das textliche Material, also die Caption und die Hashtags? • Wird dabei auf das Wording geachtet? Wenn ja, inwiefern? Welche Software wird dazu verwendet, wird es manuell untersucht? • Werden die verwendeten Hashtags untersucht? Wenn ja, wie? • Sind Zeit und Ort der Postings relevant?

XXXIII Appendix

Kennzahlen

• Verwenden Sie Kennzahlen, um das Potenzial eines Influencers zu bewerten? • Wenn ja, welche Kennzahlen werden verwendet? • Findet eine Form des Rankings potenzieller Influencer statt? • Wenn ja, auf welcher Basis werden sie gerankt?

Daten zum Influencer

• Welche Rolle spielen demografische Merkmale, wie Alter, Geschlecht, Wohnort des Influencers? • Gibt es sonstige Merkmale des Influencers, die für Sie von Interesse sind?

Automatisierung

• Nutzen Sie Programme/Algorithmen zur Identifikation von Influencern? • Inwieweit würden Sie sagen erfolgt die Identifikation manuell/automatisiert?

Abschlussfragen

• Sind in Ihren Augen noch Themen im Rahmen der Identifikation von Influencern offengeblieben, die Sie als relevant erachten?

XXXIV Appendix A.4 List of features for the identification of online trendsetters

Feature Description Data source Feature selection: (type) Pearson correlation

No. of characters used in the Profile data Biography length x biography without emojis (text)

Biography No. of all mentions used in the Profile data x mention count biography (text)

No. of emojis relative to the no. of Profile data Ratio emojis bio x characters used in the biography (text)

No. of distinct emojis relative to the Ratio distinct Profile data no. of characters used in the emojis bio (text) biography

Profile data No. of emojis bio No. of emojis used in the biography (text)

No. of distinct emojis No. of distinct emojis used in the Profile data x bio biography (text)

Type of user account, e.g., Profile data

related Account type x - professional account (binary)

User Presence of an external URL in the Profile data External URL biography, e.g., to an external blog, x (binary) website Indicates that the account is validated Profile data Verified account by the respective social media x (binary) platform Biography hashtag Profile data No. of hashtags used in the biography count (text)

No. of hashtags relative to the no. of Profile data Ratio hashtag bio x characters used in the biography (text)

Profile data Username length No. of characters of the username x (text)

Indicates if the account refers to one of the following social media Profile data Biography score x platforms: YouTube, Facebook, (text) Twitter, Pinterest

XXXV Appendix

Feature Description Data source Feature selection: (type) Pearson correlation No. of nouns relative to the no. of Avg. no. of nouns UGC (text) x words used in all posts No. of adjectives relative to the no. Avg. no. of adjectives UGC (text) of words used in all posts No. of verbs relative to the no. of Avg. no. of verbs UGC (text) x words used in all posts No. of emojis relative to the no. of Avg. no. of emojis UGC (text) x posts

Avg. no. of distinct No. of distinct emojis relative to the UGC (text) emojis no. of posts No. of emojis relative to the no. of Ratio emojis/words words used in all posts without UGC (text) x hashtags No. of posts No. of posts UGC (text) x No. of images No. of posted images UGC (text)

No. of posted images relative to Avg. no. of images the no. of posts UGC (text) x

related No. of videos No. of posted videos UGC (text) x - No. of posted videos relative to the Avg. no. of videos no. of posts UGC (text) x

Contemt Contemt No. of questions relative to the Avg. no. of questions UGC (text) x no. of posts

Ratio no. of images/ No. of images relative to the UGC (count) x no. of videos no. of posts Avg. no. of characters used in Avg. comment length UGC (text) x received comments No. of distinct hashtags relative to Ratio distinct the no. of hashtags which the user UGC (text) x hashtags/hashtags has posted No. of hashtags relative to the no. of UGC (text) x Avg. no. of hashtags posts No. of distinct hashtags relative to Avg. no. of distinct UGC (text) x hashtags the no. of posts No. of distinct hashtags used in all UGC (text) x No. of distinct hashtags posts of a user No. of hashtags used in all posts of a UGC (text) x No. of hashtags user Avg. no. of characters used per UGC (text) x Avg. hashtag length hashtag

XXXVI Appendix

Feature Description Data source Feature selection: (type) Pearson correlation Avg. post length Avg. no. of characters used per post UGC (text) x

Avg. no. of words used per post Avg. no. of words with UGC (text) hashtags including hashtags Avg. no. of words used per post Avg. no. of words UGC (text) x without hashtags without hashtags No. of hashtags relative to the no. of UGC (text) x Ratio hashtags/words words which the user has posted

No. of posts relative to the no. of Avg. posts per day days between first and last post of UGC (time) x the user SD time between SD of all values of time between two UGC (time) posts posts (resp. and the previous post)

Avg. time between Avg. time between two posts UGC (time) x posts (resp. and previous post) related - Avg. time to Avg. time between a post and each of UGC (time) x comment the respective comments

Context Minimum time to Time between post and first comment UGC (time) x comment Evolution no. of Development of the no. of received UGC (time) x comments comments over time Evolution no. of likes Development of the no. of received UGC (time) x likes over time Follow – Centrality measures: describes the Connection x out-degree/in-degree/ degree in which other users are (count) PageRank/ connected via the follow-relation of betweenness/ the user closeness/

eigenvector centrality No. of follows No. of users which the user follows Connection x related

- (count) No. of followers No. of users which the user is Connection x followed by (count) Network Ratio No. of follows in relation to no. of Connection x follows-followers follower (count) No. of comments No. of comments a user has Interaction commented on other posts of the (count) x community

XXXVII Appendix

Feature Description Data source Feature selection: (type) Pearson correlation No. of comments No. of comments a user has received Interaction x received (count) Avg. no. of comments No. of comments a user has received Interaction x received relative to no. of posts (count) Ratio distinct comment No. of distinct comment owners Interaction x owner relative to no. of comments per post (count) SD ratio distinct SD no. of distinct comment owners Interaction x comment owner relative to no. of comments per post (count) Ratio distinct comment No. of distinct comment owners Interaction owners without post without post owner relative to no. of (count) x owner comments per post Avg. no. of comments No. of comments received without Interaction without post owner comments of post owner relative to (count) x no. of posts Avg. no. of distinct No. of distinct comment owners Interaction comment owners per without post owner relative to no. of (count) x post posts SD distinct comment SD no. of distinct comment owners Interaction owner without post without post owner per post relative (count) x owner to no. of comments per post SD no. of comments SD no. of comments received Interaction x per post (count) SD no. of distinct SD no. of distinct comment owners perInteraction comment owners post (count) SD no. of distinct SD no. of distinct comment owners Interaction comment owners (without post owner) per post (count) without post owner SD no. of comments SD no. of comments received Interaction received without post without comments of post owner (count) owner relative to no. of posts Avg. no. of replies on No. of replies on comments received Interaction x comments relative to no. of comments (count) SD no. of replies on SD no. of replies on comments Interaction comments received (count) No. of mention posts No. of posts containing a Interaction x mention (@) (count)

XXXVIII Appendix

Feature Description Data source Feature selection: (type) Pearson correlation No. of mentions in No. of mentions (@) a user used in Interaction x comments her/his comments (count) No. of likes No. of likes received Interaction x (count) Avg. no. of likes No. of likes received relative to Interaction x no. of posts (count) SD no. of likes SD no. of likes received Interaction x (count) Avg. no. of interactions Sum of no. of likes and no. of Interaction per follower comments relative to the no. of (count) x followers No. of mentions No. of times the user is mentioned by Interaction in posts others within the community (count) Avg. no. of users No. of users mentioned by the user Interaction x mentioned relative to no. of posts (count) Avg. no. of distinct No. of distinct users mentioned by Interaction x users mentioned the user per post (count) No. of tags in posts No. of times the user is tagged by Interaction x others within the community (count) No. of distinct used No. of used distinct tags relative to Interaction x tags in posts no. of posts (count) Ratio tags/posts No. of used tags relative to no. of postsInteraction x (count) Avg. no. of No. of tagged users relative to Interaction x users tagged no. of posts (count) No. of tagged users No. of used tags by the user in Interaction x all posts (count) No. of distinct tagged No. of distinct tagged users by Interaction x users the user per post (count) Comments/mentions in Centrality measures: describes the Interaction 1), 2), 3), 4) posts/mentions in degree to which the user is connected (count) comments/media tag – with other users of the community via out-degree/in-degree/ the interaction activities PageRank/ betweenness/ closeness/eigenvector centrality

XXXIX Appendix

Feature Description Data source Feature selection: (type) Pearson correlation No. of distinct comment owners per Avg. comment-owner Interaction post relative to the no. of comments ratio (count) per post SD ratio comments SD of ratios no. of comments without Interaction without post post owner/no. (count) owner/posts of posts Ratio mentions in No. of posts with a mention Interaction x posts/posts relative to the no. of posts (count) Ratio commented No. of posts with a comment relative Interaction x posts/comments to the no. of comments received (count) No. of commented Interaction No. of posts with comments posts (count) No. of comments received relative to Interaction Ratio comments/likes x no. of likes received (count)

1) comments - betweenness, PageRank, closeness, eigenvector 2) mentions in posts – out-degree, PageRank, closeness, eigenvector 3) mentions in comments – out-degree, PageRank, closeness, eigenvector 4) tags – out-degree, betweenness, PageRank, closeness, eigenvector

XL Appendix A.5 List of hyperparameters’ search space and final model settings

Setting with best F1- Algorithm Hyperparameters Search space Score Logistic penalty ["l1", "l2", "elasticnet", l1 Regression None] C [0.00001, 0.0001, 0.001, 1 0.01, 0.1, 1, 10, 100, 150, 200] solver ["liblinear", "newton-cg", liblinear "sag", "saga", "lbfgs"] max_iter [1, 5 ,10, 50, 100, 200, 500] 10

Naïve Bayes alpha [0, 0.1, 0.3, 0.5, 0.7, 1.0, 1.5, 0.3 2.0] SVM Nu [0.001, 0.01, 0.1, 1, 10, 100, 0.01 (correspond to C) 1000]

gamma [0.01 ,0.1, 1, 10, 100] 0.1

Decision Trees criterion ["gini", "entropy"] entropy

max_depth [None, 2, 5, 8, 15, 25, 30, 50, 8 75, 100] max_features [None, "auto", "log2"] none

max_leaf_nodes [5, 10, 30, 100, 200, 500, 200 None] min_samples_leaf [1, 2, 5, 10] 1

min_samples_split [2, 5, 10, 15, 100] 15

splitter ["best", "random"] random

Random Forest n_estimators [10, 50, 100, 500, 1000] 100

criterion ["gini", "entropy"] gini

max_depth [None, 2, 5, 8, 15, 25, 30, 50, 30 75, 100] min_samples_split [2, 5, 10, 15, 100] 5

min_samples_leaf [1, 2, 5, 10] 1

max_leaf_nodes [5, 10, 30, 100, 200, 500, None None] AdaBoost algorithm ["SAMME", "SAMME.R"] SAMME.R

XLI Appendix

Setting with best F1- Algorithm Hyperparameters Search space Score base_estimator [Random_Forest, Max_Ent, Random Forest None] learning_rate [0.001, 0.01, 0.1, 0.5, 1, 1.5] 0.5

n_estimators [10, 50, 100, 500, 1000] 1000

Gradient loss ["deviance", "exponential"] exponential Boosting learning_rate [0.01, 0.05, 0.1, 0.15, 0.2, 0.2 0.25, 0.3, 0.5] n_estimators [1, 4, 10, 20, 35, 50, 75, 100, 100 200, 500] min_samples_split [0.1, 0.5, 1, 1.5, 2, 5, 10, 15, 25 20, 25, 30, 35, 50] min_samples_leaf [1, 2, 5, 8, 10] 2

max_depth [3, 5, 6, 8, 10] 8

XLII Appendix A.6 SHAP dependence plots

Network-related features normalized) ( tags of No. followers followers of (normalized) likes - no. no. of distinctno. of tags used SHAP value value for feature: SHAP value for feature: SHAP follows Avg. Avg. Ratio

No. of distinct used tags (normalized) No. of tags (normalized) in bio (normalized) emojis no. of comments no. of SHAP value value for feature: SHAP value for feature: SHAP comment comment mention closeness Avg. no. of no. of (normalized) verbs Avg. No. of distinctNo. of

No. of comments (normalized) Comment mention closeness (normalized) (normalized) followers followers (normalized) - of follower degree degree PageRank PageRank follows - no. out ratio ratio SHAP value value for feature: SHAP value for feature: SHAP Follow Follow Post mention Post

No. of follower (normalized) Ratio follows-followers (normalized) (normalized) degree PageRank PageRank - out followers followers - mention follow post SHAP value value for feature: SHAP value for feature: SHAP follows No. of follower follower No. of (normalized) Ratio

Post mention PageRank (normalized) Follow out-degree (normalized)

XLIII Appendix normalized) PageRank PageRank ( mention followers followers tag tag closeness - SHAP value for feature: SHAP SHAP value value for feature: SHAP follows comment comment Ratio No. of distinct used tags distinct No. of tags used (normalized)

Comment mention PageRank (normalized) Tag closeness (normalized)

Content-related features (normalized) hashtags of hashtags distinct no. of SHAP value value for feature: SHAP no. SHAP value value for feature: SHAP No. of hashtags hashtags No. of (normalized) Ratio hashtag Ratio hashtag bio (normalized) Evolution comments no. of

No. of hashtags (normalized) No. of distinct hashtags (normalized) of videos no. . of (normalized) verbs avg. length avg. post avg no. SHAP value value for feature: SHAP SHAP value value for feature: SHAP No. of hashtags (normalized) hashtags of No. No. of hashtags hashtags No. of (normalized) Avg. Avg.

Avg. no. of videos (normalized) Avg. post length (normalized) (normalized) /words /words in posts of questions PageRank PageRank no. emojis SHAP value value for feature: SHAP SHAP value value for feature: SHAP ratio ratio Post mention Post No. of distinct used tags distinct No. of tags used (normalized)

No. of questions (normalized) Ratio emojis/words in posts (normalized)

XLIV Appendix

User-related features ) bio normalized bio ( emojis hashtags hashtags followers followers - ratio ratio ratio ratio SHAP value for feature: SHAP value for feature: SHAP follows No. of distinct used tags distinct No. of tags used (normalized) Ratio

Ratio emojis bio (normalized) Ratio hashtags bio (normalized)

Context-related features . time to comment of nouns (normalized) of nouns avg SHAP value value for feature: SHAP no. SHAP value value for feature: SHAP evolution evolution comments no. of Avg. Avg. No. of comments comments No. of (normalized)

Evolution no. of comments (normalized) Avg. time to comment (normalized)

XLV