Data and Algorithms for

CS320: Value of Data & AI Winter 2020 Session 5 Data and Algorithms for Personalization Lecturer(s): James Zou Scribes: Qiwen Wang

Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor.

5.1 Introduction

The personalization algorithms influence what people see and choose on the Internet. The personalization tailors product to accommodate an individual’s needs and interests. However, the data and algorithms used for personalization impose concerns on privacy, ethics, and policy.

5.1.1 Netflix’s Artwork Personalization

Netflix is known for its high-quality media content and personalized search results and viewing content. The personalization in Netflix allows each member to have a different view of the content that adapts to their interests and can help expand their interests over time. Besides personalize media content, for each media, Netflix personalizes the whole viewing experience. Netflix differs from other traditional media offerings in that they “don’t have one product but over 100 million different products with one for each of our members with personalized recommendations and personalized visuals.”[1] Each movie and TV series can be personalized in terms of the artwork and visual elements. The core algorithmic challenge for the Netflix team is to use the data from the viewers to customize not only the viewing options but also the visual aspects. There were significant controversies on Netflix showing different artworks for each movie or TV series to targeted viewers. Some viewers think Netflix is targeting them by race from these artworks. On the 2018 Netflix comedy film poster “Like Father,” people noticed that especially for the African American viewers, Netflix advertised to them with posters featuring black characters, even if the film predominantly starred a white cast. The poster is misleading in that the content of the TV series is not what the viewers are expecting based on their impression from the poster and is not an accurate representation of the content. Is Netflix targeting users by race? Netflix defended itself by claiming they didn’t use the race, gender, or ethnicity information to personalize user’s individual Netflix experience; the only information Netflix used was the viewing history. However, the controversy remains. People argue that it is Netflix’s responsibility to ensure the film posters do not falsely represent. Besides, even though Netflix doesn’t explicitly use the sensitive data in their model, the proxy features can lead to bias against certain groups. These features can implicitly learn about the race, gender and ethnicity information of the individual. More fundamentally, is Netflix providing “too much” personalization?

5-1 5-2 Lecture 5 Data and Algorithms for Personalization: James Zou

5.1.2 Challenges of personalization

• Comparing to the systems in the past, the systems nowadays have too many options to personalize. They can customize every single element, e.g., artworks, font size, image position, etc. Too many options impose a challenge from the algorithm perspective, and the privacy, ethics, and security per- spective. • Companies need to run a large number of experiments. The challenges lie in the methodologies compa- nies use to perform the tests efficiently. In particular, we will discuss the contextual bandit algorithm in Section 5.5. • Although every element can be personalized nowadays, too much personalization can result in recom- mendation bias and a narrower perspective. Companies need to evaluate whether the personalization is beneficial, or it is too much for the users.

5.2 How do companies experiment?

The Washington Post runs live tests on the website to identify contents that users positively respond to, to provide users a better reading experience. To identify the best headline for the news story, The Washington Post generates different versions of the headlines for the same news story and tests which headline generates the most click-though rate (CTR)[2]. For example, The Washington Post presents the following headlines for a news story on why organization expert Marie Kondo’s tips don’t work for parents:

• Why Marie Kondo’s life-changing magic doesn’t work for parents (CTR: 3.3%) • The real reasons Marie Kondo’s life-changing magic doesn’t work for parents (CTR: 3.9%) • The real reasons Marie Kondo’s life-changing magic doesn’t work for parents (with the thumbnail of the author) (CTR: 4.8%)

The third headline seems more promising because the phrase “The real reasons” evokes more curiosity and is more declarative, and the thumbnail is more related to the content.

5.2.1 How to compute CTR?

To compute the click-through rate, The Washington Post

1. Picks a subset of the users, i.e., 1% of the subscribers as the sample, 2. Randomly partitions the sample into three sets, each corresponds to one of these options, 3. Computes the CTR for each set.

The headline that gets the highest CTR is selected as the headline for the news story.

5.2.2 Bing Search

Many companies perform similar experiments. Take Bing ads as another example. The test considers two advertisements shown on Bing Search Engine of the same product. One advertisement shows Lecture 5 Data and Algorithms for Personalization: James Zou 5-3

an additional set of the links at the bottom of it, and the other one doesn’t have the set of the links. The result shows the advertisement with the additional links is more promising. The tiny change in ads generates over 100 million dollars in revenue for the company. Such methodology is often referred to as A/B testing, in which version A and version B, which are identical except for one variation, are involved in a randomized experiment. In the above example, the variation is whether the ads contain additional links.

5.2.3 Run Experiments in Parallel

The number of experiments companies run per today grow quadraticaly. With the limited number of users, the overhead of testing components sequentially is enormous. How do companies handle the massive scale of experiments per day? In fact, hundreds of trials run in parallel across different interfaces for each individual. For companies like Bing and Microsoft, product teams run experiments simultaneously on the same interface for testing various components, i.e., background color, font, new features, etc., in a decentralized way. Product teams can specify the slice of the population to run the experiments and the length of the experiments.

5.2.3.1 Challenges in Large-scale Experimentation

• Combining the best individual components from each test doesn’t implt a better user experience. The marginal changes may not be additive.

• The output metrics is hard to attribute to the individual changes. LinkedIn built its own platform (XLNT) to provide a solution and help make data-driven A/B testing decisions. There are over 400 experiments per day in parallel. With multiple experiments running simultaneously on a page, locating the source of impact is difficult. To solve this problem, the experi- ment runner can subscribe to metrics and get a list of the experiments that are impacting these metrics on the platform. The list of experiment are selected based on the experiment population, effect size and metric intrinsic volatility.

5.2.4 Personalized Pricing

The personalization is nowadays so massively used that even for the same product, the companies charge the users different prices based on their affordability. Companies can track what websites you’ve visited, what items you’ve purchased, your location, or what device you may be using, to offer different prices of the product. Take, for example, a search for a hotel room in Panama City on Travelocity. The Westin Playa Bonita hotel was listed at $200 using a regular browser, but only $181 on an ‘incognito’ browser that hides the user’s cookies. Both searches were done on the same date. What consumers call “price discrimination” has become a common practice in the travel industry. Uber uses machine learning to find the price trade-off curve for each individual so that Uber can offer different prices for different rides. Daniel Graf, Uber’s head of product, said the company applies machine-learning techniques to estimate how many groups of customers are willing to shell out for a ride. Uber calculates riders’ propensity for paying a higher price for a particular route at a certain time of day. For instance, someone traveling from a wealthy neighborhood to another tony spot might be asked to pay more than another person heading to a poorer part of town, even if demand, traffic and distance are the same. 5-4 Lecture 5 Data and Algorithms for Personalization: James Zou

People are concerned about Uber’s personalized pricing. Although personalized pricing is a legit business model from the marketing perspective, it’s an unfair advantage that some people are charged more. There aren’t clear societal standards on price discrimination. The only way for the new services like car-sharing services to figure out the standard is to experiment. However, the experiment effects on the pricing are narrow. The test results can only show that whether the user is willing to keep using the service with a change of the price, but these tests don’t show the long term impact on the neighborhood, commute pattern, and the change in demography. These experiments are often usually lack of transparency. Uber gets permission to do the experiments when the users register an account and sign the policy term. Nevertheless, Uber doesn’t inform the user what exactly are they tested on, not it’s realistic in the case of pricing. From the driver’s perspective, if an Uber driver sees higher prices in wealthier communities, it’s harder for people in more impoverished communities to access the resources. With every product being personalized nowadays, it is hard to disentangle between the personalization and the discrimination.

5.3 Emotion Experiment

5.3.1 Background

Emotional contagion happens when people transfer positive or negative moods and emotions to others they interact with. Further, the social comparison effect is that exposure to the happiness of others may actually be depressing to us. These two effects are well-established in the social network theory. In an emotion testing experiment[3], Facebook tested whether emotional contagion occurs on social media by adjusting the amount of emotional content in the user’s “News Feed.” “News Feed” is a Facebook product that people can express their emotions in their feeds and see their friends’ feeds. Because people’s friends may produce much more content than the user can view, “News Feed” filters feeds with a randomized algorithm. There are two hypotheses that Facebook tests on:

• The “social comparison” hypothesis: exposure to happiness of others on Facebook depresses users by making them feel bad about their own lives. These users tend to post more negative words.

• The “emotional contagion” hypothesis: interacting with happy people makes users feel happy. Users would post more positive words.

5.3.2 Experiment

1. Facebook and researchers experimentally modified News Feed for one week in 2012. Participants were randomly selected, around 155,000 participants are in each group, with a total of around 700,000 participants in the study.

2. For around 155k participants, between 10-90% of feeds containing positive words, e.g., happy, excited, are removed.

3. For around 155k participants, between 10-90% of feeds containing negative words, e.g., mad, sad, depressed, angry, are removed.

4. Measure the number of positive/negative words used by the user. Lecture 5 Data and Algorithms for Personalization: James Zou 5-5

The desired behavior from the emotional contagion indicates people in the positivity-reduced condition should be less positive compared with their control, and people in the negativity-reduced state should be less negative, while the social comparison indicates that opposite emotion should be inversely affected.

5.3.3 Result

Figure 1 shows the result of the experiment. The control arms are the participants who don’t have the feeds reduced and views the feeds as usual; the experimental arms are the participants who have the negative or positive feeds removed. The y-axis measures the percentage of the positive/negative words in the feeds during the experiment. The result shows emotional contagion. For the participants who have negative feeds filtered out, changes for the amount of the negative words are larger than that of the positive words, and vice versa. Furthermore,

researchers also notice from the experiment that

• Omitting emotional content reduced the number of words the person subsequently produced. It is true when either positive or negative feeds are removed.

• The effect was stronger when positive words are omitted.

• People who were exposed to fewer positive/negative feeds were less expressive overall.

5.3.4 Controversy

The study faces huge controversies in exploring whether Facebook could manipulate people’s moods by tweaking their news feeds to favor negative or positive content by running direct experiments on the users. 5-6 Lecture 5 Data and Algorithms for Personalization: James Zou

Facebook shows that they have the power to filter the posts in this study; Does that imply in the future, Facebook may manipulate content for every user? In addition, using the percentage of emotional words as metrics may not reflect the actual emotion of the participants. People may post positive words in response to another positive post, but it may not reflect the actual mode of the user.

5.4 Ethics of Experiments

There are three ethics rules we can consider when we’re thinking about the ethics of the applications. Stanford also requires human subjects research (IRB) to examine these three components. We take Facebook emotion experiment as an example to examine the following three rules:

• Subject makes voluntary and informed consent, when feasible. It’s not feasible for Facebook to go through voluntary consent every time they run an experiment. It only happens once during the user registration, where the user consents for their data to be used.

• Risks to subjects are reasonable compared to benefits to subjects and society. Facebook running the experiment is beneficial. Facebook has a moral responsibility to run the experi- ment. Facebook already has a big user base; they should understand the effects of the emotional posts in running the experiment. The risk is statistically so small for filtering posts. The number of positive words and negative word change is less than 0.1%.

• The selection of subjects is equitable. Do not pick on a vulnerable population. Facebook runs the experiment on random 700k users, which is a good representation of a general population.

Therefore, from the above three ethics perspective of the experiments, Facebook emotion experiment sounds legit.

5.5 Personalization Algorithm

5.5.1 A/B Testing

A/B testing is the process of comparing two or multiple versions of a web page, email, or other marketing assets with just one varying element. For instance, the A/B testing for a headline would compare two versions of the same page with only the headline changed. Let k be the number of options, such as the different headline options. The number of options k is potentially large for testing. Let N be the number of individuals in the experiment. Each individual is randomly assigned to one of the options. For each of the options, we define an estimate to decide which option to use. One popular estimate is the click-through rate (CTR), that is CTR1,CTR2, ..., CT Rk for k options. One disadvantage of the method is that if the number of options k is large, the number of individual N needs to be large to get roughly the same number of individuals for each option. If k is 100 million, and each option needs 1000 individuals, then the whole experiment needs 100 billion individuals, which is unrealistic even for big companies like Facebook. Lecture 5 Data and Algorithms for Personalization: James Zou 5-7

5.5.2 Multi-armed bandit

Multi-armed bandit adaptively allocates limited resource N to the options k, to maximize the expected gain. One motivation of the multi-armed bandit is that with a small number of trails, we can observe some properties of the options, whether it doesn’t work at all, or it works as expected. The multi-armed bandit only sends a small number of individuals to each option at the first round. Based on the performance of the first round, the algorithm adaptively test the following round on the options that are promising. In general, the multi-armed bandit simultaneously attempts to acquire new knowledge and optimizes its decisions based on existing knowledge. Mathematically, a multi-armed bandit chooses the best options k that maximizes the CTR with an ad- ditional factor of the total number of rounds played so far, T , and the total number of round played with the option k, nk. Then the multi-armed bandit algorithm chooses k with the following algorithm. Algorithm 1: UCB1 (Multi-armed bandit)[4] Result: The optimal option k Initialization: Run one round with each option; while The option k does not converge do q 2 log T k = arg maxk CTRk + ; nk Run a round of experiment with the option k; end

q The function gives an upper estimate on the performance of the option. The 2 log T term balances between nk the exploration, i.e., gather more information, and the exploitation, i.e., make the decision with the giving information.

5.5.3 Personal Contextual Bandit

Personal Contextual Bandit is the algorithm used by The Washington Post and Facebook. In this algo- rithm[5], the user data and the component are featurized. For example, every user has a set of features of age, income, and location. Every headline option of The Washington Post has a set of features of the color, font, and content. Let the featurized user vector be Xu, and the featurized component option vector be Xo. Then, the model of the CTR is

x  CTR = u · θ, xo where θ is an unknown coefficient vector for the features. This is now a regression problem to solve θ, where we can extract both vectors from the user data and the component, observe the CTR from the experiment. The multi-armed bandit (LinUCB) model is run on top to solve the regression. The advantage of the contextual bandit can find the best option k with a much fewer number of individuals N compared to A/B testing. 5-8 Lecture 5 Data and Algorithms for Personalization: James Zou

References

[1] Netflix Technology Blog , Artwork Personalization at Netflix, 2017. https://netflixtechblog.com/artwork- personalization-c589f074ad76 [2] Nikhil Muralidhar. Bandito, a Multi-Armed Bandit Tool for Content Testing. 2018. https://developer.washingtonpost.com/pb/blog/post/2016/02/08/bandito-a-multi-armed-bandit-tool- for-content-testing/ [3] Kramer, Adam DI, Jamie E. Guillory, and Jeffrey T. Hancock. ”Experimental evidence of massive-scale emotional contagion through social networks.” Proceedings of the National Academy of Sciences 111.24 (2014): 8788-8790. [4] Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. ”Finite-time analysis of the multiarmed bandit problem.” Machine learning 47.2-3 (2002): 235-256.

[5] Li, Lihong, et al. ”A contextual-bandit approach to personalized news article recommendation.” Pro- ceedings of the 19th international conference on World wide web. 2010.