<<

The Test of Time of Test The Two Decades of Recommender Systems at Amazon.com

Amazon is well-known for personalization and recommendations, which help customers discover items they might otherwise not have found. In this update to our original article, we discuss some of the changes as Amazon has grown.

Brent Smith or two decades now,1 Amazon. recommendations, as well as desirable Amazon.com com has been building a store for properties such as updating immediately F every customer. Each person who based on new information about a cus- Greg Linden comes to Amazon.com sees it differ- tomer and being able to explain why it Microsoft ently, because it’s individually person- recommended something in a way that’s alized based on their interests. It’s as if easily understandable. you walked into a store and the shelves What was described in our 2003 started rearranging themselves, with IEEE Computing article has what you might want moving to the faced many challenges and seen much front, and what you’re unlikely to be development over the years. Here, we interested in shuffling further away. describe some of the updates, improve- From a catalog of hundreds of mil- ments, and adaptations for item-based lions of items, Amazon.com’s recommen- collaborative filtering, and offer our dations pick a small number of items you view on what the future holds for col- might enjoy based on your current con- laborative filtering, recommender sys- text and your past behavior. The algo- tems, and personalization. rithms aren’t magic; they simply share with you what other people have already The Algorithm discovered. The algorithm does all the As we described it in 2003, the item- work. It’s computers helping people help based collaborative filtering algorithm other people, implicitly and anonymously. is straightforward. In the mid-1990s, Amazon.com launched item-based collaborative filtering was generally collaborative filtering in 1998, enabling user-based, meaning the first step of the recommendations at a previously unseen algorithm was to search across other scale for millions of customers and a cat- users to find people with similar inter- alog of millions of items. Since we wrote ests (such as similar purchase patterns), about the algorithm in IEEE Internet Com- then look at what items those similar puting in 2003,2 it has seen widespread users found that you haven’t found yet. use across the Web, including YouTube, Instead, our algorithm begins by find- Netflix, and many others. The algorithm’s ing related items for each item in the success has been from its simplicity, scal- catalog. The term “related” could have ability, and often surprising and useful several meanings here, but at this point,

12 Published by the IEEE Computer Society 1089-7801/17/$33.00 © 2017 IEEE IEEE INTERNET COMPUTING Two Decades of Recommender Systems at Amazon.com

Standing the Test of Time

s part of recognizing IEEE Internet Computing for its 20 Brent Smith, and Jeremy York, from the January/February 2003 A years in publication, I recommended to the editorial issue of IC (see doi:10.1109/MIC.2003.1167344). Fourteen years board that we pick one of our magazine articles that, over the after the publication of this article, it shows 125 downloads from past 20 years, has withstood the “test of time.” In selecting an IEEE Xplore in one month, with more than 12,754 downloads since article, we evaluated the ideas in more than 20 candidate arti- January 2011. The article currently shows 4,258 citations in Google cles that reported on “evergreen” research areas over the past Scholar. I’m delighted that the selection committee recommended two decades and then assessed these articles based on down- an industry article, as it aligns with the magazine’s focus of acces- loads from IEEE Xplore, citations, and mentions of the work in sibility in academic, research, and industrial populations. popular press. This information was presented to a commit- In addition to recognizing the article, we asked the authors tee consisting of previous Editors in Chief for the magazine. I to create this retrospective piece discussing research and would like to thank the selection committee from the editorial insights that have transpired since publishing their winning board — led by Arun Iyengar, and including Fred Douglis, Rob- “Test of Time” article, while projecting into the future. ert Filman, Michael Huhns, Charles Petrie, Michael Rabinovich, Going forward, the magazine hopes to celebrate a “Test of and Munindar Singh. This committee deliberated on the top Time” article every 2–3 years. I hope that you enjoy this ret- three articles by evaluating each work’s previous importance rospective article, and please take a moment to congratulate within the context of its sustained importance in the future. Greg Linden, Brent Smith, and Jeremy York. It’s my pleasure to recognize the committee’s official “Test — M. Brian Blake of Time” winner: an industry article titled “Amazon.com Recom- Editor-in-Chief, IEEE Internet Computing mendations: Item-to-Item Collaborative Filtering” by Greg Linden, Provost and Distinguished Professor, Drexel University

let’s loosely define it as “people who buy one intuitive way as arising from a list of items the item are unusually likely to buy the other.” So, customer remembers purchasing. for every item i1, we want every item i2 that was purchased with unusually high frequency by In 2003: Amazon.com, Netflix, people who bought i1. YouTube, and More Once this related items table is built, we can By the time we published in IEEE in 2003, item- generate recommendations quickly as a series based collaborative filtering was widely deployed of lookups. For each item that’s part of this cus- across Amazon.com. The homepage prominently tomer’s current context and previous interests, featured recommendations based on your past we look up the related items, combine them to purchases and items browsed in the store. Search yield the most likely items of interest, filter out result pages recommended items related to your items already seen or purchased, and then we search. The shopping cart recommended other are left with the items to recommend. items to add to your cart, perhaps impulse buys This algorithm has many advantages over to bundle in at the last minute, or perhaps com- the older user-based collaborative filtering. plements to what you were already considering. Most importantly, the majority of the computa- At the end of your order, more recommendations tion is done offline — a batch build of the related appeared, suggesting items to order later. Using items — and the computation of the recommen- e-mails, browse pages, product detail pages, and dations can be done in real time as a series of more, many pages on Amazon.com had at least lookups. The recommendations are high quality some recommended content, starting to approach and useful, especially given enough data, and a store for every customer. remain competitive in perceived quality even Others have reported using the algorithm, with the newer algorithms created over the last too. In 2010, YouTube reported using it for rec- two decades. The algorithm scales to hundreds ommending videos.3 Many open source and of millions of users and tens of millions of third-party vendors included the algorithm, and items without sampling or other techniques that it showed up widely in online retail, travel, news, can reduce the quality of the recommendations. advertising, and more. In the years following, The algorithm updates immediately on new the recommendations were used so extensively information about a person’s interests. Finally, by Amazon.com that a Microsoft Research report the recommendations can be explained in an estimated 30 percent of Amazon.com’s page may/june 2017 13 The Test of Time

A Present-Day Perspective on Recommendation and Collaborative Filtering

s a PhD student who uses collaborative filtering in my means, this approach is item-centric, which drastically reduces A work to introduce customized recommendation tech- the data space for evaluation. As outlined in IEEE Internet Com- niques (and collaborative filtering) that select “workers” for puting’s Test of Time article4 and other closely related work,5 ,1,2 the Test of Time article is particularly mean- this data-space reduction is potentially up to three orders of ingful to me. Collaborative filtering is a technique used to per- magnitude of its original size. Being item-centric, it overcomes sonalize the experience of users through recommendations the issue with sparsity in user data in traditional approaches tailored to the users’ interests, leveraging the experiences of (such approaches contribute largely to unnecessary evalua- other users with similar profiles. Traditionally, the technique tion). It also overcomes the issue of the density in frequent is used in e-commerce platforms to drive sales by converting users who have large portions of data associated with their targeted suggestions to purchases.3 The technique has ren- profiles. dered more favorable results than blanket advertisement, and Item-based collaborative filtering still requires offline is more purposeful toward customizing the experience of indi- processing to pair similar items. By preprocessing this infor- vidual users. Despite this success, two primary challenges have mation offline, recommendations in the list produced from surfaced: these are concerns related to real-time scalability and item-based collaborative filtering can occur in real time in an recommendation quality. These concerns directly impact the online modality. This allows for easy, quick, more personalized users’ individual experiences and, by induction, the success of recommendation for the user. The similar items list is a sleek the platforms using the technique. subset of items targeted to the user’s purchasing or rating his- The first concern of scalability is directly affected by today’s tory, as opposed to that of others in the entire dataset. It also inexpensive and evolving storage and computing capabilities; overcomes challenges with newer and less-frequent users with these have led to overwhelming data generation and collection. sparse history, because the similar items list focuses on the Unfortunately, algorithms — including traditional collaborative user’s history as opposed to the history of other users. This filtering — haven’t evolved in capacity to handle this new vol- technique is more efficient, yet it hasn’t had any adverse effects ume of data in real time or in an online modality. To address on the quality of recommendations; as such, it continues to be the issue of scalability, a variety of techniques are employed the technique of choice for real-time, online collaborative fil- to reduce the dataset in a structured manner. Some of these tering and recommendations. approaches include sampling users, data partitioning driven by — Julian Jarrett the classification of items, and omitting high- or low-frequency PhD Student, Computer Science, Drexel University items to bubble others to the top of the recommended list. These approaches, while seeking to remedy the issue in scal- References ability, affect the quality in recommendations; this is a direct 1. J. Jarrett and M.B. Blake, “Using Collaborative Filtering to Automate impact to the second concern. Worker-Job Recommendations for Crowdsourcing Services,” Proc. 2016 Given these concerns, it was incumbent on the research IEEE Int’l Conf. Web Services, 2016, pp. 641–645. community and practitioners to devise an approach that gains 2. J. Jarrett et al., “Self-Generating a Labor Force for Crowdsourcing: Is the benefits of scalability without sacrificing recommendation Worker Confidence a Predictor of Quality?” Proc. 2015 3rd IEEE Workshop quality. The most successfully employed approach has come on Hot Topics in Web Systems and Technologies, 2015, pp. 85–90. in the form of item-based collaborative filtering. Its continued 3. J.L. Herlocker, et al., “Evaluating Collaborative Filtering Recommender Sys- success is evident in applications such as the major large-scale tems,” ACM Trans. Information Systems, vol. 22, no. 1, 2004, pp. 5–53. e-commerce platform, Amazon.com. It scopes recommenda- 4. G. Linden, B. Smith, and J. York, “Amazon.com Recommendations: Item- tions via the user’s purchased or rated items, pairing them to to-Item Collaborative Filtering,” IEEE Internet Computing, vol. 7, no. 1, 2003, similar items against established metrics, and finally compos- pp. 76–80. ing a list of similar items as recommendations. As opposed to 5. B. Sarwar et al., “Item-Based Collaborative Filtering Recommendation dataset-reduction techniques employed through user-centric Algorithms,” Proc. 10th Int’l Conf. , 2001, pp. 285–295.

views were from recommendations.4 Similarly, When we originally developed item-based Netflix used recommender systems so exten- collaborative filtering, Amazon.com was pri- sively that their Chief Product Officer, Neil Hunt, marily a bookstore. Since then, Amazon.com’s indicated that more than 80 percent of movies sales have grown more than a hundred-fold and watched on Netflix came through recommenda- have expanded beyond books to be dominated tions,5 and placed the value of Netflix recom- by non-media items, from laptop computers to mendations at more than US$1 billion per year. women’s dresses. This growth challenged many

14 www.computer.org/internet/ IEEE INTERNET COMPUTING Two Decades of Recommender Systems at Amazon.com

 ||c  EP=−1(1)− ||c  =−1(||c −P )k  XY ∑∑ Y  ∑  ()k Y  cX∈=cX∈  k 0 

  ||c  ||c   ||c k  k+1|c| k =−∑∑11 +−∑()k ()PPY  =−∑ (1) ()k Y cX∈   k=1  cX∈ k=1

∞ k+1|c| k ||c =−∑ ∑ (1)(()k PkY since0()k =>for|c |) cX∈ k=1

∞ k+1|c| k =−∑ ∑ (1)(()k PY Fubini's theorem) k=1cX∈

∞ k k+1|c| ==∑∑ααk ()XPY where(k X )(−1) ()k . k=1 cX∈ Figure 1. The derivation of the expected number of customers who bought both items X and Y, accounting for multiple opportunities for each X-buyer to buy Y. assumptions in our original algorithms, requir- general population. How can that be? Imagine a ing adaptation to a new and changing land- heavy buyer — someone who has bought every scape. Through experience, we also found ways item in the catalog. When we look for all the cus- to refine the algorithm to produce more relevant tomers who have bought X, this customer is guar- recommendations for the many new applica- anteed to be selected. Similarly, a customer who tions of it. has made 1,000 purchases will be about 50 times as likely to be selected as someone with 20 pur- Defining “Related” Items chases; sampling a random purchase doesn’t give The quality of recommendations depends heavily a uniform probability of selecting customers. So, on what we mean by “related.” For example, what we have a biased sample. For any item X, custom- do we mean by “unusually likely” to buy item ers who bought X will be likely to have bought Y given that you bought X? When we observe more than the general population. that customers have bought both X and Y, we This non-uniform distribution of customer might wonder how many X-buyers would have purchase histories means we can’t ignore who randomly bought Y if the two items were unre- bought X when we’re trying to estimate how lated. A is ultimately an many X-buyers we would expect to randomly application of statistics. Human behavior is noisy, buy Y. We found it useful to model customers as and the challenge is to discover useful patterns having many chances to buy Y.6 For example, among the randomness. for a customer with 20 purchases, we take each A natural way to estimate the number of cus- of these 20 purchases as an independent oppor- tomers NXY who have bought both X and Y would tunity to have purchased Y. be to assume X-buyers had the same probabil- More formally, for a given customer c who pur- ity, P(Y) = |Y buyers|/|all buyers|, of buying Y as chased X (denoted by c X), we can estimate c’s |c| the general population and use |X buyers| * P(Y) probability of buying Y as 1 - (1 - PY) , where |c| as our estimate, EXY, of the expected number of represents the number of∈ non-X purchases made customers who bought both X and Y. Our 2003 by c and PY = |Y purchases|/|all purchases| or the article, and much of our work before 2003, used a probability that any randomly selected purchase calculation similar to this. is Y. Then, we can calculate the expected num- However, it’s a curious fact that, for almost ber of Y-buyers among the X-buyers by summing any two items X and Y, customers who bought over all X-buyers and using a binomial expansion X will be much more likely to buy Y than the (see Figure 1). may/june 2017 15 The Test of Time

We can write EXY as a polynomial in PY with ally likely to buy a certain memory card, but coefficients that depend purely on X. In prac- this doesn’t guarantee that the memory card tice, PY’s are small, so close approximations can works with the camera. Customers buy memory be made with bounded k. In addition, PY and cards for many reasons and the observed cor- ak(X) can be precomputed for all items, which relation might be a random occurrence. Indeed, then allows EXY to be approximated for any pair there are hundreds of thousands of memory of items with a simple combination of precom- cards in Amazon.com’s catalog, so many of puted values. them are randomly correlated with the cam- With a robust method of computing EXY, we era. Many e-commerce sites use a hand-curated can use it to evaluate whether NXY, the observed knowledge base of compatibility, which is number of customers who bought both X and expensive and error-prone to maintain, espe- Y, is higher or lower than randomly would be cially at Amazon.com’s scale. We found that, expected. For example, NXY - EXY gives an given enough data and a robust metric for the estimate of the number of non-random co- relatedness of items, compatibility can emerge occurrences, and [NXY - EXY]/EXY gives the from people’s behavior, with the false signals percent difference from the expected random failing away and the truly appropriate items co-occurrence. These are two examples of creat- surfacing. ing a similarity score S(X, Y) as a function of Curiously, we found that the meaning of the observed and expected number of customers related items also can be emergent, arising from who purchased both X and Y. The first, NXY - the data, and discovered by customers them- EXY, will be biased toward popular Y’s such as selves. Consider the items people look at versus the first Harry Potter book, so the recommenda- the items they purchase. For books, music, and tions might be perceived as too obvious or irrel- other low-cost items, people tend to look at and evant. The second, [NXY - EXY]/EXY, makes it too purchase the same thing. For many expensive easy for low-selling items to have high scores, items, and especially for non-media items, what so the recommendations might be perceived as people view and what they purchase can be obscure and random, especially because of the radically different. For example, people tend to large number of unpopular items. Relatedness look at many , but only purchase one. scores need to strike a balance between popular- What they look at around the time of looking ity on one end and the power law distribution at that will tend to be other televi- of unpopular items on the other. The chi-square sions. What they purchase around the time they score, []NEXY − XY / EXY , is an example that bought a television tends to be complements strikes such a balance. that enhance the experience after buying that There are several other choices and param- particular television, such as a Blu-ray player eters that could be considered in a relatedness and a wall mount. score and in creating recommendations from related items. Our experience is that there is no The Importance of Time one score that works best in all settings. Ulti- Understanding the role of time is important for mately, perceived quality is what recommen- improving the quality of recommendations. For dations are judged on; recommendations are example, when computing the related items table, useful when people find them useful. how related a purchase is to another purchase and controlled online exper- depends heavily on their proximity in time. If imentation can learn what customers actually a customer buys a book five months after buy- prefer, picking the best parameters for the specific ing another book, this is weaker evidence for use of the recommendations. Not only can we the books being related than if the customer had measure which recommendations are effective, purchased them on the same day. Time direc- but we can also feed information about which tionality also can be helpful. For example, the recommendations people liked, clicked on, and fact that customers tend to buy a memory card bought back into our algorithms, learning what after buying a camera, rather than the other way helps customers the most.7 around, might be a good hint that we shouldn’t For example, compatibility is an important recommend the camera when someone buys relationship. We might observe that customers the memory card. Sometimes, items are bought who buy a particular digital camera are unusu- sequentially, such as a book, movie, or TV series,

16 www.computer.org/internet/ IEEE INTERNET COMPUTING Two Decades of Recommender Systems at Amazon.com

and recommendations should be for what you purchases, but what was purchased. We found want to do next. that a single book purchase can say a lot about Amazon.com’s catalog is continually chang- a customer’s interests, letting us recommend ing through time. Every day, thousands of new dozens of highly relevant items. But, many items arrive and many others fade into obscu- purchases in non-media categories tell us lit- rity and obsolescence. This cycle is especially tle about the customer. What insights can be pronounced in some categories. For example, gleaned from the purchase of a stapler? What apparel has seasonal fashions, and consumer surprising and insightful recommendations can electronics has rapid technological innovation. be made from buying a pair of socks? Recom- New items can be at a disadvantage, because mending tape dispensers or more underwear they don’t have enough data yet to have a strong might be helpful in the moment, but leads to correlation with other items. This is referred to uninspiring recommendations in the longer as the cold-start problem, and often requires an term. Thus, we had to develop techniques for explore/exploit process to give items that have learning which purchases lead to useful recom- not yet had much opportunity to be purchased mendations and when some should be ignored. an opportunity to be shown. Perishable items Finally, the importance of diversity in rec- such as news or social media posts represent ommendations is well known; sometimes it’s a particularly challenging form of cold start, better to give a variety of related items rather often requiring blending data from content- than a narrowly targeted list. The breadth of based algorithms (using subject, topic, and text) Amazon.com’s massive catalog with its many with behavior-based algorithms (using pur- types of products offers a unique challenge not chases, views, or ratings). seen in single-product category stores such as Customers also have a lifecycle and expe- bookstores. For example, recommending more rience their own cold-start problem. Knowing books to a heavy reader might lead to a sale, what to recommend when we have very limited but people might benefit most long term by dis- information about a new customer’s interests covering items they have never even considered has long been an issue. When to make use of before in another product line. Immediate intent limited information and when to play it safe is a factor in diversity as well. When someone is with generally popular items is a subtle transi- clearly seeking something specific, recommen- tion that’s difficult to get correct. dations should be narrow to help them quickly Even for established customers, modeling find what they need. But when intent is unclear time correctly has a large impact on the qual- or uncertain, discovery and serendipity should ity of recommendations. As they age, previous be the goal. Finding the right balance in the purchases become less relevant to the custom- diversity of recommendations requires experi- er’s current interests. This is complicated by the mentation along with a willingness to optimize fact that this can attenuate at differ- for the long term. ent rates for different types of items. For exam- ple, some purchases — such as a manual on The Future: Recommendations sailing heavy seas — likely indicate a durable Everywhere long-term interest. Others such as a dishwasher What does the future hold for recommendations? repair kit might not be relevant after this week- We believe there’s more opportunity ahead of us end’s project. There are even some purchases than behind us. We imagine intelligent inter- such as baby rattles where the recommenda- active services where shopping is as easy as a tions have to change over a long period of time; conversation. four years later, we should recommend balance This moves beyond the current paradigm of bikes and board books rather than baby bottles typing search keywords in a box and navigating a and teethers. And some items, such as books, website. Instead, discovery should be like talking are usually only bought once; others, such as with a friend who knows you, knows what you toothpaste, are bought again and again with like, works with you at every step, and anticipates a fairly predictable lapse of time between the your needs. purchases. This is a vision where intelligence is every- The quality of recommendations we can where. Every interaction should reflect who you make depends not only on the timing of past are and what you like, and help you find what may/june 2017 17 The Test of Time

other people like you have already discovered. 2. G. Linden, B. Smith, and J. York, “Amazon.com Rec- It should feel hollow and pathetic when you see ommendations: Item-to-Item Collaborative Filtering,” something that’s obviously not you; do you not IEEE Internet Computing, vol. 7, no. 1, 2003, pp. 76–80. know me by now? 3. J. Davidson et al., “The YouTube Video Recommendation Getting to this point requires a new way of System,” Proc. 4th ACM Conf. Recommender Systems, thinking about recommendations. There shouldn’t 2010, pp. 293–296. be recommendation features and recommenda- 4. A. Sharma, J.M. Hofman, D.J. Watts, “Estimating tion engines. Instead, understanding you, oth- the Causal Impact of Recommendation Systems from ers, and what’s available should be part of every Observational Data,” Proc. 16th ACM Conf. Economics interaction. and Computation, 2015, pp. 453–470. Recommendations and personalization live in 5. C.A. Gomez-Uribe and N. Hunt, “The Netflix Recommender the sea of data we all create as we move through System: Algorithms, Business Value, and Innovation,” the world, including what we find, what we dis- ACM Trans. Management Information Systems, vol. 6, no. cover, and what we love. We’re convinced the 4, 2016, pp. 1–19. future of recommendations will further build on 6. B. Smith, R. Whitman, and G. Chanda, System for intelligent computer algorithms leveraging collec- Detecting Probabilistic Associations between Items, US tive human intelligence. The future will continue Patent 8,239,287, to Amazon.com, Patent and Trade- to be computers helping people help other people. mark Office, 2012. 7. K. Chakrabarti and B. Smith, Method and System for Associating Feedback with Recommendation Rules, US early two decades ago, Amazon.com launched Patent 8,090,621, to Amazon.com, Patent and Trade- Nrecommendations to millions of customers mark Office, 2012. over millions of items, helping people discover what they might not have found on their own. Brent Smith has worked on personalization and recom- Since then, the original algorithm has spread mendations at Amazon.com for 17 years, leading teams over most of the Web, been tweaked to help peo- that work on fast-paced customer-facing innovation. ple find videos to watch or news to read, been Smith has a BS in mathematics from the University challenged by other algorithms and other tech- of California, San Diego, and an MS in mathematics niques, and been adapted to improve diversity from the University of Washington. Contact him at and discovery, recency, time-sensitive or sequen- [email protected]. tial items, and many other problems. Because of its simplicity, scalability, explainability, adapt- Greg Linden is a data scientist at Microsoft (previously at ability, and relatively high-quality recommenda- Amazon.com, Google, and several startups). Much of tions, item-based collaborative filtering remains his previous work was in recommendations, person- one of the most popular recommendation algo- alization, artificial intelligence, search, and advertis- rithms today. ing. Linden has an MS in computer science from the Yet the field remains wide open. An experi- University of Washington and an MBA from Stanford ence for every customer is a vision none have University. Contact him at [email protected]. fully realized. Much opportunity remains to add intelligence and personalization to every part of every system, creating experiences that seem like a friend that knows you, what you like, and what others like, and understands what options are out there for you. Recommendations are discovery, offering surprise and delight with what they help uncover for you. Every interaction should be a recommendation.

References 1. G.D. Linden, J.A. Jacobi, and E.A. Benson, Collabora- Read your subscriptions through tive Recommendations Using Item-to-Item Similarity the myCS publications portal at Mappings, US Patent 6,266,649, to Amazon.com, Pat- http://mycs.computer.org. ent and Trademark Office, 2001 (filed 1998).

18 www.computer.org/internet/ IEEE INTERNET COMPUTING