<<

Recommend a Movie, Win a Million Bucks

Can a computer learn to choose movies you’re sure to like? Artificial intelligence took on this real-world prob- lem in the contest. A Caltech alum recounts his adventures in the quest for the million-dollar grand prize.

but designing machines that “learn” in an even remotely similar manner will provide fodder for PhD theses for decades to come. The Netflix challenge, to build a better method for predicting a customer’s future movie preferences based on past movie ratings, ultimately led to many discoveries, including profound advances in statistical modeling—as well as the fact that fans of the TV series Friends tend to dislike Stanley Kubrick movies. A visitor to the Netflix website looking for something to rent faces a sea of more than 100,000 choices. Cinematch, the website’s Reproduced by permission of Netflix, Inc., Copyright © 2010 Netflix, Inc. All rights reserved. recommendation system, helps you sort Many a Caltech PhD goes into research, but teams competing, and with one day to go through them. Cinematch bases its sug- for the better part of a year I did so alone we were in first place. gestions on a staggeringly huge database and unemployed—and, yes, often in my Modeled after the 2003 X Prize chal- of movie ratings collected whenever one of pajamas. Yet, without ever leaving my apart- lenge to build a privately funded spaceship, Netflix’s 10-million-plus subscribers clicks ment, I ended up on an 11-country team the was intended to harness on one of the five “star” buttons below the chasing the million-dollar Netflix Prize in the creativity of computer scientists and thumbnail picture of each movie’s cover art. what Caltech professor Yaser Abu-Mostafa statisticians worldwide. Humans (most of us The average customer has rated some 200 [PhD ’83] has aptly described as the Super anyway), instinctively learn from experience movies, so Cinematch has plenty of prefer- Bowl of . There were 5,169 to recognize patterns and make predictions, ence data to work with, but still, as of 2005,

32 ENGINEERING & SCIENCE spring 2010 Contrary to popular belief, self- employed computer gurus do occasionally spend time in the sun. Sill enjoys a balmy spring day on Chicago’s lakefront.

By Joseph Sill

Netflix was not entirely happy with it. Some came close to living up to—or down to—its of its suggestions seemed pretty random. name, dead last went to Avia Vampire Hunt- The company had been tweaking the er at 1.29 stars. (A zero-star rating was not system for years, and finally Netflix founder permitted.) But winning the contest would and CEO Reed Hastings tried to fix things require much more complex mathematics. himself. He neglected his wife and family The competition itself involved another over the 2005 Christmas vacation while 2.8 million data points where only the cus- manipulating spreadsheets and trying to tomer IDs, titles, and dates were revealed. course. I visited him in early 2008 to help make sense of mountains of customer data. This data had been randomly split into two interview prospective grad students, and he He soon concluded that he was in over his subsets, the quiz set and the test set, and filled me in. We’d been friends for more than head, and Netflix eventually decided to look contestants were not told which data points 15 years, so his interest excited mine; but outside the company for help. On October belonged to what set. Teams were allowed still, my job left me no free time. 2, 2006, Netflix formally offered one mil- to upload their predictions for the entire My own career in machine learning lion dollars to whoever submitted the best set to the Netflix Prize server once a day. had begun in Caltech’s computation and algorithm, as long as it beat Cinematch’s The server then compared the predictions neural systems program, which occupies performance by at least 10 percent. to the true ratings and calculated the root the intersection of computer science and Entrants were allowed to download a mean squared error, or RMSE, for each engineering with neurobiology. In it, neurobi- data set of more than 100,000,000 ratings set. Loosely speaking, RMSE measures the ologists use computers to help understand spanning seven years; 480,189 customers, average error. If an algorithm predicted three the brain, and engineers look to neurobiology represented anonymously by ID numbers; stars when the true rating was four, for ex- for inspiration in designing machines. With and 17,770 movies and TV series. Each data ample, the RMSE would be one. The teams’ a background in applied math, I was on the point consisted of four pieces of information: RMSEs on the quiz set were posted in rank theoretical periphery, and only survived the the customer ID, the movie or TV show’s order on a public web page known as the two demanding lab courses through the title, the rating, and the date the rating was leaderboard. The RMSEs on the test set, patience and generosity of my unlucky lab made. This was the “training set”—the data which would determine who—if anyone— partners. One final required us to determine you let your computer puzzle over to tease would win the million dollars, were known the inner workings of an analog VLSI chip (a out the patterns. only to Netflix. particularly esoteric device) by running vari- Some simple calculations showed that Things started out swiftly, with an im- ous diagnostic tests. I passed only by divine the highest-rated movies were the Lord of provement of more than 8.5 percent over intervention—the fabrication plant acciden- the Rings trilogy, with the final installment, Cinematch in the first 18 months. I had a tally burned the chips, and the exam was The Return of the King, in first place with an highly demanding job as a quantitative ana- cancelled. I’m much more comfortable work- average rating of 4.72 stars. Interestingly, lyst for a hedge fund at the time, so I was ing with data, which in the mid-’90s meant a every single one of the bottom 10 movies not following the action very closely. But my few thousand data points if we were lucky. was a horror flick. In one case, this may old mentor, Professor of Electrical Engineer- The sheer enormity of the Netflix dataset have been intentional, since the sixth-lowest ing and Computer Science Abu-Mostafa, was alluring, and in the summer of 2008 movie (1.40 stars) was titled The Worst was—he was using the Netflix Prize as the a career decision provided the time to Horror Movie Ever Made. Although this film basis for a project-based machine-learning dig into it. The full fury of that fall’s market

spring 2010 ENGINEERING & SCIENCE 33 Left: A CS 156b discussion section. Abu-Mostafa (seated, at left) and grad student Panna Felson (BS ’09) look on as senior Constantine (Costis) Sideris holds forth. Grad student and teaching assistant Chess Stetson is sitting on the right.

Right: At the end of 2007, BellKor ruled the Netflix leaderboard.

meltdown had not yet hit, and my employer THE Netflix prize as a teaching Tool was doing fine, but I was ready to move on The most important feature of the Netflix be the most promising technique known, from finance. I needed some time off, and competition was not the size of the prize, blending the results of such duplicative I needed a project to keep my skills sharp one million dollars, (although that didn’t efforts would not give a lot of improve- while I mulled things over. The Netflix Prize hurt!) but rather the size of the data: 100 ment. was perfect. million points. Machine learning research- Therefore, I announced that a team’s ers are used to much smaller data sets, grade would not depend on their algo- and Netflix provided several orders of rithm’s individual performance, but on the Getting Up to Speed magnitude more data than what we incremental improvement in the class- The contest had now been going on for al- usually have to be content with. It was a wide solution when the algorithm was most two years, so I had a lot of catching up machine learning bonanza. incorporated into the blend. This gave the to do. But I had ample time, now that I had Machine learning lives or dies by the students an incentive to explore the less- quit my job, and there was a road map to the data that the algorithm learns from. The traveled roads that might offer a better top of the leaderboard. To keep things mov- whole idea of this technology is to infer chance of shining in the blended solu- ing, Netflix was offering an annual $50,000 the rules governing some underlying tion. There was great educational value “Progress Prize” to the team with the lowest process—in this case, how people decide in pushing people to venture “outside the RMSE, as long as it represented at least a 1 whether they like or dislike a movie— b ox .” percent improvement on the previous year’s from a sample of data generated by that How did we do? Well, I registered two best score. To claim the prize, however, process. The data are our only window team names with Netflix. I told the stu- you had to publish a paper describing your into the process, and if the data are not dents that if we did really, really well, we techniques at a reproducible level of detail. enough, nothing can be done but make would submit under “Caltech.” The other The first Progress Prize had gone to BellKor, guesses. That doesn’t work very well. name was that of a rival school back east, a team from AT&T Research made up of I saw the Netflix data as an opportunity and we would use it if we did really, really Yehuda Koren (now with Yahoo! Research), to get my CS 156 Learning Systems stu- badly. It turned out that we fared neither Bob Bell, and Chris Volinsky. All I had to do dents to try out the algorithms that they too badly nor too well. Our effort gave was follow their instructions, and in theory I had learned in class. With 100 million data about a 6 percent improvement over the should get similar results. points, the students could experiment original Cinematch system, so we did not As I read BellKor’s paper, I quickly real- with all kinds of ideas to their heart’s officially submit it. ized that ironic quotation marks belonged content. The second time the course was of- around the phrase “all I had to do.” Their The first time I gave the Netflix problem fered, the Netflix competition had already solution was a blend of over 100 mathemati- as a class project, the competition was ended. There was no blending this time cal models, some fairly simple and others still going on. For the project, each team around, and the grading process was complex and subtle. Fortunately, the paper had to come up with its own algorithm more of a judgment call. suggested that a smaller set of models— and maximize its performance. Then all With the benefit of hindsight, the class’ perhaps as few as a dozen—might suffice. the algorithms would be blended to give performance was much better. One team Even so, nailing down the details would be a solution for the class as a whole. Like was getting a weekly improvement that it no easy task. Joe, our hope was to rapidly climb the took the teams in the actual competition leaderboard with each submission. months to achieve. In fact, at one point a Perhaps my biggest challenge as an in- guy in the class wanted to set an appoint- structor was to work out a way of ensur- ment to meet with me. He had the top ing that the various teams tried different individual score of 5.3 percent, so I techniques. As you will see, blending radi- jokingly told him that he needed to get to cally different solutions is key to getting 6 percent before I would see him. Within a good performance, so if everybody tried few days he had gotten to 6.1 percent. the same approach because it seemed to —YA-M

34 ENGINEERING & SCIENCE spring 2010 Reproduced by permission of Netflix, Inc., Copyright © 2010 Netflix, Inc. All rights reserved.

of 0.9570, which was nearly as good as the 0.9513 of the original Cinematch software. I got perverse enjoyment out of imagining a customer being told that since he hated Rambo III, he might like Annie Hall. Nearest-neighbor systems improved greatly over the contest’s first year, as teams delved into the math behind the models. These new twists represented significant progress in the design of recommendation systems, but a bigger breakthrough came in the form of another technique called matrix Several other leading teams had also no exception. With 17,770 movies, as op- factorization. published papers, even though they were posed to half a million customers, comput- Matrix factorization catapulted into promi- not compelled to do so, and some of their ing the correlations between all possible nence when “Simon Funk” suddenly ap- methods also looked promising. Three pairings in the training set was much more peared out of nowhere, vaulting into fourth general approaches seemed to be the most manageable, and the resulting table fit easily place on the leaderboard. Funk, whose real successful: nearest neighbors, matrix factor- into a computer’s RAM. These precomputed name is Brandyn Webb, had graduated ization, and restricted Boltzmann machines. correlations led to odd insights—it’s how I from UC San Diego at 18 with a computer Nearest-neighbor models are among learned that fans of Friends tend to dislike science degree, and has since had a spec- the oldest, tried-and-true approaches in Stanley Kubrick. (If you correlate Netflix’s tacular career designing algorithms. He was machine learning, so it wasn’t surprising to catalog with the ratings given to a DVD of a quite open about his method, describing it see them pop up here. There are two basic season of Friends and rank the results, Dr. in detail on his website, and matrix factoriza- variations—the user-based approach and Strangelove and 2001: A Space Odys- tion eventually became the leading tech- the product-based approach. sey end up at the bottom.) Typically, only nique within the Netflix Prize community. Suppose the task is to predict how many positive correlations are used—for instance, Matrix factorization represents each cus- stars customer 317459 would give Titanic. Pulp Fiction is highly correlated with other tomer and each movie as a vector, that is, a The user-based approach would look for Quentin Tarantino movies like Reservoir set of numbers called factors. The custom- like-minded people and see what they Dogs and Kill Bill, as well as movies with a er’s predicted rating of that movie is the dot thought. This similarity is measured by cal- similar dark and twisted sensibility like Fight product of the two vectors—a simple mathe- culating correlation coefficients, which track Club and American Beauty. Thus a product- matical operation involving multiplying each the tendency of two variables to rise and based approach allows Netflix to say, “since customer factor by each movie factor and fall in tandem. However, this approach has you enjoyed Fight Club, we thought you summing up the result. Essentially, the movie some serious drawbacks. The correlations might like Pulp Fiction.” vector encodes an assortment of traits, with between every possible pairing of custom- Surprisingly, I found that negative prod- ers need to be determined, which for Netflix uct-based correlations were nearly as useful amounts to 125 billion calculations—a very as positive ones. Although the connections DOT PRODUCTS heavy computational burden. Furthermore, were not as obvious, there was often a Let’s say that movie vectors have four factors: e-commerce companies have found that certain logic to them. For instance, Pulp Fic- sex (S), violence (V), humor (H), and music (M). people tend to trust recommendations tion and Jennifer Lopez romantic comedies Maggie, a customer who has just watched more if the reasoning behind them can be like The Wedding Planner and Maid in Brigadoon and is looking for more of bonnie explained. “Selling” a user-based nearest- Manhattan were strongly anticorrelated, and Scotland, has preferences S = 2, V = 1, H = 3, neighbor recommendation is not so easy: fans of the TV series Home Improvement M = 5. If Braveheart scores S = 2, V = 5, “There’s this woman in Idaho who usually generally disliked quirky movies such as The H = 1, M = 1 on this scale, then the dot product agrees with you. She liked Titanic, so we Royal Tenenbaums, Eternal Sunshine of the predicting Maggie’s likely reaction to Mel think you will, too—trust us on this one.” Spotless Mind, and Being John Malkovich. Gibson’s gorefest is (2 × 2) + (1 × 5) + (3 × 1) For these reasons, product-based I ended up devising a negative-nearest- + (5 × 1) = 17. Since the maximum possible nearest-neighbor techniques are now much neighbors algorithm—a “furthest opposites” score in this example is 100, she should prob- more common, and the Netflix contest was technique, if you will—that scored an RMSE ably give Braveheart a miss.

Reproduced by permission of Netflix, Inc., Copyright © 2010 Netflix, Inc. All rights reserved. spring 2010 ENGINEERING & SCIENCE 35 each factor’s numerical value indicating how “the temperature at which the brain begins the week, movies were more likely to get a much of that trait the movie has. The cus- to die” and intended as a Republican bad rating on Mondays and more likely to get tomer vector represents the viewer’s prefer- response) lowest. As the number of factors a good rating on weekends. On its own, this ences for those traits. It’s tempting to define grew—beyond 100, in some cases!—they model performed horribly, but if none of the the factors ahead of time—how much money got much harder to interpret. The trends other models took the day of the week into the film grossed, how many Oscar winners being discerned got more subtle, but no account, this tiny but statistically significant are in the cast, how much nudity or violence less real. signal could lead to a noticeable boost in the it has—but in fact they are not predeter- The third class of approaches, the ensemble’s accuracy. mined in any way. They are generated by the restricted Boltzmann machine, is hard to model itself, inside the “black box,” and only describe concisely. Basically, it’s a type of the model knows exactly what they mean. neural network, which is a mathematical Blended Models, Blended Teams By the fall of 2008, having familiarized myself with all the major techniques and the art of blending large numbers of models, I The default number of teams displayed on the web set out to climb the leaderboard. I started with what I thought was a modest goal—the page was 40, so if I cracked the Top 40 I could at least default number of teams displayed on the point myself out to people easily. web page was 40, so if I could crack the Top 40 I could at least point myself out to people easily. Although my girlfriend was being very supportive, I wanted a tangible achievement to show for my solitary efforts All we know is that as the model learns, it construct loosely inspired by the intercon- hunched over the desk in the living room of continually adjusts the factors’ values until nections of the neurons in the brain. As a my Chicago apartment, pounding away on they reliably give the right output, or rating, computation and neural systems alumnus, my laptop. But with more than 5,000 teams for any set of inputs. I was glad to see neural networks well competing, landing in the Top 40 meant But if a model had just a handful of fac- represented. being in the 99th percentile. I should have tors, sometimes their meanings leapt out. BellKor’s $50,000 winning blend incor- realized it wouldn’t be so easy. One such factor tagged teen movies: the porated numerous twists and tweaks of all For one thing, there were subtleties American Pie series and Dude, Where’s My three major techniques as well as several unmentioned in the papers that made the Car? came out on top, and old-school clas- less-prominent methods. This blending difference between a working model and sics like Citizen Kane and The Bridge on approach became standard practice in the a failure. Deducing these tricks on my own the River Kwai were at the bottom. Another Netflix Prize community. The process of sometimes took weeks at a time. Another loved romantic comedies such as Sleepless blending predictions was itself a meta-prob- challenge was weighing how much time to in Seattle and hated the Star Trek franchise. lem in machine learning, in that the blending spend on known methods versus inventing A third chose left-leaning films, placing algorithm had to be trained how to combine original approaches. I probably spent too Michael Moore movies such as Fahrenheit the outputs of the component models. much time trying for a home run—a stupen- 9/11 highest Teams working with blends soon discov- dous, brand-new technique. and movies ered that it didn’t make sense to fixate on Still, I scored one triumph. When blend- such as improving any particular technique. A better ing their models, most of my competitors Celsius strategy was to come up with new models relied on linear regression, a time-honored 41.11 that captured subtle effects that had eluded technique whose many virtues include a (sub- your other models. My favorite example was straightforward, analytically exact procedure titled a model based solely on what day of the for obtaining the best fit to the data. My week the rating had been made. Even after intuition suggested that the blend should be controlling for the types of movies that adaptive, depending on such side informa- get rated on tion as the number of ratings associated different with each customer or movie. By making days of every model’s coefficients a linear function

Abu-Mostafa uses telepathy to pick movies for you. Machine learn- ing systems can’t do that yet.

36 ENGINEERING & SCIENCE spring 2010 Time was running out in the spring of these side variables, I got a significant of 2009, but nobody knew it. Then boost in accuracy while still retaining linear suddenly, on June 26, a newly regression’s key merits. formed team reached the goal. Despite this breakthrough, come January The competition promptly shifted 2009 I was still hovering just below where I into overdrive, as the contest wanted to be, stuck in the mid-40s. I would included a one-month “last call” often peruse the top 10 with a mixture of for all comers to take (or defend) respect and jealousy. One day, a new team the lead. The team with the best score suddenly appeared there—the Grand Prize at the end of that month would be Team, a name that struck me as presumptu- declared the winner. ous and cocky. When I read about them, though, I was intrigued. GPT’s founders—a team of Hungarian researchers called Grav- of BellKor and two Austrian ity, and a team of Princeton undergrads graduate students calling named Dinosaur Planet—had merged in themselves Big Chaos. 2007. That team, dubbed When Gravity and The amalgam, BellKor in Dinosaurs Unite, had very nearly won the Big Chaos, had won the first Progress Prize, but was overtaken by second Progress Prize BellKor in the final hours. in October 2008, with a GPT issued a standing invitation to all score of 0.8616. They had comers to join them, but in order to be since climbed to 0.8598, admitted the applicant had to demonstrably and there they had ground to improve GPT’s score—not an easy task. a halt. It seemed like that figure As of GPT’s founding, their RMSE stood at was chiseled in stone—so close 0.8655. The million-dollar goal was 0.8563, to the magic number, and yet so very so only 0.0092—or 92 basis points, as they far away. were called—remained to go. My stake as one of a dozen members of Shaving off just one basis point was ex- GPT gave me newfound motivation, as did cruciatingly hard, and GPT’s offer reflected the contrast between our leap and our com- dragged on, GPT narrowed the gap, but this. The original members would only claim petitors’ glacial progress. We were still un- the front-runners continued to inch ahead. one-third of the prize, or $333,333, and the derdogs, but underdogs with momentum on I’d check the standings every day, and often remainder would be split among the new our side. Over the next few months, I found several times a day, wincing on the rare days members in proportion to their basis-point some other side variables I could exploit, when a new score was posted. Their prog- contribution. Thus one basis point, the and we crept up the leaderboard. A new ress was so slow that I felt we had plenty of smallest measurable improvement, was recruit from Israel, Dan Nabutovsky, boosted time. I was wrong. worth almost $7,000. This may seem overly our score by another six basis points. generous, but it reflected how difficult pro­ Our collaborative style varied. Some GPT gress had become. Weeks or even months members (or prospective members) simply Endgame would go by with nothing happening at the sent Gábor their predictions. He’d add them Late afternoon on Friday, June 26, I top of the leaderboard, and a boost of just a to the mix and see whether the overall blend checked the leaderboard, and what I saw few basis points was cause for celebration. improved—something he could do without felt like a punch in the gut. A new team— The old 80/20 rule—that 80 percent of the even knowing how the new models worked. BellKor’s Pragmatic Chaos—had broken the payoff comes in the first 20 percent of the Gábor and I, however, were collaborating at million-dollar barrier, with a score of 0.8558. work—was in full force. a much deeper level, exchanging code and They hadn’t won yet, though. The contest My advancement as a lone wolf sty- debating which approaches were most likely rules provided for a one-month “last call” mied, I crossed my fingers and sent my to give us a boost. during which everyone had a final shot at adaptive-blending code to GPT captain Meanwhile, another underdog was also the prize, and the team with the best score Gábor Takács. Reading the email I got in making a run. Calling themselves Pragmatic at the end of this period would be declared response was the most satisfying moment Theory, Martin Chabbert and Martin Piotte the winner. So we weren’t toast, but things I had yet experienced—I’d boosted their of Quebec had been steadily rising up the didn’t look good. We were 36 basis points score by more than 10 basis points! GPT’s leaderboard. Neither of them had any formal behind, a gap that felt as wide as the Grand next official submission to Netflix a few days training in machine learning, and both were Canyon. We’d been running an ultrama- later jumped from 0.8626 to 0.8613 on the holding down full-time jobs. Nonetheless, by rathon—months-long for me; years-long strength of my contribution, leapfrogging March 2009 they had surpassed BellKor in for some of my teammates—and only the a few other entrants in the process. As of Big Chaos and taken the lead with a score day before, we’d been on the heels of the February 2009, I was suddenly a significant of 0.8597. After a bit of jockeying, the lead- leaders. Now, just short of the finish line, we shareholder on a leading team. erboard settled into a new equilibrium, with looked up and found they were so far ahead However, first place belonged to a merger the two teams tied at 0.8596. As the weeks that we had to squint to see them.

spring 2010 ENGINEERING & SCIENCE 37 We decloaked on Saturday afternoon. With less than 24 hours to go, we were on top.

With so little time left, GPT became much the prize, and I began to worry whether we’d the final minutes ticked away, Peng Zhou an- more collaborative, with most members be able to reconstruct what we had done. nounced from Shanghai that he thought he exchanging detailed information about their With so little time left, though, we decided could achieve 0.8553. We sent in his blend, algorithms via email and chat sessions. Ev- to deal with that later. retook the lead, and the deadline passed. eryone else was scrambling to catch up as The competition was scheduled to end We had finished first, but had we won? well. A few leaders were on a quixotic quest on Sunday, July 26, 2009, at 1:42 p.m. Shortly after the final buzzer sounded, to win on their own, but many could see that Chicago time. That Thursday, we obtained Netflix announced on the Netflix Prize there was strength in numbers. During the a thrilling result: 0.8554. We could take discussion board that two teams had quali- first two weeks of the final month, another the lead! Ah, but should we do so publicly? fied, and that their submissions were being coalition formed. Vandelay Industries, the As we phrased it in our email discussions, evaluated. We still didn’t know who’d won, name of a fake company from an episode should we decloak? Our members’ agendas but at least we knew we’d achieved a 10 of Seinfeld, and Opera Solutions, a real varied—some of us were all about win- percent (or better!) improvement on the test consulting firm, united and began aggres- ning, while others were more interested in set. We were relieved that most of our ac- sively recruiting anyone in the top 100. Soon the publicity. If we decloaked too soon, we curacy on the quiz set had carried over. their score was nearly the equal of ours, but might spur BellKor’s Pragmatic Chaos to In previous years, the Progress Prize even so, we were both still far behind. It was work extra hard or go on a recruiting binge. winners had been notified via email shortly clear to all what we had to do. We negoti- On the other hand, our window of oppor- after the deadline but well before any public ated a merger and a 50/50 split of the prize tunity might not last. If our next submission announcement. We had a gentleman’s money, halving the value of my basis points. was merely the runner-up, we’d get less agreement with our rivals that if either team We were now a group of 30 people, and media attention than if we had suddenly received such an email, we’d notify the since the technique of blending models is seized first place. Ultimately, we decloaked other. An agonizing 90 minutes passed, and known in the trade as ensemble learning, we on Saturday afternoon. With less than 24 finally, around 3:00 p.m., we got the email named ourselves The Ensemble. hours to go, we were on top. we dreaded. BellKor’s Pragmatic Chaos had We worked hard but in stealth, keeping However, we were Number One with an won. our new name off the leaderboard. BellKor’s asterisk. Remember that the leaderboard I went for a long jog to clear my head. I Pragmatic Chaos might have tried to recruit posted the quiz set’s results, while the true was a little depressed, but I wasn’t feel- the uncommitted if they felt threatened, so winner would be the highest performer on ing so bad. It was safe to say that more of we hoped to lull them into complacency the test set—a ranking known only to the their original work was in our solution than by keeping our existence a secret. We had Netflix engineers running the contest. Since vice versa, since they had been required to developed accurate ways to estimate our both sets were drawn from the same data, publish two papers in order to win the Prog- score in-house, and before long we deter- the two scores should be close, but pre- ress Prizes. I was also in awe of Pragmatic mined that we could surpass the 0.8563 cisely how close? Our learning and tweaking Theory, who had pulled off a stunning victory million-dollar barrier. Meanwhile, the opposi- process had been influenced in subtle ways in a field where they were newcomers work- tion also inched ahead, to 0.8555. Little did by feedback from the quiz set—an example ing in their spare time. I couldn’t begrudge they know that their lead was shrinking. of what’s known in the education biz as them the victory. Our efforts intensified as the contest “teaching to the test.” What would happen Netflix’s official stance for the next two entered its final week. With team mem- now that the test itself had changed? months was that the contest had not yet bers in Europe, India, China, Australia, and Our first-place debut brought a rush of been decided, as the winning software was the United States, we were working the adrenaline, but the game was far from over. still being validated. This led to a fair amount problem 24/7. We now had thousands of We planned to make one last submission a of confusion. We still held first place on the models, and there were countless ways to few minutes before the contest ended. We leaderboard, and it was entirely reasonable blend them. We could even blend a collec- worked frantically through Saturday night for a casual observer to assume that we had tion of blends. In fact, our best solution was and into Sunday morning. With 20 minutes won. Netflix asked us to say only that we many-layered—a blend of blends of blends to go, BellKor’s Pragmatic Chaos matched were happy to have qualified, which led to of blends. If we won, we’d have to docu- our quiz score of 0.8554. A cacophony of some awkward situations. ment our methods before being awarded emails and chat messages ensued, and as Netflix finally announced the winner at a

38 ENGINEERING & SCIENCE spring 2010 press conference in New York on Septem- Chabbert and Piotte had used the dates on that the overall prediction improved. Big ber 21. It was there that we learned that we which the ratings had been made in a very Chaos had found a way to train the indi- had, in fact, tied, with both teams scoring an creative way. A movie could be rated in one vidual models to optimize their contributions RMSE of 0.8567 on the test set. However, of two contexts: either within a few days of to the blend, rather than optimizing their own BellKor’s Pragmatic Chaos had sent in their watching it (Netflix asks customers to rate accuracy. submission 20 minutes before we did. The the movies they’ve just rented), or when The power of blending, which the contest pain of losing by such a minuscule margin browsing the website and rating previously demonstrated over and over again, was the was somewhat allayed by being quoted in seen movies. If a customer rated dozens or Netflix Prize’s take-away lesson. You could the New York Times, and I bought several hundreds of movies on a single day, it was even reverse-engineer a blend of models, copies as souvenirs. a safe bet that most of those films had been using the results to design an “integrated” The press conference was my first op- seen a long time ago. However, certain mov- single model that incorporated the effects portunity to meet both my teammates and ies got better ratings immediately after view- captured by the various simpler models. my competitors. I asked the Pragmatic ing, while others fared better in retrospect. Indeed, Pragmatic Theory created a single Theory duo whether they were ready to quit I had done a little work along these lines model that scored 0.8713, equal to the their jobs and become machine-learning re- myself, and had squeezed out an additional RMSE of the 100-model blend that had won searchers. I was only half joking, but Martin 1-basis-point share in GPT as a result, but BellKor the first Progress Prize. Piotte just shrugged and said he might take evidently the effect was more significant Although not all of the techniques up another hobby instead. After less than than I realized. Pragmatic Theory modeled involved in the winning solution have been 18 months in the field, and having published it in detail and got a significant accuracy incorporated into the working version of perhaps the year’s most important paper, I boost as a consequence. Cinematch, many of them have, and Netflix guess he felt it was time to move on. Another interesting advance came from has seen increased customer loyalty since At the awards ceremony that followed, Big Chaos’s Austrians. Most contestants implementing these advances. Hastings, the some members of BellKor’s Pragmatic optimized each individual model’s accuracy Netflix CEO, said in theNew York Times Chaos gave a talk outlining the advances on its own, and then blended it into the that the contest had been “a big winner” for made in their final push. It turned out that collection and crossed their fingers, hoping the company. As for myself, I didn’t quite finish as a winner in the formal sense, but I’m still thrilled with how the experience turned out. The contest has led to actual paid consulting work, and a group of my teammates and I are writing a paper on some of our techniques. Many, many other papers have and will come out of the contest, to the great benefit of the broader machine-learning research community. Only one team won the million dollars, but the Netflix competition ended up producing prizes of many kinds.

Joseph Sill is an analytics consultant. He earned his Caltech PhD in 1998, and a BS in applied math from Yale in 1993, where he was elected to Phi Beta Kappa. Before competing for the Netflix Prize, he had worked for Citadel Investment Group and NASA’s Ames Research Center. This article was edited by Douglas L. Smith.

It ain’t over till it’s over, as Yogi Berra used to say. Well, it’s finally over, and 20 minutes made all the difference.

Reproduced by permission of Netflix, Inc., Copyright © 2010 Netflix, Inc. All rights reserved. spring 2010 ENGINEERING & SCIENCE 39