The Programming: Million Dollar Prize
The Programming: Million Dollar Prize
The Programming: Million Dollar Prize
Netflixs bounty for improving its movierecommendation software is almost in the bag. Here is one teams account
BY ROBERT M. BELL, JIM BENNETT, YEHUDA KOREN & CHRIS VOLINSKY
TS 7:50 P.M. on 1 October 2007 at AT&T Labs, in Florham Park, N.J., and three of us are frantically hitting the refresh buttons on our browsers. We have just submitted our latest entry in the year-old Netix Prize competition, which o ers a grand prize of US $1 million for an algorithm thats 10 percent more accurate than the one Netix uses to predict customers movie preferences. Although we have not reached that milestone, we are hoping at least to do better than anyone else has done so far; if we can make it to 8 p.m. with the best score, well win a $50 000 Progress Prize. For most of the summer wed been ahead of our nearest rivals by a comfortable margin, and as recently as 36 hours before this moment, our victory still seemed to be a slam dunk. The previous day, though, the lead had started to slip away from us. First, the teams then in fth and sixth places merged, combining their talents to vault into second place, making us nervous enough to submit our best e ort, which we had been saving for a rainy day. But before our improved score appeared, we were hit by another combination when our two remaining serious rivals joined forces to tie us. Worse, their entry had come
RANDI SILBERMAN
28
WWW.SPECTRUM.IEEE.ORG
Authorized licensed use limited to: IEEE Publications Staff. Downloaded on June 22, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
WWW.SPECTRUM.IEEE.ORG
29
Authorized licensed use limited to: IEEE Publications Staff. Downloaded on June 22, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
72 seconds before ours, meaning that in the case of a tie, theyd win. Seventy-two seconds! Could we lose this thing for being a minute too late? Then we realized that there were still 25 hours left, and we still had one more chance. We had to make it count. We began to receive o ers from other competitors to combine our scores with theirs. We politely declined them and planned strategies for our last submission. Sure enough, these bumped up our score by a few hundredths of a percent, at which point we could only wait to see the final score from our newly allied foes. Refresh...refresh...refresh...
SINCE 1997, WHEN NETFLIX, of Los Gatos, Calif., came up with the idea of sending movie DVDs through the mail to subscribers, its customer base has grown to 10 million. That success stems, in part, from the companys giving quick and easy access to movies. But just as important is the Netf lix recom-
mender system, Cinematch, which helps customers get the most out of their memberships. The better the system predicts peoples likes and dislikes, the more theyll enjoy the movies they watch and the longer theyll keep up their subscriptions. As a new subscriber to Netix, you have several ways to choose movies on the companys Web site. You can browse by genre or search by keyword for a title, actor, or director. After receiving your selections by mail or over the Internet and viewing them, you return to the site to order more movies. At any time, Cinematch lets you rate any movie youve seen by clicking on one to ve stars. As is the case with other recommender systems, such as those of Amazon, the Internet Movie Database, and Pandora, its in the customers interest to vote thoughtfully, because doing so helps Netix gure out his or her tastes. Yet even if the customer declines to o er this feedback, Netix still notes which movies the subscriber actually orders. After the customer has rated a handful of movies, the algorithm
will start recommending titles based on the rating the algorithm predicts the customer will give.
HE NEAREST-NEIGHBOR METHOD works on the principle that a person tends to give similar ratings to similar movies. Joe likes the three movies on the left, so to make a prediction for him, nd users who also liked those movies and see what other movies they liked. Here the three other viewers all liked Saving Private Ryan, so that is the top recommendation. Two of them liked Dune, so thats ranked second, and so on.
PHOTOS: PARAMOUNT PICTURES
RECOMMENDING MOVIES that customers will enjoy is a complex business. The easy part is gathering the data, which Netflix now accumulates at the rate of some 2 million ratings a day. Much tougher is to nd patterns in the data that tell the company which movie a customer would enjoy, if only the person would watch it. Netix developed Cinematch to tackle this job using a well-known prediction technique (described below), which it designed to handle the billion or so ratings it had already logged. However, while incorporating so many ratings added to accuracy, it also took a lot of work just to keep up with the increasing scale, let alone to test alternative prediction schemes. The Netflix researchers were nevertheless curious about the many other techniques for making similar predictions that had been published in the scholarly literature. The problem was that those studies had relied on public data sets containing on the order of a few million ratings, and it would take the small Netflix team a long time to explore how well these alternative techniques worked at a scale a thousand times as large. That is, if they did all the work themselves. Reed Hastings, the chief executive of Netix, suggested running a contest. He observed that Netix had the means, the motive, and the opportunity. And the company had already staged a contest internally between the standard Cinematch system and an alternative algorithm, which did slightly, tantalizingly better. The Netix team came up with the basic structure of the contest. They would provide 100 million ratings that 480 000 anonymous customers had given to 17 000 movies. The data set would just fit in the main memory of a typical laptop, allowing almost anyone to compete. Netflix would withhold 3 million of the most recent ratings and ask the contestants to predict them. A
WWW.SPECTRUM.IEEE.ORG
30
Authorized licensed use limited to: IEEE Publications Staff. Downloaded on June 22, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
Netix computer would assess each contestants 3 million predictions by comparing predictions with actual ratings. The system would use the traditional metric for predictive accuracy, the root-mean squared error (RMSE). The more accurate a set of predictions, the smaller the RMSE will be. The score would then be reported back immediately to the contestant and reected on an online leaderboard for all to see. Each such scoring provides valuable information about the hidden ratingsso valuable, in fact, that under certain circumstances it could be used to game the system. The teams were therefore scored once a day, at most. But to help teams estimate how well they might do, Netix also provided them each with a small representative data set and the score Cinematch had been able to attain for it. Contestants could use that set to test their systems as often as they wanted. On 2 October 2006, Netf lix launched the competition, and within days thousands of teams from hundreds of countries signed up. Within weeks the Netf lix Prize Web site was getting hundreds of submissions per day. An online forum was created so that participants could share ideas and techniques, even code. Even more gratifying to Netix, within months a handful of teams did several percent better than Cinematch. The question then was how much the accuracy would improve in the rst year.
SECOND, COMPLEMENTARY method scores both a given movie and viewer according to latent factors, themselves inferred from the ratings given to all the movies by all the viewers. The factors dene a space that at once measures the characteristics of movies and the viewers interest in those characteristics. Here we would expect the fellow in the southeast corner of the graph to love Norbit, to hate Dreamgirls, and, perhaps, to rate Braveheart about average.
PHOTOS: AMADEUS, THE SAUL ZAENTZ COMPANY; ALL OTHERS, PARAMOUNT PICTURES
SERIOUS
Amadeus Braveheart
Dreamgirls
Mission Impossible: 3
CHICK FLICKS
Terms of Endearment
GUY MOVIES
48 Hours
Norbit Crossroads
ESCAPIST
LIKE MOST OF THE other top competitors, the three of us at AT&T Labs consulted the rich body of research on ways of solving problems in this domain, known as collaborative ltering. One of the main areas of collaborative ltering we exploited is the nearest-neighbor approach. A movies neighbors in this context are other movies that tend to be scored most similarly when rated by the same viewer [see illustration, The Neighborhood Model]. For example, consider Saving Private Ryan (1998), a war movie directed by Steven Spielberg and
WWW.SPECTRUM.IEEE.ORG
starring Tom Hanks. Its neighbors may include other war movies, movies directed by Spielberg, or movies starring Tom Hanks. To predict a particular viewers rating, we would look for the nearest neighbors to Saving Private Ryan that the viewer had already seen and rated. For some viewers, it may be easy to nd a full allotment of close neighbors; for many others, we may discover only a handful of neighboring movies. Our version of the nearest-neighbor approach predicted ratings using a weighted average of the viewers previous ratings on up to 50 neighboring movies. (We have since developed a way to use all past ratings, allowing an unlimited number of neighbors.)
A second area of collaborativefiltering research we pursued involves what are known as latentfactor models. These score both a given movie and a given viewer according to a set of factors, themselves inferred from patterns in the ratings given to all the movies by all the viewers [see illustration, The Latent-Factor Approach]. Factors for movies may measure comedy versus drama, action versus romance, and orientation to children versus orientation to adults. Because the factors are determined automatically by algorithms, they may correspond to hard-to-describe concepts such as quirkiness, or they may not be interpretable by humans at all. Factors for viewers measure how much the viewer likes movies
MAY 2009 IEEE SPECTRUM NA
31
Authorized licensed use limited to: IEEE Publications Staff. Downloaded on June 22, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
that score highly on the corresponding movie factor. Thus, a movie may be classied a comedy, and a viewer may be classied as a comedy lover. The model may use 20 to 40 such factors to locate each movie and viewer in a multidimensional space. It then predicts a viewers rating of a movie according to the movies score on the dimensions that person cares about most. We can put these judgments in quantitative terms by taking the dot (or scalar) product of the locations of the viewer and the movie. Both collaborative-ltering techniques work even if you dont know a thing about whats in the movies themselves. All you need to care about is how the viewers rate the movies. However, neither approach is a panacea. We found that most nearest-neighbor techniques work best on 50 or fewer neighbors, which means these methods cant exploit all the information a viewers ratings may contain. Latent-factor models have the opposite weakness: They are bad at detecting strong associations among a few closely
related lms, such as The Lord of the Rings trilo (20012003). Because these two methods are complementary, we combined them, using many versions of each in what machine-learning experts call an ensemble approach. This allowed us to build systems that were simple and therefore easy to code and fast to run. Whats more, our ensemble approach was robust enough to protect against some of the problems that arise within the systems individual components. Indeed, the solution we had just submitted on 1 October 2007 was a linear combination of 107 separate sets of predictions, using many variations on the above themes and different tuning parameters. Even so, the biggest improvements in accuracy came from relatively few methods. The lesson here is that having lots of ways to skin this particular cat can be useful for gaining the incremental improvements needed to win competitions, but practically speaking, excellent systems can be built using just a
few well-selected strategies. We dont want to give the impression that heaping together a lot of methods was enough to reach the leaderboard. The Netix Prize data set poses some huge challenges beyond its immense size. For one, there was enormous variation among viewers and among movies. A rating system must be sensitive enough to tease out subtle patterns associated with those few viewers who rated 1500 movies, without overtting thingsthat is, expecting prolific raters patterns to apply to the many more users who rated 15 or fewer movies. It can indeed be hard to make predictions for a viewer who has provided just a handful of ratings. We improved existing ad hoc methods, designed to address this concern rigorously. Another critical innovation involved focusing on which movies a viewer rated, regardless of the scores. The idea is that someone who has rated a lot of fantasy movies will probably like Lord of the Rings, even if that person has rated the other movies in the cate-
F YOU want recommendations and youd rather not rely exclusively on your customers, you can always do the daring thing and consult actual experts. Thats the idea behind Pandora, a free Internet radio service that employs musicians to rate songs according to a checklist of criteria, such as pace, rhythm, even the voice of the performer. The Oakland, Calif.based company must be doing something right. In the three years since it opened shop, Pandora Media has registered 22 million listeners. So far, all of them are in the United States, although the company is negotiating its way back into Europe, which it left after having problems with music licenses there. About 2 million people listen to the service on a given day, typically while sitting in front of their computers at work or, increasingly, while clutching their iPhones on the commute home, making for an average session of 6 hours. No wonder Pandora streams more data than any other site except YouTube. Heres how it works. The listener creates
a virtual channel by selecting a song, artist, or composer. If a song is chosen, the site compares it to its database of 600 000 songs, each rated by one of its musical experts. The site then selects another song it deems to be a close relative and keeps on playing such relatives. (Pandora cant give you your rst choice because its licensing contracts ban it from playing songs to order.) When I selected A Hard Days Night, by the Beatles, the rst song I heard was She
Loves You, by the same band. I listened for a long time before getting my rst choice. You rate a song by clicking on either a thumbs-up or a thumbs-down icon, and the algorithm adjusts its weighting of the musical checklist it uses to select subsequent songs. Whats more, a thumbs-down will keep the channel from ever playing the same song again. You have to be careful, because the more thumbs-down you give, the narrower the channel becomes, and in the extreme case you may thumb yourself into a corner, says Tim Westergren, founder of Pandora. Even then, he notes, youd only hobble that one channel. Nothing you do on one channel a ects the others, and you may create as many channels as you want. Westergren says he got the idea for Pandora when he was a young musician working on scores with moviemakers who had very di erent likes and dislikes. He wanted to nd a way to encode those di erences in a database he dubbed the Music Genome, paying musicians to do the enormous amount of work. It may seem strange to use so much manpower as a supplement to computer power, but it makes sense when humans alone can handle the joba peculiar
SONG CENTRAL: Tim Westergren and a few CDs waiting to be rated.
PHOTO: RAFAEL FUCHS
32
WWW.SPECTRUM.IEEE.ORG
Authorized licensed use limited to: IEEE Publications Staff. Downloaded on June 22, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
gory somewhat low. By replacing numerical scores with a binary whowatched-what score, the data set is transformed from one with mostly missing pieces (the cases in which users dont rate movies) to one that is completely full (using the value 1 when there are ratings and a 0 when there arent). This approach nicely complemented our other methods.
FINALLY, AT 7:58 on that fateful October evening, all of the scores for the top teams were posted for the last time on the leaderboard, and ours came out highest, with an 8.43 percent improvement on Netixs algorithm. Our nearest rival scored an 8.38 percent improvement. We didnt do well enough to land a million dollars, but still, we won. While weve since come very close to the goal of 10 percent, experience has shown that each step forward is harder than the one that came before it, presumably because weve already exploited most of the clues in the data. Nonetheless, we
continue to made progress. During 2008 we mined the data for information on how users behavior changed over time. Later, we joined forces with another team to win the 2008 Progress Prize. Currently we stand at 9.63 percent improvement and are still working hard on the problem. Now that the confetti has settled, we have a chance to look back on our work and to ask what this experience tells us. First, Netflix has incorporated our discoveries into an improved version of its algorithm, which is now being tested. Second, researchers are beneting from the data set that the competition made available, and not just because it is orders of magnitude larger than previous data sets. It is also qualitatively better than other data sets, because Netix gathered the information from paying customers, in a realistic setting. Third, the competition itself recruited many smart people in this line of research. In any case, the new blood promises to quickly improve the state of the art. Such competitions
have invigorated other elds. The various X Prizes that have been o ered for advances in genomics, automotive design, and alternative energy have shown an excellent return: By some accounts the recent $10 million Ansari X prize, awarded for suborbital spaceight, generated $100 million of private investment in space travel. The competition also validates the concept of collective intelligence popularized in James Surowieckis 2005 book The Wisdom of Crowds (Anchor Books). He argues that the sum of many independent votes is often superior to any one vote, even if made by the greatest expert. For Netix, the vast number of independent ratings allows for surprisingly good predictions. The power of this collective intelligence is also being harnessed in, for instance, Amazon. coms product recommendations and the collaborative editing of the online encyclopedia, Wikipedia. With the rise of social networks on the Web, we can expect to see and contribute to even more powerful examples in the coming years.
eld sometimes called articial articial intelligence. One example of AAI is setting puzzles, or captchas, for visitors to a Web site to solve, both to prove that theyre human beings and not bots and to perform some useful chore, such as deciphering the blurred letters from a scan of an old book. Other AAI programs lure people to do such work by providing entertainment or, as Amazons Mechanical Turk does, money. Westergren got seed money for the Music Genome in 2000, at the very end of the dot-com bubble. When the bubble burst, he and his colleagues labored almost without income for ve years before another injection of capital came through. Even now, Westergren says, Pandora is focused solely on growth and so does not turn a prot. It gets most of its revenue from the banner advertisements its site displays to listeners every time they click on something in the site, something they must do from time to time to prevent Pandora from going silent. It also gets a small royalty whenever a listener buys a song by clicking through to a vendor, such as iTunes or Amazon. One advantage of using experts is that they can categorize songs that are new, by bands that are unknown. They can also provide a way to get at music that fell out of fashion before Internet rating became possible. Such too-new and too-old songs constitute a big part of the long tail the huge inventory of items that each
WWW.SPECTRUM.IEEE.ORG
MUSIC 101: Pandora o ers a running commentary on its songs and artists.
sell in very small numbers yet collectively amount to a big part of the online marketplace. Mining that tail is one of the main jobs of any recommender system. In book publishing, genres are the equivalent of what were doing. A brandnew author can say, Mines a historical mystery novel, and thus put data into the product without having any customer reviews, Westergren says. But our theory is that its not good data, not granular enough and not objective. Tom Conrad, the chief technical o cer of Pandora, says that musical genomes sometimes turn up connections youd
probably never get with other methods. He cites the 80s pop star Cindy Lauper, who recently recorded a new record that didnt sell in great numbers. We analyzed it for its genome and found that the record sounds an awful lot like Norah Jones. So we are able to play Laupers songs when you start a Norah Jones song. Theres a Metallica ballad thats musically a nice t for Indigo Girls. So start an Indigo Girls station and you might get this ballad. Conrad says that Pandora isnt so proud of its expert-rated system that it cant learn from the collaborative-ltering techniques pioneered at Amazon, Apple, Netix, and other rms. He contends that the two approaches are complementary. We have beneted by peering inside the approaches tried by some of the thinking that went into the Netix Prize competition, and weve incorporated some of the ideas into our own system, Conrad says. Im friendly with the Netix personalization team; weve talked over the past two years or so. We wanted to have more qualitative information; they wanted more quantitative. Now we both use both. Netix has human editors who try to capture technical aspects of the movies. When the two approaches meet, experts will use computers as much as computers use experts. We will have achieved the perfect chimera: a manmachine mind meld. Philip E. Ross
MAY 2009 IEEE SPECTRUM NA
33
Authorized licensed use limited to: IEEE Publications Staff. Downloaded on June 22, 2009 at 09:38 from IEEE Xplore. Restrictions apply.