<<

Homework-1 Solutions

2. Spreadsheet exercise 1. Mean Rating Best rated movies (mean) Movie IDs and names Mean Ratings 318: Shawshank Redemption, The (1994) 3.6 260: : Episode IV - A New Hope (1977) 3.267 541: Blade Runner (1982) 3.222

2. Rating Count Most popular movies (# of ratings) Movie IDs and names Rating Count (Popularity) 1: Toy Story (1995) 17 593: Silence of the Lambs, The (1991) 16 260: Star Wars: Episode IV - A New Hope (1977) 15

3. % of ratings 4+ (liking) Most liked movies (% of 4+ ratings) Movie IDs and names (Rating > 3) Percentage 318: Shawshank Redemption, The (1994) 70 260: Star Wars: Episode IV - A New Hope (1977) 53.3 3578: Gladiator (2000) 50

4. Top movies by Co-occurrence with Silence of the Lambs Top movies for 'Silence of the Lambs' viewers (co-occurrence) Movie IDs and names Co-occurrence with #593 1: Toy Story (1995) 0.813 260: Star Wars: Episode IV - A New Hope (1977) 0.75 527: Schindler's List (1993) 0.688

5. Correlation with Silence of the Lambs Correlation with 'Silence of the lambs' Movie IDs and names Correlation with #593 2762: Sixth Sense, The (1999) 0.577 527: Schindler's List (1993) 0.437 1259: Stand by Me (1986) 0.396

6. Mean rating difference by gender Gender differences in Mean rating Men average highest than women average Men Avg. rating Women Avg. rating 1198: (1981) 3.667 2 Women average highest than men average 2396: Shakespeare in Love (1998) 2.143 4.25 Smallest gap between men and women ratings 1210: Star Wars: Episode VI - Return of the Jedi (1983) 3 3

Overall Men avg. rating Overall Women avg. rating 2.906 2.947 Overall Difference = 0.041 Overall mean rating of women is slightly higher than that of men.

7. Gender difference in % of 4+ ratings Gender differences in Liking (% of 4+ ratings) % of Men liking is highest compared to women % of men (rating > 3) % of women (rating > 3) 1198: Raiders of the Lost Ark (1981) 50 0 % of Women liking is highest compared to men 2396: Shakespeare in Love (1998) 0 75

Men overall % of 4+ ratings Women overall % of 4+ ratings 33.9 42.1 Overall % of 4+ ratings in Women in considerably more than that in Men.

3. Non-Personalized Data Exploration in Python 1. Have python display how many movies there are represented in the file and also how many users. No of unique users: 610 No of unique movies: 9724

2. What is the average rating for all users? Average of all ratings: 3.502 3. **Advanced** Graph the frequency distribution of ratings.

4. What are the 5 movies with the highest average ratings? (don't worry about tie-breakers here) **Advanced** Display the names and not just the movie IDs. Movie Title Movie ID Avg. rating No of ratings Lamerica (1994) 53 5.0 2 Heidi Fleiss: Hollywood Madam (1995) 99 5.0 2 Lesson Faust (1994) 1151 5.0 2 Jonah Who Will Be 25 in the Year 3473 5.0 2

2000 (Jonas q... Belle époque (1992) 6442 5.0 2

5. What are the 5 most popular movies ( with the most ratings)? Title Movie ID rating_count avg_rating (1994) 356 329 4.164134 Shawshank Redemption, The (1994) 318 317 4.429022 Pulp Fiction (1994) 296 307 4.197068 Silence of the Lambs, The (1991) 593 279 4.161290 Matrix, The (1999) 2571 278 4.192446

6. What are the 5 movies with highest % of >=4 ratings? Movie Title Movie ID Liked Percentage Andrew Dice Clay: Dice Rules (1991) 193609 100.0 Nine Lives of Tomas Katz, The (2000) 27320 100.0 Summer's Tale, A (Conte d'été) (1996) 26928 100.0 Nirvana (1997) 26985 100.0 Thursday (1998) 27022 100.0

7. What are the five movies that most commonly are co-rated with Toy Story Movie ID co_rating Movie Title Toy Story (1995) 1 1.000000 Forrest Gump (1994) 356 0.716279 Pulp Fiction (1994) 296 0.655814 Shawshank Redemption, The (1994) 318 0.637209 Star Wars: Episode IV - A New Hope (1977) 260 0.623256 Jurassic Park (1993) 480 0.613953

8. Top Five movies people are likely to rate if they've rated Toy Story (Baye’s rule) Movie Title Movie ID co_rating_bayes Toy Story (1995) 1 2.837209 Toy, The (1982) 4929 2.837209 Better Than Sex (2000) 4877 2.837209 Focus (2001) 4871 2.837209 Last Castle, The (2001) 4866 2.837209 Machine Girl, The (Kataude mashin gâru) (2008) 60522 2.837209

9. Five movies with highest ratings correlation with the Toy Story Movie Title Movie ID correlation Toy Story (1995) 1 1.0 Package, The (1989) 4632 1.0 Gas, Food, Lodging (1992) 6851 1.0 Room in Rome (Habitación en Roma) (2010) 85885 1.0 White Man's Burden (1995) 209 1.0 Senna (2010) 85774 1.0

4. Content-Based Filtering Spreadsheet Exercises 1. How many user/movie predictions (of the 241 for which we have ratings) match the ratings on like/dislike (i.e., if the user rated below 3, the model gives a negative score, and vice versa)?

Correctly predicted ratings 191/ 178 Total available ratings 241 2. Which two movies does the model predict that user 5347 will dislike the most? Movie Titles 1: Toy Story (1995) 2028: (1998)

527: Schindler's List (1993) 3. Which two people does the model predict will like Movie 34 (Babe) the most? User IDs 4117 5450 1940 After normalization on number of genres in a movie 1. How many user/movie predictions (of the 241 for which we have ratings) match the ratings on like/dislike (i.e., if the user rated below 3, the model gives a negative score, and vice versa)?

Correctly predicted ratings 192/ 189

Total available ratings 241

2. Which two movies does the model predict that user 5347 will dislike the most? Movie Titles 2028: Saving Private Ryan (1998) 527: Schindler's List (1993)

1: Toy Story (1995) 3. Which two people does the model predict will like Movie 34 (Babe) the most? User IDs 4117 1940 5450

6. Essay Question Domain 1: Clothing for an online high-fashion retailer

Yes, content based-filtering can be applied to do recommendations in this domain. This is possible because for most of us, we can define our clothing preferences in terms of item attributes like brand, color, price, material, type of item, or any other special features it has. We can also represent/define the items themselves completely in terms of these attributes. These two necessary conditions allow us to build a content-based model.

To understand the attributes related to high-fashion better, we take a look at some existing online stores in this domain (E.g. Link 1, Link 2). From many such examples, we can see that some of the most important features are 1) the designer/brand 2) Price 3) Text description of all the elements used in a given piece (like “Black satin shawl lapel”, “Self covered buttons” in the first link) 4) Material 5) Size. These features are of crucial importance because more often in Haute Couture, these elements are what set the trend for a season. For example, here is a news coverage about a recent fashion week. All they talk here, are particular elements which caught people’s eye (some color, or specs etc.). These elements will suddenly be used everywhere and people look for these elements in the pieces they want to buy fashion clothes.

Something like a collaborative filtering will not work in this case because customers here are generally risk taking, and are seeking new trends which are different from others.

Coming to the actual design of this content-based model, the goal of recommendation is for a user to explore some fashions and make a purchase in the end. To achieve the exploration objective, on the home page we can have a ranked list of trends in fashion from recent transactions-data of other users on the site. However, for the main search case where user knows what he wants, we can have a case- based system with a knowledge base similar to the ‘Ask Ida’ system. This will help users who want to