Python数据分析之从100万条数据中筛选出前100热门电影
写在前面经过前面几天对Numpy和Pandas的学习,我感觉我变秃了,也变强了对于学习,我们都知道仅仅Input是没有任何效果的,在掌握了基础知识后,还需要Output这次我到国外的Grouplens
写在前面
经过前面几天对Numpy和Pandas的学习,我感觉我变秃了,也变强了
对于学习,我们都知道仅仅Input是没有任何效果的,在掌握了基础知识后,还需要Output
这次我到国外的Grouplens网站找来一份百万电影数据,你可以点击我进行下载
我们通过这份数据就可以简单的进行数据分析,筛选出前100的热门电影
话不多说,我们直接开肝
欢迎大家访问我的个人博客一起学习,共同进步http://syjun.vip
导入第三方库和所需文件
import pandas as pd unames = ['user_id','gender','age','occupation','zip'] users = pd.read_table('file/users.dat', sep='::',header=None, names=unames) users.head()
用户数据
代码结果
user_id | gender | age | occupation | zip | |
---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 |
1 | 2 | M | 56 | 16 | 70072 |
2 | 3 | M | 25 | 15 | 55117 |
3 | 4 | M | 45 | 7 | 02460 |
4 | 5 | M | 25 | 20 | 55455 |
评分数据
rating_names = ['user_id','movie_id','rating','timestamp'] ratings = pd.read_table('file/ratings.dat', sep='::',header=None, names = rating_names) ratings.head()
代码结果
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
电影数据
movie_names = ['movie_id','title','genres'] movies = pd.read_table('file/movies.dat',sep='::', header=None,names=movie_names) movies.head()
代码结果
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation\Children's\Comedy |
1 | 2 | Jumanji (1995) | Animation\Children's\Comedy |
2 | 3 | Grumpier Old Men (1995) | Comedy\Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy\Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
引入三个文件后,使用merge()函数将三个表合并在一起
data = pd.merge(pd.merge(users,ratings),movies) data.head()
代码结果
小试牛刀
在正式开始之前,我们先做几个小的练习题
分析某部电影男女平均评分
这里我们以《One Flew Over the Cuckoo's Nest (1975)"》为例
#筛选出关于这部电影的所有数据one_movie = data[data.title == "One Flew Over the Cuckoo's Nest (1975)" ]#使用groupby()函数按照gender这一列分组one_movie_grop = one_movie.groupby('gender')#使用DataFrameGroupBy 对象中mean()函数求平均值,并选出rating这一列one_movie_grop.mean()['rating']#代码结果gender F 4.310811M 4.418423Name: rating, dtype: float64
分析所有电影男女平均评分
这时我们就可以想到使用pivot_table(),很简单的就能得出结果
rating_group = data.pivot_table(values='rating', index='title', columns='gender', aggfunc='mean') rating_group.head()
代码结果
gender | F | M |
---|---|---|
title | ||
$1,000,000 Duck (1971) | 3.375000 | 2.761905 |
'Night Mother (1986) | 3.388889 | 3.352941 |
'Til There Was You (1997) | 2.675676 | 2.733333 |
'burbs, The (1989) | 2.793478 | 2.962085 |
...And Justice for All (1979) | 3.828571 | 3.689024 |
求出男女评分的差值
这时,我们就可以新增一列用来显示评分差值
rating_group['diff'] = rating_group.F - rating_group.M rating_group.head()
代码结果
gender | F | M | diff |
---|---|---|---|
title | |||
$1,000,000 Duck (1971) | 3.375000 | 2.761905 | 0.613095 |
'Night Mother (1986) | 3.388889 | 3.352941 | 0.035948 |
'Til There Was You (1997) | 2.675676 | 2.733333 | -0.057658 |
'burbs, The (1989) | 2.793478 | 2.962085 | -0.168607 |
...And Justice for All (1979) | 3.828571 | 3.689024 | 0.139547 |
查找出现次数最多的前十电影
ratings_by_title = data.groupby('title').size() ratings_by_title.sort_values(ascending = False).head(10)#代码结果title American Beauty (1999) 3428Star Wars: Episode IV - A New Hope (1977) 2991Star Wars: Episode V - The Empire Strikes Back (1980) 2990Star Wars: Episode VI - Return of the Jedi (1983) 2883Jurassic Park (1993) 2672Saving Private Ryan (1998) 2653Terminator 2: Judgment Day (1991) 2649Matrix, The (1999) 2590Back to the Future (1985) 2583Silence of the Lambs, The (1991) 2578dtype: int64
查找平均评分最高的前二十电影
mean_ratings = data.pivot_table(values = 'rating',index='title',aggfunc='mean') mean_ratings.sort_values(by='rating',ascending = False).head(20)#代码结果rating title Ulysses (Ulisse) (1954) 5.000000Lured (1947) 5.000000Follow the Bitch (1998) 5.000000Bittersweet Motel (2000) 5.000000Song of Freedom (1936) 5.000000One Little Indian (1973) 5.000000Smashing Time (1967) 5.000000Schlafes Bruder (Brother of Sleep) (1995) 5.000000Gate of Heavenly Peace, The (1995) 5.000000Baby, The (1973) 5.000000I Am Cuba (Soy Cuba/Ya Kuba) (1964) 4.800000Lamerica (1994) 4.750000Apple, The (Sib) (1998) 4.666667Sanjuro (1962) 4.608696Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.560510Shawshank Redemption, The (1994) 4.554558Godfather, The (1972) 4.524966Close Shave, A (1995) 4.520548Usual Suspects, The (1995) 4.517106Schindler's List (1993) 4.510417
由于评分前二十名的电影很有可能出现,虽然评分很高,但是看的人却很少,不信我们验证一下
利用ratings_by_title,将前二十名的电影名作为索引,查看电影出现的次数
ratings_by_title.loc[top_20_score.index]#代码结果title Ulysses (Ulisse) (1954) 1Lured (1947) 1Follow the Bitch (1998) 1Bittersweet Motel (2000) 1Song of Freedom (1936) 1One Little Indian (1973) 1Smashing Time (1967) 2Schlafes Bruder (Brother of Sleep) (1995) 1Gate of Heavenly Peace, The (1995) 3Baby, The (1973) 1I Am Cuba (Soy Cuba/Ya Kuba) (1964) 5Lamerica (1994) 8Apple, The (Sib) (1998) 9Sanjuro (1962) 69Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 628Shawshank Redemption, The (1994) 2227Godfather, The (1972) 2223Close Shave, A (1995) 657Usual Suspects, The (1995) 1783Schindler's List (1993) 2304 dtype: int64
正片开始
如何真正的找出前100名好看的电影,需要从两个方面考虑
第一看的人多,第二平均评分高
#通过筛选条件:出现次数超过1000,选出热门电影hot_movies = ratings_by_title[ratings_by_title >1000]#利用mean_ratings,将hot_movies作为索引,找出平均评分,出现次数最多的电影hot_mocies_rating = mean_ratings.loc[hot_movies.index]#最后得出最好看的前100部电影top_100_good_movies = hot_mocies_rating.sort_values( ascending = False,by = 'title').head(100) top_100_good_movies.sort_values(ascending = False,by = 'rating')#代码结果rating title Shawshank Redemption, The (1994) 4.554558Usual Suspects, The (1995) 4.517106Schindler's List (1993) 4.510417 Raiders of the Lost Ark (1981) 4.477725 Rear Window (1954) 4.476190 ... ... Mission: Impossible 2 (2000) 3.195735 Twister (1996) 3.173874 Starship Troopers (1997) 3.133276 Lost World: Jurassic Park, The (1997) 3.036653 Mars Attacks! (1996) 2.900372 100 rows × 1 columns
rating | |
---|---|
title | |
Shawshank Redemption, The (1994) | 4.554558 |
Usual Suspects, The (1995) | 4.517106 |
Schindler's List (1993) | 4.510417 |
Raiders of the Lost Ark (1981) | 4.477725 |
Rear Window (1954) | 4.476190 |
... | ... |
Mission: Impossible 2 (2000) | 3.195735 |
Twister (1996) | 3.173874 |
Starship Troopers (1997) | 3.133276 |
Lost World: Jurassic Park, The (1997) | 3.036653 |
Mars Attacks! (1996) | 2.900372 |
100 rows × 1 columns
世界因代码而改变 Peace Out
相关文章
发表评论
评论列表
- 这篇文章还没有收到评论,赶紧来抢沙发吧~