ããã¹é¸æã®å¼·ãããã¤ãºã¢ããªã³ã°ã§åæãã
ã¯ããã«
stanã¨Rã§ãã¤ãºçµ±è¨ã¢ããªã³ã°ãèªã¿çµãã¾ããã æ¬ã®10ç« ã®å°æ£ã®å¼·ãã¨åè² ã ã©ãæ¨å®ããã¨ããå 容ãé¢ç½ãã£ãããã èªèº«ã®èå³ã®ããããã¹ã§åæ§ã®ãã¨ãå®æ½ãã¦ã¿ã¾ããã
æ¢ãããåããã¼ãã§ãã£ã¦ããè¨äºããã£ãã®ã§ããã å
ä½è²ä¼åºçé¨æå±ã®èº«ã¨ãã¦ããã¹ã¯å¤ããªãã£ããããå
¥åãããã¼ã¿çå°ãæ¡ä»¶ãå¤ãã¦å®æ½ãã¾ããã
ベイズモデリングで男子プロテニスの強さを分析してみた – 戦略コンサルで働くデータサイエンティストのブログ
ãã¼ã¿ã®åå¾ã»ç¢ºèª
ææ°ã®ååãç¥ãããã®ã§2018å¹´ã®ATPã®è©¦åãã¼ã¿ããããã®æµ·ããæ¾ãã¾ããã
https://github.com/JeffSackmann/tennis_atp
kaggleã«ããã¼ã¿ãããã®ã§ãããä»ã¯2017å¹´ã®ãã¼ã¿ã¾ã§ãããªãããã§ãã
https://www.kaggle.com/gmadevs/atp-matches-dataset
ã¾ããã¼ã¿ãéè¨ããä¸å®ä»¥ä¸ã®è©¦åã«åºå ´ãã¦ãã åçãé«ãé¸æã®top10ã確èªãã¦ã¿ã¾ãã
#ãã¼ã¿èªã¿è¾¼ã¿ data=pd.read_csv("../input/atp_matches_2018.csv",encoding='latin-1') #åå©æ°ã®ã«ã¦ã³ã win_count=data.groupby("winner_name")["winner_id"].count().reset_index() win_count.columns=["name","win_count"] #æåæ°ã®ã«ã¦ã³ã lose_count=data.groupby("loser_name")["loser_id"].count().reset_index() lose_count.columns=["name","lose_count"] #ãã¼ã¿ã®çµå counts=pd.merge(win_count,lose_count,on="name",how="outer") counts=counts.fillna(0) #åçã®ç®åº counts["win_rate"]=counts["win_count"]/(counts["win_count"]+counts["lose_count"]) counts=counts.sort_values(by="win_rate",ascending=False) counts[(counts["win_count"]+counts["lose_count"])>=20].head(10)
ããã«ã®åçã9å²ãè¶ ãã¦ãã®ã¯é©ç°çã§ãã⦠ãã®ä¸ã«ã¯å ¥ã£ã¦ã¾ããããé¦ç¹é¸æã¯30å15æã§åçã¯0.667ã§ããã ã¾ã試åæ¯ã«å¾ããããã¤ã³ãã¯ç°ãªããããã©ã³ãã³ã°ã¨ãé åºãç°ãªã£ã¦ãã¾ããã (ã©ã³ãã³ã°ã¯4大大ä¼ã§2åãã¦ããã¸ã§ã³ãããã1ä½)
åºå ´è©¦åæ°ã®å°ãªãé¸æãå ¥ããã¨çµæãåæããªããã¨ãäºæ¸¬ãããããã ä»åã¯åºå ´è©¦åæ°ã20以ä¸ã®99é¸æã®ãã¼ã¿ã®ã¿ã使ç¨ãã¾ãã
ã¢ãã«
ä»åã¯ä»¥ä¸ã®ãããªå¼ã«åºã¥ããã¢ãã«ã使ç¨ãã¾ãã ãã®å¼ã¯ãstanã¨Rã§ãã¤ãºçµ±è¨ã¢ããªã³ã°ã®ãp189ã«è¨è¼ããã¦ãããã®ã¨åä¸ã«ãªãã¾ãã
ããã§ãã¯é¸æã®äººæ°ãã¯é¸æã®IDã§ãåé¸æã®å¼·ãã¯å¹³å0ãæ¨æºåå·®ã®æ£è¦åå¸ã«å¾ãã¨ä»®å®ãã¦ãã¾ãã
ã¾ããæ¨æºåå·®ã®å¼±æ å ±äºååå¸ã¨ãã¦ã¯ã¬ã³ãåå¸ãç¨ãã¦ãã¾ãã ã¬ã³ãåå¸ã®è¡ã¯è¨å®ããªãã¦ãããã°ã©ã ãå®è¡ãããã¨ã¯ã§ãããã®å ´åã¯ååã«åºãä¸æ§äºååå¸ãè¨å®ããããããªã®ã§ããããã®å ´åã¯stanãåæããªããªãã¾ããã ä»åã®ã¢ãã«ã ã¨ãæ¡ä»¶ãåè ãæè ãããå¤ã大ããã¨ããé¨åãããªã絶対çãªå¤ã®ææ¨ãåå¨ããªããããäºååå¸ã®å½¢ãããç¨åº¦æå®ãããã¨ã§å¤ã®åãããç¯å²ãéå®ãããåæãããããªãããã§ãã
ã¡ãªã¿ã«ã®åå¸ã¯ä»¥ä¸ã®ãããªå½¢ã«ãªãã¾ãã
ã¢ãã«ã«å ¥åããããã«ä»¥ä¸ã®ããã«åå¦çãè¡ãã¾ãã
#試ååºå ´æ°ã20以ä¸ã®é¸æãæ½åº player_attend=pd.concat([data["winner_name"],data["loser_name"]],axis=0).value_counts().to_frame() player_attend=player_attend[player_attend[0]>=20] players_use=list(player_attend.index) #playerã«idãä»ä¸ id_player=pd.DataFrame() ids =[i+1 for i in range(len(players_use))] id_player["winner_id2"]=ids #å ã®ãã¡ã¤ã«ã¨ãã¶ãã®ã§2ãä»ä¸ id_player["winner_name"]=players_use data=pd.merge(data,id_player,on="winner_name",how="left") id_player.columns=["loser_id2","loser_name"] data=pd.merge(data,id_player,on="loser_name",how="left") #ä»å使ç¨ããé¸æã®ã¿ã« data=data[(data["winner_name"].isin(players_use)) & (data["loser_name"].isin(players_use))] #stanå ¥åç¨ LW=data[["loser_id2","winner_id2"]].astype(int)
stanã®ã³ã¼ãã¯ä»¥ä¸ã®ããã«ãªãã¾ãã
model=""" data { int N; // num of players int G; // num of games int<lower=1, upper=N> LW[G,2]; // loser and winner of each game } parameters { ordered[2] performance[G]; vector[N] mu; real<lower=0> s_mu; vector<lower=0>[N] s_pf; } model { for (g in 1:G) for (i in 1:2) performance[g,i] ~ normal(mu[LW[g,i]], s_pf[LW[g,i]]); mu ~ normal(0, s_mu); s_pf ~ gamma(10, 10); }""" stan_data = {'N': len(players_use), 'G': len(LW),'LW': LW} sm = pystan.StanModel(model_code=model) fit = sm.sampling(data=stan_data, iter=1000, chains=3)
çµæ
stanã§æ¨å®ãããã©ã¡ã¼ã¿ã確èªãã¾ãã
result = fit.extract() çµæé²è¦§ç¨ã®ãã¼ã¿ãã¬ã¼ã ä½æ winners=data[["winner_id2","winner_name"]] winners.columns=["player_id","player_name"] losers=data[["loser_id2","loser_name"]] losers.columns=["player_id","player_name"] players=pd.concat([winners,losers],axis=0) players=players.drop_duplicates() players=players.sort_values(by="player_id") #playerã®å¼·ãã®å¹³åå¤ã¨åè² ã ã© players["mu"]=np.median(result["mu"],axis=0) players["s_pf"]=np.median(result["s_pf"],axis=0)
å¼·ãã®å¹³åå¤ã®ä¸ä½10人ã¯ä»¥ä¸ã«ãªãã¾ããã ãã®è¡¨ã§ãmuã¯å¼·ãã®å¹³åãs_pfã¯å¼·ãã®åæ£ã示ãã¾ãã
players.sort_values(by="mu",ascending=False).head(10)
åçã®ä¸ä½ã§åºããé åºã¨ã»ã¨ãã©åãã§ããã ãã®ä¸ã§ã¯ãã£ã¨ã ãåè² ã ã©ãå°ã大ããããã§ããã
次ã«ãå¼·ãã®åæ£ï¼åè² ã ã©ï¼ã®ä¸ä½10人ã¯ä»¥ä¸ã«ãªãã¾ããã
players.sort_values(by="s_pf",ascending=False).head(10)
ããã«ã¯ãã¾ãä¸ä½é¸æã®ååãåºã¦ããªãçµæã¨ãªãã¾ããã ããããä»åæ½åºãã99人ã®ãã¡ãTOP層ã¨ãã以å¤ã ã¨é åºã«ããéã¿ãç°ãªã£ã¦ãããã¨ãåå ãã¨æãã¾ãã ï¼1ä½ã®é¸æã20ä½ã®é¸æã«è² ããã®ã¯ç¨ã ãã61ä½ã®é¸æã80ä½ã®é¸æã«è² ãããã¨ã¯ãããããï¼
æ°ã«ãªã£ãã®ã§ãããã«ã¨é¦ç¹ã®å¼·ãã®åå¸ãå³ç¤ºãã¦ã¿ã¾ããã
from scipy.stats import norm X = np.arange(-3,6,0.1) Y = norm.pdf(X,1.580996,0.721014) plt.plot(X,Y,color="r",label="Nadal") Y = norm.pdf(X,0.586693,0.869935) plt.plot(X,Y,color="g",label="Nishikori") plt.legend() plt.show()
ãããè¦ãéãã ã¨ãããã«ã®èª¿åãå¹³å以ä¸ã®å ´åã¯é¦ç¹ãããã«ã«åå©ããã®ã¯é£ãããã§ãã 2018å¹´ã¯ããã«vsé¦ç¹ã®å¯¾æ¦ã¯ä¸åº¦ãããªããçµæã¯ããã«ã®åå©ã«ãªã£ã¦ãã¾ãããéç®ã®é¦ç¹ã¨ããã«ã®å¯¾æ¦æ績ã¯2å10æãªã®ã§ãæè¦çã«ã¯ãããªã«ããã¦ããªãããªï¼ã¨ããæ°ããã¾ããï¼å¯¾æ¦ã®ç¸æ§ã¯å ¨ãèæ ®ãã¦ããªãã®ã§ããã¾ã§ä¸ã¤ã®å´é¢ããã®è©ä¾¡ã«ãªãã¾ãããï¼
ã¾ã¨ã
2018å¹´ã®ATPã®ãã¼ã¿ãç¨ãã¦ããã¹ã®å¼·ãã¨ãåè² ã ã©ã«ã¤ãã¦ã®åæãè¡ãã¾ããããã®ãããªåæãä½ã«ä½¿ããããèãã¦ã¿ã¾ããããä¾ãã°å£ä½æ¦ã®ãªã¼ãã®çµã¿æ¹ãèããã®ã«å½¹ç«ã¤ããããã¾ãããç·ååã¯å°ãé«ãããã©ã¼ãã³ã¹ã®ã ã©ãå°ãªãé¸æAã¨ç·ååã¯ããå£ããããã©ã¼ãã³ã¹ã®ã ã©ã大ããé¸æBãããã¨ãã¦ã対æ¦ç¸ææ¯ã«äºæ¸¬ãããåçãè¨ç®ãã¦ãã¹ããªãªã¼ããçµãã¨ãããã¨ãæè¦çã§ã¯ãªãæ°å¤çã«è¡ããã¨ããèãããããã§ããã
使ç¨ããã³ã¼ãã¯ä»¥ä¸
https://github.com/rmizuta3/tennis_analysys