Kaggleã§æ¦ããã人ã®ããã®pandaså®æ¦å ¥é
ã¯ããã«
- èªåã¯å ã pandasãè¦æã§Kaggleã³ã³ãåå æã¯åºæ¬çã«BigQueryä¸ã®SQLã§ç¹å¾´éãä½ããæä½éã®pandasæä½ã§ãã¼ã¿å¦çããã¦ãã¾ããã
- ããããããã³ã¼ãã³ã³ããã£ã·ã§ã³ã«åå ãããã¨ã«ãªããpythonã§è»½å¿«ã«ãã¼ã¿å¦çãããªãå¿ è¦ãåºã¦ããã®ã§åå¼·ãã¾ããã
- ããã§ãå½æã®åå¼·ã¡ã¢ããã¨ã«ãããã ãç¥ã£ã¦ããã°Kaggleã§ããããæ¦ããããªãã¨æã£ã¦ããpandasã®ä¸»è¦æ©è½ãã¾ã¨ãã¾ããã
注è¨
å®æ¦å ¥é
ã®ã¤ãããã»ã¼è¾æ¸
ã«ãªã£ã¦ãã¾ãã¾ãã orz- pandasã¨ã¯ãªãããçãªå
容ã¯æ¸ãã¦ãã¾ãã
(import pandas
ãDataFrameã¨ã¯ä½ããªã©) - pandas1.0ç³»ã§ãåãããã«æ¸ããã¤ããã§ããééã£ã¦ãããã¿ã¾ãã
ç®æ¬¡
- ã¯ããã«
- ç®æ¬¡
- Options
- DaraFrame èªã¿æ¸ã
- ãã¼ã¿ã¯ãªã¼ãã³ã°
- DataFrameæä½
- å種æ¼ç®
- ã«ãã´ãªå¤æ°ã¨ã³ã³ã¼ãã£ã³ã°
- æååæä½
- æ¥ä»ç³»å¦ç
- å¯è¦å
- 並åå¦ç
- ãã¾ã: Excelèªã¿æ¸ã
- pandasã身ã«ã¤ããã«ã¯ï¼
- ãããã«
Options
jupyter notebook 㧠DataFrame ã®è¡¨ç¤ºãçç¥ãããªãããã«ããã ãªãã ããã æ¸ãæ¹ãããå¿ããã
pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None)
DaraFrame èªã¿æ¸ã
CSVãã¡ã¤ã«
èªã¿è¾¼ã¿
read_csv
ã¯æå¤ã¨ãªãã·ã§ã³ãå¤ãã®ã§ãªããªãè¦ãããã¾ããã
# åºæ¬ df = pd.read_csv('train.csv') # headerããªãã¨ã (ååã¯é£çªã«ãªã) df = pd.read_csv('train.csv', header=None) # headerããªãã¦èªåã§ååæå®ãããã¨ã df = pd.read_csv('train.csv', names=('col_1', 'col_2')) # å©ç¨ããåãæå®ãããã¨ã df = pd.read_csv('train.csv', usecols=['col_1', 'col_3']) # lamdaå¼ãå©ç¨å¯è½ df = pd.read_csv('train.csv', usecols=lambda x: x is not 'col_2') # åå: èªã¿è¾¼ãã ãã¨ã®å¤æ´ df = df.rename(columns={'c': 'col_1'}) # åæå®ã§èªã¿è¾¼ã¿ (æå®ããå以å¤ã¯èªåæ¨å®) ## ã¡ã¢ãªé¼è¿«ãã¦ããã¨ã以å¤ã¯ãåæå®ãã read_csv ãã¦ã ## å¾è¿°ã® `reduce_mem_usage` ã使ããã¨ãå¤ã df = pd.read_csv('train.csv', dtype={'col_1': str, 'col_3': str}) ## å: èªã¿è¾¼ãã ãã¨ã®å¤æ´ df = df['col_1'].astype(int) # float / str / np.int8 ... # æéç³»ãã¼ã¿ãparse df = pd.read_csv('train.csv', parse_dates=['created_at', 'updated_at'])
- pandasã§csv/tsvãã¡ã¤ã«èªã¿è¾¼ã¿
- Pandasã®read_csvãããéã«ããåãdatetimeã¨ãã¦èªã¿è¾¼ã
æ¸ãåºã
# åºæ¬ df.to_csv('file_name.csv') # indexä¸è¦ã®ã¨ã (kaggle submission fileã¯ä¸è¦ãªã®ã§å¿ããã¡) submission.to_csv('submission.csv', index=False)
Pickleãã¡ã¤ã«
# åºæ¬ df = pd.read_pickle('df.pickle') df.to_pickle('df.pickle') # ãã¼ã¿ãéãã¨ãã¯zipåã§ãã (ãé ãã¦å®ç¨ã«èããªãããã) ## æ¸ãåºã: æ¡å¼µåã zip ã gzip ã«ããã ãã§ãã df.to_pickle('df.pickle.zip') ## èªã¿è¾¼ã¿: read_pickle ã¯æ¡å¼µåãè¦ã¦èªåçã«è§£åå¦çããã¦ããã df = pd.read_pickle('df.pickle.zip')
- pandas.DataFrame, Seriesãpickleã§ä¿åãèªã¿è¾¼ã¿
- pandas ã®æ°¸ç¶åãã©ã¼ãããã«ã¤ãã¦èª¿ã¹ã
- pickle / feather / parquet ã®æ¯è¼ããã¦ããè¨äº
- ä¿åæã®ãã¡ã¤ã«ãµã¤ãºä»¥å¤ã¯ããããé¢ã§pickleãåªãã¦ããã¨ã®ãã¨
- ãã¡ã¤ã«ãµã¤ãºã¯parquetãå°ããã¨ã®ãã¨
ã¡ã¢ãªä½¿ç¨éåæ¸ã®å·¥å¤«
ãã¡ã¤ã«ãèªã¿è¾¼ãã ç´å¾ã«ã¡ã¢ãªä½¿ç¨éåæ¸ããã¯ã»ãä»ãã¦ããã¨è²ã ã¯ãã©ãã¾ãã
åå¤æ´
# kaggleã§ãã使ããã `reduce_mem_usage` ã§ã¡ã¢ãªä½¿ç¨éåæ¸ ## å é¨ã§ã¯åã«ã©ã ã®å¤åã«åããã¦åå¤æ´ãè¡ã£ã¦ãã ## `reduce_mem_usage` å®è£ 㯠ref åç § df = reduce_mem_usage(df) # å®è·µçã«ã¯ read_csv ããç´å¾ã«ã¡ã¢ãªä½¿ç¨éåæ¸ãè¡ããã¨ãå¤ã df = df.read_csv('train.csv')\ .pipe(reduce_mem_usage) # ä½è«ã ããpipeã使ãã¨å¯èªæ§åä¸ãããã¨ãå¤ã # f(g(h(df), arg1=1), arg2=2, arg3=3) df.pipe(h) \ .pipe(g, arg1=1) \ .pipe(f, arg2=2, arg3=3)
ä¸è¦ã«ã©ã åé¤
import gc # dropã§ãè¯ã: df.drop('col_1', axis=1, inplace=True) del df['col_1']; gc.collect();
ãã¼ã¿ã¯ãªã¼ãã³ã°
æ¬ æãã¼ã¿å¦ç
# æ¬ æãããè¡ãåé¤ df1.dropna(how='any') # ç¹å®ã®åã§æ¬ æãã¦ããè¡ãç¡è¦ df = df[~df['col_1'].isnull()] # åãã df1.fillna(value=0)
éè¤æé¤
# åºæ¬ df2.drop_duplicates() # éè¤ãã¦ããã«ã©ã ã®æå® df2.drop_duplicates(['col_1']) # æ®ãåã®æå® df2.drop_duplicates(['col_1'], keep='last') # keep='first' / False(drop all)
è£é (interpolate)
- Kaggleã§ã¯ãã¾ã使ããªãããã ããå®åã¨ãã§å½¹ã«ç«ã¡ãã
DataFrameæä½
DataFrame æ å ±è¡¨ç¤º
# è¡æ°,åæ°,ã¡ã¢ãªä½¿ç¨é,ãã¼ã¿å,éæ¬ æè¦ç´ æ°ã®è¡¨ç¤º df.info() # è¡æ° x åæ° åå¾ df.shape # è¡æ°åå¾ len(df) # æå / æå¾ã®Nè¡è¡¨ç¤º df.head(5) df.tail(5) # ã«ã©ã åä¸è¦§ãåå¾ df.columns # åè¦ç´ ã®è¦ç´çµ±è¨éãåå¾ ## æ°å¤åè¦ç´ ã® min/max/mean/stdãªã©ãåå¾ df.describe() ## ã«ãã´ãªåè¦ç´ ã® count/unique/freq/stdãªã©ãåå¾ df.describe(exclude='number') ## 表示ãããã¼ã»ã³ã¿ã¤ã«ãæå® df.describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])
Slice (iloc / loc / (ix))
# åºæ¬ df.iloc[3:5, 0:2] df.loc[:, ['col_1', 'col_2']] # è¡ã¯æ°å¤ã§æå®ãã¦ãåã¯ååã§æå®ãã # (ãã¼ã¸ã§ã³ã«ãã£ã¦ã¯ ix ã§ãã§ãããå»æ¢ããã) df.loc[df.index[[3, 4, 8]], ['col_3', 'col_5']]
åã«ããåé¸æ
# é¤å¤ãã§ãã df.select_dtypes( include=['number', 'bool'], exclude=['object'])
æ¡ä»¶æå®ã«ããè¡é¸æ
# åºæ¬ df[df.age >= 25] # ORæ¡ä»¶ df[(df.age <= 19) | (df.age >= 30)] # ANDæ¡ä»¶ df[(df.age >= 25) & (df.age <= 34)] ## betweenã§ãæ¸ãã (ãã¾ãè¦ãªãã) df[df['age'].between(25, 34)] # IN df[df.user_id.isin(target_user_list)] # queryè¨æ³: è³å¦ä¸¡è«ãããå人çã«ã¯å¥½ã df.query('age >= 25') \ .query('gender == "male"')
indexãªã»ãã
# åºæ¬ df = df.reset_index() # ç ´å£çå¤æ´ df.reset_index(inplace=True) # drop=Falseã«ããã¨indexãåã¨ãã¦è¿½å ããã df.reset_index(drop=False, inplace=True)
ååé¤
# åºæ¬ df = df.drop(['col_1'], axis=1) # ç ´å£çå¤æ´ df = df.drop(['col_1'], axis=1, inplace=True)
Numpy Array å
# df['col_1'] ã®ã¾ã¾ã 㨠index ãä»ãã¦ã㦠# ä»ã®dfã«ãã£ã¤ããã¨ãã«ãã°ãå¼ãè½ã¨ããããªãã¨ãããã®ã§ # numpy array ã«ãã¦å¾ç¶ã®å¦çãè¡ããã¨ãå¤ã ãã df['col_1'].values
é£çµã»çµå
é£çµ
# concat ## åºæ¬ (縦ã«ç©ã: ã«ã©ã ã¯åDataFrameã®åéå df = pd.concat([df_1, df_2, df_3]) ## 横ã«ã¤ãªãã df = pd.concat([df_1, df_2], axis=1) ## åDataFrameã«å ±éã®ã«ã©ã ã®ã¿ã§ç©ã df = pd.concat([df_1, df_2, df_3], join='inner')
- pandas cheetsheet
- Reshaping Data ã®é¨åã«è²ä»ãã®ããããããå³ãããã¾ã
- pandas.DataFrame, Seriesãé£çµããconcat
- concat è¦æããã¦ãã®è¨äºããã¶ã100å以ä¸è¦ã¦ãæ°ããã¾ãç¬
çµå
merge
: ãã¼ãæå®ãã¦ã®çµå
# åºæ¬ (å é¨çµå) df = pd.merge(df, df_sub, on='key') # è¤æ°ã®ã«ã©ã ããã¼ã¨ãã df = pd.merge(df, df_sub, on=['key_1', 'key_2']) # å·¦çµå df = pd.merge(df, df_sub, on='key', how='left') # å·¦å³ã§ã«ã©ã åãéãã¨ã df = pd.merge(df, df_sub, left_on='key_left', right_on='key_right') \ .drop('key_left', axis=1) # ãã¼ã両æ¹æ®ãã®ã§ã©ã¡ããæ¶ã
join
: indexãå©ç¨ããçµå
# åºæ¬ (å·¦çµå: mergeã¨éãã®ã§æ³¨æ) df_1.join(df_2) # å é¨çµå df_1.join(df_2, how='inner')
ã©ã³ãã ãµã³ããªã³ã°
# 100è¡æ½åº df.sample(n=100) # 25%æ½åº df.sample(frac=0.25) # seedåºå® df.sample(frac=0.25, random_state=42) # éè¤è¨±å¯: ããã©ã«ãã¯replace=False df.sample(frac=0.25, replace=True) # åããµã³ããªã³ã° df.sample(frac=0.25, axis=1)
ã½ã¼ã
# åºæ¬ df.sort_values(by='col_1') # indexã§ã½ã¼ã df.sort_index(axis=1, ascending=False) # ãã¼ãè¤æ° & éæé æå® df.sort_values(by=['col_1', 'col_2'], ascending=[False, True])
argmax / TOP-N ç³»ã®å¦ç
# æãå¤ãå°ããªè¡/åãè¦ã¤ãã df['col1'].idxmax() # æãåãå°ããªåãè¦ã¤ãã df.sum().idxmin() # TOP-N: col_1ã§ä¸ä½5件ãåºã â åä¸é ä½ã§ããã°col_2ãè¦ã df.nlargest(5, ['col_1', 'col_2']) # .smallest: ä¸ä½N件
å種æ¼ç®
ãã使ãé¢æ°åºç¤
# éè¨ df['col_1'].sum() # mean / max / min / count / ... # ã¦ãã¼ã¯å¤åå¾ df['col_1'].unique() # ã¦ãã¼ã¯è¦ç´ åæ° (count distinct) df['col_1'].nunique() # percentile df['col_1'].quantile([0.25, 0.75]) # clipping df['col_1'].clip(-4, 6) # 99ãã¼ã»ã³ã¿ã¤ã«ã§clipping df['col_1'].clip(0, df['col_1'].quantile(0.99))
åºç¾é »åº¦ã«ã¦ã³ã (value_counts)
# (NaNé¤ã) df['col_1'].value_counts() # åºç¾é »åº¦ã«ã¦ã³ã(NaNå«ã) df['col_1'].value_counts(dropna=False) # åºç¾é »åº¦ã«ã¦ã³ã (åè¨ã1ã«æ£è¦å) df['col_1'].value_counts(normalize=True)
å¤ã®æ¸ãæã (apply / map)
Seriesåè¦ç´ ã®æ¸ãæã: map
# åè¦ç´ ã«ç¹å®ã®å¦ç f_brackets = lambda x: '[{}]'.format(x) df['col_1'].map(f_brackets) # 0 [11] # 1 [21] # 2 [31] # Name: col_1, dtype: object # dictã渡ãã¦å¤ã®ç½®æ df['priority'] = df['priority'].map({'yes': True, 'no': False})
DataFrameã®åè¡ã»ååã®æ¸ãæã: apply
# åºæ¬ df['col_1'].apply(lambda x: max(x)) # ãã¡ããèªèº«ã§å®ç¾©ããé¢æ°ã§ãè¯ã df['col_1'].apply(lambda x: custom_func(x)) # é²æã表示ããã¨ã㯠# from tqdm._tqdm_notebook import tqdm_notebook df['col_1'].progress_apply(lambda x: custom_func(x))
- pandasã§è¦ç´ ãè¡ãåã«é¢æ°ãé©ç¨ããmap, applymap, apply
- Jupyter notebookã§pandasã®map/applyã¡ã½ããé²æ表示ãããæãã«ãã
ãã®ä»ã®æ¸ãæã (replace / np.where)
# replace df['animal'] = df['animal'].replace('snake', 'python') # np.where df['logic'] = np.where(df['AAA'] > 5, 'high', 'low') # np.where: è¤éver. condition_1 = ( (df.title == 'Bird Measurer (Assessment)') & \ (df.event_code == 4110) ) condition_2 = ( (df.title != 'Bird Measurer (Assessment)') & \ (df.type == 'Assessment') & \ (df.event_code == 4100) ) df['win_code'] = np.where(condition_1 | condition_2, 1, 0)
éç´ (agg)
# åºæ¬ df.groupby(['key_id'])\ .agg({ 'col_1': ['max', 'mean', 'sum', 'std', 'nunique'], 'col_2': [np.ptp, np.median] # np.ptp: max - min }) # å ¨ã¦ã®åãä¸å¾ã§éç´ãããã¨ãã¯ãªã¹ãå å 表è¨ã§æ¸ãã¦ãã¾ã£ã¦ãè¯ã df.groupby(['key_id_1', 'key_id_2'])\ .agg({ col: ['max', 'mean', 'sum', 'std'] for col in cols })
éç´çµæã®æ´»ç¨ä¾
ã»ã¼ã¤ãã£ãªã ã ããæåã¯æ £ããªãã¨å¦çã«æéåãã®ã§ä¾ãæ¸ãã¦ããã
# éç´ agg_df = df.groupby(['key_id']) \ .agg({'col_1': ['max', 'min']}) # ã«ã©ã åã max / min ã«ãªããã©ã®ãã¼ã®ãã®ãåºå¥ã§ããªãã®ã§ä¿®æ£ãã # ãã«ãã¤ã³ããã¯ã¹ã«ãªã£ã¦ããã®ã§ãã©ã㦠rename ãã agg_df.columns = [ '_'.join(col) for col in agg_df.columns.values] # éç´çµæã¯indexã«key_idãå ¥ã£ã¦ããã®ã§reset_indexã§åºã agg_df.reset_index(inplace=True) # key_idããã¼ã¨ãã¦å ã®DataFrameã¨çµå df = pd.merge(df, agg_df, on='key_id', how='left')
ãããããã¼ãã«ã«ããéè¨
pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], aggfunc={'D': np.mean, 'E': [min, max, np.mean]}) # D E # mean max mean min # A C # bar large 5.500000 9.0 7.500000 6.0 # small 5.500000 9.0 8.500000 8.0 # foo large 2.000000 5.0 4.500000 4.0 # small 2.333333 6.0 4.333333 2.0
ã«ã¼ããåããé åå士ã®æ¼ç®
åæ¹åã®å¹³åå¤ã¨ã®å·®åãç®åºããæã«ä¾¿å©ã§ã
# `df['{col}_diff_to_col_mean] = df['{col}'] - df['{col}'].mean()` çãªå¦çãä¸æ¬ã§ããæ df.sub(df.mean(axis=0), axis=1) # sub 以å¤ã«ã add / div / mul (æãç®) ããã # 以ä¸ã¯ `df['{col}_div_by_col_max] = df['{col}'] / df['{col}'].max()` ã®ä¸æ¬å¦ç df.div(df.max(axis=0), axis=1)
ãã³è©°ã (cut / qcut)
# df['col_1']ã®æå°å¤ã¨æ大å¤ã®éã4åå² â ãã®å¢çã使ã£ã¦ãã³è©°ã # ã¤ã¾ããåãã³ã«å«ã¾ããåæ°ããã©ãã pd.cut(df['col_1'], 4) # df['col_1']ã®è¦ç´ æ°ã4çåãã¦ãã³ãä½ã â ãã®å¾ã«å¢çãæ±ãã # ã¤ã¾ãããã³ã®ééããã©ãã pd.qcut(df['col_1'], 4)
- ãã¹ãã°ã©ã å¯è¦åã«ã¤ãã¦ã¯è©³ãã å¾è¿° ãã
- pandasã®cut, qcuté¢æ°ã§ããã³ã°å¦çï¼ãã³åå²ï¼
æç³»åãã¼ã¿ã§ãã使ãå¦ç
shift
: è¡ã»åæ¹åã«å¤ãããã
# 2è¡ä¸ã«ããã df.shift(periods=2) # 1è¡ä¸ã«ããã df.shift(periods=-1) # 2åããã (ãã¾ã使ããªã) df.shift(periods=2, axis='columns')
rolling
: 移åå¹³åãªã©ã®ç®åº
# windowå¹ =3ã®çªé¢æ°ã«ããåè¨å¤ãç®åº df['col_1'].rolling(3).sum() # è¤æ°ã® df['col_1'].rolling(3) \ .agg([sum, min, max, 'mean'])
cumsum
: ç´¯ç©å
åæ§ã®é¢æ°ã« cummax
, cummin
ããã
# df # A B # 0 2.0 1.0 # 1 3.0 NaN # 2 1.0 0.0 # ä¸è¨ã®dfã®ç´¯è¨åãç®åº df.cumsum() # A B # 0 2.0 1.0 # 1 5.0 NaN # 2 6.0 1.0
diff
, pct_change
: è¡ã»åã®å·®åã»å¤åçãåå¾
# ä¾ã§ä½¿ãdataframe # col_1 col_2 # 0 1 2 # 1 2 4 # 2 3 8 # 3 4 16 # åºæ¬: 1è¡åã¨ã®å·®åãç®åº df.diff() # col_1 col_2 # 0 NaN NaN # 1 1.0 2.0 # 2 1.0 4.0 # 3 1.0 8.0 # 2è¡åã¨ã®å·®åç®åº df.diff(2) # col_1 col_2 # 0 NaN NaN # 1 NaN NaN # 2 2.0 6.0 # 3 2.0 12.0 # è² ã®æ°ãæå®å¯è½ df.diff(-1) # col_1 col_2 # 0 -1.0 -2.0 # 1 -1.0 -4.0 # 2 -1.0 -8.0 # 3 NaN NaN # å¤åçãåå¾ããã¨ã㯠`pct_change` df.pct_change() # col_1 col_2 # 0 NaN NaN # 1 1.000000 1.0 # 2 0.500000 1.0 # 3 0.333333 1.0 # è¨ç®å¯¾è±¡ãdatetimeã®å ´åã¯é »åº¦ã³ã¼ãã§æå®å¯è½ # 以ä¸ã®ä¾ã§ã¯ `2æ¥å` ã®ãã¼ã¿ã¨ã®å¤åçãç®åº df.pct_change(freq='2D')
æéåä½ã§ã®éç´
# 5åããã«å¹³åãæ大å¤ãéè¨ # é »åº¦ã³ã¼ã `min` `H` ãªã©ã®è©³ç´°ã¯ ref.2 ã«é常ã«è©³ããã®ã§åç §ã®ã㨠funcs = {'Mean': np.mean, 'Max': np.max} df['col_1'].resample("5min").apply(funcs)
- pandasã§æç³»åãã¼ã¿ããªãµã³ããªã³ã°ããresample, asfreq
- pandasã®æç³»åãã¼ã¿ã«ãããé »åº¦ï¼å¼æ°freqï¼ã®æå®æ¹æ³
ã«ãã´ãªå¤æ°ã¨ã³ã³ã¼ãã£ã³ã°
ã«ãã´ãªå¤æ°ã¨ã³ã³ã¼ãã£ã³ã°ã®ç¨®é¡ã«ã¤ãã¦ã¯ãã®è³æã詳ãã
One-Hot Encoding
# ãã® DataFrame ãå¦çãã # name gender # 0 hoge male # 1 fuga NaN # 2 hage female # prefixãä»ãããã¨ã§ãªãã®ã«ã©ã ã®One-Hotããããããããªã tmp = pd.get_dummies(df['gender'], prefix='gender') # gender_female gender_male # 0 0 1 # 1 0 0 # 2 1 0 # çµåãããã¨å ã®ã«ã©ã ãåé¤ãã df = df.join(tmp).drop('gender', axis=1) # name gender_female gender_male # 0 hoge 0 1 # 1 fuga 0 0 # 2 hage 1 0
Label Encoding
from sklearn.preprocessing import LabelEncoder # trainã¨testã«åããã¦ãããã¼ã¿ãä¸æ¬ã§LabelEncodingããä¾ cat_cols = ['category_col_1', 'category_col_2'] for col in cat_cols: # æ £ä¾çã« `le` ã¨ç¥ããã¨ãå¤ãæ°ããã le = LabelEncoder().fit(list( # train & test ã®ã©ãã«ã®åéåãåã set(train[col].unique()).union( set(test[col].unique())) )) train[f'{col}'] = le.transform(train[col]) test[f'{col}'] = le.transform(test[col]) # label encoding ãããã¡ã¢ãªä½¿ç¨éãæ¸ãããã®ã§å¿ããã« train = reduce_mem_usage(train) test = reduce_mem_usage(test)
- 注è¨
- ä¸è¨æ¹æ³ã ã¨testã«ã®ã¿å«ã¾ããã©ãã«ãencodingããã¦ãã¾ã
- æ°æã¡æªãå ´åã¯ãtrainã«ãªããã®ã¯ä¸æ¬ã§
-1
ã¨ãã«æ¸ãæãã¦ãã¾ã (å人çã«ã¯ãã¾ãæ°ã«ãã¦ããªãã®ã§æ£ããããæ¹ãã©ããä¸å®â¦ã)
- kaggleæ¬å®è£
- kaggleæ¬ã§ã¯trainã«åºã¦ãããã®ã ãã§LabelEnconding
Frequency Encoding
for col in cat_cols: freq_encoding = train[col].value_counts() # ã©ãã«ã®åºç¾åæ°ã§ç½®æ train[col] = train[col].map(freq_encoding) test[col] = test[col].map(freq_encoding)
Target Encoding
# è¶ éã«ããã¨ã (éæ¨å¥¨) ## col_1ã®åã©ãã«ã«å¯¾ã㦠target(correct) ã®å¹³åå¤ã¨ã«ã¦ã³ããç®åº ## ä¸å®ã®ã«ã¦ã³ãæªæº(ä»®ã«1000件)ã®ã©ãã«ã¯ç¡è¦ãã¦éè¨ãããã¨ããä¾ target_encoding = df.groupby('col_1') \ .agg({'correct': ['mean', 'count']}) \ .reset_index() \ # å°æ°ã©ãã«ã¯ãªã¼ã¯ã®åå ã«ãªãã®ã§æ¶ã .query('count >= 1000') \ .rename(columns={ 'correct': 'target_encoded_col_1', }) \ # ã«ã¦ã³ãã¯è¶³åãã«ä½¿ã£ãã ããªã®ã§æ¶ã .drop('count', axis=1) train = pd.merge( train, target_encoding, on='col_1', how='left') test = pd.merge( test, target_encoding, on='col_1', how='left')
- ä¸è¨ã®ä¾ã¯é常ã«éãªå®è£ ã§ããçé¢ç®ã«ããã¨ãã¯Kaggleæ¬ã®å®è£ ãèªãã§Foldãã¨ã«è¨ç®ãã¾ããã
æååæä½
pandas official method list ã«ããããè¼ã£ã¦ããã®ã§ä¸åº¦ç®ãéããã¨ããããããã¾ãã
åºæ¬
# æåæ° series.str.len() # ç½®æ series.str.replace(' ', '_') # 'm' ããå§ã¾ã(çµãã)ãã©ãã series.str.starswith('m') # endswith # 表ç¾ãå«ãã§ãããã©ãã pattern = r'[0-9][a-z]' series.str.contains(pattern)
ã¯ãªã¼ãã³ã°
# 大æå/å°æå series.str.lower() # .upper() # capitalize (male â Male) series.str.capitalize() # è±æ°åæ½åº: æåã®é©åé¨åã ãã ã ## ããããè¤æ°ã®å ´åã¯DFãè¿ã£ã¦ãã ## extractall: ãã¹ã¦ã®é©åé¨åããã«ãã¤ã³ããã¯ã¹ã§è¿ã£ã¦ãã series.str.extract('([a-zA-Z\s]+)', expand=False) # åå¾ã®ç©ºç½åé¤ series.str.strip() # æåã®å¤æ ## å¤æå: Qiitaã¯ãããã°ã©ãã³ã°ã«é¢ããç¥èãè¨é²ã»å ±æããããã®ãµã¼ãã¹ã§ãã ## å¤æå¾: Qiitaã¯,ããã°ã©ãã³ã°ã«é¢ããç¥èãè¨é²å ±æããããã®ãµã¼ãã¹ã§ã. table = str.maketrans({ 'ã': ',', 'ã': '.', 'ã»': '', }) result = text.translate(table)
æåã®å¤æã«ã¯str.translate()ã便å©
æ¥ä»ç³»å¦ç
åºæ¬
# åºæ¬: èªã¿è¾¼ã¿æã«å¤æå¿ããã¨ãã¨ã df['timestamp'] = pd.to_datetime(df['timestamp']) # æ¥ä»ã®ãªã¹ããä½æ dates = pd.date_range('20130101', periods=6) # æ¥ä»ã®ãªã¹ããä½æ: ç§åä½ã§100å pd.date_range('20120101', periods=100, freq='S') # æ¥ä»ã§ãã£ã«ã¿ df['20130102':'20130104'] # unixtime ã«ãã df['timestamp'].astype('int64')
é«åº¦ãªæ¥ä»æ½åº
- pandasã«ã¯ã¨ã¦ãè¤éãªæ¥ä»æ½åºã®ä»çµã¿ãå®è£
ããã¦ããã
æ¯æã®ç¬¬4åææ¥
ãæå第ä¸å¶æ¥æ¥
ã¨ãã£ãæ½åºãä¸ç¬ã§ãã(æ¥æ¬ã®ç¥æ¥ã対å¿ãã¦ããªãã®ã§å¾è¿°ã®jpholiday
ãªã©ã§å¤å°å¤æ´ã¯å¿ è¦ã§ããã) - pandasã®æç³»åãã¼ã¿ã«ãããé »åº¦ï¼å¼æ°freqï¼ã®æå®æ¹æ³ ã«è©³ããã®ã§ãæ¥ä»é¢ä¿ã®å®è£ ãå¿ è¦ãªéã¯ãã²ä¸èªããããã¨ããããããã¾ãã
# æã®æçµæ¥ãæ½åºãã pd.date_range('2020-01-01', '2020-12-31', freq='M') # DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30', # '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31', # '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31'], # dtype='datetime64[ns]', freq='M') # 2020å¹´ã®ç¬¬4åææ¥ãæ½åºãã pd.date_range('2020-01-01', '2020-12-31', freq='WOM-4SAT') # DatetimeIndex(['2020-01-25', '2020-02-22', '2020-03-28', '2020-04-25', # '2020-05-23', '2020-06-27', '2020-07-25', '2020-08-22', # '2020-09-26', '2020-10-24', '2020-11-28', '2020-12-26'], # dtype='datetime64[ns]', freq='WOM-4SAT')
ç¥æ¥å¤å®
- pandasã§ã¯ãªããkaggleã§ã使ããã¨ã(ãã¶ã)ããã¾ããããå®åä¸ä¾¿å©ãªã®ã§æ²è¼ãã¦ããã¾ãã
- jpholiday official
import jpholiday import datetime # æå®æ¥ãç¥æ¥ãå¤å® jpholiday.is_holiday(datetime.date(2017, 1, 1)) # True jpholiday.is_holiday(datetime.date(2017, 1, 3)) # False # æå®æã®ç¥æ¥ãåå¾ jpholiday.month_holidays(2017, 5) # [(datetime.date(2017, 5, 3), 'æ²æ³è¨å¿µæ¥'), # (datetime.date(2017, 5, 4), 'ã¿ã©ãã®æ¥'), # (datetime.date(2017, 5, 5), 'ãã©ãã®æ¥')]
å¯è¦å
ãã¶ã¤ã³ã綺éºã«ãããã¾ããªã
ãã®Qiitaè¨äºã«è¼ã£ã¦ãããã¾ããªããæ¸ãã¦ããã¨ãã°ã©ããã¨ã¦ã綺éºã«ãªãã®ã§ã¨ã¦ãããããã§ãã
import matplotlib import matplotlib.pyplot as plt plt.style.use('ggplot') font = {'family' : 'meiryo'} matplotlib.rc('font', **font)
ã·ã³ãã«ãªã°ã©ã
import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt # åºæ¬ df['col_1'].plot() # è¤æ°ã®ã«ã©ã ã®ããããã 2x2 ã®ã¿ã¤ã«ç¶ã«è¡¨ç¤º # (ã«ã©ã æ°ãã¿ã¤ã«æ°ãè¶ ãã¦ããã¨æããã) df.plot(subplots=True, layout=(2, 2)) # ä¸è¨ã§X軸,Y軸ã®å ±éå df.plot(subplots=True, layout=(2, 2), sharex=True, sharey=True)
ãã¹ãã°ã©ã
# ãã¹ãã°ã©ã df['col_1'].plot.hist() # binã20ã«å¢ãã / ãã¼ã®å¹ ãç´°ããã¦éãéãã df['col_1'].plot.hist(bins=20, rwidth=.8) # X軸ã®ã¬ã³ã¸ãæå® ## 0-100æ³ã5æ³å»ã¿ã§è¡¨ç¤ºããã¤ã¡ã¼ã¸ df['col_1'].plot.hist(bins=range(0, 101, 5), rwidth=.8) # ãã¹ãã°ã©ã ãéãªãæã«ééããã df['col_1'].plot.hist(alpha=0.5) # Y軸ã®æå°å¤ã»æ大å¤ãåºå® df['col_1'].plot.hist(ylim=(0, 0.25))
ç®±ã²ãå³
df['col_1'].plot.box()
åå¸å³
df.plot.scatter(x='col_1', y='col_2')
並åå¦ç
- pandasã§ã®å¦çã¯æ®å¿µãªããéãã¯ãªãã¨æãã¾ããBigQueryçã¨æ¯è¼ããã¨æ®å¿µãªã¬ãã«ã§ãã(ã¾ãå¦çã®éããã®ãã®ãæ¯è¼ããã®ã¯ã¢ã³ãã§ã¢ã§ããâ¦ã)
- 大éã®ç¹å¾´éãå ¨ã¦æ£è¦åããã¨ããã大éã®è¦ç´ ã«mapããããæã¨ãã¯ä¸¦åå¦çãé§ä½¿ããã¨ä¾¿å©ã ã¨æãã¾ãã
from multiprocessing import Pool, cpu_count def parallelize_dataframe(df, func, columnwise=False): num_partitions = cpu_count() num_cores = cpu_count() pool = Pool(num_cores) if columnwise: # åæ¹åã«åå²ãã¦ä¸¦åå¦ç df_split = [df[col_name] for col_name in df.columns] df = pd.concat(pool.map(func, df_split), axis=1) else: # è¡æ¹åã«åå²ãã¦ä¸¦åå¦ç df_split = np.array_split(df, num_partitions) df = pd.concat(pool.map(func, df_split)) pool.close() pool.join() return df # é©å½ãªé¢æ°ã«DataFrameãçªã£è¾¼ãã§åæ¹åã«ä¸¦åå¦çãã df = parallelize_dataframe(df, custom_func, columnwise=True)
- Make your Pandas apply functions faster using Parallel Processing (ååºã¯å¥ã®è¨äºã ã£ãã¨æãã¾ããè¦å½ãããªãã£ãâ¦ã)
'20/07/28 追è¨
- pandaparallelãswifterã¨ãã並è¡å¦çãè¡ã£ã¦ãããã©ã¤ãã©ãªãå
å®ãã¦ãã¦ããããã§ãã
- ä»å¾ã¯ãã¡ãã使ãã®ãè¯ãããããã¾ããã
- åèè¨äº: ãã£ãæ°è¡ã§pandasãé«éåãã2ã¤ã®ã©ã¤ãã©ãª(pandarallel/swifter)
ãã¾ã: Excelèªã¿æ¸ã
kaggleã§ã¯ä½¿ããªããã©å®åã§ä½¿ã人ä¸å®æ°ããï¼ (åã¯ä½¿ã£ããã¨ãªã)
# write df.to_excel('foo.xlsx', sheet_name='Sheet1') # read pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
pandasã身ã«ã¤ããã«ã¯ï¼
ã¾ãã¯ããã¨ãªããå ¬å¼Tutorialã«è¼ã£ã¦ããããªmaterialã以ä¸ã®ãããªé çªã§ä¸éãåãã®ãæéãã¨æãã¾ãã(å¯è¦å以å¤)
å®è·µçãªåé¡ãããããã¨ãã¯åå¦çå¤§å ¨ãããã®ãè¯ãããã§ãããKaggleã³ã³ãã«åå ããå ´åã¯å ¬éNotebookãè¦ãªããç·´ç¿ããç¨åº¦ã§ãååãã¨æãã¾ãã
ãããã«
Kaggleé¢ä¿ã®è²ã ãªè¨äºãæ¸ãã¦ããã®ã§ãè¯ãã£ããèªãã§ã¿ã¦ãã ãããã