ãã¼ã¿åæã®ä¼ç¤¾ã«è»¢è·ãã¦ãã3ã¶æã
æåã®1ã¶æã¯Pandasã®æ±ãã«æ¬å½ã«å°ã£ãã®ã§ã
æã¡ã¢ãã¦ããã¨ãç°¡åã«ããã°ã«è¨é²ãã¦ãã(o ï½¥Ïï½¥)ï¾
ã追è¨ã2017/07/31 0:36 ãã¼ã¿ãä¸é¨ééã£ã¦ãã®ã§ä¿®æ£ãã¾ãã
- Pandasã¨ã¯
- pandasã§ãã使ãå
- ãã¹ããã¼ã¿ã«ã¤ãã¦
- Pandasã§ã®ãã¼ã¿æä½å
¥é
- pandasã®load
- ãã¼ã¿(csv)ã®ãã¼ã
- ãã¼ã¿ã®ãµã¤ãº
- ãã¼ã¿ã®ã«ã©ã
- è¡åããå¿ è¦ãªå(ã«ã©ã )ãåãåºã
- æ¡ä»¶ã«ããããããã¼ã¿ãåãåºã
- è¡åããå¿ è¦ãªè¡çªå·ãæå®ãã¦ãåãåºã
- ã°ã«ã¼ãåãã¨éè¨
- æ°ããªåã追å ãã
- æ¡ä»¶ã«ãã£ãã»ã«ã ããæ¸ãæãã
- setããªã¹ãã«åå¨ããå¤ã®ãã¼ã¿ã ããåãåºã
- ãããã«
- ãã¾ã
Pandasã¨ã¯
è¡åãã¼ã¿ãæ±ããããããããéè¨ãè¡ãã©ã¤ãã©ãªã
ä¾ãã°ãã¼ã¿ãcsvãã¡ã¤ã«æ
ã£ã¦ããå ´åãpandas.read('hoge.csv')
ã¨ããã ãã§ã
æ±ãããã(DataFrameåã¨ãã)è¡åãã¼ã¿ã¨ãã¦æ±ããããã«ãªãã
ç°¡æçãªå¯è¦åæ©è½ãã¤ãã¦ãããPythonã§ãã¼ã¿ã®åæãããéã«ãæåã«ä½¿ããã¨ã«ãªããã¨ãå¤ãã©ã¤ãã©ãªã§ããã
ã¨ã¦ã便å©ãªã®ã ããæä½ã«ããªãçããããããæ £ããã¾ã§ã«ããªãæä½ã§æ¸æããã¨ãå¤ãã
pandasã§ãã使ãå
Pandasã§æ±ã代表çãªåã以ä¸ã®ï¼ã¤ã§ããã æåã¯ã©ã®åã§ä½ãã§ããã®ãããããªããªãã®ã§ãã¿ãã§ä»¥ä¸ã®ãªãã¡ã¬ã³ã¹è¦ãªããæä½ãã¦ããã®ãããã
å | 説æ | ãªãã¡ã¬ã³ã¹ |
---|---|---|
DataFrame | è¡åå | pandas.DataFrame â pandas 0.20.2 documentation |
Series | DataFrameã®ä¸ã®1å | pandas.Series â pandas 0.20.2 documentation |
GroupBy | DataFrameãSeriesãã°ã«ã¼ãã³ã°ãããã® | API Reference#groupby â pandas 0.20.2 documentation |
ãã¹ããã¼ã¿ã«ã¤ãã¦
ä»åã¯ã¿ã¤ã¿ããã¯å·ä¹è¹å®¢ã®çåè ãäºæ¸¬ãããã¼ã¿ããä¸é¨æç²ãã以ä¸ã使ãã
Survived,Pclass,Sex,Age,Fare,Embarked 1,1,male,80.0,30.0,S 1,2,female,4.0,39.0,S 0,2,female,24.0,13.0,S 0,2,male,37.0,26.0,S 0,3,female,11.0,31.275,S 1,3,female,13.0,7.2292,C 0,3,male,22.0,7.25,S
ãããsample.csvã¨ããã
ã¡ãªã¿ã«ååã®æå³ã¯ã以ä¸ã§ããã
- Survived çåè ãã©ãã
- Pclass é¨å±ã®ã°ã¬ã¼ã
- Sex æ§å¥
- Age å¹´é½¢
- Fare ä¹è¹ä»£é
- Embarked ã©ã®æ¸¯ããä¹ã£ãã
ä½è«
ãã¡ãã¯æ©æ¢°å¦ç¿ã®ã³ã³ãã¹ãã§ç¥ãããKaggleã§ãæãæåãªãã¥ã¼ããªã¢ã«ç¨ã®ä¾é¡ãªã®ã§ã
ãããã¼ã¿ã®æä½ã解æã«èå³ããã人ã¯ãã²åé¡ã«ææ¦ãã¦ã¿ã¦æ¬²ããã
ãã¼ã¿ã¯ãã¡ãããDLã§ããã
Pandasã§ã®ãã¼ã¿æä½å ¥é
pandasã®load
pandasã¯æ £ç¿çã«ãã®ããã«importããã
import pandas as pd
ãã¼ã¿(csv)ã®ãã¼ã
ã¾ãã¯ããã«csvãã¡ã¤ã«ã®ãã¼ãã®ä»æ¹ã
pd.read_csvã§DataFrameåã«å¤æãã¦ãããã
titanic_df = pd.read_csv("sample.csv")
titanic_df
Survived | Pclass | Sex | Age | Fare | Embarked | |
---|---|---|---|---|---|---|
0 | 1 | 1 | male | 80.0 | 30.0000 | S |
1 | 1 | 2 | female | 4.0 | 39.0000 | S |
2 | 0 | 2 | female | 24.0 | 13.0000 | S |
3 | 0 | 2 | male | 37.0 | 26.0000 | S |
4 | 0 | 3 | female | 11.0 | 31.2750 | S |
5 | 1 | 3 | female | 13.0 | 7.2292 | C |
6 | 0 | 3 | male | 22.0 | 7.2500 | S |
pandas.DataFrame â pandas 0.20.3 documentation
ãã¼ã¿ã®ãµã¤ãº
titanic_df.shape
(7, 6)
DataFrame.shapeã§ãã®è¡åã®ãµã¤ãº(è¡æ°, åæ°)ããããã
è¡æ°ãç¥ãããã·ãã¥ã¨ã¼ã·ã§ã³ã¯å¤ãã®ã§ãããããã¨ãã¯ä»¥ä¸ã®ããã«ããã®ãä¸è¬çã
titanic_df.shape[0]
7
ãã¼ã¿ã®ã«ã©ã
ãã¼ã¿ã®ã«ã©ã 㯠DataFrame.columns
ã§è¦ããã¨ãåºæ¥ãã
titanic_df.columns
Index(['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked'], dtype='object')
è¡åããå¿ è¦ãªå(ã«ã©ã )ãåãåºã
# 1åãåãåºãï¼ [] ã§ãã¼ãæå® Seriesåãè¿ã£ã¦ãã titanic_df['Age']
0 80.0
1 4.0
2 24.0
3 37.0
4 11.0
5 13.0
6 22.0
Name: Age, dtype: float64
# 2å以ä¸ãåãåºã: []ã«ãã¼ã®é åãæå® DataFrameåã§å¸°ã£ã¦ãã titanic_df[['Age', 'Sex']]
Age | Sex | |
---|---|---|
0 | 80.0 | male |
1 | 4.0 | female |
2 | 24.0 | female |
3 | 37.0 | male |
4 | 11.0 | female |
5 | 13.0 | female |
6 | 22.0 | male |
ãããªé¢¨ã«ãã¦å¿ è¦ãªãã¼ã ãã宣è¨ããçµãè¾¼ããããªäºãå¤ãã
valiables = ['Survived', 'Pclass', 'Sex', 'Age'] titanic_df = titanic_df[valiables] titanic_df
Survived | Pclass | Sex | Age | |
---|---|---|---|---|
0 | 1 | 1 | male | 80.0 |
1 | 1 | 2 | female | 4.0 |
2 | 0 | 2 | female | 24.0 |
3 | 0 | 2 | male | 37.0 |
4 | 0 | 3 | female | 11.0 |
5 | 1 | 3 | female | 13.0 |
6 | 0 | 3 | male | 22.0 |
æ¡ä»¶ã«ããããããã¼ã¿ãåãåºã
1. DataFrame.queryã§åãåºã
ããã§ã¯ãDataFrame.queryã«ãããã¼ã¿ã®åãåºãæ¹ãç´¹ä»ããã
Pandasã§ã¯æ§ã
ãªæ¹æ³ã§æ¡ä»¶ã«åã£ããã¼ã¿ãåãåºããã®ã ãã
.queryã«ããåãåºããèªãã ã¨ãã«ä¸çªããã®æ¡ä»¶ã§åãåºãã¦ãããã¨ããã®ãããããããã¨æããã®ã§ã
ã¾ãã¯ãã®æ¹æ³ãè¦ããã®ãè¯ããã¨æãã
# 1ã¤ã®æ¡ä»¶ã«ããããããã¼ã¿ãåãåºã titanic_df.query('Age > 20')
Survived | Pclass | Sex | Age | |
---|---|---|---|---|
0 | 1 | 1 | male | 80.0 |
2 | 0 | 2 | female | 24.0 |
3 | 0 | 2 | male | 37.0 |
6 | 0 | 3 | male | 22.0 |
# 2ã¤ä»¥ä¸ã®æ¡ä»¶ã«ããããããã¼ã¿ãåãåºã titanic_df.query('(Age > 20) & (Sex == "female")')
Survived | Pclass | Sex | Age | |
---|---|---|---|---|
2 | 0 | 2 | female | 24.0 |
True/Falseã®Seriesåãæå®ããTrueã®è¡ã ããåãåºã
次ã«åè¡ã«å¯¾ãã¦True/Falseãã¢ãµã¤ã³ãããSerieså(å)ãæå®ãããã¨ã§ãTrueã®è¡ã ããåãåºãæ¸ãæ¹ã
Seriesåã«å¯¾ãã¦æ¡ä»¶ã並ã¹ãã¨ã以ä¸ã®ãããªTrue/Falseã®Seriesåãè¿ã£ã¦ããã (numpyã®é åããããã£ãå¦çã«å¯¾ãã¦åããããªæåããã)
titanic_df['Age'] > 20
0 True
1 False
2 True
3 True
4 False
5 False
6 True
Name: Age, dtype: bool
ãã®True/Falseã®Seriesãæ´ã«DataFrameã«æå®ãããã¨ã§ãTrueã®è¡ã ããåãåºããã¨ãã§ãã
titanic_df[titanic_df['Age'] > 20]
Survived | Pclass | Sex | Age | |
---|---|---|---|---|
0 | 1 | 1 | male | 80.0 |
2 | 0 | 2 | female | 24.0 |
3 | 0 | 2 | male | 37.0 |
6 | 0 | 3 | male | 22.0 |
è¤æ°æ¡ä»¶ã並ã¹ãå ´åã¯()ã§æ¡ä»¶å士ã¯ããã£ã¦ãããå¿ è¦ãããã
titanic_df[(titanic_df['Age'] > 20) & (titanic_df['Sex'] == 'female')]
Survived | Pclass | Sex | Age | IsChild | |
---|---|---|---|---|---|
2 | 0 | 2 | female | 24.0 | 0 |
Queryã¨ããã®é åã«æ¡ä»¶ã«å ¥ããï¼ã¤ã®æ¹æ³ã§ãã¼ã¿ãåãåºããã¨ãå¤ãã
追è¨(2017/12/14)
queryã¯å¤å°é ããã©å¯èªæ§ã»æ¸ãããããããã®ã§ãã©ã¡ãã使ããã¯å¥½ã¿
è¡åããå¿ è¦ãªè¡çªå·ãæå®ãã¦ãåãåºã
è¡ãåãåºãã«ã¯ãDataFrame.loc
ã¨ããé¢æ°ã使ã£ã¦ããã
locã§ã¯ãloadããæã«dfã®ä¸çªå·¦ã«èªåçã«å²ãå½ã¦ããããindexããæå®ãã¦è¡ãåãåºãã
DataFrame.loc[start:end]ã¨ããã¨ãã«ãstartã¨endãã©ã¡ããå«ãã ç¶æ
ã§åãåºãã®ã§æ³¨æã
titanic_df.loc[0:2]
Survived | Pclass | Sex | Age | |
---|---|---|---|---|
0 | 1 | 1 | male | 80.0 |
1 | 1 | 2 | female | 4.0 |
2 | 0 | 2 | female | 24.0 |
ä»ã«ãilocãixã¨ãã£ãè¡ãåãåºããé¢æ°ãããããåºæ¬ã¯locã使ãå ´åãå¤ãã®ã§ã¾ãã¯ã³ã¬ã ãè¦ããããã°ãããã¨ã
ã°ã«ã¼ãåãã¨éè¨
groupbyé¢æ°ãç¨ããã¨ãæå®ãããåãå¤æ¯ã«ã°ã«ã¼ãã³ã°ãã¦ãããã 帰ã£ã¦ããgroupbyãªãã¸ã§ã¯ãã§éè¨é¢æ°ãå¼ã³åºãã¨ã ã°ã«ã¼ããã¨ã®å¹³åãæ大å¤ãä¸å¤®å¤ãªã©ã調ã¹ããã¨ãã§ããã
詳ããã¯ãã¡ãï¼ API Reference â pandas 0.20.3 documentation #groupby
# Surviveåã0ã1ãã§çåãããã©ããã示ãã¦ããã # çåãã人ãã¡ã¨ããã¦ããªã人ãã¡ã§åå¤ã®å¹³åãã¨ã£ã¦ã¿ãã titanic_df.groupby(['Survived']).mean()
Pclass | Age | |
---|---|---|
Survived | ||
0 | 2.5 | 23.500000 |
1 | 2.0 | 32.333333 |
â»éè¨é¢æ°ã¯ãéè¨åºæ¥ãåã ããéè¨ãã¦ãããã(ä»åã¯æååã®Sex/Embarkedã¯éè¨ããã¦ããªã)
reset_index()ãå¼ã³åºããã¨ã§ãã®è¡åèªä½ãæ´ã«æä½ãã¦ããããããªãã
titanic_df.groupby(['Survived']).mean().reset_index()
Survived | Pclass | Age | |
---|---|---|---|
0 | 0 | 2.5 | 23.500000 |
1 | 1 | 2.0 | 32.333333 |
æ°ããªåã追å ãã
詳ããç¥ãããæ¹ã¯ãã¡ããåèã«ãªãã¾ãã
åºæå¤ã追å ãã
ãã åã«1ããã¤ãoneãã¨ããå追å ããæ¹æ³ãèªåã¯assignã使ããã¨ãå¤ãã
titanic_df.assign(
One = 1
)
Survived | Pclass | Sex | Age | One | |
---|---|---|---|---|---|
0 | 1 | 1 | male | 80.0 | 1 |
1 | 1 | 2 | female | 4.0 | 1 |
2 | 0 | 2 | female | 24.0 | 1 |
3 | 0 | 2 | male | 37.0 | 1 |
4 | 0 | 3 | female | 11.0 | 1 |
5 | 1 | 3 | female | 13.0 | 1 |
6 | 0 | 3 | male | 22.0 | 1 |
ä»ã®åãå å·¥ãã¦æ°ããªåãä½ã
Ageãå å·¥ãã¦ã20æ³ä»¥ä¸ãã©ããã示ãåãis_childããä½ãã
titanic_df.assign( IsChild = titanic_df['Age'] < 20 )
Survived | Pclass | Sex | Age | IsChild | |
---|---|---|---|---|---|
0 | 1 | 1 | male | 80.0 | False |
1 | 1 | 2 | female | 4.0 | True |
2 | 0 | 2 | female | 24.0 | False |
3 | 0 | 2 | male | 37.0 | False |
4 | 0 | 3 | female | 11.0 | True |
5 | 1 | 3 | female | 13.0 | True |
6 | 0 | 3 | male | 22.0 | False |
True / False ã ã¨æ±ãã¥ãããã¨ãå¤ãã®ã§ãTrue=1, False=0 ã«ãã¦å ¥ããå ´åã¯ä»¥ä¸ã®ããã«ãã
titanic_df.assign( IsChild = (titanic_df['Age'] < 20).astype(int) )
Survived | Pclass | Sex | Age | IsChild | |
---|---|---|---|---|---|
0 | 1 | 1 | male | 80.0 | 0 |
1 | 1 | 2 | female | 4.0 | 1 |
2 | 0 | 2 | female | 24.0 | 0 |
3 | 0 | 2 | male | 37.0 | 0 |
4 | 0 | 3 | female | 11.0 | 1 |
5 | 1 | 3 | female | 13.0 | 1 |
6 | 0 | 3 | male | 22.0 | 0 |
æ¸ãæãã¦ãã¾ã£ã¦è¯ãã®ã§ããã°ãããã£ãæ¸ãæ¹ãã§ããã
ãã ãããã®æ¸ãæ¹ã¯SettingWithCopyWarning
ãåºãã
titanic_df['IsChild'] = (titanic_df['Age'] < 20).astype(int) titanic_df
Survived | Pclass | Sex | Age | IsChild | |
---|---|---|---|---|---|
0 | 1 | 1 | male | 80.0 | 0 |
1 | 1 | 2 | female | 4.0 | 1 |
2 | 0 | 2 | female | 24.0 | 0 |
3 | 0 | 2 | male | 37.0 | 0 |
4 | 0 | 3 | female | 11.0 | 1 |
5 | 1 | 3 | female | 13.0 | 1 |
6 | 0 | 3 | male | 22.0 | 0 |
ä»ã®è¤æ°åãå å·¥ãã¦æ°ããªåãä½ã
Pclass ã®å¤ã¨Survivedã®å¤ã足ããåãã¤ãã£ã¦ã¿ã (ãã®å ´åã¯ãããªåã«æå³ã¯ãªããã»ã»ã»)
titanic_df.apply(lambda x: x['Pclass'] + x['Survived'], axis=1)
0 2
1 3
2 2
3 2
4 3
5 4
6 3
dtype: int64
ããã§å¯¾å¿ããSeriesåãã¤ãããã®ã§ãAssignããã ã
titanic_df.assign( X=titanic_df.apply(lambda x: x['Pclass'] + x['Survived'], axis=1) )
Survived | Pclass | Sex | Age | IsChild | X | |
---|---|---|---|---|---|---|
0 | 1 | 1 | male | 80.0 | 0 | 2 |
1 | 1 | 2 | female | 4.0 | 1 | 3 |
2 | 0 | 2 | female | 24.0 | 0 | 2 |
3 | 0 | 2 | male | 37.0 | 0 | 2 |
4 | 0 | 3 | female | 11.0 | 1 | 3 |
5 | 1 | 3 | female | 13.0 | 1 | 4 |
6 | 0 | 3 | male | 22.0 | 0 | 3 |
追è¨
ãã£ã¡ã®ã»ããåÃåã§æ©ãã(æ¸ãæãããã£ããtitanic_df.loc[:, 'X']
ã«ä»£å
¥ããã°ãã)
âã®ãã¤ã ã¨è¡ãã¨ã«å¦çãè¡ãã®ã§ãã£ã¡ãæéãããã
titanic_df.assign( X=titanic_df['Pclass'] + titanic_df['Survived'] )
Survived | Pclass | Sex | Age | IsChild | X | |
---|---|---|---|---|---|---|
0 | 1 | 1 | male | 80.0 | 0 | 2 |
1 | 1 | 2 | female | 4.0 | 1 | 3 |
2 | 0 | 2 | female | 24.0 | 0 | 2 |
3 | 0 | 2 | male | 37.0 | 0 | 2 |
4 | 0 | 3 | female | 11.0 | 1 | 3 |
5 | 1 | 3 | female | 13.0 | 1 | 4 |
6 | 0 | 3 | male | 22.0 | 0 | 3 |
æ¡ä»¶ã«ãã£ãã»ã«ã ããæ¸ãæãã
ä¾ãã°IsChildã®1ã®ã¨ããã1ã§ã¯ãªã5ã«ãã¦ã¿ãã
ãã®æ¸ãæ¹ã¯ SettingWithCopyWarning
ãåºããã¨ãããã®ã ãã©è¯ãæ¸ãæ¹ãããããªãã®ã§ã
ç¥ã£ã¦ãæ¹ããããæãã¦æ¬²ããã (å¤èªä½ã¯ã¡ããã¨å¤ããã)
titanic_df.loc[titanic_df['IsChild'] == 1, ['IsChild']] = 5 titanic_df
Survived | Pclass | Sex | Age | IsChild | |
---|---|---|---|---|---|
0 | 1 | 1 | male | 80.0 | 0 |
1 | 1 | 2 | female | 4.0 | 5 |
2 | 0 | 2 | female | 24.0 | 0 |
3 | 0 | 2 | male | 37.0 | 0 |
4 | 0 | 3 | female | 11.0 | 5 |
5 | 1 | 3 | female | 13.0 | 5 |
6 | 0 | 3 | male | 22.0 | 0 |
setããªã¹ãã«åå¨ããå¤ã®ãã¼ã¿ã ããåãåºã
PClassã1ã3ã®å ´åã®è¡ãåãåºãããã¨ããã
ãããªã¨ãã¯DataFrame.isin
ã使ãã
target_set = set([1, 3]) condition = titanic_df['Pclass'].isin(target_set) titanic_df[condition]
Survived | Pclass | Sex | Age | IsChild | |
---|---|---|---|---|---|
0 | 1 | 1 | male | 80.0 | 0 |
4 | 0 | 3 | female | 11.0 | 5 |
5 | 1 | 3 | female | 13.0 | 5 |
6 | 0 | 3 | male | 22.0 | 0 |
ãããã«
ã¨ãããããã¤ã使ã£ã¦ããæä½ãã°ã¼ã£ã¨æ¸ãã¦ã¿ãã
Pandasã¯åããã¨ãããã®ã«ããããªæ¸ãæ¹ããã(´ã»Ïã»ï½)
ä»åãä»ã«ãæ¸ããæ¸ãæ¹ãããã«ããããããã使ããªãã»ããããã¨èªåãæã£ã¦ããã®ã¯ããã¦ç´¹ä»ããªãã£ãã èªåã®æ¸ãã¦ãæ¸ãæ¹ã§ãããã¾ããããããªãæ¸ãæ¹ã¨ããæããããããªãã
触ã£ã¦ãããã¡ã«çèªåã®ãã¹ããã©ã¯ãã£ã¹çãªã®ãè¦ã¤ããããã¨ããã®ããªã
ãã¾ã
Pandasã¯æ¸ãæ¹ãä¸æ©ééããã¨ãã¡ããã¡ãè¨ç®æéãé
ããªã£ãããããã
è¡åè¨ç®ã¯ è¡ã«å¯¾ãã¦ä½ããããã®ã§ãªããåã«å¯¾ãã¦å å·¥ãè¡ããã¨ããæèã大äºã
å¼æ°ã« axis=1
ãå
¥ãå ´åã¯è¡ã«å¯¾ãã¦æä½ãè¡ãã¨ããªã®ã§ã
èªåã®ã³ã¼ãã«ãã£ãããä»ã®æ¸ãæ¹ããªããèãã¦ã¿ããã