pandas.DataFrame ã®forã«ã¼ãããããµãâ³æ¹è¯ãã¦300åé«éåãã
主張ï¼é«éåã¯æå¾ã®ããã®ãã¿ã«ãã¾ãããã
ç¡é§ã«ãããããã¦é«éåããã¦æºè¶³ããçµå±ãã®å¾ã¯ã»ã¨ãã©ä½¿ããªãã£ããªãããããç§ã®æ¥å¸¸ã§ãã
ããã人ãè¨ã£ã¦ãã¾ãããé«éåãªãã¦ãã³ãæå¾ã§ãããã§ãã»ã»ã»ã ä»ã¾ã§ä½ååå¾æãããã¨ããããï¼ãããããã¾ãããã¾ãããï¼
pythonã§ããã° numba,cython,swig ãªã©ãã³ã³ãã¤ã«ãã¡ããç³»ã®åï¼ãã¯ã¼ï¼ãåãããã¨ã§ã å ¨ãåãã¢ã«ã´ãªãºã ã§ããã£ãã100ååä½ã§ç°¡åã«é«éåã§ãã¾ãã
ãããããã®ããæ¹ã§ã¯pythonã®ã¤ã³ã¿ã¼ããªã¿ã¼ãªãããµãâ³è¨èªã®è¯ããï¼è©²å½ã³ã¼ãé¨åã«ããã¦ï¼æ¨ã¦ã¦ãã¾ãã¾ããçµå±C/C++ã«éã売ã£ã¦ããã ãã§ãã
ç§ã¯éã売ããã¨ããèªä½ã好ããªã®ã§è¯ãã§ããããã®è¡çºã¯pythonã®æã¤å¥ã®é¢ã§ã®é«éæ§ãã¤ã¾ã "çç£æ§ã®é«ã" ãç ç²ã«ãã¦ãã¾ãã
ã³ã¼ãã®å®è¡ã¹ãã¼ããéããªã£ã¦ãçç£æ§ãä¸ããã°ãå¶ä½æéï¼ã³ã¼ãå®è¡æéã¯æªåãã¦ãããã¨ãããããã¾ããç§ã¯ã»ã¼ç¢ºå®ã«æªåã§ãã¦ãã¾ãã
ãããªãã¡ãªããããããããé«éåã¯çç£æ§ãã»ã¨ãã©ç ç²ã«ããªãã§æ¸ããããªæ®µéãã¤ã¾ãã§ããã ãæå¾ã®æ®µéã§ããã¨ãå¾ã«ãªãããã§ãã
ãã®ç¹ã«ããã¦ãnumpyãªã©ã¯å®ã«è¯ãç²åº¦ã§éã売ããã¨ã«æåãã¦ããããã使ç¨è ã¯çç£æ§ãã»ã¨ãã©æãªããã«é«éåã®æ©æµãå¾ããã¨ãã§ãã¾ãã ãã®ãã©ã³ã¹ã¯ã¨ã¦ãç´ æ´ããããéã売ãè ãªããã®å¡©æ¢ ãç®æãã¨è¯ãã§ãããã
ï¼ã¨ã¯ãããåå以ä¸é«éåããã¦ãæ°æéã®å¦çãæ°ç§ã«ãªã£ããããã¨é«éåããèªä½ã«åãã§ãã¾ããé«éåæ²¼ã«å¼ãããè¾¼ã¾ãã¾ã^^ï¼
æ¬é¡ : pandas.DataFrame ã®forã«ã¼ãããããµãâ³æ¹è¯ãã¦300åé«éåãã
ããããæ¬é¡ã§ãããä¸è¨ã®ããã«ã§ããã ãéã売ããã«ï¼== pythonã®æè»ããæãªããã«ï¼é«éåããã¨ããããããµãâ³é«éåããã¦ã¿ã¾ãã
ãããpythonã®foræã¯é
ããããã¡ã ãã¨ããã人ããã¾ãããå人çã«ã¯ãã®æè¦ã«ã¯åæãããã¾ãã
foræã¯æªè ã§ã¯ãªããforæå ã«é ãã³ã¼ããæ¸ããã¨ãæªãã®ã§ãã
forã«ã¼ãã ããåããæéè¨æ¸¬çµæãä¸ã®æ¹ã«æ¸ãã¾ããããpythonã®ä»ã®å¦çã«æ¯ã¹ã¦é ããªãã¦ãã¨ã¯ãªãã¨æãã¾ãã
ãã¦ã以ä¸ã®ãã㪠DataFrameããã£ãã¨ãã¾ãã
import pandas as pd df = pd.DataFrame({"a":list(range(500000))}) df.shape # (500000,1)
å¤æ° df 㯠pandas.DataFrame ã§ã 50ä¸è¡ã§ãã
ããã§ã¯ãé常ã«ãããããæ¡ä»¶åå²ããããããforæã使ããã¨ãé¸ãã ã±ã¼ã¹ãæ³å®ãã¦ãã¾ãã
ããã¦ä½ããè¨ç®ãããããå¤æ°dfã®aåã1è¡ãã¤é çªã«åãåºããããªã£ãã¨ãã¾ãã ï¼ããã¯ãã¹ãã¨ãããï¼
ãªããããã§ã®æéè¨æ¸¬ã§ããã%%time ã¨ãæ¸ãã®ãé¢åãªã®ã§ã Jupyter ã® nbextensions ã® ExecuteTime ã£ã¦ãã¤ã§èªåã§è¨æ¸¬ããã¦ãã¾ãã è¥å¹²ãªã¼ãã¼ããããããæ°ããã¦ãæ°msé
ãããããã¾ããã
ãã£ãã·ã¥ã«é¢ãã¦ã¯ç¡è¦ãã¦ãã¾ãããã£ãã·ã¥ãå¹ããããããéãã£ãå ´åãããããããã¾ããããä¸æ¦ã¯ãããå®åï¼ï¼ï¼ã¨ãã¦æ¯è¼ãã¦ãã¾ãã
ãªãã使ç¨ãããã¼ã¸ã§ã³ã¯ããã§ããã
python 3.6.3
pandas 0.22.0
è¨æ¸¬ã¯ãããªã«éããªããã¼ãã§ãã£ã¦ããã®ã§æéã¯é
ãã§ãã
(a) ãã¤ã¼ãã¨ãããå ¬å¼çï¼ï¼ï¼ãªå®è£ -> 27ç§
for idx,row in df.iterrows(): row.a # 27ç§
(a) row ã¯pandas.Seriesåã§ãçµæ§ããã¤ãã®ãªãã¸ã§ã¯ãã§ãã
ãããdf.iterrows() ã«ãã£ã¦è¨50ä¸åãçæãããã®ã§ãé
ãã¡ã¢ãªç¢ºä¿ãããããçºçãã¦ããã§ãããã
(b)ã¡ãã£ã¨æèé«ããã¦ã¿ã -> 10ç§
for idx in range(df.shape[0]): df.a.iloc[idx] # 10ç§
(b) Seriesã®çæã¯é¿ãããã¨ãã§ãã¾ãããã iloc ã®å
é¨ã§èµ·ãã¦ããindexã®æ¢ç´¢ããä¾å¤å¦çãéããã§ããï¼ã½ã¼ã¹è¦ã¦ãªãã®ã§æ³åï¼
ã¾ãã50ä¸åãããã¨ã df.a ã®è£ã§æèºãã¦ããã __getattr__ ãã»ãã®ã¡ãã£ã¨ã ãè² è·ã«ãªã£ã¦ãããããã¾ããã
(b') (b)ããã¼ã¹ã«ãã¦ã __getattr__ ãæ®ã©ä½ç¨ããªãããã«ããã -> 6.4ç§
df_a = df.a for idx in range(df.shape[0]): df_a.iloc[idx] # 6.4ç§
(b') ããï¼ï¼ãæã£ãããã __getattr__ ã®å½±é¿ã大ããã£ãã§ããèªåã§ãé©ãã¾ããã
ã»ã»ã»ã¨ããããDataFrameã® __getattr__ ã®ãªã¼ãã¼ã©ã¤ããããã¤ããã§ãããããããæ®éãããªã«éããªãã§ããã»ã»ã»
ã¨ãããã¨ã§ãã¡ãã£ã¨è±ç·ãã¦ãã¼ãã«ã® __getattr__ ãè¨æ¸¬ã -> 75ms
class A: def __init__(self): self.a = 0 InsA = A() for _ in range(500000): InsA.a # 75ms
ãã£ã±ããããªããã§ããï¼ ã¤ã¾ã pandas ã® DataFrameã® __getattr__ å ã®å¦çãéãã ãã§ãããããã¯æ°ãã¤ããªããã»ã»ã»
(b'') ããã«ã .ilocï¼ããã¯æ®éã®é¢æ°å¼ã³åºãï¼ãä¸åã«ã¾ã¨ãã¦ã¿ãã -> 6.2ç§
df_a_iloc = df.a.iloc for idx in range(df.shape[0]): df_a_iloc[idx] # 6.2ç§
(b'') ããã¯(b')ã¨ã»ã¨ãã©å¤ãããã __getattr__ ã®è² è·ã«æ¯ã¹ãã¨ãé¢æ°å¼ã³åºãã®ã³ã¹ãã¯ç¡è¦ã§ããã¬ãã«ã§ããã
(b''') (b)ã®df.aã®ä»£ããã« df["a"] ã使ãç -> 8.9ç§
ã¡ãã£ã¨(b)ã¨ã®æ¯è¼ã¾ã§è©±ãæ»ãã®ã§ããã df.a ã®ã¢ã¯ã»ã¹ã¨ã df["a"] ã®ã¢ã¯ã»ã¹ã¯ã©ã£ã¡ãéãã®ã§ããããã
for idx in range(df.shape[0]): df["a"].iloc[idx] # 8.9ç§
(b''') __getitem__ã§ã®ã¢ã¯ã»ã¹(df["a"]) ã®ã __getattr__ (df.a (b)ã«ã¦10ç§ã ã£ã )ãããã¡ãã£ã¨éããã§ãããã¸ã¼ã
(c) pandasãæ¨ã¦ãnumpyã«éãã -> 2.5ç§
for idx in range(df.shape[0]): df.a.values[idx] # 2.5ç§
(c) pandasãããã£ã¦ããããã«ãã£ã±ã numpyã«éãã好ä¾ã§ããã
valuesã§numpy.arrayã«ãã¦ããã®indexingãªãã»ã¨ãã©çã¢ã¯ã»ã¹ã«è¿ãã®ã§éãã§ãã
(c') (c)ããã¼ã¹ã«ãã¦ä¾ã® __getattr__å°çãåé¿ãã -> 0.45ç§(448ms)
df_a = df.a for idx in range(df.shape[0]): df_a.values[idx] # 0.45ç§ (448ms)
(c'') (c')ããããã« .values åã® __getattr__ ãåæ¸ -> 0.09ç§(90ms) âãããã ããã£ã
a = df.a.values for idx in range(df.shape[0]): a[idx] # 0.09ç§( 90ms)
(åè) 50ä¸åã®ç©ºã«ã¼ã -> 0.03ç§(35ms)
ã¡ãªã¿ã«ç©ºã«ã¼ãã以ä¸ã®ããã«35msã§ããã®ã§ãããããããµãâ³é«éåã®éçå¤ã§ãããã
( values[ ] ã® __getitem__ ã take ã試ããã®ã§ãããåè
㯠120ms, å¾è
㯠350ms ã¨é
ãã£ãã§ãããªãã§ã ãããã©ããªå®è£
ã«ãªã£ã¦ããã®ã§ããããã)
for _ in range(500000): pass # 0.03ç§( 35ms)
ã¾ã¨ã
- numbaãcythonãæ°ãã«ä½¿ããã«ã 27000ms(27ç§)â90ms ã¨ã 300åã«é«éåã§ãã¾ããã
- foræã¯æªããªãããforæã®ä¸ã§éãå¦çãæ¸ããã¨ãæªããã ãã
æå¾ã«æ³¨æ
pandasã§ãããã§ãããã§ããã ãforæã使ããªãæ¸ãæ¹ã§æ¸ã¾ããã®ãè¯ãã§ããï¼åºæ¬çã«ã¯ apply ã¨ã groupby ã使ãã¾ãããï¼