Another tool that deserves mention: Dask. The idea: recreate the API from Pandas (and NumPy and Scikit-Learn) as much as possible, but do lazy evaluation, and allow distributed computation (like Spark). As a demo, let's recreate the Most-Viewed Wikipedia Pages solution in Dask⦠[Complete code: dask_wikipedia.py] Dask Basic setup: import the Dask DataFrame functionality, and create a client to work
ã¯ããã« ãã®è¨äºã¯ï¼Kaggle Advent Calendar 2022第6æ¥ç®ã®è¨äºã«ãªãã¾ãã æ¬è¨äºã§ã¯ã 32GBè¶ ã®CSVãã¼ã¿ã®åºæ¬çµ±è¨éããå°è¦æ¨¡ãã·ã³ã§ãçã¡ã¢ãªãã¤é«éã«è¨ç®ãããã¯ãã㯠ã«ã¤ãã¦è§£èª¬ãã¾ãã Kaggleã³ã³ãã«éããã ãã·ã³ã¹ããã¯ãä½ãããã大ããªãã¼ã¿ã»ãããæºè¶³ã«å¦çã§ããå°ã£ã¦ãã æ¯åè¡ããã¡ã¤ã«èªã¿è¾¼ã¿ãé ãã®ã§ããã£ã¨é«éåããã â¡ ã¨ãã£ãæ©ã¿ã課é¡ãæ±ãã¦ããæ¹ã®åèã«ãªãã°å¹¸ãã§ãã ã¢ããã¼ã·ã§ã³ ãã¼ã¿åææ¥åãKaggleçã®ã³ã³ããã£ã·ã§ã³ã§åãã¦ã®ãã¼ã¿ã»ãããæ±ãå ´åããããªãæ©æ¢°å¦ç¿ã¢ã«ã´ãªãºã ãè¡ããã¨ã¯ã¾ãç¡ããæåã«ãã¼ã¿è¦³å¯ãè¡ãã®ãä¸è¬çã§ãã ãã¼ãã«ãã¼ã¿ã§ããã°ãåã«ã©ã ã®åºæ¬çµ±è¨éï¼æå°å¤ãæ大å¤ãå¹³åãåæ£ãååä½æ°ï¼ãªã©ãè¨ç®ã»å¯è¦åãããã¼ã¿ã¯ã¬ã³ã¸ã³ã°ã®è¦å¦ãç¹å¾´éè¨è¨ã®æ¹éãªã©ãæ¤
ãªãªã¼ã¹ãé害æ å ±ãªã©ã®ãµã¼ãã¹ã®ãç¥ãã
ææ°ã®äººæ°ã¨ã³ããªã¼ã®é ä¿¡
å¦çãå®è¡ä¸ã§ã
j次ã®ããã¯ãã¼ã¯
kåã®ããã¯ãã¼ã¯
lãã¨ã§èªã
eã³ã¡ã³ãä¸è¦§ãéã
oãã¼ã¸ãéã
{{#tags}}- {{label}}
{{/tags}}