@hurutoriya ããããå æ¥ä»¥ä¸ã®è¨äºãæ稿ãã¦ãã¾ããã
ãã®å¾ãã¤ãã¿ã¼ä¸ã§ããã¨ããã¦ãããã¡ã«ãããªè©±ãããã¾ããã
ãææãããã¨ããããã¾ã!
— Shunya Ueta (@hurutoriya) 2020å¹´9æ11æ¥
ãã£ãããã¨ããã ã¨æãã®ã§ãå¤æ´ãã¾ããâºï¸
ãã£ã¼ãããã¯ãããããã§ãhttps://t.co/S1MDDovpUj
æè¿ãshiumachi ããã®ãã¤ã¼ãã§https://t.co/2ByvPpj2dQ
ãç¥ã£ãã®ã§ã次ã¯ããã使ã£ã¦ã¹ãã¼ããåä¸ããã¹ããã¦é£çµãããã¿ã¼ã³ã試ãã¦ã¿ã¾ã : )
å½åã¯ç°¡åã«ã§ããããªã¨æã£ãã®ã§ãããmergeãªãã¨ãããconcatã¨ãªãã¨ããããªã«ç°¡åã«ã¯ããã¾ããã
ããããã³ã¼ãã使ãã±ã¼ã¹ã¨ããã®ã¯ã大æµã®å ´åæ¢ç´¢çãã¼ã¿åæãã¦ãããããç´ æ©ãæ軽ã«èªã¿è¾¼ã¿ãããã¨ãããã®ãªã®ã§ãæ軽ãã失ããªãããã«ããªããæä½éã®ãã§ãã¯ãè¡ã£ã¦ããå¿ è¦ãããã¾ãã
é£ç¶ããcsvãèªã¿è¾¼ãã¨ãã«ã²ã£ãããã±ã¼ã¹ã®å¤§ããªãã®ã¨ãã¦ã¯ãã«ã©ã ã®ä¸ä¸è´ã¨ãã¼ã¿åã®ä¸ä¸è´ã§ãããªã®ã§ããã®2ã¤ã«çµã£ã¦ããªãã¼ã·ã§ã³ãè¡ããvalidateé¢æ°ãä½ã£ã¦ã¿ã¾ããã
ã³ã¼ãã¯é·ããªãã®ã§è¨äºã®æ«å°¾ã«è¼ãã¦ãã¾ãã
使ãæ¹ã¯ç°¡åã§ãã¾ããdfã®ãªã¹ãã®ä»£ããã«ã (pathlib.Path, df) ã®ã¿ãã«ã®ãªã¹ããä½ãã¾ãã
data = [(path, pd.read_csv(str(path))) for path in pathlib.Path(f_path).glob('*.csv')]
ãã¨ã¯ããã validate(data) ã«å ¥ãã¦ã pd.concat ã«æ¸¡ãã ãã§ãã
pd.concat(validate(data))
ããã«ã©ã ãä¸è´ãã¦ããªãå ´åã¯ä»¥ä¸ã®ãããªã¨ã©ã¼ãåºã¾ãã
ValueError: ambiguous columns: file2.csv, file3.csv
ããã«ã©ã ãä¸è´ãã¦ãã¦ãdtypeãä¸è´ãã¦ããªãå ´åã¯ä»¥ä¸ã®ãããªã¨ã©ã¼ãåºã¾ãã
ValueError: inconsistent dtypes: int64, object in file2.csv, file4.csv
é©å½ã«ä½ã£ãã³ã¼ããªã®ã§ã¨ã©ã¼çããããããã¾ããã
ããä¸å
·åçãã£ãããæ°è»½ã«ãå ±åãã ããã
ã³ã¼ãå ¨ä½(ãã¢ã³ã¼ãã¤ã)
import pathlib import typing import pandas as pd # data definition ## valid data df1 = pd.DataFrame( [ {"c1": 100, "c2": "a100"}, {"c1": 101, "c2": "a101"}, ] ) ## valid data df2 = pd.DataFrame( [ {"c1": 200, "c2": "a200"}, {"c1": 202, "c2": "a202"}, ] ) ## invalid data: ambiguous column names df3 = pd.DataFrame( [ {"c1": 300, "c3": "a300"}, {"c1": 301, "c3": "a301"}, ] ) ## invalid data: inconsistent dtypes df4 = pd.DataFrame( [ {"c1": "400", "c2": "a400"}, {"c1": 401, "c2": "a401"}, ] ) ## dataset test case 1: ambiguous column names data1 = [ (pathlib.Path("file1.csv"), df1), (pathlib.Path("file2.csv"), df2), (pathlib.Path("file3.csv"), df3), ] ## dataset test case 2: inconsistent dtypes data2 = [ (pathlib.Path("file1.csv"), df1), (pathlib.Path("file2.csv"), df2), (pathlib.Path("file4.csv"), df4), ] def validate( data: typing.Sequence[typing.Tuple[pathlib.Path, typing.Sequence[pd.DataFrame]]] ) -> typing.Sequence[pd.DataFrame]: """simple data validation :param data: [(path, df)] :return: [df] """ for x, y in zip(data, data[1:]): if x[1].columns.tolist() != y[1].columns.tolist(): raise ValueError(f"ambiguous columns: {x[0]}, {y[0]}") for xd, yd in zip(x[1].dtypes, y[1].dtypes): if xd != yd: raise ValueError(f"inconsistent dtypes: {xd}, {yd} in {x[0]}, {y[0]}") return [x[1] for x in data] print("### validate ambiguous columns demo ###") try: pd.concat(validate(data1)) except ValueError as ve: print(ve) print("### validate inconsistent dtypes demo ###") try: pd.concat(validate(data2)) except ValueError as ve: print(ve)
è¬è¾
@aodag (zipã®ã¨ã¬ã¬ã³ããªæ¸ãæ¹ãæãã¦ããã¦ãããã¨ããããã¾ã)