æ©æ¢°å¦ç¿ã»ãã¥ã¼ã©ã«ãããã¯ã¼ã¯ã«ããã¦ããã¼ã¿ãç¨æã»ãã¼ããããã¨ã¯ç°¡åããã«è¦ãã¦é£ããã
- corpusã®ãã¦ã³ãã¼ãã»å±é
- ãã¼ã¿ã®åå¦çã»ä¿å
- ãã¼ã¿ã»ããï¼processed corpusã®é¨åéåï¼ã®ã»ããã¢ãã
- datumã®ãã¼ã
- ã¢ã¯ã»ã¹é ã»ãããå
- åæ£dataset
èãããã¨ãã£ã±ãã
fileSystemãããã¼ã¿ãã¡ã¢ãªä¸ã«å±é (ãã¼ã) ãã¾ããã¨ã¡ã¢ãªå¼¾ããããæ¯åã®ãã³ãåå¦çããã¦ããæ¥ãæ®ããã
並è¡ãã¦å¦ç¿ãããéã«é©åã«ãã¼ã¿ãé
åããå¿
è¦ãããã
ã³ã¼ãã¹ ~ ãããã®å®ç¾©
ã³ã¼ãã¹ãåå¦çãã¦ãã¼ã¿ã»ããã¨ãããã¼ã¿ã»ãããå¤æãã¦ããããã³ã½ã«ã¨ãã.
Preprocess Transform Corpus -----------> Dataset -----------> Batch
Corpus :: [any] = {i0, i1, ...}
Preprocess :: (i â C' â Corpus) -> (x: any)
Dataset :: [any] = {Preprocess(i) | i â C' â Corpus}
Transform :: (x â Dataset) -> (x' :: T)
Batch :: [T] = {Transform(x) | x â D' â Dataset}
åå²ã®æ義
ï¼é
å¸ãããï¼ãã¼ã¿æºã¨ãã¦ã®Corpus.
ãã¼ã¿ãæ§ã
ãªç¹å¾´éã¸äºåå¤æ (preprocessing) ããDataset.
Transformã«ããé次å¤æã¨ãã¼ã¿åçµ±ä¸ï¼ãã³ã½ã«åï¼ãããããå¾ãããBatch.
å¤æ: äºå (preprocess) / é次 (transform)
é次å¤æã§é½åº¦ç°ãªããã¼ã¿ãçæå¯è½ (ä¾: clipping)
äºåå¤æã§çæãããã¼ã¿ã®åãç°ãªã£ã¦ãã¦ããé次å¤æã§ãã³ã½ã«åãå¯è½ (T=100ã¨T=120ã®ãã¼ã¿ããlength=40ã®ã©ã³ãã åºéã«é次clip).
PyTorch
- DataSet: datumã®åãåºããå¯è½ãªãªãã¸ã§ã¯ã
- DataLoader: Datasetããæ§ã ãªè¨å®ã§ãã¼ã¿ãåãåºãããã«ã§ããiterable
Dataset
Dataset
ããã¼ã¹ã¯ã©ã¹ã¨ããèããã¼ã¿wrapperã
- Map-style datasets
Dataset
: Python ã·ã¼ã¯ã¨ã³ã¹/ãããã³ã°ï¼__getitem__
ï¼ - Iterable-style datasets
IterableDataset
: Python ã¤ãã¬ã¼ã¿ï¼__next__
ï¼
ã®ããããã§ãã¼ã¿éåã表ç¾ããã
Dataset
ã¯æ½è±¡ã¯ã©ã¹ã«ãªã£ã¦ãããå®è£
è
ããã¼ã¿ã»ãããã¨ã«èªã¿è¾¼ã¿çãå®è£
ããã
Utilityã¨ãã¦ãã¼ã¿ã»ããã®çµåï¼ConcatDataset for dataset / ChainDataset for iterSetï¼ã¨åå²ï¼Subset / random_splitï¼ãæä¾ããã¦ããã
corpusã®ãã¦ã³ãã¼ããåå¦çï¼preprocessing, ãã¼ã¿ã»ããåã¨ç§ã¯å¼ãã§ããï¼ãdatasetã®initã«æ¸ãã®ããã¹ããã©ã¯ãã£ã¹.
new dataset()ããã ãã§ãã¼ã¿ãæã«å
¥ããåå©ç¨æ§ãé«ã.
augmentationãå§ãã¨ãããfeedæã«é次å®è¡ãã¦ã»ããé¢æ°ãã¯datasetå
ã«transformã¨ãã¦å®è£
.
ãã¼ã¿ãã¼ãã®ã¿ã¤ãã³ã°ã¯å®è£
次第ã
__init__
ã§on-memoryã«ç½®ãã¦é«éã¢ã¯ã»ã¹ãå¯è½ã«ãã¦ããããã__getitem__
ã§é次datumãèªã¿è¾¼ãã§ã¡ã¢ãªç¯ç´ãã¦ããã.
å®è£
torch.utils.data.Dataset
Dataset
㯠__add__
ãmixed inãããMappingåã
ãªã®ã§å©ç¨æ㯠Dataset
ãç¶æ¿ãã¦datumãåãåºã __getitem__
ã¨å
¨é·ãè¿ã __len__
ãå®è£
ããã
__add__
㯠Dataset
ã®åä½ãå®ç¾©ãã¦ãããæ示çã« ConcatDataset
ãå®è¡ããã
DataLoader
index setã®ä¾çµ¦ (sampler) ã¨ãindex setã«åºã¥ãdatumã®setå (collate_fn).
samplerããããã°åãindexãåºããããindexã®åºãé çªã«ãã¤ã¢ã¹ (weight random sampleçãª) ãããããã§ãã.
collate_fnããããã°ãããåãããããããããã®å½¢ãä»»æã«ããã£ããããã¼ã¿èªä½ã«æãå
¥ãããã§ãã.
ããã©ã«ãã§ã¯shuffleãã©ã°ã«åºã¥ããã©ã³ãã ãµã³ãã«ããã¼ã¿Tensorå&ãããåã®collate_fn.
ããã©ã«ãcollate_fnã¯èªåTensoråããã¦ãããã®ã§ä¾¿å©ã ããè¬æåã«è¦ããªãããªãã®ã§é ã®çé ã«.
It automatically converts NumPy arrays and Python numerical values into PyTorch Tensors.