MNIST ææ¸ãæ°åãã¼ã¿ã¯ã0 ãã 9 ã¾ã§ã®ææ¸ãã®æ°å 70,000 ç¹ãåé²ãããã¼ã¿ã»ããã§ããæ©æ¢°å¦ç¿ããã¿ã¼ã³èªèã®ææ³ã確èªããããã«å©ç¨ã§ãã¾ãã以ä¸ã®ã¦ã§ããµã¤ããããã¼ã¿ããã¦ã³ãã¼ãã§ãã¾ãã
MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges
ãã¼ã¿ã¯ãè¨ç·´ãã¼ã¿ 60,000 ç¹ (train) ã¨ãã¹ããã¼ã¿ 10,000 ç¹ (t10k) ã«åãããã¦ãã¾ãããããããææ¸ãã®æ°åã表ããã¯ã»ã«ãã¼ã¿ (images) ã¨ãããã 0 ãã 9 ã®ã©ã®æ°åãªã®ãã示ãã©ãã«ãã¼ã¿ (labels) ãããªãã¾ããããããããã¤ããªå½¢å¼ã§æä¾ããã¦ãã¾ãã
ä»åã¯ããããã®ãã¼ã¿ããã¤ããªå½¢å¼ããããã¹ãå½¢å¼ã«å¤æãã¦ãç°¡åã«ããã¼ã¿ã®å 容ã確èªãããã¹ã¯ãªããè¨èªã§å¦çãããã§ããããã«ãã¾ããããã«ããã¼ã¿ãä¸åãã¤ç»åãã¡ã¤ã«ã«åºåãã¦ãã©ã®ãããªæ°åãæ¸ããã¦ãããã確èªãã¦ã¿ã¾ãã
ãã¼ã¿ã®ç¢ºèª
ã¾ããã¦ã§ããµã¤ããããã¦ã³ãã¼ãããå§ç¸®ãã¡ã¤ã«ãå±éãã¾ãã以ä¸ã§ã¯ããã¦ã³ãã¼ãã§ãããã¡ã¤ã«ã®ãã¡è¨ç·´ãã¼ã¿ã対象ã«èª¬æãã¾ããããã¹ããã¼ã¿ã§ãåæ§ã§ãã
$ gzip -dc train-images-idx3-ubyte.gz >train-images-idx3-ubyte $ gzip -dc train-labels-idx1-ubyte.gz >train-labels-idx1-ubyte
ãã¦ãMNIST ã®ã¦ã§ããã¼ã¸ã®èª¬æã«ããã¨ãç»åãã¡ã¤ã«ã®ãã©ã¼ãããã¯ä»¥ä¸ã®ã¨ããã§ãããããé åã 16 ãã¤ãã§ããã®å¾ã« 28 * 28 ãã¤ãã®ãã¯ã»ã«ãã¼ã¿ã 60,000 ç»ååã ãç¶ãã¾ã*1ãã¯ããã«ãç»åãã¡ã¤ã«ã確ãã«ãã®ãã©ã¼ãããã«ãªã£ã¦ãããã¨ããç°¡åã«ç¢ºèªãã¾ãã
[offset] [type] [value] [description] 0000 32 bit integer 0x00000803(2051) magic number 0004 32 bit integer 60000 number of images 0008 32 bit integer 28 number of rows 0012 32 bit integer 28 number of columns 0016 unsigned byte ?? pixel 0017 unsigned byte ?? pixel ........ xxxx unsigned byte ?? pixelMNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges
ã¾ãããã¡ã¤ã«ãµã¤ãºã確èªãã¾ããä¸è¨ã®èª¬æããããã¡ã¤ã«ãµã¤ãºã¯ 16 + 60,000 * 28 * 28 = 47,040,016 ãã¤ãã«ãªã£ã¦ããã¯ãã§ããwc ã³ãã³ãã§èª¿ã¹ã¦ã¿ãã¨ããããã«ãã®ã¨ããã§ãããã¨ã確èªã§ãã¾ãã
$ wc -c train-images-idx3-ubyte 47040016 train-images-idx3-ubyte
次ã«ããããé åã®å¤ã確èªãã¾ããhead ã³ãã³ãã« -c ãªãã·ã§ã³ãæå®ããã¨ãè¡æ°ã§ã¯ãªããã¤ãæ°ãæå®ãã¦åãåºãã¾ããod ã¯ãã¡ã¤ã«ããã³ãããã³ãã³ãã§ãã-t ã¯åºåå½¢å¼ãæå®ãããªãã·ã§ã³ã§ãx1 ãæå®ãã㨠1 ãã¤ããã¤åºåã£ã¦ 16 é²æ°ã§è¡¨ç¤ºããã¨ããæå³ã«ãªãã¾ãã
$ head -c 16 train-images-idx3-ubyte | od -tx1 0000000 00 00 08 03 00 00 ea 60 00 00 00 1c 00 00 00 1c 0000020
å é ã® 4 ãã¤ã㯠"00 00 08 03" ã§ãmagic number ã®å¤ã«ä¸è´ãã¦ãã¾ãã次㮠4 ãã¤ã㯠"00 00 ea 60" ã§ããã10 é²æ°ã«å¤æãã¦ã¿ãã¨ç¢ºãã« 60,000 ã«ãªãã¾ããããã«ç¶ã "00 00 00 1c" ãåæ§ã« 10 é²æ°ã«å¤æãã㨠28 ã«ãªãã¾ãã
$ printf "%d\n" 0xea60 60000 $ printf "%d\n" 0x1c 28
ãªãããããé åã®å¤ã¯ãããã 32 ãããæ´æ°ã§ãããod -td4 ã¨ãã¦è¡¨ç¤ºããã¦ã Intel ããã»ããµã§ã¯æ£ããå¤ã«ãªãã¾ãããããã¯ããã¼ã¿ãããã°ã¨ã³ãã£ã¢ã³ã§æ ¼ç´ããã¦ããããã§ããã¦ã§ããã¼ã¸ã«ä»¥ä¸ã®ããã«è¨è¼ããã¦ããã¨ããã§ãã
All the integers in the files are stored in the MSB first (high endian) format used by most non-Intel processors. Users of Intel processors and other low-endian machines must flip the bytes of the header.
MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges
ããã¹ãå½¢å¼ã¸ã®å¤æ
ããã§ã¯ããã®ãã¤ããªãã¡ã¤ã«ããããã¯ã»ã«ãã¼ã¿æ¬ä½é¨åãããã¹ãå½¢å¼ã«å¤æãã¾ããod ã³ãã³ããå©ç¨ããã°ã以ä¸ã®ããã«ç°¡åã«å¤æã§ãã¾ãã
$ od -An -v -tu1 -j16 -w784 train-images-idx3-ubyte | sed 's/^ *//' | tr -s ' ' >train-images.txt
å®è¡ä¾ã§ã¯ od ã³ãã³ãã®ãªãã·ã§ã³ãããããã¨æå®ãã¦ãã¾ããããããã以ä¸ã®æå³ã«ãªãã¾ãã
- -An
- ãã¡ã¤ã«å é ããã®ãªãã»ãããé表示ã«ãã
- -v
- åä¸ã®å¤ãç¶ãå ´åã§ã表示ãçç¥ããªã
- -tu1
- åºåå½¢å¼ã¯ 1 ãã¤ãåä½ã®ç¬¦å·ç¡ãæ´æ°ã¨ãã
- -j16
- å é ã® 16 ãã¤ããã¹ããããã
- -w784
- 784 ãã¤ããã¨ã«æ¹è¡ãã (28 * 28 = 784 ã§ããä¸ç»ååã®ãã¼ã¿ãä¸è¡ã«åºåããããã«æå®ãã¦ãã¾ã)
ã©ãã«ãã¡ã¤ã«ãåæ§ã«ãã¦ã§ããã¼ã¸ã«è¨è¼ããã¦ãããã©ã¼ãããã確èªãã¦ããã®å 容ã«æ²¿ã£ã¦å¤æãã¾ãããã¡ãã¯ããããé åã 8 ãã¤ãã§ãç»åãã¨ã« 1 ãã¤ãã®ãã¼ã¿ (0 ãã 9 ã®å¤ã®ãããã) ãªã®ã§ãod ã®ãªãã·ã§ã³ã« -j8 㨠-w1 ãæå®ãã¾ããã¾ããåºåã¯ç»åãã¨ã« 1 åã®ã¿ãªã®ã§ã空ç½ã¯åç´ã«åé¤ãã¦ãã¾ãã
$ od -An -v -tu1 -j8 -w1 train-labels-idx1-ubyte | tr -d ' ' >train-labels.txt
ã©ãã«ãã¡ã¤ã«ã®å 容ã¨ããã¯ã»ã«ãã¼ã¿ã®å 容ãçªãåããã¦ãå¤æçµæã確èªãã¾ããã©ãã«ãã¡ã¤ã«ã®å é ã® 5 è¡ã¯ä»¥ä¸ã®ããã«ãªã£ã¦ãã¾ããããã¯ãææ¸ãã®æ°åã®æåã® 5 æåããé ã« 5, 0, 4, 1, 9 ãæ¸ãããã®ã§ãããã¨ã示ãã¦ãã¾ãã
$ head -n 5 train-labels.txt 5 0 4 1 9
ããã§ãå®éã«ãã®ãããªæ°åãæ¸ããã¦ããã®ãããã¯ã»ã«ãã¼ã¿ã確èªãã¦ã¿ã¾ããåç»åãä¸è¡ã®ãã¼ã¿ã«ãªã£ã¦ããã®ã§ãæåã®ç»åã¯æ¬¡ã®ããã«ãã¦ç¢ºèªã§ãã¾ããå é è¡ã 28 åãã¨ã«æ¹è¡ãã¦è¡¨ç¤ºãã¦ãã¾ãããã¯ã»ã«ãã¼ã¿ã¯ 0 ãã 255 ã®ã°ã¬ã¼ã¹ã±ã¼ã«ãªã®ã§ããã表示ãè¦ãããããããã« 0 ã 1 以ä¸ãã§åºå¥ãã¦ã空ç½æåã "â " æåã表示ãã¦ãã¾ãã
$ head -n 1 train-images.txt |\ awk '{ for (i = 1; i <= NF; i++) printf("%s%s", $i > 0 ? "â " : "ã", i % 28 ? "" : "\n") }' ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ããããããããããããâ â â â â â â â â â â â ãããã ããããããããâ â â â â â â â â â â â â â â â ãããã ãããããããâ â â â â â â â â â â â â â â â ããããã ãããããããâ â â â â â â â â â â ãããããããããã ããããããããâ â â â â â â ãâ â ãããããããããã ãããããããããâ â â â â ãããããããããããããã ãããããããããããâ â â â ããããããããããããã ãããããããããããâ â â â ããããããããããããã ããããããããããããâ â â â â â ãããããããããã ãããããããããããããâ â â â â â ããããããããã ããããããããããããããâ â â â â â ãããããããã ãããããããããããããããâ â â â â ãããããããã ãããããããããããããããããâ â â â ããããããã ããããããããããããããâ â â â â â â ããããããã ããããããããããããâ â â â â â â â ãããããããã ããããããããããâ â â â â â â â â ããããããããã ããããããããâ â â â â â â â â â ãããããããããã ããããããâ â â â â â â â â â ãããããããããããã ããããâ â â â â â â â â â ãããããããããããããã ããããâ â â â â â â â ãããããããããããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã
ããã 5 ã ã¨ãããã¨ã§ãããã®ããã«è¦ããªããã¨ããªãã§ããããã®æ°åã§ã¯ãã¾ãã¡ã¯ã£ããããªãããã«ãæããã¾ãã念ã®ãããä»ã®ç»åãè¦ã¦ã¿ã¾ãã3 çªç®ã®ç»åã¯ãã©ãã«ãã¡ã¤ã«ã«ããã¨æ°åã® 4 ã ã¨ãããã¨ã§ãã
$ head -n 3 train-images.txt | tail -n 1 |\ awk '{ for (i = 1; i <= NF; i++) printf("%s%s", $i > 0 ? "â " : "ã" i % 28 ? "" : "\n") }' ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ããããããããããããããããããããâ â â ããããã ããããâ â ããããããããããããããâ â â ããããã ããããâ â ãããããããããããããâ â â â ããããã ããããâ â ãããããããããããããâ â â ãããããã ããããâ â ãããããããããããããâ â â ãããããã ãããâ â â ãããããããããããããâ â â ãããããã ãããâ â â ããããããããããããâ â â â ãããããã ãããâ â â ããããããããããããâ â â â ãããããã ãããâ â â ãããããããããâ â â â â â ããããããã ãããâ â â ãããâ â â â â â â â â â â â ããããããã ãããâ â â â â â â â â â â â â â â â â ãããããããã ããããâ â â â â â â â ãããããâ â â ãããããããã ãããããããããããããããããâ â â ãããããããã ãããããããããããããããããâ â â ãããããããã ãããããããããããããããããâ â â ãããããããã ãããããããããããããããããâ â â ãããããããã ãããããããããããããããããâ â â ãããããããã ãããããããããããããããããâ â â ãããããããã ãããããããããããããããããâ â â ãããããããã ãããããããããããããããããâ â â ãããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã ãããããããããããããããããããããããããããã
ãã¡ãã¯ã¯ãããã« 4 ã«è¦ãã¾ããããããã§ãã
ç»åãã¡ã¤ã«ã¸ã®å¤æ
次ã«ãããã¹ãå½¢å¼ã®ãã¡ã¤ã«ãå ã«ç»åãã¡ã¤ã«ãçæãã¦ãä¸ã®å®è¡ä¾ã®ãããªåºåãå®éã®ç»åã¨ãã¦ç¢ºèªã§ããããã«ãã¦ã¿ã¾ãã
ä»åã®ãããªãã¼ã¿ããç»åãã¡ã¤ã«ãä½æããã«ã¯ãNetpbm å½¢å¼ãå©ç¨ããã®ãæ軽ã§ããNetpbm å½¢å¼ã¯ãç°¡åãªãããã¨ãã¯ã»ã«ãã¼ã¿ã§ç»åã表ç¾ãã¾ãã以ä¸ã®ã¦ã§ããã¼ã¸ã«ããã¡ã¤ã«ãã©ã¼ããããªã©ãè¨è¼ããã¦ãã¾ãã
User manual for Netpbm
Netpbm ã«ã¯ PBM/PGM/PPM ã®ä¸ç¨®é¡ã®å½¢å¼ãããã¾ãããä»åã¯ã°ã¬ã¼ã¹ã±ã¼ã«ã®ãã¯ã»ã«ãã¼ã¿ããç»åãä½æããã®ã§ãPGM å½¢å¼ã使ãã¾ããPGM å½¢å¼ã®ãã©ã¼ãããã¯ä»¥ä¸ã«å®ãããã¦ãã¾ãã
PGM Format Specification
以ä¸ã®ããã«ãã¦ããã¯ã»ã«ãã¼ã¿ãã PNG å½¢å¼ã®ç»åãã¡ã¤ã«ãçæã§ãã¾ãã
$ head -n 3 train-images.txt | tail -n 1 |\ awk ' BEGIN { print "P2 28 28 255" } { for (i = 1; i <= NF; i++) printf("%d%s", $i, i % 14 ? " " : "\n") }' |\ pnmtopng - >image.png
awk ã® BEGIN ãããã¯ã§ããããã¨ã㦠"P2 28 28 255" ãåºåãã¾ãããããã®æå³ã¯æ¬¡ã®ã¨ããã§ãã
- P2
- ãã®ãã¡ã¤ã«ã Plain PGM å½¢å¼ã§ãããã¨ã示ããã¸ãã¯ãã³ãã¼
- 28
- ç»åã®æ¨ªã®ãµã¤ãº
- 28
- ç»åã®ç¸¦ã®ãµã¤ãº
- 255
- ã°ã¬ã¼ã¹ã±ã¼ã«ã®æ大å¤
æ¬ä½ã§ã¯ãã¯ã»ã«ãã¼ã¿ãåºåãã¾ãããã®é¨åã¯å ã»ã©ããã¹ããã¼ã¿ã®åºåã確èªããå¦çã¨åæ§ã§ãããPlain PGM ã®ãã©ã¼ãããã¨ã㦠1 è¡ã 70 æå以ä¸ã«åããããã«è¨è¼ããã¦ããã®ã§ã14 åãã¨ã«æ¹è¡ãã¦ãã¾ã*2ã
ãã® awk ã®åºåãã®ãã®ã PGM å½¢å¼ã®ç»åãã¡ã¤ã«ã«ãªã£ã¦ãã¾ããã§ãããPGM å½¢å¼ã¯ãã¾ãä¸è¬çã§ã¯ãªãã®ã§ãæå¾ã® pnmtopng ã³ãã³ãã§ãããä¸è¬ç㪠PNG å½¢å¼ã«å¤æãã¦ãã¾ã*3ã
ãã®å¦çã§åºåããã image.png ã¯æ¬¡ã®ç»åã«ãªãã¾ãããããã«ãä¸ã§è¦ãæ°å "4" ãç»åã¨ãã¦åºåããã¦ãã¾ãã
ãã®ããã«ãã¦ãä¸ååã®ãã¼ã¿ãç»åãã¡ã¤ã«ã«å¤æã§ãã¾ãããããã¹ããã¡ã¤ã«ã®åè¡ã«ã¤ãã¦ãã®å¦çãç¹°ãè¿ããã¨ã§ãããããã®ææ¸ãæ°åãç»åãã¡ã¤ã«ã«å¤æã§ãã¾ãã
*1:ãã¹ããã¼ã¿ã§ã¯ 10,000 ç»åã«ãªãã¾ãã
*2:17 åãã¨ã®æ¹è¡ã§ã 70 æåã«åã¾ãã¾ãããç»åãµã¤ãºã 28x28 ãªã®ã§ãåºåãã®ããã¨ãã㧠14 åãã¨ã¨ãã¦ãã¾ãã
*3:CentOS 7 ã§ã¯ãpnmtopng 㯠netpbm-progs ããã±ã¼ã¸ã«å«ã¾ãã¾ããä»ã®æ¹æ³ã¨ãã¦ãImageMagick ã«å«ã¾ãã convert ã³ãã³ãã§ãå¤æã§ãã¾ãã