Google WSDM'09è¬æ¼ç¿»è¨³ï¼å¤§è¦æ¨¡ãªæ å ±æ¤ç´¢ã·ã¹ãã æ§ç¯ã«ããã課é¡ï¼ï¼ï¼
Googleã®Fellowã§ããJeffrey Deanæ°ã®WSDM'09ã«ãããè¬æ¼"Challenges in Building Large-Scale Information Retrieval Systems"ã®スライドã®ç¿»è¨³ã®ç¬¬3åã§ããGoogleã®æ¤ç´¢ã·ã¹ãã ã®10å¹´éã®é²åã®è»è·¡ãç´¹ä»ããã¦ãããä»åã¯2004å¹´ãã2007å¹´ãããã¾ã§ã®æ¤ç´¢ã·ã¹ãã ã®ç´¹ä»ã¨ã¤ã³ããã¯ã¹ã®ç¬¦å·åæ¹å¼ãæ¤ç´¢ç²¾åº¦ãåä¸ãããããã®å®é¨ç°å¢ã«ã¤ãã¦ã®ç´¹ä»ã¨ãªãã¾ããå人çã«ã¯åå²å¦çãå¾¹åºçã«æé¤ããGoogleã®ææ°ã®ç¬¦å·åæ¹å¼ãèå³æ·±ãã£ãã§ããã¤ã¿ãªãã¯ä½ã§ä¸é¨è§£èª¬ã»ææ³ãããã¦ãã¾ãã翻訳ã¯ç´ 人ãªã®ã§è©³ããã¯å ã®è³æãåç §ãã¦ãã ããã
第1åï¼Google WSDM'09講演翻訳:大規模な情報検索システム構築における課題(1) - llameradaの日記
第2åï¼Google WSDM'09講演翻訳:大規模な情報検索システム構築における課題(2) - llameradaの日記
ãµã¼ãè¨è¨ 2004å¹´ç
2001å¹´ã®ãµã¼ãæ§æãã大ããå¤æ´ããã¦ãã¾ããæ¤ç´¢ãµã¼ãã¯ãã¢ã¬ã³ãã»ãµã¼ãã¨ãªã¼ãã»ãµã¼ãã®æ¨æ§é ã¨ãªã£ã¦ãã¾ããããã¯ããªã¼ããµã¼ãã®æ°ãå¢ããããããµã¼ãã®é層æ§é ã2層æ§é ããå¤å±¤æ§é ã«ããå¿ è¦ãçããããã§ãããã
ã¤ã³ããã¯ã¹ã¯åæ£ãã¡ã¤ã«ã·ã¹ãã ã§ããGFSã«æ ¼ç´ãããFile LoaderãGFSãããªã¼ããµã¼ãä¸ã®ã¬ãã¸ããªshardsã«ã¤ã³ããã¯ã¹ãã³ãã¼ããã¢ãã«ã¨ãªã£ã¦ãã¾ãããããããGFSã«ã¤ã³ããã¯ã¹ã®ãã¹ã¿ãæ ¼ç´ãã¦ããããªã¼ããµã¼ããã¹ã¬ã¼ãã®å½¹å²ã«ãªãã®ã§ãããããã¹ã¿ã»ã¹ã¬ã¼ãã®å¯¾å¿é¢ä¿ã¯å¶å¾¡ãããããFile Loaderã«æ示ãåºãã¬ãã¸ããªã»ããã¼ã¸ã£ã¼ãå°å
¥ããã¦ããããã§ãã
æ°ããã¤ã³ããã¯ã¹å½¢å¼
- ãããã¯ã»ã¤ã³ããã¯ã¹å½¢å¼ã¯2段éã®ã¹ãã¼ããå©ç¨ãã¦ããã
- åhitã¯ï¼docid. ææ¸ä¸ã®åèªä½ç½®ï¼ã®ãã¢ã¨ãã¦ç¬¦å·åãããã
- docidã®å·®åå¤ã¯ã©ã¤ã¹ç¬¦å·ã§ç¬¦å·åãããã
- å§ç¸®çã¯é常ã«è¯ãã£ããï¼å ã ã¯ãã£ã¹ã¯ãã¼ã¹ã®ã¤ã³ããã¯ã¹åãã«è¨è¨ãããããï¼ã復å·å¦çãCPUã«è² æ ãããããã復å·å¦çãé ãã£ãã
- æ°å½¢å¼ï¼åä¸ã®ãã©ãããªåèªä½ç½®ç©ºé
- ææ¸ã®å¢çã¯å¥ã®ãã¼ã¿æ§é ã§è¨é²ããã
- ãã¹ãã£ã³ã°ã»ãªã¹ãã¨ã¯å·®åå¤ã符å·åããåèªä½ç½®ã®ãªã¹ãã¨ããã
- ã³ã³ãã¯ãã§ãããã¨ãå¿ è¦ï¼1ã¤ã®åèªä½ç½®ãæ ¼ç´ããã®ã«32bitã¯æä¾ã§ããªãï¼ã
- ãããã復å·å¦çã¯é常ã«é«éã§ããå¿ è¦ãããã
å¾æ¥ã®ã¤ã³ããã¯ã¹ã§ã¯åèªä½ç½®æ¯ã«docidãä¿æãã¦ããã®ã§ããã®docidãè²ã
ã¨ç¡é§ã ã£ãããã§ããä¾ãã°ãããåèªWãææ¸Dã®3çªç®ã¨5çªç®ã«åºç¾ããå ´åã(D, 3)ã¨(D, 5)ãæ ¼ç´ããããDã2åæ ¼ç´ãããã¨ã«ãªãã¾ãããã ããã®ãã¼ã¸ã®è¨è¿°ã¯åã®ãããã¯ã¤ã³ããã¯ã¹ã®è¨è¿°ã¨æ´åãã¨ãã¦ããªãæ°ãããã®ã§ããªã«ãåéããã¦ããããããã¾ããã
ãã¤ãæ´åãããå¯å¤é·ç¬¦å·
- å¯å¤é·ç¬¦å·
- åbyteã¯7bitã®ãã¼ã¿ã¨ç¶ç¶bitã§æ§æãããã
- æ¬ ç¹ï¼å¾©å·å¦çã§å¤§éã®åå²å¦çã»ã·ããæ¼ç®ã»ãã¹ã¯å¦çãå¿ è¦ã
00000001 00001111 11111111 00000011 11111111 11111111 00000111 ======== ======== ================== ============================ 1 15 511 131071
- ã¢ã¤ãã¢ï¼ãã¤ãé·ãä¸ä½2bitã«ç¬¦å·å
- è¯ãç¹ï¼åå²å¦çãã·ããæ¼ç®ããã¹ã¯å¦çãå°ãªã
- æ¬ ç¹ï¼å¤ã®ä¸éã30bitã«å¶éããããã¾ãã以åã¨ãã¦ãã³ã¼ãå¦çã«ã·ããæ¼ç®ãå¿ è¦ã
00000001 00001111 01111111 00000011 10111111 11111111 00000111 ======== ======== ================== ============================ 1 15 511 131071
å¯å¤é·ç¬¦å·ã¯ãã¤ãåä½ç¬¦å·ãªã®ã§ãbitåä½ç¬¦å·ããã¯å¾©å·ã¯é«éã§ããããããå¯å¤é·ç¬¦å·ã®å¾©å·ã«ã¯ãï¼bitåä½ç¬¦å·ããã¯å°ãªãã§ããï¼åå²å¦çã»ã·ããæ¼ç®ã»ãã¹ã¯å¦çã大éã«å¿
è¦ã§ããããããã®å¦çã¯CPUã®é«éåã®æ©æµãåãã«ããããããã¯ã復å·é度ãåé¡ã¨ãªãã¾ããç¹ã«åå²å¦çã¯ãã¤ãã©ã¤ã³ã»ãã¶ã¼ãã®ã³ã¹ãã®é«ãCPUã«ã¨ã£ã¦ã¯é¬¼éã§ãï¼åèï¼パイプライン処理 - Wikipediaï¼ããã®ããããªãã ãåå²å¦çã®å°ãªããããªãã¡ifæãå°ãªã符å·ã®æ¹ã復å·å¦çã¯é«éã«ãªãã¾ããæåã®ç¬¦å·ã¨æ¹åããã符å·ãè¦æ¯ã¹ãã¨ãéãé åã®æ°ã7åãã4åã«æ¸ã£ã¦ãã¾ããéãé åæ¯ã«åå²å¦çã»ã·ããæ¼ç®ã»ãã¹ã¯å¦çãå¿
è¦ã¨ãªããããæ¹åå¾ã®ç¬¦å·ã®æ¹ãé«éã«å¾©å·ã§ããããã§ãã
ã°ã«ã¼ãå¯å¤é·ç¬¦å·
- ã¢ã¤ãã¢ï¼1ã¤1ã¤æ´æ°ã符å·åããã®ã§ã¯ãªãã4ã¤ã®æ´æ°ãã¾ã¨ãã¦5-17ãã¤ãã«ç¬¦å·åããã
- 2bitã®ãã¤ãé·ã4ã¤ã¾ã¨ãã¦å é ãã¤ãã¨ããã
00000001 00001111 01111111 00000011 10111111 11111111 00000111 â 00000110 00000001 00001111 11111111 00000011 ======== ======== ======== ================== tag 1 15 511 11111111 11111111 00000001 ============================ 131071
- 復å·å¦çï¼å é ãã¤ããèªã¿è¾¼ã¿256åã®ã¨ã³ããªããããã¼ãã«ãåç §ããã
...
00|00|01|01 => Offsets: +1,+2,+3,+5; Masks: ff000000,ff000000,ffff0000,ffff0000
00|00|01|10 => Offsets: +1,+2,+3,+5; Masks: ff000000,ff000000,ffff0000,ffffff00
00|00|01|11 => Offsets: +1,+2,+3,+5; Masks: ff000000,ff000000,ffff0000,ffffffff
00|00|10|00 => Offsets: +1,+2,+3,+6; Masks: ff000000,ff000000,ffffff00,ff000000
...
- ä»ã®ææ³ããã復å·å¦çã¯é常ã«é«é
- ãã¤ãæ¯ã«7ãããï¼1ç§éã«180Mã¬ã³ã¼ã
- 30ãããå¶é + 2bitãã¤ãé·ï¼1ç§éã«240Mã¬ã³ã¼ã
- ã°ã«ã¼ãï¼1ç§éã«400Mã¬ã³ã¼ã
ãªããªãæãã¤ããªãæ¹å¼ã§ãã2ãããã®ãã¤ãé·ã4ã¤ã¾ã¨ãã¦å
é ãã¤ãã«æ ¼ç´ãããã¨ã§åå²å¦çãå®å
¨ã«ç¡ããã¦ãã¾ããå¾æ¥æ¹å¼ã§ãã¨å
é bitæ¯ã«å¦çãåå²ãã¦ããã®ããå
é ãã¤ãã®å¤ãã4ã¤ã®Offsetã¨Maskã®çµããã¼ãã«ããåå¾ãã¦ãåOffsetãã4ãã¤ãèªã¿è¾¼ã¿Maskããããã ãã¨ãªãã¾ããã¾ããã·ããæ¼ç®ããªããªã£ã¦ãã¾ããåå²å¦çã¨ã·ããæ¼ç®ããªãããã¾ãããã¹ã¯å¦çã®åæ°ãæ¸ãããã¨ã«ãã1.7åã®é«éåãå®ç¾ãã¦ãã¾ããã¾ãããã®ç¬¦å·ã§ãã¨å¤ã®30bitå¶éã¯ãªããªãã¾ãããã ãã符å·ãµã¤ãºã¯ä¾ãããåããããã«æ大25%å¢å ãã¾ããé常ã«èå³æ·±ãæ¹æ³ãªã®ã§ç¹è¨±ã®åé¡ããªããã°OSSã®å
¨ææ¤ç´¢ã¨ã³ã¸ã³ãªã©ã§è©¦ãããã¨ããã§ãã
2007ï¼ã¦ããã¼ãµã«ã»ãµã¼ã
ããã³ãã®Webãµã¼ãã®é
ä¸ã«ã¹ã¼ãã¼ã»ã«ã¼ããåå¨ãããã®é
ä¸ã«Web, Image, Blogsã¨ãã£ãå°éæ¤ç´¢ãµã¼ã群ãé
ç½®ãããæä¸å±¤ã«ã¤ã³ããã¯ã¹ãä½æããã¤ã³ããã¯ã¹ã»ãµã¼ãã¹ãåå¨ãã¦ãã¾ãã
ãããã¤ã³ããã¯ã¹ãã¦ããã£ã¦ï¼ 1åå¾ ã£ã¦ããï¼
- çæéã§ã®ã¯ãã¼ã«ããã³ã¤ã³ããã¯ã¹ã¯å¤§å¤ã
- ã¯ãã¼ã«ã»ãã¥ã¼ãªã¹ãã£ã¯ã¹ï¼ã©ã®ãã¼ã¸ãã¯ãã¼ã«ãã¹ããï¼
- ã¯ãã¼ã«ã»ã·ã¹ãã ï¼ãã¼ã¸ãç´ æ©ãã¯ãã¼ã«ãããã¨ãå¿ è¦ã
- ã¤ã³ããã¯ã¹ã»ã·ã¹ãã ï¼ã°ãã¼ãã«ãã¼ã¿ã«ä¾åããã
- ãã¼ã¸ã©ã³ã¯ããã®ãã¼ã¸ãæãã¦ãããã¼ã¸ã®ã¢ã³ã«ã¼ããã¹ããªã©ãªã©ã
- ãããã®ã°ãã¼ãã«å±æ§å¤ããªã³ã©ã¤ã³ã§ã®è¿ä¼¼ãå¿ è¦ã
- æ¤ç´¢ãµã¼ãã»ã·ã¹ãã ï¼æ¤ç´¢è¦æ±å¦çä¸ã«æ´æ°ã§ããããã«ãã¦ãããã¨ãå¿
è¦ã
- ãªã³ã©ã¤ã³æ´æ°ç¨ã®ãã¼ã¿æ§é ã¯ãããã«ããæ´æ°ã¨ã¯å ¨ãç°ãªãã
æåã®æ¹ã®ã¹ã©ã¤ãã«ããããã«10å¹´éã§æãé²åããææ¨ã®1ã¤ãã¤ã³ããã¯ã¹ãããã¾ã§ã®æéã§ãããã¼ã¸ãã¢ãããã¼ãããã¨ããã«ã¯ãã¼ã©ããã£ã¦ããã®ã¯å½ããåã®ããã§ãããå®ç¾ããã®ã¯é常ã«å¤§å¤ã§ãã
æ å ±æ¤ç´¢ã·ã¹ãã ã«ãããæè»æ§ã¨å®é¨
- å®é¨ã容æã§ãããã¨ã¯é常ã«éè¦ã
- å®é¨ã®æè¦æéãçã => ããããã®å®é¨ => ããå¤ãã®æ¹å
- ããã¤ãã®å®é¨ã¯ç°¡åã«ã§ããã
- ä¾ãã°ãæ¢ã«åå¨ãããã¼ã¿ã®éã¿ãå¤ããå ´åã
- ãã以å¤ã®å®é¨ã¯å®è¡ããã®ããã大å¤ï¼ç¾å¨ã®ã¤ã³ããã¯ã¹ã«åå¨ããªããã¼ã¿ãå¿
è¦ã
- æ°ãããã¼ã¿ãä½ãåºãã¦çµã¿è¾¼ã¿ããã®ãã¼ã¿ã使ã£ã¦å®é¨ãããã¨ãç°¡åã§ãããã¨ãå¿ è¦ã
ããããæ°ãã¼ã¸ãã©ã³ãã³ã°ã»ã¢ã«ã´ãªãºã ãåä¸ãããããã®å®é¨ã¤ã³ãã©ã»å®é¨æ¹å¼ã®ç´¹ä»ãç¶ãã¾ãã
æ¤ç´¢ã·ã¹ãã ã®ããã®ã¤ã³ãã©
- ã¤ã³ãã©ã®ãã¼ã¨ãªããã¼ã
- GFS: 大è¦æ¨¡åæ£ãã¡ã¤ã«ã·ã¹ãã
- MapReduceï¼å¤§è¦æ¨¡ãªã¸ã§ããç°¡åã«ä½æã»å®è¡
- åç¨ã®ã¤ã³ããã¯ã¹ãçæéã§ä½æ
- ã¢ãããã¯ãªå®é¨ãè¿ éã«å®è¡
- BigTableï¼åæ§é åã¹ãã¬ã¼ã¸ã·ã¹ãã
- ææ¸æ¯ã®æ å ±ã¸ã®ä»»æã®æéã§ã®ãªã³ã©ã¤ã³ã§é«éãªã¢ã¯ã»ã¹ã
- è¤æ°ã®ããã»ã¹ãææ¸æ¯ã®æ å ±ãéåæã«æ´æ°å¯è½ã
- ææ¸ãæ°æéã§ã¯ãªãæ°åã§æ´æ°ããå ´åã«ã¯é常ã«éè¦ã
å®é¨ãµã¤ã¯ã« ãã¼ã1
- æ°ããã©ã³ãã³ã°ã®ã¢ã¤ãã¢ããã¹ã¿ã¼ãã
- å®é¨ãå®è¡ããã®ãç°¡åã§ãé«éã«å®è¡ããã®ãç°¡åã§ãªãã¨ãããªãã
- MapReduceãBigTableã®ãããªãã¼ã«ãã¤ãã£ã¦ãã¼ã¿ãä½æããã
- æåã¯ãªãã©ã¤ã³ã§å®é¨ãå®è¡ãããã®ã¢ã¤ãã¢ã®å¹æã確èªããã
- æ§ã ãªç¨®é¡ã®äººæã§éè¦åº¦ãä»ããããã¯ã¨ãªã»ããã§ã®å¹æã
- æ¢åã®ã©ã³ãã³ã°ã¸ã®å½±é¿ã調ã¹ãããã®ã©ã³ãã ãªã¯ã¨ãªã»ããã§ã®å¹æã
- å¿çé度ã¨ã¹ã«ã¼ãããã¯ãã®ãããã¿ã¤ãã§ã¯æ°ã«ããªãã
- å®é¨çµæã«åºã¥ãã¦ãã¢ã¤ãã¢ã®ç·´ãè¾¼ã¿ã¨å®é¨ãç¹°ãè¿ãã
å®é¨ãµã¤ã¯ã« ãã¼ã2
- ãã£ãããªãã©ã¤ã³ã§ã®å®é¨çµæãææã§ãã£ããªãã°ãã©ã¤ãå®é¨ã®å®è¡ãæ±ããã
- ã¦ã¼ã¶ã¼ãã©ãã£ãã¯ã®ããä¸é¨ã使ã£ãå®é¨ã
- é常ã¯ã©ã³ãã ãµã³ããªã³ã°ã
- ãã ãããã°ãã°ç¹å®ã®ã¯ã©ã¹ã®ã¯ã¨ãªã®ã¿ããµã³ãã«ã¨ãã¦é¸ã¶ã
- ä¾ãã°ãè±èªã®ã¯ã¨ãªãå°åã®å«ã¾ããã¯ã¨ãªããªã©ãªã©ã
- ãã ãããã°ãã°ç¹å®ã®ã¯ã©ã¹ã®ã¯ã¨ãªã®ã¿ããµã³ãã«ã¨ãã¦é¸ã¶ã
- ãã®æãã¹ã«ã¼ãããã¯éè¦ã§ã¯ãªããå¿çé度ã¯åé¡ï¼
- å®é¨ãã¬ã¼ã ã¯ã¼ã¯ã¯åç¨ã®å¿çé度ã¨åçã§ããå¿ è¦ãããã
å®é¨çµæã¯è¯å¥½ã ã£ãã次ã¯?
- ãã¼ã³ãï¼
- å
¨ãã©ãã£ãã¯ã«èããããããã«æ§è½ãã¥ã¼ãã³ã°ã¨åå®è£
ãè¡ãã
- ä¾1) å®è¡æã«ãã¼ã¿ãè¨ç®ããã®ã§ã¯ãªãäºåã«è¨ç®ãã¦ããã
- ä¾2) ãååè¯ãããããåºã³ã¹ããªè¿ä¼¼ææ³ã«ç½®ãæããã
- ãã¼ã«ã¢ã¦ãã»ããã»ã¹ã¯éè¦
- 絶ãéãªãã質ãã¨ãã³ã¹ããã®ãã¬ã¼ããªã
- ç´ æ©ããã¼ã«ã¢ã¦ãã¯ããµã¤ãã®å®å®æ§ã¨çãå¿çé度ã¨ã¯ä¸¡ç«ããªãã
- æ¤ç´¢å質ã°ã«ã¼ãã¨ã·ã¹ãã ãé«éã»å®å®ã«ããã°ã«ã¼ãã¨ã®éã«è¯å¥½ãªé¢ä¿ãç¯ããã¨ãå¿ è¦ã