Hadoop DistCp (distributed copy, ã§ããã¨ãã¼ã´ã¼ãã§ããã¨ãã´ã¼) ã¯ãMapReduceãç¨ãã¦Hadoopã¯ã©ã¹ã¿éã§ãã¼ã¿ã³ãã¼ããããã®ãã¼ã«ã§ããä¿å®éç¨ãã¦ããå ´åãé¤ãããããã2020å¹´ã«ããã¦ãéç¨ä¸ã®é¸æè¢ã¨ãã¦æ®ã£ã¦ããæå¾ã®MapReduceã®ãã¼ã«ã§ãããã®è¨äºã§ã¯ãDistCpã®ç´¹ä»ã¨å®è·µçãªä½¿ãæ¹ã®åºæ¬ã«ã¤ãã¦èª¬æãã¦ããã¾ããå 容ã¨ãã¦ã¯ä»¥ä¸ã®éãã§ãã
- Distcpã®æ¦è¦ã¨åç
- å®è·µDistCp
- DistCpã«ãã©ã¤ã©ã³ã¯ãªã
- ã³ãã¼ã¨ã¢ãããã¼ãã®æåã®éããæ¼ããã
- ã¹ãããã·ã§ãããåå¾ãã
- ã½ã¼ã¹ã¨å®å ãã©ã¡ãã®ã¯ã©ã¹ã¿ã§DistCpãå®è¡ããã
- ç°ãªãã¡ã¸ã£ã¼ãã¼ã¸ã§ã³éã§ã®ãã¼ã¿è»¢éã«webhdfsã使ã
- -p ãªãã·ã§ã³ã®æå
- 2ã¤ã®ã³ãã¼æ¦ç¥: uniformizeã¨dynamic
- mapæ°ã®èª¿æ´
- 転é帯å
ãªãã§ä»æ´DistCpï¼
DistCpã®ä½¿ãæ¹ã«ã¤ãã¦ãã¡ãã¨æ¸ãã¦ããããã¥ã¡ã³ãããªãã£ãã®ã§æ¸ãã¾ãããHadoopã®ãã¤ãã«ã§ãã象æ¬ãããDistCpã«ã¤ãã¦ã¯æ¬å½ã«ç°¡åãªãã¨ããæ¸ãã¦ããããå®éã®ä½¿ãæ¹ã«ã¤ãã¦ã¾ã¨ãã¦ããããã¥ã¡ã³ããããã¾ããã§ãããClouderaã®ãããªãã³ãã¼ã®å ´å㯠Cloudera Manager という素晴らしいツールが持つデータレプリケーション機能ã«å å«ããã¦ãã¦ãã¦ã¼ã¶ã¼ã¯ãã¿ã³ä¸çºã§ã¯ã©ã¹ã¿éãã¼ã¿è»¢éãã§ãããããDistCpã«ã¤ãã¦ç´°ãã話ãç¥ãå¿ è¦ã¯ããã¾ãããããã§ãç´ ã®Hadoopã使ã人ã®ããã®DistCpã®è¨äºãæ¸ãã¦ãããã¨ã«ãã¾ããã
DistCpã«ã¤ãã¦ã®æ©è½ä¸è¦§ãªã©ã®è©³ç´°ã«ã¤ãã¦ã¯å ¬å¼ããã¥ã¡ã³ããåç §ãã¦ãã ããã
- ä½è :Tom White
- çºå£²æ¥: 2013/07/26
- ã¡ãã£ã¢: 大åæ¬
DistCpã®æ¦è¦
DistCp ã¯ãMapReduceãç¨ãã¦Hadoopã¯ã©ã¹ã¿éã§é«éã«ãã¼ã¿ã³ãã¼ããããã®ãã¼ã«ã§ãApache Hadoop ã®æ¨æºãªãªã¼ã¹ã«å«ã¾ãã¦ãã¾ããApache Hadoopã¯ãåæ£ã¹ãã¬ã¼ã¸ã®HDFS(Hadoop Distributed File SystemãHadoopåæ£ãã¡ã¤ã«ã·ã¹ãã )ã¨ãåæ£ã³ã³ãã¥ã¼ãã£ã³ã°ãã¬ã¼ã ã¯ã¼ã¯ã®YARNããæ§æããã¦ããåæ£å¦çãã¬ã¼ã ã¯ã¼ã¯ã§ãMapReduceã¯YARNä¸ã§åã代表çãªã¢ããªã±ã¼ã·ã§ã³ã®ä¸ã¤ã§ãã Hadoopã¯ã©ã¹ã¿éã¨æ¸ãã¾ããããæ£ç¢ºã«ã¯åæ£ã¹ãã¬ã¼ã¸éã¨è¨ã£ãæ¹ãæ£ããã§ããããDistCpã¯ãHDFSã ãã§ãªããAmazon S3 ãAzure Storage ã¨ãã£ããªãã¸ã§ã¯ãã¹ãã¬ã¼ã¸ã«ã対å¿ãã¦ãã¾ãã
DistCpã¯ã³ãã³ãã©ã¤ã³ãã¼ã«ã§ã以ä¸ã®ãããªå½¢å¼ã§å®è¡ãã¾ãã
$ hadoop distcp hdfs://cluster1/foo/bar hdfs://cluster2/foo
ããã¯ãcluster1ã¨ããHDFSã¯ã©ã¹ã¿ã®ã /foo/bar ã¨ãããã¹ããcluster2 ã¨ããHDFSã¯ã©ã¹ã¿ã®ã /foo ã¨ãããã£ã¬ã¯ããªã«ã³ãã¼ãããã¨ããã³ãã³ãã¨ãªãã¾ãã
DistCpã®åç
DistCpã¯ãMapReduceãã¬ã¼ã ã¯ã¼ã¯ã§åä½ãã¾ããã¾ããMapReduceã«ã¤ãã¦ç°¡åã«ãããããã¾ããMapReduceã¯ãè¤æ°ã®ãã¼ãã§å¥åã«è¨ç®å¦çãè¡ãMapãç¹å®ã®ãã¼ãã¨ã«ãã¼ã¿ã転éãã¦éç´ããShuffleãéç´ããããã¼ã¿ã«å¯¾ããMapã¨åæ§ããã¼ããã¨ã«ç¬ç«ãã¦å¦çãè¡ãReduceã¨ãã3ã¤ã®ãã§ã¼ãºã§åæ£å¦çãè¡ããã¬ã¼ã ã¯ã¼ã¯ã§ãã
以ä¸ã®å³ã¯ãMapReduceã®å¦çã®æµãã表ãã¦ãã¾ãã
DistCpã¯ãMapå¦çã®ã¿ã使ããä½ãè¨ç®ãã(æçé¢æ°)ãå ¥åã¨åºåãå¥ã®ã¯ã©ã¹ã¿ã§è¡ãã¨ããå½¢ã§MapReduceã使ç¨ãã¦ãã¾ãã
以ä¸ã®å³ã¯ãDistCpã®å¦çã®æµãã表ãã¦ãã¾ãã
DistCpã®ã½ã¼ã¹(èªã¿è¾¼ã¿å
)ã¨å®å
(æ¸ãè¾¼ã¿å
)ã¯URIã§è¡¨ããã¾ããå
ç¨ã®ä¾ã§ã¯ãå®å
ã hdfs://cluster2/foo ã¨ãã¾ãããããã®å®å
㯠s3a://bucket1/foo ã§ãåé¡ãªãåä½ãã¾ããããã¯ãS3ä¸ã® bucket1 ã¨ãããã±ããã®é
ä¸ã«ãã foo ã¨ããåå空éã«ãã¼ã¿ãã³ãã¼ãããã¨ãæå³ãã¾ãã
å®è·µDistCp: ãã©ã¤ã©ã³ã¯ãªã
DistCpã¯ãé常ã«å¤§è¦æ¨¡ãã¤ä¸å¯éå¤æ´ãè¡ããã¼ã«ã§ããã«ãé¢ãããããã©ã¤ã©ã³ã«ç¸å½ããæ©è½ãåå¨ããªãã¨ããç¹ã«æ³¨æãã¦ãã ããããã©ã¤ã©ã³ããªãã¨ãããã¨ã¯ãååã«æ¤è¨¼ã¯ã©ã¹ã¿ã§ãã¹ãããå¾ãæ¬çªã§ã®å®è¡ãæåãããã¨ããç¥(ãããã¯ããªããä¿¡ä»°ããä½ã)ã«ç¥ããããªããªãã¾ããããã¦å¤§æµã®å ´åãã®ç¥ããå±ããã¨ã¯ããã¾ãããé å¼µãã¾ãããã
ãã©ã¤ã©ã³ã«ã¤ãã¦ã¯6å¹´éãªã¼ãã³ãã¦ããJIRAãããã¾ãã®ã§ãæããã¯ã¨ããæ¹ã¯å®è£ ãå¾ ã¡ãã¦ãã¾ãã
å®è·µDistCp: ã³ãã¼ã¨ã¢ãããã¼ãã®æåã®éããæ¼ããã
hadoop distcp ã³ãã³ãã¯ãä½ããªãã·ã§ã³ãã¤ããªãå ´åã¯ãã³ãã¼ã¨ããæåã«ãªãã¾ããããã¯ã以ä¸ã®æä½ãè¡ãã¾ãã
- ã½ã¼ã¹ã«ãã¹ãåå¨ããå®å ã«åå¨ããªãå ´åã¯ã³ãã¼ãã
- ã½ã¼ã¹ã¨å®å ã«åããã¹ãåå¨ããå ´åã¯ä½ãããªã
- ã½ã¼ã¹ã«ãã¹ãåå¨ãããå®å ã«åå¨ããå ´åã¯ä½ãããªã
hadoop distcp -update ã§ã¯ã以ä¸ã®ããã«æåãå¤ããã¾ãã
- ã½ã¼ã¹ã«ãã¹ãåå¨ããå®å ã«åå¨ããªãå ´åã¯ã³ãã¼ãã
- ã½ã¼ã¹ã¨å®å ã«åããã¹ãåå¨ããå ´åããã§ãã¯ãµã ãªã©ã³ã³ãã³ãã®ä¸èº«ã確èªããã³ã³ãã³ããç°ãªãå ´åã¯ã³ãã¼ãããã³ã³ãã³ããåä¸ã®å ´åã¯ä½ãããªã
- ã½ã¼ã¹ã«ãã¹ãåå¨ãããå®å ã«åå¨ããå ´åã¯ä½ãããªã
hadoop distcp -update -delete ã§ã¯ã以ä¸ã®ããã«æåãå¤ããã¾ãã
- ã½ã¼ã¹ã«ãã¹ãåå¨ããå®å ã«åå¨ããªãå ´åã¯ã³ãã¼ãã
- ã½ã¼ã¹ã¨å®å ã«åããã¹ãåå¨ããå ´åããã§ãã¯ãµã ãªã©ã³ã³ãã³ãã®ä¸èº«ã確èªããã³ã³ãã³ããç°ãªãå ´åã¯ã³ãã¼ãããã³ã³ãã³ããåä¸ã®å ´åã¯ä½ãããªã
- ã½ã¼ã¹ã«ãã¹ãåå¨ãããå®å ã«åå¨ããå ´åã¯ãã®ãã¹ãåé¤ãã
ãããã®æåãã¾ã¨ããã¨ã以ä¸ã®å³ã®ããã«ãªãã¾ãã
hadoop distcp ã« -update ãã¤ããå ´åãã³ã³ãã³ãã®ä¸èº«ãæ¯è¼ããããããªã¼ãã¼ããããçºçãã¾ãããã®ããã-updateãªãã«æ¯ã¹ã¦å¦çæ§è½ãè½ã¡ããã¨ã«æ³¨æãã¦ãã ããã
DistCpã®ã³ãã¼ã¨ã¢ãããã¼ãã®æåã®éãã¯ééãããããããã¦ãã®ééããé大ãªäºæ ãèµ·ããã¦ãã¾ãå¯è½æ§ãããã¾ãã®ã§çµ¶å¯¾ã«è¦ãã¦ãã ããã
以ä¸ã®2ã¤ã®ä¾ãè¦ã¦ãã ããã
# ä¾1 $ hadoop distcp hdfs://cluster1/foo/bar hdfs://cluster2/foo # ä¾2: 誤ã£ãæ¹æ³ $ hadoop distcp -update hdfs://cluster1/foo/bar hdfs://cluster2/foo
ä¾1ã¯ãcluster2/foo ã®ç´ä¸ã« cluster1/foo/bar ãã³ãã¼ããã®ã§ãçµæã¨ã㦠cluster2/foo/bar ãä½æããã¾ãã
ä¾2ã¯ã cluster2/foo ã cluster1/foo/bar ã®å
容ã§ã¢ãããã¼ãããã®ã§ã cluster2/foo/bar ã¯ä½æããããcluster2/foo ã®ã³ã³ãã³ãã cluster1/foo/bar ã¨åããã®ã«ãªãã¾ãã
å³ã«ããã¨ä»¥ä¸ã®ããã«ãªãã¾ãã
ãã®ä¸ä¾ã ãã ã¨ãã³ã¨ããªãããããã¾ããã®ã§ããã£ã¨å®åä¸å®è¡ããå¯è½æ§ã®ããã³ãã³ãã§ã¿ã¦ã¿ã¾ãããã
# ä¾3 $ hadoop distcp hdfs://cluster1/user/sato hdfs://cluster2/user # ä¾4: 誤ã£ãæ¹æ³ $ hadoop distcp -update -delete hdfs://cluster1/user/sato hdfs://cluster2/user
hadoop distcp ã® -delete ãªãã·ã§ã³ã¯ -update ãªãã·ã§ã³ã¨ä¸ç·ã«ä½¿ããªãã¨å©ç¨ã§ããªããªãã·ã§ã³ã§ãã½ã¼ã¹ã¯ã©ã¹ã¿ã«ã¯åå¨ããªããã©å®å
ã¯ã©ã¹ã¿ã«ã¯åå¨ããå
¨ã¦ã®ãã¹ãåé¤ãã¾ããã¤ã¾ãã-delete ãä»ä¸ããã¨ãã½ã¼ã¹ã¨å®å
ã®ã³ã³ãã³ããå
¨ãåä¸ã®ãã®ã¨ãªãã¾ãã
ä¾3ã¯ã cluster1/user/sato ãã cluster2/user/ ã«ã³ãã¼ãã¾ãããã£ã¦ãcluster2/user/sato ãä½æããã¾ãã
ä¾4ã¯ã cluster2/user ã®ã³ã³ãã³ãããcluster1/user/sato ã¨å
¨ãåããã®ã«ãªãã¾ããã¤ã¾ãã /user ãã£ã¬ã¯ããªé
ä¸ã«åå¨ããå
¨ã¦ã®ã¦ã¼ã¶ã¼ãã¼ã¿ãå®å
¨ã«åé¤ããããã®ä»£ããã«ã¦ã¼ã¶ sato ã®ã³ã³ãã³ãã ããç½®ãããããã«ãªãã¾ãã
å³ã«è¡¨ãã¨ã以ä¸ã®ããã«ãªãã¾ãã
ãã´ãç®±æ©è½ãããããå³åº§ã«åé¤ããããã¨ã¯ãªãã®ã§ã¯ï¼ãã¨æãããããã¾ããããDistCpã®ãã°ã§ã´ãç®±ã¯æ©è½ãã¾ããããã®åé¡ã¯2020/07/15ç¾å¨æªè§£æ±ºã§ãã詳細ã«ã¤ãã¦ã¯ä»¥ä¸ã®JIRAãåç §ãã¦ãã ããã
éç¨è ã¯ãã®ã³ãã³ãã誤ã£ã¦å®è¡ããæç¹ã§ãå³åº§ã«ç·æ¥äºæ ã®ã¢ã©ã¼ããåºããªããã°ãããªããªãã§ãããã
ãã®ä¾4ã¯ãæ£ããã¯ä»¥ä¸ã®ããã«æ¸ãã¹ãã§ããã
# ä¾4: 誤ã£ãæ¹æ³ $ hadoop distcp -update -delete hdfs://cluster1/user/sato hdfs://cluster2/user # ä¾5: ä¾4ã®æ£ããæ¸ãæ¹ $ hadoop distcp -update -delete hdfs://cluster1/user/sato hdfs://cluster2/user/sato
ã§ã¯ãããã§ããä¸ã¤ã®ä¾ãç´¹ä»ãã¾ããããcluster2ã«æ¢ã«/userãåå¨ããã¨ãã«ã以ä¸ã®ã³ãã³ããå®è¡ããã¨ä½ãèµ·ããã§ããããã
# ä¾6: 誤ã£ãæ¹æ³
$ hadoop distcp hdfs://cluster1/user hdfs://cluster2/user
ãããã-update ( -delete ) ãã¤ãã¦ãããªãã°ãåé¡ãªãã£ãããããã¾ãããããããä»å㯠-update ãã¤ãã¦ãã¾ããããã£ã¦ã cluster1/user ã cluster2/user ã®é ä¸ã«ã³ãã¼ããã¾ããã¤ã¾ãã cluster2/user/user ãä½æããã¾ããããã¯ãå¤ãã®éç¨è ã«ã¨ã£ã¦æå³ããæåã§ã¯ãªãã§ãããã
ãã®ã¨ããå®æã« cluster2/user/user ãåé¤ãããã¨ã¯ã§ãã¾ããããªããªãã cluster2/user/user ã¨ãããã£ã¬ã¯ããªã¯ã³ãã¼åããåå¨ãã¦ããå¯è½æ§ãããããã®ä¸ã«ã³ã³ãã³ããåå¨ãã¦ããå¯è½æ§ãããããã§ããä¸åº¦æ··ãã£ã¦ãã¾ãã°ãcluster1ç±æ¥ã®ã³ã³ãã³ãã¨cluster2ãªãªã¸ãã«ã®ã³ã³ãã³ãããµããåããã®ã¯å°é£ã§ãããã-update ãªãã·ã§ã³ããªãã¨ãã決ãã¦æ²¹æãã¦ã¯ããã¾ããã
cluster1ã®/userãcluster2ã®/userã«ã³ãã¼ããå ´åã以ä¸ã®ããã«æ¸ãã¹ãã§ããã
# ä¾6: 誤ã£ãæ¹æ³ $ hadoop distcp hdfs://cluster1/user hdfs://cluster2/user # ä¾7: ä¾6ã®æ£ããæ¸ãæ¹ $ hadoop distcp hdfs://cluster1/user hdfs://cluster2/
å³ã«è¡¨ãã¨ã以ä¸ã®ããã«ãªãã¾ãã
æåã»èªåã§ã®å®è¡ã«é¢ãããããã¹ã®ç¢ºèªã¯çµ¶å¯¾ã«æå¾ã®æå¾ã¾ã§ç¢ºå®ã«è¡ãããã«ãã¦ãã ããã
å®è·µDistCp: ã¹ãããã·ã§ãããåå¾ãã
DistCpã¯ãé常é常ã«è¨å¤§ãªæéããããã¾ããã¯ã©ã¹ã¿å
¨ä½ã®ãã¼ã¿è»¢éã®å ´åã1æ¥ã2æ¥ã¯å½ããåã§ã1é±éã1ã¶æã«æ¸¡ã£ã¦è»¢éãç¶ãããã¨ãããã¨ã¯é »ç¹ã«èµ·ããã¾ããDistCpã¯MapReduceå®è¡åã«å¯¾è±¡ãã¹ã®ä¸è¦§ãåå¾ãã¾ãã®ã§ã転éä¸ã«ã½ã¼ã¹ãã¡ã¤ã«ãå¤åãã¦ãä¸åèæ
®ãããã¨ã¯ã§ãã¾ããã大æµã®å ´åã転éä¸ã«ãã¡ã¤ã«ãåé¤ãããä½æ¥ããããDistCpã失æãããã¨ã«ãªãã§ããããéè¯ã転éã«æåããã¨ãã¦ããã³ã³ãã³ãã®ä¸èº«ã«ä¸æ´åãçºçãã¦ããã°ãHiveçã®å¥ã®ã¢ããªã±ã¼ã·ã§ã³ã§ã®å¦ççµæãæå³ããªããã®ã¨ãªãããããã¨ã¯ä¸ã¤ãããã¾ããããã®ãããã½ã¼ã¹ã¯ã¹ãããã·ã§ãããæå®ããã®ãéåã§ãã
ã¹ãããã·ã§ããã®åå¾ã¯ã以ä¸ã®2ã¤ã®ã³ãã³ããé çªã«å®è¡ãã¾ãã
$ hdfs dfsadmin -allowSnapshot hdfs://cluster1/foo/bar $ hdfs dfs -createSnapshot hdfs://cluster1/foo/bar snapshot1
hdfs dfsadmin -allowSnapshot 㯠hdfs ã¦ã¼ã¶ã§ãªãã¨å®è¡ã§ãã¾ããããhdfs dfs -createSnapshot ã¯ã対象ãã£ã¬ã¯ããªã®æ¨©éãæã£ã¦ããä¸è¬ã¦ã¼ã¶ã§ãå®è¡å¯è½ã§ããä¸è¨ã³ãã³ããå®è¡ããã¨ã hdfs://cluster1/foo/bar/.snapshot/snapshot1 ã¨ãããã£ã¬ã¯ããªãä½æããããã®é
ä¸ã«ã¯ hdfs://cluster1/foo/bar ã®ã³ã³ãã³ãã¨å
¨ãåããã¼ããªã³ã¯ãä½æããã¾ãã
snapshot1ã¯ã¹ãããã·ã§ããåãªã®ã§ãèªç±ã«å¤æ´ãã¦ã³ãã³ããå®è¡ãã¦ãã ããã
ã¹ãããã·ã§ããã使ã£ãDistCpã¯ä»¥ä¸ã®ããã«è¨è¿°ãã¾ãã
$ hadoop distcp hdfs://cluster1/foo/bar/.snapshot/snapshot1 hdfs://cluster2/foo
å®è·µDistCp: ã½ã¼ã¹ã¨å®å ãã©ã¡ãã®ã¯ã©ã¹ã¿ã§DistCpãå®è¡ããã
DistCpã¯ãåºæ¬çã«ã¯å®å ã¯ã©ã¹ã¿å´ã§å®è¡ãããã¨ãæ¨å¥¨ãã¾ããDistCpãå®å ã¯ã©ã¹ã¿å´ã§å®è¡ããªããã°ãªããªãã±ã¼ã¹ã¨ãã¦ã¯ä»¥ä¸ã®ãããªãã®ãããã¾ãã
- éã»ãã¥ã¢ã¯ã©ã¹ã¿ããã»ãã¥ã¢ã¯ã©ã¹ã¿ã«ãã¼ã¿ãã³ãã¼ããå ´å
- ä½ãã¡ã¸ã£ã¼ãã¼ã¸ã§ã³ã®ã¯ã©ã¹ã¿ããé«ãã¡ã¸ã£ã¼ãã¼ã¸ã§ã³ã«ãã¼ã¿ãã³ãã¼ããå ´å
ã¾ããæ°è¦ã¯ã©ã¹ã¿ã¸ã®ãã¼ã¿ç§»è¡ã®å ´åãã½ã¼ã¹ã¯ã©ã¹ã¿ã¯é常æ¥åã®ã¢ããªã±ã¼ã·ã§ã³ã稼åãã¦ããä¸æ¹ãå®å ã¯ã©ã¹ã¿ã¯å¤§æµã®å ´åæ¬çªç¨¼ååãªã®ã§ãã½ã¼ã¹ã¯ã©ã¹ã¿ã®è² è·ãå¢ãããã«ãå®å ã®ãªã½ã¼ã¹ãæå¹æ´»ç¨ãããã¨ãã§ãã¾ãã
DistCpãã½ã¼ã¹ã¯ã©ã¹ã¿ã§å®æ½ããªããã°ãããªãã±ã¼ã¹ãããã¾ããä¾ãã°ãã»ãã¥ã¢ã¯ã©ã¹ã¿ããéã»ãã¥ã¢ã¯ã©ã¹ã¿ã¸ãã¼ã¿ã転éããå ´åã§ãã
Clouderaã®ä»¥ä¸ã®ããã¥ã¡ã³ãã®è¨è¼ãå¼ç¨ãã¾ãã
You can use DistCp and WebHDFS to copy data between a secure cluster and an insecure cluster. Note that when doing this, the distcp commands should be run from the secure cluster.
ã»ãã¥ã¢ã¯ã©ã¹ã¿ã«ãããDistCpã®æ¹æ³ã«ã¤ãã¦ã¯ãã®è¨äºã§ã¯æ±ãã¾ããããDistCpãã©ã¡ãã®ã¯ã©ã¹ã¿ã§å®æ½ããããæ¤è¨ããå ´åã«ã¯é ã®çé
ã«ã¨ã©ãã¦ããã¦ãã ããã
å®è·µDistCp: ç°ãªãã¡ã¸ã£ã¼ãã¼ã¸ã§ã³éã§ã®ãã¼ã¿è»¢éã«webhdfsã使ã
webhdfsãããã³ã«ã使ããã¨ã§ãã¡ã¸ã£ã¼ãã¼ã¸ã§ã³ã®ä½ããã¼ã¸ã§ã³ããé«ããã¼ã¸ã§ã³ã¸ã®ãã¼ã¿è»¢éãè¡ããã¨ãã§ãã¾ãã
$ hadoop distcp webhdfs://cluster1/foo/bar hdfs://cluster2/foo
以ä¸ã¯åèãªã³ã¯ã§ãã
å®è·µDistCp: -p ãªãã·ã§ã³ã®æå
ããã©ã«ãã§ã¯ãDistCpã¯ãã¡ã¤ã«å±æ§çã¯ã³ãã¼ãã¾ããããã¡ã¤ã«å±æ§ãã³ãã¼ããã«ã¯ -p ãªãã·ã§ã³ã使ãã¾ããããã®ãªãã·ã§ã³ã®æåã«ã¯æ§ã
ãªå¶ç´äºé
ãåå¨ãã¾ããä¾ãã°ã -update ãªãã·ã§ã³ã¯ã³ã³ãã³ãã®ä¸èº«ãåä¸ã®ãã¹ã«å¯¾ãã¦ã¯ã³ãã¼ãå®æ½ãã¾ãããããã®ã¨ããã¡ã¤ã«å±æ§ã ããéã£ã¦ãã¦ããã®å±æ§ãæ´æ°ãããã¯ãã¾ããã
以ä¸ã®ä¾ã§ã両ã¯ã©ã¹ã¿ã« /foo/bar/file1 ã¨ãããã¡ã¤ã«ãããã¨ãã¾ãã
$ hadoop distcp -update hdfs://cluster1/foo/bar hdfs://cluster2/foo
ãã®ã¨ããcluster1/foo/bar/file1 ã®ãã¼ããã·ã§ã³ã644ã§ã cluster2/foo/bar/file1 ã®ãã¼ããã·ã§ã³ã600ã¨ãªã£ã¦ãã¦ããã¡ã¤ã«ã®ã³ã³ãã³ããå ¨ãåä¸ã§ããå ´åãcluster2/foo/bar/file1 ã®ãã¼ããã·ã§ã³ã¯ 600 ã®ã¾ã¾å¤æ´ããã¾ããã
å¥ã®ä¾ãç´¹ä»ãã¾ãããã -pt ãªãã·ã§ã³ã使ãã¨æ´æ°æ¥æãªã©ãä¿æã§ãã¾ããããã®ãªãã·ã§ã³ã¯ãNameNodeã®è¨å®ã®ä¸ã¤ã dfs.namenode.accesstime.precision (ããã©ã«ã1æé) ã0(ç¡å¹)ã®å ´åå©ç¨ã§ãã¾ãããdfs.namenode.accesstime.precision ã 0 ã«ããã¾ã¾ä»¥ä¸ã®ã³ãã³ããå®è¡ãã¦ãã失æãã¾ãã
$ hadoop distcp -pt hdfs://cluster1/foo/bar hdfs://cluster2/foo
ãã®ã¨ãã以ä¸ã®ãããªã¨ã©ã¼ãåºåããã¾ãã
Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Access time for hdfs is not configured. Please set dfs.namenode.accesstime.precision configuration parameter.
ã¢ã¯ã»ã¹æéã®è¨å®ã¯ããã©ã¼ãã³ã¹ã®æé©åã®ããã«0ã«ããã®ãæ¨å¥¨ã§ãAmbari / HDP ã¯ããã©ã«ã0ã«ãªã£ã¦ãã¾ãããã³ãã¥ããã£çãClouderaãããã©ã«ã1æéãªã®ã§ãè¨å®ãã®ãã®ãç¥ããªã人ãå¤ãã¨æãã¾ãã -p ãªãã·ã§ã³ã使ãã¨ãã¯ããã¥ã¡ã³ããèªãã ãã§è¨è¨ãããå¿
ãæ¤è¨¼ããã¦ãã ããã
å®è·µDistCp: 2ã¤ã®ã³ãã¼æ¦ç¥: uniformizeã¨dynamic
DistCpãåMapã¿ã¹ã¯ã«å¦ç対象ã®ãã¹ãæ¯ãåããæ¦ç¥ã¯2ã¿ã¤ãåå¨ãã¾ããããã©ã«ãã¯uniformizeã¨ããããã¼ã¿ãµã¤ãºã§åå²ããæ¹æ³ã§ããä¾ãã°è»¢é対象ã®ãã¼ã¿ã100TBãããmapã¿ã¹ã¯ã1000ã§è¨å®ããå ´åãåmapã¿ã¹ã¯ã¯100GBã®ãã¼ã¿ã転éããããã«ããã¡ã¤ã«ãã¹ãæ¯ãåãããã¾ãããã®æåã¯ãã½ã¼ã¹ã³ã¼ããèªãã°ãããã¾ããããªã¹ãããããã¡ã¤ã«ãä¸ããé ã«åãåºãã¦ãããµã¤ãºã足ãã¦ããã転é対象ã®å ¨ãã¼ã¿ãµã¤ãº/mapæ°ãè¶ ããã次ã®mapã¿ã¹ã¯ã«æ¸¡ããã¨ããæä½ãè¡ã£ã¦ãã¾ãã
çæ³çãªHDFSã®ç°å¢ã§ã¯ããã§åé¡ãªãã®ã§ãããå°ãããã¡ã¤ã«ã大éã«ããç°å¢ã®å ´åã¯ãuniformizeã§ã¯ãã¾ãããã¾ããã
uniformizeã§ã¯ãã©ãã ãããããã®ãã¡ã¤ã«ããã£ã¦ããä¸å®ã®ãµã¤ãºãè¶
ããªãéãã¯ãããã®ãã¡ã¤ã«ã1mapã¿ã¹ã¯ã«å²ãå½ã¦ããã¦ãã¾ãã¾ããå²ãå½ã¦ããããã¡ã¤ã«ã¯ããã¡ã¤ã«ãªã¹ãã®ä¸ããé ã«ãã¡ã¤ã«ãåãåºããã¾ãããã¡ã¤ã«ãªã¹ãã¯ãåç´ã«å¯¾è±¡ãã£ã¬ã¯ããªã®é
ä¸ã®ãã¡ã¤ã«ã»ãã£ã¬ã¯ããªãå帰çã«ãªã¹ããã¦ããã ããªã®ã§ãåä¸ãã£ã¬ã¯ããªã®ãã¡ã¤ã«ã¯ä¸ç®æã«åºã¾ã£ã¦ãã¾ãããã®çµæããããã£ã¬ã¯ããªã®ãã¡ã¤ã«ã¯1ã¿ã¹ã¯ã«éä¸ãããã¨ã«ãªãã¾ãã
1ãã¡ã¤ã«ã«å¯¾ããHDFSã¢ã¯ã»ã¹ã¯é常ã«é
ãã§ããç°å¢ã«ãããã¾ããã1ãã¡ã¤ã«ãããæ°msã®ãªã¼ãã¼ã¯è¦ãã»ããããã§ãããããã®ãããã¹ã¢ã¼ã«ãã¡ã¤ã«ãå¤ãã¹ãã¬ã¼ã¸ã§ã¯ããã¼ã¿è»¢éé度ã¯é常ã«é
ããªãã¾ããããã¦ãå¤ãã®å ´åãã¹ã¢ã¼ã«ãã¡ã¤ã«ã¯å±æåãã¦ãã¾ããããã¯ããªãã¡ãç¹å®ã®ãã£ã¬ã¯ããªã«ã¹ã¢ã¼ã«ãã¡ã¤ã«ãéä¸ãã¦ãããã¨ãæå³ãã¾ãã
ã¾ã¨ããã¨ãç¹å®ã®ãã£ã¬ã¯ããªã«éä¸ããã¹ã¢ã¼ã«ãã¡ã¤ã«ç¾¤ãã¾ã¨ãã¦1ã¤ã®mapã¿ã¹ã¯ã«å²ãå½ã¦ãããçµæãmapã¿ã¹ã¯ã®ã¹ãã¥ã¼ãçºçãããã®mapã¿ã¹ã¯ã ãã極端ã«é
ããªãã¨ããç¾è±¡ãçºçãã¾ãã
ãã®ãããªç°å¢ã§ã¯ãdynamic ã¨ããããä¸ã¤ã®ã³ãã¼æ¦ç¥ã使ãã¾ããdynamic ã¯ãã¡ã¤ã«æ°ã§ã¿ã¹ã¯ãããã®å²å½ãåå²ãããªãã·ã§ã³ã§ããä¾ãã°ã1åãã¡ã¤ã«ããã·ã¹ãã ã§1000mapã¿ã¹ã¯ã§å¦çãåå²ããå ´åã1ã¿ã¹ã¯ããã10ä¸ãã¡ã¤ã«ãæ
å½ãããã¨ã«ãªãã¾ãã
dynamicãªãã·ã§ã³ã使ãå ´åãuniformizeã¨éã«ã極端ã«ãã¡ã¤ã«ãµã¤ãºã大ãããã¼ã¿ãéä¸ãã¦ããã±ã¼ã¹ã«æ³¨æãã¦ãã ããããã¡ã¤ã«ãµã¤ãºãèæ
®ããªãã§ãã¼ã¿ãåå²ãããããç¹å®ã®ã¿ã¹ã¯ã ã極端ã«å¤§ããªãã¼ã¿ãå¦çããªããã°ãããªãã¨ãããªã¹ã¯ãçºçãã¾ãã転é対象ã®ãã¼ã¿ç¹æ§ã¯å¿
ãäºåã«èª¿æ»ãã¾ãããã
dynamic æ¦ç¥ã使ãã«ã¯ã以ä¸ã®ããã«ãªãã·ã§ã³ãä¸ãã¾ãã
$ hadoop distcp -strategy dynamic hdfs://cluster1/foo/bar hdfs://cluster2/foo
å®è·µDistCp: mapæ°ã®èª¿æ´
ããã©ã«ãã§ã¯DistCpã¯20mapã¿ã¹ã¯ãã使ç¨ãã¾ããããã¼ã¿éããªã½ã¼ã¹ç¶æ³ã«å¿ãã¦ãmapæ°ã®èª¿æ´ãããã»ããããã§ãããã以ä¸ã®ä¾ã¯ãmapæ°ã100ã¨ããå ´åã®ä¾ã§ãã
$ hadoop distcp -m 100 hdfs://cluster1/foo/bar hdfs://cluster2/foo
mapæ°ã®èª¿æ´ã¯ãåºæ¬çãªHadoopã¢ããªã±ã¼ã·ã§ã³ã¨åæ§ãã¹ãã¬ã¼ã¸IOããªã½ã¼ã¹ã«å¿ãã¦èª¿æ´ããå¿
è¦ãããã¾ãããªã½ã¼ã¹ããã«ã«ä½¿ããã®ã§ããã°ãç·ãã£ã¹ã¯æ°ã®1ï½2åãããã«ãã¦ããã®ãããã¨æãã¾ãããä¾ãã°ã¹ã¢ã¼ã«ãã¡ã¤ã«ä¸å¿ã®ã¯ã©ã¹ã¿ã®å ´åIOãããCPUä¾åã«ãªãã¯ããªã®ã§ãCPUã³ã¢æ°ããã¿ã¹ã¯æ°ãè¨ç®ããæ¹ãããããããã¾ããããã¯ã©ã¹ã¿ã®ãªã½ã¼ã¹ãé¼è¿«ãã¦ããç¶æ
ã§ããã°ãããmapæ°ãæ¸ããã¦ãã£ããå¦çããæ¹ãããããããã¾ããããã®ãããã®è¨ç®ã«èªä¿¡ããªããã°ãã¾ãã¯ããã©ã«ãã§è©¦é¨çã«è»¢éãã¦ã¿ã¦ã転éé度ãè¨ç®ããä¸ã§å¿
è¦ãããã°ãã¥ã¼ãã³ã°ããã¨ããç¨åº¦ã§ããã¨æãã¾ãã
å®è·µDistCp: 転é帯å
ãããã¯ã¼ã¯ã®å¸¯åãªã½ã¼ã¹ãé¼è¿«ãã¦ããå ´åã¯ã転éç¨ã®å¸¯åãå¶å¾¡ããæ¹ãããã§ãããã以ä¸ã®ããã«è¨å®ãããã¨ã§ã1mapãããã®è»¢é帯åã10MB/sã«æãããã¨ãã§ãã¾ãã
$ hadoop distcp -bandwidth 10 hdfs://cluster1/foo/bar hdfs://cluster2/foo
ãã®è¨äºã§æ¸ãã¦ããªããã¨
ãã ã³ãã¼ããã¨ãã£ã¦ããç´°ããè¦ä»¶ã¯ããã¸ã§ã¯ãã«ãã£ã¦ç°ãªããããã«å¿ãã¦DistCpã®æ§ã
ãªæ©è½ãæ´»ç¨ãã¦ããå¿
è¦ãããã¾ãã
ãã®è¨äºã§ã«ãã¼ãã¦ããªãå
容ã¯ä»¥ä¸ã®éãã§ãã
- snapshot diff ã使ã£ãå®å¸¸çãªå·®åããã¯ã¢ãã
- HAã¯ã©ã¹ã¿ã§ã®DistCp
- ã»ãã¥ã¢ã¯ã©ã¹ã¿ã§ã®DistCp
- ãªãã¸ã§ã¯ãã¹ãã¬ã¼ã¸ã対象ã¨ããDistCp
ããã¦ãã¯ã©ã¹ã¿ç§»è¡ã¨ãã話ã«ãªã£ãã¨ãã¯ãå¿
è¦ãªä½æ¥ã¯DistCpã ãã§ã¯ããã¾ãããä¾ãã°ãHiveã¡ã¿ã¹ãã¢DBã®ãã¼ã¿ç§»è¡ãã管çãã¼ã«ã®ãã¼ã¿ç§»è¡ãªã©ãèããã¹ã課é¡ã¯ä»ã«ãããã¾ãããããã«ã¤ãã¦ãææ°ã®æ
å ±ããã¼ã¹ã«ä½ç³»çã«ã¾ã¨ããããæ¸ç±ã¯åå¨ããªãã®ã§ãããèªä¿¡ããªãã¨ããå ´åã¯Clouderaçã®ãã³ãã¼ã«ç¸è«ãããã¨ããããããã¾ãã
åèãªã³ã¯
æ¢åºãå«ãã¦ãåèãªã³ã¯ãã¾ã¨ãã¦ããã¾ãã