CCMatrix: A billion-scale bitext dataset for training translation models What it is:CCMatrix is the largest dataset of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year.
ã©ã¤ã»ã³ã¹ ãæ¥æ¬å¤å ¸ç±ãããåãã¼ã¿ã»ãããï¼å½æå¦ç ç©¶è³æé¤¨ã»ãæèµï¼æ å ±ã»ã·ã¹ãã ç ç©¶æ©æ§ ãã¼ã¿ãµã¤ã¨ã³ã¹å ±åå©ç¨åºç¤æ½è¨ 人æå¦ãªã¼ãã³ãã¼ã¿å ±åå©ç¨ã»ã³ã¿ã¼å å·¥ï¼ã¯ã¯ãªã¨ã¤ãã£ãã»ã³ã¢ã³ãº 表示 - ç¶æ¿ 4.0 å½é ã©ã¤ã»ã³ã¹ï¼CC BY-SAï¼ã®ä¸ã«æä¾ããã¦ãã¾ãã ãã¼ã¿ã»ããå ¨ä½ããå©ç¨ã®éã«ã¯ãä¾ãã°ä»¥ä¸ã®ãããªè¡¨ç¤ºããé¡ããã¾ããåå¥ã®å¤å ¸ç±ã®ã¿ããå©ç¨ã®å ´åã«ã¯ãããããã®ãã¼ã¸ãã覧ä¸ããã ãæ¥æ¬å¤å ¸ç±ãããåãã¼ã¿ã»ããã ï¼å½æç ã»ãæèµï¼CODHå å·¥ï¼ doi:10.20676/00000340 å¯è½ãªå ´åã¯ããã¼ã¿æä¾å ã§ããROIS-DS人æå¦ãªã¼ãã³ãã¼ã¿å ±åå©ç¨ã»ã³ã¿ã¼ã¸ã®ãªã³ã¯ããé¡ããã¾ãã æä¾ï¼ROIS-DS人æå¦ãªã¼ãã³ãã¼ã¿å ±åå©ç¨ã»ã³ã¿ã¼ ãã¼ã¿æä¾æ¹æ³ã»æ³¨æäºé æ¸ç±ãã¨ã«åå½¢ãã¾ã¨ããZIPãã¡ã¤ã«ãããã³å ¨é¨ãã¾ã¨ããZI
ã©ã³ãã³ã°
ã©ã³ãã³ã°
ã¡ã³ããã³ã¹
ãªãªã¼ã¹ãé害æ å ±ãªã©ã®ãµã¼ãã¹ã®ãç¥ãã
ææ°ã®äººæ°ã¨ã³ããªã¼ã®é ä¿¡
å¦çãå®è¡ä¸ã§ã
j次ã®ããã¯ãã¼ã¯
kåã®ããã¯ãã¼ã¯
lãã¨ã§èªã
eã³ã¡ã³ãä¸è¦§ãéã
oãã¼ã¸ãéã
{{#tags}}- {{label}}
{{/tags}}