é ããã«ã³ãã¢ãã«ã«ã¤ãã¦
é ããã«ã³ãã¢ãã«
ç³»åãã¼ã¿ã«å¯¾ãã¦ã次æ°ããã¤ãã«ã³ãæ§ã®ä»®å®ã«å¶éãããããªããã¤èªç±ãã©ã¡ã¼ã¿ã®æ°ãå¶éãããã¢ãã«ãä½ããã¨ãèãããããã¯æ½å¨å¤æ°ãå°å
¥ãããã¨ã§ãå®ç¾ããããå³ã®ããã«ãã«ã³ãé£éãæ§æããã®ãæ½å¨å¤æ°ã§ããã¨ä»®å®ãããã¨ã§ç¶æ
空éã¢ãã«ã¨å¼ã°ããã°ã©ãæ§é ãå¾ãããã
ãã®ã¢ãã«ã®åæåå¸ã¯ä»¥ä¸ã§ä¸ããããã
ããæ½å¨å¤æ°ã颿£å¤æ°ã§ããå ´åããã®ã¢ãã«ãé ããã«ã³ãã¢ãã«ã¨å¼ã¶ã
æ½å¨å¤æ°ã¯éå»ã®æ å ±ããè¦ç´ããã¦ããããã®æ å ±ãå ã«ãã¦æ¬¡ã®ç¶æ ã®é·ç§»ãäºæ¸¬ãè¡ããããããäºæ¸¬ã¯ãã¹ã¦ã®éå»ã®è¦³æ¸¬å¤ã«ä¾åãããä¾ãã°ã天æ°ã®ä¾ã§ä¾ããã¨ãæ°è±¡ç¶æ ï¼é«æ°å§ã使°å§ãªã©ï¼ãæ½å¨å¤æ°ã¨ããããç´æ¥çãªè¦³æ¸¬ãã¼ã¿ï¼ä¾ï¼é£ç¶ããæ´ãã®æ¥ï¼ããã鿥çã«ãé ãç¶æ ï¼é«æ°å§ï¼ã®é·ç§»ãã¿ã¼ã³ãæ¨å®ãããã®æ å ±ãå ã«ææ¥ã®å¤©æ°ãäºæ¸¬ããã
ããã§æ½å¨å¤æ°$z$ã¯1対K符å·åæ³ï¼é«æ°å§ã使°å§ã®ç¶æ ãããå ´åã髿°å§[1,0] 使°å§[0,1]ã¨è¡¨ãæ¹æ³ï¼ã«ããK次å ã®2å¤å¤æ°ã§è¡¨ããã¨ã«ãããæå»ï½ã«ãããæ½å¨å¤æ°$z_n$ã®ç¶æ ã¯ããã®1ã¤åã®æå»ã®ç¶æ $z_{n-1}$ã«ä¾åããããã®ç¶æ ã®é·ç§»ãè¡¨ãæ¡ä»¶ä»ãåå¸ã¯é·ç§»ç¢ºçï¼transition probability)è¡å$A$ã§è¡¨ãããã
æå»$n-1$ã§$j$ã®ç¶æ ãããæå»$n$ã§$k$ã®ç¶æ ã«ãªãé·ç§»ç¢ºçã¯$A_{jk}\equiv p(z_{n,k}=1|z_{n-1,j}=1)$ã§å®ç¾©ããããé·ç§»ç¢ºçè¡åAã¯$KÃK$ã®è¡åã¨ãªããã$\sum_k A_{jk}=1$ãªã®ã§ããã©ã¡ã¼ã¿ã®æ°ã¯$Kï¼Kï¼1ï¼$ã¨ãªãã
é·ç§»ç¢ºçè¡åãç¨ãã¦ãæ¡ä»¶ä»ãåå¸ã¯ä»¥ä¸ã®å½¢ã§ãããã
$$ p(z_n | z_{n-1}, A) = \prod_{k=1}^{K} \prod_{j=1}^{K} A_{jk}^{z_{n-1,j},z_{n,k}} $$
æåã®æ½å¨ãã¼ã$z_1$ã¯ããã®åã®æå»ãæããªãã®ã§ããã®åå¸ã¯åæç¶æ åå¸$\pi$ã«ãã£ã¦ä¸ããããã
$$ p(z_1 | \pi) = \prod_{k=1}^{K} \pi_k^{z_{1k}} $$
Ïã®è¦ç´ ã®åè¨ã¯1ã§ããã
K=3ã®æã®ç¶æ
é·ç§»ã表ãå³ã¯ä»¥ä¸ã®ããã«ãªãã
確çã¢ãã«ãæå®ãããããè¦³æ¸¬å¤æ°ã®æ¡ä»¶ä»ã確çåå¸$p(x_n|z_n, \phi)$ãå®ç¾©ãããããã§$\phi$ã¯åå¸ãæ¯é
ãããã©ã¡ã¼ã¿ã®éåã¨ãªããåºå確çï¼emission probability)ã¨å¼ã°ãããåºå確çã¯ä»¥ä¸ã®å½¢å¼ã§è¡¨ãããã
$$ p(x_n | z_n, \phi) = \sum_{k=1}^{K} p(x_n | \phi_k) z_{nk} $$
ãã®ã¨ãæ½å¨å¤æ°ãæ¯é
ãããã¹ã¦ã®æ¡ä»¶ä»ãåå¸ãåãé·ç§»ç¢ºçè¡åAãå
±æãããã¹ã¦ã®åºååå¸ãåä¸ã®ãã©ã¡ã¼ã¿\phiãå
±æãã¦ããã¨ãããåä¸ãªã¢ãã«ãèããã¨ãæ½å¨å¤æ°ã¨è¦³æ¸¬å¤æ°ã®åæåå¸ã¯ä»¥ä¸ã®ããã«ãªãã
HMMã®ç®çã¯è¦³æ¸¬çµæ$ X=x_1,\ldots,x_N$ããæªç¥ã®ãã©ã¡ã¼ã¿$\thetaï¼{Ï,A,Ï}$ãæé©åãããã¨ã§ãããå°¤åº¦é¢æ°ã¯åæåå¸ã®å¼ãæ½å¨å¤æ°ã«ã¤ãã¦å¨è¾ºåãããã¨ã§å¾ãããã
$$
p\left(X\middle|\theta\right)=\sum_{Z}{p\left(X,Z\middle|\theta\right)}
$$
ãã®å°¤åº¦é¢æ°ã®æå¤§åã«ã¯EMã¢ã«ã´ãªãºã ãç¨ãããã¨ã«ãªããããã«ã¤ãã¦ã¯ä»å¾è¨äºã使äºå®ã
åã®è¨äºããã«ã³ãã¢ãã«ã«ã¤ãã¦
å³é¢ã¯ä»¥ä¸ããå¼ç¨ã https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf
ãã«ã³ãã¢ãã«ã«ã¤ãã¦
ãã«ã³ãã¢ãã«
æç³»å${x_1, \ldots, x_n}$ã®ãããªç³»åãã¼ã¿ãæãç°¡åã«æ±ãæ¹æ³ã¯ãç³»åã§ããã¨ããæ§è³ªãç¡è¦ãã¦ãããã観測å¤ãç¬ç«ååå¸ã«å¾ããã®ã¨ãã¦æ±ããã¨ã§ããï¼ä¸å³ï¼ããããããã®æ¹æ³ã¯ããã¼ã¿ã®é åºã«é¢ä¿ãããã¿ã¼ã³ãæãããã¨ãã§ããªãã
ä¾ãã°ãææ¥ãé¨ãéãããç¥ãããå ´åãããã¾ã§ã«1000æ¥éã®è¦³æ¸¬ãã¼ã¿ãããããã¡100æ¥é¨ãéã£ã¦ããã¨ãããããã観測ãã¼ã¿ãç¬ç«ååå¸ã«å¾ãã¨ããã¨ã100/1000 ãã¤ã¾ãã¯1/10ã¨ããé »åº¦ãææ¥ã®é¨ã®éã確çã¨ãã¦äºæ¸¬ãããã ãããããããå®éã«ã¯é¨ã¯é£ç¶ãã¦éããã¨ãå¤ãã仿¥ãé¨ãéã£ããã©ãããç¥ããã¨ã¯ãææ¥ãé¨ãéããäºæ¸¬ããããã«å½¹ç«ã¤ã
ãã®ãããªãã¨ã確çã¢ãã«ã§è¡¨ç¾ããããã®æ¹æ³ã¨ãã¦ãã«ã³ãã¢ãã«ï¼Markov modelï¼ãèãããããNåã®è¦³æ¸¬ç³»åã®åæåå¸ã¯ä»¥ä¸ã®å½¢ã§è¡¨ç¾ã§ãã
$$ \begin{split} p(x_1, \ldots, x_N) = & p(x_1) p(x_2|x_1)p(x_3|x_1,x_2)â¦p(x_n | x_1, \ldots, x_{n-1})\\=&p(x_1)\prod_{n=2}^{N} p(x_n | x_1, \ldots, x_{n-1}) \end{split} $$
ããã§ã$ p(x_n | x_1, \ldots, x_{n-1})$ ã¯ããè¦³æ¸¬å¤ $x_n$ã¯$x_1, \ldots, x_{n-1}$ã«ãã£ã¦æ¡ä»¶ä»ãããã¦ãããã¨ã表ãã
ææ¥ã®å¤©æ°ãäºæ¸¬ããéã«ã仿¥ã®å¤©æ°ã®æ å ±ã®ã¿ãå½±é¿ããå ´åãã¤ã¾ãæãè¿ã観測å¤ä»¥å¤ã®ãã¹ã¦ã®éå»ã®è¦³æ¸¬å¤ãç¬ç«ããäºæ¸¬ã«å½±é¿ãä¸ããªãã¨ããã¨ãNåã®è¦³æ¸¬ç³»åã®åæåå¸ã¯ä»¥ä¸ã®ããã«ãªãã
$$ \begin{split} p(x_1, \ldots, x_N) =& p(x_1) p(x_2|x_1)p(x_3|x_2)â¦p(x_n | x_{n-1})\\=& p(x_1)\prod_{n=2}^{N} p(x_n |x_{n-1}) \end{split} $$
ãã®å ´åããã観測å¤$x_n$ã¯$x_{n-1}$ã«ã®ã¿æ¡ä»¶ã¥ãããã¦ããã以ä¸ã®ãããªã°ã©ãã£ã«ã«ã¢ãã«ã§å³ç¤ºãããã
ã»ã¨ãã©ã®ãã«ã³ãã¢ãã«ã®å¿ç¨ã«ããã¦$p(x_n |x_{n-1})$ãã¿ãªåä¸ã§ããã¨ããå¶ç´ã課ããã¦ããããã¨ãã°ã仿¥ãé¨ã®å ´åãææ¥ã®é¨ã®ç¢ºçã10ï¼ ä¸ããã¨ãã£ãæ¡ä»¶ãããã¨ããããããã1å¹´ãéãã¦ãã£ã¨æãç«ã¤ã¨ä»®å®ãã¦ããã¨ãããã¨ã ãããããã¢ãã«ãåä¸ãã«ã³ãé£éï¼homogeneous Markov chain)ã¨å¼ã¶ã å®éã¯ãæ¢ é¨ã®ææã®æ¹ã仿¥ã®å¤©æ°ãææ¥ã®å¤©æ°ã«ä¸ããå½±é¿ã大ããã®ããããããæ¡ä»¶ä»ã確çã¯åä¸ã§ã¯ãªããããããªãããã²ã¨ã¾ãåä¸ãã«ã³ãé£éãä»®å®ããã¢ãã«ãå¤ãã
ããéå»ã®æ å ±ãäºæ¸¬ã«å©ç¨ããä¾ã¨ãã¦ãæ¨æ¥ã仿¥ã¨ï¼æ¥åã®æ å ±ãææ¥ã®å¤©æ°ã®äºæ¸¬ã«ç¨ããã¨ããããã®å ´åãNåã®è¦³æ¸¬ç³»åã®åæåå¸ã¯ä»¥ä¸ã®ããã«ãªãã
$$ \begin{split} p(x_1, \ldots, x_N) =& p(x_1) p(x_2|x_1)p(x_3|x_2, x_1)â¦p(x_n | x_{n-1}, x_{n-2})\\=& p(x_1)p(x_2|x_1)\prod_{n=3}^{N} p(x_n |x_{n-1}, x_{n-2}) \end{split} $$
ãã®ã¢ãã«ã2次ãã«ã³ãé£éã¨å¼ã³ãã°ã©ãã£ã«ã«ã¢ãã«ã¯ä»¥ä¸ã®ããã«å³ç¤ºãããã
åæ§ã«Mæ¥åã®å¤©æ°æ å ±ãäºæ¸¬ã«ç¨ãããã¨ãã§ããM次ã®ãã«ã³ãé£éã«æ¡å¼µãããã¨ãã§ãããéå»ã®æ å ±ãå¤ãåãå ¥ãããã¨ã§ãäºæ¸¬ç²¾åº¦ãåä¸ãããå¯è½æ§ããããã䏿¹ã§ã¢ãã«ã®ãã©ã¡ã¼ã¿æ°ãææ°çã«å¢å¤§ããã¢ãã«ãè¤éã«ãªããããå¯è½æ§ãããã
å³é¢ã¯ä»¥ä¸ããå¼ç¨ã https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf
次ã®è¨äºãé ããã«ã³ãã¢ãã«ã«ã¤ãã¦
è«æè¦ç´ï¼BitNet: Scaling 1-bit Transformers for Large Language Models
BitNet: Scaling 1-bit Transformers forLarge Language Models
Hongyu Wangâ â¡ Shuming Maâ Li Dongâ Shaohan Huangâ Huaijie Wang§ Lingxiao Maâ Fan Yangâ Ruiping Wangâ¡ Yi Wu§ Furu Weiâ â
â Microsoft Research â¡ University of Chinese Academy of Sciences § Tsinghua University arxiv.org
Abstract
-
ç®çï¼å¤§è¦æ¨¡è¨èªã¢ãã«ã®å±éã«ããã課é¡ã¨ãé«ã¨ãã«ã®ã¼æ¶è²»ã«ããç°å¢ã¸ã®å½±é¿ã«å¯¾å¦ãããããã¹ã±ã¼ã©ãã«ã§å®å®ãã1ãããTransformerã¢ã¼ããã¯ãã£ã§ããBitNetãå°å ¥ãããã¨ã
- ææ³ï¼nn.Linear層ã®ä»£ããã«BitLinearãå°å ¥ãã1ãããéã¿ãã¼ãããè¨ç·´ãããã¨ã§ãå¤§è¦æ¨¡è¨èªã¢ãã«ç¨ã«è¨è¨ãããBitNetãéçºã
- çµæï¼è¨èªã¢ããªã³ã°ã«ãããå®é¨çµæãããBitNetã¯ç«¶äºåã®ããæ§è½ãéæããæå
端ã®8ãããéååæ¹æ³ããã³FP16 Transformerãã¼ã¹ã©ã¤ã³ã¨æ¯è¼ãã¦ãã¡ã¢ãªãããããªã³ãã¨ã¨ãã«ã®ã¼æ¶è²»ã大å¹
ã«åæ¸ãããã¨ã示ããããããã«ãBitNetã¯ãã«ãã¬ã·ã¸ã§ã³Transformerã¨åæ§ã®ã¹ã±ã¼ãªã³ã°æ³åã示ããå¹çæ§ã¨æ§è½ã®å©ç¹ãç¶æããªãããããã«å¤§ããªè¨èªã¢ãã«ã¸ã®æå¹ãªã¹ã±ã¼ãªã³ã°ã®å¯è½æ§ã示åãã¦ããã
Introduction
- å¤§è¦æ¨¡è¨èªã¢ãã«ï¼LLMï¼ã®æ¥éãªæé·ã¯ããã¾ãã¾ãªã¿ã¹ã¯ã§é¡èãªæ¹åãããããã¦ããããé«ãæ¨è«ã³ã¹ãã¨ã¨ãã«ã®ã¼æ¶è²»ã«ããããããã®ã¢ãã«ããã¹ããããã¨ã¯è²»ç¨ããããã
- ã¢ãã«ã®ãµã¤ãºã大ãããªãã«ã¤ãã¦ãã¢ãã«ãã©ã¡ã¼ã¿ã¸ã®ã¢ã¯ã»ã¹ã¨å¦çã«å¿ è¦ãªã¡ã¢ãªå¸¯åå¹ ã主è¦ãªããã«ããã¯ã¨ãªããå ¨ä½çãªæ¨è«æ§è½ãå¶éãã¦ããã
- 忣ã·ã¹ãã ããã«ãããã¤ã¹ãã©ãããã©ã¼ã ä¸ã§ãããã®ã¢ãã«ãå±éããéãããã¤ã¹ééä¿¡ã®ãªã¼ãã¼ããããæ¨è«ã®é å»¶ã¨ã¨ãã«ã®ã¼æ¶è²»ã«å¤§ããªå½±é¿ãä¸ãã¦ããã
- ã¢ãã«éååã¯ãå¤§è¦æ¨¡ã¢ãã«ã®ã¡ã¢ãªãããããªã³ãã¨è¨ç®ã³ã¹ããå¤§å¹ ã«åæ¸ããªãããç«¶äºåã®ããæ§è½ãç¶æã§ããææãªè§£æ±ºçã§ããã
- æ¢åã®éååã¢ããã¼ãã®å¤ãã¯ããã¬ã¼ãã³ã°å¾ã«é©ç¨å¯è½ãªãããç°¡åã«ä½¿ç¨ã§ãããããããã¢ãã«ã¯éåå表ç¾ã§è¨ç·´ä¸ãæé©åããã¦ããªãããã精度ãä½ä¸ããããã䏿¹ã§ãéååèªèãã¬ã¼ãã³ã°ï¼quantization-aware trainingï¼ã¯ãåææ®µéããéååãèæ ®ãã¦ã¢ãã«ããã¬ã¼ãã³ã°ãããããããè¯ã精度ãå®ç¾ããããã
- æ¬ç ç©¶ã¯ã1ãããå¤§è¦æ¨¡è¨èªã¢ãã«ã®éååèªèãã¬ã¼ãã³ã°ã調æ»ããæåã®ä½æ¥ã§ãããBitNetã¨ãã1ãããTransformerã¢ã¼ããã¯ãã£ãææ¡ããã¡ã¢ãªã¨è¨ç®ã®ä¸¡æ¹ã®é¢ã§å¹ççã«ã¹ã±ã¼ã«ãããã¨ãç®æãã¦ããã
- BitNetã¯ãä½ç²¾åº¦ã®ãã¤ããªéã¿ã¨éååãããã¢ã¯ãã£ãã¼ã·ã§ã³ã使ç¨ããªããããã¬ã¼ãã³ã°ä¸ã«æé©åå¨ã®ç¶æ ã¨å¾é ã«ã¯é«ç²¾åº¦ãç¶æããã
- BitNetã®å®è£ ã¯åç´ã§ãããTransformerå ã®ç·å½¢å°å½±ï¼PyTorchã®nn.Linearãªã©ï¼ã®ç½®æã®ã¿ãè¦æ±ãããã¾ããPagedAttentionãFlashAttentionãæ¨æ¸¬ãã³ã¼ãã£ã³ã°ãªã©ãä»ã®å¤§è¦æ¨¡è¨èªã¢ãã«ã®å éæ¹æ³ã¨è£å®ãããã
- è¨èªã¢ããªã³ã°ã®ãã³ããã¼ã¯ã«ããã¦BitNetãè©ä¾¡ããæå 端ã®éååæ¹æ³ããã³FP16 Transformerã¨æ¯è¼ãããå®é¨çµæã¯ãBitNetããã¼ãã¬ãã·ãã£ã¨ä¸æµã¿ã¹ã¯ã®ç²¾åº¦ã®ä¸¡æ¹ã§ç«¶äºåã®ããæ§è½ãéæããã¡ã¢ãªãããããªã³ãã¨ã¨ãã«ã®ã¼æ¶è²»ããã¼ã¹ã©ã¤ã³ã¨æ¯è¼ãã¦å¤§å¹ ã«åæ¸ãããã¨ã示ãã¦ããã
BitNet
- BitNetã¯ãã»ã«ãã¢ãã³ã·ã§ã³ã¨ãã£ã¼ããã©ã¯ã¼ããããã¯ã¼ã¯ã®ãããã¯ãç©ã¿éãããTransformerã¨åãã¬ã¤ã¢ã¦ãã使ç¨ãã¦ããã徿¥ã®è¡åä¹ç®ã®ä»£ããã«ã1ãããã¢ãã«éã¿ã使ç¨ããBitLinearãæ¡ç¨ãã¦ããã
- ãã®ä»ã®ã³ã³ãã¼ãã³ãã¯ä»¥ä¸ã®çç±ããéååããã«é«ç²¾åº¦ï¼ä¾ï¼8ãããï¼ãç¶æãã¦ããã
- æ®å·®æ¥ç¶ã¨ã¬ã¤ã¤ã¼æ£è¦åã¯ãå¤§è¦æ¨¡è¨èªã¢ãã«ã«ãããè¨ç®ã³ã¹ããå°ããç¡è¦ã§ãã
- QKVï¼query, key, valueï¼ã®å¤æã³ã¹ãã¯ãã¢ãã«ã大ãããªãã«ã¤ããè¡åä¹ç®é¨ã«æ¯ã¹å°ãããªãããã
- å ¥åºåã®ã¨ã³ããã£ã³ã°ã«ã¯ãè¨èªã¢ãã«ãé«ç²¾åº¦ã®ç¢ºçã使ç¨ãã¦ãµã³ããªã³ã°ãè¡ãå¿ è¦ãããããã精度ãä¿æããå¿ è¦ãããã
BitLinear
- éã¿ã符å·é¢æ°ãç¨ãã¦+1ã¾ãã¯â1ã«äºå¤åããäºå¤ååã«éã¿ãã¼ãå¹³åã«ãããã¨ã§ãéãããæ°å¤ç¯å²å ã§ã®å®¹éãå¢å ããããäºå¤åå¾ã«ã¯ã宿°å¤ã¨äºå¤åãããéã¿éã®$l2$ã¨ã©ã¼ãæ¸ããããã«ã¹ã±ã¼ãªã³ã°ä¿æ°$β$ã使ç¨ãããéã¿$W \in \mathbb{R}^{n \times m}$ã®äºå¤åã¯ä»¥ä¸ã®ããã«å®å¼åãããã
ããã§
- ã¢ã¯ãã£ãã¼ã·ã§ã³ã¯bããã精度ã«ããã«éååããããabsmaxéååã使ç¨ããå ¥åè¡åã®çµ¶å¯¾æå¤§å¤ã§å²ããã¨ã«ãããã¢ã¯ãã£ãã¼ã·ã§ã³ãç¯å²$[âQ_b, Q_b] (Qb = 2^{bâ1})$ã«ã¹ã±ã¼ã«ããã
$$ x_e = \text{Quant}(x) = \text{Clip}\left(\frac{x \times Q_b}{\gamma}, -Q_b + \epsilon, Q_b - \epsilon\right), \quad $$
ããã§$\epsilon$ã¯ã¯ã¯ãªããã³ã°ãå®è¡ããéã«ãªã¼ãã¼ããã¼ãé²ãããã®å°ããªæµ®åå°æ°ç¹æ°
- éç·å½¢é¢æ°ï¼ä¾ï¼ReLUï¼ã®åã®ã¢ã¯ãã£ãã¼ã·ã§ã³ã«ã¤ãã¦ã¯ãå ¥åã®æå°å¤ãå¼ããã¨ã§ç¯å²$[0, Q_b]$ã«ã¹ã±ã¼ã«ãããã¹ã¦ã®å¤ãéè² ã«ããã
ãã®ç ç©¶ã§ã¯ãã¢ã¯ãã£ãã¼ã·ã§ã³ã8ãããã«éååããããä½ã精度ã¯å°æ¥ã®ç ç©¶ã¨ãããã¾ãããã¬ã¼ãã³ã°ä¸ã¯ãã³ã½ã«ãã¨ã«ãæ¨è«ä¸ã¯ãã¼ã¯ã³ãã¨ã«éååãè¡ããå®å®æ§ã¨å¹çæ§ã確ä¿ããã
ä¸è¨ã®éååæ¹ç¨å¼ãç¨ããã¨ãè¡åä¹ç®ã¯ä»¥ä¸ã®ããã«è¨è¿°ã§ãã
$$ y=\tilde{W}\tilde{x} $$
ï¼ï½ã¯ï¼ãã«ãï¼tildeï¼çå¤ã«ã»ã¼çããï¼
ããã§ã$W$ã¨$x$ã®è¦ç´ ã¯äºãã«ç¬ç«ã§ãããåãåå¸ãå ±æãã¦ããã¨ä»®å®ããããã®å¾ãåºå$y$ã®åæ£ã¯ä»¥ä¸ã®ããã«æ¨å®ãããï¼
$$ \text{Var}(y) = n\text{Var}(\tilde{w}\tilde{x}) $$
$$ = n\mathbb{E}[\tilde{w}^2]\mathbb{E}[\tilde{x}^2] \quad $$
- ãã«ãã¬ã·ã¸ã§ã³ï¼å ¨ç²¾åº¦ï¼è¨ç®ã§ã¯ãKaimingåæåãXavieråæåãªã©ã®æ¨æºçãªåæåæ¹æ³ã使ç¨ãããã¨ã§ãåºåã®åæ£Var(y)ã1ã®ã¹ã±ã¼ã«ã«ä¿ã¡ããã¬ã¼ãã³ã°ãå®å®ãããã
- éååã«ãã精度ã®ä½ä¸ãé²ããåºåã®åæ£ãç¶æããããã«ãæ´»æ§åéååã®åã«LayerNormï¼ã¬ã¤ã¤ã¼æ£è¦åï¼é¢æ°ãå°å
¥ãããããã«ãããåºå$y$ã®åæ£ã¯
ã¨æ¨å®ãããããã¯ãã«ãã¬ã·ã¸ã§ã³ã®åºåã®åæ£$Var(y)$ã¨åã大ããã«ãªãã¾ãã
- Transformerã¢ãã«ã®æèã§ã¯ããã®ããã»ã¹ã¯SubLNï¼ãµãã¬ã¤ã¤ã¼æ£è¦åï¼ã¨ãã¦å®è£ ããã¦ãããããã¯ãBitLinearã®å°å ¥ã«ããã1ãããéã¿ã¨éååãããã¢ã¯ãã£ãã¼ã·ã§ã³éã®è¡åä¹ç®ãå¯è½ã«ãã¾ãã
- BitLinearã¯ãSubLNããã³éååæ¹æ³ãç¨ãã¦æ¬¡ã®ããã«å®å¼åããã
$$ y = \tilde{W}\tilde{f} = \tilde{W}Quant(LN(x)) à \frac{βγ}{ Q_b} $$
ããã§ã$β$ã¯ã¹ã±ã¼ãªã³ã°ä¿æ°ã$γ$ã¯æ£è¦åã®ã¹ã±ã¼ã«ãã¯$Q_b$éååã®ãããæ°ã表ããSubLNæä½å¾ãæ´»æ§åã¯absmax颿°ã§éååããã1ãããéã¿ã¨éååãããæ´»æ§åéã§è¡åä¹ç®ãå®è¡ããã¾ããåºåæ´»æ§åã¯${β, γ}$ã§åã¹ã±ã¼ã«ãããå ã®ç²¾åº¦ã«ééååãããã
Model parallelism with Group Quantization and Normalization
å¤§è¦æ¨¡è¨èªã¢ãã«ã®ã¹ã±ã¼ã«ã¢ããã«ã¯ã¢ãã«ä¸¦åæ§ãéè¦ã§ãããããã¯è¤æ°ã®ããã¤ã¹ä¸ã§ã®è¡åä¹ç®ãåå²ããæè¡ã§ããããã ããå ¨ã¦ã®ãã©ã¡ã¼ã¿Î±ãβãγãηã¯ãã³ã½ã«å ¨ä½ããè¨ç®ããããããç¬ç«æ§ã®åæãç ´ãã
ä¸ã¤ã®è§£æ±ºçã¨ãã¦ãåãã©ã¡ã¼ã¿ã«å¯¾ãã¦all-reduceæä½ãå°å ¥ãããã¨ãèããããããã¢ãã«ãæ·±ããªãã«ã¤ãã¦éä¿¡ã®éãå¢å ããå¦çãé ããªãã
ãã®åé¡ã解決ããããã«ãéã¿ã¨ã¢ã¯ãã£ãã¼ã·ã§ã³ãã°ã«ã¼ãã«åå²ããåã°ã«ã¼ãã®ãã©ã¡ã¼ã¿ãç¬ç«ãã¦æ¨å®ãããã¨ã§ã追å ã®éä¿¡ãªãã«ãã©ã¡ã¼ã¿ããã¼ã«ã«ã§è¨ç®ã§ããæ°ããã¢ããã¼ããææ¡ããããã®ææ³ã¯ã°ã«ã¼ãéååã¨å¼ã°ããã
å ·ä½çã«ã¯ãéã¿è¡å $W \in \mathbb{R}^{n \times m}$ããã¼ãã£ã·ã§ã³æ¬¡å ã«æ²¿ã£ã¦ G ã°ã«ã¼ãã«åå²ããåã°ã«ã¼ãã
ã®ãµã¤ãºãæã¤ããã«ãããæ¬¡ã«ãåã°ã«ã¼ãã®ãã©ã¡ã¼ã¿ãç¬ç«ãã¦æ¨å®ããï¼
ããã§ã$W^{(g)}$ ã¯éã¿è¡åã® g çªç®ã®ã°ã«ã¼ãã表ããåæ§ã«ãå ¥åè¡å $x\in \mathbb{R}^{n \times m}$ ã G ã°ã«ã¼ãã«åå²ããåã°ã«ã¼ãã®ãã©ã¡ã¼ã¿ãè¨ç®ããï¼
- LNï¼Layer Normalizationï¼ã«ã¤ãã¦ã¯ãã°ã«ã¼ãæ£è¦åæè¡ãé©ç¨ãã¦ãåã°ã«ã¼ãã®å¹³åã¨åæ£ãç¬ç«ã«è¨ç®ã§ããã
$$ \text{LN}(x^{(g)}) = \frac{x^{(g)} - \mathbb{E}(x^{(g)})}{\sqrt{\text{Var}(x^{(g)}) + \epsilon}} $$
- ãã®æ¹æ³ã«ããã追å ã®éä¿¡ãå¿ è¦ã¨ããã«ãããå¹ççãªã¢ãã«ä¸¦åæ§ãå®ç¾ãããã
Model Training
Straight-through estimator
- ãããã¢ãã«ã®ãã¬ã¼ãã³ã°ã«ã¯ãããã¯ãããã²ã¼ã·ã§ã³ä¸ã®å¾é ãè¿ä¼¼ããããã«ã¹ãã¬ã¼ãã¹ã«ã¼æ¨å®å¨ï¼STEï¼ã使ç¨ããããã®æ¹æ³ã¯ãããã¯ã¯ã¼ããã¹ä¸ã®éå¾®åå¯è½ãªé¢æ°ï¼ä¾ï¼Sign颿°ãClip颿°ï¼ãåé¿ããéååã¢ãã«ã®ãã¬ã¼ãã³ã°ãå¯è½ã«ããã
Mixed precision training
- éã¿ã¨æ´»æ§åã¯ä½ç²¾åº¦ã«éååãããããå¾é ã¨ãªããã£ãã¤ã¶ã®ç¶æ ã¯ãã¬ã¼ãã³ã°ã®å®å®æ§ã¨ç²¾åº¦ãä¿è¨¼ããããã«é«ç²¾åº¦ã®ã¾ã¾ä¿åããããå¦ç¿å¯è½ãªãã©ã¡ã¼ã¿ã®ããã«ãé«ç²¾åº¦ãã©ã¼ãããã®æ½å¨éã¿ãä¿æãããã©ã¡ã¼ã¿æ´æ°ãèç©ãããæ½å¨éã¿ã¯ãã©ã¯ã¼ããã¹ä¸ã«äºå¤åãããæ¨è«ããã»ã¹ã«ã¯ä½¿ç¨ãããªãã
Large learning rate
- æé©åã®èª²é¡ã®ä¸ã¤ã¯ãæ½å¨éã¿ã®å°ããªæ´æ°ã1ãããéã¿ã«ã»ã¨ãã©éããçããããªããã¨ã§ãããããã¯ãã¤ã¢ã¹ã®ããã£ãå¾é ã¨æ´æ°ãçããããç¹ã«ãã¬ã¼ãã³ã°ã®åææ®µéã§åé¡ã¨ãªãããã®èª²é¡ã«å¯¾å¦ãããããå¦ç¿çãå¢å ããããã¨ãæé©åãå éããæãåç´ã§æè¯ã®æ¹æ³ã§ãããã¨ãçºè¦ãããBitNetã¯å¤§ããªå¦ç¿çããåæã«ããã¦å©çãå¾ãããåãå¦ç¿çã§FP16 Transformerã¯ãã¬ã¼ãã³ã°ã®éå§æã«çºæ£ããã
Computational Efficiency
- BitNetã®è¨ç®å¹çã¯ãç®è¡æ¼ç®ã®ã¨ãã«ã®ã¼ã¨ã¡ã¢ãªãããããªã³ãã®ä¸¡æ¹ã®è¦³ç¹ã§è©ä¾¡ãããã
- [Hor14, ZZL22]ã«ãããã¨ãã«ã®ã¼ã¢ãã«ã«ããã°ãç°ãªãç®è¡æ¼ç®ã®ã¨ãã«ã®ã¼æ¶è²»ã¯ä»¥ä¸ã®ããã«æ¨å®ãããã
ããã©Transformerã«ãããã¨ãã«ã®ã¼æ¶è²»
$mÃn$ ã¨$nÃp$ ã®æ¬¡å ãæã¤è¡åä¹ç®ã§ã¯ãã¨ãã«ã®ã¼æ¶è²»ã¯å ç®ã¨ä¹ç®ã§æ¬¡ã®ããã«è¨ç®ããã
BitNetã«ãããã¨ãã«ã®ã¼æ¶è²»
BitNetã§ã¯ã1ãããã®éã¿ã使ç¨ãããããè¡åä¹ç®ã®ã¨ãã«ã®ã¼æ¶è²»ã¯å ç®æä½ã«ãã£ã¦æ¯é ããããä¹ç®æä½ã¯åºåãã¹ã±ã¼ã©ã¼$β$ã¨$γ/Q_b$ ã§ã¹ã±ã¼ãªã³ã°ããããã«ã®ã¿é©ç¨ããããããä¹ç®ã®ã¨ãã«ã®ã¼æ¶è²»ã¯ E${\text{mul}} = (m \times p + m \times n) \times \hat{E}_{\text{mul}}$ã¨ãã¦è¨ç®ã§ããããã¯Transformerã«æ¯ã¹ã¦èããå°ããã
Comparison with FP16 Transformers
Setup
- BitNetãç¨ããæ§ã ãªã¹ã±ã¼ã«ã®èªå·±å帰è¨èªã¢ãã«ã125Mãã30Bã®ç¯å²ã§ãã¬ã¼ãã³ã°ãããã¢ãã«ã¯Pileãã¼ã¿ã»ãããCommon Crawlã¹ãããã·ã§ãããRealNewsãCC-Storiesãã¼ã¿ã»ããããæ§æãããè±èªã³ã¼ãã¹ä¸ã§ãã¬ã¼ãã³ã°ãããããã¼ã¿ã®åå¦çã«ã¯Sentencpieceãã¼ã¯ãã¤ã¶ã¼ã使ç¨ããèªå½ãµã¤ãºã¯16Kã§ãããBitNetã«å ãã¦ãå ¬å¹³ãªæ¯è¼ã®ããã«åããã¼ã¿ã»ããã¨è¨å®ã§Transformerãã¼ã¹ã©ã¤ã³ããã¬ã¼ãã³ã°ããã
Inference-Optimal Scaling Law
- ãã¥ã¼ã©ã«è¨èªã¢ãã«ã¯ããã©Transformerã¢ã¼ããã¯ãã£ã§äºæ¸¬å¯è½ã«ã¹ã±ã¼ã«ãããã¨ã証æããã¦ãããæå¤±ã¯ãã¬ã¼ãã³ã°ã«ä½¿ç¨ãããè¨ç®éã®ã¹ãä¹åã«å¾ã£ã¦ã¹ã±ã¼ã«ãããããã«ãããè¨ç®äºç®ã®æé©ãªå²ãå½ã¦ã決å®ããå°ããã¢ãã«ããå¤§è¦æ¨¡è¨èªã¢ãã«ã®æ§è½ãäºæ¸¬ã§ããã
- ãã¤ãã©ã¤ãºãTransformerã®ã¹ã±ã¼ãªã³ã°æ³åã調æ»ãããããBitNetã¨FP16 Transformerãã¼ã¹ã©ã¤ã³ã®ãã©ã¡ã¼ã¿ã«ã¦ã³ãã«å¯¾ããã¹ã±ã¼ãªã³ã°æ²ç·ããããããããBitNetã®æå¤±ã¹ã±ã¼ãªã³ã°ã¯FP16 Transformerã«ä¼¼ã¦ãããã¹ãä¹åï¼$L(N) = aNb + c$ï¼ã«å¾ã(ä¸å³ã§ã125Mãã6.7Bã®ã¢ãã«çµæã§ã¹ãä¹åããã£ããã£ã³ã°ãããã13Bããã³30Bãã®ãã¾ããBitNetã¨FP16 Transformeréã®ã®ã£ããã¯ã¢ãã«ãµã¤ãºã大ãããªãã«ã¤ãã¦å°ãããªãã
- Inference-Optimal Scaling Lawãå°å ¥ããã¨ãã«ã®ã¼æ¶è²»ã«å¯¾ããæå¤±ãäºæ¸¬ãããããã¯ã¢ãã«ã®ä½¿ç¨éã«å¿ãã¦ã¹ã±ã¼ã«ããæ¨è«ã¨ãã«ã®ã¼ã³ã¹ãã«ç¦ç¹ãå½ã¦ããã¬ã¼ãã³ã°ã³ã¹ããããå¹ççãªã¹ã±ã¼ãªã³ã°ãæä¾ãããBitNetã¯åºå®ãããè¨ç®äºç®ã§é¡èã«è¯ãæå¤±ãéæããFP16ã¢ãã«ã¨åãæ§è½ãå¾ãããã®æ¨è«ã³ã¹ãã¯å¤§å¹ ã«å°ããã
Results on Downstream Tasks
- BitNetã®ã¹ã±ã¼ãªã³ã°ã«ä¼´ãè½åã«ã¤ãã¦ããæå¤±ã¨åæ§ã«é¢å¿ããããHellaswagãWinograndeãWinogradãStoryclozeãå«ã4ã¤ã®ä¸æµã¿ã¹ã¯ã§ã0ã·ã§ããã¨4ã·ã§ããã®çµæããã¹ãããè§£éå¯è½ãªææ¨ã§ã¹ã±ã¼ãªã³ã°ã«ä¼´ãè½åãè©ä¾¡ããï¼ä¸å³ï¼ãBitNetã¨FP16 Transformerã®å¹³åçµæãå ±åããè¨ç®äºç®ãå¢ããã«ã¤ãã¦ä¸æµã¿ã¹ã¯ã®æ§è½ãã¹ã±ã¼ã«ãããã¨ã示ãããã
- ä½ãããTransformerã®ãã¬ã¼ãã³ã°ã«ããã主è¦ãªèª²é¡ã¯æé©åã®å®å®æ§ã§ãããããBitNetã¨FP16ãã¼ã¹ã©ã¤ã³ã®å®å®æ§ãã¹ãããç°ãªããã¼ã¯å¦ç¿çã§ã®ã¢ãã«ã·ãªã¼ãºã®ãã¬ã¼ãã³ã°ã«ãã£ã¦è¡ããBitNetã¯å¤§ããªå¦ç¿çã§åæã§ããããFP16 Transformerã¯ã§ããªããã¨ã示ããï¼ä¸å³ï¼ãBitNetã®ãã¬ã¼ãã³ã°å®å®æ§ãããåªãã¦ãããã¨ã示ãã¦ããã
<
- BitNetã¯å¦ç¿çã®å¢å ããæ©æµãåããPPL(perplexity)ã®è¦³ç¹ã§ããè¯ãåæãéæã§ãããã¨ã示ãããã
Comparison with Post-training Quantization
- BitNetãAbsmaxãSmoothQuantãGPTQãQuIPãå«ãæå 端ã®éååæ¹æ³ã¨æ¯è¼ããããããã®æ¹æ³ã¯FP16 Transformerã¢ãã«ä¸ã§ã®ãã¬ã¼ãã³ã°å¾éååã§ãããBitNetã¨åããã¬ã¼ãã³ã°è¨å®ããã³ãã¼ã¿ã«å¾ããAbsmaxã¨SmoothQuantã¯éã¿ã¨ã¢ã¯ãã£ãã¼ã·ã§ã³ã®ä¸¡æ¹ãéååããGPTQã¨QuIPã¯éã¿ã®ç²¾åº¦ã®ã¿ãä¸ããã
- éã¿ã®ã¿ã®éååï¼GPTQã¨QuIPï¼ã«ã¤ãã¦ã¯ãW4A16ã¨W2A16ã§å®é¨ãè¡ããéã¿ã¨ã¢ã¯ãã£ãã¼ã·ã§ã³ã®éååï¼Absmaxã¨SmoothQuantï¼ã«ã¯ãFP16 TransformerãW8A8ãW4A4ãW1A8ã«éååãããBitNetã®å®è£ ã¯ãã¤ããªéã¿8ãããã¢ã¯ãã£ãã¼ã·ã§ã³ï¼W1A8ï¼ã§ããããã¼ã¹ã©ã¤ã³ãããä½ããåçã®ãããæ°ã§ããã
- WinograndeãWinogradãStoryclozeãHellaswagã®4ã¤ã®ãã³ããã¼ã¯ãã¼ã¿ã»ããã«ãããæ§ã ãªãã¼ã¹ã©ã¤ã³ã¢ããã¼ãã«å¯¾ããBitNetã®ææ¡æ¹æ³ã®ã¼ãã·ã§ããæ§è½ã®è©³ç´°ãªæ¯è¼åæãæç¤ºããã
- å ¬å¹³ãªæ¯è¼ã®ããã«ããã¹ã¦ã®ã¢ãã«ã¯6.7Bã®ã¢ãã«ãµã¤ãºãæã¡ã16ãã1ã«è³ãã¾ã§ã®ããã¤ãã®éã¿ãããã¬ãã«ã§è©ä¾¡ããããè©ä¾¡ææ¨ã«ã¯ã䏿µã¿ã¹ã¯ã®ã¼ãã·ã§ãã精度ã«å ãã¦ãåæ¹æ³ã®æ§è½ãå æ¬çã«çè§£ããããã®æ¤è¨¼ã»ããä¸ã®è¨èªã¢ãã«ãã¼ãã¬ãã·ãã£ãå«ã¾ããã
- BitNetã¯ãç¹ã«ä½ãããã¬ãã«ã§ããã¼ã¹ã©ã¤ã³ã¢ããã¼ãã¨æ¯è¼ãã¦è¿ãæ§è½ã¬ãã«ãéæãã¦ãããBitNetã®ã¼ãã·ã§ããã¹ã³ã¢ã¯8ãããã¢ãã«ã«å¹æµããããæ¨è«ã³ã¹ãã¯ã¯ããã«ä½ãã
- 4ãããã¢ãã«ã«ããã¦ã¯ãéã¿ã®ã¿ãéååããæ¹æ³ãéã¿ã¨ã¢ã¯ãã£ãã¼ã·ã§ã³ã®éååå¨ãããæ§è½ãè¯ããããã¯ãã¢ã¯ãã£ãã¼ã·ã§ã³ãéååãããã¨ãããå°é£ã§ããããã§ããã1ãããã¢ãã«ã§ããBitNetã¯ãéã¿ã¨ã¢ã¯ãã£ãã¼ã·ã§ã³ã®éååæ¹æ³ããã³éã¿ã®ã¿ã®æ¹æ³ãããèããåªããçµæãéæãã¦ããã
- ä½ãããã¢ãã«ã«é¢ãã¦ãBitNetã¯ãã¹ã¦ã®ãã¼ã¹ã©ã¤ã³ã«å¯¾ãã¦ä¸è²«ãã¦åªããã¹ã³ã¢ãæã£ã¦ãããããã¯ããã¬ã¼ãã³ã°å¾ã®éååæ¹æ³ãããéååèªèãã¬ã¼ãã³ã°ã¢ããã¼ãã®å©ç¹ã証æãã¦ããã1.3Bãã6.7Bã¾ã§ã¢ãã«ãµã¤ãºãã¹ã±ã¼ã«ã¢ããããéã®ãç§ãã¡ã®æ¹æ³ã¨ãã¼ã¹ã©ã¤ã³ã®ã¼ãã·ã§ãã精度ã¨ãã¥ã¼ã·ã§ãã精度ã®ä¸¡æ¹ãè¦ç´ããå³6ã¯ããã®å©ç¹ãç°ãªãã¹ã±ã¼ã«ã§ä¸è²«ãã¦ãããã¨ã証æãã¦ããã
Ablation Studies
- 以ä¸ã«BitNetã®ã¢ãã¬ã¼ã·ã§ã³ã¹ã¿ãã£ãããã¤ãã®ä»£æ¿ã¢ããã¼ãã¨ã®æ¯è¼çµæã示ããæ´»æ§åéååã¢ããã¼ãã®é¸æã¨ã¢ãã«ãã¬ã¼ãã³ã°ã®å®å®åæè¡ã®å¹æãæ¤è¨¼ããã
- BitNetã¯æ´»æ§åã®éååã«absmaxã使ç¨ãããã¬ã¼ãã³ã°ã®å®å®æ§ã®ããã«SubLNã使ç¨ãããéååã®ä»£æ¿æ¡ã¨ãã¦ãå¦ç¿å¯è½ãªãã©ã¡ã¼ã¿ã§ã¹ã±ã¼ã«ãåçã«èª¿æ´ããelastic颿°ããããå®é¨ã§ã¯ãabsmaxãelastic颿°ãããåªããæ§è½ã示ããã¨ããããã
- ããã«ãabsmax颿°ã¯ããå®å®ãããã¬ã¼ãã³ã°ããããããBitNetã«å¯¾ãã¦ãã大ããªå¦ç¿çãå¯è½ã«ãããSubLNãPre-LNããã³BMTã¢ã¼ããã¯ãã£ã¨æ¯è¼ãããPre-LNã¯GPTã®ããã©ã«ãã¢ã¼ããã¯ãã£ã§ãããBMTã¯ãã¤ãã©ã¤ãºãã¢ãã«ã®å®å®æ§ãæ¹åãããã¨ã証æããã¦ãããå®é¨ã§ã¯ãSubLNãPre-LNã¨BMTã®ä¸¡æ¹ãä¸åããã¨ã示ãããããã£ã¦ãBitNetã®å®è£ ã«ã¯absmaxã¨SubLNã鏿ããã
Conclusion and Future Work
- BitNetãå¤§è¦æ¨¡è¨èªã¢ãã«ç¨ã®æ°ãã1ãããTransformerã¢ã¼ããã¯ãã£ãç´¹ä»ããããã®ã¢ããã¼ãã¯ãå¤§è¦æ¨¡è¨èªã¢ãã«ãå¹ççã«æ±ããã¨ãã§ããã¹ã±ã¼ã©ãã«ã§å®å®ããè¨è¨ãç®æãã¦ããã
- å®é¨çµæã¯ãBitNetããã¼ãã¬ãã·ãã£ã¨ä¸æµã¿ã¹ã¯ã®ããã©ã¼ãã³ã¹ã®ä¸¡æ¹ã«ããã¦ç«¶äºåã®ããæ§è½ãéæãããã¼ã¹ã©ã¤ã³ã¨æ¯è¼ãã¦ã¡ã¢ãªãããããªã³ãã¨ã¨ãã«ã®ã¼æ¶è²»ãå¤§å¹ ã«åæ¸ãããã¨ã示ãã¦ãããããã«ãBitNetã¯ãã«ãã¬ã·ã¸ã§ã³ãã©ã³ã¹ãã©ã¼ãã¼ã¨åæ§ã®ã¹ã±ã¼ãªã³ã°æ³åã«å¾ããããã©ã¼ãã³ã¹ã¨å¹çã®é¢ã§æ½å¨çãªå©ç¹ãæã£ã¦ããã«å¤§ããªè¨èªã¢ãã«ã«å¹æçã«ã¹ã±ã¼ã«ã¢ããã§ãããã¨ã示ãã¦ããã
- å°æ¥çã«ã¯ãã¢ãã«ãµã¤ãºã¨ãã¬ã¼ãã³ã°ã¹ãããã®é¢ã§BitNetãã¹ã±ã¼ã«ã¢ãããããã¨ãç®æãã¦ãããã¾ããå¤§è¦æ¨¡è¨èªã¢ãã«ã®ãã¬ã¼ãã³ã°ã«ããã¦BitNetãä»ã®ã¢ã¼ããã¯ãã£ï¼ä¾ï¼RetNetï¼ã«é©ç¨ãããã¨ã«ãé¢å¿ãããã
ãã®è«æã®çºå±ç1.58bitã®è¦ç´ã¯ãã¡ã reseachpaper-matome.hatenablog.com
è«æè¦ç´ï¼GPT Takes the Bar Exam
GPT Takes the Bar ExamÂ
Michael Bommarito II, Daniel Martin Katz 2022
ã©ã¤ã»ã³ã¹
CC BY 4.0 Deed | Attribution 4.0 International | Creative Commons
- GPT Takes the Bar ExamÂ
- Abstract
- Introduction
- DATA
- Methods
- Results
- Conclusion and Future Work
- ãã¾ããGPTï¼ã«ãããµã³ãã«åé¡ã®è§£èª¬ã¨çãï¼æ¥æ¬èªï¼
Â
Abstract
- ç ç©¶ã®ç®ç
- ã¢ã¡ãªã«åè¡å½ã®æ³æ¹è³æ ¼è©¦é¨ï¼ãã¼è©¦é¨ï¼ã®å¤è¢é¸æå¼ã»ã¯ã·ã§ã³ï¼MBEï¼ã«ãããOpenAIã®text-davinci-003ã¢ãã«ï¼GPT-3.5ã¨ãå¼ã°ããï¼ã®æ§è½ãå®é¨çã«è©ä¾¡ãããã¨ã
- ææ³
- GPT-3.5ã®ã¼ãã·ã§ããæ§è½ã«å¯¾ãã¦ããã¤ãã¼ãã©ã¡ã¼ã¿ã®æé©åã¨ããã³ããã¨ã³ã¸ãã¢ãªã³ã°ãé©ç¨ãããã®å½±é¿ãè©ä¾¡ãã¾ããMBEã®å®å ¨ãªç·´ç¿è©¦é¨ã«ãããæ£è§£çã¨ãã¨ããã³ã¹ããã³ãã¼ãã®ç§ç®ã§ã®åæ ¼çãæ¸¬å®ã
- çµæ
- GPT-3.5ã¯ããã¹ãããã³ããã¨ãã©ã¡ã¼ã¿ã¼ãç¨ããå ´åãMBEç·´ç¿è©¦é¨ã§ã®æ£è§£çã50.3%ã«éãã25%ã®åºæºæ¨æ¸¬çãå¤§å¹ ã«ä¸åããã¨ããã³ã¹ã¨ãã¼ãã®ä¸¡æ¹ã§åæ ¼çãéæãããã¾ããGPT-3.5ã®é¸æè¢ã®ã©ã³ãã³ã°ã¯æ£è§£ã¨é«ãç¸é¢ã示ããä¸ä½2ã¤ããã³ä¸ä½3ã¤ã®é¸æè¢ããããã71%ã88%ã®å²åã§æ£è§£ã§ãããã¨ã示ããã
- çµè«
- GPT-3.5ã®MBEã»ã¯ã·ã§ã³ã«ãããæ§è½ã¯ãLLMãè¿ãå°æ¥ãã¼è©¦é¨ã®MBEé¨åã«åæ ¼ããå¯è½æ§ãé«ããã¨ãå¼·ã示åãã¦ããããã ããLLMã¨GPTã®æ°ããç§å¦ççè§£ã¨æææ¨©ã®æ§è³ªã«ããããããã®çµæã®è§£éã¯éå®ããã¦ããã
Introduction
æ³å¾ã·ã¹ãã ã®è¤éãã«ã¤ãã¦
- æ³å¾ã·ã¹ãã ã®è¤éããå¢ãã¦ããã社ä¼ãæ±ããæ³çãµã¼ãã¹ã®éã質ãã¢ã¯ã»ã·ããªãã£ã®åä¸ã®ããã«æè¡ã®æ¯æ´ãå¿ è¦ã¨ãªã£ã¦ããã
- 人工ç¥è½ãããã»ã¹ã¨ã³ã¸ãã¢ãªã³ã°ã¯ãæ³å¾ã·ã¹ãã ã®éå°éå®¶ããã³å°éå®¶ã®ä¸¡æ¹ã«å¯¾ãã¦æ°åå¹´ã«ãããæ¯æ´ãã¦ããã
- ããããªããæ³çè¨èªã®è¤éãã¨æ³çç¥èã®åºå¤§ãããæ³çãªåé¡ã®ãã¥ã¢ã³ã¹ãçè§£ããã·ã¹ãã ã®éçºãå°é£ã«ãã¦ããã
- æ³å¾ã¯è¨èªã®ä½¿ç¨ã«å¤§ããä¾åãã¦ãããæ³çææ¸ã¯é常ã«å¤§éã«çæããã¦ãããæ³çè¨èªã¯è¤éã§ãããæ³å¾å°éå®¶ã¯ãã®è¨èªãçè§£ãçæããããã«ã»ã¼10å¹´éã®æè²ã¨å°éçãã¬ã¼ãã³ã°ãåãã¦ããã
- æ³çè¨èªã®è¤éãã¯ãç¹ã«é«åº¦ã«è¦ç¯åãããæ £ç¿ã¨å³å¯ã«æ£ç¢ºãªãã¬ã¼ãºã«ãããã®ã§ãããé常ã®è¨èªã¨ã¯å¤§ããç°ãªãã
æ©æ¢°å¦ç¿ã«ããè¨èªã¢ãã«ã®çºå±
- è¿å¹´ãèªç¶è¨èªå¦çã¨è¨ç®ã®é²æ©ã«ãããæ©æ¢°å¦ç¿æè¡ã®ããã©ã¼ãã³ã¹ãå¤§å¹ ã«åä¸ãã¦ããã
- ãã©ã³ã¹ãã©ã¼ãã¼ã¢ã¼ããã¯ãã£ã®å°å ¥ã¯ãç¹ã«ããã¹ããç»åã®ã¢ããªãã£ã«ããã¦é©å½ããããããæåãã¦ããã
- OpenAIã®GPTã¢ãã«ã¯ãç¹ã«æåã§ã¢ã¯ã»ã¹ããããå¤§è¦æ¨¡è¨èªã¢ãã«ï¼LLMï¼ã§ãããGPT-3ã¯1750åã®ãã©ã¡ã¼ã¿ã¼ãæã¤èªå·±å帰è¨èªã¢ãã«ã§ããã
- OpenAIã®ã¢ãã«ã¸ã®ã¢ã¯ã»ã¹ã¯ã忥çããã³å«çççç±ãããOpenAIã®APIãéãã¦ã®ã¿æä¾ããã¦ãããããã¹ãå®äºãã³ã¼ãå®äºãç»åçæãåãè¾¼ã¿çæã®ã¨ã³ããã¤ã³ããæä¾ãã¦ããã
- GPT-3.5ãChatGPTã¯ã¼ãã·ã§ããããã¥ã¼ã·ã§ããã®ã¿ã¹ã¯ã«ããã¦ããã¾ã§ã«ãªãæ§è½ã示ãã¦ãããããã¡ã¤ã³åºæã®ã¢ãã«ã§ã¯ãªããMultistate Bar Examination (MBE)ã®ãããªæ³ç試é¨ã«ããã¦æå 端ã®LLMãæåãããã¯æªè§£æããã
DATA
- MBEã®è³ªåã¯ãæ³çç¥èã¨èªè§£åã®ä¸¡æ¹ã試ãããã«è¨è¨ããã¦ãããè±èªã®ä¸ç´ã¬ãã«ã®æå³è«çããã³çµ±èªè«ççè§£ãè¦æ±ããã
- MBEã®è³ªåã¯ç´æ¥çãªæ³çåé¡ãåºãã®ã§ã¯ãªãããã¹ãåé¨è ã«æ¶ç©ºã®ç¶æ³ãæç¤ºãã詳細ã«é£¾ãä»ããããäºå®ã®è¨è¿°ãæä¾ããããããã®è©³ç´°ã®ä¸ã«ã¯éè¦ãªãã®ãããã°ãèªè ãæãããããã ãã«è¿½å ããããã®ãããã
- 以ä¸ã¯å ¬éããã¦ãããµã³ãã«è³ªåã§ãããåè»ã«ãã£ã¦è»ãè¡çªããäºæ ã«é¢ãã¦ã交差ç¹è¿ãã«15å¹´éä½ãã§ãã使°ã®è¨¼è¨ã®è¨±å®¹æ§ã«ã¤ãã¦åããã¦ããã
Question: A man sued a railroad for personal injuries suffered when his
car was struck by a train at an unguarded crossing. A major issue is
whether the train sounded its whistle before arriving at the crossing.
The railroad has offered the testimony of a resident who has lived near
the crossing for 15 years. Although she was not present on the occasion
in question, she will testify that, whenever she is home, the train always
sounds its whistle before arriving at the crossing.
Is the residentâs testimony admissible?
(A) No, due to the residentâs lack of personal knowledge regarding the
incident in question.
(B) No, because habit evidence is limited to the conduct of persons,
not businesses.
(C) Yes, as evidence of a routine practice.
(D) Yes, as a summary of her present sense impressions.Â
- Bar試é¨ã®MBEé¨åã¯ãä¸è¨ã®ãµã³ãã«ã®ãããªç´200ã®è³ªåããæ§æããããå®éã®è©¦é¨ã§ã¯ã8ã¤ã®ã«ãã´ãªãã25ã®è³ªåãåºããããã®ãã¡7ã¤ã¯ç¹å®ã®æ³å¾åéã«å¯¾å¿ãã1ã¤ã¯ãã¹ãè¨è¨ã®å®é¨ç¨ã§ããã
- ä¸é¨ã®è³ªåã¯ãå·ã®æ³æ¹ä¼ãNCBEã«ãã£ã¦æçµã¹ã³ã¢ããé¤å¤ãããå ´åããããåã ã®å·ã®æ³æ¹ä¼ã¨NCBEã¯ãå·å å¤ã®åé¨è ã®ããã©ã¼ãã³ã¹ãè©ä¾¡ããä¸é¨ã®è³ªåãåé¤ããçã®ã¹ã³ã¢ã調æ´ãã¦ç®¡è½åºåéã®ä¸è²«æ§ãç¶æããã
- NCBEã¯è©¦é¨è¨è¨ã¨æºåã®ä¸ç°ã¨ãã¦ã試é¨ã®ããã©ã¼ãã³ã¹ã«é¢ããçµ±è¨æ å ±ãç¶æãã¦ãããå¹³åçãªå¦çã4åä¸1å以ä¸ã誤çããé£æåº¦ã表ããæããã§ããã
- ãã®ç ç©¶ã®ããã«ãMBEé¨åã®æ¨æºçãªè©¦é¨æºåè³æãNCBEããè³¼å ¥ããç·´ç¿åé¡ã¨æ¨¡æ¬è©¦é¨ãå«ãããããã®è³æã¯åé å¸ã§ããªãããæ¬è«æã®çµæãåç¾ãããç ç©¶è ã¯ãNCBEã®ãªã³ã©ã¤ã³ã¹ãã¢ããç´300USDã§ãããã®ãã¼ã¿ãè³¼å ¥ã§ããã
Methods
- å®é¨è©ä¾¡ã§ã¯ãtext-davinci-003ããã¹ãå®äºAPIã«å¯¾ãã¦ã¼ãã·ã§ããããã³ããã使ç¨ããããã®ã»ã¯ã·ã§ã³ã§ã¯ãããã³ããã®è¨è¨ãå復ãé¢é£ããAPIãã¤ãã¼ãã©ã¡ã¼ã¿ãããã³ã¢ã¼ãã®ãã¡ã¤ã³ãã¥ã¼ãã³ã°ã®è©¦ã¿ã«ã¤ãã¦è©³è¿°ããã
Prompt Engineering and ResponsesÂ
- ããã³ããã¨ã³ã¸ãã¢ãªã³ã°ã¨ã¯ãLLMãæä¾ãããããã³ããã«éå¸¸ã«ææã§ããããããã®ãããªããã³ããã使ãããæè¡ããæãããã®ç ç©¶ã§ã¯ãããã³ããã¨ã³ã¸ãã¢ãªã³ã°ã«å¤§ããåãçµãã ã
- ãã¹ããããããã³ããã¿ã¤ãã«ã¯ã次ã®ãã®ãããï¼
- 1. åä¸é¸æã®ã¿
- 2. åä¸é¸æã¨ãã®çç±ã®èª¬æ
- 3. ä¸ä½2ã¤ã®é¸æã®ã¿
- 4. ä¸ä½2ã¤ã®é¸æã¨ãã®çç±ã®èª¬æ
- 5. ä¸ä½2ã¤ã®é¸æã¨åããã³ãã
- 6. ãã¹ã¦ã®é¸æè¢ã®é ä½ä»ã
- 7. ä¸ä½3ã¤ã®é¸æè¢ã®é ä½ä»ã
- ãããã®ããã³ããéã§çµæã«å¤§ããªéãã¯æ¦ãè¦ãããªãã£ããã以ä¸ã®ããã«ä¸ä½3ã¤ã®é¸æãé ä½ä»ãããæå¾ã®ããã³ããæ¦ç¥ã®ã¿ããã¢ãã«ã®æ£ç¢ºæ§ãå¤§å¹ ã«åä¸ãããã
- GPT-3.5ã®ããã層ã«ç´æ¥æ´å¯ããªãããããªããã®ããã³ããã®å¤æ´ãä»ã®ããã³ããã¨ã¯ç°ãªãæ¹æ³ã§ã¢ãã«ã®æ¯ãèãã«å½±é¿ãä¸ããã®ãã«ã¤ãã¦ããã«ã³ã¡ã³ããããã¨ã¯ã§ããªãã
- ãã®ããã³ããããæã䏿£è§£ãæé¤ããéå¸°çµæ§è½ã¨ã確çç帰çµã¨è¨æ¶ãæé©ã«çµã¿åããããã®ã§ããã¨æ¨æ¸¬ãããã
- ãã¹ã¦ã®æ¨¡æ¬è©¦é¨ã«ããã¦ãããã³ããã¨å®å ¨ãªJSONã¬ã¹ãã³ã¹ï¼OpenAI APIãªã¯ã¨ã¹ãIDãå«ãï¼ãè¨é²ããããããã¹ãå®äºã¬ã¹ãã³ã¹ã®åè¡ã¯è§£æãããæ¡ç¹ã¾ãã¯è³ªçåæã®ããã«ä¿åãããã
- ããå°æ°ã®ã±ã¼ã¹ï¼< 1%ï¼ã§ã¯ããMy first choice is (D)ãã®ãããªèªç¶è¨èªããã©ã¼ãããã®ããªã¨ã¼ã·ã§ã³ãå«ã¾ãã¦ããããããã®ããªã¨ã¼ã·ã§ã³ã¯ãã¼ãµã¼ã®ä¾å¤ã±ã¼ã¹ãéãã¦å¦çããããã¬ã¹ãã³ã¹ã¯äººéã«ãã£ã¦æåã§å¤æ´ããããè©ä¾¡ãããããããã¨ã¯ãªãã£ãã
- æè¡çãªè¦³ç¹ããããããã®ããã³ããã¯ãã¹ã¦ãã¢ãã«ã声æãçå®ãéçå®ããè©ä¾¡ããå¿ è¦ããã徿¥ã®ããã¹ã帰çµã¿ã¹ã¯ã«é¢é£ãã¦ãããã¼ãã·ã§ãã試é¨ã·ãã¥ã¬ã¼ã·ã§ã³ã§ã¯ã帰çµåé¡ã«é¢ããæ¢åã®ç ç©¶ã¨ã¯ç°ãªãã仮説ã主張ãã¾ãã¯ç¥èã®ä½ç³»ã®ãã¬ã¼ãã³ã°ãã»ã¨ãã©å¶å¾¡ã§ããªãã
- GPTå ã«åå¨ãããæç¤ºçã¾ãã¯æç¤ºçãªä»»æã®ç¥èã°ã©ããç¶æ ã¢ãã«ã«ã¤ãã¦ã®æ´å¯ããªããã¾ããããã¤ãã®ã±ã¼ã¹ã§ã¯ã帰çµã®è¦³ç¹ããè¤æ°ã®é¸æãæ£ããå¯è½æ§ããããåé¨è ã¯è©¦é¨è¨è¨ã®ç¥èã«åºã¥ãã¦é¸æãé ä½ä»ãããå¿ è¦ãããããã®ãã¹ãã«ã¯ãåç´ãªäºé 帰çµ/é帰çµåé¡ããããæ¤ç´¢ã¨é¢é£æ§ã¹ã³ã¢ãªã³ã°ã«ä¼¼ãè¦ç´ ãå«ã¾ãã¦ããã
(Hyper)parameters for GPT-3Â
- æ©æ¢°å¦ç¿ã¨è¨ç®ç ç©¶ã®çµæã¯ãä¸è¬çã«ã¢ãã«ã®ãã©ã¡ã¼ã¿ã¼ããã¤ãã¼ãã©ã¡ã¼ã¿ã¼ã«éå¸¸ã«ææã§ããããã®ç ç©¶ã§ã¯ãä¸è¨ã®ããã«ããã³ãããå¤åããããã¨ã«å ããã¢ãã«ã®ã温度ãã®ãããªãã¤ãã¼ãã©ã¡ã¼ã¿ã¼ãã¢ãã«ã®æ§è½ã«ã©ã®ããã«å½±é¿ããããè©ä¾¡ããã
- è©ä¾¡ãããã©ã¡ã¼ã¿ã¼ã«ã¯ä»¥ä¸ãå«ã¾ããï¼1. 温度ï¼ãµã³ããªã³ã°ã®æ¸©åº¦ï¼0.0ã¯æ±ºå®è«çãé«ãã»ã©ãã©ã³ãã ãï¼ã2. top pï¼æ ¸ãµã³ããªã³ã°ç¢ºçï¼ã3. best ofï¼ãµã¼ãã¼å´ã§[N]åã®å®äºãçæãããã¼ã¯ã³ãã¨ã®æé«ã®ãã°ç¢ºçãæã¤ãã®ããæè¯ãã¨ãã¦è¿ãï¼ã4. max tokensï¼çæãããã¼ã¯ã³ã®æå¤§æ°ï¼ã
- 温度ã¯{0.0, 0.25, 0.5, 0.75, 1.0}ãtop pã¯{0.75, 1.0}ãbest ofã¯{1, 2, 4}ãmax tokensã¯èª¬æãªãã®ããã³ããã§ã¯{16, 32}ã説æããã®ããã³ããã§ã¯{128, 256, 1024}ã§ãã¹ãããã
Fine-tuning
- GPT-3.5ã®ãããªLLMã大ããªé¢å¿ãéããä¸å ã¯ããã®ã¼ãã·ã§ããã¾ãã¯ãã¥ã¼ã·ã§ããã®æ§è½ãé常ã«åªãã¦ããããã§ãããããã«ãããããããä¸é¨ã®ç¶æ³ã§ã¯ãLLMã®ä¸é¨ã¾ãã¯å ¨ã¦ã®å±¤ãåãã¬ã¼ãã³ã°ãããã¨ã§æ§è½ãåä¸ããå¯è½æ§ãããã
- OpenAIã¯APIãéãã¦åãã¬ã¼ãã³ã°ãããã¡ã¤ã³ãã¥ã¼ãã³ã°ãã®æ©è½ãæä¾ãã¦ãããå¦ç¿çãããããµã¤ãºãªã©ã®ãã¬ã¼ãã³ã°ããã»ã¹ãããç¨åº¦å¶å¾¡ãããã¨ãã§ããã200åã®æªå ¬éã®æ¨¡æ¬MBEãã¼è©¦é¨åé¡ãç¨ãã¦text-davinci-003ã®ãã¡ã¤ã³ãã¥ã¼ãã³ã°ã試ã¿ããããã¹ã¦ã®ã±ã¼ã¹ã§ãã¡ã¤ã³ãã¥ã¼ãã³ã°ã¢ãã«ã¯text-davinci-003èªä½ã®æ§è½ãå¤§å¹ ã«ä¸åã£ãã
- é«å質ãªãã¼ã¿ã®ä¸è¶³ã¨è©ä¾¡ã®ãããGPTã¢ãã«ã®ãã¡ã¤ã³ãã¥ã¼ãã³ã°ããã以ä¸è¿½æ±ããªãã£ãããããã®çµæã¯ãä»è ã«ãã£ã¦è¦³å¯ãããLLMã®ãã¡ã¤ã³ãã¥ã¼ãã³ã°ãªã¹ã¯ãå¯è½æ§ããããã¨ã示ãã¦ããã
Results
- ç·è¨ã§107åã®ãµã³ãã«è©¦é¨ã宿½ããä¸ä½3ã¤ã®é¸æè¢ã®é ä½ä»ãï¼ããã³ããã¹ã¿ã¤ã«ï¼7ï¼ãæãè¯ãæ§è½ã示ããããã®ããã³ããã«ã¤ãã¦41åã®ãµã³ãã«ã©ã³ããã©ã¡ã¼ã¿ã¼çµã¿åããã§åéããã
- GPTã¯å ¨ä½ã®å¤è¢é¸æå¼è©¦é¨ã«ã¯ã¾ã åæ ¼ãã¦ããªããã25%ã®åºæ¬ã©ã³ãã ãã£ã³ã¹çãå¤§å¹ ã«ä¸åããå°ãªãã¨ã2ã¤ã®ã«ãã´ãªã¼ï¼ã¨ããã³ã¹ã¨ãã¼ãï¼ã§å¹³ååæ ¼çã«éãã¦ããã
- å ¨ã«ãã´ãªã¼å¹³åã§ãGPTã¯äººéã®ãã¹ãåé¨è ã«ç´17%é ãã¦ãããããããã¨ããã³ã¹ããã¼ããæ°äºè¨´è¨ã«ããã¦ã¯ãã®å·®ã¯ç¡è¦ã§ããã䏿¡ã§ãããã¨ããã³ã¹ã«é¢ãã質åã§ã¯æ¢ã«äººéã¨åçã§ããã
- æ²æ³æ³ãä¸åç£æ³ãå¥ç´æ³ãåæ³ã®æ®ãã®ã«ãã´ãªã¼ã§ã¯ãå·®ã¯ããé¡èã§ãããåæ³ã®å ´åã«ã¯36%ã¾ã§ä¸æãã¦ããããã®æ§è½ã®å·®ã¯ãGPTã®ãã¬ã¼ãã³ã°ãã¼ã¿ããæ¬ å¦ãã¦ããç¥èé åãã¾ãã¯ã¢ãã«ã®å§ç¸®ããã¡ã¤ã³ãã¥ã¼ãã³ã°ä¸ã«åé¤ãããå¯è½æ§ãããã
- GPTã®çãã®ã©ã³ã¯ã¨æ£è§£ã®ç¸é¢ãä½ãå ´åããã®æ³å¾é åã«é¢ããç¥èãçã«æ¬ å¦ãã¦ããã¨èããããã䏿¹ã§ãäºçªç®ã¾ãã¯ä¸çªç®ã®é¸æè¢ãæ£ãããªããã¨ãå¤ãå ´åãåé¡ã®è¨è¨ãæ§è½ã®ä½ä¸ã«è²¬ä»»ãããã¨æ¨æ¸¬ã§ãããGPTã®ç¬¬äºããã³ç¬¬ä¸ã®ãã¹ãã¢ã³ãµã¼ã¯æ£è§£ã¨é«ãç¸é¢ã示ãã¦ãããå ¨ã«ãã´ãªã¼ã§ããã2ã®åçã50%ã®åºæ¬ã©ã³ãã ãã£ã³ã¹çãä¸åãã7ã¤ã®ã«ãã´ãªã¼ä¸5ã¤ã§NCBEå ±åå¹³åãè¶ ãã¦ããã
Conclusion and Future Work
- ãã®ç ç©¶ã§ã¯ãNCBEã®ã¢ãã«ãã¼è©¦é¨ã®MBEé¨åã«ãããGPT-3.5ã®å®é¨çè©ä¾¡ãè¨é²ãããGPT-3.5ã¯ããã¹ã¦ã®ããã³ããã¨ãã¤ãã¼ãã©ã¡ã¼ã¿å¤ã«ããã¦ãã©ã³ãã ãªæ¨æ¸¬ã®åºæºçãå¤§å¹ ã«ä¸åã£ãã
- ãã¡ã¤ã³ãã¥ã¼ãã³ã°ãªãã§ãGPT-3.5ã¯ãã¼ã®2ã¤ã®ã«ãã´ãªã¼ã§åæ ¼çãéæãã1ã¤ã®ã«ãã´ãªã¼ã§äººéã®ãã¹ãåé¨è ã¨åçã«ãªã£ããå¯è½ãªé¸æè¢ã®é ä½ä»ãã¯ãã©ã³ãã ãã£ã³ã¹ãè¶ ãã¦æ£è§£ã¨å¼·ãç¸é¢ãã¦ãããæ³çé åã«å¯¾ããä¸è¬çãªçè§£ã確èªãã¦ããã
- GPT-3.5ã¯ããã®ã¿ã¹ã¯ã«ããã¦ç§ãã¡ã®æå¾ ãå¤§å¹ ã«ä¸åãæ§è½ã示ãããGPTã®çè§£ãåè£åçéã®é¸ææ¹æ³ã«ã¤ãã¦ã®è§£éè½åã¯éããã¦ããããé¡ä¼¼ã®åé¡ã®æ´å²ã¯LLMãéããªããã¼è©¦é¨ã«åæ ¼ããå¯è½æ§ãé«ããã¨ãå¼·ã示åãã¦ããã
- GPT-4ãLAIONã®Bloomãã¡ããªã¼ã®ã¢ãã«ã«é¢é£ããé¸è©±ç証æ ã«åºã¥ãã¨ããããä»å¾0ã18ã¶æä»¥å ã«çºçããå¯è½æ§ãé常ã«é«ããGPT-JãGPT-NeoãBloomãã¡ããªã¼ã®ã¢ãã«ãç¨ããå®é¨è¨è¨ã®åç¾ã¨ãã¡ã¤ã³ãã¥ã¼ãã³ã°ãç¶ããäºå®ã§ãããã¾ããMBEã¯ãã¼è©¦é¨ã®ä¸é¨ã§ãããå°æ¥ã®ç ç©¶ã§ã¯GPT-3.5ããã³ä»ã®ã¢ãã«ãã¨ãã»ã¤ï¼MEEï¼ããã³ç¶æ³å¤æï¼MPTï¼ã»ã¯ã·ã§ã³ã«ãé©ç¨ãã¦è©ä¾¡ããäºå®ã§ããã
Â
ãã¾ããGPTï¼ã«ãããµã³ãã«åé¡ã®è§£èª¬ã¨çãï¼æ¥æ¬èªï¼
å顿ã«ã¤ãã¦
Question: A man sued a railroad for personal injuries suffered when his
car was struck by a train at an unguarded crossing. A major issue is
whether the train sounded its whistle before arriving at the crossing.
The railroad has offered the testimony of a resident who has lived near
the crossing for 15 years. Although she was not present on the occasion
in question, she will testify that, whenever she is home, the train always
sounds its whistle before arriving at the crossing.
Is the residentâs testimony admissible?
(A) No, due to the residentâs lack of personal knowledge regarding the
incident in question.
(B) No, because habit evidence is limited to the conduct of persons,
not businesses.
(C) Yes, as evidence of a routine practice.
(D) Yes, as a summary of her present sense impressions.Â
ï¼ä»¥ä¸Chatgptè§£çï¼
ãã®åé¡ã¯ãããç·æ§ãç¡äººã®äº¤å·®ç¹ã§åè»ã«èªåè»ãè¡çªãããå人çãªæå®³ãåãããã¨ã«é¢ãã¦ééä¼ç¤¾ã訴ããã±ã¼ã¹ã«ã¤ãã¦ã§ãã主ãªäºç¹ã¯ãåè»ã交差ç¹ã«å°çããåã«è¦ç¬ãé³´ããããã©ããã§ãã
ãã®ã±ã¼ã¹ã§ãééä¼ç¤¾ã¯15å¹´éãã®äº¤å·®ç¹ã®è¿ãã«ä½ãã§ãã使°ã®è¨¼è¨ãæä¾ãã¦ãã¾ãããã®ä½æ°ã¯ãåé¡ã®äºæ ãçºçããæã«ã¯ç¾å ´ã«ãã¾ããã§ããããèªå® ã«ããæã¯ãã¤ãåè»ã交差ç¹ã«å°çããåã«è¦ç¬ãé³´ããã¦ããã¨è¨¼è¨ãã¾ãã
åé¡ã¯ããã®ä½æ°ã®è¨¼è¨ãæ³å»·ã§è¨±å®¹ããããã©ããã§ãã
証è¨ã許容ããããã©ããã夿ããã«ã¯ã証æ ã®è¦åãç¹ã«ç¿æ £ãã«ã¼ãã³ã®å®è·µã«é¢é£ãã証è¨ã®è¦åãçè§£ããå¿ è¦ãããã¾ããããã§ã®ãã¤ã³ãã¯ããã®ä½æ°ãç¹å®ã®äºæ ã«ã¤ãã¦ç´æ¥ã®ç¥èãæã£ã¦ããªãã¦ããé·å¹´ã«ããã観å¯ããåè»ã®ä¸è²«ããè¡åãã¿ã¼ã³ï¼ã¤ã¾ãã交差ç¹ã«å°çããåã«è¦ç¬ãé³´ããã¨ããã«ã¼ãã³ï¼ã«åºã¥ãã¦è¨¼è¨ãããã¨ãã§ãããã©ããã§ãã
åããããã説æããã¨ããã®åé¡ã¯ãéå»ã®ä¸è²«ããè¡åï¼ãã®å ´åã¯åè»ãè¦ç¬ãé³´ããç¿æ £ï¼ãåºã«ãç¹å®ã®äºä»¶ã«ããã¦åãè¡åãåãããã¨æ¨æ¸¬ãã証è¨ããè£å¤ã§è¨¼æ ã¨ãã¦ä½¿ãããã©ãããåãã¦ãã¾ãã
Â
çã
ãã®åé¡ã®æ£è§£ã¯ãï¼C)使°ã®è¨¼è¨ããã¯ããã«ã¼ãã³ã®å®è·µã¨ãã¦ã®è¨¼æ ãã¨ãã¦è¨±å®¹ããããã¨ãããã®ã§ãããããçè§£ããããã«ãè¨¼æ æ³ã®ããã¤ãã®åºæ¬ååãè¦ã¦ã¿ã¾ãããã
-
ç´æ¥çãªç¥èã®å¿ è¦æ§: é常ã証è¨ããããã«ã¯ããã®åºæ¥äºã«ã¤ãã¦ã®ç´æ¥çãªç¥èãå¿ è¦ã§ãããããããã®ã±ã¼ã¹ã§ã¯ã使°ã¯ç¹å®ã®äºæ ã«ã¤ãã¦ç´æ¥çãªç¥èãæã£ã¦ãã¾ããã彼女ã¯äºæ ãèµ·ãã£ãæã«ç¾å ´ã«ããããã§ã¯ããã¾ããã
-
ç¿æ £ã¾ãã¯ã«ã¼ãã³ã®å®è·µã«é¢ãã証æ : è¨¼æ æ³ã¯ãå人ã¾ãã¯çµç¹ã®ç¿æ £ãã«ã¼ãã³ã®å®è·µã«åºã¥ãè¡åã証æããããã«ããã®ãããªç¿æ £ãã«ã¼ãã³ã®è¨¼æ ã許容ãããã¨ãããã¾ããããã¯ããã®ç¿æ £ãã«ã¼ãã³ãä¸è²«ãã¦è¡ããã¦ããå ´åãç¹å®ã®å ´é¢ã§åæ§ã®è¡åãåãããå¯è½æ§ãé«ãã¨æ¨æ¸¬ããããã§ãã
ãã®ã±ã¼ã¹ã§è¨ãã°ã使°ãéå»15å¹´éã«ããã£ã¦ãèªå® ã«ããæã¯å¸¸ã«åè»ã交差ç¹ã«å°çããåã«è¦ç¬ãé³´ããã¦ããã¨ããä¸è²«ããè¡åãç®æãã¦ããã¨ãã証è¨ã¯ãééä¼ç¤¾ããã®äº¤å·®ç¹ã§åè»ãè¿ã¥ãéã«è¦ç¬ãé³´ããã¨ããã«ã¼ãã³ã®å®è·µãæã£ã¦ãããã¨ã示ããã®ã§ãã
ãã®çç±ããã彼女ã®è¨¼è¨ã¯ãã«ã¼ãã³ã®å®è·µãã¨ãã¦è¨¼æ ã«ãªãå¾ãã®ã§ããã¤ã¾ãã彼女ã¯ç¹å®ã®äºæ ã«ã¤ãã¦ã®ç´æ¥çãªç¥èãæã£ã¦ããªãã¦ããåè»ã交差ç¹ã«å°çããåã«è¦ç¬ãé³´ããã¨ããééä¼ç¤¾ã®ä¸è²«ããè¡åãã¿ã¼ã³ã«ã¤ãã¦è¨¼è¨ãããã¨ãã§ããããã訴è¨ã®æèã§éè¦ãªæ å ±ã¨ãªãå¯è½æ§ãããã¾ãã
ãã®çè«ã«åºã¥ãã使°ã®è¨¼è¨ã¯ãééä¼ç¤¾ãç¹å®ã®æ¥ã«è¦ç¬ãé³´ããããã©ãããç´æ¥çã«ã¯è¨¼æãã¾ããããåè»ã交差ç¹ã«è¿ã¥ãéã«è¦ç¬ãé³´ããã¨ããä¸è²«ããã«ã¼ãã³ããããã¨ã示ã証æ ã¨ãã¦è¨±å®¹ããã¾ãããããã鏿è¢(C)ãæ£ããçç±ã§ãã
Â
Â
Â
Â
è«æè¦ç´ï¼The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Shuming Maâ Hongyu Wangâ Lingxiao Ma Lei Wang Wenhui Wang
Shaohan Huang Li Dong Ruiping Wang Jilong Xue Furu WeiâÂ
ãã®ãã¼ã¸ã®å³é¢ã»è¡¨ã®æ¨©å©ã¯å ¨ã¦è«æã®èè ãã«å¸°å±ãããã¾ãã
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- ãã®è«æãä¸è¡ã§ããã¨
- Abstract
- The Era of 1-bit LLMsÂ
- BitNet b1.58
- Result
- Energy
- Discussion and Future Work
- èªå¥èª¬æ
Â
ãã®è«æãä¸è¡ã§ããã¨
BitNet b1.58: ã¡ã¢ãªã¨ã¨ãã«ã®ã¼ã忏ãããã©ã¼ãã³ã¹ãç¶æãè¨èªã¢ãã«ã®æ°æä»£ã¸
Â
Abstract
-
ç®çï¼æ°ä¸ä»£ã®1ãããå¤§è¦æ¨¡è¨èªã¢ãã«ï¼LLMï¼ã§ããBitNet b1.58â»ï¼ãç´¹ä»ãã髿§è½ãã¤ã³ã¹ãå¹çã®è¯ãLLMã®éçºã«åããæ°ããªã¹ã±ã¼ãªã³ã°æ³åã¨è¨ç·´ã¬ã·ããå®ç¾©ãããã¨ã
-
ææ³ï¼BitNet b1.58ã¯ãLLMã®åãã©ã¡ã¼ã¿ï¼ã¾ãã¯éã¿ï¼ã{-1, 0, 1}ã®ã¿ã§è¡¨ç¾ãã1ãããLLMãéçºã
-
çµæï¼BitNet b1.58ã¯ãåãã¢ãã«ãµã¤ãºã¨ãã¬ã¼ãã³ã°ãã¼ã¯ã³ãç¨ããå®å ¨ç²¾åº¦Transformer LLMã¨æ¯è¼ãã¦ãå°æåº¦ã¨ã¨ã³ãã¿ã¹ã¯ã§åçã®æ§è½ã示ããªãããã¬ã¤ãã³ã·ãã¡ã¢ãªãã¹ã«ã¼ããããã¨ãã«ã®ã¼æ¶è²»ã®é¢ã§é¡èã«ã³ã¹ãå¹çãè¯ããã¨ã示ãããã
- çµè«ï¼1.58ãããLLMã¯ã髿§è½ãã¤ã³ã¹ãå¹çã®è¯ãæ°ä¸ä»£ã®LLMãè¨ç·´ããããã®æ°ããªã¹ã±ã¼ãªã³ã°æ³åã¨ã¬ã·ããæä¾ãã1ãããLLMã«æé©åãããç¹å®ã®ãã¼ãã¦ã§ã¢ã®è¨è¨ã«åããæ°ããªæä»£ãç¯ãã
The Era of 1-bit LLMsÂ
- è¿å¹´ãAIåéã§ã¯å¤§è¦æ¨¡è¨èªã¢ãã«ï¼LLMï¼ã®ãµã¤ãºã¨è½åãæ¥éã«æé·ãã夿§ãªèªç¶è¨èªå¦çã¿ã¹ã¯ã§é¡èãªæ§è½ã示ãã¦ãããããã®ãµã¤ãºã®å¢å ã¯å±éã«ããã課é¡ãçããããé«ãã¨ãã«ã®ã¼æ¶è²»ã«ããç°å¢ããã³çµæ¸ã¸ã®å½±é¿ã«å¯¾ããæ¸å¿µãå¼ãèµ·ããã¦ããã
- ãããã®èª²é¡ã«å¯¾å¦ããä¸ã¤ã®ã¢ããã¼ãã¯ããã¹ããã¬ã¼ãã³ã°éååãç¨ãã¦æ¨è«ã®ããã®ä½ãããã¢ãã«ã使ãããã¨ã§ãããããã«ããéã¿ã¨ã¢ã¯ãã£ãã¼ã·ã§ã³ã®ç²¾åº¦ãä¸ããLLMã®ã¡ã¢ãªã¨è¨ç®è¦æ±ãå¤§å¹ ã«åæ¸ããã
- BitNetãã¯ããã¨ãã1ãããã¢ãã«ã¢ã¼ããã¯ãã£ã®æè¿ã®ç ç©¶ã¯ãæ§è½ãç¶æãã¤ã¤LLMã®ã³ã¹ãã忏ããææãªæ¹åæ§ã示ãã¦ãããBitNetã®è¡åä¹ç®ã§ã¯æ´æ°å ç®ã®ã¿ãè¡ããLLMã®ã¨ãã«ã®ã¼ã³ã¹ããå¤§å¹ ã«ç¯ç´ããã
- ãã®ç ç©¶ã§ã¯ãåãã©ã¡ã¼ã¿ãä¸å¤{-1, 0, 1}ãåã1ãããLLMã®ããªã¢ã³ãã§ããBitNet b1.58ãç´¹ä»ããããã«ããã¡ã¢ãªæ¶è²»ãã¹ã«ã¼ãããâ»2ãã¬ã¤ãã³ã·â»3ã®é¢ã§FP16 LLMãã¼ã¹ã©ã¤ã³ã¨æ¯è¼ãã¦å¤§å¹ ã«å¹ççã§ãããã¨ãããã«ã¯ç¹å¾´ãã£ã«ã¿ãªã³ã°ãå¯è½ã«ãã0ã®å°å ¥ã«ãã1ãããLLMã®æ§è½ãå¤§å¹ ã«åä¸ãããªã©ã®è¿½å çãªå©ç¹ã示ãã
BitNet b1.58
- BitNet b1.58ã¯ãnn.LinearãBitLinearã«ç½®ãæããTransformerã§ããBitNetã¢ã¼ããã¯ãã£ã«åºã¥ãã¦ããã1.58ãããã®éã¿ã¨8ãããã®ã¢ã¯ãã£ãã¼ã·ã§ã³ã§ã¼ãããè¨ç·´ãããã
- éã¿ã-1ã0ã+1ã«å¶éããããã«ãabsmeanéåå颿°ãæ¡ç¨ãã¦ãããããã¯ãéã¿è¡åããã®å¹³å絶対å¤Î³ã§ã¹ã±ã¼ãªã³ã°ããæ¬¡ã«åå¤ã{-1, 0, +1}ã®ä¸ã§æãè¿ãæ´æ°ã«ä¸¸ããï¼Round Clip)ã
- ã¢ã¯ãã£ãã¼ã·ã§ã³ã®éåå颿°ã¯BitNetã¨åæ§ã«å®è£ ããã¦ããããéç·å½¢é¢æ°ã®åã«ã¢ã¯ãã£ãã¼ã·ã§ã³ã[0, Qb]ã®ç¯å²ã«ã¹ã±ã¼ãªã³ã°ããã®ã§ã¯ãªãããã¼ã¯ã³ãã¨ã«[âQb, Qb]ã«ã¹ã±ã¼ãªã³ã°ãã¦ã¼ããã¤ã³ãéååãæé¤ããã
LLaMA-alike Components.
- BitNet b1.58ã®ã¢ã¼ããã¯ãã£ã¯ããªã¼ãã³ã½ã¼ã¹ã®LLMã®ããã¡ã¯ãã¹ã¿ã³ãã¼ãã§ããLLaMAã®ã³ã³ãã¼ãã³ããæ¡ç¨ãã¦ãããRMSNormãSwiGLUããã¼ã¿ãªã¼ã¨ã³ããã£ã³ã°ã使ç¨ãããã¹ã¦ã®ãã¤ã¢ã¹ãåãé¤ãã¦ãããããã«ãããBitNet b1.58ã¯ãHuggingfaceãvLLMãllama.cppãªã©ã®äººæ°ã®ãããªã¼ãã³ã½ã¼ã¹ã½ããã¦ã§ã¢ã«æå°éã®åªåã§çµ±åã§ããã
Result
- BitNet b1.58ã¨åç¾ããFP16 LLaMA LLMãæ§ã ãªãµã¤ãºã§æ¯è¼ããRedPajamaãã¼ã¿ã»ããã§1000åãã¼ã¯ã³ã«å¯¾ãã¦äºåè¨ç·´ãè¡ããå ¬å¹³ãªæ¯è¼ã宿½ã
- è¨èªã¿ã¹ã¯ã®ç¯å²ã«ãããã¼ãã·ã§ããæ§è½ãè©ä¾¡ããWikiText2ã¨C4ãã¼ã¿ã»ããã®æ¤è¨¼å°æåº¦ãå ±åããã
- BitNet b1.58ã¯ã3Bã¢ãã«ãµã¤ãºã§å®å
¨ç²¾åº¦ã®LLaMA LLMã¨å°æåº¦ã®é¢ã§ä¸è´ãã2.71åéããGPUã¡ã¢ãªã3.55åå°ãªã使ç¨ããã
- BitNet b1.58 3.9Bã¯ãLLaMA LLM 3Bãããé¡èã«åªãã¦ããã2.4åéããã¡ã¢ãªæ¶è²»ã¯3.32åå°ãªãããã¨ã³ãã¿ã¹ã¯ã®ç²¾åº¦ã§ã¯ä¸è´ã¾ãã¯ãããä¸åãæ§è½ã示ãã
Â
- ãããã®çµæã¯ãBitNet b1.58ãç¾è¡ã®æå 端LLMã¢ãã«ã«å¯¾ãã¦ãã¬ã¼ãæ¹åï¼æªããªãã¨ããã®ãªãæ¹åï¼ãå®ç¾ãã¦ãããã¨ã示ãã¦ããã
Memory and Latency
-
ã¢ãã«ãµã¤ãºã7Bã13Bã70Bã«æ¡å¤§ããã³ã¹ããè©ä¾¡ããçµæãã¢ãã«ãµã¤ãºãã¹ã±ã¼ã«ããã«ã¤ãã¦ãé度åä¸ãå¢å ããç¹ã«BitNet b1.58 70Bã¯LLaMA LLMãã¼ã¹ã©ã¤ã³ããã4.1åéãã
-
ã¡ã¢ãªæ¶è²»ãåæ§ã®å¾åã示ãã大ããªã¢ãã«ã»ã©ã¡ã¢ãªå¹çãè¯ããªããembedding layerãå®å ¨ç²¾åº¦ã®ã¾ã¾ã ãã大ããªã¢ãã«ã»ã©ãã¢ãã«å ¨ä½ã«å¯¾ãããembedding layerã®å²åãå°ãããªãããã§ããã両æ¹ã®ã¬ã¤ãã³ã·ã¨ã¡ã¢ãªã¯2ãããã«ã¼ãã«ã§æ¸¬å®ããã¦ãããã³ã¹ããããã«åæ¸ããããã®æé©åã®ä½å°ãããã
Energy
-
BitNet b1.58ã¯è¡åä¹ç®ã«ãããç®è¡æ¼ç®ã¨ãã«ã®ã¼æ¶è²»ã71.4å忏ããã¢ãã«ãµã¤ãºãã¹ã±ã¼ã«ããã«ã¤ãã¦FP16 LLaMA LLMãã¼ã¹ã©ã¤ã³ã¨æ¯è¼ãã¦ã¨ãã«ã®ã¼æ¶è²»ã®å¹çãåä¸ããã
ThroughputÂ
- BitNet b1.58 70Bã¯LLaMA LLMã¨æ¯è¼ãã¦æå¤§11åã®ããããµã¤ãºããµãã¼ãã§ãã8.9åé«ãã¹ã«ã¼ããããå®ç¾ããã
Â
- BitNet b1.58ã¯ãã¢ãã«ã®æ§è½ã¨æ¨è«ã³ã¹ãã«é¢ããæ°ããã¹ã±ã¼ãªã³ã°æ³åãå¯è½ã«ãã¦ãããç°ãªãã¢ãã«ãµã¤ãºéã§ã®ç価æ§ã以ä¸ã®ããã«æä¾ããã
- 13B BitNet b1.58ã¯ãã¬ã¤ãã³ã·ãã¡ã¢ãªä½¿ç¨éãã¨ãã«ã®ã¼æ¶è²»ã®é¢ã§ã3B FP16 LLMãããå¹ççã§ããã
- 30B BitNet b1.58ã¯ãã¬ã¤ãã³ã·ãã¡ã¢ãªä½¿ç¨éãã¨ãã«ã®ã¼æ¶è²»ã®é¢ã§ã7B FP16 LLMãããå¹ççã§ããã
- 70B BitNet b1.58ã¯ãã¬ã¤ãã³ã·ãã¡ã¢ãªä½¿ç¨éãã¨ãã«ã®ã¼æ¶è²»ã®é¢ã§ã13B FP16 LLMãããå¹ççã§ããã
Training with 2T Tokens
- 2Tãã¼ã¯ã³ã§ã®è¨ç·´ã§ã¯ãBitNet b1.58ãStableLM-3Bã®ãã¼ã¿ã¬ã·ãã«å¾ã£ã¦2Tãã¼ã¯ã³ã§è¨ç·´ããWinograndeãPIQAãSciQãLAMBADAãARC-easyã§æ§æããããã³ããã¼ã¯ã§è©ä¾¡ããã
- BitNet b1.58ã¯ããã¹ã¦ã®ã¨ã³ãã¿ã¹ã¯ã§åªããæ§è½ãéæãã1.58ãããLLMãå¼·åãªä¸è¬åè½åãæã£ã¦ãããã¨ã示ãã¦ããã
Discussion and Future Work
1-bit Mixture-of-Experts (MoE) LLMs
- Mixture-of-Expertï¼MoEï¼LLMã¯ãè¨ç®FLOPsãå¤§å¹ ã«åæ¸ãã¤ã¤ãé«ãã¡ã¢ãªæ¶è²»ã¨ãããééä¿¡ã®ãªã¼ãã¼ããããå±éã¨ã¢ããªã±ã¼ã·ã§ã³ãå¶éãããããããã®èª²é¡ã¯1.58ãããLLMã«ãã£ã¦è§£æ±ºå¯è½ã§ãããããã«ãããMoEã¢ãã«ãå±éããããã«å¿ è¦ãªããã¤ã¹æ°ãæ¸å°ãããããã¯ã¼ã¯ãä»ãã¦ã¢ã¯ãã£ãã¼ã·ã§ã³ã転éãããªã¼ãã¼ããããå¤§å¹ ã«åæ¸ãããã
Native Support of Long Sequence in LLMs
- é·ãã·ã¼ã±ã³ã¹ã®ãã¤ãã£ããµãã¼ãã¯ãKVãã£ãã·ã¥â»4ã«ããã¡ã¢ãªæ¶è²»ãé·ãã·ã¼ã±ã³ã¹æ¨è«ã®ä¸»ãªèª²é¡ã§ããããBitNet b1.58ã¯16ããããã8ãããã¸ã®ã¢ã¯ãã£ãã¼ã·ã§ã³ã®åæ¸ã«ãããåããªã½ã¼ã¹ã§ã³ã³ããã¹ãã®é·ãã2åã«ãããã¨ã§ãé·ãã·ã¼ã±ã³ã¹ã®ãµãã¼ãã«åããéè¦ãªã¹ãããã表ãã
LLMs on Edge and Mobile
- 1.58ãããLLMã®ä½¿ç¨ã¯ãã¡ã¢ãªã¨è¨ç®è½åã«å¶éãããã¨ãã¸ããã³ã¢ãã¤ã«ããã¤ã¹ä¸ã§ã®è¨èªã¢ãã«ã®æ§è½ãå¤§å¹ ã«åä¸ãããå¯è½æ§ããããããã«ããããã¾ã§ä¸å¯è½ã ã£ãã¢ããªã±ã¼ã·ã§ã³ãå¯è½ã«ãªããã¨ãã¸ããã³ã¢ãã¤ã«ããã¤ã¹ã®è½åãå¤§å¹ ã«åä¸ããã
New Hardware for 1-bit LLMs
- 1ãããLLMç¨ã®æ°ãããã¼ãã¦ã§ã¢ã«ã¤ãã¦ã¯ãGroqã®ãããªæè¿ã®ç ç©¶ãLLMç¨ã®ç¹å®ãã¼ãã¦ã§ã¢ï¼ä¾ãã°ãLPUï¼ã®æ§ç¯ã«ããã¦ææãªçµæã¨å¤§ããªå¯è½æ§ã示ãã¦ãããBitNetãå¯è½ã«ããæ°ããè¨ç®ãã©ãã¤ã ã«ç¹åãã¦æé©åãããæ°ãããã¼ãã¦ã§ã¢ã¨ã·ã¹ãã ã®è¨è¨ã«åããè¡åãå¼ã³ãããã
Â
ãã®è«æã®ç¤ã¨ãªãBitNetã®è¦ç´
reseachpaper-matome.hatenablog.com
Â
èªå¥èª¬æ
â»1 ãªã1.58? ã»ã»ã»ï½1ï¼0ã-1ï½ã®å¤ããããã1/3ã§åºç¾ããå ´åã®å¹³åæ å ±éã1.58
â»2 ã¹ã«ã¼ãããã»ã»ã»å使éãããã«å¦çã¾ãã¯ä¼éã§ãããã¼ã¿ã®é
â»3 ã¬ã¤ãã³ã·ã»ã»ã»ããã·ã¹ãã ããããã¯ã¼ã¯å ã§å¦çããã¼ã¿ãä¼éãããã®ã«è¦ããæéé å»¶ã®ãã¨
â»4 KVãã£ãã·ã¥ã»ã»ã»Key-Valueï¼ãã¼-å¤ï¼ãã£ãã·ã¥ã®ç¥ã§ããã¼ã¿ããã¼ã¨å¤ã®ãã¢ã¨ãã¦ä¿åããä¸ç¨®ã®ãã¼ã¿ã¹ãã¬ã¼ã¸ã¾ãã¯ãã£ãã·ã¥ã¡ã«ããºã