llama.cppã«MoEã«é©ããCPU/GPUã®æ¯ãåãã®ãªãã·ã§ã³ãå
¥ã£ã¦ãLM Studioã§ããã®ãªãã·ã§ã³ã«å¯¾å¿ãããã¨ã«ãã£ã¦ãMoEã¢ãã«ã§ããGPT-ossãå°ãªãGPUã¡ã¢ãªã§ããããªãã«åãããã«ãªãã¾ãããæ¡å¤§ããã¨ãããã¾ãããLM Studioã®å³ä¸ã®è¡¨ç¤ºã«ããã¨ãã¡ã¤ã³ã¡ã¢ãªã¯12GBããã使ãã¾ãã

14tok/secåºã¦ãã¾ãã

CPUã ãã§åããã¨10tok/secã ã£ãã®ã§ã5å²ãã·ã§ããã

0.3.23.0ã«ãForce Model Expert weight onto CPUãã¨ããã¹ã¤ãããå
¥ã£ã¦ããã®ã§ããããOnã«ããã¨Expertã®ã¦ã§ã¤ãããã¹ã¦CPUã«ä¹ãããã«ãªãã¾ããã¢ãã³ã·ã§ã³ã¯GPUã§ã

詳ããã¯ãªãªã¼ã¹ãã¼ãã«ããã¾ãããllama.cppã®--n-cpu-moeã®ä»çµã¿ã使ã£ã¦ãã¨ã®ãã¨ã
https://lmstudio.ai/blog/lmstudio-v0.3.23#force-moe-expert-weights-onto-cpu-or-gpu
ã¡ãªã¿ã«å
¨é¨GPUã«è¼ããã¨65tok/secã§ãã

åºæ¬çã«ã¯ããã®è©±ã®å¿ç¨ã
CPUが得意なことをCPUにまかせて少ないVRAMでも大きめのLLMを速く動かす - きしだのHatena
ã¢ãã³ã·ã§ã³ã¯3éã«ã¼ãããã䏿¹ãã©ã¡ã¼ã¿æ°ãå°ãªãã®ã§GPUã«ãFeed Forward Network(FFN)ã¯ãã©ã¡ã¼ã¿æ°ãå¤ã䏿¹ã§2éã«ã¼ããªã®ã§CPUã§ãæ¯è¼çéãå¦çãã§ãããã¨ããã®ãå©ç¨ãã¦ãã¾ãããã®å³ã§ã¯FFNã«ã¤ãã¦3å²ãããCPUã«ããããã¨ããã¾ãããä»åã®è¨å®ã§ã¯å
¨é¨ãCPUã«ãããã¦ãã¾ãã

MoEã®Expertsã¨ããã®ã¯FFNãããããåããã¦ããããã§ãå
¨ä½ã®ãã©ã¡ã¼ã¿æ°ã¯å¤ããã©å®è¡æã«ã¯ã»ã¨ãã©ä½¿ãããªãã®ã§ã¡ã¢ãªãç¡é§ã«ãªãã¾ãã
ã¨ãããã¨ã§ãã¡ã¢ãªãè²´éãªGPUãããªãCPUã«è¼ãã¦ããã°ãããªãã«å¹çããå®è¡ã§ããããã§ããã
ã¨ããã§ãåç»ç¨ã«15ç§ãããã§çµãããã¤ã¨æã£ã¦Hello Worldãåºãã¦ããã£ãã®ã ãã©ãIDEã使ãå ´åã¨ãã¯ã©ã¹ã¨ãã¡ã¤ã«åã¨ãããããã¨è§£èª¬ãã¦30ç§ããã£ã¦ãã¾ã£ãã
åçã®èªä¿¡ããªãã¦ããããä»ãè¶³ãã¦ãã¾ã£ã¦ãã
