FMA (Fused Multiply-Add) ã«ã¤ãã¦è²ããªè¦³ç¹ã§ã¾ã¨ãã¦ã¿ã
å°æ¸ æ°´ (@curekoshimizu) ã§ãã
æ¬æ¥ã¯ FMA ã«ã¤ãã¦ã話ãããã¨æãã¾ãã
FMA ã¨ã¯ï¼
FMA ã¨ã¯ Fused Multiply-Add ãã¨ã§
ã®æ¼ç®ã®ãã¨ã§ãã ãã㧠㯠ã®ä¸¸ãã表ãã¾ããã
æ¬å½ã«ãã ããã ãã®å 容ãªã®ã§ããã ä»åã®è¨äºã¯ããã® FMA ã«ã¤ãã¦ç±ãæ¸ãã¦ã¿ããã¨æãã¾ãã
FMA ã«ã¤ãã¦æ¸ãã¢ããã¼ã·ã§ã³ãªã®ã§ããã
æ¬ããã°ã¯ 精度ã«é¢ããè©±é¡ ãå¤ãåãä¸ãã¦ãã¾ããã
ãã®ä¸ã§ FMA ã¨é¢ä¿ãã話é¡ãé常ã«å¤ãç»å ´ãã
ãããããç»å ´äºå® ã§ãã
ãã®ãã³ã«ã FMA ã«ã¤ãã¦è£è¶³ãã¹ããã¨ãå¤ãã ããã§ã¾ã¨ãã¦ãããã¨æãç«ã¡ã¾ããã
ä¾ãã°ãã®è¨äºã§ FMA å½ä»¤ã¨ä¸¸ã誤差ã®è©±ããã§ã«ç»å ´ãã¦ãã¾ãï¼
ãã®è¨äºãèªãã¨
1. FMA ã®åããããã
精度ãé«éåã¨é¢ãããã¨ããããã¾ãã
2. èªåã®ç°å¢ã§ FMA å½ä»¤ããµãã¼ããã¦ããããããã
FMA å½ä»¤ãç©ãã HW ã®æ´å²ã«ã¤ãã¦ããç¨åº¦ã¾ã¨ãã¦ã¿ã¾ããã
3. ãµãã¼ãããªãã¨ãç¡çãã FMA æ¼ç®ãå®è¡ããæ¹æ³ãããã
FMA å½ä»¤ãç©ãã§ããªãã¨ã FMA è¨ç®ããæ¹æ³ãããã¾ãã®ã§ç´¹ä»ãã¾ãã
4. ã³ã³ãã¤ã«ã㦠FMA å½ä»¤ãåºãæ¹æ³ãããã
FMA å½ä»¤ããã£ã¦ãã¦ããå®éã«ä½¿ãããªãã¦ã¯æå³ãããã¾ããï¼
5. FMA ã®å¿ç¨ãããã
ããã¨é¢ä¿ãã¦ã Horner æ³ã®èµ·æºè«æãªã©ãè¼ãã¦ããããã¾ãã®ã§ã ãã®ãããã«èå³ããæ¹ãæ¯éã覧ãã ããã
å ·ä½çã«ä½ããããã®ï¼
2ã¤ã®ã¡ãªãããããã¾ãï¼
1. 丸ãåæ°ãæ¸ãã®ã§ç²¾åº¦çã«æå©
2. HWãµãã¼ããããå ´åã«é度çã«æå©
ãããç¹ãã®1 – 丸ãåæ°ãæ¸ãã®ã§ç²¾åº¦çã«æå©
ä¸ã® ã FMA å½ä»¤ã使ããã«è¨ç®ããã¨
ã¨å ç®å½ä»¤1åãä¹ç®å½ä»¤1åã®æ¼ç®ã§è¨ç®ããããã¨ã«ãªãã¾ãã
ã¤ã¾ãã 2åã®ä¸¸ã ãçºçãã¾ãã
ããã FMA å½ä»¤ã ã¨
㨠1åã®ä¸¸ãããå ¥ããªãããã精度çãªè¦³ç¹ã§æå© ã«ãªãã¾ãã
ãããç¹ãã®2 – HWãµãã¼ããããå ´åã«é度çã«æå©
ãã® FMA ã¯å¤ãã®ææ° CPU 㧠FMA å½ä»¤ã HW ãµãã¼ããã¦ãã¾ãï¼
ããã« CPU ã ãã§ãªã GPU ãªã©ã® HW ã§ããµãã¼ããã¦ãããã¨ãå¤ãã§ãã
ããã«ãã
ã¨ããæ¼ç®ã 1åã§è¨ç®ã§ãã¾ãã ãã£ã¦ããªããã° ä¹ç®å½ä»¤ã¨å ç®å½ä»¤ã® 2å㧠å®è¡ããå¿ è¦ãããããã§ãã (丸ãã«ãã誤差ã¯åº¦å¤è¦ããã¨)
ãããã©ããã£ãæå³ããã¤ã®ã§ããããï¼
ãã® FMA ã®åãã¿ãã¹ãã Geforce GTX 1080
ã¨ãã GPU ãä¾ã« çè«ãã¼ã¯æ§è½ãç®åºãã¦ã¿ã¾ãããã
Geforce GTX 1080 ã®è½åã¯
- 2560 CUDA ã³ã¢
- 1.733 GHz
ã§åä½ãããã® CUDA ã³ã¢1ã¤ã§ å精度㮠FMA å½ä»¤ãçºè¡ã§ãã¾ãã
FMA 㯠ä¹ç®ã»å ç®ã® 2ã¤ã®å½ä»¤åãªã®ã§ 2åã®æµ®åå°æ°ç¹ãè¨ç®ã§ããã¨ããå®ç¾©ã«ãªãã¾ãã
ãããã å精度çè«ãã¼ã¯æ§è½ ã¯
2560 (CUDA ã³ã¢) 1.733 (GHz) 2 (FLOPS/(CLOCK CUDAã³ã¢)) = 8872 GFLOPS
ã¨ãªãã¾ãã
FMA å½ä»¤ããã£ã¦ããªãã¨ãã® ã2ã ã®ä¹ç®ããªããªããã¼ã¯æ§è½ææ¨ã¯ãååã«ãªã£ã¦ãã¾ãã¾ãã
ä½è«ã«ãªãã¾ããããã® GTX 1080 ã¯å精度㧠8.8TFLOPS ã®åãæã£ã¦ããã¨ãããã¨ã§ã 2017å¹´ç¾å¨ã® GPU ã¯ãããªã«ãè½åãé«ãã®ãâ¦ã¨æãããã¾ãã
ãã®ããã«ã æ§è½ææ¨ã«ãæ´»èºããã®ã FMA å½ä»¤ã§ãï¼ ããªãã¡ã ãã®å½ä»¤ã使ããªãã¨äºå®ä¸ããã¼ã¯æ§è½ã®ååããçè«ä¸ã ããã¨ãã§ãã¾ããã
ãã®ãããFMA å½ä»¤ãã¡ããã¨ä½¿ã£ã¦ããããã¨ãã HW ã®åãå¼ãåºãã¦ãããã©ããã¨å¯æ¥ãªé¢ä¿ãããã¾ãã
FMA å½ä»¤ HW ãµãã¼ããã¦ãï¼
ãã ããã® FMAã å¤ãã® CPU ã§ãµãã¼ãã¨æ¸ãã¾ãããã æ¯è¼çå¤ã CPU ã使ã£ã¦ããæ¹ã¯ãµãã¼ãããã¦ããªãå¯è½æ§ ãããã¾ãã
ç§ã® ãã¹ã¯ãããã»ãã¼ãPC ã¯æ¬¡ã®ãããªç°å¢ã§ãï¼
(ãã¹ã¯ãããPC)
- CPU : Core i7-3820 CPU @3.60GHz
- ã¡ã¢ãª : 64GB (DDR3-1600)
(ãã¼ãPC)
- CPU : Core i5-4288U CPU @ 2.60GHz
- ã¡ã¢ãª : 8GB (DDR3-1600)
ãã¡ããã¡ã¢ãªã ãã§ã¿ã㨠ãã¹ã¯ãããã®æ§è½ããããããã§ãï¼
ããããªãã FMA å½ä»¤ããµãã¼ããã¦ãããã©ããã§ããã¨ã
ãã¼ãPC 㯠FMAå½ä»¤ããµãã¼ã ãã¦ããããã¹ã¯ããã㯠éãµãã¼ã ã§ããã¾ãã
ãã®ãããFMA ã試ãç°å¢ã¨ãã¦ã¯ãç§ã®ãã¹ã¯ãããç°å¢ã¯é常ã«ãããããªãã§ãã
ãã®ãããã® FMA ãè¼ã£ã¦ããè¼ã£ã¦ããªãã®å¢ç㯠2011ã2013 å¹´ããã® CPU ã«ããã ããã«ã¤ãã¦ãå¾è¿°ãã¾ãã
ã¾ã FMA ãè¼ã£ã¦ããªãã¨ãã FMA ã®è¨ç®ãããªãã¡
ã®è¨ç®ãã§ãã¾ãï¼
ãã㯠HW ãµãã¼ãããã¦ããªãã®ã§ããªãé ãã§ããã 精度ä¸ã®æ©æµãåããããã«ä½¿ããã¨ãã§ãã ã¨ããæå³ã§ãã
ããã«ã¤ãã¦ã å¾è¿° ãã¾ãã
FMA ã¨ãã®æ´å²
FMA ãè¼ã£ã¦ãããã©ããæªããã¨ãã話ãä¸ã§ãã¾ãããã ä½æé ç»å ´ããã®ã§ããããï¼
æåã« FMA å½ä»¤ãç©ãã ã®ã¯ 1990å¹´ ã® IBM RS/6000 ã§ãã
IBM RS/6000 ã«é¢ããè«æ ãDesign of the IBM RISC System/6000 floating-point execution unitã ã«ã¯æ¬¡ã®ããã«æ¸ããã¦ãã¾ãï¼
The IBM RISC System/6000® (RS/6000) floating-point unit (FPU) exemplifies a second-generation RISC CPU architecture and an implementation which greatly increases floating-point performance and accuracy. The key feature of the FPU is a unified floating-point multiply-add-fused unit (MAF) which performs the accumulate operation (A times B) + C as an indivisible operation.
ãã®ããã«ãæ§è½ã¨ç²¾åº¦ãåä¸ããã¹ãæè¼ãããã®ã FMA ã§ãã ãã®å½æããã®å½ä»¤ã¯ MAD (multiply-add-fused) ã¨ç¥ããã¦ããããã§ãã
ãã®å¾ãæµ®åå°æ°ç¹ã®è¦æ ¼ã¨ãã¦æ¡æãããã®ã¯ 2008 å¹´ã®ãã¨ã«ãªãã¾ãã IEEE754-2008 ã« FMA ã«ã¤ãã¦å³æ ¼ãªä»æ§ãå®ãããã¾ããã
ãã®ä»æ§ãå®ã¾ãåã«ãæ¢ã«
ãªã©ã§å®è£ ããã¦ãã¾ããã
ãã®ããã2008 å¹´é ã«ã¯ã ããã CPU 㯠FMA å½ä»¤ãã£ã¦ããã®ããªï¼ã¨æãã¾ãã
æ´å²çã«æ°ãã¤ãããã®ãã æã ã®ãã¹ã¯ãããç°å¢ããã¼ãPCãªã©ã§ä½¿ããã¦ãã x86_64 ã Intel CPU ã AMD CPU ã§ãã
対å¿ãããã®ã¯ã¾ãã¾ãæè¿ã®è©± ã ã¨ãããã¨ããããã¾ãã (æè¿ã®æè¦ãããã¦ããããã¿ã¾ãã)
FMA 㨠Intel CPU
ããã㯠Intel CPU ã§ããã° Haswell ã«ãªã£ã¦ã ãããã FMA ãµãã¼ããã¦ãã¾ãã
å ·ä½çã«ã¯ Intel Core ã·ãªã¼ãºãæè¿ã¾ã§ã®è¡¨ãä¸ã«è¼ãã¦ã¿ã¾ãã¨
- 第1ä¸ä»£: Nehalem
- 第2ä¸ä»£: Sandy Bridge
- 第3ä¸ä»£: Ivy Bridge
- 第4ä¸ä»£: Haswell (2013å¹´é )
- 第5ä¸ä»£: Broadwell
- 第6ä¸ä»£: Skylake
- 第7ä¸ä»£: Kaby Lake (2017å¹´ç¾å¨é )
- 第8ä¸ä»£: Coffee Lake
ã¨ãªã£ã¦ãã 第4ä¸ä»£ Haswell ãã FMA å½ä»¤ã HWãµãã¼ãããã¾ããã
(ãã¹ã¯ãããPC)
- CPU : Core i7-3820 CPU @3.60GHz
- ã¡ã¢ãª : 64GB (DDR3-1600)
(ãã¼ãPC)
- CPU : Core i5-4288U CPU @ 2.60GHz
- ã¡ã¢ãª : 8GB (DDR3-1600)
å ã»ã©ã®ç°å¢ã®ã ãCore i7-3820ã ã¯ç¬¬2ä¸ä»£ Sandy Bridge (æ£ç¢ºã«ã¯ Sandy Bridge-E) ã§ãã FMA å½ä»¤ããã£ã¦ãã¾ããã ä¸æ¹ã§ãã¼ãPCã«è¼ã£ã¦ãã ãCore i5-4288Uã 㯠第4ä¸ä»£ Haswell ã®ãã¨ã§ããã¾ãã
FMA 㨠AMD CPU
ä¸æ¹ã AMD CPU ã§ããã° AMD FXãBulldozer ãã FMA å½ä»¤ããµãã¼ãããã¦ããã æè¿ã®ä½ãã¨è©±é¡ãª AMD Zen ã·ãªã¼ãºã® Ryzen ããµãã¼ãæ¸ã¿ã§ãã
ãã®ãããã¯ç´°ãããã¨ãè¨ãã¨ã ããã«ããå°ããããããã ã¬ã¸ã¹ã¿ãããã¤ä½¿ããã¨ããç¹ã§ç°ãªã FMA3ã»FMA4 ã® ã©ã¡ãã主æµã¨ãããã®æ´å²ããã«ã Intelã»AMD ã§ããã¾ããã
FMA3 ã¨ã¯ä¾ãã°
ã¨ãã3ã¬ã¸ã¹ã¿ã®å½ä»¤ã§ããä¸æ¹ã§ FMA4 ã¯
ã¨ãã4ã¬ã¸ã¹ã¿ã®å½ä»¤ã§ãã
Intel 㯠FMA3 ããæåãããµãã¼ããã¦ãã¾ããã§ãããã AMD ã®æåã« FMA ãµãã¼ããã Bulldozer (2011å¹´é ) 㯠FMA4 ã®ã¿ããµãã¼ããã¦ãã¾ããã
æè¿ç»å ´ãã Ryzen 㯠FMA4 ãéãµãã¼ãã¨ãã FMA3 ã®ã¿ããµãã¼ã ã¨ãã¦ãã¾ãããã
FMA3 ã®ã»ãã主æµã¨ãªãããã§ãã
ãã¦ãä¸è¨ã®ãã㪠FMA å½ä»¤ã®ä¸ã®ããã«åé¡ã®è©±ã¯ããã¦ããã¨ãã¦ã
çµå±ã®ã¨ãã 2011ã2013å¹´ä»è¿ã® x86_64 é ããã大éæã« FMA å½ä»¤ã使ããããã«ãªã£ã¦ãã ã¨ãããã¨ã«ãªãã¾ãã (å ·ä½çã«ã¯ Intel ãªã HaswellãAMD ãªãBulldozer)
ããã§èª¤è§£ããªãã§é ãããã®ã¯ã Intel åã® FMA æè¼ CPU ã Haswell ã¨ããããã§ã¯ããã¾ããã ä¾ãã°ã Itanium ã¨ãã IA-64 ã 2001 å¹´é ã«ããã¾ãã¦ã æ¢ã« FMA å½ä»¤ãã£ã¦ãã¾ããã
å ãã¦ããã®å¹´ä»£ãè¶ ãã¦ãããå¿ ã FMA ãæã£ã¦ããã¨ããããã§ã¯ããã¾ããã®ã§ã èªèº«ã® CPU ã調ã¹ã¦ã¿ã¦ãã ããï¼ ä¾ãã° Linux ã§ããã° /proc/cpuinfo ã®ä¸ãã¿ã¦ã¿ãã¨ãµãã¼ããã¦ããããããã¾ãã
(FMAãµãã¼ããã¦ããï¼Core i5-4288Uã§ã®çµæ)
話ãè±ç·æ°å³ãªã®ã§æ¬¡ã®è©±é¡ã«ç§»ãã¾ãã
ãã£ããFMAå½ä»¤ããã£ã¦ãã¦ãã å®éã«ä½¿ããã¦ããªããã°æå³ãããã¾ããï¼ ããã°ã©ãã³ã°è¨èªã»ã³ã³ãã¤ã©ã®æè¿ã®ç¶æ³ã¯ã©ããªã®ã§ããããï¼
FMA 㨠ããã°ã©ãã³ã°è¨èª
Cè¨èªã»C++è¨èªã ãã®è©±ã«éå®ãã¾ããã Cè¨èªã§ããã° C99 㧠FMA è¨ç®ç¨ã®é¢æ°ããµãã¼ããã¦ããã C++è¨èªã§ããã° C++11 ã§ãµãã¼ããã¦ãã¾ãã
å ·ä½çã«ã¯ fmafã»fmaã»fmal ã¨ããé¢æ°åã§æä¾ããã¦ãã¾ãã é ã« floatã»doubleã»long double ç¨ ã«ãªãã¾ãã
C99 ã¯ååã®éã1999å¹´ã®è¦æ ¼ãªããã§ããã FMA æ¼ç®
ã HW ãµãã¼ããã¦ããã®ã¯ä¸ã§ãæ¸ããã¨ããæè¿ã®è©±ã«ãªãã¾ãã
å®ã¯ HW ãµãã¼ããã¦ããªãã¨ã FMA è¨ç®ã¯ã§ãã¾ãï¼
ãã ãããªãã®å½ä»¤æ°ã使ããã HWãµãã¼ãããã¦ããªãå ´åã¯é ãã§ãï¼ï¼
ãã¾ãã«ãå½ä»¤æ°ãããããããããã§ã¯ãFMAè¨ç®ãã¨ããã¦æ¸ãã¾ããã
ããã«æ³¨æãã¦ãã ããã
FMA ã¨ã³ã³ãã¤ã©ã®æé©å
ä¾ãã° FMA ã HW ãµãã¼ããã¦ããå ´åã ã³ã³ãã¤ã©ã¯èªåçã« FMA å½ä»¤ã«å¤æ´ãã¦ããããããã®ã§ããããï¼
å®ã¯ãã®ãããããããã«ããããã話ãããããã¾ãã
ãã®ããã gcc 㨠clang ã§è©±ãå°ãéãã®ã§ãã
ã³ã³ãã¤ã©ã¨æé©åãªãã·ã§ã³
gcc ã clang ãæé©åãªãã·ã§ã³ããã£ã¦ããã ããã©ã«ãã§ã¯æé©åãè¡ããªãè¨å®ã«ãªã£ã¦ãã¾ãã
gccã»clnag ã®æé©åãªãã·ã§ã³ã¨ãã¦ä»£è¡¨çãªãã®ã¯æ¬¡ã§ãï¼
- -O0 (default:æé©åãªã)
- -O1
- -O2
- -O3
- -Ofast
ä¸ã«ããã»ã©ã³ã³ãã¤ã©ã«ããæé©åããã¾ãã
åèã¾ã§ã«ä»ã«ã次ã®ãããªæé©åãªãã·ã§ã³ãªã©ãããã¾ãï¼
- -Og : Optimize debugging experience
- -Os : Optimize for size
ããããã®æé©åã§ã©ããã£ããã¨ããããã«ã¤ãã¦èª¿æ»ããã«ã¯ã 次ãåèã«ããã¨ããããããã¾ããï¼
(gcc)
- https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
- http://syohex.hatenablog.com/entry/20110616/1308237655
(clang)
-march=native ãªãã·ã§ã³
ä¸ã®æé©åãªãã·ã§ã³ã ãã§ã¯ CPU ã«ããéãã¯æ°ã«ãã¾ããã å ·ä½çã«ã¯
å ã»ã©ã® Intel CPU ã®
- 第1ä¸ä»£: Nehalem
- 第2ä¸ä»£: Sandy Bridge
- 第3ä¸ä»£: Ivy Bridge
- 第4ä¸ä»£: Haswell (2013å¹´é )
- 第5ä¸ä»£: Broadwell
- 第6ä¸ä»£: Skylake
- 第7ä¸ä»£: Kaby Lake (2017å¹´ç¾å¨é )
- 第8ä¸ä»£: Coffee Lake
ã¨ãã£ãéãã¯ã ä¸ã® -Oç³»ã®ãªãã·ã§ã³ã ããã¤ãã¦ãå¸åãã¦ããã¾ããã
å ·ä½çã«ã¯ ä¸ã®æé©åãªãã·ã§ã³ã ããã¤ãã¦ãã FMA å½ä»¤ãã㤠Haswell 㨠FMA å½ä»¤ããããªã Sandy Bridge ã®ãã㪠CPU ã®ã©ã¡ãã§ã åä½ã§ããããã«ããããã
çµè«ã¨ãã¦ã¯ ãFMAå½ä»¤ããããªãã³ã¼ãããçæããã ãã¨ã«ãªãã¾ãã
é«ç²¾åº¦ã»é«é度ãªå½ä»¤ã§ãã FMA ããã£ã¦ãã¦ãã-Oç³»ã®ãªãã·ã§ã³ãã¤ããã ãã§ã¯ä½¿ãããªã ã®ã§ãã
é常ã«ãã£ãããªãã§ãï¼
ãã®æé©åãæ¯æ´ããããã«ã ãã® CPU ã ãã§ãã使ããªããï¼ ã¨ãã ãã®æ¼ç®ãã§ããCPUã§ãã使ããªããï¼ ã¨ä¼ãããªãã·ã§ã³ãããã¾ãã
ãããä¾ãã° -march=native ã«ãªãã¾ãã
ä¾ãã° gcc 5.4.0 ã使ã£ã¦ Core i7-3820 ãä¹ã£ãç°å¢ã§ã¯
$ gcc -march=native
ãã¤ããã¨ã¨æ¬¡ã®ãããªãªãã·ã§ã³ãæå¹ã«ãªãã¾ãï¼
-march=sandybridge -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mno-movbe -maes -mno-sha -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-clwb -mno-pcommit -mno-mwaitx --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=10240 -mtune=sandybridge
ä¾ãã° ãã£ãã·ã¥ã®å¤§ããã SSEå½ä»¤ (Streaming SIMD Extensions) ã®ã©ãããµãã¼ããã¦ããããªã©ãã ã³ã³ãã¤ã©ã«ä¼ãã¦ãããã¨ããããã¾ãã
ããã§ããããã¨ã¯ã
-mno-fma -mno-fma4 -avx512ifma
ã¨ãããªãã·ã§ã³ãã¤ãã¦ãããFMAå½ä»¤ããã£ã¦ããªããã ãããªâ¦ ã¨ãããã¨ã容æã«æ³åã¤ãã¾ãã
-march=native ã§ã©ããã£ããªãã·ã§ã³ãæå¹ã«ãªããã«ã¤ãã¦ã¯æ¬¡ã®è¨äºãåèã«ãªãã¾ãï¼
ã¾ããããã¾ã§ç´¹ä»ãã -O1/-O2/-O3/-Ofast ã -march=native ãã¤ããã¨ã©ã®ç¨åº¦éããªããã«ã¤ãã¦ã¯ã ä¾ãã°æ¬¡ã®è¨äºãåèã«ãªãããããã¾ããï¼
http://www.phoronix.com/scan.php?page=news_item&px=GCC-Optimizations-E3V5-Levels
ãã®ãããã®æé©åãªãã·ã§ã³ãåå¼·ããã ãã§ãé¢ç½ãã§ãã
FMA ãçºè¡ãããããã®æé©åãªãã·ã§ã³ã¯ï¼
FMA å½ä»¤ããã£ã Haswell ç°å¢ã§æ¬¡ã®ã³ã¼ããã³ã³ãã¤ã«ãã¦ã¿ã¾ãããã
float fma_test(float x, float y, float z) { return x*y + z; }
$ gcc -S hoge.c
ã§ã³ã³ãã¤ã«ããã¨æ¬¡ã®ããã«ãªãã¾ãï¼
..... mulsd -16(%rbp), xmm0 addsd -24(%rbp), %xmm0 .....
ã¤ã¾ã㯠FMAå½ä»¤ã使ãããä¹ç®ã¨å ç®å½ä»¤ ã§å®è¡ããããã¨ã«ãªãã¾ãã
$ gcc -O3 -march=native
ã§ã³ã³ãã¤ã«ããã¨ã©ãã§ããããï¼
..... vfmadd132sd %xmm1, %xmm2, %xmm0 .....
㨠FMA å½ä»¤ (詳ããã¯ä¸ã§èª¬æããFMA3å½ä»¤) ãçºè¡ããã¦ãããã¨ããããã¾ãã
ä¸æ¹ã§
$ clang -O3 -march=native
ã¯ã©ãã§ããããï¼
..... vmulsd %xmm1, %xmm0, %xmm0 vaddsd %xmm2, %xmm0, %xmm0 .....
㨠FMA å½ä»¤ã使ããã¾ããã
ã¤ã¾ããæ°ãã¤ããããã¤ã³ã㯠gccã»clang ã®ãªãã·ã§ã³ãæãã¦ã FMA åããããéãå ´åããã ã¨ãããã¨ã§ãã
ãã®éãã調æ»ãã表ã次ã«ãªãã¾ãï¼
float fma_test(float x, float y, float z) { return x*y + z; }
ä¸ã®ã³ã¼ãã¯æ¬¡ã®ãªãã·ã§ã³ã§ FMAå½ä»¤ ãå¼ã°ãããã®è¡¨
(Core i5-4288U CPU (Haswell/ FMA ããã¤CPU))
ãªãã·ã§ã³ | gcc 5.4.0 | clang 3.8.0 |
---|---|---|
-O0 | Ã | Ã |
-O1 | Ã | Ã |
-O2 | Ã | Ã |
-O3 | Ã | Ã |
-Ofast | Ã | Ã |
-march=native -O0 | Ã | Ã |
-march=native -O1 | Ã | Ã |
-march=native -O2 | â | Ã |
-march=native -O3 | â | Ã |
-march=native -Ofast | â | â |
(Core i7-3820 (Haswell/ FMA ããããªãCPU))
ãªãã·ã§ã³ | gcc 5.4.0 | clang 3.8.0 |
---|---|---|
-O0 | Ã | Ã |
-O1 | Ã | Ã |
-O2 | Ã | Ã |
-O3 | Ã | Ã |
-Ofast | Ã | Ã |
-march=native -O0 | Ã | Ã |
-march=native -O1 | Ã | Ã |
-march=native -O2 | Ã | Ã |
-march=native -O3 | Ã | Ã |
-march=native -Ofast | Ã | Ã |
ã¾ã march=native ã®éè¦æ§ããããã¾ãã ããããªã㨠FMA å½ä»¤ããããªãããããã¾ããã®ã§ FMA å½ä»¤ã¯çºè¡ããã¾ããã
ã¾ã gcc 㨠clang 㧠FMA ãæå¹ã«ãªãæé©åãã¤ã³ããéãã¾ãã gcc ãªãã° -O2 ãã clang ãªã -Ofast ãã ã®ããã§ãã
-ffp-contract ã¨ãããªãã·ã§ã³ãããã®ã§ããã ãããã¤ãã㨠clang ã§ã -O1 ããå¯è½ãªãã° fma å½ä»¤ãçºè¡ãããããã«ãªãã¾ãã
ãªãã·ã§ã³ | gcc 5.4.0 | clang 3.8.0 |
---|---|---|
-ffp-contract=fast -march=native -O0 | Ã | Ã |
-ffp-contract=fast -march=native -O1 | à | â(å¤æ´ç¹) |
-ffp-contract=fast -march=native -O2 | â | â(å¤æ´ç¹) |
-ffp-contract=fast -march=native -O3 | â | â(å¤æ´ç¹) |
-ffp-contract=fast -march=native -Ofast | â | â |
ãã® -ffp-contract 㯠gcc ã®ãªãã¡ã¬ã³ã¹ã«ãã®ããã«è¨è¼ããã¦ãã¾ãï¼
-ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression contraction if allowed by the language standard. This is currently not implemented and treated equal to -ffp-contract=off.
The default is -ffp-contract=fast.
ãããã¿ã㨠gcc ã¯æ¢ã« -ffp-contract ãªãã·ã§ã³ãã¤ãã¦ãã¾ãã ä¸æ¹ã§ clang ã¯æåãå¤ãã£ããã¨ãããããã©ã«ãã fast ã«ãªã£ã¦ããªãããã§ãã
ãã®ãããã« ã³ã³ãã¤ã©ã®ææ³ã®éãããã ããã«è¦åãããã¾ãã
ã¾ãã gcc ã§ã -O1 ã§æé©åãã¦ãããã«ã¯ -fexpensive-optimizations ãã¤ããã¨å¤§ä¸å¤«ã§ãã
ããã«ã -march=native 㯠ããã¤ãã®ãªãã·ã§ã³ãã¤ãã¦ãããããã®ãã®ã ã£ãããã§ããã å®éã®ã¨ãã㯠-mfma ãæ¬è³ªçã§ãã
ãã®ããã¾ã¨ããã¨
float fma_test(float x, float y, float z) { return x*y + z; }
ã¨ããã³ã¼ãã¯
$ gcc -march=native -O2 $ gcc -march=native -O1 -fexpensive-optimizations $ gcc -mfma -O2 $ gcc -mfma -O1 -fexpensive-optimizations $ clang -march=native -Ofast $ clang -march=native -O1 -ffp-contract=fast $ clang -mfma -Ofast $ clang -mfma -O1 -ffp-contract=fast
ã®ãããã㧠FMA å½ä»¤ã«å±éããããã¨ããããã¾ãã
#include <math.h> float fma_test(float x, float y, float z) { return fma(x, y, z); }
ã¨ãã¦ã¿ã¾ãããã
ãã®å ´åã¯
(Core i5-4288U CPU (Haswell/ FMA ããã¤CPU))
ãªãã·ã§ã³ | gcc 5.4.0 | clang 3.8.0 |
---|---|---|
-O0 | Ã | Ã |
-O1 | Ã | Ã |
-O2 | Ã | Ã |
-O3 | Ã | Ã |
-Ofast | Ã | Ã |
-march=native -O0 | â | â |
-march=native -O1 | â | â |
-march=native -O2 | â | â |
-march=native -O3 | â | â |
-march=native -Ofast | â | â |
(Core i7-3820 (Haswell/ FMA ããããªãCPU))
ãªãã·ã§ã³ | gcc 5.4.0 | clang 3.8.0 |
---|---|---|
-O0 | Ã | Ã |
…. | …. | …. |
-march=native -Ofast | Ã | Ã |
ã¨ãªããHWå½ä»¤ãããã£ã¦ããã° FMA ã«å±éããã¾ãã
ãã®ãããã©ããã¦ã FMA ãçºè¡ããããå ´åã«ã¯æå¹ãªæ¹æ³ã¨è¨ãã¾ãã
ã¾ã¨ãã¾ãã¨
#include <math.h> float fma_test(float x, float y, float z) { return fma(x, y, z); }
ã¨ããã³ã¼ãã¯
$ gcc -march=native $ gcc -mfma $ clang -march=native $ clang -mfma
ã§å±éããã¾ãããã¡ã㯠-O0 ã§ã大ä¸å¤«ãªã¨ããã大ããªéãã§ãã
ã¾ãã clang ã§ãã£ã¦ã -ffp-contract=fast ãå¿ è¦ããã¾ããã ããã¯æ¢ã«ã¦ã¼ã¶ã¼ã FMA æ¼ç®ç®æãæå®ãã¦ããããã§ãã
FMA ã® ãã¼ãã¦ã§ã¢ãµãã¼ãããªãå ´åã« C99 FMA é¢æ° ãå¼ã¶ã¨ï¼
ãã®è¨ç®ã
ãã¡ãã¨çºè¡ã§ãã¾ãï¼
ãã ããåé¡ã¯ è¤æ°å½ä»¤ã§å®è¡ãã¦ããã®ã§é ã ã¨ããç¹ã§ãã
ãããã¿ãã¹ã次ã®ãã¨ãã¿ã¦ã¿ã¾ãããã
HWãµãã¼ãããªãå ´åã« å精度浮åå°æ°ç¹ç¨ã® C99 FMA é¢æ°ã® fmaf ã使ã£ãå®è¡ãã¡ã¤ã«ã nm ã³ãã³ãã«ããã¦ã¿ã¾ãã¨
$ nm a.out | grep fmaf U fmaf@@GLIBC_2.2.5
ã¨ãªã glibc ã® fmaf é¢æ°ãå¼ãã§ãã ãã¨ããããã¾ãã
å ·ä½çã«ã¯ glibc ã®ã³ã¼ãã®
sysdeps/ieee754/dbl-64/s_fmaf.c
ãå¼ã°ãã¦ãããã½ããã¦ã§ã¢ã§å®è£ ããã¦ãã¾ãã
ãã®å®è£ ã¯æ¬¡ã®è«æãå ã«ãªã£ã¦ãã¾ãï¼
ã¤ã¾ããé«éåã¨ãã観ç¹ã§ã¯ãªã丸ãã®å½±é¿ãæããããã è¤æ°ã®è¨ç®ã§å®ç¾ãã¦ããã¨ããããã§ãã
ããã«ã¤ãã¦ã¯ä¸¸ã誤差ã®é¢ç½ãææ³ã使ããã¦ããã 詳ããç´¹ä»ãã¦ã¿ããã¨æã£ã¦ããã¾ãã
FMA å¿ç¨äºä¾ – å¤é å¼è¨ç®ã«ãã FMA ã®å©ç¹
ããã¾ã§ããããç¥ããªãã¨ã FMA å½ä»¤ã®æ©æµãå¾ãããªããã¨ããããã¾ãã
ããã¦ãFMA å½ä»¤ã使ããã¨ãªãã¦ãããªã«ããã®ï¼ ã¨ãã声ãããã¾ãã
é常ã«ããããããã¾ãï¼
ç´¹ä»ããäºä¾ãå¤ãããããã
ããã§ã¯ å¤é å¼è¨ç® ã§èãã¦ã¿ããã¨æãã¾ãã
次ã®å¤é å¼ãèãã¦ã¿ã¾ããã
ããããä¾ãã° 5次ã®æ¹ç¨å¼ãè¨ç®ãããã¨ãããã¨æãã¾ããã ã¨ããå¼ã åã®ä¹ç® ã§è¨ç®ãã㨠0+1+2+3+4+5+6+7 = 28åã®ä¹ç®ã¨ 7åã®å æ¸ç®ã§å®è¡ãããã¨ã«ãªãã¾ãã ã¤ã¾ã 35 åã®ä¸¸ããçºçãã¾ãã
ãããä¾ãã°æ¬¡ã®ãããªæ¬å¼§ã®ä»ãæ¹ã§èãã¾ã
ããã¨ã ã¨ããå¼ã åã®ä¹ç® ã§è¨ç®ãã㨠0+0+1+2+3+4+5+6 = 21åã®ä¹ç®ã¨ 7åã®FMAã§å®è¡ãããã¨ã«ãªãã¾ãã ã¤ã¾ã 28 åã®ä¸¸ããçºçãã¾ãã
ã¤ã¾ã 35 åã®ä¸¸ããå ¥ãå¼ãã 28 åã®ä¸¸ããå ¥ãå¼ ã«å¤ããã¾ããã
ããã«ãä¸ã® 5次æ¹ç¨å¼ã¯æ¬¡ã® Horner æ³ã¨ãã°ããå½¢ã«å¤æãã¦ããããã¨ãã§ãã¾ãï¼
ãããã㨠FMA 7 åã§è¨ç®ã§ãã¦ãã¾ãã¾ãã
ã¡ãªã¿ã« Honeræ³ (ãã¼ãã¼æ³) ã¯é常ã«æåã§ããã æ°ã«ãªãèµ·æºè«æã調ã¹ã¦ã¿ã¾ããã 1819å¹´ã® Honer ã®è«æã«ãªãã¾ãï¼
ãã¡ãããã® Honeræ³ ã¯æåãªã®ã§ããã é«éåã®è¦³ç¹ã«ãã¦ã°ãä¾ãã°ãã®ãããªå¼å¤å½¢ã注ç®ãã¹ãã§ãã
ãã®æ¹æ³ã«ãã¾ã㨠7åã® FMA æ¼ç®ã§è¨ç®ã§ãã¦ãã¾ãã¾ãã
ãã®ç¹å¾´ã¯ 並åã«è¨ç®å¯è½ 㪠4ã¤ã® FMA æ¼ç®ã
ç»å ´ãããã¨ã§ããããã®ç¹ã§ Honer æ³ããæå©ã«åãã¾ãã
ããããå¤é å¼è¨ç®ææ³ã¯ããããç¥ããã¦ãããã©ããã§ã¾ã¨ãããã¨æã£ã¦ãã¾ãã
æå¾ã«
ããã¾ã§ FMA ã«ã¤ãã¦ããããç´¹ä»ãã¦ã¿ã¾ããã
FMA ã¯ããªãå½¹ã«ç«ã¤å½ä»¤ã ã£ããã æ§è½åä¸ã«å½¹ç«ã¤ãã®ã§ãã®ã§ã æ¯éãæ´»ç¨ãã ããï¼
ã¾ãããã家åºã§PCã®è³¼å ¥äºç®ããããªãå ´åã
ä»ã®PCã¯FMAããµãã¼ããã¦ãªããã©æè¿ã®ã¯FMAãè¼ã£ã¦ããããã ã ããã2åãããæ§è½ããããããï¼ã£ã¦ãã㰠奥ããã説å¾ã§ããããªæ°ããã¾ãã
ãããªæã«ã使ã£ã¦ã¿ã¦ãã ããã
FMA ã¯ä¸¸ã誤差ã®è©±ã¨ãããé¢ãããããä»å¾ãã®ããã°ã§ã¯é »ç¹ã«ç»å ´äºå®ã§ãï¼
ãããããé¡ããã¾ãã
ãããªã«ãæ°å¼ãåºã¦ããªãã£ãã®ã¯ããã®è¨äºãåãã¦ããããã¾ããã