LinkedInã®è¨äºãããã£ã¦ãããã¡ã«è¦ã¤ããããã¤ã¯ãã¢ã¼ããã¯ãã£ã«é¢ããé¢ç½ãäºä¾ã CPUã®ãã¤ã¯ãã¢ã¼ããã¯ãã£ã®ããã«å¥¥æ·±ãã¾ã§ç解ãå¿ è¦ãªåé¡ã解決ããããã«ãã©ã®ãããªãã¼ã«ãã¤ãã£ã¦ã©ã®ããã«è§£æ±ºãããã®è©±ã netflixtechblog.com Netflixå ã§ã®ã¯ã¼ã¯ãã¼ãæé©åã®ãããAWSã®ã¤ã³ã¹ã¿ã³ã¹ãµã¤ãºã移è¡(16 vCPUãã48 vCPU)ããCPUãããã«ããã¯ã¨ãªãã¯ã¼ã¯ãã¼ãã®æ§è½åä¸ãå³ã£ãã ãã®ã¤ã³ã¹ã¿ã³ã¹ã®ç§»è¡ã«ãããæ§è½ãã»ã¼ç´ç·çã«å¢å ããããã¨ãæ³å®ããã¹ã«ã¼ããããããã3åã«ãªãã¨äºæ³ããã ããããçµæã¨ãã¦ãã®ç§»è¡ã§æ³å®ããæ§è½ã¯éæã§ããªãã£ãã https://netflixtechblog.com/seeing-through-hardware-counters-a-journey-to-threefold-pe
modules: jmeter: version: 5.4.1 # ããã«æ¸ãã¦ãããã¼ã¸ã§ã³ãåæã«ãã¦ã³ãã¼ããã¦ããã properties: log_level.JMeter: WARN log_level.JMeter.threads: WARN system-properties: org.apache.commons.logging.simplelog.log.org.apache.http: WARN æ¢åãã¼ã«ã®ã©ããã¼ã¨ãã¦åä½ ããã©ã«ãã§ã¯å é¨çã«Jmeterãå®è¡ããã¾ããã以ä¸ã®ãããªãã¼ã«ã§ä½æãããã¹ã¯ãªãããæµç¨ãããã¨ãå¯è½ã§ãã JMeter Gatling Locust Selenium Vegeta ã¤ã¾ããããã»ã©ã¯YAMLã§ã·ããªãªãè¨è¿°å¯è½ã¨ã¯è¨ãã¾ãããããã¡ããæ¢åã®ã¹ã¯ãªãããæµç¨ã§ããã£ã¦ãã¨ã§ãã ãã¾ã¾ã§ä½ãä¸ãã¦ããã¹ã¯ãªããã
ãããã¡ã¤ã©ã使ç¨ãã TensorFlow ã®ããã©ã¼ãã³ã¹æé©å ã³ã¬ã¯ã·ã§ã³ã§ã³ã³ãã³ããæ´ç å¿ è¦ã«å¿ãã¦ãã³ã³ãã³ãã®ä¿åã¨åé¡ãè¡ãã¾ãã ãã®ã¬ã¤ãã§ã¯ãTensorFlow Profiler ã§æä¾ããã¦ãããã¼ã«ã使ç¨ãã¦ãTensorFlow ã¢ãã«ã®ããã©ã¼ãã³ã¹ã追跡ããæ¹æ³ã説æãã¾ããã¾ãããã¹ãï¼CPUï¼ãããã¤ã¹ï¼GPUï¼ãã¾ãã¯ãã¹ãã¨ããã¤ã¹ã®ä¸¡æ¹ã®çµã¿åããã§ã¢ãã«ãã©ã®ããã«æ©è½ãããã確èªãã¾ãã ãããã¡ã¤ãªã³ã°ã¯ãã¢ãã«å ã®ãã¾ãã¾ãª TensorFlow æ¼ç®ï¼opï¼ã«ãããã¼ãã¦ã§ã¢ãªã½ã¼ã¹æ¶è²»ï¼æéã¨ã¡ã¢ãªï¼ãææ¡ããããã©ã¼ãã³ã¹ã®ããã«ããã¯ã解æ¶ãã¦æçµçã«ã¢ãã«ã®å®è¡ãé«éåããã®ã«å½¹ç«ã¡ã¾ãã ãã®ã¬ã¤ãã§ã¯ããããã¡ã¤ã©ã®ã¤ã³ã¹ãã¼ã«æ¹æ³ãå©ç¨å¯è½ãªãã¾ãã¾ãªãã¼ã«ããããã¡ã¤ã©ã®ãã¾ãã¾ãªããã©ã¼ãã³ã¹ãã¼ã¿åéã¢ã¼ãããã
We want to use the full power of our GPU during LLM inference. To do that, we need to know if our inference is compute bound or memory bound so that we can make optimizations in the right area. Calculating the operations per byte possible on a given GPU and comparing it to the arithmetic intensity of our modelâs attention layers reveals where the bottleneck is: compute or memory. We can use this i
ã¢ããã³ãã«ã¬ã³ãã¼ãã»ã¼æ¨ªæµã®æ°ãã®11æ¥ç®ã®è¨äºã§ãã ä»å¹´ã¯ LLM ã®é«éåå®è£ ã«ã¤ãã¦æ¸ãã¦ãã¾ããç§ã¯LLMã®å°é家ã§ã¯ãªãã§ããåã ããèå³ããã£ãã®ã§å°ãåå¼·ãã¦ã¿ã¾ããã ãã®è¨äºãèªãã§ãããã㨠LLMãæç« ãçæããä»çµã¿ torch.compile ã«ãã£ã¦ LLM ã¯ã©ã®ããã«é«éåãããã®ãï¼ Speculative Decoding ã¨ã¯ï¼ èæ¯ å°ãåã« Accelerating Generative AI with Pytorch II: GPT, Fast ã¨ããç´ æ´ãããããã°è¨äºãè¦ããã¾ããããã®è¨äºã¯ Pytorch ãã¼ã ããåºããããã®ã§ãç´ ã® Pytorch ã®ã¿ãç¨ã㦠LLM ã®æ¨è«ã 10 åé«éåã§ããã¨ãããã®ã§ãããä¸ä½ã©ã®ããã« 10 åãã®é«éåãå®ç¾ãã¦ããã®ãæ°ã«ãªã£ãã®ã§ãå人çãªåå¼·ãå ¼ãã¦ãã®è¨äºãæ¸ãã¦ãã¾ãã
by Team PyTorch This post is the second part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Anything over 8x using only pure, native
by Sayak Paul and Patrick von Platen (Hugging Face ð¤) This post is the third part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Any
by Team PyTorch This post is the first part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. We are excited to share a breadth of newly released PyTorch performance features alongside practical examples of how these features can be combined to see how far we can push PyTorch native performance. As announced during the PyTorch Developer Conference
ãªãªã¼ã¹ãé害æ å ±ãªã©ã®ãµã¼ãã¹ã®ãç¥ãã
ææ°ã®äººæ°ã¨ã³ããªã¼ã®é ä¿¡
å¦çãå®è¡ä¸ã§ã
j次ã®ããã¯ãã¼ã¯
kåã®ããã¯ãã¼ã¯
lãã¨ã§èªã
eã³ã¡ã³ãä¸è¦§ãéã
oãã¼ã¸ãéã
{{#tags}}- {{label}}
{{/tags}}