[B! inference] dannã®ãƒ–ãƒƒã‚¯ãƒžãƒ¼ã‚¯

dann id:dann

inferenceã«é–¢ã™ã‚‹dannã®ãƒ–ãƒƒã‚¯ãƒžãƒ¼ã‚¯ (17)

${{author_name}}$

{{author_name}} {{created}}

{{{comment_expanded}}}

{{label}}

{{#is_bookmark}}ãƒªã‚¹ãƒˆ{{/is_bookmark}}{{^is_bookmark}}ãƒªãƒ³ã‚¯{{/is_bookmark}}

${{author_name}}$
{{author_name}}{{created}}
{{ #comment }}{{ comment }}{{ /comment }}
- {{ label }}

{{#following_bookmarks}}

${{author_name}}$

{{author_name}} {{created}}

{{{comment_expanded}}}

{{label}}

{{#is_bookmark}}ãƒªã‚¹ãƒˆ{{/is_bookmark}}{{^is_bookmark}}ãƒªãƒ³ã‚¯{{/is_bookmark}}

{{/following_bookmarks}}

{{/is_wiped}}

LLMæŽ¨è«–é«˜é€ŸåŒ–æ‰‹æ³•ãŒæŽ¨è«–çµæžœã« ä¸Žãˆã‚‹å½±éŸ¿ã®åˆ†æž
Analysis of the effects of LLM inference acceleration methods W&B Fully Connected 2024 æ ªå¼ä¼šç¤¾ãƒªã‚¯ãƒ«ãƒ¼ãƒˆ Megagon Labs æ¾ç”°å¯›. Made by Hiroshi Matsuda using W&B
dann 2024/11/10
inference
ãƒªãƒ³ã‚¯
GitHub - huggingface/optimum-nvidia
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
dann 2024/03/11
llm

inference
ãƒªãƒ³ã‚¯
Accelerating Generative AI with PyTorch II: GPT, Fast
by Team PyTorch This post is the second part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Anything over 8x using only pure, native
dann 2024/01/24
llm

inference

performance
ãƒªãƒ³ã‚¯
DeepSpeed, vLLM, CTranslate2 ã§ rinna 3.6b ã®ç”Ÿæˆé€Ÿåº¦ã‚’æ¯”è¼ƒã™ã‚‹
ã¯ã˜ã‚ã« è¨€èªžãƒ¢ãƒ‡ãƒ«ã‚’ç”¨ã„ãŸãƒ†ã‚ã‚¹ãƒˆã®ç”Ÿæˆã«ã¯transf ormersãƒ©ã‚¤ãƒ–ãƒ©ãƒªãŒåºƒãä½¿ã‚ã‚Œã¦ã„ã¾ã™ãŒã€transf ormersãƒ©ã‚¤ãƒ–ãƒ©ãƒªã¯å¹…åºƒã„ãƒ¢ãƒ‡ãƒ«ã«å¯¾å¿œã™ã‚‹ä¸€æ–¹ã§ã€ãƒ†ã‚ã‚¹ãƒˆç”Ÿæˆã®é€Ÿåº¦ã‚„ãƒ¡ãƒ¢ãƒªåŠ¹çŽ‡ã«ã¯ååˆ†ã«æœ€é©åŒ–ã•ã‚Œã¦ã„ã¾ã›ã‚“ã€‚ãã“ã§ã“ã®è¨˜äº‹ã§ã¯ãƒ†ã‚ã‚¹ãƒˆç”Ÿæˆã®åŠ¹çŽ‡ã‚’ä¸Šã’ã‚‹ãŸã‚ã®ãƒ„ãƒ¼ãƒ«ã‚’ç´¹ä»‹ã—ã¾ã™ã€‚ ä»Šå›žã¯PyPIã‹ã‚‰ç°¡å˜ã«ã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«ã§ãã‚‹DeepSpeedã¨vLLMã€CTranslate2ã‚’æ¯”è¼ƒã—ã¾ã™ã€‚ ãƒ¢ãƒ‡ãƒ«ã¯rinna/japanese-gpt-neox-3.6b-instruction-ppoã‚’ä½¿ã„ã¾ã™ã€‚ãƒ—ãƒãƒ³ãƒ—ãƒˆã®ãƒ•ã‚©ãƒ¼ãƒžãƒƒãƒˆã‚„ãƒˆãƒ¼ã‚¯ãƒŠã‚¤ã‚¶ç‰ã®ä½¿ã„æ–¹ã«ã¤ã„ã¦ã¯ãƒ¢ãƒ‡ãƒ«ã‚«ãƒ¼ãƒ‰ã‚’ã”è¦§ãã ã•ã„ã€‚ ã“ã®è¨˜äº‹ã§ã¯Colabã®T4 GPUã‚¿ã‚¤ãƒ—ã‚’åˆ©ç”¨ã—ã¦ãƒ†ã‚ã‚¹ãƒˆç”Ÿæˆã®é€Ÿåº¦ã‚’æ¸¬å®šã—ã¦ã„ã¾ã™ã€‚ãã‚Œãžã‚Œã®ãƒ„ãƒ¼ãƒ«ã‚’è©¦ã™ãƒŽãƒ¼ãƒˆãƒ–ãƒƒã‚¯ã¨ã€Colabã§é–‹ã‘ã‚‹ãƒªãƒ³ã‚¯ã‚’è¼‰ã›ã¦ã„ã‚‹ã®ã§å‚è€ƒã«ã—ã¦ã¿ã¦ãã ã•ã„ã€‚
dann 2024/01/24
llm

inference
ãƒªãƒ³ã‚¯
Accelerating Generative AI Part III: Diffusion, Fast
by Sayak Paul and Patrick von Platen (Hugging Face ðŸ¤—) This post is the third part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Any
dann 2024/01/05
performance

inference

pytorch
ãƒªãƒ³ã‚¯
Distributed Inference with ðŸ¤— Accelerate
dann 2024/01/04
accelerate

inference
ãƒªãƒ³ã‚¯
Deploy Your Local GPT Server With Triton
dann 2023/05/17
triton

inference
ãƒªãƒ³ã‚¯
GitHub - NVIDIA/FasterTransformer: Transformer related optimization, including BERT, GPT
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
dann 2023/04/23
transformer

inference

deeplearning

triton

fastertransformer
ãƒªãƒ³ã‚¯
NVIDIA Triton Inference Server on AWS: Customer success stories and AWS deployment methods to optimize inference throughput, reduce latency, and lower GPU or CPU inference costs. | GTC Digital November 2021 | NVIDIA On-Demand
dann 2023/03/30
aws

eks

inference
ãƒªãƒ³ã‚¯
MoEfication: Transformer Feed-forward Layers are Mixtures of Experts
dann 2023/03/09
lm

inference

deeplearning
ãƒªãƒ³ã‚¯
TensorFlow Model Optimization
import tensorflow as tf import tensorflow_model_optimization as tfmot model = tf.keras.Sequential([...]) pruning_schedule = tfmot.sparsity.keras.PolynomialDecay( initial_sparsity=0.0, final_sparsity=0.5, begin_step=2000, end_step=4000) model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude( model, pruning_schedule=pruning_schedule) ... model_for_pruning.fit(...) TensorFlow Model Optimization
dann 2023/03/08
tensorflow

inference
ãƒªãƒ³ã‚¯
[English ver.] [Tensorflow Lite] Various Neural Network Model quantization methods for Tensorflow Lite (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization, EdgeTPU). As of May 05, 2020. - Qiita
[English ver.] [Tensorflow Lite] Various Neural Network Model quantization methods for Tensorflow Lite (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization, EdgeTPU). As of May 05, 2020.Python DeepLearningTensorFlowPyTorchOpenVINO Japaneseã€€English - English - 1. Introduction In this article, I'd like to share with you the quantization workflow I've been workin
dann 2020/08/04
tensorflow

inference
ãƒªãƒ³ã‚¯
ONNXã®æœ€é©åŒ–ã¾ã¨ã‚ - ã±ãŸã¸ã
ONNXã®æœ€é©åŒ–ã‚’ä¸€é€šã‚Šè©¦ã—ã¦ã¿ãŸã®ã§ã¾ã¨ã‚ã€‚ ã‚µãƒãƒ¼ãƒˆã—ã¦ã„ã‚‹æœ€é©åŒ–ä¸€è¦§ã®å–å¾— ã‚µãƒãƒ¼ãƒˆã—ã¦ã„ã‚‹æœ€é©åŒ–ã¯ã€get_available_passesã§å–å¾—ã§ãã¾ã™ã€‚ from onnx import optimizer all_passes = optimizer.get_available_passes() å¤§ããåˆ†ã‘ã‚‹ã¨ã€ã“ã®ã‚ˆã†ã«åˆ†é¡žã§ãã¾ã™ã€‚ æ„å‘³ã®ãªã„Opã®å‰Šé™¤ ï¼ˆeliminate_deadendç‰ï¼‰ 2ã¤ã®Opã®fusionã€€ï¼ˆfuse_matmul_add_bias_into_gemmç‰ï¼‰ Convã¸ã®fusionã€€ï¼ˆfuse_add_bias_into_convç‰ï¼‰ ãã®ä»– convã¸ã®fuseã¯å…¨ãå‹•ã‹ãšã€ãƒãƒ¼ã‚¸ãƒ§ãƒ³ã‚¢ãƒƒãƒ—å¾…ã¡ã§ã™ã€‚ æœ€é©åŒ–ã®çµæžœ Qiitaã«ãã‚Œãžã‚Œã¾ã¨ã‚ã¾ã—ãŸã€‚ ONNXã§eliminate_deadend æœ€é©åŒ– ONNXã§ eliminate_i
dann 2020/07/17
onnx

inference
ãƒªãƒ³ã‚¯
[Tensorflow Lite] Various Neural Network Model quantization methods for Tensorflow Lite (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization, EdgeTPU). As of May 05, 2020. - Qiita
[Tensorflow Lite] Various Neural Network Model quantization methods for Tensorflow Lite (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization, EdgeTPU). As of May 05, 2020.Python DeepLearningTensorFlowPyTorchOpenVINO æ—¥æœ¬èªžã€€English - Japanese - 1. Introduction ä»Šå›žã¯ç§ãŒåŠå¹´é–“æŽ›ã‘ã¦ãŸã‚ã¦ããŸã€å¦ç¿’æ¸ˆã¿ãƒ¢ãƒ‡ãƒ«ã®é‡ååŒ–ãƒ¯ãƒ¼ã‚¯ãƒ•ãƒãƒ¼ã‚’ãƒ¡ãƒ¢ãŒã¦ã‚‰å…±æœ‰ã—ãŸã„ã¨æ€ã„ã¾ã™ã€‚ Tensorflow ã® checkpoint (.ckpt/.meta)ã€ FreezeGraph (.
dann 2020/05/06
tensorflow

pytorch

inference
ãƒªãƒ³ã‚¯
GitHub - PINTO0309/PINTO_model_zoo: A repository for storing models that have been inter-converted between various frameworks. Supported frameworks are TensorFlow, PyTorch, ONNX, OpenVINO, TFJS, TFTRT, TensorFlowLite (Float32/16/INT8), EdgeTPU, CoreML.
Made with contrib.rocks. A repository for storing models that have been inter-converted between various frameworks. Supported frameworks are TensorFlow, PyTorch, ONNX, OpenVINO, TFJS, TFTRT, TensorFlowLite (Float32/16/INT8), EdgeTPU, CoreML. TensorFlow Lite, OpenVINO, CoreML, TensorFlow.js, TF-TRT, MediaPipe, ONNX [.tflite, .h5, .pb, saved_model, tfjs, tftrt, mlmodel, .xml/.bin, .onnx] I have been
dann 2020/04/19
deeplearning

inference

tensorflow
ãƒªãƒ³ã‚¯
GitHub - pfnet-research/chainer-trt: Chainer x TensorRT
dann 2018/12/14
chainer

tensorrt

inference
ãƒªãƒ³ã‚¯
benchmarking-hardware-for-cnn-inference-in-2018-1d58268de12a
dann 2018/09/07
inference

hardware

deeplearinng
ãƒªãƒ³ã‚¯
1

ãŠçŸ¥ã‚‰ã›

ã‚‚ã£ã¨èªã‚€

å…¬å¼Twitter

@HatenaBookmark
ãƒªãƒªãƒ¼ã‚¹ã€éšœå®³æƒ…å ±ãªã©ã®ã‚µãƒ¼ãƒ“ã‚¹ã®ãŠçŸ¥ã‚‰ã›
@hatebu
æœ€æ–°ã®äººæ°—ã‚¨ãƒ³ãƒˆãƒªãƒ¼ã®é…ä¿¡

ã‚ãƒ¼ãƒœãƒ¼ãƒ‰ã‚·ãƒ§ãƒ¼ãƒˆã‚«ãƒƒãƒˆä¸€è¦§

jæ¬¡ã®ãƒ–ãƒƒã‚¯ãƒžãƒ¼ã‚¯

kå‰ã®ãƒ–ãƒƒã‚¯ãƒžãƒ¼ã‚¯

lã‚ã¨ã§èªã‚€

eã‚³ãƒ¡ãƒ³ãƒˆä¸€è¦§ã‚’é–‹ã

oãƒšãƒ¼ã‚¸ã‚’é–‹ã

è¨å®šã‚’å¤‰æ›´ã—ã¾ã—ãŸx