Llama Inference Cheatsheet

	B	C	D	E	F	G
1
2	Notes from Finbarr blog
3
4	Global params
5	Const		2	store one set of weights for the K values, and one for the Vs
6	Bytes / param		0.5	number of bytes per param (set by quantization)
7	Batch size		1
8
9	The latency is the maximum of either the compute or the memory latency
10	Memory bandwidth often << FLOPS, so memory latency is the constraint
11
12	Latency (s)
13			Llama Model Params (B)
14			7	13	32	65
15	M1		0.1026	0.1905	0.4689	0.9524
16	M2		0.0350	0.0650	0.1600	0.3250
17	M2 Max		0.0175	0.0325	0.0800	0.1625
18	A100		0.0036	0.0067	0.0165	0.0336
19
20	Tok / sec (ceiling for 4-bit quantized Llama model)
21			Llama Model Params (B)
22		GPU memory bandwidth (GB/s)	7	13	32	65
23	M1	68.25	9.75	5.25	2.13	1.05
24	M2	200	28.57	15.38	6.25	3.08
25	M2 Max	400	57.14	30.77	12.50	6.15
26	A100	1935	276.43	148.85	60.47	29.77
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100