ABCDEFGHIJKLMNOPQRSTUVWXYZAAAB
1
2
Notes from Finbarr blog
3
4
Global params
5
Const2
store one set of weights for the K values, and one for the Vs
6
Bytes / param0.5
number of bytes per param (set by quantization)
7
Batch size1
8
9
The latency is the maximum of either the compute or the memory latency
10
Memory bandwidth often << FLOPS, so memory latency is the constraint
11
12
Latency (s)
13
Llama Model Params (B)
14
7133265
15
M10.10260.19050.46890.9524
16
M20.03500.06500.16000.3250
17
M2 Max0.01750.03250.08000.1625
18
A1000.00360.00670.01650.0336
19
20
Tok / sec (ceiling for 4-bit quantized Llama model)
21
Llama Model Params (B)
22
GPU memory
bandwidth (GB/s)
7133265
23
M168.259.755.252.131.05
24
M220028.5715.386.253.08
25
M2 Max40057.1430.7712.506.15
26
A1001935276.43148.8560.4729.77
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100