Skip to content

scru64/spec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

SCRU64: Sortable, Clock-based, Realm-specifically Unique identifier

SCRU64 ID offers compact, time-ordered unique identifiers generated by distributed nodes. SCRU64 has the following features:

  • 63-bit non-negative integer storable as signed/unsigned 64-bit integer
  • Sortable by generation time (as integer and as text)
  • 12-digit case-insensitive textual representation (Base36)
  • ~38-bit Unix epoch-based timestamp that ensures useful life until year 4261
  • Variable-length node/machine ID and counter fields that share 24 bits

Examples in the 12-digit canonical textual representation:

0u375nxqh5cq
0u375nxqh5cr
0u375nxqh5cs
0u375nxqh5ct
0u375ny0glr0
0u375ny0glr1
0u375ny0glr2
0u375ny0glr3

SCRU64's uniqueness is realm-specific, i.e., dependent on the centralized assignment of node ID to each generator. If you need decentralized, globally unique time-ordered identifiers, consider SCRU128. See the following comparison table for the properties of the two schemes.

SCRU64 SCRU128
Long name Sortable, Clock-based, Realm-specifically Unique identifier Sortable, Clock and Random number-based Unique identifier
Binary size 63 bits 128 bits
Textual size 12 digits 25 digits
Textual encoding Base36 with 0-9a-z (case-insensitive) Base36 with 0-9a-z (case-insensitive)
Timestamp range January 1, 1970 - February 27, 4261 (UTC) January 1, 1970 - August 2, 10889 (UTC)
Timestamp resolution 256 milliseconds 1 millisecond
Number of IDs per timestamp Up to 8.4 million per 256 milliseconds per node (configurable) 140.7 trillion per millisecond per node (on average)
Number of distributed nodes Up to 8.4 million (configurable) No specific limitation
Source of uniqueness Centrally pre-assigned node ID Independently generated random numbers
Choose it when you ... Prefer 64-bit integer for storage, indexing, and other reasons Want unique IDs without hassle to coordinate generators

Implementations

Specification v1.0.0

A SCRU64 ID is a non-negative integer less than 36^12 (approx. 2^62) consisting of three terms:

timestamp * 2^24 + node_id * 2^(24 - node_id_size) + counter

Where:

  • timestamp is a non-negative integer less than 36^12 / 2^24 (from zero to 282429536480) representing a 256-millisecond-precision Unix timestamp (i.e., the number of 256 milliseconds elapsed since 1970-01-01 00:00:00+00:00, ignoring leap seconds; or a Unix timestamp in milliseconds divided by 256).
  • node_id is a node_id_size-bit unsigned integer uniquely assigned to each SCRU64 generator in a relevant scope, where node_id_size is an integer between 1 and 23, inclusive.
    • The method to assign unique node_ids is implementation-dependent and is out of the scope of this specification.
    • node_id_size may be chosen arbitrarily and may vary from node to node as long as the leading bits of a longer node_id do not collide with shorter node_ids.
  • counter is a 24 - node_id_size-bit unsigned integer incremented by one whenever a generator produces a new ID. counter is reset to a random number when timestamp moves forward.
    • counter should be reset to a random number of the full counter size but may be reset to a smaller-bit (e.g., 15-bit for a 16-bit counter) random number to reserve the leading bits as an overflow guard to guarantee the room for a certain number of IDs within a timestamp tick.

This definition is equivalent to the following bitwise operations using 64-bit integers:

timestamp << 24 | node_id << (24 - node_id_size) | counter

For the following value ranges:

timestamp = unix_timestamp_in_milliseconds >> 8
0 <= timestamp    <= 282429536480
1 <= node_id_size <= 23
0 <= node_id      <  1 << node_id_size
0 <= counter      <  1 << (24 - node_id_size)

Binary representation

A SCRU64 ID is usually represented as a 64-bit unsigned integer but may be expressed as a signed integer, a byte array, or any other form as long as it encodes integers from zero to the maximum value (36^12 - 1).

Textual representation

A SCRU64 ID is encoded in a string using the Base36 encoding. The Base36 denotes a SCRU64 ID as an unsigned integer in the radix of 36 using the digits of 0-9a-z (0123456789abcdefghijklmnopqrstuvwxyz), with leading zeros added to form a 12-digit canonical representation. The following pseudo equation illustrates the encoding algorithm:

109959589539758421
    =  0  * 36^11 + 30  * 36^10 +  2  * 36^9 + ... +  4  * 36^2 + 11  * 36^1 +  9
    = '0' * 36^11 + 'u' * 36^10 + '2' * 36^9 + ... + '4' * 36^2 + 'b' * 36^1 + '9'
    = "0u2pf62ji4b9"

The maximum value (36^12 - 1) of SCRU64 ID is defined as the maximum value denotable in a 12-digit Base36 string (i.e., zzzzzzzzzzzz).

For the sake of uniformity, an encoder should use lowercase letters in encoding IDs. A decoder, on the other hand, must always ignore cases when interpreting or lexicographically sorting encoded IDs.

The Base36 encoding shown above is available by default in several languages (e.g., strconv.FormatUint() and strconv.ParseUint() in Go). Another simple way to implement it is by using 64-bit integer division and modulo operations. The following C code illustrates the straightforward algorithm:

uint64_t num = UINT64_C(109959589539758421);

static const char digits[] = "0123456789abcdefghijklmnopqrstuvwxyz";
char text[13];
for (int i = 11; i >= 0; i--) {
  text[i] = digits[num % 36];
  num /= 36;
}
text[12] = '\0';
puts(text); // 0u2pf62ji4b9

Special-purpose IDs

The IDs with timestamp set at zero or 36^12 / 2^24 - 1 are reserved for special purposes (e.g., use as dummy, error, or example values) and must not be used or assigned as an identifier of anything.

Considerations

Quality of random numbers

The random numbers used need not be of cryptographic quality because small random numbers are insecure anyway. This specification introduces randomness not as a source of uniqueness or unguessability but primarily as a thin protection against unintended duplication of node_ids by accidents and mistakes.

Identifying node by node_id

Implementations should not extract the node_id from a SCRU64 ID to identify the generator because:

  • node_id is not necessarily a persistent ID of a single node and may change over time.
  • node_id_size may also change over time to adjust the trade-off between the number of distributed nodes and the number of IDs per node.

node_id is embedded in an ID solely for the purpose of uniqueness guarantee across distributed nodes, and thus the use of node_id for any other reason may harm the extensibility of the variable node_id_size design.

Counter overflow handling

Counter overflow occurs when the counter field does not provide sufficient space for the IDs generated within a timestamp tick. The counter of SCRU64 IDs tends to overflow when the workload is high because the scheme is only able to spare a limited number of bits for counter. Therefore, generators must implement reasonable logic to handle such overflows. The recommended approach is to increment timestamp and continue in the following way:

  1. Increment timestamp by one.
  2. Reset counter to a random number.

This approach is recommended over other options such as the "sleep till next tick" approach because this technique allows the generation of monotonically ordered IDs in a non-blocking manner. Raising an error on a counter overflow is generally not recommended because a counter overflow is not a fault of users of SCRU64.

This approach results in a greater timestamp value than the real-time clock. Such a gap between timestamp and the wall clock should be handled as a small clock rollback discussed below.

Clock rollback handling

A SCRU64 generator relies on a real-time clock to ensure the monotonic order of generated IDs; therefore, it cannot guarantee monotonicity when the clock moves back. When a generator detects a clock rollback by comparing the up-to-date timestamp from the system clock and the one embedded in the last generated ID, the recommended treatment is:

  1. If the rollback is small enough (e.g., a few seconds), treat the timestamp of the last generated ID as the up-to-date one, betting that the wall clock will catch up soon.
  2. Otherwise, stall the generator and wait for the next timestamp tick, reset the generator with another unique node_id, or abort and raise an error.

This approach ensures a prompt response when a clock rollback is small, while the generator behavior will be implementation-dependent if the clock rollback is significant, or if the demand for IDs is so high that the counter overflow handling discussed above results in a timestamp significantly advanced from the current timestamp.

Resetting timestamp without refreshing node_id is highly discouraged because it results in a very high risk of duplicate IDs.

Informative usage notes

Node spec API

The reference implementations include a default global generator that reads the node_id and node_id_size configuration from an environment variable. Pass a unique node configuration encoded in a node specifier string through the SCRU64_NODE_SPEC environment variable to invoke a process, and the invoked application will have access to the global SCRU64 generator configured with the given node information. For example:

SCRU64_NODE_SPEC=42/8 npx scru64 -n 4

A node spec string starts with a decimal node_id, a hexadecimal node_id prefixed by "0x", or a 12-digit node_prev SCRU64 ID value, followed by a slash and a decimal node_id_size value ranging from 1 to 23 (e.g., "42/8", "0xb00/12", "0u2r85hm2pt3/16"). The first and second forms create a fresh new generator with the given node_id, while the third form constructs one that generates subsequent SCRU64 IDs to the node_prev.

The global generator raises an error if a proper node spec is not passed, because a SCRU64 generator is not able to generate a unique ID without knowledge of a unique node configuration.

Variable node_id_size use cases

SCRU64 employs the variable node_id_size design to pursue the right balance between the number of distributed nodes and the number of IDs generated by each node within a 256-millisecond timestamp tick. If an application consists of many independent nodes that generate a few IDs per timestamp, a large node_id_size will fit. If an application deploys a few nodes that generate many IDs, then a small node_id_size should be used.

With the variable-size design, an application can operate short node_ids and long node_ids together if the leading bits of longer node_ids are carefully assigned not to collide with short node_ids. This approach is useful to mix hot nodes that generate many and cold nodes that do not. For example:

# Hot nodes take 0b0000, 0b0001, ..., 0b0111
SCRU64_NODE_SPEC=0x0/4 launch_hot_node
SCRU64_NODE_SPEC=0x1/4 launch_hot_node

# Cold nodes use 0b1000_0000, 0b1000_0001, ..., 0b1111_1111
SCRU64_NODE_SPEC=0x80/8 launch_cold_node
SCRU64_NODE_SPEC=0x81/8 launch_cold_node
SCRU64_NODE_SPEC=0x82/8 launch_cold_node
SCRU64_NODE_SPEC=0x83/8 launch_cold_node

Node configurations may vary from time to time. Theoretically, anode_id must be unique in a single timestamp tick only, so two generators sharing one node_id will not collide with each other as long as they operate in totally different timestamp spaces. A typical use case of this property is to expire all the node_ids at the end of a day and lease new ones that are adaptively customized based on the stats of yesterday.

License

This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.

Packages

No packages published