Description
The upcoming Intel AVX10.2 instruction set extension is going to be adding support for conversions to E5M2 (BF8) and E4M3 (HF8) 8-bit floating point types from F16, along with conversions to F16 from E4M3 (HF8).
The E5M2 and E4M3 floating point formats are described in the Open Compute Project 8-bit Floating Point Specification, which can be found at https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1.
The E5M2 (BF8) floating-point format has 1 sign bit, 5 exponent bits, and 2 mantissa bits, and the bit representation of the E5M2 format is equivalent to the upper 8 bits of a hwy::float16_t (16-bit IEEE 754 half-precision floating-point) value (similar to the bit representation of the hwy::bfloat16_t being equivalent to the upper 16 bits of a 32-bit IEEE 754 single-precision floating-point value).
The E4M3 (HF8) floating-point format has 1 sign bit, 4 exponent bits, and 3 mantissa bits. The E4M3 floating-point format has no infinities, and the E4M3 format has only 2 NaN bit representations (0x7F and 0xFF). The E4M3 floating-point format considers non-NaN values that have the largest exponent to be normal floating-point values whose absolute value is between 256 and 448, unlike most of the floating-point formats which consider values having the largest exponent to be infinities or NaN values.
The AVX10.2 VCVTNE2PH2BF8 instruction converts a F16 vector to a E5M2 (BF8) vector, and the AVX10.2 VCVTNEPH2HF8 instruction converts a F16 vector to a E4M3 (HF8) vector.
The AVX10.2 VCVTHF82PH instruction converts a E4M3 (HF8) vector to a F16 vector.
Arm has already added the FP8 AArch64 extension that adds support for conversions to E5M2/E4M3 floating-point types from F16/F32 floating-types along with conversions from E5M2/E4M3 to F16/BF16.
Activity