Telegram supports styled text using message entities.
A client that wants to send styled messages would simply have to integrate a Markdown/HTML parser, and generate an array of message entities by iterating through the parsed tags.
Nested entities are supported.
Special care must be taken to consider the length of strings when generating message entities as the number of UTF-16 code units, even if the message itself must be encoded using UTF-8.
Example implementations: tdlib, MadelineProto.
A Unicode code point is a number ranging from 0x0
to 0x10FFFF
, usually represented using U+0000
to U+10FFFF
syntax.
Unicode defines a codespace of 1,112,064 assignable code points within the U+0000
to U+10FFFF
range.
Each of the assignable codepoints, once assigned by the Unicode consortium, maps to a specific character, emoji or control symbol.
The Unicode codespace is further subdivided into 17 planes:
U+0000
to U+FFFF
: Basic Multilingual Plane (BMP)U+00000
to U+10FFFF
: Multiple supplementary planes as specified by the Unicode standardSince storing a 21-bit number for each letter would result in a waste of space, the Unicode consortium defines multiple encodings that allow storing a code point into a smaller code unit:
UTF-8 » is a Unicode encoding that allows storing a 21-bit Unicode code point into code units as small as 8 bits.
UTF-8 is used by the MTProto and Bot API when transmitting and receiving fields of type string.
UTF-16 » is a Unicode encoding that allows storing a 21-bit Unicode code point into one or two 16-bit code units.
UTF-16 is used when computing the length and offsets of entities in the MTProto and bot APIs, by counting the number of UTF-16 code units (not code points).
U+0000
to U+FFFF
) count as 1, because they are encoded into a single UTF-16 code unitA simple, but not very efficient way of computing the entity length is converting the text to UTF-16, and then taking the byte length divided by 2 (=number of UTF-16 code units).
However, since UTF-8 encodes codepoints in non-BMP planes as a 32-bit code unit starting with 0b11110
, a more efficient way to compute the entity length without converting the message to UTF-16 is the following:
0b11110
) increment the count by 2, otherwise0b10
) increment the count by 1.Example:
length := 0
for byte in text {
if (byte & 0xc0) != 0x80 {
length += (byte >= 0xf0 ? 2 : 1)
}
}
Note: the length of an entity must not include the length of trailing newlines or whitespaces, rtrim
entities before computing their length: however, the next offset must include the length of newlines or whitespaces that precede it.
Example implementations: tdlib, MadelineProto.
For example the following HTML/Markdown aliases for message entities can be used:
<b>bold</b>
, <strong>bold</strong>
, **bold**
<i>italic</i>
, <em>italic</em>
*italic*
messageEntityCode
» => <code>code</code>
, `code`
<s>strike</s>
, <strike>strike</strike>
, <del>strike</del>
, ~~strike~~
<u>underline</u>
messageEntityPre
» => <pre language="c++">code</pre>
, ```c++ code ```
The following entities can also be used to mention users:
Also, messageEntityCustomEmoji entities are used for custom emojis ».
A number of other entities are also available, see the type page for the full list ».