Text files may seem like the most fanastic, ubiqutous and simple format. Almost all programmes seem to be able to cope with reading and writing them interchangeably, which is great. They aren’t quite a simple as you might think. Back in the dark ages of computing, when space was at a premium and even text files seemed large, most people in compuging spoke English, so they thought that a single Byte would be enough to store all the letters. This lead to a nice, efficient, simple encoding called ASCII.
What happened to just ASCII
It turned out that one byte doesn’t store enough characters (it only gives you 256 and that includes lowercase and uppercase letters and all punctuation). To fill the gap, hundreds of different formats sprang up. They were all slightly different. Many of these will work fine even if you interpret them as ASCII but some won’t.
This was no good, you had to try and guess the format of any text file before you could read or understand it. A sollution was needed that could match ASCII for small file size most of the time, but could also encode all the many characters of foreign cultures (and ancient ones).
Enter UTF8
A format was concieved that used one byte for the most commonly used symbols, but used multiple bytes when more were needed. This format was called UTF8. The 8 stands for the fact that 8 bits are used most of the time. One bit in that first byte is used to indicate whether or not the character is a two big character. If it is, a second byte will need to be read to determine what that character is.
UTF8 is simple, high performance, and can encode practically any character just fine. It’s by far the most popular, so the web is using it almost exclusively.
Almost
Yes, I said almost. Web browsers don’t default to UTF8 mode, they instead try and guess by default. They guess wrong, so you have to tell them. Always add the following to the head
of any HTML file you ever write, and you’ll be fine:
<meta charset="utf8" />
If you forget to do that, or think you don’t need to use UTF8, bad things will happen to you. If you (or someone you know) is responsible for writing software which either reads or writes text files, tell them to do so in UTF8 by default, and only use other formats when explicitly told they have to.
The BOM
There is one little problem with using UTF8 (usually on windows). It’s called the BOM (Byte Order Mark). In order to deal with the difficulty of guessing the file type, some windows users of UTF8 decided to add a couple of special characters to the start of each document in order to indicate that the file was UTF8. Most text editors will understand these characters and not display them. Most other computer programmes will fail when they see them. If you get a couple of wierd looking characters at the start of your document, that’s why. If you’re building a text editor, please get rid of those characters.