Has it ever happened to you that you sent an international email with an attached plain-text file and the person on the other end of the line couldn’t read it because some of the characters were mangled? The reason for this is that different people use different character sets to read and write. In simple terms, not everyone in the world uses the US keyboard.
When sending formatted text, be it created in commercial software like Microsoft Office or Adobe Acrobat, or in free software such as Open Office or Libre Office, this is a non-issue.
However, in the case of plain-text files, for example created in Notepad or Notepad++, the encoding of the files becomes an issue.
But let’s start at the beginning…
The year was 1960 and the American Standards Association’s X3.2 subcommittee started to adapt telegraph code to teleprinters. These were the output devices of the time, before video displays appeared.
The way they adopted the telegraph code was called ASCII, an abbreviation of American Standard Code for Information Interchange.
It is a scheme in which English alphabet characters are encoded using seven bits. The odd number of bits is a legacy of the teleprinters of the time which used said number of bits.
The subcommittee discussed using an 8-bit encoding but voted against it. As it usually is with committees, they preferred a very temporary cost cut but created multi-decade long chaos in the computing world.
With seven bits of code it was possible to create an ASCII table consisting of only 128 characters (95 printable and 33 control characters). While this was sufficient for the English language, internationally it posed big problems later on.
Chaos and Market Solution
Non-English languages have extra characters which cannot be encoded with only 7 bits. Since both perforated tape and computers were able to use 8 bits, a multitude of non-standard character encoding schemes sprung up all over the world. All of them using ASCII plus an eighth bit to enable the encoding of an extended ASCII table of up to 256 characters.
As computing technology became ever more popular, the International Organization for Standardization (ISO) came up with the ISO-8859 encoding system based on the extended 8-bit ASCII table. The ISO-8859 was divided into 16 parts according to regions of the world (i.e. 16 tables with 256 characters each), so that all languages would be included.
Later, Microsoft created its own code pages — OEM (original equipment manufacturer) code pages for IBM PCs using DOS, and Windows code pages for more current PCs. Windows code pages were then renamed ANSI code pages (after the American National Standards Institute, known as American Standards Association before 1969).
However, none of these code pages were formally standardized either by ISO or ANSI.
Since the 1990s a new approach called Unicode has been on the rise. The most widely adopted Unicode encoding is UTF-8 which uses 8 bits (or one byte) for ASCII characters and up to 32 bits (or 4 bytes) for all other characters. This enables UTF-8 to create a table consisting of up to 1,112,064 characters. That is more than enough to include all characters in all languages and then some.
As of 2014, UTF-8 encoding is used in more than 80% of cases on the internet. Also, virtually all email programs use UTF-8 for encoding messages.
Notepad and UTF-8
So, when you wish to send non-standard or non-English characters, your best bet is to encode it using the UTF-8 format.
To do this, when you are about to save your text file in Notepad click
In other text editors, such as Notepad++, it is also possible to set the character encoding you wish to use. In Notepad++, simply click
Now anyone anywhere will receive your file and see exactly the same characters as you typed up.
Liked this post?
Subscribe to our newsletter to receive early notification of new posts and deals: