|
Author:
This article was written to try and simplify a topic that I found confusing when I first encountered discussions about character encoding. I tried to be as accurate as I could (and hope I have not made any glaring mistakes) and to stick to the subject of "character encoding and the Web designer".
Web designers need to understand character sets and encoding because this information must be included in documents that will be displayed on the Web. The characters that make up the content and markup of a Web page must be converted by the recipient software, such as a browser agent or an application, from the stored digital format back into the actual characters according to the character set and it's encoding.
Web servers send Web pages, whether they are HTML or XHTML or XML, to the recipient software as a stream of bytes; the recipient software interprets and converts the sequence of bytes into a sequence of characters which we read as text on our monitor screens. The conversion method can range from a simple one-to-one association to complex switching schemes or algorithms.
I have found the best information on character sets and encoding has been written by Ian S. Graham, and the information in this article was compiled using the character set and encoding information from his book XHTML 1.0 Language and Design along with information from the W3C's 5 HTML Document Representation, Unicode.org's FAQ, and a page on how character sets affect technology at Microsoft and Alan Wood’s Unicode Resources.
The character set says what characters are being used and what position a character holds in the set (definition). An example would be the Capital letter Q which holds the eighty-first position in the Latin-1 character set (ISO 8859-1 specification). The character is digitally encoded (character encoding), and our example of the Capital letter Q is encoded as an 8 bit binary string. This particular example (ISO 8859-1) encodes all the characters inside a single byte (8 bits). Single byte encoding means there is only 256 possible characters. There are other character sets that encode characters in a single byte such as ISO 8859-4 for Cyrillic characters.
There are complex encodings for languages such as Chinese, Korean, or Japanese, which might be multibyte and have multiple encoding for a particular character set. This means that the traditional character sets such as ISO 8859-1 (used for most Western languages) can only define and encode a small number of characters yet it takes tens of thousands of characters to support all the languages of our world. A problem arises because different character sets cannot be mixed; so the question must be asked "how can we actually have a universal Web when there is no universal format for Web documents?"
The International Organization for Standards and the Unicode Consortium merged formats that they were independently working on (ISO 10646 and Unicode) to create a single universal character set called the Universal Character Set (UCS), which some people refer to as Unicode.
UCS/Unicode can represent over 1 million characters, punctuation marks, and symbols. There is even room in this character set for more as-of-yet undeveloped characters. It is language-independent so no single character is assumed to identify a language in itself.
The UCS character set (ISO 10646/Unicode) supports several different encodings, UTF-32, UTF-16, 8 and 7. UTF stands for Universal Character Set Transformation Format. The main one is UTF-16 which uses 2 byte character storage. UTF-8 and UTF-7 both use a 1 byte character storage with UTF-8 using all 8 bits for encoding and UTF-7 using only 7 bits for encoding. UTF-7 is not mentioned as appropriate for Web documents in any of the readings I have come across. In this article I am going to refer to the Universal Character Set as UCS so I hope I don't offend the Unicode folks.
The W3C adopted UCS as the document character set for HTML 4, XML 1.0 (thus XHTML) and CSS2. XML is entirely defined in terms of UCS characters and requires the UTF-8 and UTF-16 encodings. Web designers will need to create documents that use only characters from UCS or they can only use character sets whose characters are defined the same as UCS (so the easiest thing to do is just use UCS or ASCII). If character references are used, they must refer to a position in the UCS character set. (Character references are discussed later in this article.)
Historically, most Web pages have been created using the ISO 8859-1 (Latin-1) character set and fortunately ISO 8859-1 defines it's characters at the same position as in UCS. This is not true for other character sets.
Ian Graham states that the most portable documents will be encoded using UTF-8 or UTF-16. He states that ISO-8859 or US-ASCII are useful options that are compatible with most current Web software. Any non-Latin characters in a document using the last two encodings can be represented using character references.
Character references are an encoding-independent mechanism for entering any character from the UCS/UNICODE character set. Character references in HTML/XHTML can appear in two forms:
Numeric character references specify the code position of a character in the document character set. Numeric character references can take two forms:
Decimal Numeric Character Reference
The syntax is "&#D;", where D is a decimal number, refers to the ISO 10646 decimal character number D.
Hexadecimal Numeric Character Reference
The syntax "&#xH;" or "&#XH;", where H is a hexadecimal number, refers to the ISO 10646 hexadecimal character number H. Hexadecimal numbers in numeric character references are case-insensitive.
Example: å (in decimal) represents the letter "a" with a small circle above it å (used, for example, in Norwegian).
å (in hexadecimal) represents the same character.
Character entity references use symbolic names which are more easily remembered by Web designers than the numeric character references.
For example, the character entity reference © refers to the copyright symbol © character and is easier to remember than ©.
When a Web designer wants to display HTML code so that it is visible to the reader they traditionally have used the character entity reference in place of the markup so that the browser will not interpret it as markup. The following entity references are commonly used to escape special characters used as HTML markup (of course the numeric character reference could be used instead of the character entity):
"<" represents the < sign.
">" represents the > sign.
"&" represents the & sign.
""" represents the " mark.
I have used the escape characters in this article when I wanted to display code such as the numeric or entity examples above and charset examples below.
The Character set and encoding can be identified to the recipient software as a pair using a naming scheme called charsets. So the charset ISO-8859-1 is ISO Latin 1 using 1 byte encoding or the charset UTF-8 is UCS using 1 byte encoding or the charset UTF-16 is UCS using 2 byte encoding.
How does a server decide which charset is correct for a document it will serve? Some servers examine the first few bytes of the document, or check against a database of known files and encodings. The server then sends the charset information as a parameter of the HTTP MIME Content-Type in the documents header (which precedes the actual data).
Documents written with a server-side languages such as PHP can tell the recipient software what the charset is by sending an HTTP header that indicates the character encoding as a single string argument. For example, in PHP the function is header() and the syntax is as follows:
<?php
header("Content-Type: text/html; charset=UTF-8");
?>
How does the recipient software know which charset has been used? There are a couple of ways. As mentioned above in Charsets at the Server, the server can provide the charset information as a parameter of the HTTP Content-Type header as long as the delivery method supports it and the recipient software can read the HTTP Content-type header. Some methods of file transfer do not send the header information, such as retrieving a file using FTP and older browsers may not be able to handle the Content-Type header information.
As a fail-safe, a designer can include the charset information as a string in the document. This can take three forms:
XML Declaration
XML (thus XHTML) documents served as XML must use the XML declaration to indicate the charset as such:
<?xml version="1.0" encoding="UTF-8" ?>
Meta Tag
HTML/XHTML documents served as HTML can use a Meta tag to indicate the document charset as such:
<meta http-equiv="Content-type" content="text/html; charset=UTF-8" />
Obviously, if both the methods listed above are used they should not contradict each other (and make sure any server-side HTTP header statements jive). In any case, the indicated charset should be the character set and encoding used to create the document. If you run your pages through the W3C Validator it will tell you if the character encoding you declare in your Meta tag or XML declaration is correct for the characters that make up your document.
Charset attribute on the HTML anchor
And one final method of indicating charset is to tell recipient software what the character encoding of a destination (target) page of a link is, using the charset attribute on the link, as such:
<p>For more information about UniNetNews, please consult the <a href="http://www.uninetnews.com/" charset="UTF-8">UniNetNews Web site</a></p>
This is the information that most discussions on UCS/Unicode leave out. How does a Web designer actually save their document as UTF-8 or UTF-16 or for that matter ISO-8859 or US-ASCII?
Upon researching this question, I found that there are a few factors that play into the answer of this question. I am going to try and lay this out simply which means I am not going to go into detail, I am just going to state the important facts as I have read them.
Recipient software by rule will determine a document's character encoding in the following order (from highest priority to lowest):
Currently, not all software can create or read UCS, but if you use only ASCII characters in your document and use character entities for any others you are safe to declare your documents as UTF-8.
If the HTTP server does not send charset information or the recipient browser/application cannot handle the content-type header (usually old browsers or applications) then all that is left is for the recipient browser/application to try and infer the charset from the content of the document. That is why you should always use the meta tag or XML declaration methods to declare your character encoding, the inference could be wrong.
If you want to be sure that your Web documents can be read, exchanged and searched by users around the world, you should save your Web documents as UTF-8 (if you can) and explicitly declare this character encoding in your document. As the Web moves to a universal, global information source interoperability will require you to use the Universal Character Set (UCS) and provide the character encoding of your documents to the software that will be reading them.