Specifying Character Encoding
This month kicks off our new “WaSP Asks the W3C” Question and Answer project.
In this project, frequently asked questions posed to WaSP by Web authors and designers
regarding standards are submitted by WaSP members to the
href="http://www.w3.org/QA/">W3C’s Quality Assurance Group for
information. The answers are published and archived both here and on the W3C Web Standards
Education list, where follow-up discussion also takes place. Signup details can be found at
the end of this article.
WaSP asks
There are several ways of specifying the character encoding for a particular document. Which of
the following methods (or combination thereof) does the W3C recommend, and why?
- Have the server administrator set the proper encoding via the HTTP headers returned by the
Web server - Have the author add the encoding with a
meta
element - XHTML authors can add the character encoding using the XML declaration
The W3C responds
These three ways of providing the character encoding of a document are not equivalent. When
trying to figure out the character encoding of a resource, user agents will try, in this
order:
- The HTTP
Content-Type
header sent by the server - The XML declaration (only for XHTML documents)
- The HTML/XHTML
meta
element - Other ways. There are algorithms to guess the character encoding, for
example
Since the HTTP Content-Type
header has precedence, and is also the easiest
information to retrieve (user-agents do not have to parse the resource to get it), it is almost
always the preferred way to provide the character encoding for an (X)HTML document.
However, in at least two cases, this is simply not possible:
- The document author does not have any way to configure the server to send the
proper HTTPContent-Type
header - The document is not served via HTTP.
In these cases, an HTML document should provide the character encoding via a
meta
element, and an XML document can provide it via the XML
declaration. If the XML document uses one of the default encodings (UTF-8 or UTF-16) no declaration is needed to manage the
character encoding.
To sum it up
- Wrong. The webmaster sets a default character encoding to be sent by the
server but does not let the author override it or the info is not provided anywhere
whatsoever - Good. The character encoding is not set at the server level but properly
declared through the HTMLmeta
element (and/or the XML declaration for XHTML
documents) - Best. The character encoding is properly set at the server level, either
with a default that authors can override or on a per-document basis, and is also available at
the document level (both in the XML declaration if applicable and the meta element) for
standalone use
Examples
Example of an XHTML 1.0 document written in French with an ISO-8859-1 encoding:
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr"> <head> <title>Exemple de document XHTML 1.0</title> </head> <body> <h1>Portrait Intérieur</h1> <h2>Rainer-Maria Rilke</h2> <p>Ce ne sont pas des souvenirs<br /> qui, en moi, t'entretiennent ;<br /> tu n'es pas non plus mienne<br /> par la force d'un beau désir.</p> </body> </html>
Example of an HTML 4.01 document written in French with a UTF-8 encoding:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="fr"> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <title>Exemple de document HTML 4.01</title> </head> <body> <h1>Portrait Intérieur</h1> <h2>Rainer-Maria Rilke</h2> <p>Ce ne sont pas des souvenirs<br> qui, en moi, t'entretiennent ;<br> tu n'es pas non plus mienne<br> par la force d'un beau désir.</p> </body> </html>
On the popular Apache Web server, the HTTP Content-Type
header for a resource
can be set up in the .htaccess file, as follows:
<Files example.html> ForceType text/html;charset=ISO-8859-1 </Files>
This would force the file example.html to be served as ISO-8859-1 even if the server had a
different global configuration.
WaSP comments
WaSP and W3C member Tim Bray commented on this answer and said:
“If you know that the document you’re sending is going to get read by
an XML processor, the server should get the charset right. If the server makes any mistake the
rules say that the processor is supposed to do the wrong thing! On the other hand, if the
document is going to any kind of HTML reader, the server can usefully try to help and do
what is suggested here. So it turns out that it matters whether you serve it as html or
xhtml+xml.”
How to serve HTML and XHTML will be discussed in the next issue of WaSP Asks the W3C.
References
- href="http://www.w3.org/International/O-HTTP-charset">About Charset Parameters
- href="http://www.w3.org/International/O-charset.html">About Character Encodings
-
href="http://www.w3.org/TR/html4/charset.html#h-5.2.2">HTML 4.0 specification on character
encodings -
href="http://www.w3.org/TR/xhtml1/#C_9">XHTML 1.0 specification on character
encodings -
href="http://www.w3.org/TR/REC-xml#charencoding">XML 1.0 specification on character encodings
Discussion
For clarification and discussion on this topic, please address your comments and questions to the W3C Web Standards Education list.
To subscribe to the list, send an email to
href="mailto:[email protected]">[email protected] with
“Subject: subscribe”. You can read archived posts at
href="http://lists.w3.org/Archives/Public/public-evangelist/">http://lists.w3.org/Archives/Public/public-evangelist/.
The Web Standards Project is a grassroots coalition fighting for standards which ensure simple, affordable access to web technologies for all.