XML - Managing Data Exchange/XHTML

In previous chapters, we have learned how to generate HTML documents from XML documents and XSL stylesheets. In this chapter, we will learn how to convert those HTML documents into valid XHTML. We will discuss why XHTML has evolved as a standard and when it should be used.

The Evolution of XHTML
Originally, Web pages were designed in HTML. Unfortunately most implementations of this markup language allow all sorts of mistakes and bad formatting. Major browsers were designed to be forgiving, and poor code would display with few problems in most cases. This poor code was often not portable between browsers, e.g. a page would render in Netscape but not Internet Explorer or vice versa. The accounting for human error and bad formatting takes an amount of processing power that small handheld devices might not have. Thus when displaying data on handhelds, a tiny mistake can crash the device.

XHTML partially mitigates these problems. The processing burden is reduced by requiring XHTML documents to conform to the much stricter rules defined in XML. Aside from the stricter rules, HTML 4.01 and XHTML 1.0 are functionally equivalent. If a document breaks XML's well-formedness rules, an XHTML-compliant browser must not render the page. If a document is well-formed but invalid, an XHTML-compliant browser may render the page, so a significant number of mistakes still slip through.

In this chapter, we will examine in detail how to create an XHTML document.

The biggest problem with HTML from a design standpoint is that it was never meant to be a graphical design language. The original version of HTML was intended to structure human readable content (e.g. marking a section of text as a paragraph), not to format it (e.g. this paragraph should be displayed in 14pt Arial). HTML has evolved far past its original purpose and is being stretched and manipulated to cover cases that the original HTML designers never imagined.

The recommended solution is to use a separate language to describe the presentation of a group of documents. Cascading Style Sheets (CSS) is a language used for describing presentation. From version 1.1 of XHTML upwards web pages must be formatted using CSS or a language with equivalent capabilites such as XSLT (XSL Transformations). The use of CSS or XSLT is optional in XHTML 1.0 unless the strict variant is used. HTML 4.01 supports CSS but not XSLT.

So What is XHTML?
As you might have guessed, XHTML stands for eXtensible HyperText Markup Language. It is a cross between HTML and XML. It fulfills two major purposes that were ignored by HTML:
 * 1) XHTML is a stricter standard than HTML. XHTML documents must be well-formed just like regular XML. This reduces vagaries and inconsistency between browsers, because browsers do not have to decide how to display a badly-formed page. Malformed XHTML is not allowed. Note 1: Browsers only enforce well-formedness if the MIME type is set to  . If the MIME type is set to , the browser will allow badly-formed documents. There are a large number of 'XHTML' documents on the web that are badly-formed and get away with it because their MIME type is  . Note 2: Browsers are not required to check for validity. See Invalid XHTML below for an example.
 * 2) XHTML allows for modularization (m12n). For different environments different element and attribute subsets can be defined.

The best thing about XHTML is that it is almost the same as HTML! If you know how to write an HTML document, it will be very simple for you to create an XHTML document without too much trouble. The biggest thing that you must keep in mind is that unlike with HTML, where simple errors like missing a closing tag are ignored by the browser, XHTML code must be written according to an exact specification. We will see later that adhering to these strict specifications actually allows XHTML to be more flexible than HTML.

XHTML Document Structure
At a minimum, an XHTML document must contain a DOCTYPE declaration and four elements: html, head, title, and body: The opening  tag of an XHTML document must include a namespace declaration for the XHTML namespace.

The DOCTYPE declaration should appear immediately before the html tag in an XHTML document. It can follow one of three formats.

XHTML 1.0 Strict
The Strict declaration is the least forgiving. This is the preferred DOCTYPE for new documents. Strict documents tend to be streamlined and clean. All formatting will appear in Cascading Style Sheets rather than the document itself. Elements that should be included in the Cascading Style Sheet and not the document itself include, but are not limited to:, nderline , old, talics, and 

There are also certain instances where your code needs to be nested within block elements. Incorrect Example:  I hope that you enjoy your stay. Correct Example:  I hope that you enjoy your stay.

XHTML 1.0 Transitional
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> This declaration is intended as a halfway house for migrating legacy HTML documents to XHTML 1.0 Strict. The W3C encourages authors to use the Strict DOCTYPE for new documents. (The XHTML 1.0 Transitional DTD refers readers to the relevant note in the HTML4.01 Transitional DTD.)

This DOCTYPE does not require CSS for formatting; although, it is recommended. It generally tolerates inline elements found where block-level elements are expected.

There are a couple of reasons why you might choose this DOCTYPE for new documents.
 * You require backwards compatibility with browsers that support the formatting elements of XHTML but do not support CSS. This is a very small fraction of general users (less than 1%). Many browsers that don't support CSS don't support HTML 4.0 or XHTML either. However, it may be useful on a corporate intranet that has a larger than normal fraction of very old (pre-2000) browsers.
 * You need to link to frames. Using frames is discouraged as they work badly in many browsers.

XHTML 1.0 Frameset
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"> If you are creating a page with frames, this declaration is appropriate. However, since frames are generally discouraged when designing Web pages, this declaration should be used rarely.

XML Prolog
Additionally, XHTML authors are encouraged by the W3C to include the following processing instruction as the first line of each document: Although it is recommended by the standard, this processing instruction may cause errors in older Web browsers including Internet Explorer version 6. It is up to the individual author to decide whether to include the prolog.

Language
It is good practice to include the optional  attribute  on the html element to describe the document's primary language. For compatibility with HTML the  attribute should also be specified with the same value. For an English language document use:

The  and   attributes can also be specified on other elements to indicate changes of language within the document, e.g. a French quotation in an English document.

Converting HTML to XHTML
In this section, we will discover how to transform an HTML document into an XHTML document. We will examine each of the following rules:
 * Documents must be well-formed
 * Tags must be properly nested
 * Elements must be closed
 * Tags must be lowercase
 * Attribute names must be lowercase
 * Attribute values must be quoted
 * Attributes cannot be minimized
 * The name attribute is replaced with the id attribute (in XHTML 1.0 both name and id should be used with the same value to maintain backwards-compatibility).
 * Plain ampersands are not allowed
 * Scripts and CSS must be escaped(enclose them within the tags <![CDATA[ and ]]>) or preferably moved into external files.

Documents must be well-formed
Because XHTML conforms to all XML standards, an XHTML document must be well-formed according to the W3C's recommendations for an XML document. Several of the rules here reemphasize this point. We will consider both incorrect and correct examples.

Tags must be properly nested
Browsers widely tolerate badly nested tags in HTML documents. The text above would display as bold and underlined, even though the end tags are not in the proper order. An XHTML page will not display if the tags are improperly nested, because it would not be considered a valid XML document. The problem can be easily fixed.

Elements must be closed
Again, XHTML documents must be considered valid XML documents. For this reason, all tags must be closed. HTML specifications listed some tags as having "optional" end tags, such as the    and    tags. In XHTML, the end tags must be included. What should we do about HTML tags that do not have a closing tag? Some special tags do not require or imply a closing tag. In XHTML, the XML rule of including a closing slash within the tag must be followed. Note that some of today's browsers will incorrectly render a page if the closing slash does not have a space before it (  ). Although it is not part of the official recommendation, you should always include the space ( ) for compatibility purposes.

Here are the common empty tags in HTML:
 * area
 * base
 * basefont
 * br
 * hr
 * img
 * input
 * link
 * meta
 * param

Tags must be lowercase
In HTML, tags could be written in either lowercase or uppercase. In fact, some Web authors preferred to write tags in uppercase to make them easier to read. XHTML requires that all tags be lowercase. This difference is necessary because XML differentiates between cases. XML would read    and   as different tags, causing problems in the above example. The problem can be easily fixed by changing all tags to lowercase.

Attribute names must be lowercase
Following the pattern of writing all tags in lowercase, all attribute names must also be in lowercase. The correct tags are easy to create.

Attribute values must be quoted
Some HTML values do not require quotation marks around them. They are understood by browsers. XHTML requires all attributes to be quoted. Even numeric, percentage, and hexadecimal values must appear in quotations for them to be considered part of a proper XHTML document.

Attributes cannot be minimized
HTML allowed some attributes to be written in shorthand, such as selected or noresize. When using XHTML, attribute minimization is forbidden. Instead, use the syntax  x="x" </tt>, where x is the attribute that was formerly minimized. A complete list of minimized attributes follows:
 * checked
 * compact
 * declare
 * defer
 * disabled
 * ismap
 * nohref
 * noresize
 * noshade
 * nowrap
 * readonly
 * selected
 * multiple

The attribute is replaced with the   attribute
HTML 4.01 standards define a name attribute for the tags <tt> a, applet, frame, iframe, img, </tt> and <tt> map </tt>. XHTML has deprecated the name attribute. Instead, the id attribute is used. However, to ensure backwards compatibility with today's browsers, it is best to use both the name and id attributes. As technology advances, it will eventually be unnecessary to use both attributes and XHTML 1.1 removed name altogether.

Ampersands are not supported
Ampersands are illegal in XHTML. They must instead be replaced with the equivalent character code &amp;amp;.

Image alt attributes are mandatory
Because XHTML is designed to be viewed on different types of devices, some of which are not image-capable, alt attributes must be included for all images. Remember that the img tag must include a closing slash in XHTML!

Scripts and CSS must be escaped
Internal scripts and CSS often include characters like the ampersand and less-than characters. If you are using internal scripts or CSS, enclose them within the tags <tt> <![CDATA[ </tt> and <tt> ]]> </tt>. This will mark them as character data that should not be parsed. If you do not use these tags, characters like &amp; and < will be treated as start-of-character entities (like &amp;nbsp;) and tags (like <b> ) respectively. This will cause your page to behave unpredictably, and it may invalidate your code.

Additionally, the type attribute is mandatory for scripts. The comment tags <tt> </tt> that have traditionally been used to hide JavaScript from noncompliant browsers should not be included. The XML standard states that text enclosed in comment tags may be completely excluded from rendered documents, which would lose all script enclosed in the tags. Also  is not permitted in XHTML documents. You must used node creation methods such as  instead. Confusingly,  will appear to work as expected if the document is incorrectly served with a MIME type of   (the type for HTML documents), instead of   (the type for XHTML documents). If the MIME type is  the document will be parsed as HTML which allows. Parsing the document as HTML defeats the purpose of writing it in XHTML.

Similar changes must be made for internal stylesheets. The type attribute must be included, and the CDATA tags should be used. Because scripts and CSS may complicate an XHTML document, it is strongly recommended that they be placed in external .js and .css files, respectively. They can then be linked to from your XHTML document.

Some elements may not be nested
The W3C recommendations state that certain elements may not be contained within others in an XHTML document, even when no XML rules are violated by the inclusion. Elements affected are listed below.

When to convert
By now, it probably sounds as though converting an HTML document into XHTML is easy, but tedious. When would you want to convert your existing pages into XHTML? Before deciding to change your entire Web site, consider these questions.


 * Do you want your pages to be easily viewed over a nontraditional Internet-capable device, such as a PDA or Web-enabled telephone? Will this be a goal of your site in the future? XHTML is the language of choice for Web-enabled portable devices. Now may be a good time for you to commit to creating an all-XHTML site.
 * Do you plan to work with XML in the future? If so, XHTML may be a logical place to begin. If you head up a team of designers who are accustomed to using HTML, XHTML is a small step away. It may be less intimidating for beginners to learn XHTML than it is to try teaching them all about XML from scratch.
 * Is it important that your site be current with the most recent W3C standards? Staying on top of current standards will make your site more stable and help you stay updated in the future, as you will only have to make small changes to upgrade your site to the newest versions of XHTML as they are approved by the W3C.
 * Will you need to convert your documents to another format? As a valid XML document, XHTML can utilize XSL to be converted into text, plain HTML, another XHTML document, or another XML document. HTML cannot be used for this purpose.

If you answered yes to any of the above questions, then you should probably convert your Web site to XHTML.

MIME Types
XHTML 1.0 documents should be served with a MIME Type of  to Web browsers that can accept this type. XHTML 1.0 may be served with the MIME type  to clients that cannot accept   provided that the XHTML complies with the additional constraints in [Appendix C] of the XHTML 1.0 specification. If you cannot configure your Web server to serve documents as different MIME types, you probably should not convert your Web site to XHTML.

You should check that your XHTML documents are served correctly to browsers that support, e.g. Mozilla Firefox. Use 'Page Info' to verify that the type is correct.

XHTML 1.1 documents are often not backwards compatible with HTML and should not be served with a MIME type of.

Help Converting

 * Strict Alternatives to Deprecated Attributes for Elements in XHTML 1.0 Strict

HTML Tidy
When creating HTML, it's very easy to make a mistake by leaving out an end tag or not properly nesting tags. HTML Tidy is a wonderful application that can be used to correct a number of errors with poorly formed HTML documents and convert it into XHTML. Tidy can also format ugly code to be more readable, including code generated by WYSIWYG editors. HTML Tidy can't generate clean code when it encounters problems it isn't sure of how to fix. In these cases, it will generate an error to let you know where the mistake is located in your document.

A few examples of problems that HTML Tidy can remedy:
 * Missing or mismatched end tags.
 * Improperly nested elements.
 * Mixed up tags.
 * Add a missing "/" to properly close tags.
 * Insert missing tags into lists.
 * Add missing quotes around attribute values.
 * Ability to insert the correct DOCTYPE value based on your code (can also recognize and report proprietary elements).

HTML Tidy can also be customized at runtime using a wide array of command line arguments. It is capable of indenting code to make it more readable as well as replacing FONT, NOBR, and CENTER tags with style tags and rules using CSS. Tidy can also be taught new tags by declaring them in the configuration file.

You can read more about HTML Tidy at the W3C's HTML Tidy site, as well as download the application as a binary or get the source code. There are several sites that offer HTML Tidy as an online service including the W3C and Site Valet.

You can also validate your page using the validator available at http://validator.w3.org/.

When not to convert
You shouldn't convert your Web pages if they will always be served with a MIME type of. Make sure you know how to configure your server or server-side script to perform HTTP content negotiation so that XHTML capable browsers receive XHTML marked as. If you can't set up content negotiation, stick to HTML 4.01. People viewing your Web pages with mainstream browsers will be unable to tell the difference between a valid HTML 4.01 web page and a valid XHTML 1.0 Web page.

Make sure the automated tests you run on your site simulate connections from both XHTML-compatible browsers, e.g. Mozilla Firefox, and non&#8211;XHTML-compatiable browsers, e.g. Internet Explorer 6.0. This is particularly important if you use Javascript on your Web site. If maintaining two copies of your test suite is too time consuming, don't convert.

Bear in mind that valid HTML 4.01 Strict documents generally require less effort to convert to XHTML 1.1 than valid XHTML 1.0 Transitional documents. A valid HTML 4.01 Strict document can only contain elements that are valid in XHTML 1.1, although a few attributes may need changing. XHTML 1.0 Transitional documents on the other hand can contain ten element types and more than a dozen attributes that are not valid in XHTML 1.1. The XHTML 1.0 Transitional  element alone has six atrributes that are not supported in XHTML 1.1.

Don't be pressured into using XHTML by people talking vaguely about bad practice. Pin them down to what they mean by bad practice. If they start talking about separation of content and presentation, they have confused the differences between HTML and XHTML with the differences between the Transitional and Strict doctypes. Both XHTML 1.0 Transitional and HTML 4.01 Transitional allow you to mix presentation and content in the same document, i.e. they allow this type of bad practice. Both HTML 4.01 Strict and XHTML 1.0 Strict force you to move the bulk of the presentation (but not all of it) in to CSS or an equivalent language. All four doctypes allow you to use embedded stylesheets, whereas, true separation requires that all CSS and Javascript be moved to external files.

XHTML 1.1
XHTML 1.0 is a suitable markup language for most purposes. It provides the option to separate content and presentation, which fits the needs of most Web authors. XHTML 1.1 enforces the separation of content and presentation. All deprecated elements and attributes have been removed. It also removes two attributes that were retained in XHTML 1.0 purely for backwards-compatibility. The  attribute is replaced by   and   is replaced by. Finally it adds support for ruby text found in East Asian documents.

DOCTYPE
The DOCTYPE for XHTML 1.1 is:

Modularization
The modularization of XHTML, or XHTML m12n, provides suggestions for customizing XHTML, either by integrating subsets of XHTML into other XML applications or extending the XHTML element set. The framework defines two proceses:


 * How to group elements and attributes into "modules"
 * How to combine modules to create new markup languages

The resulting languages, which the W3C calls "XHTML Host Languages", are based on the familiar XHTML structure but specialized for specific purposes. XHTML 1.1 is an example of a host language. It was created by grouping the different elements available to XHTML.

XHTML variations, while possible in theory, have not been widely adopted. There is continuing work being done to develop host languages, but their details are beyond the scope of this discussion.

Invalid XHTML
XHTML-compliant browsers are allowed to render invalid XHTML documents provided that the documents are well-formed. A simple example is given below:

Save the example as  (the .xhtml extension is important) and open the page with Mozilla Firefox. The page will render even though it is invalid.

Answers
Lenguaje XHTML XHTML