UniNetNews Logo

Let's Unify the Net!

Standards News and Solutions for Web Designers

Why XML? For the Web Designer

Author: Jan Hunt

First you should know that SGML (Standard Generalized Markup Language) is the basis for both HTML and XML. SGML is an international standard (ISO 8879) that was published in 1986.

Second, you need to know that XHTML is XML. "XHTML 1.0 is a reformulation of HTML 4.01 in XML, and combines the strength of HTML 4 with the power of XML."

Thirdly, XML is NOT a language, it is rules to create an XML based language. Thus, XHTML 1.0 uses the tags of HTML 4.01 but follows the rules of XML.

The Document

A typical document is made up of three layers:

  1. structure,
  2. content,
  3. and style.

Structure

Structure would be the documents title, author, paragraphs, topics, chapters, head, body etc.

Content

Content is the actual information that composes a title, author, paragraphs etc.

Style

Style is how the content within the structural elements are displayed such as font color, type and size, text alignment etc.

Markup

HTML, SGML, and XML all markup content using tags. The difference is that SGML and XML mainly deal with the relationship between content and structure, the structural tags that markup the content are not predefined (you can make up your own language), and style is kept TOTALLY separate; HTML on the other hand, is a mix of content marked up with both structural and stylistic tags. HTML tags are predefined by the HTML language.

By mixing structure, content and style you limit yourself to one form of presentation and in HTML's case that would be in a limited group of browsers for the World Wide Web.

By separating structure and content from style, you can take one file and present it in multiple forms. XML can be transformed to HTML/XHTML and displayed on the Web, or the information can be transformed and published to paper, and the data can be read by any XML aware browser or application.

SGML (Standard Generalized Markup Language)

Historically, Electronic publishing applications such as Microsoft Word, Adobe PageMaker or QuarkXpress, "marked up" documents in a proprietary format that was only recognized by that particular application. The document markup for both structure and style was mixed in with the content and was published to only one media, the printed page.

These programs and their proprietary markup had no capability to define the appearance of the information for any other media besides paper, and really did not describe very well the actual content of the document beyond paragraphs, headings and titles. The file format could not be read or exchanged with other programs, it was useful only within the application that created it.

Because SGML is a nonproprietary international standard it allows you to create documents that are independent of any specific hardware or software. The document structure (what elements are used and their relationship to each other) is described in a file called the DTD (Document Type Definition). The DTD defines the relationships between a document's elements creating a consistent, logical structure for each document.

SGML is good for handling large-scale, long-term information management needs and has been around for more than a decade as the language of defense contractors and the electronic publishing industry. Because SGML is very large, powerful, and complex it is hard to learn and understand and is not well suited for the Web environment.

XML (Extensible Markup Language)

XML is a "restricted form of SGML" which removes some of the complexity of SGML. XML like SGML, retains the flexibility of describing customized markup languages with a user-defined document structure (DTD) in a non-proprietary file format for both storage and exchange of text and data both on and off the Web.

As mentioned before, XML separates structure and content from style and the structural markup tags can actually describe the content because they can be customized for each XML based markup language. A good example of this is the Math Markup Language (MathML) which is an XML application for describing mathematical notation and capturing both its structure and content. MathML 2.0

Until MathML, the ability to communicate mathematical expressions on the Web was limited to mainly displaying images (JPG or GIF) of the scientific notation or posting the document as a PDF file. MathML allows the information to be displayed on the Web, and makes it available for searching, indexing, or reuse in other applications.

The Goals of XML

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance.

These goals come from the W3C's Extensible Markup Language (XML) 1.0 (Second Edition)

HTML (Hypertext markup Language)

HTML is a single, predefined markup language that forces Web designers to use it's limiting and lax syntax and structure. The HTML standard was not designed with other platforms in mind, such as Web TV’s, mobile phones or PDAs. The structural markup does little to describe the content beyond paragraph, list, title and heading.

XML breaks the restricting chains of HTML by allowing people to create their own markup languages for exchanging information. The tags can be descriptive of the content and authors decide how the document will be displayed using style sheets (CSS and XSL). Because of XML's consistent syntax and structure, documents can be transformed and published to multiple forms of media and content can be exchanged between other XML applications.

HTML was useful in the part it has played in the success of the Web but has been outgrown as the Web requires more robust, flexible languages to support it's expanding forms of communication and data exchange.

Universal Access

"The W3C defines the Web as the universe of network-accessible information (available through your computer, phone, television, or networked refrigerator...). Today this universe benefits society by enabling new forms of human communication and opportunities to share knowledge. One of W3C's primary goals is to make these benefits available to all people, whatever their hardware, software, network infrastructure, native language, culture, geographical location, or physical or mental ability."

-- From  The W3C in 7 Points

Summary

XML will never completely replace SGML because SGML is still considered better for long-time storage of complex documents. However, XML has already replaced HTML as the recommended markup language for the Web with the creation of XHTML 1.0.

Even though XHTML has not made the HTML that currently exists on the Web obsolete, HTML 4.01 is the last version of HTML. XHTML (an XML application) is the foundation for a universally accessible, device independent Web.

Read the UniNetNews article "The Standards They Are A Chang'en" and find out what XML, HTML and XHTML have in common. The move from HTML to XHTML 1.0 is not a difficult one.



Valid XHTML 1.0! Valid CSS! Bobby Approved Triple A