freepooma-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

docbook overview


From: Allan Stokes
Subject: docbook overview
Date: Thu, 24 May 2001 04:57:23 -0700

Hello everyone,

Back at the Proximation meeting it was suggested that the Pooma
documentation be prepared in DocBook format.  Not everyone there was
familiar with DocBook so I'd like to take a few minutes to describe DocBook
and the authoring process so everyone understands the document format.

About ten years ago SGML (Generalized Markup Language) became an official
standard for document markup.  This is an extremely rich structure which
does not lend itself to simple applications.  For example, in SGML you can
redeclare the characters which function as markup delimiters.  There isn't
much you take for granted without a full parse.

HTML is a simplified language which includes a subset of all possible SGML
documents.  The HTML subset was designed to exclude SGML mechanisms which
complicate parsing so that HTML documents would be simple to read and
process.

Unfortunately, the markup elements included in HTML conflate structure with
layout.  Tags such as <p> and <ul> express paragraph and list document
elements.  But you also find stuff like this (as on the Pooma.com home
page):

<table width="100%" cellpadding="0" cellspacing="0">
<tr bgcolor="#E5D5C4">
</tr>
</table>

The table structure and the visual presentation are hopelessly mangled
together.  This makes HTML a poor choice for creating portable documents.

TeX also suffers from unnatural commingling, which is one of several reasons
people find TeX unpleasant to use.  In the case of TeX, the problem was
partly addressed by creating LaTeX as a document language.  LaTeX allows the
author to describe the structure of the document directly and (mostly) keeps
the visual mechanics behind the curtains.

DocBook was invented to solve the same problem with HTML: separating
structure from presentation.  The structure DocBook describes is everything
you might want to put in a book.

Here is a small scrap which I created while testing my DocBook tools to give
the flavour of the notation.

<!DOCTYPE Book PUBLIC "-//OASIS//DTD DocBook V4.1//EN">

<book>
  <bookinfo>
    <title>Allan's First DocBook Book</title>
    <author>
      <firstname>Allan</firstname>
      <surname>Stokes</surname>
    </author>
    <copyright>
      <year>2001</year>
      <holder>Allan Stokes</holder>
    </copyright>
  </bookinfo>
  <preface>
    <title>Foreward</title>
    <para>I survived PSGML.</para>
  </preface>
  <chapter>
    <title>chapters MUST have a title</title>
    <para>The location of the PSGML tutorial is www.lysator.liu.se
    </para>
    <para>And so it goes.
    </para>
  </chapter>
</book>

This is a valid SGML document.  The first line declares the document data
type (DTD) which governs the structure of the rest of the document.  The DTD
formally defines what markup elements are legal, required, optional, etc.

If you run this scrap through a formal SGML validator, the validator will be
able to tell you if your document obeys the rules imposed by the DTD.  This
fragment references the DocBook DTD for DocBook version 4.1.

With this background I can now define what DocBook is.  DocBook is an SGML
DTD.  Presently the DocBook DTD is about 300,000 characters of SGML in SGML
DTD syntax which defines the vast majority of document elements which any
technical book might wish to include.  O'Reilly has published high quality
technical books directly from DocBook sources.

DocBook is also used for electronic documentation.  The Linux Documentation
Project (LDP) has standardized on DocBook, and FreeBSD is in the process of
converting much of their own online documentation into DocBook format.  The
tools required to work with DocBook are now standard components on many of
these platforms, so DocBook has become exceptionally portable.

A common stumbling block in understanding DocBook is that DocBook itself
provides no tools.  DocBook is an SGML DTD.  The DocBook DTD is essentially
an SGML document in its own right.  (In the SGML world, everything is a
document.  When you have a hammer ...)

The core tool for working with DocBook is the SGML validator.  A formal SGML
validator parses the DTD to determine the rules which govern the document
instance which follows.  The validator will tell you if your DocBook
document is a legal document.  In the purest sense, DocBook defines a subset
of all possible SGML documents as being legal DocBook documents.  The
semantics of the markup are defined by the tools used in the postproduction.

All DocBook gives us is a structure for composing well-formed documents.  We
need to address the issue of publishing the document separately.

To do this you need an SGML processor which is capable of converting the
DocBook source document into a backend format.  There are many tools out
there which can manipulate SGML so there are many ways to publish a DocBook
document.  In the non-commercial world, DocBook publishing is usually done
with Jade (a free tool written by James Clark).

Jade comes with a set of standard stylesheets for a variety of backend
formats.  The quality of output is quite acceptable using the default
stylesheets.  It is unlikely that we will wish to make any changes to the
standard stylesheets for publishing the Pooma documentation.  You can
typically use Jade without having to deal with much of Jade's complexity.
But I'll describe Jade anyway.

There are a variety of standards for specifying stylesheets.  Jade supports
the DSSSL standard.  DSSSL stylesheets are written in a Scheme-like
language.  DSSSL processing reminds me of Pooma.  The DocBook document is an
SGML tree structure (much like a Pooma template expression), and DSSSL
behaves like an expression template which walks the tree structure and
transforms it (decorates, trims, etc.) into a new tree structure, then
finally you flatten the whole thing for output (e.g. evaluate).

DSSSL is a functional language which iterates via tail recursion.  A DSSSL
stylesheet, surprise, is itself a valid SGML document.  (It is even possible
to use Jade to transform DSSSL stylesheets into other DSSSL stylesheets.
It's this kind of thing which gives DSSSL a cult reputation.  Let go of your
mouse and back away quietly.)

The primary backend formats supported by Jade are HTML, RTF, TeX.  The TeX
produced by Jade is suitable for conversion into high quality Postscript or
PDF.

Jade is easy enough to use, but configuring Jade involves a couple of
frustrating steps.  SGML contains several methods of indirection which are
used extensively to modularize the implementation of the DocBook DTD and the
DSSSL stylesheets.  Most SGML entities are reference via public identifiers
(as opposed to host filenames).  Most SGML tools resolve the public
identifiers by searching site-local files called catalogs.  The catalog
files supply local mappings to names which can be resolved on the local file
system.  Most SGML toolsets expect the catalog configuration to be done
manually before all the SGML magic will work.

Invoking Jade with the right combinations of stylesheets and backend formats
is moderately tricky as well.

The "DocBook Definitive Guide" presents a solution to this problem which
demonstrates typical DSSSL trickery.  A DSSSL stylesheet is an SGML document
whose structure is defined by a DSSSL DTD.  What they do is modify the
standard DSSSL DTD to allow annotations to be placed inside the DSSSL
stylesheet which define what combinations of stylesheets and backends are
appropriate.  Then they supply a Perl script which uses Jade to look inside
the DSSSL stylesheet to find the annotations which govern how the backend
format specified on the command line should be produced.

The whole thing is slick when it works, but initially you get a severe dose
of abstraction shock trying to figure out which bit of syntax functions at
what level.

I have most of this stuff working in my own environment.  Once I figure out
how I accomplished this feat I think it would be a good idea to document my
Jade configuration and publishing process.  Perhaps these notes would be
suitable for a web page somewhere on pooma.com.  (Is that stuff under source
control?)

The final piece of the puzzle is the authoring process.

Generally, everything I've found on the web about authoring DocBook applies
to emacs.  emacs has a psgml package which defines a major mode for editing
DocBook documents.  psgml defines several menus and many keystrokes for
applying syntax directed markup.  When you load your DocBook document, psgml
parses the DocBook DTD to determine the required markup structure for your
document.  It will prompt you on legal completions when you get lost in
tricky annotations.

It also has hooks to run external SGML validators to find errors in your
markup.  You can also automate the process of publishing your documents
using Jade by commands invoked from within emacs.  There are also a number
of commercial applications which provide the same capabilities (syntax
directed editing of SGML documents).

psgml is a generic setup.  I discovered that I needed to apply a number of
tweaks to emacs to make the configuration usable.  For example, psgml
doesn't necessarily know how you have your catalog files set up when
invoking external tools.

Maintaining an existing DocBook document manually (e.g. without setting up
psgml) is not much worse than maintaining HTML manually.  However, I doubt
anyone would want to crank new material without the support of something
like psgml.

There's another aspect of DocBook which I should mention briefly because
many of the web materials talk about it.  So far I've described the SGML
version of DocBook and the tools appropriate for publishing SGML DocBook
(Jade/DSSSL).

There is also a parallel version of DocBook in XML format.  XML is also an
SGML language, simplified even more than HTML.  The SGML version of DocBook
is slightly more human friendly so most people write in the SGML dialect.
The interconversion is mechanical.

Jade and DSSSL are mature tools (with many limitations) which are no longer
being aggressively developed, in part because DSSSL/Scheme is regarded as
overkill.

The ongoing work is mostly in the XML camp now.  XML defines several
different stylesheets languages (e.g. CSS) and different transformation
tools (e.g. XSL/XSLT).  When browsers fully support HTML/CSS it will be
possible to publish XML DocBook documents directly to the web.  The web
browser will do the presentation directly from the style sheet.

The XML version of DocBook uses an XML syntax for declaring the DocBook DTD.
If I understand the situation correctly, the XML syntax is slightly less
expressive than the SGML syntax, so the two DocBook DTDs are not presently
formally equivalent.  The next major version of DocBook is supposed to
address this issue.

The XML tools at this point are mostly experimental.  Within the next few
years it is anticipated that the new XML model will largely supplant the
Jade/DSSSL process.  It will be a different way of publishing the same
source documents.

I hope that gives people a good overview of what DocBook is, where it is
going, and how the process fits together.  Once I finish collecting my
installation notes I'll set up my document framework for the Pooma concepts.

Allan

reply via email to

[Prev in Thread] Current Thread [Next in Thread]