Docutils | Overview | About | Users | Reference | Developers

The Docutils Publisher

Author:

David Goodger

Contact:
docutils-develop@lists.sourceforge.net
Date:
$Date$
Revision:
$Revision$

The docutils.core.Publisher class is the core of Docutils, managing all the processing and relationships between components. See PEP 258 for an overview of Docutils components. Configuration is done via runtime settings assembled from several sources.

The Publisher convenience functions are the normal entry points for using Docutils as a library.

Publisher Convenience Functions

There are several convenience functions in the docutils.core module. Each of these functions sets up a docutils.core.Publisher object, then calls its publish() method. docutils.core.Publisher.publish() handles everything else.

See the module docstring, help(docutils.core), and the function docstrings, e.g., help(docutils.core.publish_string), for details and a description of the function arguments.

publish_cmdline()

Function for command-line front-end tools, like rst2html.py with file I/O. Also returns the output as bytes instance.

There are several examples in the tools/ directory of the Docutils repository. A detailed analysis of one such tool is in Inside A Docutils Command-Line Front-End Tool.

publish_file()

For programmatic use with file-like I/O. In addition to writing the output document to a file, also returns it as a bytes instance.

publish_string()

For programmatic use with string I/O:

Input

can be a str or bytes instance. bytes are decoded with input_encoding.

Output

is a memory object:

  • a str instance [1], if the "auto_encode" function argument is False or output_encoding is set to the special value "unicode".

  • a bytes instance, if the "auto_encode" argument is True and output_encoding is set to an encoding registered with Python's "codecs" module (default: "utf-8").

Calling output = bytes(publish_string(…)) ensures that output is a bytes instance encoded with the configured output_encoding (matching the encoding indicated inside HTML, XML, and LaTeX documents).

publish_doctree()

Parse string input (cf. string I/O) into a Docutils document tree data structure (doctree). The doctree can be modified, pickled & unpickled, etc., and then reprocessed with publish_from_doctree().

publish_from_doctree()

Render from an existing document tree data structure (doctree). Returns the output document as a memory object (cf. string I/O).

publish_programmatically()

Auxilliary function used by publish_file(), publish_string(), publish_doctree(), and publish_parts(). It returns a 2-tuple: the output document as memory object (cf. string I/O) and the Publisher object.

publish_parts()

For programmatic use with string input (cf. string I/O). Returns a dictionary of document parts. Dictionary keys are the names of parts, and values are str instances; encoding is up to the client. Useful when only portions of the processed document are desired.

There are usage examples in the docutils/examples.py module.

Each Writer component may publish a different set of document parts, described below. Not all writers implement all parts.

Parts Provided By All Writers

encoding

The output encoding setting.

version

The version of Docutils used.

whole

parts['whole'] contains the entire formatted document.

Parts Provided By the HTML Writers

HTML4 Writer
body

parts['body'] is equivalent to parts['fragment']. It is not equivalent to parts['html_body'].

body_prefix

parts['body_prefix'] contains:

</head>
<body>
<div class="document" ...>

and, if applicable:

<div class="header">
...
</div>
body_pre_docinfo

parts['body_pre_docinfo] contains (as applicable):

<h1 class="title">...</h1>
<h2 class="subtitle" id="...">...</h2>
body_suffix

parts['body_suffix'] contains:

</div>

(the end-tag for <div class="document">), the footer division if applicable:

<div class="footer">
...
</div>

and:

</body>
</html>
docinfo

parts['docinfo'] contains the document bibliographic data, the docinfo field list rendered as a table.

footer

parts['footer'] contains the document footer content, meant to appear at the bottom of a web page, or repeated at the bottom of every printed page.

fragment

parts['fragment'] contains the document body (not the HTML <body>). In other words, it contains the entire document, less the document title, subtitle, docinfo, header, and footer.

head

parts['head'] contains <meta ... /> tags and the document <title>...</title>.

head_prefix

parts['head_prefix'] contains the XML declaration, the DOCTYPE declaration, the <html ...> start tag and the <head> start tag.

header

parts['header'] contains the document header content, meant to appear at the top of a web page, or repeated at the top of every printed page.

html_body

parts['html_body'] contains the HTML <body> content, less the <body> and </body> tags themselves.

html_head

parts['html_head'] contains the HTML <head> content, less the stylesheet link and the <head> and </head> tags themselves. Since publish_parts returns str instances and does not know about the output encoding, the "Content-Type" meta tag's "charset" value is left unresolved, as "%s":

<meta http-equiv="Content-Type" content="text/html; charset=%s" />

The interpolation should be done by client code.

html_prolog

parts['html_prolog] contains the XML declaration and the doctype declaration. The XML declaration's "encoding" attribute's value is left unresolved, as "%s":

<?xml version="1.0" encoding="%s" ?>

The interpolation should be done by client code.

html_subtitle

parts['html_subtitle'] contains the document subtitle, including the enclosing <h2 class="subtitle"> and </h2> tags.

html_title

parts['html_title'] contains the document title, including the enclosing <h1 class="title"> and </h1> tags.

meta

parts['meta'] contains all <meta ... /> tags.

stylesheet

parts['stylesheet'] contains the embedded stylesheet or stylesheet link.

subtitle

parts['subtitle'] contains the document subtitle text and any inline markup. It does not include the enclosing <h2> and </h2> tags.

title

parts['title'] contains the document title text and any inline markup. It does not include the enclosing <h1> and </h1> tags.

PEP/HTML Writer

The PEP/HTML writer provides the same parts as the HTML4 writer, plus the following:

pepnum

parts['pepnum'] contains the PEP number (extracted from the header preamble).

S5/HTML Writer

The S5/HTML writer provides the same parts as the HTML4 writer.

HTML5 Writer

The HTML5 writer provides the same parts as the HTML4 writer. However, it uses semantic HTML5 elements for the document, header and footer.

Parts Provided by the "LaTeX2e" and "XeTeX" Writers

See the template files default.tex, titlepage.tex, titlingpage.tex, and xelatex.tex for examples how these parts can be combined into a valid LaTeX document.

abstract

parts['abstract'] contains the formatted content of the 'abstract' docinfo field.

body

parts['body'] contains the document's content. In other words, it contains the entire document, except the document title, subtitle, and docinfo.

This part can be included into another LaTeX document body using the \input{} command.

body_pre_docinfo

parts['body_pre_docinfo] contains the \maketitle command.

dedication

parts['dedication'] contains the formatted content of the 'dedication' docinfo field.

docinfo

parts['docinfo'] contains the document bibliographic data, the docinfo field list rendered as a table.

With --use-latex-docinfo 'author', 'organization', 'contact', 'address' and 'date' info is moved to titledata.

'dedication' and 'abstract' are always moved to separate parts.

fallbacks

parts['fallbacks'] contains fallback definitions for Docutils-specific commands and environments.

head_prefix

parts['head_prefix'] contains the declaration of documentclass and document options.

latex_preamble

parts['latex_preamble'] contains the argument of the --latex-preamble option.

pdfsetup

parts['pdfsetup'] contains the PDF properties ("hyperref" package setup).

requirements

parts['requirements'] contains required packages and setup before the stylesheet inclusion.

stylesheet

parts['stylesheet'] contains the embedded stylesheet(s) or stylesheet loading command(s).

subtitle

parts['subtitle'] contains the document subtitle text and any inline markup.

title

parts['title'] contains the document title text and any inline markup.

titledata

parts['titledata] contains the combined title data in \title, \author, and \date macros.

With --use-latex-docinfo, this includes the 'author', 'organization', 'contact', 'address' and 'date' docinfo items.

Configuration

Docutils is configured by runtime settings assembled from several sources:

The individual settings are described in Docutils Configuration.

Docutils overlays default and explicitly specified values from these sources such that settings behave the way we want and expect them to behave. For details, see Docutils Runtime Settings.

To pass application-specific setting defaults to the Publisher convenience functions, use the settings_overrides parameter. Pass a dictionary of setting names & values, like this:

overrides = {'input_encoding': 'ascii',
             'output_encoding': 'latin-1'}
output = publish_string(..., settings_overrides=overrides)

Settings from command-line options override configuration file settings, and they override application defaults.

Further customization is possible creating custom component objects and passing them to publish_*() or the Publisher.

Encodings

The default input encoding is UTF-8. A different encoding can be specified with the input_encoding setting.

The encoding of a reStructuredText source can also be given by a Unicode byte order mark (BOM) or a "magic comment" [2] similar to PEP 263. This makes the input encoding both visible and changeable on a per-source basis.

If the encoding is unspecified and decoding with UTF-8 fails, the locale's preferred encoding is used as a fallback (if it maps to a valid codec and differs from UTF-8).

The default behaviour differs from Python's open():

The default output encoding is UTF-8. A different encoding can be specified with the output_encoding setting.