PWG Logo

Working Draft:
The Printer Working Group Standard for Character Repertoire Interoperability

March 17, 2003

This version:
ftp://ftp.pwg.org/pub/pwg/Character-Repertoires/wd-pcr10-20030317.html
Previous version:
ftp://ftp.pwg.org/pub/pwg/Character-Repertoires/wd-pcr10-20030228.html
Editor:
Elliott Bradshaw, Oak Technology Imaging Group

Abstract

In traditional printing environments, clients rely on font downloads when they are not sure a given character is embedded in the printer. As printing moves to small clients, downloading may not be an option and clients have a need to know what characters are available in a given device.

There are many published named character repertoires, and a small client will not know about them all.  

To improve operability, this document defines:

The primary target of this document is printing using languages based on XML or HTML (for example, XHTML-Print).  It will be less applicable to traditional PDLs (PCL, PostScript, etc.) because they tend to have very language-specific mechanisms for managing character repertoires.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the PWG.

All sections of this document are normative unless noted as informative.

This document is a working draft and only a working draft. It is currently being reviewed by PWG Members but has not been approved. It is not a stable document and may not be used as reference material nor cited as a normative reference from another document.

Public discussion of PWG Character Repertoires takes place on the mailing list: cr@pwg.org. To subscribe send an email to majordomo@pwg.org with the words subscribe cr in the body. You must be subscribed to the mailing list to post there. Please report errors in this document to the editor listed above or on the mailing list.

A list of current PWG Standards and other technical documents can be found at http://www.pwg.org/standards.html

Open Issues for discussion are in yellow background with a border, like this.

Revision History

17-Mar-03

1. Introduction

This document define a data element called "repertoires-supported".  This element is intended to be incorporated into higher level description schemes, such as the PWG Semantic Model [PWG-SM], as well as protocols based on those schemes.

Issue: should we consider use of this document by other schemes besides the Semantic Model?

Inside the scope of this document are:

  1. Syntactic conventions for advertisement of character repertoires defined elsewhere.
  2. Definition of a minimal printing client and the Basic Repertoires that it understands.
  3. Rules for a conforming printer to use when advertising supported repertoires.

Some areas outside the scope of this document are:

  1. Character encoding.  It is assumed that the client and printer have some other way of agreeing on encoding.
  2. Mapping into and out of Unicode.  It is assumed that for any repertoire defined in a different encoding (e.g. ISO-Latin-xxx, Shift-JIS), the implementer can provide a suitable mapping into Unicode.
  3. Font downloading.
  4. Adaptation to mature PDLs such as PostScript and PCL.  These provide rich, alternate schemes for managing repertoires (including download), and it is not apparent how they would use the mechanisms in this document.
  5. Ability to advertise individual characters.  Our view is that this will add a great deal of data with little real-world benefit.
  6. Query mechanisms.  Separately, a protocol could define a query mechanism using this data format.

2. Terminology

In Unicode and W3C documents, the term character set usually refers to a method of encoding a (possibly very large) set of characters, e.g. UTF-8. This tells how to encode a given character if it is present, but doesn't define which characters in that space are actually in use.

The term character repertoire is used here to indicate a subset of characters that is actually present. It is convenient to specify a character repertoire using Unicode characters;  however in principle a character repertoire could be encoded in a different encoding.

The keywords "MUST", "SHALL", "MUST NOT", "SHALL NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" when used in this document are to be interpreted as described in RFC 2119 [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.

3. The Semantic Element "repertoires-supported"

[PWG-SM] defines semantic elements for a printer to use in advertising its capabilities (among other things).  We use the Model to let a printer advertise its supported repertoires; the union of all characters in all advertised repertoires tells the client what characters it may safely use.  (Note that a printer is free to implement additional characters beyond those listed in the supported repertoires.)  A printer might also use "repertoires-ready," in the usual manner described by the Semantic Model, to indicate repertoires that are available without any operator intervention (such as inserting a DIMM).

A client references characters in whatever encoding is present, without reference to a particular repertoire.  In other words, repertoires are (possibly overlapping) sets of characters, but a repertoire is not needed to reference a character.  Therefore, there are no semantic elements for default, current, or actual repertoire values.

3.1. Values for "repertoires-supported"

This document specifies how to reference repertoires defined elsewhere.  "repertoires-supported" contains one or more values, with each value constructed as follows:

Source Form of each value Example
IANA charset registry as defined in [IANA-Charsets] IANA: name IANA: iso-8859-1
Unicode code chart as defined in [Unicode-Charts] Unicode: name Unicode: Basic Latin 1
Unicode Unihan database as defined in [Unihan] Unihan: name Unihan: JIS X 0208
Vendor specific Vendor: vendor: name Vendor: Oak: Floral

Note that these sources are in a variety of encodings, not necessarily Unicode.  If a non-Unicode repertoire is used in a Unicode context, the implication is that the corresponding Unicode codepoints are used.  Such mappings are outside the scope of this document (but are commonly available in most cases).

3.2. Matching rules

In matching names, the client should consider these rules:

  1. Names are case-insensitive, so a letter  should match its upper/lower case equivalent
  2. Space, hyphen, and underscore characters are removed prior to matching

As a result, all of the following are equivalent:

Unicode: Latin-1 Supplement
unicode:Latin1Supplement
unicode:    latin_1    supplement

Issue: align these rules with a similar discussion going on in the Semantic Model group.  Compare these rules with rules for other elements, e.g. charsets-supported, as well as common practice in other areas such as web browsers.

Issue: should we define more rules for how these names are formed?  Canonical form (e.g. all lower case)?  Alternatively, this is moot if the client implements the matching rules.  What about IANA's use of multiple aliases for the same charset...should we mandate use of the preferred form?

Individual transport protocols may place further restrictions on the use of upper/lower case, and the use of space, hyphen, and underscore characters.

3.3. Font Sensitivity

The semantic element "repertoires-supported" does not correlate with particular fonts.  If a character is present in an advertised repertoire, then the printer must be able to render that character regardless of the currently selected font.  However, renderings in different fonts need not be distinct.  A common approach is for the printer to implement a system default font with all advertised characters, and to implement a fall-through mechanism that will render a character from the default font if it is not available in a currently select font.

3.4. Basic Repertoires

In order to promote interoperability, this document designates a small number of repertoires as "basic".  In this way a print client that only knows the names of the basic repertoires can get useful results.

The repertoires designated as basic are:

Latin Extended-A is used primarily by Latin-based languages in Eastern Europe. The last four support PRC, Japan, Korea, and Taiwan respectively.

Issue: should we postpone inclusion of Hebrew, Arabic, and Thai?

A conforming printer must advertise a basic repertoire whenever it advertises similar repertoires.  For example, any printer advertising any Cyrillic repertoire must also advertise "Unicode: Cyrillic".  In this way a client that does not recognize a large number of repertoires can still recognize that basic Cyrillic printing is possible on this device.

Printers will often support larger repertoires.  If a printer supports a repertoire that is a superset of a basic repertoire, then it must advertise the basic repertoire in addition to the superset.

3.5. Extensions

A printer may implement several types of extensions without losing conformance with this document.  Examples include:

  1. Advertise other repertoires in addition to the basic repertoires; for example:
  2. Implement additional characters without advertising them.
  3. Implement characters that are available only in certain fonts.  However these characters must not be advertised using this mechanism.

4. Conformance

4.1. Printer Conformance

A conforming printer must follow these rules:

  1. All printers must support and advertise:
  2. All printers must support the euro (U+20AC) character, even though it is not advertised.
  3. A basic repertoire must be supported whenever similar ones are, as described above.  
  4. If a printer supports a repertoire, it must be able to render all characters from the repertoire, regardless of selected font, as described above.

There is no requirement that every supported character is represented in some repertoire;  a printer may support specific characters without advertising them.  In some languages (e.g. those based on XHTML) certain characters are implicitly supported (e.g. as built-in character entities), without being advertised in any repertoire.

4.2. Print Client Operation (informative)

Printing protocols (outside of this document) specify how a print client learns about the supported repertoires in a printer.  Once it knows, a client may choose to use this knowledge in any of these ways:

  1. If multiple printers are available, look for one that can print all the characters in a job.
  2. If printing to a printer that can't print all the characters in a job, warn the user.
  3. Make a substitution for a character that won't print.

5. Acknowledgements (informative)

This document was prepared with input and assistance from:

A. References

Issue: how do we lock down specific versions of these references?
 
[IANA-Charsets]
Available online at http://www.iana.org/assignments/character-sets.
[PWG-SM]
PWG Semantic Model.  Available online at http://www.pwg.org/sm/index.html.
[RFC2119]
"RFC2119 - Key words for use in RFCs to Indicate Requirement Levels", S. Bradner. Available online at http://www.ietf.org/rfc/rfc2119.
[Unicode-Charts]
Unicode code charts.  Available online at http://www.unicode.org/charts.
[Unihan]
Unicode Unihan database, which includes mappings to major CJK character set standards.  Available online at http://www.unicode.org/charts/unihan.html

B. Bindings to IPP

(To be written.)

C. Bindings to Semantic Model (XML)

(To be written.)