Working Draft:
The Printer Working Group Standard for Character Repertoire Interoperability

March 17, 2003

This version:: ftp://ftp.pwg.org/pub/pwg/Character-Repertoires/wd-pcr10-20030317.html
Previous version:: ftp://ftp.pwg.org/pub/pwg/Character-Repertoires/wd-pcr10-20030228.html
Editor:: Elliott Bradshaw, Oak Technology Imaging Group

Abstract

In traditional printing environments, clients rely on font downloads when they are not sure a given character is embedded in the printer. As printing moves to small clients, downloading may not be an option and clients have a need to know what characters are available in a given device.

There are many published named character repertoires, and a small client will not know about them all.

To improve operability, this document defines:

Semantics and naming conventions, to allow a printer to advertise what repertoires it supports
Definition of which repertoires are considered "basic", and therefore safe to use with all clients

The primary target of this document is printing using languages based on XML or HTML (for example, XHTML-Print). It will be less applicable to traditional PDLs (PCL, PostScript, etc.) because they tend to have very language-specific mechanisms for managing character repertoires.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the PWG.

All sections of this document are normative unless noted as informative.

This document is a working draft and only a working draft. It is currently being reviewed by PWG Members but has not been approved. It is not a stable document and may not be used as reference material nor cited as a normative reference from another document.

Public discussion of PWG Character Repertoires takes place on the mailing list: cr@pwg.org. To subscribe send an email to majordomo@pwg.org with the words subscribe cr in the body. You must be subscribed to the mailing list to post there. Please report errors in this document to the editor listed above or on the mailing list.

A list of current PWG Standards and other technical documents can be found at http://www.pwg.org/standards.html

Open Issues for discussion are in yellow background with a border, like this.

Revision History

17-Mar-03

Changed the title to remove "Preferred"
Marked some sections as Informative
Formatting cleanup; addition of copyright notice, acknowledgements, etc.
Clarification in the Abstract and Introduction of goals and non-goals
More information about how this document relates to the Semantic Model
Changed the details of syntax for repertoire names
More information about rules for matching repertoire names
Clarified the wording regarding font sensitivity
Confirmed use of Unicode code charts for basic non-Asian repertoires
Changed from the notion of "Preferred Repertoire" to "Basic Repertoire"; this emphasizes that the printer is free to advertise additional repertoires
Included Latin-1 Supplement and Latin Extended-A as Basic Repertoires
Added requirement to support the euro character
References

1. Introduction

This document define a data element called "repertoires-supported". This element is intended to be incorporated into higher level description schemes, such as the PWG Semantic Model [PWG-SM], as well as protocols based on those schemes.

Issue: should we consider use of this document by other schemes besides the Semantic Model?

Inside the scope of this document are:

Syntactic conventions for advertisement of character repertoires defined elsewhere.
Definition of a minimal printing client and the Basic Repertoires that it understands.
Rules for a conforming printer to use when advertising supported repertoires.

Some areas outside the scope of this document are:

Character encoding. It is assumed that the client and printer have some other way of agreeing on encoding.
Mapping into and out of Unicode. It is assumed that for any repertoire defined in a different encoding (e.g. ISO-Latin-xxx, Shift-JIS), the implementer can provide a suitable mapping into Unicode.
Font downloading.
Adaptation to mature PDLs such as PostScript and PCL. These provide rich, alternate schemes for managing repertoires (including download), and it is not apparent how they would use the mechanisms in this document.
Ability to advertise individual characters. Our view is that this will add a great deal of data with little real-world benefit.
Query mechanisms. Separately, a protocol could define a query mechanism using this data format.

2. Terminology

In Unicode and W3C documents, the term character set usually refers to a method of encoding a (possibly very large) set of characters, e.g. UTF-8. This tells how to encode a given character if it is present, but doesn't define which characters in that space are actually in use.

The term character repertoire is used here to indicate a subset of characters that is actually present. It is convenient to specify a character repertoire using Unicode characters; however in principle a character repertoire could be encoded in a different encoding.

The keywords "MUST", "SHALL", "MUST NOT", "SHALL NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" when used in this document are to be interpreted as described in RFC 2119 [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.

3. The Semantic Element "repertoires-supported"

[PWG-SM] defines semantic elements for a printer to use in advertising its capabilities (among other things). We use the Model to let a printer advertise its supported repertoires; the union of all characters in all advertised repertoires tells the client what characters it may safely use. (Note that a printer is free to implement additional characters beyond those listed in the supported repertoires.) A printer might also use "repertoires-ready," in the usual manner described by the Semantic Model, to indicate repertoires that are available without any operator intervention (such as inserting a DIMM).

A client references characters in whatever encoding is present, without reference to a particular repertoire. In other words, repertoires are (possibly overlapping) sets of characters, but a repertoire is not needed to reference a character. Therefore, there are no semantic elements for default, current, or actual repertoire values.

3.1. Values for "repertoires-supported"

This document specifies how to reference repertoires defined elsewhere. "repertoires-supported" contains one or more values, with each value constructed as follows:

Source	Form of each value	Example
IANA charset registry as defined in [IANA-Charsets]	IANA: name	IANA: iso-8859-1
Unicode code chart as defined in [Unicode-Charts]	Unicode: name	Unicode: Basic Latin 1
Unicode Unihan database as defined in [Unihan]	Unihan: name	Unihan: JIS X 0208
Vendor specific	Vendor: vendor: name	Vendor: Oak: Floral

Note that these sources are in a variety of encodings, not necessarily Unicode. If a non-Unicode repertoire is used in a Unicode context, the implication is that the corresponding Unicode codepoints are used. Such mappings are outside the scope of this document (but are commonly available in most cases).

3.2. Matching rules

In matching names, the client should consider these rules:

Names are case-insensitive, so a letter should match its upper/lower case equivalent
Space, hyphen, and underscore characters are removed prior to matching

As a result, all of the following are equivalent:

Unicode: Latin-1 Supplement
unicode:Latin1Supplement
unicode: latin_1 supplement

Issue: align these rules with a similar discussion going on in the Semantic Model group. Compare these rules with rules for other elements, e.g. charsets-supported, as well as common practice in other areas such as web browsers.

Issue: should we define more rules for how these names are formed? Canonical form (e.g. all lower case)? Alternatively, this is moot if the client implements the matching rules. What about IANA's use of multiple aliases for the same charset...should we mandate use of the preferred form?

Individual transport protocols may place further restrictions on the use of upper/lower case, and the use of space, hyphen, and underscore characters.

3.3. Font Sensitivity

The semantic element "repertoires-supported" does not correlate with particular fonts. If a character is present in an advertised repertoire, then the printer must be able to render that character regardless of the currently selected font. However, renderings in different fonts need not be distinct. A common approach is for the printer to implement a system default font with all advertised characters, and to implement a fall-through mechanism that will render a character from the default font if it is not available in a currently select font.

3.4. Basic Repertoires

In order to promote interoperability, this document designates a small number of repertoires as "basic". In this way a print client that only knows the names of the basic repertoires can get useful results.

The repertoires designated as basic are:

Unicode: Basic Latin
Unicode: Latin-1 Supplement
Unicode: Latin Extended-A
Unicode: Greek
Unicode: Cyrillic
Unicode: Hebrew
Unicode: Arabic
Unicode: Thai
Unihan: GB 2312
Unihan: JIS X 0208
Unihan: KS X 1001:1992
Unihan: Big5

Latin Extended-A is used primarily by Latin-based languages in Eastern Europe. The last four support PRC, Japan, Korea, and Taiwan respectively.

Issue: should we postpone inclusion of Hebrew, Arabic, and Thai?

A conforming printer must advertise a basic repertoire whenever it advertises similar repertoires. For example, any printer advertising any Cyrillic repertoire must also advertise "Unicode: Cyrillic". In this way a client that does not recognize a large number of repertoires can still recognize that basic Cyrillic printing is possible on this device.

Printers will often support larger repertoires. If a printer supports a repertoire that is a superset of a basic repertoire, then it must advertise the basic repertoire in addition to the superset.

3.5. Extensions

A printer may implement several types of extensions without losing conformance with this document. Examples include:

Advertise other repertoires in addition to the basic repertoires; for example:
- Unicode: Basic Latin, Unicode: Latin-1 Supplement, IANA: windows-1252 [adds Windows characters]
- Unicode: Basic Latin, Unicode: Latin-1 Supplement, Unihan: JIS X 0208, Unihan: JIS X 0213 [adds characters from 0213]
Implement additional characters without advertising them.
Implement characters that are available only in certain fonts. However these characters must not be advertised using this mechanism.

4. Conformance

4.1. Printer Conformance

A conforming printer must follow these rules:

All printers must support and advertise:
- Unicode: Basic Latin, and
- Unicode: Latin-1 Supplement
All printers must support the euro (U+20AC) character, even though it is not advertised.
A basic repertoire must be supported whenever similar ones are, as described above.
If a printer supports a repertoire, it must be able to render all characters from the repertoire, regardless of selected font, as described above.

There is no requirement that every supported character is represented in some repertoire; a printer may support specific characters without advertising them. In some languages (e.g. those based on XHTML) certain characters are implicitly supported (e.g. as built-in character entities), without being advertised in any repertoire.

4.2. Print Client Operation (informative)

Printing protocols (outside of this document) specify how a print client learns about the supported repertoires in a printer. Once it knows, a client may choose to use this knowledge in any of these ways:

If multiple printers are available, look for one that can print all the characters in a job.
If printing to a printer that can't print all the characters in a job, warn the user.
Make a substitution for a character that won't print.

5. Acknowledgements (informative)

This document was prepared with input and assistance from:

Rick Seeler, Adobe
Rod Acosta, Agfa Monotype
Lee Farrell, Canon
Jun Fujisawa, Canon
Michael Sweet, Easy Software
Michael Wu, Heidelberg
Ira McDonald, High North
Jim Bigelow, HP
Harry Lewis, IBM
Paul Tykodi, Intermate US
Mark Robb, Lexmark
Don Wright, Lexmark
Don Levinstone, Motorola
Peter Zehler, Xerox

A. References


[IANA-Charsets]: Available online at http://www.iana.org/assignments/character-sets.
[PWG-SM]: PWG Semantic Model. Available online at http://www.pwg.org/sm/index.html.
[RFC2119]: "RFC2119 - Key words for use in RFCs to Indicate Requirement Levels", S. Bradner. Available online at http://www.ietf.org/rfc/rfc2119.
[Unicode-Charts]: Unicode code charts. Available online at http://www.unicode.org/charts.
[Unihan]: Unicode Unihan database, which includes mappings to major CJK character set standards. Available online at http://www.unicode.org/charts/unihan.html

B. Bindings to IPP

(To be written.)

C. Bindings to Semantic Model (XML)