Character Repertoires Implementor's Guide

Printer Working Group Draft, December, 2002

Editors:: Elliott Bradshaw, Oak Technology Imaging Group

Abstract

When sending a job to a printer, a print client (PC or other device) needs to make sure the printer has the ability to print the characters in the job. On PCs and similar devices, clients traditionally use font downloading to supply characters which the printer may not have. On smaller devices, including PDAs, set-top-boxes, etc., this will often not be an option.

This document provides guidance for implementors of printers and printing clients, including summaries and references to existing standards, recommended practices, and recommendations for future standards.

Status of this Document

This document is informative only. It has not been reviewed by PWG Members nor approved. It is not a stable document and may not be cited as a normative reference from another document.

Public discussion of Character Repertoires takes place on the mailing list: cr@pwg.org (archive). To subscribe send an email to majordomo@pwg.org with the words subscribe cr in the body. You must be subscribed to the mailing list to post there. Please report errors in this document to one of the editors listed above or on the mailing list.

A list of current PWG Standards and other technical documents can be found at http://www.pwg.org/standards.html.

General Approach

There is very little new material in this document. Rather, it is an attempt to summarize a complex subject, provide a conceptual framework, and bring together references so that a non-specialist can quickly find what is needed for managing printable characters. [The present author often feels, while surfing the web, that he is rediscovering what was well-known in a different time and place.]

A second goal is to clarify areas where more standards work is needed.

We assume the reader has some familiarity with Internet technologies such as Unicode, MIME, and XML. Older technologies are used only as needed for specific applications, and can usually be mapped into or associated with corresponding Internet technologies. This approach has two principle advantages:

"Forward-looking" applications such as XHTML-Print can be built without knowledge of legacy technologies
The Internet technologies are well documented and widely understood, thus providing a reliable basis for common understanding

Terminology

The term "character set" is confusing. It is most often used to mean a specific set of (abstract) characters, each represented in a specific way (almost always as a series of octets). Knowing the character set, you know what bits to expect for each character. This is also the meaning of the term "charset" as used in MIME and XML.

Unicode adds a layer of abstraction, defining each character as an integer, and allowing for multiple coding schemes to represent each integer as a set of octets. In this sense Unicode is not a "character set", but might perhaps be called a "set of abstract characters."

In this document we preserve this distinction. We are primarily concerned with the set of abstract characters supported in a printer, relying on the (perhaps erroneous) assumption that multiple encodings would nevertheless access the same character data.

We adopt the term "repertoire" to mean a specified subset of Unicode characters, without regard to how they are encoded. When we use the term "character set", we mean a specific encoding, not necessarily Unicode.

Examples of "character sets":

UTF-8 (all the characters of Unicode, encoded in a specific way)
ISO 8859-1, which gives a specific coded value for each character
Shift-JIS

Examples of "repertoires":

Unicode itself
"The Unicode characters that map to ISO-8859-1"
"The Unicode characters that map to Shift-JIS"

Unicode

In this document we rely heavily on the Unicode scheme for organizing characters. Much of the following material is excerpted from [UC-Principles].

Unicode is a widely-adopted, worldwide character encoding standard. For each character it defines:

A number, known as a code point, usually written in hexadecimal preceded by "U+"
A name

Some examples include:

U+0041 "LATIN CAPITAL LETTER A"
U+0180 "LATIN SMALL LETTER B WITH STROKE"
U+0436 "CYRILLIC SMALL LETTER ZHE"
U+0624 "ARABIC LETTER WAW WITH HAMZA ABOVE"
U+0A1B "GURMUKHI LETTER CHA"
U+2733 "EIGHT SPOKED ASTERISK"
U+30A4 "KATAKANA LETTER I"
U+3204 "PARENTHESIZED HANGUL MIEUM"

The actual appearance of the character on paper or screen is called a "glyph", and varies based on device, font, etc. Unicode does not define glyphs, although it does give examples.

Character encodings define how these numeric values are represented in bits. Unicode defines three encodings:

UTF-8. Each character is represented in 1-4 bytes. The Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII.
UTF-16. Each character is represented in 1 or 2 16-bit words.
UTF-32. Each character is represented in 1 32-bit word.

In order to print successfully, a client needs to know both what characters (code points) are available, as well as what encodings can be used.

There are many character sets that are not based on Unicode, and several of these are important for printing existing documents. Fortunately, nearly all have published mappings into Unicode. Therefore, knowing what Unicode characters are available, a client can deduce which characters are available from an alternate character set. In addition, the client needs to find out whether the printer can accept characters encoded in the alternate character set, or whether the client must map them to Unicode.

Of course a printer may be configured to accept other character sets, but not those based on Unicode. However, such a printer is outside the scope of this Guide.

In summary, before printing a job a client needs to determine this information about the printer:

Which character encoding schemes are available (e.g. Unicode UTF-8, Shift-JIS)
Within each scheme, which characters are available

It may also be that some characters are conditionally available, e.g. only when certain fonts are selected. This topic is reserved for future work, and is not considered in this Guide. In fact, one recommendation is that a printer implement a system default font that can be used to render its full character set, and that this font be used as a fall-through to handle missing characters in other fonts.

W3C, Character Sets, and the Internet

The IANA registry (long) of character sets is available at http://www.iana.org/assignments/character-sets. Every registered charset contains at least:

a primary name
reference to an RFC or other publicly available specification
if feasible, a mapping to Unicode

In some cases, an alternate "preferred MIME name" is given. In those cases that is the name we use.

Part of the PWG's mission should be to identify a short list of character sets, as preferred for use in printing applications. Wherever possible we use the IANA name for a character set.

Discussion of Referenced Character Sets

Latin/European

[ISO-8859] defines various Latin-based alphabets (each up to 256 characters in size), while [Unicode-8859] is a set of mappings from ISO codes to Unicodes.

Microsoft

Microsoft publishes a number of single- and mutli-byte code pages, at http://www.microsoft.com/globaldev/reference/cphome.asp. These are all defined in terms of Unicode.

As part of their OpenType specification, Microsoft defines the WGL4.0 character set, which is expressed in terms of Unicode. It has 652 characters, containing many of the characters from the ISO Latin sets, as well as quite a few symbols.

World Wide Web Consortium

[XHTML-Chars] defines a number of pre-defined character entities, in these groups:

Latin 1 (96 entries)
Special Characters (33 entries)
Mathematical, Greek, and Symbolic (124 entries)

For a total of 253 entries.

Summary of Non-Asian Characters

You can compare the ISO-8859, Microsoft, and XHTML repertoires side by side here.

Asian

These are the relevant fields in the [Unihan] database:

kGB0: Chinese PRC: GB 2312-80
kJIS0213/kJis0/kJis1 ?: Japanese:
kKSC0: Korean: KS C 5601
kBigFive: Chinese Taiwan:

For Thai, use 8859-11, which is equivalent to TIS 620-2533 (1990) with the addition of 0xA0 NO-BREAK SPACE.

Named Character Repertoires

The PWG will define a standard set of repertoire names to be used for printing capabilities. The draft version of this list is:

PWG Character Repertoire	Based on IANA Charset	Description	Reference Location
ISO-8859-1	ISO-8859-1	Latin alphabet No. 1	RFC1345
ISO-8859-2	ISO-8859-2	Latin alphabet No. 2	RFC1345
ISO-8859-3	ISO-8859-3	Latin alphabet No. 3	RFC1345
ISO-8859-4	ISO-8859-4	Latin alphabet No. 4	RFC1345
ISO-8859-5	ISO-8859-5	Latin/Cyrillic alphabet	RFC1345
ISO-8859-6	ISO-8859-6	Latin/Arabic alphabet	RFC1345
ISO-8859-7	ISO-8859-7	Latin/Greek alphabet	RFC1345
ISO-8859-8	ISO-8859-8	Latin/Hebrew alphabet	RFC1345
ISO-8859-9	ISO-8859-9	Latin alphabet No. 5	RFC1345
ISO-8859-10	ISO-8859-10	Latin alphabet No. 6	RFC1345
ISO-8859-13	ISO-8859-13	Latin alphabet No. 7	http://www.iana.org/assignments/ charset-reg/iso-8859-13
ISO-8859-14	ISO-8859-14	Latin alphabet No. 8	http://www.iana.org/assignments/ charset-reg/iso-8859-14
ISO-8859-15	ISO-8859-15	Latin alphabet No. 9	http://www.iana.org/assignments/ charset-reg/ISO-8859-15
ISO-8859-16	ISO-8859-16	Latin alphabet No. 10	??? Could use http://www.unicode.org/Public/ MAPPINGS/ISO8859/8859-16.TXT
GB_2312-80	GB_2312-80	Chinese (People’s Republic of China)	RFC1345
Shift_JIS	Shift_JIS	Japanese	"Appendix 1 of JIS X0208:1997," but where is this? Unicode Unihan database has entries ("Jis1") for JIS X 0212-1990
KS_C_5601-1987	KS_C_5601-1987	Korean	RFC1345
Big5	Big5	Chinese (Taiwan)	"Chinese for Taiwan Multi-byte set. PCL Symbol Set Id: 18T", but where is this?
TIS-620	TIS-620	Thai	???. maybe http://www.nectec.or.th/it- standards/std620/std620.htm (in Thai)
XHTML			http://www.w3.org/TR/xhtml- modularization/dtd_module_defs.html# a_xhtml_character_entities
<to be specified>		Microsoft symbols

Protocol Bindings

Printer Capabilities

The PWG will develop recommendations for built-in repertoires, based on the the advertised service level of the printer. (These service levels are intended to align with those for XHTML-Print, Bluetooth, and UPnP printing.) If a client knows that a printer implements one of these service levels, it may assume the presence of the given repertoires.

A draft version of this list is:

Print Service Level	Built-in Repertoires
Basic	ISO-8859-1
Enhanced	TBD

Capability Queries

Various protocols provide a way for a client to find out information about a printer's capabilities. These protocols should be extended to define how the client can learn what repertoires are available in a printer. N0te that this query, if implemented, should always include the built-in repertoires for the service level offered by the printer.

The fundamental semantic unit for getting this capability is an attribute named "repertoires-supported" on the Printer object. The value is a comma-separated string containing the PWG names of the supported repertoires. Various protocols may map these names to other forms of representation. For example, the Bluetooth Basic Printing Profile uses bits in a bitmap, while the Printer MIB uses string names with no punctuation.

In addition, a protocol may provide a mechanism for discovering particular character sets that may be sent directly. The repertoires-supported attribute does not necessarily reflect characters available in non-Unicode character sets.

Queries associating available repertoires with fonts, charsets, PDLs, etc. are reserved for future study.

Recommendations for the Printer Implementor

Always implement Unicode UTF-8, in addition to any other character encoding schemes.
Define whether the printer offers Basic or Enhanced services, and include at least the repertoires defined by those services.
Make supported characters available in all fonts, using a system font fall-through if needed.
Print a recognizable "missing character" symbol for any character not supported.

Recommendations for the Client Implementor

If the printer provides a query mechanism to obtain supported encoding schemes and repertoires, use it to find out what the printer can handle.
Otherwise, if the printer advertises (either by specification, or through a protocol) that it supports a particular service level, assume the repertoires for that service level are available.
Otherwise, assume the printer can handle the Basic Printing service level, but not necessarily any others.
If the source document is not in Unicode, decide whether or not to map it to Unicode. Usually, if the printer can handle the original encoding it is best to send it unmapped.
If the document contains characters that won't print, decide whether to alert the user, map them to some other characters, let the printer handle them, etc.

Recommendations for Standards Work

This section is directed at the Printer Working Group, with suggestions for standards that need to be developed.

Adopt a standard set of character repertoire names.
Define the standard set of characters to be available in all printers.
Define the semantics of a query mechanism to determine which coding schemes and character repertoires are available in a printer.
Agree on and publish normative references for mapping between other schemes and Unicode.

Issues

How do we reference ISO-8859? Is there a version online, or does every reader need to buy it from ISO? If so, we should list exactly what they need to buy.
What about other ISO-8859 components?
- 8859-11: Latin/Thai
- 8859-12: does not exist
- 8859-16: Latin alphabet No. 10
Does the presence of a codepoint with right-to-left property imply that bidi processing is required in the printer?

References

[BPP]: "Bluetooth Basic Printing Profile", Bluetooth SIG, October 5, 2001. Available at: http://www.bluetooth.com/pdf/Basic_Printing_Profile_0_95a.pdf.
[ISO-8859]: ...purchase each alphabet online at http://www.iso.org.
[Unicode-8859]: Mapping tables from 8859 alphabets to Unicode.; http://www.unicode.org/Public/MAPPINGS/ISO8859/
[Gloss]: Unicode Glossary. http://www.unicode.org/glossary/
[Unihan]: Asian property database for Unicode; include mapping from other alphabets. A very large file; zip form available at http://www.unicode.org/Public/UNIDATA/Unihan.zip.
[WGL4.0-desc]: Description of Microsoft's character set standard which "includes characters required by Western, Central, and Eastern European writing systems, as well as characters required by Greek and Turkish." http://www.microsoft.com/typography/unicode/cscp.htm
[WGL4.0-data]: Unicode values for WGL4.0. http://www.microsoft.com/typography/OTSPEC/WGL4.htm.
[XHTML-chars]: Predefined character entities in XHTML. http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities
[UC-Priniciples]: The Unicode® Standard: A Technical Introduction; http://www.unicode.org/standard/principles.html