Proposal: Specify UTF encoding of Unicode strings

Sairus Patel sppatel at
Thu Nov 24 01:05:03 CET 2011

[I'm sending this to both the OT and OFF lists, per guidelines from the specifications' editors.]

=== Background

I don't believe OT/OFF specifies the encoding of 'name' table Unicode strings.

Microsoft and others, please verify in particular that <platform=3, encoding=10> strings must be in UTF-16. There are no fonts with such strings in my Windows 7 fonts folder so I can't easily comment on current practice.

Also, I don't believe the specification states anywhere that <3,0> strings have Unicode semantics, though that's how current "Windows symbol" fonts are made. I choose UCS-2 in the proposal below because <3,0> predates <3,10>, and so parsers may choke on surrogate pairs in <3,0>. But I'm fine with UCS-4 if that's what is preferred.

=== Proposal { my comments are in curly brackets }

{ In [OFF sec. 5.2.6]: }

1. { Insert the following sentence at the end of the paragraph "Unicode platform encoding ID 5 can be used for encodings in the 'cmap' table but not for strings in the 'name' table.": }

Strings for all Unicode platform encoding IDs other than 5 must be encoded in UTF-16 (big endian).

2. { Insert the following paragraphs at the end of the "Windows platform-specific encoding IDs (platform ID= 3)" section: }

Strings for Windows platform encoding ID 0 are considered to have Unicode semantics (UCS-2).

Strings for Windows platform encoding IDs 0, 1, and 10 must be encoded in UTF-16 (big endian).


More information about the mpeg-otspec mailing list