[MPEG-OTSPEC] comments wrt wide glyph ID proposal

Tue Dec 12 03:54:14 CET 2023

I've discussed the Nov 20 draft proposal for wide glyph IDs with others at Microsoft and we have some comments to bring up in tomorrow's AHG meeting. I'll summarize them in mail to give a (brief) chance to consider in advance. (I'll send separate mail with comments regarding cubic beziers.)

The proposal: boring-expansion-spec/iso_docs/WG03-beyond-64k-glyphs-2023-11-20.pdf at main * harfbuzz/boring-expansion-spec (github.com)<https://github.com/harfbuzz/boring-expansion-spec/blob/main/iso_docs/WG03-beyond-64k-glyphs-2023-11-20.pdf>

24-bit or 32-bit:
The doc mentions the entire Unicode codespace as being ~21bits. That implies that, even if every Unicode code point were assigned a unique character (which is unlikely to happen during this century), that could still allow for almost 16 alternate glyphs (on average) for every Unicode character.

It occurs to us, in addition, that the overall file size of the current format is constrained by 32-bit offsets, and that imposes a practical limitation on total glyph count. If a font had 2^24 glyphs, then on average the total data per glyph (outlines, metrics, GSUB, etc.) would have to be 256 bytes or less.

A potential way for glyph IDs to be consumed independent of Unicode assigned characters or total file size would be virtual glyphs used in transient derivational steps in the GSUB table. Even so, we think that 2^24 is plenty adequate for now.

Hybrid narrow/wide fonts:
Hybrid fonts are going to be more challenging to build and maintain-much more so than hybrid COLRv0/v1. Attempting to engineer mechanisms specifically to accommodate hybrid fonts is likely to add to complexity.

>From our perspective, the most interesting potential use case for hybrid narrow/wide fonts would be for pan-CJK: there are more CJK characters in Unicode than can currently fit into a single font, and so some current designs may be split across multiple fonts; e.g., SimSun, SimSun-ExtB and (in development) SimSun-ExtG in Windows. It would be great to be able to merge into a single font. However, it may be necessary to support older software with the same font file.

One take-away for us from this is that we shouldn't over-engineer hybrid narrow/wide support into tables where it may not be beneficial. For example, we think it should be possible to have colour fonts with wide glyph IDs, but we wouldn't make it a priority to create hybrid narrow/wide colour fonts.

TTCs:
A second take-away for us from thinking about hybrid fonts is that we think TTCs can provide another approach to creating hybrid fonts-one that could be easier for font developers to create and maintain. To that end, we think it would make sense to define a v2.1 TTC header that adds numFonts2 and tableDirectoryOffsets2 members, and provide guidance that software that supports wide glyph IDs should use only these new members, ignoring numFonts and tableDirectoryOffsets. In this way, older software could see only fonts with narrow glyph IDs, while newer software could see a distinct set of fonts without duplication.

This brought to my mind that, six - ten years ago (I forget the exact timeframe), there was discussion between Adobe, Apple and MS about defining a _dmap_ (delta cmap) table for use in TTCs: It's very common in TTCs that there are cmap differences, with the result that each font in the TTC must have its own cmap without any sharing of data. In CJK fonts, the cmap table is one of the largest tables (probably second only to glyf or CFF / CFF2). Moreover, in a CJK font, the majority of mappings in a cmap table could be the same, with only a small portion of mappings being different. (E.g., in MS Gothic vs MS PGothic, all the ideograph glyphs are the same; it's just Latins and punctuation that differ.) A dmap table would allow fonts in a TTC to share a common base cmap table with small, font-specific dmap tables handling differences. In our discussions, we came up with formats that would work, except we hadn't figured out how to handle format 14 cmap subtables.

COLR, MATH:
We noted that the proposal doesn't include any integration for COLR or MATH tables. There might be several things to consider in relation to the MATH table, and we have no concern with leaving that for future consideration.

But COLR might not be too difficult. So, we think it's worth discussing options:

  1.  Postpone for future consideration.
  2.  Create a new major version - i.e., a new table tag - to design a table with wide glyph IDs (it wouldn't need to support narrow IDs).
  3.  Create a minor version enhancement (COLR v2) that maintains backward compatibility while adding wide support.

The third option would need to add new offsets in the header for wide variants of base glyph and clip lists, with new BaseGlyphPaintRecord2 and ClipRecord2 formats. (There'd also need to be a new PaintGlyph format, but that will be true regardless.)

We haven't yet decided which option we prefer; we just want to get it into discussion.

Max profile:
The current proposal doesn't make any change wrt 'maxp', other than to say numGlyphs isn't used for wide-GID support. In a hybrid font, it's unclear what font developers should do with all the other maxp members: if they're set as appropriate for narrow GIDs, then the values may not work for wide GIDs and the app could run out of resources. On the other hand, if the values are set for wide GIDs, those can work for both narrow and wide, but for older software could lead to over-allocation of unused resources.

Since we're already considering glyf/loca and GLYF/LOCA that can exist side by side, it seems simple and clean to define a MAXP table that gets used only in conjunction with GLYF/LOCA. These tables are small, so the file size impact is negligible.

GPOS/GSUB:
It appears the proposal doesn't yet include wide versions for common table formats that will be required (e.g., coverage). These will, of course, be needed.

This may be an opportunity to deprecate certain formats from use in wide-GID fonts. E.g., GSUB type 5 and GPOS type 7 (contextual) were effectively obsoleted when the chaining contextual formats were added. If we agreed, then Contextual positioning / substitution subtable formats 4 - 6 wouldn't need to be added.

Various formats are proposed using uint24 for subtable counts and Offset24 for subtable offsets. This could turn into a real limitation. For example, consider single substitution format 4: if glyphCount were 5,592,406, then the size of the substituteGlyphIDs[] array would exceed xFFFFFF and Offset24 for coverageOffset would not work. We're inclined to make offsets and any counts not limited by 24-bit GIDs to be 32-bit.

Peter Constable

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.aau.at/pipermail/mpeg-otspec/attachments/20231212/ab32cdbb/attachment-0001.html>