[MPEG-OTSPEC] [EXTERNAL] Re: dmap proposal
Peter Constable
pconstable at microsoft.com
Fri Dec 22 17:16:20 CET 2023
This discussion is getting very focused on CJK. That was the original context for which Apple started the discussion back in 2016, and pan-CJK fonts are certainly a significant use case for a ‘dmap’ table.
But the potential benefit of a dmap table is not just for pan-CJK fonts. There are many reasons why TTCs get created. Windows has 28 .ttc files, none of which are for pan-CJK fonts. (Most are single-region CJK fonts; three are Thai and one is Latin.)
A key reason for creating a TTC is size savings by de-duplication of data, and the cmap table is one of the larger tables in fonts for which there’s potential to de-duplicate—especially in CJK fonts but not only CJK fonts.
Peter
From: mpeg-otspec <mpeg-otspec-bounces at lists.aau.at> on behalf of Hin-Tak Leung <htl10 at users.sourceforge.net>
Date: Friday, December 22, 2023 at 8:54 AM
To: Ken Lunde <lunde at unicode.org>, Skef Iterum <skef at skef.org>
Cc: mpeg-otspec at lists.aau.at <mpeg-otspec at lists.aau.at>
Subject: [EXTERNAL] Re: [MPEG-OTSPEC] dmap proposal
From a user's perspective (I am native to one of the 5/6), it is easier to have a system-wide configuration, as in I want *all* my applications to prefer one of the 5/6, unless specifically overridden per usage instance. Hence a named subfont is more attractive, compared to locl/lang features to be applied per paragraph/sentence. Yes, I guess you can argue for applying specific locl/lang tags unless overridden too; but as Ken said, applications supporting tags are rare...
For those who care about the subtle differences between Taiwan & HK variant of traditional Chinese, the Taiwan MOE (ministry of education) have/had a document somewhere on their web site about common mis-writings and the "correct" variants among historical forms of various characters (numbers in the hundreds, so it is sizeable but small compared to the 10k/20k daily usage size). The difference is legally important for those who notice the difference.
I am surprised there isn't a 6th - Singapore variant of simplified Chinese.
dmap (and ttc subfonts with different dmaps) has the advantage that it is seen as a more system-wide feature, at least in existing usage of ttc sub fonts.
I like the idea of ttc subfonts having some locl/lang features on by default.
On Friday, 22 December 2023 at 07:58:17 GMT, Skef Iterum <skef at skef.org> wrote:
I stand dystopianed.
However, to not yet give up entirely on this line of thinking ...
What is on the table in these messages is a further extension of an existing table, in this case cmap. Which at least suggests that the problem here isn't "system-level" support -- we think we can get those changes. What you describe is, loosely speaking, "application level" support -- allowing the context that the user interacts with to specify the needed parameters, and then educating the user to do so.
I agree that's hopeless for the foreseeable future.
These dmap ideas do have the benefit of being somewhat general (although one might worry about unusual cases). Maybe other compelling use cases, or just the value of generality itself, justify such an extension. Still, if the fundamental problems are what you describe, we might also consider addressing them directly and specifically. Instead of extending cmap, and building region- or language-specific fonts via a separate mechanism, we should at least consider extending TTC to associate a named subfont with the missing parameters. Basically: "render this set of tables using this script and this language by default". Done a bit subtly, one could just ship every cross-language font file with a "base" font with just the name, and some entries for other scripts and language, suitably named, and otherwise sharing TTC data-structures.
From the perspective of the font engineer that seems more productive than building a cross-language font with one set of mechanisms and then building multiple data-sharing individual language fonts using a different mechanism (assuming we still want engineers to do the former).
Skef
On 12/21/23 18:15, Ken Lunde wrote:
Skef,
I might be the only one in this discussion who clearly remembers that Version 1.000 of Source Han Sans and Noto Sans CJK, which were released on 2014-07-15, *was* utopian in that the fonts with the full set of 64K glyphs, meaning genuine Pan-CJK, expected that language tagging would be used to access the desired non-default region-specific glyphs, with the default glyphs being for Japan. Reality quickly taught us that expecting language tagging alone to solve this was completely unrealistic for the following three reasons:
1) The app must support language tagging
2) The app must support language tagging for the appropriate East Asian languages, which is now up to five for these Pan-CJK fonts
3) Assuming #1 and #2 work, the user must then language-tag the text
Going on 10 years later, not much has changed for #1 and #2.
Modern browsers supported the 'locl' GSUB feature way back in 2014, but support in authoring apps is still severely lacking today.
I use Adobe InDesign to get full language-tagging support for these fonts, which is still about the only game in town. Adobe Illustrator silently added East Asian language-tagging in the 2018 release (in 2017), but it was a "close but no cigar" outcome in that they added only "Chinese" (that turned out to be Traditional Chinese for Taiwan) and Japanese, and despite filing bugs over five years ago, Adobe Illustrator 2024 (in 2023) is still unchanged in this regard. What makes the current support even less useful for mainstream users, ignoring that three of the five East Asian regions are not supported at all, is that the two supported East Asian regions are visible only when creating Character or Paragraph styles. They are not shown in the list of languages in the Character or Properties panels. Adobe Photoshop 2024 (in 2023) still does not support language tagging for East Asian languages.
Getting back to Source Han Sans and Noto Sans CJK, Version 1.001 was released on 2014-09-12, which added separate 64K-glyph fonts for each of the four (at the time) supported East Asian regions. The 'locl' feature is still included for the benefit of those environments that support language tagging. All five regions were not supported until Version 2.000, which was released on 2018-11-19, which meant five separate sets of 64K-glyph fonts. The fifth region, of course, was Hong Kong SAR.
In other words, we are quite far from Utopia, and we are unlikely to arrive there anytime soon.
Regards...
-- Ken
On Dec 21, 2023, at 17:04, Skef Iterum <skef at skef.org><mailto:skef at skef.org> wrote:
More stuff after hitting send too fast:
I can see a set of arguments against trying to deal with these regional problems within a single mega-font grounded one way or another in GIDs being a limited resource. But we've already decided to overcome that problem. So, for example, if we need to spend a GID to, in effect, abstractly represent a given codepoint to bridge from cmap into the shaping tables, we have GIDs to spend now. (And, as implied in my other messages today, wouldn't necessarily have to pay the typical file overhead for them.)
As I understand it that's how regional variations in, e.g., Cyrillic are handled now. So I guess, other than the large number of glyphs in CJK fonts I'm not understanding what requirements are pushing the solution in such a different (and seemingly ad hoc) direction.
Skef
On 12/21/23 16:49, Skef Iterum wrote:
Maybe I'm being utopian but I can't help thinking that either there's some token ("dialect"?) that Unicode should be tracking and formalizing but isn't, or Unicode is doing that and we haven't tilted the font specifications enough in its direction to use it. There's already all of that script and language infrastructure there that is meant for this flavor of need, and it seems like a much better place to be solving these problems than rapping stuff up in a TTC and having the client side pick out the sub-font by name or whatever.
Skef
On 12/21/23 15:00, Peter Constable wrote:
During the recent AHG meeting, I mentioned that Apple, Adobe and Microsoft, some years ago, had started discussing a ‘dmap’ (delta character map) table proposal. This was in late fall of 2016; the focus was on pan-CJK fonts, and in that timeframe Ken Lunde has submitted a proposal to UTC (L2/16-063 Proposal to accept the submission to register the “PanCJKV” IVD collection) to define variation sequences for ideographs that designated a range of variation selector characters to correspond to several regions for which regional glyph variants of CJK ideographs might need to be supported. I managed to find an archive of some emails from discussions at the time, so can summarize:
The aim was to be able to support distinct fonts for regional CJK variants without duplication of data. A TTC could allow de-duplication of glyph data, but there would be other duplication. We agreed the biggest concern was with ‘cmap’ data: If any one of the regional variant fonts in the collection were taken as a point of reference, then any of the other regional variants would have many of the same mappings (perhaps most), though not all the same mappings. But there wasn’t any existing means to share common mappings across fonts while there were also some different mappings. Dwane Robinson suggested that we define a new ‘dmap’ table that uses ‘cmap’ formats but is just used to describe the differences in mappings from a common ‘cmap’. We agreed that a ‘dmap’ table doesn’t need the duplication of different platforms/encodings, and that we can converge on only one platform/encoding (hence, no encoding records are necessary). We discussed format 4 versus 12, and agreed to allow either, but that both are never required. Now, we had teleconfs between Apple and MS, but the emails I found indicate that Behdad was also kept informed: one of the emails records that Behdad requested that format 13 also be allowed.
We hadn’t settled, however, on what to do about format 14 subtables. It wasn’t a priority for Apple at the time, but it seemed like it would be incomplete if we ignored it. Knowing that Ken Lunde was dealing a lot with VSes and also working on pan CJK Source Han Sans CJK, we brought Adobe into our discussion at that point.
The issue with format 14 is that it divides variation sequences into two groups: (i) VSes that map to the same glyph already mapped in a format 4 or 12 subtable (DefaultUVS), and (ii) VSes that map to a different glyph. Certainly the default mappings would be different in the various regional variant fonts, and some of the non-default mappings could also be different. (Even if a given VS never mapped to different glyphs in the different fonts, the fonts could still differ in what VSes they need to support.) So it’s necessary to resolve how a dmap/14 subtable should interacts with a cmap/4 (or cmap/12) subtable, with a cmap/14 subtable, with a dmap/4 (or dmap/12) subtable, and with a dmap/14 subtable. One possible approach would be that the dmap/14 subtable completely supersedes the cmap/14 subtable (i.e., the latter is not used at all, and there is no de-duplication of that data). Another approach could be that a dmap/14 subtable complements the cmap/14 subtable by providing select replacement mappings (a delta—though there are still further details about how that would work exactly).
There were some useful points brought up along the way:
• Ned Holbrook pointed out that the format 14 DefaultUVS subtable is just a space-saving variant of the NonDevaultUVS subtable. A font doesn’t need to have any DefaultUVS table: the same sequences could be handled in NonDefaultUVS subtables — less efficiently… _in a single font_.
• For CJK, Ken Lunde pointed out that there are two kinds of UVSes to consider:
• “Standardized” VSs: these are defined in the Unicode Standard (see unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt) for CJK Compatibility Ideographs. They are defined in Unicode in a region-independent manner, but most represent region-specific glyphs.
• “Ideographic” VSes: these are VSes registered in the Ideographic Variation Database (Ideographic Variation Database (unicode.org)) in region-specific collections.
Because of the nature of each type, Ken thought there might be limited sharing across fonts. (E.g., at least some font developers would want to support a given IVS collection only in the one regional font for the corresponding region.) He did identify cases, however, in which the same SVS would need to map to different glyphs in different fonts.
• Again, for CJK, there would be cases in which different fonts would need to support the same VSes, but they would differ wrt DefaultUVS vs. NonDefaultUVS mappings.
Ken also called out some other uses in email exchanges. It all suggested that an ideal solution would make it possible to construct a collection file in which - two or more fonts can share some UVS mapping data while also having some font-specific mapping data; and
- it's also possible to have other fonts that do not share any UVS mapping data with other fonts.
That would allow the fonts to support only UVSs that are relevant for their respective markets, while also having an efficiency benefit from data-sharing between certain of the fonts.
That was in December 2016. We ran into end-of-year holidays and never resumed to closed on an approach that optimizes size of VS mapping data. The following is the last draft proposal that we exchanged. —-
dmap - Character to Glyph Index Differences Table
This table is an optional adjunct to the ‘cmap’ table defining differences from the nominal mappings in order to increase sharing of the ‘cmap’ itself across fonts in a TTC.
If a font production tool determines that the ‘cmap’ tables across the fonts in a TTC are largely but not entirely identical, it can choose one font to be used as the basis for the others in terms of character to glyph index mapping, expressing the mappings of the other fonts using only the mappings that are different from those of the former font. An example would be a CJK font family with region-specific fonts, where most characters would map to the same glyph index.
The ‘dmap’ table
Type Name Description
UInt16 version Set to 0.
UInt16 numTables Number of offset fields to follow.
UInt32 offset[numTables] Array of byte offsets from beginning of table to cmap subtables. All subtables are assumed to use Unicode. There can be at most one subtable of either format 4, 12, or 13.
As in the ‘cmap’ table, each ‘dmap’ subtable shall have the same structure as in ‘cmap’, starting with a format field that determines the remainder. The language field for a format 4, 12, or 13 subtable must be set to zero.
The steps for determining the glyph index for a given UVS consisting of a base character and optional variation selector are as follows:
• Apply the Unicode ‘cmap’ subtable to the base character to get the nominal glyph index.
• If the font has a ‘dmap’ format 4 or 12 subtable that maps the base character to a non-zero glyph index, it will replace the nominal glyph index.
• If the ‘cmap’ has a format 14 subtable, apply it in this way:
3.1.If the Default UVS Table contains the base character, the final glyph index will the be one determined by the ‘cmap’.
3.2.Else if the Non-Default UVS Table contains the base character, it will determine the final glyph index.
3.3.Else the final glyph index will remain as it was after step 2.
Note: An earlier draft of this document allowed for a second subtable of format 14, which would allow redefinition of variation sequences. Owing to uncertainty about usefulness and the exact behavior of the Default UVS Table, however, it has been removed pending further discussion.
—
In the previous draft, a different set of steps for handling UVSes were considered:
—
The steps for determining the glyph index for a given UVS consisting of a base character and optional variation selector are as follows:
1. Apply the ‘cmap’ to the base character to get the nominal glyph index.
2. If the font has a ‘dmap’ format 4 or 12 subtable that maps the base character to a non-zero glyph index, it will replace the nominal glyph index.
3. If the ‘dmap’ has a format 14 subtable, it will be used in place of the one in the ‘cmap’.
4. If there is a format 14 subtable, apply it in this way:
4.1.If the Default UVS Table contains the base character, the final glyph index will the be one determined by the ‘cmap’.
4.2.Else if the Non-Default UVS Table contains the base character, it will determine the final glyph index.
4.3.Else the final glyph index will remain as it was after step 2.
—
Peter
_______________________________________________
mpeg-otspec mailing list
mpeg-otspec at lists.aau.at<mailto:mpeg-otspec at lists.aau.at>
https://lists.aau.at/mailman/listinfo/mpeg-otspec
_______________________________________________
mpeg-otspec mailing list
mpeg-otspec at lists.aau.at<mailto:mpeg-otspec at lists.aau.at>
https://lists.aau.at/mailman/listinfo/mpeg-otspec
_______________________________________________
mpeg-otspec mailing list
mpeg-otspec at lists.aau.at<mailto:mpeg-otspec at lists.aau.at>
https://lists.aau.at/mailman/listinfo/mpeg-otspec
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.aau.at/pipermail/mpeg-otspec/attachments/20231222/4046e99c/attachment-0001.html>
More information about the mpeg-otspec
mailing list