[MPEG-OTSPEC] Does a rendering system know if a variation selector requested glyph is not available in a font?

Ken Lunde lunde at unicode.org
Wed Jul 17 05:47:35 CEST 2024


Hin-Tak,

This is simply due to history.

The Adobe-Japan1 IVD collection was the first one to be registered, which was on 2007-12-14. You referenced the use of VS1 (aka U+FE00) and VS18 (aka U+E0101). The difference is that VS1 is among the 16 variation selectors that are used for SVSes (Standardized Variation Sequences) and EVSes (Emoji Variation Sequences), and VS18 is among the 240 variation selectors that are (currently) dedicated for use for the IVD (Ideographic Variation Database).

Unicode Version 6.3 (2013) added SVSes for all 1,002 CJK Compatibility Ideographs. Adobe-Japan1–related resources had mappings to CJK Comnpatibility Ideographs, and while they could not serve as base characters per UTS #37, Adobe-Japan1 IVSes were registered for them using their canonical equivalents as base characters:

51DE E0101; Adobe-Japan1; CID+20307
97FF E0101; Adobe-Japan1; CID+13337

I am referencing the "Adobe-Japan1_sequences.txt" data file in the Adobe-Japan1-7 Character Collection project, which is the "source of truth" for that glyph set:

https://github.com/adobe-type-tools/Adobe-Japan1/

The following corresponding SVSes were added in Unicode Version 6.3:

51DE FE00; CJK COMPATIBILITY IDEOGRAPH-FA15;
97FF FE00; CJK COMPATIBILITY IDEOGRAPH-FA69;

These became the following additional entries in the "Adobe-Japan1_sequences.txt" data file:

51DE FE00; Standardized_Variants; CID+20307
97FF FE00; Standardized_Variants; CID+13337

In other words, for CJK Compatibility Ideographs U+FA15 and U+FA69, there is both an SVS and a registered IVS. IVSes cannot be unregistered, and the SVSes need to be supported.

Luckily, none of the subsequent registered IVD collections needed to deal with this, so there was no duplication of SVSes and registered IVSes among their sequences. BTW, there are 89 cases like this. Search for "# 89 Standardized Variants (Unicode 6.3)" in the "Adobe-Japan1_sequences.txt" data file.

About SVSes for East Asian punctuation, the first batch was added in Unicode Version 12.0 (2019) based on a proposal that I submitted:

3001 FE00; corner-justified form; # IDEOGRAPHIC COMMA
3001 FE01; centered form; # IDEOGRAPHIC COMMA
3002 FE00; corner-justified form; # IDEOGRAPHIC FULL STOP
3002 FE01; centered form; # IDEOGRAPHIC FULL STOP
FF01 FE00; corner-justified form; # FULLWIDTH EXCLAMATION MARK
FF01 FE01; centered form; # FULLWIDTH EXCLAMATION MARK
FF0C FE00; corner-justified form; # FULLWIDTH COMMA
FF0C FE01; centered form; # FULLWIDTH COMMA
FF0E FE00; corner-justified form; # FULLWIDTH FULL STOP
FF0E FE01; centered form; # FULLWIDTH FULL STOP
FF1A FE00; corner-justified form; # FULLWIDTH COLON
FF1A FE01; centered form; # FULLWIDTH COLON
FF1B FE00; corner-justified form; # FULLWIDTH SEMICOLON
FF1B FE01; centered form; # FULLWIDTH SEMICOLON
FF1F FE00; corner-justified form; # FULLWIDTH QUESTION MARK
FF1F FE01; centered form; # FULLWIDTH QUESTION MARK

The bsecond batch is targeted for Unicode Version 16.0, which is scheduled for release in September.

Regards...

-- Ken

> On Jul 16, 2024, at 19:40, Hin-Tak Leung <htl10 at users.sourceforge.net> wrote:
> 
> Hi Ken,
> 
> While discussing implementations of uvs tables fall quite firmly within the topics of the list, I am somewhat conscious that others might not be too interested in specific details about Adobe Source fonts... Anyway.
> 
> On the non-optimal observation, I am referring to that sometimes there is a coded VS17 or even VS22 (in the case of U+53a9) being the default. And in the two cases with 6 variants, U+51de and U+97ff having VS1 and VS18 mapped to the same non-default gid. I looked up what they are - apparently there are semantic variants, specialized semantic variants, traditional variants and simplified variants - thus the semantic variant / specialized semantic variants might be logically different from the simplified/traditional variant, but "happens" to be of the same shape. I know I am "preaching to the priest", as you wrote UAX #38, which details the unihan variant properties too :-).
> 
> The 
> https://www.unicode.org/alloc/Pipeline.html URL is also interesting reading - I didn't know "left justified form" "right justified form" are a thing. This points to an oversight of mine: I was just looking at duplicate gid entries /shapes in the uvs table, and forgotten about positioning (of the same glyph) could be a variant form. Those could be east asian punctuations.
> 
> Dropping the default UVS entries would be an instant size saving of a few 10k's (?), with no functional impact.
> 
> Hindsight is a wonderful thing. Thanks for an interesting discussion.
> 
> Hin-Tak
> 
> 
> On Tuesday 16 July 2024 at 03:02:58 BST, Ken Lunde <lunde at unicode.org> wrote: 
> 
> 
> Hin-Tak,
> 
> Not tedious at all. In fact, I might be the only person on the planet who can meaningfully respond to your questions. 🤣
> 
> In any case, I now understand your sub-optimal claim, and it has merit. I can explain the background.
> 
> Adobe-Japan1 was the first IVD collection to be register (in 2007), and the philosophy that was used was to register sequences for *every* ideograph in Adobe-Japan1-6 regardless of whether a particular ideograph had any unencoded variants. The number of registered Adobe-Japan1 sequences is 14,684, but the number of unencoded glyphs is only 1,372. This means that the number of base characters with no variants is on the order 12,000. For example, the common ideograph U+4E00 一, which includes no variants forms, has a registered Adobe-Japan1 IVS. These are referred to as "default" UVSes, meaning that the Format 14 'cmap' subtable stores only the sequence, not a GID. This means that the default glyph for the base character as specified in the Format 4 or 12 'cmap' subtable should be used to render the sequence. I was not the person who insisted on this, and given that registered IVSes cannot be unregistered, we need to live with it.
> 
> Luckily, subsequent IVD collections that were registered did not follow this philosophy, and instead register an IVS for a base character only when there are one or more uncoded variants. In retrospect, that should have been done for the Adobe-Japan1 IVD collection.
> 
> About the Source Han and Noto CJK fonts, their Japanese versions support the Adobe-Japan IVD collection, along with SVSes (Standardized Variation Sequences) that correspond to supported CJK Compatibility Ideographs, slashed zero glyphs, and East Asian punctuation. Anything outside that scope do not use any variation sequences. Note there will be new SVSes for the smart quotes in Unicode Version 16.0, and whether they are supported in the Source Han and Noto CJK fonts is up to our friends at Adobe and Google. See the end of the following page:
> 
> https://www.unicode.org/alloc/Pipeline.html
> 
> Regards...
> 
> -- Ken
> 
> > On Jul 15, 2024, at 17:27, Hin-Tak Leung <htl10 at users.sourceforge.net> wrote:
> > 
> > Hi Ken,
> > 
> > Apologies, I double-checked - I think you are right about them being more or less in sync. For some reasons I seem to have the mistaken impression that that one is v2... the other v4 and with widely different release dates. Some of the other Adobe Source non-Hans seems to be indeed at v4...
> > 
> > As for the sub-optimal claim, I just meant minimal in terms of numbers of references to distinct glyph shapes (and minimal table size). Reading UTS #37 properly, I see that this is very much not the case, and as you said, as the number of UVSes increase, the number of registered collections increases. Or rather, the other way round: as the number of UVSes increase AS A CONSEQUENCE OF more registered collections, there will be partially overlapping collections, and redundancies / duplicated references to exact same shape across different collections, and they will take up more variant selector slots with the same glyph shapes.
> > 
> > In fact, if I read UTS #37 correctly (sorry this sounds like asking the author to explain the subtlety/intention/clarification - I see you wrote UTS #37) , as a hypothetical scenario, it is entirely possible for a later version of a font having no new glyphs compared to an earlier version, but just a much larger uvs cmap. And it seems to imply that a (specific versioned instance of) uvs cmap should have a corresponding (specific versioned instance of) IVD_Collections + IVD Sequences?
> > 
> > Put it in simpler terms, the "suboptimal" claim about the current construction of Adobe Hans - to get around it, a vendor - say, Google - could register a "web font usage uvs collection" with exactly one IVS per distinct glyph, and ship a font that does not support any other collections?
> > 
> > Hidden in there, is the idea that the current Adobe Hans must have a (versioned) list of (versioned) IVS collections it claims to support - and it should be possible to check the implementation of a uvs cmap against that text-based list?
> > 
> > I hope this is not too tedious a discussion...
> > 
> > Regards 
> > Hin-Tak
> > 
> > 
> > 
> > On Monday 15 July 2024 at 04:51:59 BST, Ken Lunde <lunde at unicode.org> wrote: 
> > 
> > 
> > Hin-Tak,
> > 
> > Apologies for the delay in replying. I spent the last week working on Unicode- and IRG-related matters.
> > 
> > I just compared the Format 14 'cmap' subtables of the latest Source Han Sans (Version 2.004) and Source Han Serif (Version 2.002) fonts, and they are "pretty much" in sync. (I developed tools for doing such a comparison.) I found only four differences, which were attributed to whether the sequences are "default" or "non-default," which may actually be a bug. That is no longer my problem, but I can still confidently state that that their UVSes are "pretty much" in sync.
> > 
> > I am, however, curious about your "suboptimal" claims. I suspect that the Format 14 'cmap' subtable itself may be suboptimal, as the "IVS Test" project sort of demonstrates, given the sheer size of its 'cmap' table. As the number of UVSes increases, the optimization decreases, or rather, the fact that it is suboptimal becomes more apparent.
> > 
> > Regards...
> > 
> > -- Ken
> > 
> > > On Jun 29, 2024, at 19:03, Hin-Tak Leung <htl10 at users.sourceforge.net> wrote:
> > > 
> > > 
> > > 
> > > On Friday 28 June 2024 at 06:12:44 BST, Ken Lunde <lunde at unicode.org> wrote: 
> > > 
> > > 
> > > > Hin-Tak,
> > > 
> > > > For better or worse, I am effectively the caretaker of the history of much of the CJK-related type activities that took place at Adobe over the last 30+ years, to include the development of the Source Han and Noto CJK Pan-CJK typefaces, which are clones of one another.
> > > 
> > > > About the observations that you made, particularly about the lookup of UVSes in Source Han being suboptimal, that was intentional. While I have been the IVD Registrar since May of 2011, the registration of virtually all Adobe-Japan1 IVSes was performed by my former Adobe colleague, Eric Muller. I suspect that your observation is about the Variation Selector that is associated with what is deemed the default UVS, meaning that the Format 14 'cmap' subtable defers to the Format 12 (or 4) 'cmap' subtable for the GID. When the first -- and by far, largest -- batch of Adobe-Japan1 IVS were registered in the IVD, it was intentional that the lowest -- by code point order -- Variation Selector was not associated with the UVS that is considered the default (aka encoded) one. This was purposefully done so that implementations would not make such an assumption.
> > > 
> > > > BTW, you may be interested in the "IVS Test" project that I started while at Adobe:
> > > 
> > > > https://github.com/adobe-fonts/ivs-test/
> > > 
> > > Thanks Ken, for the anecdotes about the development history. I am aware that technical decisions are often made not entirely based on technical considerations. It may not even be optimal at the time, and certainly not on hindsight. It is always interesting to learn how "oddities" come to be.
> > > 
> > > It makes a lot of sense to intentionally NOT to associate the lowest variation selector with the default. Technologically it is redundant (one can save one code point by just "spec it out" and remove it and gain the use of one empty slot). A lot of parties are going to argue that they want their favourite as default so "default" in this case is a political minefield too.
> > > 
> > > I was curious about the non-optimalness of the format 14 cmap on Adobe Sources Hans Sans, and wonder if they are sync with the Serif font. I.e. two glyph shapes can be non-degenerate and different in the serif font (e.g. a brush stroke tapering from top right to bottom left, vs the reverse tapering from bottom left to top right - they become identical in the Sans font). But I found that the serif font has an entirely different versioning and release schedule. While its UVS table feels more optimal, no conclusion could be drawn from its relationship with the Sans font. There is probably another interesting story there.
> > > 
> > > Thanks for the URL for the ivs-test - looks to be an interesting "stress test" benchmarking sample for performance in related software/ code path!
> > > 
> > > Regards,
> > > Hin-Tak
> > > 
> > 
> 



More information about the mpeg-otspec mailing list