Composite Font syntax

Thu Jun 25 07:06:56 CEST 2009

All,

Mikhail Leonov, daan Strebe, and myself spent some time since the last  
meeting to have some private discussions related to the syntax of the  
Composite Font format, and I am now prepared to share with the AHG  
what we have agreed to. Any delay in bringing this discussion to the  
AHG is due to me. And for that, I apologize.

There are three basic elements for the syntax that describes the  
functional portion of the Composite Font recipe:

   Language
   Component Fonts
   Unicode Ranges

(As a side note, we also discussed the notion of "Script" as an  
alternative to specifying Unicode ranges, and agreed to defer that  
portion of the discussion in order to bring the agreed portions to the  
rest of the group.)

Given the "elective detail" principle, along with the acknowledgment  
that the intentions and needs of creators and consumers are diverse,  
these three Composite Font elements must be specified in a hierarchy,  
and that the hierarchy depends on the intent of the creator.  
Furthermore, in adhering to the "elective detail" principle, all of  
these elements are not required to be specified, except for a minimum  
of one Component Font.

We have agreed that a minimal Composite Font specifies a single  
Component Font, such as the following:

   <ComponentFont Target="LatinFont-1"/>

This is functionally equivalent to the form that uses start- and end- 
tags:

   <ComponentFont Target="LatinFont-1"></ComponentFont>

Another minimal form of a Composite Font is simply an ordered list of  
Component Fonts, and the order in which they appear in the list is  
their order of preference, and would function as a fallback font:

   <ComponentFont Target="LatinFont-1, LatinFont-2"/>

or:

   <ComponentFont Target="LatinFont-1"/>
   <ComponentFont Target="LatinFont-2"/>

The latter form is necessary if one of the Component Fonts requires an  
attribute not shared by another:

   <ComponentFont Target="LatinFont-1"/>
   <ComponentFont ScaleFactor="110" Target="LatinFont-2"/>

Note that tag attributes are used to specify the content of the tags,  
as opposed to string data between the start- and end-tags. When an  
element has no further hierarchical information, the empty-element tag  
form can be used, and when there is further hierarchy to specified,  
the start- and end-tag form must be used.

The <ComponentFont> tag example above provided on attribute,  
specifically "Target" that specifies the name of the Component Font.  
Other attributes for this tag can include "BaselineShift" and  
"ScaleFactor" values. Given that design spaces of Component Fonts can  
be diverse, ranging from 256- to 2048-em, with 1000-em being typical  
for PostScript-based fonts, these values are best specified as  
percentages. The ScaleFactor attribute obviously benefits from this.  
The following is an example that uses all of the attributes:

   <ComponentFont ScaleFactor="110" BaselineShift="-2"  
Target="LatinFont-1"/>

In other words, the Component Font named "LatinFont-1" is scaled to  
110% of its size, and a -2% baseline shift is performed.

The other elements are specified as the following tags:

   <Language>
   <Encoding>

The <Language> tag uses "Target" as its attribute, which specifies one  
or more three-letter ISO 639-2/T language codes. The <Encoding> tag  
also uses "Target" as its primary attribute, which specifies Unicode  
ranges and Unicode code points, separated by commas. Some examples:

   <Language Target="jpn"/>
   <Encoding Target="4E00-9FCB"/>

Let us consider John Hudson's example of mixed English and Devanagari  
text. The issue for that example was about language tagging, and that  
punctuation that is common across languages, but share the same  
Unicode code points, need appropriate treatment. Let us consider only  
the following code points for this example:

   Punctuation: 0028-0029, 002C, 002E, 2018-2019, 201C-201D
   Latin:       0041-005A, 0061-007A
   Devanagari:  0900-097F

When the language is English (or other Latin-based one), the  
punctuation should be from the font intended for English, and  
likewise, when the language is Hindi, the punctuation should be from  
the Devanagari font.

   <!-- English -->
   <Language Target="eng">
       <Encoding Target="0028-0029, 002C, 002E, 0041-005A, 0061-007A,  
2018-2019, 201C-201D">
           <ComponentFont BaselineShift="25" ScaleFactor="108"  
Target="LatinFont-1"/>
           <ComponentFont Target="LatinFont-2"/>
       </Encoding>
   </Language>

   <!-- Hindi -->
   <Language Target="hin">
       <Encoding Target="0028-0029, 002C, 002E, 0900-097F, 2018-2019,  
201C-201D">
           <ComponentFont Target="DevanagariFont-1"/>
       </Encoding>
   </Language>

If we were to implement Composite Font support equivalent to what  
Adobe applications provide, I think that the following represents a  
good example, though it is highly simplified (and incomplete) for the  
purpose of this explanation:

   <!-- Latin -->
   <Encoding Target="0000-007F">
       <ComponentFont BaselineShift="12" ScaleFactor="95"  
Target="LatinFont-1"/>
   </Encoding>

   <!-- Japanese Punctuation -->
   <Encoding Target="3000-303F">
       <ComponentFont Target="JapaneseFont-1"/>
   </Encoding>

   <!-- Japanese Kana -->
   <Encoding Target="3041-30FF">
       <ComponentFont Target="KanaFont-1"/>
   </Encoding>

   <!-- Everything Else -->
   <ComponentFont Target="JapaneseFont-2"/>

Note how "JapaneseFont-2" serves as the Base Font, because it does not  
declare any other elements or attributes. The Component Fonts that are  
declared prior to that line take precedence, and are used for the  
specific Unicode ranges that are declared. I could have declared a  
Unicode range for the last font, but for the purpose of this Composite  
Font, it is not necessary, because any and all glyphs that it contains  
can be used, other than those masked by previous declarations in the  
Composite Font.

Also note that the language was not declared, because it is not  
important for this specific Composite Font. This adheres to the  
"elective detail" principle.

I cannot think of any other attributes for the <Language> tag other  
than "Target" to specify a three-letter ISO 639-2/T language code. For  
the <Encoding> tag, it should be possible to re-encode a Component  
Font, and an "Original" attribute can be used. Consider the following  
two scenarios, both of which are very real:

1) To be able to add glyphs from a Component Font that are encoded in  
one way (such as according to a single-byte encoding), but to encode  
them according to a different encoding in the Composite Font, such as  
in the PUA region. Legacy Composite Font mechanisms referred to this  
as the ability to add "gaiji" to fonts. The re-encoding was a  
necessary step. Basically, re-encoding a single-byte Component Font so  
that the glyphs are accessed via character codes in the Composite Font.

2) To be able to change the encoding of a select number of glyphs in a  
Component Font. A good example are GB 18030 glyphs that are encoded  
using PUA code points, but can be encoded according to non-PUA code  
points. It would be reasonable for a Composite Font definition to  
perform this function. I would claim that both code points (PUA in the  
original Component Font and non-PUA in the Composite Font) could  
result in the same glyph when accessed via the Composite Font.

The way that these were handled in legacy Composite Font formats was  
to specify encoding ranges for the Composite Font, which could be  
length=1 (meaning that the start and end character code are the same)  
at a minimum, and specify only the start character code for the  
Component Font. For example:

   <Encoding Target="4E00-4EFF" Original="00"/>

In other words, U+4E00 through U+4EFF in the Composite Font are mapped  
to 0x00 through 0xFF in the Component Font.

Here is a good example of GB 18030 characters that are often PUA- 
encoded in fonts, but could be re-encoded in Composite Fonts via this  
mechanism:

   <Encoding Target="9FB4" Original="FE59"/>
   <Encoding Target="9FB5" Original="FE61"/>
   <Encoding Target="9FB6-9FB7" Original="FE66"/>
   <Encoding Target="9FB8" Original="FE6D"/>
   <Encoding Target="9FB9" Original="FE7E"/>
   <Encoding Target="9FBA" Original="FE90"/>
   <Encoding Target="9FBB" Original="FEA0"/>
   <Encoding Target="20087" Original="FE51"/>
   <Encoding Target="20089" Original="FE52"/>
   <Encoding Target="200CC" Original="FE53"/>
   <Encoding Target="215D7" Original="FE6C"/>
   <Encoding Target="2298F" Original="FE76"/>
   <Encoding Target="241FE" Original="FE91"/>

I understand that the above is a lot to digest (especially considering  
that I polished off half a bottle of red wine, followed by two shots  
of brandy), but if anyone has any comments or feedback, please post it  
to the mailing list.

Regards...

-- Ken