Composite Font syntax
Ken Lunde
lunde at adobe.com
Thu Jun 25 07:06:56 CEST 2009
All,
Mikhail Leonov, daan Strebe, and myself spent some time since the last
meeting to have some private discussions related to the syntax of the
Composite Font format, and I am now prepared to share with the AHG
what we have agreed to. Any delay in bringing this discussion to the
AHG is due to me. And for that, I apologize.
There are three basic elements for the syntax that describes the
functional portion of the Composite Font recipe:
Language
Component Fonts
Unicode Ranges
(As a side note, we also discussed the notion of "Script" as an
alternative to specifying Unicode ranges, and agreed to defer that
portion of the discussion in order to bring the agreed portions to the
rest of the group.)
Given the "elective detail" principle, along with the acknowledgment
that the intentions and needs of creators and consumers are diverse,
these three Composite Font elements must be specified in a hierarchy,
and that the hierarchy depends on the intent of the creator.
Furthermore, in adhering to the "elective detail" principle, all of
these elements are not required to be specified, except for a minimum
of one Component Font.
We have agreed that a minimal Composite Font specifies a single
Component Font, such as the following:
<ComponentFont Target="LatinFont-1"/>
This is functionally equivalent to the form that uses start- and end-
tags:
<ComponentFont Target="LatinFont-1"></ComponentFont>
Another minimal form of a Composite Font is simply an ordered list of
Component Fonts, and the order in which they appear in the list is
their order of preference, and would function as a fallback font:
<ComponentFont Target="LatinFont-1, LatinFont-2"/>
or:
<ComponentFont Target="LatinFont-1"/>
<ComponentFont Target="LatinFont-2"/>
The latter form is necessary if one of the Component Fonts requires an
attribute not shared by another:
<ComponentFont Target="LatinFont-1"/>
<ComponentFont ScaleFactor="110" Target="LatinFont-2"/>
Note that tag attributes are used to specify the content of the tags,
as opposed to string data between the start- and end-tags. When an
element has no further hierarchical information, the empty-element tag
form can be used, and when there is further hierarchy to specified,
the start- and end-tag form must be used.
The <ComponentFont> tag example above provided on attribute,
specifically "Target" that specifies the name of the Component Font.
Other attributes for this tag can include "BaselineShift" and
"ScaleFactor" values. Given that design spaces of Component Fonts can
be diverse, ranging from 256- to 2048-em, with 1000-em being typical
for PostScript-based fonts, these values are best specified as
percentages. The ScaleFactor attribute obviously benefits from this.
The following is an example that uses all of the attributes:
<ComponentFont ScaleFactor="110" BaselineShift="-2"
Target="LatinFont-1"/>
In other words, the Component Font named "LatinFont-1" is scaled to
110% of its size, and a -2% baseline shift is performed.
The other elements are specified as the following tags:
<Language>
<Encoding>
The <Language> tag uses "Target" as its attribute, which specifies one
or more three-letter ISO 639-2/T language codes. The <Encoding> tag
also uses "Target" as its primary attribute, which specifies Unicode
ranges and Unicode code points, separated by commas. Some examples:
<Language Target="jpn"/>
<Encoding Target="4E00-9FCB"/>
Let us consider John Hudson's example of mixed English and Devanagari
text. The issue for that example was about language tagging, and that
punctuation that is common across languages, but share the same
Unicode code points, need appropriate treatment. Let us consider only
the following code points for this example:
Punctuation: 0028-0029, 002C, 002E, 2018-2019, 201C-201D
Latin: 0041-005A, 0061-007A
Devanagari: 0900-097F
When the language is English (or other Latin-based one), the
punctuation should be from the font intended for English, and
likewise, when the language is Hindi, the punctuation should be from
the Devanagari font.
<!-- English -->
<Language Target="eng">
<Encoding Target="0028-0029, 002C, 002E, 0041-005A, 0061-007A,
2018-2019, 201C-201D">
<ComponentFont BaselineShift="25" ScaleFactor="108"
Target="LatinFont-1"/>
<ComponentFont Target="LatinFont-2"/>
</Encoding>
</Language>
<!-- Hindi -->
<Language Target="hin">
<Encoding Target="0028-0029, 002C, 002E, 0900-097F, 2018-2019,
201C-201D">
<ComponentFont Target="DevanagariFont-1"/>
</Encoding>
</Language>
If we were to implement Composite Font support equivalent to what
Adobe applications provide, I think that the following represents a
good example, though it is highly simplified (and incomplete) for the
purpose of this explanation:
<!-- Latin -->
<Encoding Target="0000-007F">
<ComponentFont BaselineShift="12" ScaleFactor="95"
Target="LatinFont-1"/>
</Encoding>
<!-- Japanese Punctuation -->
<Encoding Target="3000-303F">
<ComponentFont Target="JapaneseFont-1"/>
</Encoding>
<!-- Japanese Kana -->
<Encoding Target="3041-30FF">
<ComponentFont Target="KanaFont-1"/>
</Encoding>
<!-- Everything Else -->
<ComponentFont Target="JapaneseFont-2"/>
Note how "JapaneseFont-2" serves as the Base Font, because it does not
declare any other elements or attributes. The Component Fonts that are
declared prior to that line take precedence, and are used for the
specific Unicode ranges that are declared. I could have declared a
Unicode range for the last font, but for the purpose of this Composite
Font, it is not necessary, because any and all glyphs that it contains
can be used, other than those masked by previous declarations in the
Composite Font.
Also note that the language was not declared, because it is not
important for this specific Composite Font. This adheres to the
"elective detail" principle.
I cannot think of any other attributes for the <Language> tag other
than "Target" to specify a three-letter ISO 639-2/T language code. For
the <Encoding> tag, it should be possible to re-encode a Component
Font, and an "Original" attribute can be used. Consider the following
two scenarios, both of which are very real:
1) To be able to add glyphs from a Component Font that are encoded in
one way (such as according to a single-byte encoding), but to encode
them according to a different encoding in the Composite Font, such as
in the PUA region. Legacy Composite Font mechanisms referred to this
as the ability to add "gaiji" to fonts. The re-encoding was a
necessary step. Basically, re-encoding a single-byte Component Font so
that the glyphs are accessed via character codes in the Composite Font.
2) To be able to change the encoding of a select number of glyphs in a
Component Font. A good example are GB 18030 glyphs that are encoded
using PUA code points, but can be encoded according to non-PUA code
points. It would be reasonable for a Composite Font definition to
perform this function. I would claim that both code points (PUA in the
original Component Font and non-PUA in the Composite Font) could
result in the same glyph when accessed via the Composite Font.
The way that these were handled in legacy Composite Font formats was
to specify encoding ranges for the Composite Font, which could be
length=1 (meaning that the start and end character code are the same)
at a minimum, and specify only the start character code for the
Component Font. For example:
<Encoding Target="4E00-4EFF" Original="00"/>
In other words, U+4E00 through U+4EFF in the Composite Font are mapped
to 0x00 through 0xFF in the Component Font.
Here is a good example of GB 18030 characters that are often PUA-
encoded in fonts, but could be re-encoded in Composite Fonts via this
mechanism:
<Encoding Target="9FB4" Original="FE59"/>
<Encoding Target="9FB5" Original="FE61"/>
<Encoding Target="9FB6-9FB7" Original="FE66"/>
<Encoding Target="9FB8" Original="FE6D"/>
<Encoding Target="9FB9" Original="FE7E"/>
<Encoding Target="9FBA" Original="FE90"/>
<Encoding Target="9FBB" Original="FEA0"/>
<Encoding Target="20087" Original="FE51"/>
<Encoding Target="20089" Original="FE52"/>
<Encoding Target="200CC" Original="FE53"/>
<Encoding Target="215D7" Original="FE6C"/>
<Encoding Target="2298F" Original="FE76"/>
<Encoding Target="241FE" Original="FE91"/>
I understand that the above is a lot to digest (especially considering
that I polished off half a bottle of red wine, followed by two shots
of brandy), but if anyone has any comments or feedback, please post it
to the mailing list.
Regards...
-- Ken
More information about the mpeg-otspec
mailing list