[mpeg-OTspec] Conflicts between language system tags

Andrew Glass (WINDOWS) Andrew.Glass at microsoft.com
Wed Feb 27 01:32:25 CET 2013


If we go with all four character codes being ISO based - which seems appropriate. I'd prefer to have the variant marker be the last character so that variants will sort together. In that case the ISO flag number should also be at the back of the tag, e.g., KAB6. Having 6 at the back of the tag makes it seem as through 0-5 are missing. Therefore, the proposal should rather be {Uppercase(ISO639-3)}0, so KAB0 = 'Kabyle', whereas KAB = 'Kabardian'.

Orthographic variants would be marked with fourth letter, e.g.,
	KABR = 'Kabyle (reformed orthography)
Note that Kabardian (reformed orthography) would be KBDR, based on ISO 639-3 kbd = 'Kabardian'.

-----Original Message-----
From: Martin Hosken [mailto:martin_hosken at sil.org] 
Sent: 26 February 2013 16:18
To: Andrew Glass (WINDOWS)
Cc: Peter Constable; OTspec (mpeg-OTspec at yahoogroups.com)
Subject: Re: [mpeg-OTspec] Conflicts between language system tags

Oops, this discussion didn't go to the list. Let's try that again.

> Dear Martin,
> 
> Thanks for the feedback, this is a good point to consider.
> 
> In the case of Malayalam this distinction is currently indicated at the level of the three letter tag (MAL, MLR), where the ISO code is MAL. Note that while the reform orthography is signified subtly with R, this tag itself is not distinguishable from a completely unrelated language. If this were a language that was not already tagged in OT, the new code would be something like 6MAL - based on the proposal, variants could be signified using a different number, e.g., 7MAL. If so, it might be better to start at 1 for standard orthography for the ISO code itself. 2 through 9, would then be used to indicate orthographic variants. That would be quite a limited scope for variants - I'm not sure how much room is really needed. And if much more room is needed, perhaps a more robust solution along the lines of BCP47 is really what we want.
> 
> >How about using case?
> 
> I tend to think of case as a pretty fragile means of making distinctions as the differences would be prone to programming error (font vendor uses the wrong case, or instruction applies casing normalization).
> 

I agree. Another thought came to my mind is that we say "if the lang tag is 4 char then it uses iso639-3 codes" and we introduce a single pad character like '6' as per your suggestion. But we also say that if another character is used, then the tag is still iso639-3 based but with an extension. That's very similar to your approach only slightly looser. I don't imagine that there will be many variants per language script combination, so perhaps using digits alone is sufficient.

Yours,
Martin

> Cheers,
> 
> Andrew
>  
> -----Original Message-----
> From: Martin Hosken [mailto:martin_hosken at sil.org] 
> Sent: 26 February 2013 13:02
> To: Andrew Glass (WINDOWS)
> Subject: Re: [mpeg-OTspec] Conflicts between language system tags
> 
> Dear Andrew,
> 
> > Following a discussion with Peter, we’d like to propose using an marker (perhaps “6” for ISO 639-3) in the language system tag to identify new codes that are based on ISO 639-3. The rest of the tag would be the upper-cased ISO tag, per Bob’s mail.
> 
> There is a problem with this approach in that it means that there is no way to extend a language tag to account for writing system variations like old malayalam versus modern malayalam. I would suggest that there will be cases where we will want to use a 4 letter code based on a 3 letter base tag.
> 
> How about using case? Karen -> KRN or sgw (when krn is the language tag for Sapo and sgw is the code for Sgaw Karen).
> 
> There may be other approaches. I'm flagging the issue rather than pushing a solution.
> 
> Yours,
> Martin


More information about the mpeg-otspec mailing list