Unicode Strings

Mac strings are stored internally as a sequence of 16 bit values (UTF-16). The length of a string is the number of UTF-16 items which are stored. This may or may not be the number of characters in the string. Certain Unicode characters, such as emojis occupy more than a single UTF-16 item. These characters are called composed characters.

For almost all tagging situations this information can be ignored. Comparing and searching is not affected. Only when using an absolute reference to a position in a string is there a possibility for error. Four action statements are affected by this internal representation and you should be aware of the ramifications. The statements are:

Index Of

Format Numeric

Length

Substring

The Index Of Statement

The statement returns an index into the list of UTF-16 items comprising a string. This index always is the correct location for the start of the matched pattern but may not be a character index. For example:

Set named variable 'test' to "😀abc"
Set named variable 'location' to the index of "abc" in named variable 'test'

The above sequence will result in named variable location being set to 2. This is because the 😀 character requires two UTF-16 items.

The Format Numeric Statement

The statement's As Integer Value of Unicode Character and as As Hexadecimal Value of Unicode Character functions always return the value of the Unicode character at the start of the string, regardless of the number of UTF-16 character positions it occupies. The functions set the Character Length named variable to the number of UTF-16 items comprising the character.

Every attempt is made to return the Unicode code point which describes the character whose value is being extracted. Unicode characters are represented by values called code points. There is a well defined algorithm to form a code point from many characters which occupy more than a single UTF-16 item. However to muddy the waters, some Unicode characters are followed by a Variant code which changes the graphical rendition of the character. In order that the value returned by these two functions can be recreated by the As Unicode Character from Value function the returned value must be encoded in a non standard manner when there is a Variant code. Yate uses the following method to display a code point followed by a variant:

(variant code << 32) | code point

example: ☺️ has the following UFT-16 items: 0x263A, 0xFE0F
the code point is: 0x263A
the variant value is 0xFE0F
the returned value is: 0xFE0F0000263A

example: 😐︎ has the following UFT-16 items: 0DX3D, 0xDE10, 0xFE0E
the computed code point is: 0x1F610
the variant value is: 0xFE0E
the returned value is: 0xFE0E0001F610

The As Unicode Character from Value function will correctly recreate a character from the value returned by the As Integer Value of Unicode Character and as As Hexadecimal Value of Unicode Character functions.

The Length Statement

The length statement returns the number of UTF-16 items in a string. It may return a number which is greater than the actual number of characters. The following snippet will set named variable count to the number of characters in the string contained in named variable source.

1: Set named variable 'loc' to "0"
2: Save the length of named variable 'source' to named variable 'count'
3: Repeat Forever loop
4: Start loop
5: Index of Composed Unicode character starting at index '\<loc>' of named variable 'source' -> named variable 'next'
6: Test if the integer value of named variable 'next' == "-1" (Set test state)
7: Exit Repeat if true
8: Evaluate Expression "\<next>+\<Character Length>" save integer result to named variable 'loc'
9: Evaluate Expression "\<count>-\<Character Length>+1" save integer result to named variable 'count'

If named variable source contains "😀abc😃def", named variable count will be set to 8. Note that the length would be 10.

Substring

The With Index & Length, To Index, From Index and With Range functions base their implicit or explicit index and length values on UTF-16 items.

The Unicode Character function always extracts a single Unicode character at the specified UTF-16 index, regardless of the number of UTF-16 character positions it occupies. The function sets the Character Length named variable to the number of UTF-16 items comprising the extracted character.

The Integer Value of Unicode Character function extracts the value of the Unicode character at the specified UTF-16 index, regardless of the number of UTF-16 character positions it occupies. The function sets the Character Length named variable to the number of UTF-16 items comprising the extracted character.

The Index of Composed Unicode Character function returns the index of the located composed character. This index is a number of UTF-16 items and can correctly be passed to the Unicode Character function to extract the character. The function sets the Character Length named variable to the number of UTF-16 items comprising the located character. If a composed character is not located, the Character Length will be 0 and the returned index will be -1.

Statement 5 in the preceding code snippet is a Substring statement with the Index of Composed Unicode Character function.

Warning

Not all emojis display on all versions of macOS, or they may display differently. In fact there might even be hardware dependencies as well. We have different Macs running the same version of macOS which handle emojiis differently.

Summary

As previously said, this information can typically be ignored for the majority of tagging purposes. It is only a potential issue if you are extracting and manipulating characters as opposed to strings. Even then it is not an issue for typical characters.

Filename Lengths