Yate supports the re-encoding of text via the Multi Field Editor, the Yate Transformations submenu found on text field context menus and via the Re-Encode action statement.
The following functions are available on each of the above three sources. The function names may differ slightly depending on the source.
- Re-encode to Cyrillic
- Re-encode to Greek
- Re-encode to ISO Latin-2
- Re-encode to Turkish
- Re-encode to WinLatin-1
- Re-encode to WinLatin-2
The ID3 specification uses ISO-Latin-1 as its 8 bit text encoding. In the past before UTF8 was supported, many people specified their mp3 fields in a variety of languages which contained characters not supported in ISO Latin-1. When these files are read by Yate, fields which specify an encoding of ISO Latin-1 may not display the correct characters if in fact they were not ISO Latin-1 characters.
This statement allows you to specify the original encoding and attempt to re-encode to the actual encoding. Modifications will be made wherever possible. Note that if a field currently contains characters which cannot be represented in ISO Latin-1, no modifications will occur.
The algorithm essentially re-encodes the Mac's internal representation of a string back to ISO Latin-1 and then encodes the raw data using your specified encoding.
- Force Unicode UNFC
Unicode supports the encoding of most accented characters as precomposed single characters or decomposed sequences. É, precomposed has a string length of 1. When decomposed it has a string length of 2. The string displays correctly regardless of the encoding. When Unicode UNFC is selected, the associated fields are converted to their precomposed encoding. UNFC stands for Unicode Normalization Form C. Note that this transformation should rarely be required.
- Force ISO Latin-1
This function attempts to ensure that every character in the result can be represented as an ISO Latin-1 character. It does so by changing various characters to their similar ISO Latin-1 equivalents, removing accents if necessary and as a last resort by changing characters which cannot be represented as ISO Latin-1 to underscore characters. Unicode UNFC, and Fold Characters are applied.
- Force ASCII
This function attempts to ensure that every character in the result can be represented as an ASCII character. It does so by changing various characters to their similar ASCII equivalents, removing accents if necessary and as a last resort by changing characters which cannot be represented as ASCII to underscore characters. Unicode UNFC, and Fold Characters are applied.
- Remove Accents
This function re-encodes all accented characters to their baseline unaccented characters, wherever possible.
- Fold Characters
This function changes various characters to their similar Latin-1 equivalents. Currently this includes single and double quote equivalents as well as dash/hyphen equivalents. Unicode UNFC is applied. A complete list of the current substitutions can be found here.
The following functions are only available via the Re-Encode action statement.
- Force Unicode UNFD
Unicode supports the encoding of most accented characters as precomposed single characters or decomposed sequences. É, precomposed has a string length of 1. When decomposed it has a string length of 2. The string displays correctly regardless of the encoding. When Unicode UNFD is selected, the associated fields are converted to their decomposed encoding. UNFD stands for Unicode Normalization Form D. Note that this transformation should rarely be required.
- Re-encode to ISO Latin-1 (Lossy)
This function re-encodes as ISO Latin-1 discarding characters which cannot be represented. This is the system implementation which is included for historic reasons. Force ISO Latin-1 is a better choice.
- Re-encode to ASCII (Lossy)
This function re-encodes as ASCII discarding characters which cannot be represented. This is the system implementation which is included for historic reasons. Force ASCII is a better choice.
- Re-encode for JSON
Re-encodes the text for inclusion in a JSON string. Note the word string! Do not use this function to re-encode entire JSON sequences.
- Remove RTF Formatting
If the data is properly structured RTF, the formatting will be removed leaving only the text.
- Remove Prompt Markup Sequences
Remove prompt markup sequences. If the source text does not start with <m>, it will be returned without modification.
- Remove HTML & Sequences
All HTML & sequences will be replaced with the characters they describe. The full source need not be valid HTML. Invalid & sequences are not modified.
- Add HTML & Sequences
The following transformations will be applied:
- &
- &
- <
- <
- >
- >
- "
- "
- '
- '
- Re-encode to Base64
The source is encoded as Base64.
- Decode Base64
The source is assumed to be Base64 and is decoded. If any errors occur, the returned value will be empty.
- Escape for Regular Expression (Pattern)
Backslash characters are added to escape characters used in a regular expression pattern.
- Escape for Regular Expression (Template)
Backslash characters are added to escape characters used in a regular expression replace template.
- Remove URL % Encoding
URL percent encoded sequences are converted back to their textual representation.
- Re-encode as URL Custom
The named variable URL Custom Encode Set is read to determine a list of characters to be percent escaped.
There are six more functions dedicated to percent encoding content based on the characters which must be escaped for particular components of a URL. The following functions are available and illustrate the text that requires the specific escaping for the following URL:
http://username:password@www.site.com/index.html?name=value#pagelink
- Re-encode as URL Host Component
www.site.com
- Re-encode as URL Path Component
/index.html
- Re-encode as URL Fragment Component
pagelink
- Re-encode as URL Password Component
password
- Re-encode as URL Query Component
name=value
- Re-encode as URL Query Value Component
The value portion of name=value
- Re-encode as URL User Component
username