Application icon

Re-Encoding Text

Yate supports the re-encoding of text via the Multi Field Editor, the Yate Transformations submenu found on text field context menus and via the Re-Encode action statement.

The following functions are available on each of the above three sources. The function names may differ slightly depending on the source.

Re-encode to Cyrillic
Re-encode to Greek
Re-encode to ISO Latin-2
Re-encode to Turkish
Re-encode to WinLatin-1
Re-encode to WinLatin-2

The ID3 specification uses ISO-Latin-1 as its 8 bit text encoding. In the past before UTF8 was supported, many people specified their mp3 fields in a variety of languages which contained characters not supported in ISO Latin-1. When these files are read by Yate, fields which specify an encoding of ISO Latin-1 may not display the correct characters if in fact they were not ISO Latin-1 characters.

This statement allows you to specify the original encoding and attempt to re-encode to the actual encoding. Modifications will be made wherever possible. Note that if a field currently contains characters which cannot be represented in ISO Latin-1, no modifications will occur.

The algorithm essentially re-encodes the Mac's internal representation of a string back to ISO Latin-1 and then encodes the raw data using your specified encoding.

Force Unicode UNFC

Unicode supports the encoding of most accented characters as precomposed single characters or decomposed sequences. É, precomposed has a string length of 1. When decomposed it has a string length of 2. The string displays correctly regardless of the encoding. When Unicode UNFC is selected, the associated fields are converted to their precomposed encoding. UNFC stands for Unicode Normalization Form C. Note that this transformation should rarely be required.

Force ISO Latin-1

This function attempts to ensure that every character in the result can be represented as an ISO Latin-1 character. It does so by changing various characters to their similar ISO Latin-1 equivalents, removing accents if necessary and as a last resort by changing characters which cannot be represented as ISO Latin-1 to underscore characters. Unicode UNFC, and Fold Characters are applied.

Force ASCII

This function attempts to ensure that every character in the result can be represented as an ASCII character. It does so by changing various characters to their similar ASCII equivalents, removing accents if necessary and as a last resort by changing characters which cannot be represented as ASCII to underscore characters. Unicode UNFC, and Fold Characters are applied.

Remove Accents

This function re-encodes all accented characters to their baseline unaccented characters, wherever possible.

Fold Characters

This function changes various characters to their similar Latin-1 equivalents. Currently this includes single and double quote equivalents as well as dash/hyphen equivalents. Unicode UNFC is applied. A complete list of the current substitutions can be found here.

The following functions are only available via the Re-Encode action statement.

Force Unicode UNFD

Unicode supports the encoding of most accented characters as precomposed single characters or decomposed sequences. É, precomposed has a string length of 1. When decomposed it has a string length of 2. The string displays correctly regardless of the encoding. When Unicode UNFD is selected, the associated fields are converted to their decomposed encoding. UNFD stands for Unicode Normalization Form D. Note that this transformation should rarely be required.

Re-encode to ISO Latin-1 (Lossy)

This function re-encodes as ISO Latin-1 discarding characters which cannot be represented. This is the system implementation which is included for historic reasons. Force ISO Latin-1 is a better choice.

Re-encode to ASCII (Lossy)

This function re-encodes as ASCII discarding characters which cannot be represented. This is the system implementation which is included for historic reasons. Force ASCII is a better choice.

Re-encode for JSON

Re-encodes the text for inclusion in a JSON string. Note the word string! Do not use this function to re-encode entire JSON sequences.

Remove RTF Formatting

If the data is properly structured RTF, the formatting will be removed leaving only the text.

Remove Prompt Markup Sequences

Remove prompt markup sequences. If the source text does not start with <m>, it will be returned without modification.

Remove HTML & Sequences

All HTML & sequences will be replaced with the characters they describe. The full source need not be valid HTML. Invalid & sequences are not modified.

Add HTML & Sequences

The following transformations will be applied:
&
&amp;
<
&lt;
>
&gt;
"
&quot;
'
&#39;
Re-encode to Base64

The source is encoded as Base64.

Decode Base64

The source is assumed to be Base64 and is decoded. If any errors occur, the returned value will be empty.

Escape for Regular Expression (Pattern)

Backslash characters are added to escape characters used in a regular expression pattern.

Escape for Regular Expression (Template)

Backslash characters are added to escape characters used in a regular expression replace template.

Remove URL % Encoding

URL percent encoded sequences are converted back to their textual representation.

Re-encode as URL Custom

The named variable URL Custom Encode Set is read to determine a list of characters to be percent escaped.

There are six more functions dedicated to percent encoding content based on the characters which must be escaped for particular components of a URL. The following functions are available and illustrate the text that requires the specific escaping for the following URL:

http://username:password@www.site.com/index.html?name=value#pagelink
Re-encode as URL Host Component

www.site.com

Re-encode as URL Path Component

/index.html

Re-encode as URL Fragment Component

pagelink

Re-encode as URL Password Component

password

Re-encode as URL Query Component

name=value

Re-encode as URL Query Value Component

The value portion of name=value

Re-encode as URL User Component

username