UTF-8 string routines

These functions are declared in the main Allegro header file:

 #include <allegro5/allegro.h>

About UTF-8 string routines

Some parts of the Allegro API, such as the font rountines, expect Unicode strings encoded in UTF-8. The following basic routines are provided to help you work with UTF-8 strings, however it does not mean you need to use them. You should consider another library (e.g. ICU) if you require more functionality.

Briefly, Unicode is a standard consisting of a large character set of over 100,000 characters, and rules, such as how to sort strings. A code point is the integer value of a character, but not all code points are characters, as some code points have other uses. Unlike legacy character sets, the set of code points is open ended and more are assigned with time.

Clearly it is impossible to represent each code point with a 8-bit byte (limited to 256 code points) or even a 16-bit integer (limited to 65536 code points). It is possible to store code points in a 32-bit integers but it is space inefficient, and not actually that useful (at least, when handling the full complexity of Unicode; Allegro only does the very basics). There exist different Unicode Transformation Formats for encoding code points into smaller code units. The most important transformation formats are UTF-8 and UTF-16.

UTF-8 is a variable-length encoding which encodes each code point to between one and four 8-bit bytes each. UTF-8 has many nice properties, but the main advantages are that it is backwards compatible with C strings, and ASCII characters (code points in the range 0-127) are encoded in UTF-8 exactly as they would be in ASCII.

UTF-16 is another variable-length encoding, but encodes each code point to one or two 16-bit words each. It is, of course, not compatible with traditional C strings. Allegro does not generally use UTF-16 strings.

Here is a diagram of the representation of the word “ål”, with a NUL terminator, in both UTF-8 and UTF-16.

                   ---------------- ---------------- --------------
           String         å                l              NUL
                   ---------------- ---------------- --------------
      Code points    U+00E5 (229)     U+006C (108)     U+0000 (0)
                   ---------------- ---------------- --------------
      UTF-8 bytes     0xC3, 0xA5          0x6C            0x00
                   ---------------- ---------------- --------------
   UTF-16LE bytes     0xE5, 0x00       0x6C, 0x00      0x00, 0x00
                   ---------------- ---------------- --------------

You can see the aforementioned properties of UTF-8. The first code point U+00E5 (“å”) is outside of the ASCII range (0-127) so is encoded to multiple code units – it requires two bytes. U+006C (“l”) and U+0000 (NUL) both exist in the ASCII range so take exactly one byte each, as in a pure ASCII string. A zero byte never appears except to represent the NUL character, so many functions which expect C-style strings will work with UTF-8 strings without modification.

On the other hand, UTF-16 represents each code point by either one or two 16-bit code units (two or four bytes). The representation of each 16-bit code unit depends on the byte order; here we have demonstrated little endian.

Both UTF-8 and UTF-16 are self-synchronising. Starting from any offset within a string, it is efficient to find the beginning of the previous or next code point.

Not all sequences of bytes or 16-bit words are valid UTF-8 and UTF-16 strings respectively. UTF-8 also has an additional problem of overlong forms, where a code point value is encoded using more bytes than is strictly necessary. This is invalid and needs to be guarded against.

In the following “ustr” functions, be careful whether a function takes code unit (byte) or code point indices. In general, all position parameters are in code unit offsets. This may be surprising, but if you think about it, it is required for good performance. (It also means some functions will work even if they do not contain UTF-8, since they only care about storing bytes, so you may actually store arbitrary data in the ALLEGRO_USTRs.)

For actual text processing, where you want to specify positions with code point indices, you should use al_ustr_offset to find the code unit offset position. However, most of the time you would probably just work with byte offsets.

UTF-8 string types

ALLEGRO_USTR

typedef struct _al_tagbstring ALLEGRO_USTR;

About UTF-8 string routines

UTF-8 string types

ALLEGRO_USTR

ALLEGRO_USTR_INFO

Creating and destroying strings

al_ustr_new

al_ustr_new_from_buffer

al_ustr_newf

al_ustr_free

al_cstr

al_ustr_to_buffer

al_cstr_dup

al_ustr_dup

al_ustr_dup_substr

Predefined strings

al_ustr_empty_string

Creating strings by referencing other data

al_ref_cstr

al_ref_buffer

al_ref_ustr

al_ref_info

Sizes and offsets

al_ustr_size

al_ustr_length

al_ustr_offset

al_ustr_next

al_ustr_prev

Getting code points

al_ustr_get

al_ustr_get_next

al_ustr_prev_get

Inserting into strings

al_ustr_insert

al_ustr_insert_cstr

al_ustr_insert_chr

Appending to strings

al_ustr_append

al_ustr_append_cstr

al_ustr_append_chr

al_ustr_appendf

al_ustr_vappendf

Removing parts of strings

al_ustr_remove_chr

al_ustr_remove_range

al_ustr_truncate

al_ustr_ltrim_ws

al_ustr_rtrim_ws

al_ustr_trim_ws

Assigning one string to another

al_ustr_assign

al_ustr_assign_substr

al_ustr_assign_cstr

Replacing parts of string

al_ustr_set_chr

al_ustr_replace_range

Searching

al_ustr_find_chr

al_ustr_rfind_chr

al_ustr_find_set

al_ustr_find_set_cstr

al_ustr_find_cset

al_ustr_find_cset_cstr

al_ustr_find_str

al_ustr_find_cstr

al_ustr_rfind_str

al_ustr_rfind_cstr

al_ustr_find_replace

al_ustr_find_replace_cstr

Comparing

al_ustr_equal

al_ustr_compare

al_ustr_ncompare

al_ustr_has_prefix

al_ustr_has_prefix_cstr

al_ustr_has_suffix

al_ustr_has_suffix_cstr

UTF-16 conversion

al_ustr_new_from_utf16

al_ustr_size_utf16

al_ustr_encode_utf16