UTF-8 string routines

Introduction

Here we should give a short overview of Unicode/UCS and in particular UTF-8 encoding.

Explain about code points and relationship to "characters".

Explain about half-open intervals.

Always be careful whether a function takes byte offsets or code-point indices. In general, all position parameters are in byte offsets, not code point indices. This may be surprising, but it is because the functions are designed to be highly performant and to also work with arbitrary byte buffers. Therefore UTF8 decoding is not done by default.

For actual text processing, where you want to specify positions with code point indices, you should use al_ustr_offset to find the byte position of a code point.

UTF-8 string types

ALLEGRO_USTR

typedef struct _al_tagbstring ALLEGRO_USTR;

ALLEGRO_USTR_INFO

typedef struct ALLEGRO_USTR_INFO ALLEGRO_USTR_INFO;

Creating and destroying strings

al_ustr_new

ALLEGRO_USTR *al_ustr_new(const char *s)

Create a new string containing a copy of the C-style string s. The string must eventually be freed with al_ustr_free.

al_ustr_new_from_buffer

ALLEGRO_USTR *al_ustr_new_from_buffer(const char *s, size_t size)

Create a new string containing a copy of the buffer pointed to by s of the given size. The string must eventually be freed with al_ustr_free.

al_ustr_newf

ALLEGRO_USTR *al_ustr_newf(const char *fmt, ...)

Create a new string using a printf-style format string.

Notes:

That "%s" specifier takes C string arguments, not ALLEGRO_USTRs. Therefore to pass an ALLEGRO_USTR as a parameter you must use al_cstr, and it must be NUL terminated. If the string contains an embedded NUL byte everything from that byte onwards will be ignored.

The "%c" specifier outputs a single byte, not the UTF-8 encoding of a code point. Therefore it's only usable for ASCII characters (value <= 127) or if you really mean to output byte values from 128--255. To insert the UTF-8 encoding of a code point, encode it into a memory buffer using al_utf8_encode then use the "%s" specifier. Remember to NUL terminate the buffer.

al_ustr_free

void al_ustr_free(ALLEGRO_USTR *us)

Free a previously allocated string.

al_cstr

const char *al_cstr(const ALLEGRO_USTR *us)

Get a char * pointer to the data in a string. This pointer will only be valid while the underlying string is not modified and not destroyed. The pointer may be passed to functions expecting C-style strings, with the following caveats:

  • ALLEGRO_USTRs are allowed to contain embedded NUL ('\0') bytes. That means al_ustr_size(u) and strlen(al_cstr(u)) may not agree.

  • An ALLEGRO_USTR may be created in such a way that it is not NUL terminated. A string which is dynamically allocated will always be NUL terminated, but a string which references the middle of another string or region of memory will not be NUL terminated.

  • If the ALLEGRO_USTR references another string, the returned c-string will point into the referenced string, the length of the string will be ignored.

al_ustr_to_buffer

void al_ustr_to_buffer(const ALLEGRO_USTR *us, char *buffer, int size)

Write the contents of the string into a pre-allocated buffer of the given size in bytes. The result will always be 0-terminated.

al_cstr_dup

char *al_cstr_dup(const ALLEGRO_USTR *us)

Create a NUL ('\0') terminated copy of the string. Any embedded NUL bytes will still be presented in the returned string. The new string must eventually be freed with free(). If an error occurs NULL is returned.

[after we introduce al_free it should be freed with al_free]

al_ustr_dup

ALLEGRO_USTR *al_ustr_dup(const ALLEGRO_USTR *us)

Return a duplicate copy of a string. The new string will need to be freed with al_ustr_free.

al_ustr_dup_substr

ALLEGRO_USTR *al_ustr_dup_substr(const ALLEGRO_USTR *us, int start_pos,
   int end_pos)

Return a new copy of a string, containing its contents in the byte interval [start_pos, end_pos). The new string will be NUL terminated and will need to be freed with al_ustr_free.

If you need a range of code-points instead of bytes, use al_ustr_offset to find the byte offsets.

Predefined strings

al_ustr_empty_string

ALLEGRO_USTR *al_ustr_empty_string(void)

Return a pointer to a static empty string. The string is read only.

Creating strings by referencing other data

al_ref_cstr

ALLEGRO_USTR *al_ref_cstr(ALLEGRO_USTR_INFO *info, const char *s)

Create a string that references the storage of a C-style string. The information about the string (e.g. its size) is stored in the structure pointed to by the info parameter. The string will not have any other storage allocated of its own, so if you allocate the info structure on the stack then no explicit "free" operation is required.

The string is valid until the underlying C string disappears.

Example:

ALLEGRO_USTR_INFO info;
ALLEGRO_USTR us = al_ref_cstr(&info, "my string");

al_ref_buffer

ALLEGRO_USTR *al_ref_buffer(ALLEGRO_USTR_INFO *info, const char *s, size_t size)

Like al_ref_cstr but the size of the string data is passed in as a parameter. Hence you can use it to reference only part of a string or an arbitrary region of memory.

The string is valid while the underlying C string is valid.

al_ref_ustr

ALLEGRO_USTR *al_ref_ustr(ALLEGRO_USTR_INFO *info, const ALLEGRO_USTR *us,
   int start_pos, int end_pos)

Create a read-only string that references the storage of another string. The information about the string (e.g. its size) is stored in the structure pointed to by the info parameter. The string will not have any other storage allocated of its own, so if you allocate the info structure on the stack then no explicit "free" operation is required.

The referenced interval is [start_pos, end_pos).

The string is valid until the underlying string is modified or destroyed.

If you need a range of code-points instead of bytes, use al_ustr_offset to find the byte offsets.

Sizes and offsets

al_ustr_size

size_t al_ustr_size(const ALLEGRO_USTR *us)

Return the size of the string in bytes. This is equal to the number of code points in the string if the string is empty or contains only 7-bit ASCII characters.

al_ustr_length

size_t al_ustr_length(const ALLEGRO_USTR *us)

Return the number of code points in the string.

al_ustr_offset

int al_ustr_offset(const ALLEGRO_USTR *us, int index)

Return the offset (in bytes from the start of the string) of the code point at the specified index in the string. A zero index parameter will return the first character of the string. If index is negative, it counts backward from the end of the string, so an index of -1 will return an offset to the last code point.

If the index is past the end of the string, returns the offset of the end of the string.

al_ustr_next

bool al_ustr_next(const ALLEGRO_USTR *us, int *pos)

Find the byte offset of the next code point in string, beginning at *pos. *pos does not have to be at the beginning of a code point. Returns true on success, then value pointed to by pos will be updated to the found offset. Otherwise returns false if *pos was already at the end of the string, then *pos is unmodified.

This function just looks for an appropriate byte; it doesn't check if found offset is the beginning of a valid code point. If you are working with possibly invalid UTF-8 strings then it could skip over some invalid bytes.

al_ustr_prev

bool al_ustr_prev(const ALLEGRO_USTR *us, int *pos)

Find the byte offset of the previous code point in string, before *pos. *pos does not have to be at the beginning of a code point. Returns true on success, then value pointed to by pos will be updated to the found offset. Otherwise returns false if *pos was already at the end of the string, then *pos is unmodified.

This function just looks for an appropriate byte; it doesn't check if found offset is the beginning of a valid code point. If you are working with possibly invalid UTF-8 strings then it could skip over some invalid bytes.

Getting code points

al_ustr_get

int32_t al_ustr_get(const ALLEGRO_USTR *ub, int pos)

Return the code point in us beginning at pos.

On success returns the code point value. If pos was out of bounds (e.g. past the end of the string), return -1. On an error, such as an invalid byte sequence, return -2.

al_ustr_get_next

int32_t al_ustr_get_next(const ALLEGRO_USTR *us, int *pos)

Find the code point in us beginning at *pos, then advance to the next code point.

On success return the code point value. If pos was out of bounds (e.g. past the end of the string), return -1. On an error, such as an invalid byte sequence, return -2. As with al_ustr_next, invalid byte sequences may be skipped while advancing.

al_ustr_prev_get

int32_t al_ustr_prev_get(const ALLEGRO_USTR *us, int *pos)

Find the beginning of a code point before *pos, then return it. Note this performs a pre-increment.

On success returns the code point value. If pos was out of bounds (e.g. past the end of the string), return -1. On an error, such as an invalid byte sequence, return -2. As with al_ustr_prev, invalid byte sequences may be skipped while advancing.

Inserting into strings

al_ustr_insert

bool al_ustr_insert(ALLEGRO_USTR *us1, int pos, const ALLEGRO_USTR *us2)

Insert us2 into us1 beginning at pos. pos cannot be less than 0. If pos is past the end of us1 then the space between the end of the string and pos will be padded with NUL ('\0') bytes. pos is specified in bytes.

Use al_ustr_offset to find the byte offset for a code-points offset

Returns true on success, false on error.

al_ustr_insert_cstr

bool al_ustr_insert_cstr(ALLEGRO_USTR *us, int pos, const char *s)

Like al_ustr_insert but inserts a C-style string.

al_ustr_insert_chr

size_t al_ustr_insert_chr(ALLEGRO_USTR *us, int pos, int32_t c)

Insert a code point into us beginning at byte offset pos. pos cannot be less than 0. If pos is past the end of us then the space between the end of the string and pos will be padded with NUL ('\0') bytes.

Returns the number of bytes inserted, or 0 on error.

Appending to strings

al_ustr_append

bool al_ustr_append(ALLEGRO_USTR *us1, const ALLEGRO_USTR *us2)

Append us2 to the end of us1.

Returns true on success, false on error.

al_ustr_append_cstr

bool al_ustr_append_cstr(ALLEGRO_USTR *us, const char *s)

Append C-style string s to the end of us.

Returns true on success, false on error.

al_ustr_append_chr

size_t al_ustr_append_chr(ALLEGRO_USTR *us, int32_t c)

Append a code point to the end of us.

Returns the number of bytes added, or 0 on error.

al_ustr_appendf

bool al_ustr_appendf(ALLEGRO_USTR *us, const char *fmt, ...)

This function appends formatted output to the string us. fmt is a printf-style format string. See al_ustr_newf about the "%s" and "%c" specifiers.

Returns true on success, false on error.

al_ustr_vappendf

bool al_ustr_vappendf(ALLEGRO_USTR *us, const char *fmt, va_list ap)

Like al_ustr_appendf but you pass the variable argument list directly, instead of the arguments themselves. See al_ustr_newf about the "%s" and "%c" specifiers.

Returns true on success, false on error.

Removing parts of strings

al_ustr_remove_chr

bool al_ustr_remove_chr(ALLEGRO_USTR *us, int pos)

Remove the code point beginning at byte offset pos. Returns true on success. If pos is out of range or pos is not the beginning of a valid code point, returns false leaving the string unmodified.

Use al_ustr_offset to find the byte offset for a code-points offset.

al_ustr_remove_range

bool al_ustr_remove_range(ALLEGRO_USTR *us, int start_pos, int end_pos)

Remove the interval [start_pos, end_pos) (in bytes) from a string. start_pos and end_pos may both be past the end of the string but cannot be less than 0 (the start of the string).

Returns true on success, false on error.

al_ustr_truncate

bool al_ustr_truncate(ALLEGRO_USTR *us, int start_pos)

Truncate a portion of a string at byte offset start_pos onwards. start_pos can be past the end of the string (has no effect) but cannot be less than 0.

Returns true on success, false on error.

al_ustr_ltrim_ws

bool al_ustr_ltrim_ws(ALLEGRO_USTR *us)

Remove leading whitespace characters from a string, as defined by the C function isspace().

Returns true on success, or false if the function was passed an empty string.

al_ustr_rtrim_ws

bool al_ustr_rtrim_ws(ALLEGRO_USTR *us)

Remove trailing ("right") whitespace characters from a string, as defined by the C function isspace().

Returns true on success, or false if the function was passed an empty string.

al_ustr_trim_ws

bool al_ustr_trim_ws(ALLEGRO_USTR *us)

Remove both leading and trailing whitespace characters from a string.

Returns true on success, or false if the function was passed an empty string.

Assigning one string to another

al_ustr_assign

bool al_ustr_assign(ALLEGRO_USTR *us1, const ALLEGRO_USTR *us2)

Overwrite the string us1 with another string us2. Returns true on success, false on error.

al_ustr_assign_substr

bool al_ustr_assign_substr(ALLEGRO_USTR *us1, const ALLEGRO_USTR *us2,
   int start_pos, int end_pos)

Overwrite the string us1 with the contents of us2 in the byte interval [start_pos, end_pos). The end points will be clamed to the bounds of us2.

Usually you will first have to use al_ustr_offset to find the byte offsets.

Returns true on success, false on error.

al_ustr_assign_cstr

bool al_ustr_assign_cstr(ALLEGRO_USTR *us1, const char *s)

Overwrite the string us with the contents of the C-style string s. Returns true on success, false on error.

Replacing parts of string

al_ustr_set_chr

size_t al_ustr_set_chr(ALLEGRO_USTR *us, int start_pos, int32_t c)

Replace the code point beginning at byte offset pos with c. pos cannot be less than 0. If pos is past the end of us1 then the space between the end of the string and pos will be padded with NUL ('\0') bytes. If pos is not the start of a valid code point, that is an error and the string will be unmodified.

On success, returns the number of bytes written, i.e. the offset to the following code point. On error, returns 0.

al_ustr_replace_range

bool al_ustr_replace_range(ALLEGRO_USTR *us1, int start_pos1, int end_pos1,
   const ALLEGRO_USTR *us2)

Replace the part of us1 in the byte interval [start_pos, end_pos) with the contents of us2. start_pos cannot be less than 0. If start_pos is past the end of us1 then the space between the end of the string and start_pos will be padded with NUL ('\0') bytes.

Use al_ustr_offset to find the byte offsets.

Returns true on success, false on error.

Searching

al_ustr_find_chr

int al_ustr_find_chr(const ALLEGRO_USTR *us, int start_pos, int32_t c)

Search for the encoding of code point c in us from byte offset start_pos (inclusive).

Returns the position where it is found or -1 if it is not found.

al_ustr_rfind_chr

int al_ustr_rfind_chr(const ALLEGRO_USTR *us, int end_pos, int32_t c)

Search for the encoding of code point c in us backwards from byte offset end_pos (exclusive). Returns the position where it is found or -1 if it is not found.

al_ustr_find_set

int al_ustr_find_set(const ALLEGRO_USTR *us, int start_pos,
   const ALLEGRO_USTR *accept)

This function finds the first code point in us, beginning from byte offset start_pos, that matches any code point in accept. Returns the position if a code point was found. Otherwise returns -1.

al_ustr_find_set_cstr

int al_ustr_find_set_cstr(const ALLEGRO_USTR *us, int start_pos,
   const char *accept)

Like al_ustr_find_set but takes a C-style string for accept.

al_ustr_find_cset

int al_ustr_find_cset(const ALLEGRO_USTR *us, int start_pos,
   const ALLEGRO_USTR *reject)

This function finds the first code point in us, beginning from byte offset start_pos, that does not match any code point in reject. In other words it finds a code point in the complementary set of reject. Returns the byte position of that code point, if any. Otherwise returns -1.

al_ustr_find_cset_cstr

int al_ustr_find_cset_cstr(const ALLEGRO_USTR *us, int start_pos,
   const char *reject)

Like al_ustr_find_cset but takes a C-style string for reject.

al_ustr_find_str

int al_ustr_find_str(const ALLEGRO_USTR *haystack, int start_pos,
   const ALLEGRO_USTR *needle)

Find the first occurrence of string needle in haystack, beginning from byte offset pos (inclusive). Return the byte offset of the occurrence if it is found, otherwise return -1.

al_ustr_find_cstr

int al_ustr_find_cstr(const ALLEGRO_USTR *haystack, int start_pos,
   const char *needle)

Like al_ustr_find_str but takes a C-style string for needle.

al_ustr_rfind_str

int al_ustr_rfind_str(const ALLEGRO_USTR *haystack, int end_pos,
   const ALLEGRO_USTR *needle)

Find the last occurrence of string needle in haystack before byte offset end_pos (exclusive). Return the byte offset of the occurrence if it is found, otherwise return -1.

al_ustr_rfind_cstr

int al_ustr_rfind_cstr(const ALLEGRO_USTR *haystack, int end_pos,
   const char *needle)

Like al_ustr_rfind_str but takes a C-style string for needle.

al_ustr_find_replace

bool al_ustr_find_replace(ALLEGRO_USTR *us, int start_pos,
   const ALLEGRO_USTR *find, const ALLEGRO_USTR *replace)

Replace all occurrences of find in us with replace, beginning at byte offset start_pos. The find string must be non-empty. Returns true on success, false on error.

al_ustr_find_replace_cstr

bool al_ustr_find_replace_cstr(ALLEGRO_USTR *us, int start_pos,
   const char *find, const char *replace)

Like al_ustr_find_replace but takes C-style strings for find and replace.

Comparing

al_ustr_equal

bool al_ustr_equal(const ALLEGRO_USTR *us1, const ALLEGRO_USTR *us2)

Return true iff the two strings are equal. This function is more efficient than al_ustr_compare so is preferable if ordering is not important.

al_ustr_compare

int al_ustr_compare(const ALLEGRO_USTR *us1, const ALLEGRO_USTR *us2)

This function compares us1 and us2 by code point values. Returns zero if the strings are equal, a positive number if us1 comes after us2, else a negative number.

This does not take into account locale-specific sorting rules. For that you will need to use another library.

al_ustr_ncompare

int al_ustr_ncompare(const ALLEGRO_USTR *us1, const ALLEGRO_USTR *us2, int n)

Like al_ustr_compare but only compares up to the first n code points of both strings.

Returns zero if the strings are equal, a positive number if us1 comes after us2, else a negative number.

al_ustr_has_prefix

bool al_ustr_has_prefix(const ALLEGRO_USTR *us1, const ALLEGRO_USTR *us2)

Returns true iff us1 begins with us2.

al_ustr_has_prefix_cstr

bool al_ustr_has_prefix_cstr(const ALLEGRO_USTR *us1, const char *s2)

Returns true iff us1 begins with s2.

al_ustr_has_suffix

bool al_ustr_has_suffix(const ALLEGRO_USTR *us1, const ALLEGRO_USTR *us2)

Returns true iff us1 ends with us2.

al_ustr_has_suffix_cstr

bool al_ustr_has_suffix_cstr(const ALLEGRO_USTR *us1, const char *s2)

Returns true iff us1 ends with s2.

UTF-16 conversion

al_ustr_new_from_utf16

ALLEGRO_USTR *al_ustr_new_from_utf16(uint16_t const *s)

Create a new string containing a copy of the 0-terminated string s which must be encoded as UTF-16. The string must eventually be freed with al_ustr_free.

al_ustr_size_utf16

size_t al_ustr_size_utf16(const ALLEGRO_USTR *us)

Returns the number of bytes required to encode the string in UTF-16 (including the terminating 0). Usually called before al_ustr_encode_utf16 to determine the size of the buffer to allocate.

al_ustr_encode_utf16

size_t al_ustr_encode_utf16(const ALLEGRO_USTR *us, uint16_t *s,
   size_t n)

Encode the string into the given buffer, in UTF-16. Returns the number of bytes written. There are never more than n bytes written. The minimum size to encode the complete string can be queried with al_ustr_size_utf16. If the n parameter is smaller than that, the string will be truncated but still always 0 terminated.

Low-level UTF-8 routines

al_utf8_width

size_t al_utf8_width(int c)

Returns the number of bytes that would be occupied by the specified code point when encoded in UTF-8. This is between 1 and 4 bytes for legal code point values. Otherwise returns 0.

al_utf8_encode

size_t al_utf8_encode(char s[], int32_t c)

Encode the specified code point to UTF-8 into the buffer s. The buffer must have enough space to hold the encoding, which takes between 1 and 4 bytes. This routine will refuse to encode code points above 0x10FFFF.

Returns the number of bytes written, which is the same as that returned by al_utf8_width.

Low-level UTF-16 routines

al_utf16_width

size_t al_utf16_width(int c)

Returns the number of bytes that would be occupied by the specified code point when encoded in UTF-16. This is either 2 or 4 bytes for legal code point values. Otherwise returns 0.

al_utf16_encode

size_t al_utf16_encode(uint16_t s[], int32_t c)

Encode the specified code point to UTF-8 into the buffer s. The buffer must have enough space to hold the encoding, which takes either 2 or 4 bytes. This routine will refuse to encode code points above 0x10FFFF.

Returns the number of bytes written, which is the same as that returned by al_utf16_width.

Last updated: 2009-08-09 08:22:46 UTC