|
UTF-8 string routines
IntroductionHere we should give a short overview of Unicode/UCS and in particular UTF-8 encoding. Explain about code points and relationship to "characters". Explain about half-open intervals. Always be careful whether a function takes byte offsets or code-point indices. In general, all position parameters are in byte offsets, not code point indices. This may be surprising, but it is because the functions are designed to be highly performant and to also work with arbitrary byte buffers. Therefore UTF8 decoding is not done by default. For actual text processing, where you want to specify positions with code point indices, you should use al_ustr_offset to find the byte position of a code point. UTF-8 string typesALLEGRO_USTR
ALLEGRO_USTR_INFO
Creating and destroying stringsal_ustr_new
Create a new string containing a copy of the C-style string al_ustr_new_from_buffer
Create a new string containing a copy of the buffer pointed to by al_ustr_newf
Create a new string using a printf-style format string. Notes: That "%s" specifier takes C string arguments, not ALLEGRO_USTRs. Therefore to pass an ALLEGRO_USTR as a parameter you must use al_cstr, and it must be NUL terminated. If the string contains an embedded NUL byte everything from that byte onwards will be ignored. The "%c" specifier outputs a single byte, not the UTF-8 encoding of a code point. Therefore it's only usable for ASCII characters (value <= 127) or if you really mean to output byte values from 128--255. To insert the UTF-8 encoding of a code point, encode it into a memory buffer using al_utf8_encode then use the "%s" specifier. Remember to NUL terminate the buffer. al_ustr_free
Free a previously allocated string. al_cstr
Get a
al_ustr_to_buffer
Write the contents of the string into a pre-allocated buffer of the given size in bytes. The result will always be 0-terminated. al_cstr_dup
Create a NUL ('\0') terminated copy of the string. Any embedded NUL bytes will still be presented in the returned string. The new string must eventually be freed with free(). If an error occurs NULL is returned. [after we introduce al_free it should be freed with al_free] al_ustr_dup
Return a duplicate copy of a string. The new string will need to be freed with al_ustr_free. al_ustr_dup_substr
Return a new copy of a string, containing its contents in the byte interval [start_pos, end_pos). The new string will be NUL terminated and will need to be freed with al_ustr_free. If you need a range of code-points instead of bytes, use al_ustr_offset to find the byte offsets. Predefined stringsal_ustr_empty_string
Return a pointer to a static empty string. The string is read only. Creating strings by referencing other dataal_ref_cstr
Create a string that references the storage of a C-style string. The information about the string (e.g. its size) is stored in the structure pointed to by the The string is valid until the underlying C string disappears. Example:
al_ref_buffer
Like al_ref_cstr but the size of the string data is passed in as a parameter. Hence you can use it to reference only part of a string or an arbitrary region of memory. The string is valid while the underlying C string is valid. al_ref_ustr
Create a read-only string that references the storage of another string. The information about the string (e.g. its size) is stored in the structure pointed to by the The referenced interval is [start_pos, end_pos). The string is valid until the underlying string is modified or destroyed. If you need a range of code-points instead of bytes, use al_ustr_offset to find the byte offsets. Sizes and offsetsal_ustr_size
Return the size of the string in bytes. This is equal to the number of code points in the string if the string is empty or contains only 7-bit ASCII characters. al_ustr_length
Return the number of code points in the string. al_ustr_offset
Return the offset (in bytes from the start of the string) of the code point at the specified index in the string. A zero index parameter will return the first character of the string. If index is negative, it counts backward from the end of the string, so an index of -1 will return an offset to the last code point. If the index is past the end of the string, returns the offset of the end of the string. al_ustr_next
Find the byte offset of the next code point in string, beginning at This function just looks for an appropriate byte; it doesn't check if found offset is the beginning of a valid code point. If you are working with possibly invalid UTF-8 strings then it could skip over some invalid bytes. al_ustr_prev
Find the byte offset of the previous code point in string, before This function just looks for an appropriate byte; it doesn't check if found offset is the beginning of a valid code point. If you are working with possibly invalid UTF-8 strings then it could skip over some invalid bytes. Getting code pointsal_ustr_get
Return the code point in On success returns the code point value. If al_ustr_get_next
Find the code point in On success return the code point value. If al_ustr_prev_get
Find the beginning of a code point before On success returns the code point value. If Inserting into stringsal_ustr_insert
Insert Use al_ustr_offset to find the byte offset for a code-points offset Returns true on success, false on error. al_ustr_insert_cstr
Like al_ustr_insert but inserts a C-style string. al_ustr_insert_chr
Insert a code point into Returns the number of bytes inserted, or 0 on error. Appending to stringsal_ustr_append
Append Returns true on success, false on error. al_ustr_append_cstr
Append C-style string Returns true on success, false on error. al_ustr_append_chr
Append a code point to the end of Returns the number of bytes added, or 0 on error. al_ustr_appendf
This function appends formatted output to the string Returns true on success, false on error. al_ustr_vappendf
Like al_ustr_appendf but you pass the variable argument list directly, instead of the arguments themselves. See al_ustr_newf about the "%s" and "%c" specifiers. Returns true on success, false on error. Removing parts of stringsal_ustr_remove_chr
Remove the code point beginning at byte offset Use al_ustr_offset to find the byte offset for a code-points offset. al_ustr_remove_range
Remove the interval [start_pos, end_pos) (in bytes) from a string. Returns true on success, false on error. al_ustr_truncate
Truncate a portion of a string at byte offset Returns true on success, false on error. al_ustr_ltrim_ws
Remove leading whitespace characters from a string, as defined by the C function Returns true on success, or false if the function was passed an empty string. al_ustr_rtrim_ws
Remove trailing ("right") whitespace characters from a string, as defined by the C function Returns true on success, or false if the function was passed an empty string. al_ustr_trim_ws
Remove both leading and trailing whitespace characters from a string. Returns true on success, or false if the function was passed an empty string. Assigning one string to anotheral_ustr_assign
Overwrite the string al_ustr_assign_substr
Overwrite the string Usually you will first have to use al_ustr_offset to find the byte offsets. Returns true on success, false on error. al_ustr_assign_cstr
Overwrite the string Replacing parts of stringal_ustr_set_chr
Replace the code point beginning at byte offset On success, returns the number of bytes written, i.e. the offset to the following code point. On error, returns 0. al_ustr_replace_range
Replace the part of Use al_ustr_offset to find the byte offsets. Returns true on success, false on error. Searchingal_ustr_find_chr
Search for the encoding of code point Returns the position where it is found or -1 if it is not found. al_ustr_rfind_chr
Search for the encoding of code point al_ustr_find_set
This function finds the first code point in al_ustr_find_set_cstr
Like al_ustr_find_set but takes a C-style string for al_ustr_find_cset
This function finds the first code point in al_ustr_find_cset_cstr
Like al_ustr_find_cset but takes a C-style string for al_ustr_find_str
Find the first occurrence of string al_ustr_find_cstr
Like al_ustr_find_str but takes a C-style string for al_ustr_rfind_str
Find the last occurrence of string al_ustr_rfind_cstr
Like al_ustr_rfind_str but takes a C-style string for al_ustr_find_replace
Replace all occurrences of al_ustr_find_replace_cstr
Like al_ustr_find_replace but takes C-style strings for Comparingal_ustr_equal
Return true iff the two strings are equal. This function is more efficient than al_ustr_compare so is preferable if ordering is not important. al_ustr_compare
This function compares This does not take into account locale-specific sorting rules. For that you will need to use another library. al_ustr_ncompare
Like al_ustr_compare but only compares up to the first Returns zero if the strings are equal, a positive number if al_ustr_has_prefix
Returns true iff al_ustr_has_prefix_cstr
Returns true iff al_ustr_has_suffix
Returns true iff al_ustr_has_suffix_cstr
Returns true iff UTF-16 conversional_ustr_new_from_utf16
Create a new string containing a copy of the 0-terminated string al_ustr_size_utf16
Returns the number of bytes required to encode the string in UTF-16 (including the terminating 0). Usually called before al_ustr_encode_utf16 to determine the size of the buffer to allocate. al_ustr_encode_utf16
Encode the string into the given buffer, in UTF-16. Returns the number of bytes written. There are never more than Low-level UTF-8 routinesal_utf8_width
Returns the number of bytes that would be occupied by the specified code point when encoded in UTF-8. This is between 1 and 4 bytes for legal code point values. Otherwise returns 0. al_utf8_encode
Encode the specified code point to UTF-8 into the buffer Returns the number of bytes written, which is the same as that returned by al_utf8_width. Low-level UTF-16 routinesal_utf16_width
Returns the number of bytes that would be occupied by the specified code point when encoded in UTF-16. This is either 2 or 4 bytes for legal code point values. Otherwise returns 0. al_utf16_encode
Encode the specified code point to UTF-8 into the buffer Returns the number of bytes written, which is the same as that returned by al_utf16_width. |
Last updated: 2009-08-09 08:22:46 UTC