The utf8 library provides basic support for UTF-8 encoding. This library does not provide any support for Unicode other than the handling of the encoding. Any operation that needs the meaning of a character, such as character classification, is outside its scope.
Unless stated otherwise, all functions that expect a byte position as a parameter assume that the given position is either the start of a byte sequence or one plus the length of the subject string. As in the string library, negative indices count from the end of the string.
You can find a large catalog of usable UTF-8 characters from this web page.
Description: Receives zero or more codepoints as integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.
Description: Returns an iterator function so that the construction:
for position, codepoint in utf8.codes(str) do -- body endwill iterate over all codepoints in string str. It raises an error if it meets any invalid byte sequence.
Description: Returns the codepoints (as integers) from all codepoints in the provided string (str) that start between byte positions i and j (both included). The default for i is 1 and for j is i. It raises an error if it meets any invalid byte sequence.
Description: Returns the number of UTF-8 codepoints in the string _str_ that start between positions i and j (both inclusive). The default for i is 1 and for j is -1. If it finds any invalid byte sequence, returns a false value plus the position of the first invalid byte.
Description: Returns the position (in bytes) where the encoding of the n-th codepoint of s (counting from byte position i) starts. A negative n gets characters before position i. The default for i is 1 when n is non-negative and #s + 1 otherwise, so that utf8.offset(s, -n) gets the offset of the n-th character from the end of the string. If the specified character is neither in the subject nor right after its end, the function returns nil.
Description: Returns an iterator function so that
for first, last in utf8.graphemes(str) do local grapheme = s:sub(first, last) -- body endwill iterate the grapheme clusters of the string.
Description: Converts the input string to Normal Form C, which tries to convert decomposed characters into composed characters.
Description: Converts the input string to Normal Form D, which tries to break up composed characters into decomposed characters.
Description: The pattern
"[%z-\x7F\xC2-\xF4][\x80-\xBF]", which matches exactly one UTF-8 byte sequence, assuming that the subject is a valid UTF-8 string.