PcoWSkbVqDnWTu_dm2ix
We use cookies on this site to enhance your user experience

Jul 02 2018, 3:58 PM PST 2 min

The utf8 library provides basic support for UTF-8 encoding. This library does not provide any support for Unicode other than the handling of the encoding. Any operation that needs the meaning of a character, such as character classification, is outside its scope.
Unless stated otherwise, all functions that expect a byte position as a parameter assume that the given position is either the start of a byte sequence or one plus the length of the subject string. As in the string library, negative indices count from the end of the string.

You can find a large catalog of usable UTF-8 characters from this web page.


Functions


utf8.char

string utf8.char(Tuple<int>codepoints)

Description: Receives zero or more codepoints as integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.


utf8.codes

function, string, int utf8.codes(string str)

Description: Returns an iterator function so that the construction:

for position, codepoint in utf8.codes(str) do 
	-- body
end
will iterate over all codepoints in string str. It raises an error if it meets any invalid byte sequence.


utf8.codepoint

Tuple<int> utf8.codepoint(string str, int i = 1, int j = i)

Description: Returns the codepoints (as integers) from all codepoints in the provided string (str) that start between byte positions i and j (both included). The default for i is 1 and for j is i. It raises an error if it meets any invalid byte sequence.


utf8.len

int utf8.len(string s, int i = 1, int j = -1)

Description: Returns the number of UTF-8 codepoints in the string _str_ that start between positions i and j (both inclusive). The default for i is 1 and for j is -1. If it finds any invalid byte sequence, returns a false value plus the position of the first invalid byte.


utf8.offset

int utf8.offset(string s, int n, int i = 1)

Description: Returns the position (in bytes) where the encoding of the n-th codepoint of s (counting from byte position i) starts. A negative n gets characters before position i. The default for i is 1 when n is non-negative and #s + 1 otherwise, so that utf8.offset(s, -n) gets the offset of the n-th character from the end of the string. If the specified character is neither in the subject nor right after its end, the function returns nil.


utf8.graphemes

function utf8.graphemes(string str, number i, number j)

Description: Returns an iterator function so that

for first, last in utf8.graphemes(str) do 
	local grapheme = s:sub(first, last) 
	-- body
end
will iterate the grapheme clusters of the string.


utf8.nfcnormalize

string utf8.nfcnormalize(string str)

Description: Converts the input string to Normal Form C, which tries to convert decomposed characters into composed characters.


utf8.nfdnormalize

string utf8.nfdnormalize(string str)

Description: Converts the input string to Normal Form D, which tries to break up composed characters into decomposed characters.


Constants


utf8.charpattern

string utf8.charpattern

Description: The pattern "[%z-\x7F\xC2-\xF4][\x80-\xBF]", which matches exactly one UTF-8 byte sequence, assuming that the subject is a valid UTF-8 string.