PcoWSkbVqDnWTu_dm2ix
We use cookies on this site to enhance your user experience
Collapse Sidebar

utf8

The utf8 library provides basic support for UTF-8 encoding. This library does not provide any support for Unicode other than the handling of the encoding. Any operation that needs the meaning of a character, such as character classification, is outside its scope.

Unless stated otherwise, all functions that expect a byte position as a parameter assume that the given position is either the start of a byte sequence or one plus the length of the subject string. As in the string library, negative indices count from the end of the string.

You can find a large catalog of usable UTF-8 characters from this web page.

Functions

string utf8.char ( Tuple codepoints )

Receives zero or more codepoints as integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.

function, string, int utf8.codes ( string str )

Returns an iterator function so that the construction:

for position, codepoint in utf8.codes(str) do 
	-- body
end

will iterate over all codepoints in string str. It raises an error if it meets any invalid byte sequence.

Tuple utf8.codepoint ( string str, int i = 1, int j = i )

Returns the codepoints (as integers) from all codepoints in the provided string (str) that start between byte positions i and j (both included). The default for i is 1 and for j is i. It raises an error if it meets any invalid byte sequence.

int utf8.len ( string s, int i = 1, int j = -1 )

Returns the number of UTF-8 codepoints in the string str that start between positions i and j (both inclusive). The default for i is 1 and for j is -1. If it finds any invalid byte sequence, returns a false value plus the position of the first invalid byte.

int utf8.offset ( string s, int n, int i = 1 )

Returns the position (in bytes) where the encoding of the n-th codepoint of s (counting from byte position i) starts. A negative n gets characters before position i. The default for i is 1 when n is non-negative and #s + 1 otherwise, so that utf8.offset(s, -n) gets the offset of the n-th character from the end of the string. If the specified character is neither in the subject nor right after its end, the function returns nil.

function utf8.graphemes ( string str, number i, number j )

Returns an iterator function so that

for first, last in utf8.graphemes(str) do 
	local grapheme = s:sub(first, last) 
	-- body
end

will iterate the grapheme clusters of the string.

string utf8.nfcnormalize ( string str )

Converts the input string to Normal Form C, which tries to convert decomposed characters into composed characters.

string utf8.nfdnormalize ( string str )

Converts the input string to Normal Form D, which tries to break up composed characters into decomposed characters.

Constants

string utf8.charpattern

The pattern "[%z-\x7F\xC2-\xF4][\x80-\xBF]", which matches exactly one UTF-8 byte sequence, assuming that the subject is a valid UTF-8 string.