This page describes the Unicode support in Object Icon. This provides an additional string type, which behaves like a conventional string, but can contain any unicode character. Csets have also been enhanced in Object Icon.
ucs
typeucs
(standing for Unicode character string) is a new builtin type, whose behaviour closely mirrors that of the conventional Icon string. It operates by providing a wrapper around a conventional conventional Icon string, which must be in utf-8 format. This has several advantages, and only one serious disadvantage, namely that a utf-8 string is not randomly accessible, in the sense that one cannot say where the representation for unicode character i
begins. To alleviate this disadvantage, the ucs
type maintains an index of offsets into the utf-8 string to make random access faster. The size of the index is only a few percent of the total allocation for the ucs
object.
Another potential disadvantage of utf-8 as an internal format, namely that it is awkward to edit a string (since a new character can have a different length to the one it’s replacing), happily doesn’t apply in Object Icon, since strings are immutable.
Two new escape sequences are provided to represent the utf-8 character sequences of unicode characters. These are \u
, which is followed by up to 4 hex digits, and \U
, which is followed by up to six hex digits. Each expands to between one and four characters, depending on the unicode character concerned. So, for example, the line
write(image("\u0001*\u00ff*\u1234*\U10ffff"))
writes
"\x01*\xc3\xbf*\xe1\x88\xb4*\xf4\x8f\xbf\xbf"
Note that this is still just an ordinary string, rather than a ucs
string.
ucs
stringA ucs
string can be created at compile-time, as a literal, or at runtime, via the builtin ucs
function. To create a literal, prefix a u
to an ordinary string literal, which must be valid utf-8; for example :-
s := u"\u0001*\u00ff*\u1234*\U10ffff"
To use ucs
, just call it like any other function.
s := ucs(x)
The parameter to ucs
must be something which can be converted to a string which must be valid utf-8 (otherwise ucs
fails). Note that all plain ascii strings (ie, those with only characters less than 128) are in utf-8 format.
ucs
stringsThe ucs
type supports all of the familiar string operations, with the same semantics as the conventional string type.
String operations which take two parameters can usually mix string
and ucs
types, although some care is needed. The general rule is: if either parameter is a ucs
, then the other parameter must be convertible to a ucs
. For example, consider string catenation. The expression
"abc" || u"\u1234"
is valid, and has the result u"abc\u1234"
, because "abc"
is valid utf8 and hence convertible to a ucs
. However, the expression
"\xff" || u"\u1234"
is invalid and will cause a runtime error because "\xff"
is not valid utf8, and hence cannot be converted to a ucs
.
Converting a ucs
back to a normal string produces the utf8 representation. This is the internal representation, so this operation is very fast.
In Icon, csets can only represent characters in the range 0 to 255. Object Icon extends this range to cover all possible unicode characters (0 up to 0x10FFFF).
The \u
and \U
escape sequences can be used to specify characters greater than 255. For example
'\x01\xff\u1234\U10ffff'
specifies a cset with four characters. You can also specify a range of characters by using a hyphen. Thus 'a-zA-Z'
has the lower and upper case characters, '0-9'
has the digits, and so on. A new keyword, &uset
, contains all of the possible characters and is equivalent to '\x00-\U10ffff'
.
Generating the elements of a cset with !
will produce one-character strings for those elements less than 256, and one-character ucs
strings for those elements greater than or equal to 256. So for example the expression
!'\x01\xff\u1234\U10ffff'
produces the following four results
"\x01"
"\xff"
u"\u1234"
u"\U10ffff"
Indexing a cset will produce either a normal string or a ucs
string, depending on whether any of the elements in the range are greater than or equal to 256. For example the expression
'\x01\xff\u1234\U10ffff'[1:1 to 5]
produces the following results
""
"\x01"
"\x01\xff"
u"\x01\u00ff\u1234"
u"\x01\u00ff\u1234\U10ffff"
The builtin ord()
function can be used to access the numerical values of some or all of the characters in a cset (see below for a full explanation of ord
).
Any cset can be converted to a ucs
string, but only one containing only characters less than 256 can be converted to a normal string.
The ord
function expands on its Icon predecessor. The first parameter can be a string, a ucs
string, or a cset. The optional parameters i
and j
specify a range within x
, and default to 1
and 0
respectively. The result sequence is the integer character values of the specified range. For example
ord(u"\x01\u00ff\u1234\U10ffff")
generates
1
255
4660
1114111
whilst
ord(&ucase, 5, 10)
generates
69
70
71
72
73
This is the ucs
equivalent of the char
function. It produces a one-character ucs
string containing character number x
.
This function will try to convert x
to either a ucs
or a conventional string as appropriate. If x
is a string or ucs
, it is just returned. If x
is a cset then it is converted to a string if its highest char is < 256; otherwise it is converted to a ucs
. For any other type, normal string conversion is attempted.
This class has a some static methods which may prove useful.
The method has_ord(c, x)
tests whether character number x
is in cset c
.
The method utf8_seq(i)
produces the utf-8 string representation of character i
. This is useful for building up a utf-8 string which can then be passed to ucs
. For example, consider the problem of converting an iso-8859-1 format string to a ucs
. One way to do this would be :-
procedure iso8859_to_ucs(s)
local t
t := u""
every t ||:= uchar(ord(s))
return t
end
The drawback with this method is that it is creating lots of temporary ucs
values in the every loop (uchar
produces one, and the old value of t
is thrown away).
A quicker way is to create a utf-8 string first, and then create the ucs
result at the end :-
import lang(Text)
procedure iso8859_to_ucs(s)
local t
t := ""
every t ||:= Text.utf8_seq(ord(s))
return ucs(t)
end
Source code files can be edited in non-ASCII format.
To specify a file’s encoding, a preprocessor directive, $encoding
is used. The directive is followed by the encoding name, which at present can take one of three possible values :-
ASCII
(the default)ISO-8859-1
UTF-8
Each source file is processed as a sequence of codepoints, which are converted from the input bytes, based on the encoding. For ASCII encoding and ISO-8859-1 encoding, each codepoint is the same as each input byte. The only difference is that ASCII restricts the range of codepoints to 0-127, as opposed to 0-255 for ISO-8859-1. For UTF-8 encoding each codepoint may correspond to several input bytes, and may be any valid Unicode codepoint.
Other than escape sequences, each codepoint within a string, ucs or cset literal will correspond to exactly one character in that literal. For a string, the codepoint must be in the range 0-255; otherwise a compile-time error is signalled.
import io
$encoding UTF-8
procedure main()
local s
s := u"Министры иностранных дел Европейского союза утвердили"
s ? every write(upto('ив'))
end
This program produces the output
2
4
10
27
47
51
53
Contents