International Character Support in Allegro CL

The index for the Allegro CL Documentation is in index.htm. The documentation is described in introduction.htm.

This document contains the following sections:

1.0 Introduction
2.0 Internal Representation
   2.1 History
   2.2 Unicode
   2.3 Memory Usage
   2.4 Character names
3.0 External formats
   3.1 External-Format Overview
      3.1.1 Basic External-Format Types
      3.1.2 The unicode and fat External-Format Types
      3.1.3 Composed External-Formats
      3.1.4 Defining External-Formats
      3.1.5 Retrieving Existing External-Formats
      3.1.6 External-Format Runtime Mode
   3.2 External-Format Usage
      3.2.1 Streams
      3.2.2 String <-> External-Format Lisp Arrays
   3.3 Older Allegro CL External-Format Compatibility
4.0 Foreign-Functions
5.0 Locales
   5.1 The initial locale when Allegro CL starts up
   5.2 Locales in applications
6.0 Earlier International Allegro CL Compatibility
   6.1 EUC Module
   6.2 :mode Option Removal
Appendix A. Functions, Symbols, Variables Documentation
   Appendix A.1. External-Format API
Appendix B. #\newline Discussion
Appendix C. 8-bit images

1.0 Introduction

Starting with Allegro CL Release 6.0, the International Version of Allegro CL, which has been available to Unix users since Release 3.1, and to Windows users since Release 5.0.1, takes center stage to become the new standard Allegro CL. The older, non-international version remains available to Allegro CL users, as described in Appendix C 8-bit images.

For most users, this version change will be undetectable. Some users may notice new warnings about how to improve code compatibility between non-international and International Allegro CL, especially regarding string passing to foreign functions (described below in this document). The main benefits, however, of the new version are for users developing internationalized applications requiring non-English character sets. The internal changes to Allegro CL allow for universal character representation and character Input/Output that is more flexible than that available with previous Allegro CL releases. This document describes these changes and how Lisp programmers can exploit the new features.

Note on examples with non-ASCII characters: This document contains some examples with non-ASCII characters to illustrate Allegro CL's ability to represent them. These examples are displayed using JPEG pictures, and therefore you cannot cut and paste them, as you can with examples containing only ASCII characters.

2.0 Internal Representation

2.1 History

The previously standard, non-international, 8-bit version of Allegro CL represents characters internally using 8-bits per numeric character code. In Allegro CL, English letters and punctuation characters are represented using the ASCII character set. Several non-ASCII characters in 8-bit extended character sets, including the ISO-8859 series of character sets, define numeric codes for non-English/non-ASCII characters. Although non-international Allegro CL does not provide specific support to 8-bit extended character sets, all 8-bit character codes are representable in non-international Allegro CL.

The International version of Allegro CL was originally developed to support Japanese characters of which there are too many to represent using the standard 8-bit per character code model. With Release 6.0, Allegro CL is extended to support all international characters (ie, Asian, European, etc.) by using 16-bit Unicode as the internal character representation model. The Unicode standard is used as the internal representation model for the Windows NT/2000 Operating System as well as the Java programming language.

The International version of Allegro CL has the feature :ics on the *features* list. The non-International version does not have that feature.

2.2 Unicode

Each character, be it a letter, Chinese ideograph, Korean Hangul, punctuation mark, or other glyph, has a unique numeric Unicode representation value. Please visit the Unicode Web site (www.unicode.org) for more information on the Unicode standard. We provide a basic description of Unicode in this document.

The characters from the Latin-1 (aka ISO 8859-1) character set, a 256 character (ie, 8-bit) superset of the ASCII character set, have the same values in Unicode as they do in Latin-1. This provides convenient compatibility for programs which depend on numeric character codes strictly within the Latin-1 range.

Characters from other sets, however, may have different values in Unicode. For example, the Latin-2 character "Latin Capital Letter L With Stroke" has value #xa3 in Latin-2 (ISO 8859-2), but has value u+0141 in unicode [we use the 'u+xxxx' convention here for describing unicode values; 'xxxx' is in hexadecimal format]. Thus, programs which depend on character code values of non-Latin-1 characters may need to be examined and possibly updated to operate with Allegro CL 6.0. Users with existing Allegro CL programs who do not wish immediately to update their programs will be able to use the non-international, 8-bit character based, Allegro CL which does not use Unicode to represent characters.

Note that, as described later in this document, conversions from external format encodings, such as Latin-2, happen automatically during Lisp Input/Output. Thus, the only areas where user code is likely to be affected by differences between internal character representation among non-international and International Allegro CL are places where the Lisp functions char-code and code-char are called directly on non-Latin-1 characters. For example, using the Latin-2 character "Latin Capital Letter L With Stroke", the following sessions show the differences:

2.3 Memory Usage

Internally, all Lisp strings are represented as arrays of Unicode character codes. Each array element is exactly 16-bits wide, even if the string contains only 7-bit ASCII characters. This widening of strings causes a memory usage increase. However, since almost all initial Allegro CL strings are stored in memory-mapped files, the initial runtime memory usage difference between International Allegro CL and non-international Allegro CL is less than 5%. Users wishing to deliver applications with their (read-only) strings in similarly memory mapped files can use the :purify option to generate-application. Please see delivery.htm for more information.

2.4 Character names

Lisp characters can be represented using the `#\[name]' syntax, where [name] is the character's Unicode name. Since the Unicode naming convention uses spaces in character names, and since the Lisp character reader treats space as a token delimiter, #\_ (underscore) characters are used to act as spaces in the Unicode name. For example, the following shows what the unicode name for u+0141:

> (code-char #x0141)
 #\latin_capital_letter_l_with_stroke

 > (format t "u+~4,'0x" (char-code #\latin_capital_letter_l_with_stroke))
 u+0141

Not all Unicode characters have names. In particular, most CJK (Chinese/Japanese/Korean) characters are unnamed. If you are using Mule or the Allegro CL IDE to enter Japanese characters, though, you can name the characters directly:

Character names specified in ANSI Common Lisp are also recognized. Thus, some characters have more than one name in Allegro CL:

> (format t "u+~4,'0x" (char-code #\latin_capital_letter_a))
 u+0041

 > (format t "u+~4,'0x" (char-code #\A))
 u+0041

3.0 External formats

As described above, International Allegro CL characters and character strings are represented internally using the Unicode standard with each character occupying exactly 16-bits per character code. Externally, however, and outside of Allegro CL's control, most non-ASCII characters are stored in variable-width multi-byte models using any one of several different representations.

For example, there are several common ways to represent Japanese characters, and most of these encodings specify that ASCII characters (which are non-Japanese) occupy a single 8-bit byte each, whereas Japanese characters may occupy two or three 8-bit bytes each depending on the character and the encoding.

Allegro CL 6.0 provides stream-level and foreign-function call-level automatic translation between Unicode and several of these external formats. We describe how external-formats are used in this section as well as how users can define their own Unicode to External Format translations for their own external formats.

3.1 External-Format Overview

3.1.1 Basic External-Format Types

The simplest external-format is for the Latin-1 character set. This is the external-format used when the default locale (sometimes known as the "C" or "POSIX" locale) is being used. For input, the Latin-1 external-format translation simply takes the next input octet and forms a Lisp character from the single octet's numeric value. For output, the external-format takes the Lisp character's code, and uses, as its octet output, the character code value. If the Lisp character code value is greater than 256 (ie, what can be represented as a Latin-1 octet), then the ASCII value for question-mark (== 63) is used as the output octet. Thus question-marks appearing in Latin-1 output can indicate places where non-Latin-1 characters are used.

The next simplest class of external-formats are the 8-bit character sets, such as for any of the ISO-8859 sets. (The Latin-1 case described above is actually ISO-8859-1). For these external-formats, each Lisp character corresponds to one octet. (A special case exception is the #\newline case described below.) For the non-Latin-1 external-formats, the translation is generally done by fast table lookup. The following are the names and nicknames for the 8-bit external-formats supplied with Allegro CL:

 Name            Nicknames                      Comments
 ----            ---------                      --------
 :latin1         :ascii, :8-bit, :iso8859-1, t
 :1250                                          For MS Windows
 :1251                                          For MS Windows
 :1252                                          For MS Windows
 :1253                                          For MS Windows
 :1254                                          For MS Windows
 :1255                                          For MS Windows
 :1256                                          For MS Windows
 :1257                                          For MS Windows
 :1258                                          For MS Windows
 :iso8859-2     :latin-2, :latin2
 :iso8859-3     :latin-3, :latin3
 :iso8859-4     :latin-4, :latin4
 :iso8859-5     :latin-5, :latin5
 :iso8859-6     :latin-6, :latin6
 :iso8859-7     :latin-7, :latin7
 :iso8859-8     :latin-8, :latin8
 :iso8859-9     :latin-9, :latin9
 :iso8859-14    :latin-14, :latin14
 :iso8859-15    :latin-15, :latin15

The general class of external-formats are for the variable-width multi-byte character sets often used for Asian languages. As described above, a single Lisp character may be represented externally using several external-format octets. The external-format conversion routines consume octets on input or generate octets on output, and may use table lookup for transation to/from Lisp characters. The following are the names and nicknames for the multi-byte external-formats supplied with Allegro CL:

 Name            Nicknames                      Comments
 ----            ---------                      --------
 :utf8           :utf-8
 :euc
 :874                                           For MS Windows
 :932                                           For MS Windows
 :949                                           For MS Windows
 :950                                           For MS Windows
 :jis
 :shiftjis

See 3.1.2 The unicode and fat External-Format Types for external formats that are exactly two bytes wide.

3.1.2 The unicode and fat External-Format Types

Two external forms that use precisely two bytes (16 bits) per character are :fat and :unicode. These external formats are similar, but :unicode follows unicode byte-ordering conventions. In particular, when a stream is first opened with the unicode external-format or when a stream's external-format is changed (via (setf stream-external-format)) to the unicode external-format, the unicode byte-order-marker is used in the following way:

The first time a character is requested from the stream (eg, via read-char) a check for the unicode byte-order marker is made. If one is found, then the stream's internal state for subsequent character byte-ordering is set accordingly so that any necessary byte-swapping is done automatically by the external-format convertor. If a byte-order marker is not found, then little-endian ordering is assumed.

The first time a character is output to the stream (eg, via write-char) a unicode byte-order marker is written before that first character. See sniff-for-unicode.

3.1.3 Composed External-Formats

The most abstract of external-formats in Allegro CL 6.0 are the "composed" or "composing" or "wrapper" external-formats. Unlike the basic external-formats described above, which translate between Lisp characters and octets, a composing external-format provides translations between Lisp characters and (other) Lisp characters. The most widely used composed external-format, named :crlf, is used to convert the Common Lisp #\newline character from/to the combination #\return #\linefeed combination. The :crlf external-format is used by default on the Windows platform where the textual convention is to end each line (regardless of character set encoding) with the ASCII octets 13 and 10 which represent 'carriage return' and 'linefeed' respectively.

On the Windows platform, the default external-format used (that which is specified by the current locale, see 5.0 Locales) is a composition of the :crlf external-format and the locale-specific base external-format. For example, on US English Windows, the default base external-format is called :1252-base (the 1252 names the "code page" name used by Windows for US English -- this character set is effectively the same as Latin-1). In this Windows environment, Allegro CL creates and uses as default the composed external-format :crlf-1252-base. In other words, all Input/Output is filtered through a #\newline processor as well as base-level external-format. For Japanese Windows, where the default code page is 932 (corresponding to Japanese Shift-JIS), the base external-format is :932-base, and Allegro CL composes the :crlf external-format with the :932-base external-format to create :crlf-932-base as the Lisp's default external-format.

The default external-format is made the global value of *locale* at Allegro CL startup time.

The changes to Allegro CL 6.0 regarding #\newline handling can, in some cases, cause compatibility problems for code that was explicitly handling multi-character newline terminations. The special composing external-format :crcrlf is designed to work around these problems. See Appendix B #\newline Discussion below.

The function crlf-base-ef extracts the external format composed with the :crlf external-format or the :crcrlf external-format when passed such a composed external format.

3.1.4 Defining External-Formats

An external-format object is defined in Lisp. Many external-formats are pre-defined for and distributed with Allegro CL 6.0. New external-formats may be made available after the Allegro CL 6.0 release date either as patches or supplemental lisp files.

Users can define their own external-formats using def-external-format. A complete external-format object includes translation macros specified by def-char-to-octets-macro and def-octets-to-char-macro. These macros are used internally by Allegro CL to fill code templates that use external-formats. Pre-filled versions of these templates can be built and stored as auto-loaded fasl files using the function generate-filled-ef-templates.

3.1.5 Retrieving Existing External-Formats

The find-external-format function takes as its required argument a name and returns the external-format whose name, or one of whose nicknames, matches the argument. When the argument is :default, find-external-format returns the external-format associated with *locale* (the current locale, see 5.0 Locales). If the external-format cannot immediately be found as defined in the Lisp, then an attempt is made to autoload the external-format definition. The string "ef-" is concatenated with the string name of the argument and passed to the Common Lisp 'require' function. This effectively means that a module named ef-[name].fasl, where [name] is the argument to find-external-format, is sought and, if found, loaded. Using autoloading in this way allows Allegro CL to have in memory only those external-formats and translation tables that are needed.

3.1.6 External-Format Runtime Mode

An external-format lacking translation macro definitions is said to be in runtime-mode. This means that the external-format exists (ie, can be retrieved with find-external-format), and other aspects of the external-format such as its nicknames can be retrieved, but unfilled code templates cannot be filled for that external-format. The reason one may want an external-format to be in runtime-mode is that if the code templates for an external-format have already been filled by, say, having previously used generate-filled-ef-templates, then the macro definitions and other structures needed by the macros at their expansion time can be deleted to save space. The def-ef-switch-to-runtime macro is used to name a function (or function object) that when funcalled clears structures used by the macros that are not needed when the external-format is in runtime mode. The Allegro CL switch-ef-to-runtime function switches an external-format to runtime mode.

When an external-format is autoloaded (see 3.1.5 Retrieving Existing External-Formats above), an attempt is also made to autoload the pre-filled external-format code templates. These pre-filled templates are stored in separate fasl files, usually with names that begin with 'efft-'. If the pre-filled templates are successfully loaded, then the just-loaded external-format is automatically switched to runtime mode.

3.2 External-Format Usage

3.2.1 Streams

The Lisp 'open' function (and, analagously, 'load' and 'compile-file') takes an ':external-format' argument. If this argument is not specified, a default external-format, based on the current locale (see 5.0 Locales below) is used.

When an operation requests to read an input character stream's next character, the stream's external-format template(s) will request one or more octets from the buffered stream device which it then translates into a Lisp character. Similarly, for writing Lisp characters via streams, the external-format is used to translate the Lisp character code in Unicode to the octet(s) specified by the external-format translation.

A stream's external-format can be changed at arbitrary times, using (setf (stream-external-format ...) ...). If it is changed to be an external-format for which readers/writers are not already built, the Lisp compiler is invoked to build a new associated reader/writer in the stream for that external-format. Since the external-format translation routines are defined using macros, the Lisp compiler is used to build new readers/writers, thus keeping runtime stream overhead from external-format processing kept to the bare minimum.

3.2.2 String <-> External-Format Lisp Arrays

When defining a new external-format, the string-to-octets and octets-to-string functions are a convenient way to test the conversion macros. These functions (formerly known in Allegro CL 5.0.1 as string-to-mb and mb-to-string which are still supported aliases) and the functions string-to-native and native-to-string take an external-format argument to specify how to convert between a Lisp string and a Lisp octet array.

For example, the following translates to the shift-jis external-format:

> (setq mb (string-to-octets (coerce '(#\hiragana_letter_a
                                          #\hiragana_letter_i
                                          #\hiragana_letter_u)
                                         'string)
                               :external-format :shiftjis))

#(130 160 130 162 130 164 0)
7

The following takes the above Shift-JIS result and converts it to EUC:

> (string-to-octets
    (octets-to-string mb :external-format :shiftjis)
    :external-format :euc)

 #(164 162 164 164 164 166 0)
 7

The first value returned by string-to-octets is the octet array in EUC format. The second value is the number of octets generated including the null-terminating 0 which is added by string-to-octets (and not by the external-format).

3.3 Older Allegro CL External-Format Compatibility

Use of the existing special *default-external-format* is discouraged in Allegro CL 6.0. Users are encouraged either to bind a locale to *locale* (see 5.0 Locales) or to directly specify the desired external-formats when calling functions that take the :external-format argument (eg, with-native-string, string-to-mb, mb-to-string, etc.) In Allegro CL 6.0, the default value of *default-external-format* is :default. When find-external-format is invoked with :default, the returned external-format will be that stored in *locale*.

4.0 Foreign-Functions

In C, the 'char' type is equivalent to an 8-bit byte. A C string is represented as an array of 8-bit bytes terminated by the null (or 0) byte. Therefore, a C routine expecting a 'char *' argument may expect a null-terminated 8-bit character array (ie, string) in this format.

The Allegro CL Foreign-Function Interface allows users/programmers to pass Lisp strings to C routines expecting 'char *' arguments. Since Allegro CL internally null-terminates lisp string objects, passing a lisp string to a foreign function simply means internally passing the address to the first character of the lisp string's array.

The previous Allegro CL string-passing mechanism described above breaks down in the International version since the internal character codes of the lisp string's array are in Unicode, and non-ASCII characters may not match the codes in the locale's native (or external) format. Furthermore, even for ASCII-only strings, a 'char *' argument expects its value to be in a format where ASCII characters are 1-byte per character. International Allegro CL represents all characters, including ASCII characters, as 2-bytes per character. The upper byte of each ASCII character is always zero. Therefore, even if a user wishes to pass an ASCII-only string from Allegro to a foreign function, the foreign function will most likely treat the string argument as truncated since the first upper (all-zero) byte will be regarded as the string terminator.

For International Allegro CL 5.0.1, the solution offered to this problem was to provide a macro, called with-native-string, to be used around all foreign-function calls that pass strings. This macro is used to convert string arguments to native format using a dynamic-extent array of 8-bit bytes.

Even with the with-native-string solution, users porting foreign function code from earlier releases of Allegro CL to International Allegro CL (now the standard version with Release 6.0) would have to manually hunt down every string-passing foreign function call in order to wrap those calls with with-native-string.

In order to save users from this burden, Allegro CL 6.0 has a new keyword argument :strings-convert to def-foreign-call. The default value of this argument is true.

When :strings-convert is true, then when any of the specified arguments at def-foreign-call time are declared directly or indirectly as (* :char), def-foreign-call augments the function wrapping the low-level foreign function call so that for each (* :char) declared argument, a check is made at runtime to see if that declaration's corresponding value is a string. If it is, then that value is converted at runtime to native-string format using a dynamic-extent array, and this new array is passed in place of the original string argument to low-level foreign function call.

Since this runtime search for string arguments only happens for those arguments declared directly or indirectly as (* :char), no new code is introduced for foreign functions not expecting strings. Consequently, no checking is introduced if the arguments are specified as a &rest list.

Suppose we have a C function which takes two input string arguments and one output string argument defined as follows:

 /*
  * concatenates first two arguments,
  * returns result in third output argument.
  */
 char *
 myconcat(char *st1, char *st2, char *retval)
 {
     int st1len = strlen(st1);
     int st2len = strlen(st2);
     int i;

     for (i = 0; i < st1len; i++) {
	 retval[i] = st1[i];
     }
     for (i = 0; i < st2len; i++) {
	 retval[st1len + i] = st2[i];
     }
     retval[st1len + st2len] = '\0';
     return retval;
 }

To call this function from Allegro CL 6.0 using the foreign function interface, one can define the Lisp foreign function as follows:

(ff:def-foreign-call myconcat ((st1 (* :char))
                               (st2 (* :char))
                               (result (* :char)))
      :returning :int)

Evaluating the above form causes the following warnings (of condition type ff:strings-convert-def-warning) to be signaled:

 Warning: A runtime with-native-string call is being generated for argument
	  `st1' to the foreign-function `myconcat'.  The with-native-string
	  macro can be used for explicit string conversions around the foreign
	  calls.  This warning is suppressed when :strings-convert is specified
	  as nil in the def-foreign-call.
 Warning: A runtime with-native-string call is being generated for argument
	  `st2' to the foreign-function `myconcat'.  The with-native-string
	  macro can be used for explicit string conversions around the foreign
	  calls.  This warning is suppressed when :strings-convert is specified
	  as nil in the def-foreign-call.

Disregarding the warnings for the moment, and continuing on to use the just-defined foreign function, note that we can call it with lisp string arguments:

user(22): (let ((x (make-array 500 :element-type '(unsigned-byte 8))))
             (myconcat "abc" "def" x)
             (octets-to-string x))
"abcdef"
6
6

The returned value, "abcdef", which is the concatenation of the first two arguments (performed by the C foreign function) is correctly returned.

To turn the example above into one which (a) doesn't generate the warnings, and (b) generates faster runtime code since string conversion checking will be suppressed, one can set the strings-convert keyword argument to false as follows:

(ff:def-foreign-call myconcat ((st1 (* :char))
                               (st2 (* :char))
                               (result (* :void)))
     :strings-convert nil
     :returning :int)

By specifying :strings-convert to nil, the foreign-function interface will not automatically convert string arguments. Thus, to call the foreign-function defined this way, one needs to pass converted string arguments as follows:

user(23): (let ((x (make-array 500 :element-type '(unsigned-byte 8))))
             (with-native-string (st1 "abc")
               (with-native-string (st2 "xyz")
                 (myconcat st1 st2 x)))
             (octets-to-string x))
"abcxyz"
6
6

It is instructive to note what happens when :strings-convert is nil, yet the string arguments are not converted:

user(24): (let ((x (make-array 500 :element-type '(unsigned-byte 8))))
             (myconcat "abc" "def" x)
             (octets-to-string x))
"ad"
2
2

The result is the first character of the first string concatenated with the first character of the second string. The reason this happens is that the C foreign function sees the unconverted arguments as Unicode strings with each element being two octets wide. To the C function, each argument appears as the first octet being an ASCII character, and the second octet being a string NULL terminator.

Note that on big-endian platforms, the result of the above form is "" (ie, the empty string). That's because the Unicode Ascii values have zero in their upper-bytes, and any array of Unicode Ascii values appear to C routines as being zero-length null-terminated strings.

Note that neither setting of the strings-convert keyword argument affects foreign function return results or "output" variables. Users with foreign function code that expects to "fill in" Lisp strings directly will need to modify those calls to pass octet arrays and do conversions, eg, with octets-to-string as shown above for the example's third argument.

5.0 Locales

When Allegro CL starts up, the global variable *locale* is bound to a locale object. The value of *locale* is treated as the current locale.

The Windows and UNIX Operating Systems define a locale environment for each running program. The OS definition of locale describes date/time formats, currency printing formats, and sort ordering information in addition to character type information.

In Allegro CL 6.0, is only used to determine the default external-format (see example just below) because Allegro CL 6.0 queries only the character encoding from the OS locale. The external-format used in the default Lisp locale object is derived from the encoding. In Release 6.0, no consideration is given in Lisp for other aspects of OS locales such as date/time or currency format. Therefore, nothing within Allegro CL 6.0 beside the external-format uses or depends on *locale*, although future releases may expand on how *locale* is used.

Here is an example showing how changing the locale changes the external-format.:

cl-user(47): (dolist (x (list (find-locale "C")
			      (find-locale "japan.EUC")))
	       (let ((*locale* x))
		 (format t "~&*locale*=~s;~%default external-format=~s~2%"
			 *locale*
			 (find-external-format :default))))
*locale*=#<locale "C" (English/default) [:latin1-base] @ #x[...]>;
default external-format=#<external-format :latin1-base @ #x[...]>

*locale*=#<locale "japan" [:euc-base] @ #x[...]>;
default external-format=#<external-format :euc-base @ #x[...]>

See 5.1 The initial locale when Allegro CL starts up for information on how the initial locale (that is, the initial value of *locale*) is set.

The *locale* variable is analogous to the Common Lisp *package* variable in that rebinding the variable can affect basic Lisp functionality such as Input/Output.

The standardized convention for locale names is Name[_Territory][.Charset]. Because Allegro CL 6.0 only uses locales for external-formats, the most important part of the locale name is the Charset.

Suppose the following Lisp session were started in a Japanese locale using the EUC encoding. One can override the default external-format by dynamically changing the locale as follows:

The following is a list of locale names recognized by Allegro CL 6.0:

 Name   Language                Territory       External-Format
 ----   -------                 ---------       --------------- 
C       English                 default         :latin1
ca      Catalan                 Spain           :latin1
cz      Czech                   Czech Republic  :iso8859-2
da      Danish                  Denmark         :latin1
de      German                  Germany         :latin1
de_AT   German                  Austria         :latin1
de_CH   German                  Switzerland     :latin1
el      Greek                   Greece          :iso8859-7
en_AU   English                 Austria         :latin1
en_CA   English                 Canada          :latin1
en_UK   English                 Great Britain   :latin1
en_US   English                 United States   :latin1
es      Spanish                 Spain           :latin1
es_AR   Spanish                 Argentina       :latin1
es_BO   Spanish                 Bolivia         :latin1
es_CL   Spanish                 Chile           :latin1
es_CO   Spanish                 Columbia        :latin1
es_CR   Spanish                 Costa Rica      :latin1
es_EC   Spanish                 Ecuador         :latin1
es_GT   Spanish                 Guatemala       :latin1
es_MX   Spanish                 Mexico          :latin1
es_NI   Spanish                 Nicaragua       :latin1
es_PA   Spanish                 Panama          :latin1
es_PE   Spanish                 Peru            :latin1
es_PY   Spanish                 Paraguay        :latin1
es_SV   Spanish                 El Salvador     :latin1
es_UY   Spanish                 Uruguay         :latin1
es_VE   Spanish                 Venezuela       :latin1
et      Estonian                Estonia         :iso8859-10
fr      French                  France          :latin1
fr_BE   French                  Belgium         :latin1
fr_CA   French                  Canada          :latin1
fr_CH   French                  Switzerland     :latin1
hu      Hungarian               Hungary         :iso8859-2
is      Icelandic               Iceland         :latin1
it      Italian                 Italy           :latin1
ja      Japanese                Japan           :euc
ko      Korean                  Korea           :korean
lt      Lithuanian              Lithuania       :iso8859-10
lv      Latvian                 Latvia          :iso8859-10
nl      Dutch                   Netherlands     :latin1
nl_BE   Dutch                   Belgium         :latin1
no      Norwegian               Norway          :latin1
pl      Polish                  Poland          :iso8859-2
POSIX   English                 default         :latin1
pt      Portuguese              Portugal        :latin1
pt_BR   Portuguese              Brazil          :latin1
ru      Russian                 Russian Federation :iso8859-5
su      Finnish                 Finland         :latin1
sv      Swedish                 Swedan          :latin1
tr      Turkish                 Turkey          :iso8859-9
zh      Simplified Chinese      China(PRC)      :simplified-chinese-euc
zh_TW   Traditional Chinese     Taiwan(ROC)     :traditional-chinese-euc

5.1 The initial locale when Allegro CL starts up

The initial locale (that is, the initial value of *locale*) can be determined in various ways. Note that the variable if first set to a locale that (presumably) always exists. It is only reset if valid locales are determined from the additional steps.

Initially, *locale* is set to (find-locale "C") (see find-locale). This becomes the default value for *locale* if the following tests fail.
If the environment variable ACL_LOCALE is set, then Allegro CL attempts to look up, using find-locale, the Lisp locale object named by ACL_LOCALE. If a corresponding lisp locale object is found, then *locale* is set to this object.
If the environment variable ACL_LOCALE is not set, then Allegro CL attempts to look up, using find-locale, the Lisp locale object named by a call to setlocale(LC_CTYPE). (This is the portable Operating Systems level way to look up a locale on both Windows and Unix.) If a corresponding lisp locale object is found, then *locale* is set to this object. Note: the Operating System is polled, as described in this step, only if ACL_LOCALE is not set. If ACL_LOCALE is set but its value is bogus (i.e. find-locale returns nil on the value) the value of *locale* is its inital value (find-locale "C") and will only be changed, if at all, by the nest step.
If the -locale command-line argument is specified (see Command line arguments in startup.htm), then Allegro CL attempts to look up, using find-locale, the lisp locale object named by the argument value. If a corresponding lisp locale object is found, then *locale* is set to this object. Thus, using -locale effectively overrules any environmental setting of LC_CTYPE or ACL_LOCALE. Note that this step is performed when command-line argument processing is done. All the steps above are done earlier in the startup procedure. See What Lisp does when it starts up in startup.htm for details.

Some examples from UNIX

% env ACL_LOCALE=japan.EUC mlisp
cl-user(1): *locale*
#<locale "japan" [:euc-base] @ #x404a02da>

% mlisp -locale cz
cl-user(1): *locale*
#<locale "cz" (Czech/Czech Republic) [:iso8859-2-base] @ #x400f7a02>

% env ACL_LOCALE=japan.EUC ./lispi -I mlisp.dxl -locale cz
cl-user(1): *locale*
#<locale "cz" (Czech/Czech Republic) [:iso8859-2-base] @ #x400f7a02>
cl-user(2):

On UNIX machines, you can determine the available locales with the locale -a shell command. A process' locale is often specified by setting the LANG environment variable, which often automatically sets several other variables including one named LC_CTYPE.

5.2 Locales in applications

If you are preparing an application for delivery to another computer (using generate-application or perhaps build-lisp-image), and the locale on the computer being delivered to is different from the locale on the computer generating the application, you must be sure the application can successfully change locales. The easiest way to ensure this is to specify the runtime-bundle keyword argument (to generate-application and build-lisp-image) true (this produces a runtime bundle file to be shipped with the application from which the locale-changing code can be loaded if needed).

If you know the specific locale on the target computer, you can call find-locale with that locale name (a string) as the argument during the image build (put such a form in a file and include that file as one of the list of files which is the value of the lisp-files argument to build-lisp-image or the input-files required argument to generate-application). But if you want to be ready for any locale, specify runtime-bundle true.

6.0 Earlier International Allegro CL Compatibility

6.1 EUC Module

In previous releases of International Allegro CL for UNIX, EUC is the only supported external-format, and a special internal format known as 'process-code' was used. Some user-visible euc-specific functionality was added to the lisp when International Allegro CL was first created and exists in Release 5.0.1 on UNIX. This functionality, which mostly consists of character type definitions, is moved into a special module no longer built into Allegro CL 6.0. This way, backward compatibility can be achieved by loading this module using '(require :euc)'.

6.2 :mode Option Removal

Because of incompatibilities between the UNIX and Windows Operating Systems with respect to textual line termination, a special keyword argument, :mode was added to the cl:open function. This flag determined whether a stream would read one or two characters when a text line was terminated. With #\newline handling integrated into external-formats, this flag is no longer needed. A warning is signaled if this flag is used. See also Appendix B #\newline Discussion.

Appendix A: Functions, Symbols, Variables Documentation

Individual operators, variables, etc. are documented on their own pages, as is standard in Allegro CL documentation. In this section, we provide a list of the operators, variables, etc. with brief descriptions and links to the documentation pages.

Appendix A.1 External-Format API

def-external-format, a macro which defines an external-format object. External-formats are structs (defined using defstruct).
def-ef-switch-to-runtime, a macro which associates a function object with an external-format.
switch-ef-to-runtime, a function which invokes the function object def-ef-switch-to-runtime for an external format.
find-external-format, a function which returns the external-format object named by the argument symbol.
def-char-to-octets-macro, a defining macro which defines a macro associated with an external format which will be used for converting a character object to a sequence of octets.
char-to-octets, a macro which expands to the macro stored in the char-to-octets-macro slot of the external-format passed in.
def-octets-to-char-macro, a defining macro which defines a macro associated with an external format which will be used for converting a sequence of octets to a character object.
octets-to-char, a macro which expands to the macro stored in the octets-to-char-macro slot of the external-format passed as an argument.
compose-external-formats, a function which creates a new external-format composed of argument external-formats.
composed-external-format-p, a function which returns true or false as its argument is or is not a composed external format.
ef-composer-ef, a function which returns the value in the composer slot of an external format.
ef-composee-ef, a function which returns the value in the composee slot of an external format.

The following two functions are named by unexported symbols. We document them because the sources for Allegro CL 6.0 external-formats will be made available to users and these functions and trie data structures are referenced in the sources. The symbols naming these functions are kept internal in the Allegro CL packages to indicate that their associated functions are subject to change.

build-trie

Function

Package: excl

Arguments: &key name list index-key value-key optimize

name should be a symbol. list should be a list of index/value pairs. index-key, and value-key should be designators of functions of one argument (symbols or function specs, or function objects). optimize should be a boolean.

The symbol naming this function is not exported. excl::build-trie builds a trie data structure consisting of the data from list. A trie data structure holds index/value pairs where most of the values are the same default value. The trie structure does not store these default values for each index, thus saving space and making lookup more efficient. It is not necessary for users to know details of trie data structures.

The list argument names a list of index/value pairs for the trie. The index-key is a function which when applied to a pair returns the index of the pair. The value-key is a function which when applied to a pair returns the value of the pair.

If the optimize argument is true, then any rows in the resulting trie that would be equalp to any rows in any of the tries returned by excl::all-tries are shared. The result is that all equalp rows of all existing tries which are equalp become eq.

Examples:

[*package* is the excl package for these examples]

(let ((jis-to-unicode-list '((#x2121 . #x3000)
			      [...]
			     (#x2124 . #xff0c)
			      [...])))
  (build-trie :name :jis-to-unicode
	      :list jis-to-unicode-list
	      :index-key #'car
	      :value-key #'cdr)
  (build-trie :name :unicode-to-jis
	      :list jis-to-unicode-list
	      :index-key #'cdr
	      :value-key #'car))

(let ((unicode-to-jis-trie
       (cadr (member :unicode-to-jis (excl::all-tries)))))
  (write (aref (aref unicode-to-jis-trie
		     (ldb (byte 8 8) #x3000))
	       (ldb (byte 8 0) #x3000))
	 :base 16))
  ==> [prints 2121]

all-tries

Function

Package: excl

Arguments:

The symbol naming this function is not exported. excl::all-tries returns a list of all tries built by excl::build-trie. The returned list is in plist format: (trie-name1 trie1 trie-name2 trie2 ...).

Appendix B: #\newline Discussion

ANSI Common Lisp specifies that the single character #\newline denotes the end of a character text line. This requirement is complicated by the lack of a uniform convention among Operating Systems regarding textual line endings. To end a line, UNIX based applications generally uses Ascii 10; Macintosh based, Ascii 13; and Windows based applications use two Ascii characters: 13 followed by 10.

To confuse things further, Common Lisp implementations also differ on which Ascii character code is used for #\newline. Some use 13, others use 10. Some Common Lisp implementations may have disregarded the ANSI requirement that a single newline character be returned at the ends of lines, and for Windows, where two character bytes denote line endings, return more than one character at the end of a line.

Allegro CL 5.0.1 uses Ascii 10 for #\newline, following the UNIX convention. By using UNIX compatibility modes provided by MS Windows, Allegro CL 5.0.1 is able (via the ':mode :text' option to the Allegro CL 'open' function) to read/write files following the Windows line ending conventions. One problematic side-effect with this approach, though, is that the Allegro CL file-position routine is inaccurate.

Allegro CL 6.0 has a new external-format processing mechanism for handling character I/O. Using this mechanism, developed initially for multi-byte international characters, several adjacent external bytes may translate to a single Common Lisp character. A natural use of this mechanism is to map external end-of-line markers to/from the single Common Lisp #\newline character. This mechanism eliminates the need for the ':mode :text' option to the 'open' function and also fixes the file-position problem.

Specifically, for Windows the newline processing is handled using a composing external-format. The exact description is as follows:

Standard Translation:
---------------------

Characters -> External Octets:

 Lisp Character                         External Octet Sequence
 --------------                         -----------------------

  #\return  ----------------------------->  13
  #\newline [eq #\linefeed]  ------------>  13 10

External Octets -> Characters:

 External Octet Sequence               Lisp Character
 -----------------------               --------------
 
  13 10  ------------------------------->  #\newline [eq #\linefeed]
  13  ---------------------------------->  #\return
  10  ---------------------------------->  #\linefeed [eq #\newline]

One effect of this translation is that since #\linefeed is the same as #\newline, '#\return #\linefeed' translates to '13 13 10' instead of '13 10' as would happen with Allegro CL 5.0.1 on Windows. This scenario is only likely to be noticed if a program deliberately inserts the '#\return #\linefeed' sequence into a string that is to be converted to external format.

A fix for this situation is for the user/programmer simply to use '#\newline' instead of the '#\return #\linefeed' sequence. For users/programmers not immediately able to make this change, a compatibility mode exists in the form of an alternate composing external-format which translates as follows:

Compatibility Translation:
--------------------------

Characters -> External Octets:

 Lisp Character Sequence                           External Octet Sequence
 -----------------------                           -----------------------

  [ADD] #\return #\linefeed [eq #\newline]  ----->  13 10

  #\return  ------------------------------------->  13           
  #\newline [eq #\linefeed]  -------------------->  13 10

External Octets -> Characters:

 External Octet Sequence               Lisp Character Sequence
 -----------------------               -----------------------
 
  [ADD] 13 13 10  ---------------------->  #\return #\return #\linefeed

  13 10  ------------------------------->  #\newline [eq #\linefeed]
  13  ---------------------------------->  #\return
  10  ---------------------------------->  #\linefeed [eq #\newline]

Note that the new rules allow the external octet sequence '13 13 10' to be preserved after undergoing a round-trip conversion via Lisp characters.

Use of the compatibility mode is only supported at lisp startup time via a new -compat-crlf command-line argument (see Command line arguments in startup.htm).

Appendix C: 8-bit images

Standard Allegro CL 6.0 supports international characters and uses two bytes (16 bits total) for each character. 8-bit characters are not supported in standard Allegro CL. However, 8-bit versions of Allegro CL 6.0 are supplied. The 8-bit executables have `8' in their names (mlisp8, etc.) While most users will likely use the standard version, some (particularly those who manipulate very large ASCII strings) may wish to use the 8-bit version. Note that the 8-bit version does not support 16-bit strings or characters and fasl files are incompatible between the two versions.