Overview.html

<html>
<body>

<p>
@author YAMATODANI Kiyoshi
<br>
@version $Id: Overview.html,v 1.3 2007/04/19 04:10:32 kiyoshiy Exp $
</p>

<p>
'LMLML' is Library of MultiLingualization for ML, which aims to support writing multi-linugalized program in ML. The current version supports multi-byte string processing only.
</p>
<p>
String manipulation modules of existing ML compilers and ML Basis library assume, in fact, a codec which encodes a character in a byte. They do not expect codecs which encode a character in multi-bytes.
Therefore, it is hard for ML programmer to write applications which have to handle texts encoded in various codecs.
SML# project developed LMLML to support development of such multi-byte string applications.
</p>

<h2>Feature</h2>

<p>
<ul>
<li>isolating codec-specific code,</li>
<li>almost compatibility with SML Basis,</li>
<li>support of new encoding method,</li>
<li>Standard ML compliance.</li>
</ul>
</p>

<hr>
<h2>1, isolating codec-specific code.</h2>

<p>
With LMLML, you can select used codec dynamically for each string.
And, you can manipulate strings encoded in heterogeneous codecs as instances of the same type.
Therefore, with LMLML, you can isolate program codes which depends on spcific codecs from program codes of codec-independent.
</p>

<p>
You can select encoding method dynamically with <code>MultiByteString</code> structure.
</p>

<p>
<code>decode</code> functions of <code>MultiByteString</code> take a string which specify the encoding method to use.
</P>

<pre>
  signature MULTI_BYTE_STRING =
  sig
    structure Char :
    sig
      type char
      val decodeBytesSlice
          : String.string -> Word8VectorSlice.slice -> char option
      val decodeBytes : String.string -> Word8Vector.vector -> char option
      val decodeString : String.string -> String.string -> char option
       :
    end
    structure String :
    sig
      type string
      val decodeBytesSlice : String.string -> Word8VectorSlice.slice -> string
      val decodeBytes : String.string -> Word8Vector.vector -> string
      val decodeString : String.string -> String.string -> string
       :
    end
  end
  structure MultiByteString : MULTI_BYTE_STRING
</pre>

<p>
For example, you can decode a <code>byteVector : Word8Vector.vector</code> in UTF-16 encoding as follows.
</p>
<pre>
  # val s1 = MultiByteString.String.decodeBytes "UTF-16" byteVector;
  val s1 = - : MultiByteString.String.string
</pre>

<p>
Available encoding methods are listed by <code>MultiByteString.getCodecNames</code>.
</p>
<pre>
  # MultiByteString.getCodecNames();
  val it =
      [
        "UTF-16",
        "UTF-16LE",
        "UTF-16BE",
        "UTF-8",
        "SHIFT_JIS",
        "MS_KANJI",
        "CSSHIFTJIS",
        "ISO-2022-JP",
        "CSISO2022JP",
        "GB2312",
        "CSGB2312",
        "GBK",
        "CP936",
        "MS936",
        "WINDOWS-936",
        "EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE",
        "CSEUCPKDFMTJAPANESE",
        "EUC-JP",
        "ANSI_X3.4-1968",
        "ISO-IR-6",
        "ANSI_X3.4-1986",
        "ISO_646.IRV:1991",
        "ASCII",
        "ISO646-US",
        "US-ASCII",
        "US",
        "IBM367",
        "CP367",
        "CSASCII"
      ]
      : string list
</pre>
<p>
As described below, new encoding method can be added.
</p>

<p>
If the encoding method to use is statically fixed, it is more efficient to use a structure that implements the encoding method.
</p>
<pre>
  # val s2 = UTF16Codec.String.fromBytes byteVector;
  val s2 = - : UTF16Codec.String.string
</pre>

<p>
However, strings obtained in these ways are not compatible each other.
</p>
<pre>
  # MultiByteString.String.size s2;
  stdIn:5.1-5.30 Error:
    operator and operand don't agree
    operator domain: MultiByteString.String.string
    operand: UTF16Codec.String.string
</pre>

<hr>
<h2>2, almost compatibility with SML Basis</h2>

<p>
<code>MultiByteString</code> and encoding-specific modules, such as <code>UTF16Codec</code>, provide interfaces almost compatible with <code>Char</code> and <code>String</code> of Basis.
It is easy to upgrade existing ML program to support multi-byte string codecs with minor changes only.
</p>
<pre>
  signature MB_CHAR =
  sig
    type char

    val compare : char * char -> order
    val isAscii : char -> bool
       :
  end

  signature MB_STRING =
  sig
    type string
    type char

    val sub : string * int -> char
    val explode : string -> char list
         :
  end

  signature MULTI_BYTE_STRING =
  sig
    structure Char : MB_CHAR
    structure String : MB_STRING
    sharing type Char.string = String.string
    sharing type Char.char = String.char
  end
  structure MultiByteString : MULTI_BYTE_STRING

  signature CODEC =
  sig
    structure Char : MB_CHAR
    structure String : MB_STRING
    sharing type Char.string = String.string
    sharing type Char.char = String.char
  end
  structure UTF16Codec : CODEC
</pre>
<p>
And, LMLML provides functors to generate multibyte-string version of <code>Substring</code>, <code>StringCvt</code> and <code>ParserComb</code>.
</p>
<pre>
  functor SubstringBase
  functor StringConverterBase
  functor ParserCombinatorBase
</pre>
<p>
Note: For the current version, functors are not loaded in the prelude. You should load "LMLML/extension.sml" as follows to use these functors.
<pre>
</pre>
# use "LMLML/extension.sml";
</p>
<p>
For example, you can obtain a multibyte-string version of <code>Substring</code> for UTF-16 encoding as follows.
</p>
<pre>
  local
    structure MBS = UTF16Codec.String
    structure MBC = UTF16Codec.Char
    structure P =
    struct
      type char = MBS.char
      type string = MBS.string
      val sub = MBS.sub
      val substring = MBS.substring
      val size = MBS.size
      val concat = MBS.concat
      val compare = MBS.compare
      val compareChar = MBC.compare
    end
  in
  structure UTF16Substring : MB_SUBSTRING = SubstringBase(P)
  end
</pre>

<hr>
<h2>3, support of new encoding method</h2>

<p>
LMLML supports major codecs, such as ShiftJIS and UTF-16 already.
You can extend LMLML by adding a new module that supports an encoding method you need without changing LMLML.
</p>

<p>
First, you have to define a structure that implements <code>PRIM_CODEC</code> signature.
For example, to support an encoding "foo", define a structure as follows.
</p>
<pre>
  structure FooCodecPrim : PRIM_CODEC =
  struct
    val names = ["foo"]
          :
  end
</pre>
<p>
Then, apply <code>Codec</code> functor to it.
</p>
<pre>
  structure FooCodec = Codec(FooCodecPrim);
</pre>
<p>
This code registers foo codec to <code>MultiByteString</code>.
</p>

<p>
You can decode in foo codec as follows.
</p>
<pre>
  val mbs1 = MultiByteString.String.decodeBytes "foo" byteVector;
</pre>
<p>
Of course, you can use <code>FooCodec</code> directly.
</p>
<pre>
  val mbs2 = FooCodec.String.fromBytes byteVector;
</pre>

<hr>
<h2>4, Standard ML compliance.</h2>

<p>
Major features of LMLML are implemented without depending on SML# features.
LMLML can be used with any compiler that conform to the Definition of Standard ML, including of cource SML# but probably also SML/NJ, MLton, and many others.
</p>

<h4>SML#</h4>
<p>
LMLML is installed with SML# system.
And, its core modules are loaded in prelude.
</p>
<p>
In current version of SML#, Codec functor is not loaded in prelude for an implementation reason. To use Codec functor to extend LMLML with new codec, you have to load "LMLML/extension.sml" as follows.
<pre>
# use "LMLML/extension.sml";
</pre>
</p>

<h4>SML/NJ</h4>
<p>
Use sources.cm with SML/NJ CM.
</p>

<h4>MLton</h4>
<p>
Use sources.mlb with MLton Basis system.
</p>

<hr>
<h2>Example</h2>
<p>
As an example programming with LMLML, we try to search a character '&#x5263;' in a string "&#x767D;&#x8840;&#x75C5;abc&#x5263;&#x9053;".
</p>
<p>
In Shift_JIS, "&#x767D;&#x8840;&#x75C5;abc&#x5263;&#x9053;" is encded into the following byte vector:
<pre>
 0wx94, 0wx92, 0wx8C, 0wx8C, 0wx95, 0wx61, (* &#x767D;&#x8840;&#x75C5; *)
 0wx61, 0wx62, 0wx63, (* abc *)
 0wx8C, 0wx95, 0wx93, 0wxB9 (* &#x5263;&#x9053; *)
</pre>
</p>
<p>
A pair of the second byte of '&#x8840;' and the first byte of '&#x8840;' is ( 0wx8C, 0wx95 ), which eqauls to '&#x5263;'.
Therefore, if we search '&#x5263;'(0wx8C, 0wx95) in this byte vector, we find incorrectly the byte sequence spanning the second character and the third character of "&#x767D;&#x8840;&#x75C5;".
</p>
<p>
With LMLML, we can obtain the correct result.
</p>
<p>
At first, we decode a Shift_JIS string from the byte vector.
</p>
<pre>
 (* "&#x767D;&#x8840;&#x75C5;abc&#x5263;&#x9053;" *)
 val bytes =
     Word8Vector.fromList
     [
       0wx94, 0wx92, 0wx8C, 0wx8C, 0wx95, 0wx61, (* &#x767D;&#x8840;&#x75C5; *)
       0wx61, 0wx62, 0wx63, (* abc *)
       0wx8C, 0wx95, 0wx93, 0wxB9 (* &#x5263;&#x9053; *)
     ];
 val string = MultiByteString.String.decodeBytes "Shift_JIS" bytes;

 (* "&#x5263;" *)
 val KenBytes = Word8Vector.fromList [0wx8C, 0wx95]; (* &#x5263; *)
 val KenString = MultiByteString.String.decodeBytes "Shift_JIS" KenBytes;
</pre>
<p>
Then, search "&#x5263;" in "&#x767D;&#x8840;&#x75C5;abc&#x5263;&#x9053;" .
</p>
<pre>
 val (leftSS, rightSS) =
     MBSubstring.position KenString (MBSubstring.full substring);
</pre>
<p>
The obtained leftSS is first 6 characters of "&#x767D;&#x8840;&#x75C5;abc", which indicates that the 7th charcter "&#x5263;" is found correctly.
</p>
<pre>
 # MBSubstring.size leftSS;
 val it = 6 : int
</pre>
<p>
In this example, codec is fixed at Shift_JIS, you can write by using Shift_JIS specific module as follows.
</p>
<pre>
 val ShiftJISString = ShiftJISCodec.String.fromBytes bytes;
 val ShiftJISKenString = ShiftJISCodec.String.fromBytes KenBytes;
 structure ShiftJISSubstring =
           SubstringBase
               (struct
                  open ShiftJISCodec.String
                  val compareChar = ShiftJISCodec.Char.compare 
                end);
 val (leftSS, rightSS) =
     ShiftJISSubstring.position
         ShiftJISKenString (ShiftJISSubstring.full ShiftJISString);
 val len = ShiftJISSubstring.size leftSS;
</pre>

</body>
</html>