Back to index

lightning-sunbird  0.9+nobinonly
Public Types | Public Member Functions | Protected Member Functions | Protected Attributes
nsUTF8ToUnicode Class Reference

A character set converter from UTF8 to Unicode. More...

#include <nsUTF8ToUnicode.h>

Inheritance diagram for nsUTF8ToUnicode:
Inheritance graph
[legend]
Collaboration diagram for nsUTF8ToUnicode:
Collaboration graph
[legend]

List of all members.

Public Types

enum  { kOnError_Recover, kOnError_Signal }

Public Member Functions

 nsUTF8ToUnicode ()
 Class constructor.

Protected Member Functions

NS_IMETHOD GetMaxLength (const char *aSrc, PRInt32 aSrcLength, PRInt32 *aDestLength)
 Normally the maximum length of the output of the UTF8 decoder in UTF16 code units is the same as the length of the input in UTF8 code units, since 1-byte, 2-byte and 3-byte UTF-8 sequences decode to a single UTF-16 character, and 4-byte UTF-8 sequences decode to a surrogate pair.
NS_IMETHOD Convert (const char *aSrc, PRInt32 *aSrcLength, PRUnichar *aDest, PRInt32 *aDestLength)
 Converts the data from one Charset to Unicode.
NS_IMETHOD Reset ()
 Resets the charset converter so it may be recycled for a completely different and urelated buffer of data.

Protected Attributes

PRUint32 mUcs4
PRUint8 mState
PRUint8 mBytes
PRPackedBool mFirst

Detailed Description

A character set converter from UTF8 to Unicode.

18/Mar/1998 04/Feb/2000

Author:
Catalin Rotaru [CATA]

Definition at line 63 of file nsUTF8ToUnicode.h.


Member Enumeration Documentation

anonymous enum [inherited]
Enumerator:
kOnError_Recover 
kOnError_Signal 

Definition at line 98 of file nsIUnicodeDecoder.h.

       {
    kOnError_Recover,       // on an error, recover and continue
    kOnError_Signal         // on an error, stop and signal
  };

Constructor & Destructor Documentation

Class constructor.

Definition at line 70 of file nsUTF8ToUnicode.cpp.

Here is the call graph for this function:


Member Function Documentation

NS_IMETHODIMP nsUTF8ToUnicode::Convert ( const char *  aSrc,
PRInt32 aSrcLength,
PRUnichar aDest,
PRInt32 aDestLength 
) [protected, virtual]

Converts the data from one Charset to Unicode.

About the byte ordering:

  • For input, if the converter cares (that depends of the charset, for example a singlebyte will ignore the byte ordering) it should assume network order. If necessary and requested, we can add a method SetInputByteOrder() so that the reverse order can be used, too. That method would have as default the assumed network order.
  • The output stream is Unicode, having the byte order which is internal for the machine on which the converter is running on.

Unless there is not enough output space, this method must consume all the available input data! The eventual incomplete final character data will be stored internally in the converter and used when the method is called again for continuing the conversion. This way, the caller will not have to worry about managing incomplete input data by mergeing it with the next buffer.

Error conditions: If the read value does not belong to this character set, one should replace it with the Unicode special 0xFFFD. When an actual input error is encountered, like a format error, the converter stop and return error. Hoever, we should keep in mind that we need to be lax in decoding.

Converter required behavior: In this order: when output space is full - return right away. When input data is wrong, return input pointer right after the wrong byte. When partial input, it will be consumed and cached. All the time input pointer will show how much was actually consumed and how much was actually written.

Parameters:
aSrc[IN] the source data buffer
aSrcLength[IN/OUT] the length of source data buffer; after conversion will contain the number of bytes read
aDest[OUT] the destination data buffer
aDestLength[IN/OUT] the length of the destination data buffer; after conversion will contain the number of Unicode characters written
Returns:
NS_PARTIAL_MORE_INPUT if only a partial conversion was done; more input is needed to continue NS_PARTIAL_MORE_OUTPUT if only a partial conversion was done; more output space is needed to continue NS_ERROR_ILLEGAL_INPUT if an illegal input sequence was encountered and the behavior was set to "signal"

Implements nsIUnicodeDecoder.

Definition at line 123 of file nsUTF8ToUnicode.cpp.

{
  PRUint32 aSrcLen   = (PRUint32) (*aSrcLength);
  PRUint32 aDestLen = (PRUint32) (*aDestLength);

  const char *in, *inend;
  inend = aSrc + aSrcLen;

  PRUnichar *out, *outend;
  outend = aDest + aDestLen;

  nsresult res = NS_OK; // conversion result

  // Set mFirst to PR_FALSE now so we don't have to every time through the ASCII
  // branch within the loop.
  if (mFirst && aSrcLen && (0 == (0x80 & (*aSrc))))
    mFirst = PR_FALSE;

  for (in = aSrc, out = aDest; ((in < inend) && (out < outend)); ++in) {
    if (0 == mState) {
      // When mState is zero we expect either a US-ASCII character or a
      // multi-octet sequence.
      if (0 == (0x80 & (*in))) {
        // US-ASCII, pass straight through.
        *out++ = (PRUnichar)*in;
        mBytes = 1;
      } else if (0xC0 == (0xE0 & (*in))) {
        // First octet of 2 octet sequence
        mUcs4 = (PRUint32)(*in);
        mUcs4 = (mUcs4 & 0x1F) << 6;
        mState = 1;
        mBytes = 2;
      } else if (0xE0 == (0xF0 & (*in))) {
        // First octet of 3 octet sequence
        mUcs4 = (PRUint32)(*in);
        mUcs4 = (mUcs4 & 0x0F) << 12;
        mState = 2;
        mBytes = 3;
      } else if (0xF0 == (0xF8 & (*in))) {
        // First octet of 4 octet sequence
        mUcs4 = (PRUint32)(*in);
        mUcs4 = (mUcs4 & 0x07) << 18;
        mState = 3;
        mBytes = 4;
      } else if (0xF8 == (0xFC & (*in))) {
        /* First octet of 5 octet sequence.
         *
         * This is illegal because the encoded codepoint must be either
         * (a) not the shortest form or
         * (b) outside the Unicode range of 0-0x10FFFF.
         * Rather than trying to resynchronize, we will carry on until the end
         * of the sequence and let the later error handling code catch it.
         */
        mUcs4 = (PRUint32)(*in);
        mUcs4 = (mUcs4 & 0x03) << 24;
        mState = 4;
        mBytes = 5;
      } else if (0xFC == (0xFE & (*in))) {
        // First octet of 6 octet sequence, see comments for 5 octet sequence.
        mUcs4 = (PRUint32)(*in);
        mUcs4 = (mUcs4 & 1) << 30;
        mState = 5;
        mBytes = 6;
      } else {
        /* Current octet is neither in the US-ASCII range nor a legal first
         * octet of a multi-octet sequence.
         *
         * Return an error condition. Caller is responsible for flushing and
         * refilling the buffer and resetting state.
         */
        res = NS_ERROR_UNEXPECTED;
        break;
      }
    } else {
      // When mState is non-zero, we expect a continuation of the multi-octet
      // sequence
      if (0x80 == (0xC0 & (*in))) {
        // Legal continuation.
        PRUint32 shift = (mState - 1) * 6;
        PRUint32 tmp = *in;
        tmp = (tmp & 0x0000003FL) << shift;
        mUcs4 |= tmp;

        if (0 == --mState) {
          /* End of the multi-octet sequence. mUcs4 now contains the final
           * Unicode codepoint to be output
           *
           * Check for illegal sequences and codepoints.
           */

          // From Unicode 3.1, non-shortest form is illegal
          if (((2 == mBytes) && (mUcs4 < 0x0080)) ||
              ((3 == mBytes) && (mUcs4 < 0x0800)) ||
              ((4 == mBytes) && (mUcs4 < 0x10000)) ||
              (4 < mBytes) ||
              // From Unicode 3.2, surrogate characters are illegal
              ((mUcs4 & 0xFFFFF800) == 0xD800) ||
              // Codepoints outside the Unicode range are illegal
              (mUcs4 > 0x10FFFF)) {
            res = NS_ERROR_UNEXPECTED;
            break;
          }
          if (mUcs4 > 0xFFFF) {
            // mUcs4 is in the range 0x10000 - 0x10FFFF. Output a UTF-16 pair
            mUcs4 -= 0x00010000;
            *out++ = 0xD800 | (0x000003FF & (mUcs4 >> 10));
            *out++ = 0xDC00 | (0x000003FF & mUcs4);
          } else if (UNICODE_BYTE_ORDER_MARK != mUcs4 || !mFirst) {
            // Don't output the BOM only if it is the first character
            *out++ = mUcs4;
          }
          //initialize UTF8 cache
          mUcs4  = 0;
          mState = 0;
          mBytes = 1;
          mFirst = PR_FALSE;
        }
      } else {
        /* ((0xC0 & (*in) != 0x80) && (mState != 0))
         * 
         * Incomplete multi-octet sequence. Unconsume this
         * octet and return an error condition. Caller is responsible
         * for flushing and refilling the buffer and resetting state.
         */
        in--;
        res = NS_ERROR_UNEXPECTED;
        break;
      }
    }
  }

  // output not finished, output buffer too short
  if ((NS_OK == res) && (in < inend) && (out >= outend))
    res = NS_OK_UDEC_MOREOUTPUT;

  // last UCS4 is incomplete, make sure the caller
  // returns with properly aligned continuation of the buffer
  if ((NS_OK == res) && (mState != 0))
    res = NS_OK_UDEC_MOREINPUT;

  *aSrcLength = in - aSrc;
  *aDestLength = out - aDest;

  return(res);
}

Here is the call graph for this function:

NS_IMETHODIMP nsUTF8ToUnicode::GetMaxLength ( const char *  aSrc,
PRInt32  aSrcLength,
PRInt32 aDestLength 
) [protected, virtual]

Normally the maximum length of the output of the UTF8 decoder in UTF16 code units is the same as the length of the input in UTF8 code units, since 1-byte, 2-byte and 3-byte UTF-8 sequences decode to a single UTF-16 character, and 4-byte UTF-8 sequences decode to a surrogate pair.

However, there is an edge case where the output can be longer than the input: if the previous buffer ended with an incomplete multi-byte sequence and this buffer does not begin with a valid continuation byte, we will return NS_ERROR_UNEXPECTED and the caller may insert a replacement character in the output buffer which corresponds to no character in the input buffer. So in the worst case the destination will need to be one code unit longer than the source. See bug 301797.

Implements nsIUnicodeDecoder.

Definition at line 94 of file nsUTF8ToUnicode.cpp.

{
  *aDestLength = aSrcLength + 1;
  return NS_OK;
}
NS_IMETHODIMP nsUTF8ToUnicode::Reset ( ) [protected, virtual]

Resets the charset converter so it may be recycled for a completely different and urelated buffer of data.

Implements nsIUnicodeDecoder.

Definition at line 106 of file nsUTF8ToUnicode.cpp.

{

  mUcs4  = 0;     // cached Unicode character
  mState = 0;     // cached expected number of octets after the current octet
                  // until the beginning of the next UTF8 character sequence
  mBytes = 1;     // cached expected number of octets in the current sequence
  mFirst = PR_TRUE;

  return NS_OK;

}

Here is the caller graph for this function:


Member Data Documentation

Definition at line 76 of file nsUTF8ToUnicode.h.

Definition at line 77 of file nsUTF8ToUnicode.h.

Definition at line 75 of file nsUTF8ToUnicode.h.

Definition at line 74 of file nsUTF8ToUnicode.h.


The documentation for this class was generated from the following files: