Back to index

nux  3.0.0
Public Member Functions | Private Member Functions | Private Attributes
nux::NUTF8 Class Reference

Convert UTF-16 to UTF-8. More...

#include <NUTF.h>

List of all members.

Public Member Functions

 NUTF8 (const UNICHAR *Source)
 NUTF8 (const std::wstring &Source)
 ~NUTF8 ()
 operator const char * ()

Private Member Functions

void Convert (const UNICHAR *)

Private Attributes

char * utf8

Detailed Description

Convert UTF-16 to UTF-8.

Definition at line 60 of file NUTF.h.


Constructor & Destructor Documentation

nux::NUTF8::NUTF8 ( const UNICHAR Source) [explicit]

Definition at line 29 of file NUTF.cpp.

  {
    Convert (Source);
  }

Here is the call graph for this function:

nux::NUTF8::NUTF8 ( const std::wstring &  Source) [explicit]

Definition at line 34 of file NUTF.cpp.

  {
    Convert (NUX_REINTERPRET_CAST (UNICHAR *, NUX_CONST_CAST (wchar_t *, Source.c_str() ) ) );
  }

Here is the call graph for this function:

Definition at line 209 of file NUTF.cpp.

  {
    delete [] utf8;
  }

Member Function Documentation

void nux::NUTF8::Convert ( const UNICHAR Source) [private]

Definition at line 39 of file NUTF.cpp.

  {
    int NumBytes = 0;
    // *6 each UTF16 char can translate to up to 6 bytes in UTF8
    // +1 for NULL char
    size_t Size = wcslen ( (wchar_t *) Source) * 6 + 1;
    utf8 = new char[Size];
    memset (utf8, 0, Size);

    unsigned char TwoBytes[2];
    TwoBytes[0] = '\0';
    TwoBytes[1] = '\0';

    utf8[0] = '\0';

    //     U-00000000  U-0000007F:       0xxxxxxx
    //     U-00000080  U-000007FF:       110xxxxx 10xxxxxx
    //     U-00000800  U-0000FFFF:       1110xxxx 10xxxxxx 10xxxxxx
    //     U-00010000  U-001FFFFF:       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    //     U-00200000  U-03FFFFFF:       111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    //     U-04000000  U-7FFFFFFF:       1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    // The original specification of UTF-8 allowed for sequences of up to six bytes covering numbers up to 31 bits
    // (the original limit of the universal character set). However, UTF-8 was restricted by RFC 3629 to use only
    // the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. So UTF-8 code point is at most 4 bytes.

    for (size_t n = 0; Source[n] != 0; n++)
    {
      if (Source[n] <= 0x7F)
      {
        TwoBytes[0] = (char) Source[n];
        STRCAT_S (utf8, Size, (const char *) &TwoBytes[0]);
      }
      else
      {
        // 11 valid bits 2 bytes
        if (Source[n] <= 0x7FF)
        {
          // Extract the 5 highest bits
          TwoBytes[0] = (char) (0xC0 + (Source[n] >> 6) );
          NumBytes = 2;
        }
        // 16 valid bits 3 bytes
        else if (Source[n] <= 0xFFFF)
        {
          // Extract the highest 4 bits
          TwoBytes[0] = (char) (0xE0 + (Source[n] >> 12) );
          NumBytes = 3;
        }
        // Unichar is only 16 bits. Do no continue because (Source[n] >> 18) does not make sense.
        // 21 valid bits 4 bytes
        else if (Source[n] <= 0x1FFFFF)
        {
          // Extract the highest 3 bits
          TwoBytes[0] = (char) (0xF0 + (Source[n] >> 18) );
          NumBytes = 4;
        }
        // Split a 26 bit character into 5 bytes
        else if (Source[n] <= 0x3FFFFFF)
        {
          // Extract the highest 2 bits
          TwoBytes[0] = (char) (0xF8 + (Source[n] >> 24) );
          NumBytes = 5;
        }
        // Split a 31 bit character into 6 bytes
        else if (Source[n] <= 0x7FFFFFFF)
        {
          // Extract the highest bit
          TwoBytes[0] = (char) (0xFC + (Source[n] >> 30) );
          NumBytes = 6;
        }

        STRCAT_S (utf8, Size, (const char *) &TwoBytes[0]);

        // Extract the remaining bits - 6 bits at a time
        for (int i = 1, shift = (NumBytes - 2) * 6; shift >= 0; i++, shift -= 6)
        {
          TwoBytes[0] = (char) (0x80 + ( (Source[n] >> shift) & 0x3F) );
          STRCAT_S (utf8, Size, (const char *) &TwoBytes[0]);
        }
      }
    }
  }

Here is the caller graph for this function:

nux::NUTF8::operator const char * ( )

Definition at line 214 of file NUTF.cpp.

  {
    return utf8;
  }

Member Data Documentation

char* nux::NUTF8::utf8 [private]

Definition at line 73 of file NUTF.h.


The documentation for this class was generated from the following files: