Back to index

python3.2  3.2.2
Public Member Functions | Public Attributes | Private Member Functions | Static Private Attributes
email.charset.Charset Class Reference

List of all members.

Public Member Functions

def __init__
def __str__
def __eq__
def __ne__
def get_body_encoding
def get_output_charset
def header_encode
def header_encode_lines
def body_encode

Public Attributes

 input_charset
 header_encoding
 body_encoding
 output_charset
 input_codec
 output_codec

Private Member Functions

def _get_encoder

Static Private Attributes

 __repr__ = __str__

Detailed Description

Map character sets to their email properties.

This class provides information about the requirements imposed on email
for a specific character set.  It also provides convenience routines for
converting between character sets, given the availability of the
applicable codecs.  Given a character set, it will do its best to provide
information on how to use that character set in an email in an
RFC-compliant way.

Certain character sets must be encoded with quoted-printable or base64
when used in email headers or bodies.  Certain character sets must be
converted outright, and are not allowed in email.  Instances of this
module expose the following information about a character set:

input_charset: The initial character set specified.  Common aliases
               are converted to their `official' email names (e.g. latin_1
               is converted to iso-8859-1).  Defaults to 7-bit us-ascii.

header_encoding: If the character set must be encoded before it can be
                 used in an email header, this attribute will be set to
                 Charset.QP (for quoted-printable), Charset.BASE64 (for
                 base64 encoding), or Charset.SHORTEST for the shortest of
                 QP or BASE64 encoding.  Otherwise, it will be None.

body_encoding: Same as header_encoding, but describes the encoding for the
               mail message's body, which indeed may be different than the
               header encoding.  Charset.SHORTEST is not allowed for
               body_encoding.

output_charset: Some character sets must be converted before the can be
                used in email headers or bodies.  If the input_charset is
                one of them, this attribute will contain the name of the
                charset output will be converted to.  Otherwise, it will
                be None.

input_codec: The name of the Python codec used to convert the
             input_charset to Unicode.  If no conversion codec is
             necessary, this attribute will be None.

output_codec: The name of the Python codec used to convert Unicode
              to the output_charset.  If no conversion codec is necessary,
              this attribute will have the same value as the input_codec.

Definition at line 167 of file charset.py.


Constructor & Destructor Documentation

def email.charset.Charset.__init__ (   self,
  input_charset = DEFAULT_CHARSET 
)

Definition at line 211 of file charset.py.

00211 
00212     def __init__(self, input_charset=DEFAULT_CHARSET):
00213         # RFC 2046, $4.1.2 says charsets are not case sensitive.  We coerce to
00214         # unicode because its .lower() is locale insensitive.  If the argument
00215         # is already a unicode, we leave it at that, but ensure that the
00216         # charset is ASCII, as the standard (RFC XXX) requires.
00217         try:
00218             if isinstance(input_charset, str):
00219                 input_charset.encode('ascii')
00220             else:
00221                 input_charset = str(input_charset, 'ascii')
00222         except UnicodeError:
00223             raise errors.CharsetError(input_charset)
00224         input_charset = input_charset.lower()
00225         # Set the input charset after filtering through the aliases
00226         self.input_charset = ALIASES.get(input_charset, input_charset)
00227         # We can try to guess which encoding and conversion to use by the
00228         # charset_map dictionary.  Try that first, but let the user override
00229         # it.
00230         henc, benc, conv = CHARSETS.get(self.input_charset,
00231                                         (SHORTEST, BASE64, None))
00232         if not conv:
00233             conv = self.input_charset
00234         # Set the attributes, allowing the arguments to override the default.
00235         self.header_encoding = henc
00236         self.body_encoding = benc
00237         self.output_charset = ALIASES.get(conv, conv)
00238         # Now set the codecs.  If one isn't defined for input_charset,
00239         # guess and try a Unicode codec with the same name as input_codec.
00240         self.input_codec = CODEC_MAP.get(self.input_charset,
00241                                          self.input_charset)
00242         self.output_codec = CODEC_MAP.get(self.output_charset,
00243                                           self.output_charset)

Here is the caller graph for this function:


Member Function Documentation

def email.charset.Charset.__eq__ (   self,
  other 
)

Definition at line 249 of file charset.py.

00249 
00250     def __eq__(self, other):
00251         return str(self) == str(other).lower()

Here is the caller graph for this function:

def email.charset.Charset.__ne__ (   self,
  other 
)

Definition at line 252 of file charset.py.

00252 
00253     def __ne__(self, other):
00254         return not self.__eq__(other)

Here is the call graph for this function:

Definition at line 244 of file charset.py.

00244 
00245     def __str__(self):
00246         return self.input_charset.lower()

def email.charset.Charset._get_encoder (   self,
  header_bytes 
) [private]

Definition at line 365 of file charset.py.

00365 
00366     def _get_encoder(self, header_bytes):
00367         if self.header_encoding == BASE64:
00368             return email.base64mime
00369         elif self.header_encoding == QP:
00370             return email.quoprimime
00371         elif self.header_encoding == SHORTEST:
00372             len64 = email.base64mime.header_length(header_bytes)
00373             lenqp = email.quoprimime.header_length(header_bytes)
00374             if len64 < lenqp:
00375                 return email.base64mime
00376             else:
00377                 return email.quoprimime
00378         else:
00379             return None

Here is the call graph for this function:

Here is the caller graph for this function:

def email.charset.Charset.body_encode (   self,
  string 
)
Body-encode a string by converting it first to bytes.

The type of encoding (base64 or quoted-printable) will be based on
self.body_encoding.  If body_encoding is None, we assume the
output charset is a 7bit encoding, so re-encoding the decoded
string using the ascii codec produces the correct string version
of the content.

Definition at line 380 of file charset.py.

00380 
00381     def body_encode(self, string):
00382         """Body-encode a string by converting it first to bytes.
00383 
00384         The type of encoding (base64 or quoted-printable) will be based on
00385         self.body_encoding.  If body_encoding is None, we assume the
00386         output charset is a 7bit encoding, so re-encoding the decoded
00387         string using the ascii codec produces the correct string version
00388         of the content.
00389         """
00390         # 7bit/8bit encodings return the string unchanged (module conversions)
00391         if self.body_encoding is BASE64:
00392             if isinstance(string, str):
00393                 string = string.encode(self.output_charset)
00394             return email.base64mime.body_encode(string)
00395         elif self.body_encoding is QP:
00396             return email.quoprimime.body_encode(string)
00397         else:
00398             if isinstance(string, str):
00399                 string = string.encode(self.output_charset).decode('ascii')
00400             return string

Here is the call graph for this function:

Return the content-transfer-encoding used for body encoding.

This is either the string `quoted-printable' or `base64' depending on
the encoding used, or it is a function in which case you should call
the function with a single argument, the Message object being
encoded.  The function should then set the Content-Transfer-Encoding
header itself to whatever is appropriate.

Returns "quoted-printable" if self.body_encoding is QP.
Returns "base64" if self.body_encoding is BASE64.
Returns conversion function otherwise.

Definition at line 255 of file charset.py.

00255 
00256     def get_body_encoding(self):
00257         """Return the content-transfer-encoding used for body encoding.
00258 
00259         This is either the string `quoted-printable' or `base64' depending on
00260         the encoding used, or it is a function in which case you should call
00261         the function with a single argument, the Message object being
00262         encoded.  The function should then set the Content-Transfer-Encoding
00263         header itself to whatever is appropriate.
00264 
00265         Returns "quoted-printable" if self.body_encoding is QP.
00266         Returns "base64" if self.body_encoding is BASE64.
00267         Returns conversion function otherwise.
00268         """
00269         assert self.body_encoding != SHORTEST
00270         if self.body_encoding == QP:
00271             return 'quoted-printable'
00272         elif self.body_encoding == BASE64:
00273             return 'base64'
00274         else:
00275             return encode_7or8bit

Return the output character set.

This is self.output_charset if that is not None, otherwise it is
self.input_charset.

Definition at line 276 of file charset.py.

00276 
00277     def get_output_charset(self):
00278         """Return the output character set.
00279 
00280         This is self.output_charset if that is not None, otherwise it is
00281         self.input_charset.
00282         """
00283         return self.output_charset or self.input_charset

Here is the caller graph for this function:

def email.charset.Charset.header_encode (   self,
  string 
)
Header-encode a string by converting it first to bytes.

The type of encoding (base64 or quoted-printable) will be based on
this charset's `header_encoding`.

:param string: A unicode string for the header.  It must be possible
    to encode this string to bytes using the character set's
    output codec.
:return: The encoded string, with RFC 2047 chrome.

Definition at line 284 of file charset.py.

00284 
00285     def header_encode(self, string):
00286         """Header-encode a string by converting it first to bytes.
00287 
00288         The type of encoding (base64 or quoted-printable) will be based on
00289         this charset's `header_encoding`.
00290 
00291         :param string: A unicode string for the header.  It must be possible
00292             to encode this string to bytes using the character set's
00293             output codec.
00294         :return: The encoded string, with RFC 2047 chrome.
00295         """
00296         codec = self.output_codec or 'us-ascii'
00297         header_bytes = _encode(string, codec)
00298         # 7bit/8bit encodings return the string unchanged (modulo conversions)
00299         encoder_module = self._get_encoder(header_bytes)
00300         if encoder_module is None:
00301             return string
00302         return encoder_module.header_encode(header_bytes, codec)

Here is the call graph for this function:

def email.charset.Charset.header_encode_lines (   self,
  string,
  maxlengths 
)
Header-encode a string by converting it first to bytes.

This is similar to `header_encode()` except that the string is fit
into maximum line lengths as given by the argument.

:param string: A unicode string for the header.  It must be possible
    to encode this string to bytes using the character set's
    output codec.
:param maxlengths: Maximum line length iterator.  Each element
    returned from this iterator will provide the next maximum line
    length.  This parameter is used as an argument to built-in next()
    and should never be exhausted.  The maximum line lengths should
    not count the RFC 2047 chrome.  These line lengths are only a
    hint; the splitter does the best it can.
:return: Lines of encoded strings, each with RFC 2047 chrome.

Definition at line 303 of file charset.py.

00303 
00304     def header_encode_lines(self, string, maxlengths):
00305         """Header-encode a string by converting it first to bytes.
00306 
00307         This is similar to `header_encode()` except that the string is fit
00308         into maximum line lengths as given by the argument.
00309 
00310         :param string: A unicode string for the header.  It must be possible
00311             to encode this string to bytes using the character set's
00312             output codec.
00313         :param maxlengths: Maximum line length iterator.  Each element
00314             returned from this iterator will provide the next maximum line
00315             length.  This parameter is used as an argument to built-in next()
00316             and should never be exhausted.  The maximum line lengths should
00317             not count the RFC 2047 chrome.  These line lengths are only a
00318             hint; the splitter does the best it can.
00319         :return: Lines of encoded strings, each with RFC 2047 chrome.
00320         """
00321         # See which encoding we should use.
00322         codec = self.output_codec or 'us-ascii'
00323         header_bytes = _encode(string, codec)
00324         encoder_module = self._get_encoder(header_bytes)
00325         encoder = partial(encoder_module.header_encode, charset=codec)
00326         # Calculate the number of characters that the RFC 2047 chrome will
00327         # contribute to each line.
00328         charset = self.get_output_charset()
00329         extra = len(charset) + RFC2047_CHROME_LEN
00330         # Now comes the hard part.  We must encode bytes but we can't split on
00331         # bytes because some character sets are variable length and each
00332         # encoded word must stand on its own.  So the problem is you have to
00333         # encode to bytes to figure out this word's length, but you must split
00334         # on characters.  This causes two problems: first, we don't know how
00335         # many octets a specific substring of unicode characters will get
00336         # encoded to, and second, we don't know how many ASCII characters
00337         # those octets will get encoded to.  Unless we try it.  Which seems
00338         # inefficient.  In the interest of being correct rather than fast (and
00339         # in the hope that there will be few encoded headers in any such
00340         # message), brute force it. :(
00341         lines = []
00342         current_line = []
00343         maxlen = next(maxlengths) - extra
00344         for character in string:
00345             current_line.append(character)
00346             this_line = EMPTYSTRING.join(current_line)
00347             length = encoder_module.header_length(_encode(this_line, charset))
00348             if length > maxlen:
00349                 # This last character doesn't fit so pop it off.
00350                 current_line.pop()
00351                 # Does nothing fit on the first line?
00352                 if not lines and not current_line:
00353                     lines.append(None)
00354                 else:
00355                     separator = (' ' if lines else '')
00356                     joined_line = EMPTYSTRING.join(current_line)
00357                     header_bytes = _encode(joined_line, codec)
00358                     lines.append(encoder(header_bytes))
00359                 current_line = [character]
00360                 maxlen = next(maxlengths) - extra
00361         joined_line = EMPTYSTRING.join(current_line)
00362         header_bytes = _encode(joined_line, codec)
00363         lines.append(encoder(header_bytes))
00364         return lines

Here is the call graph for this function:


Member Data Documentation

Definition at line 247 of file charset.py.

Definition at line 235 of file charset.py.

Definition at line 234 of file charset.py.

Definition at line 225 of file charset.py.

Definition at line 239 of file charset.py.

Definition at line 236 of file charset.py.

Definition at line 241 of file charset.py.


The documentation for this class was generated from the following file: