Back to index

plone3  3.1.7
Public Member Functions | Public Attributes | Static Public Attributes | Private Member Functions
kss.core.BeautifulSoup.UnicodeDammit Class Reference

List of all members.

Public Member Functions

def __init__
def subMSChar
def find_codec

Public Attributes

 smartQuotesTo
 triedEncodings
 unicode
 originalEncoding
 markup

Static Public Attributes

dictionary CHARSET_ALIASES
tuple subMSChar = staticmethod(subMSChar)
 EBCDIC_TO_ASCII_MAP = None
dictionary MS_CHARS

Private Member Functions

def _convertFrom
def _toUnicode
def _detectEncoding
def _codec
def _ebcdic_to_ascii

Detailed Description

A class for detecting the encoding of a *ML document and
converting it to a Unicode string. If the source encoding is
windows-1252, can replace MS smart quotes with their HTML or XML
equivalents.

Definition at line 1556 of file BeautifulSoup.py.


Constructor & Destructor Documentation

def kss.core.BeautifulSoup.UnicodeDammit.__init__ (   self,
  markup,
  overrideEncodings = [],
  smartQuotesTo = 'xml' 
)

Definition at line 1570 of file BeautifulSoup.py.

01570 
01571                  smartQuotesTo='xml'):
01572         self.markup, documentEncoding, sniffedEncoding = \
01573                      self._detectEncoding(markup)
01574         self.smartQuotesTo = smartQuotesTo
01575         self.triedEncodings = []
01576         if isinstance(markup, unicode):
01577             return markup
01578 
01579         u = None
01580         for proposedEncoding in overrideEncodings:
01581             u = self._convertFrom(proposedEncoding)
01582             if u: break
01583         if not u:
01584             for proposedEncoding in (documentEncoding, sniffedEncoding):
01585                 u = self._convertFrom(proposedEncoding)
01586                 if u: break
01587                 
01588         # If no luck and we have auto-detection library, try that:
01589         if not u and chardet and not isinstance(self.markup, unicode):
01590             u = self._convertFrom(chardet.detect(self.markup)['encoding'])
01591 
01592         # As a last resort, try utf-8 and windows-1252:
01593         if not u:
01594             for proposed_encoding in ("utf-8", "windows-1252"):
01595                 u = self._convertFrom(proposed_encoding)
01596                 if u: break
01597         self.unicode = u
01598         if not u: self.originalEncoding = None

Here is the call graph for this function:

Here is the caller graph for this function:


Member Function Documentation

def kss.core.BeautifulSoup.UnicodeDammit._codec (   self,
  charset 
) [private]

Definition at line 1736 of file BeautifulSoup.py.

01736 
01737     def _codec(self, charset):
01738         if not charset: return charset 
01739         codec = None
01740         try:
01741             codecs.lookup(charset)
01742             codec = charset
01743         except LookupError:
01744             pass
01745         return codec

Here is the caller graph for this function:

def kss.core.BeautifulSoup.UnicodeDammit._convertFrom (   self,
  proposed 
) [private]

Definition at line 1613 of file BeautifulSoup.py.

01613 
01614     def _convertFrom(self, proposed):        
01615         proposed = self.find_codec(proposed)
01616         if not proposed or proposed in self.triedEncodings:
01617             return None
01618         self.triedEncodings.append(proposed)
01619         markup = self.markup
01620 
01621         # Convert smart quotes to HTML if coming from an encoding
01622         # that might have them.
01623         if self.smartQuotesTo and proposed in("windows-1252",
01624                                               "ISO-8859-1",
01625                                               "ISO-8859-2"):
01626             markup = re.compile("([\x80-\x9f])").sub \
01627                      (lambda(x): self.subMSChar(x.group(1),self.smartQuotesTo),
01628                       markup)
01629 
01630         try:
01631             # print "Trying to convert document to %s" % proposed
01632             u = self._toUnicode(markup, proposed)
01633             self.markup = u       
01634             self.originalEncoding = proposed
01635         except Exception, e:
01636             # print "That didn't work!"
01637             # print e
01638             return None        
01639         #print "Correct encoding: %s" % proposed
01640         return self.markup

Here is the call graph for this function:

def kss.core.BeautifulSoup.UnicodeDammit._detectEncoding (   self,
  xml_data 
) [private]
Given a document, tries to detect its XML encoding.

Definition at line 1666 of file BeautifulSoup.py.

01666 
01667     def _detectEncoding(self, xml_data):
01668         """Given a document, tries to detect its XML encoding."""
01669         xml_encoding = sniffed_xml_encoding = None
01670         try:
01671             if xml_data[:4] == '\x4c\x6f\xa7\x94':
01672                 # EBCDIC
01673                 xml_data = self._ebcdic_to_ascii(xml_data)
01674             elif xml_data[:4] == '\x00\x3c\x00\x3f':
01675                 # UTF-16BE
01676                 sniffed_xml_encoding = 'utf-16be'
01677                 xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
01678             elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
01679                      and (xml_data[2:4] != '\x00\x00'):
01680                 # UTF-16BE with BOM
01681                 sniffed_xml_encoding = 'utf-16be'
01682                 xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
01683             elif xml_data[:4] == '\x3c\x00\x3f\x00':
01684                 # UTF-16LE
01685                 sniffed_xml_encoding = 'utf-16le'
01686                 xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
01687             elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
01688                      (xml_data[2:4] != '\x00\x00'):
01689                 # UTF-16LE with BOM
01690                 sniffed_xml_encoding = 'utf-16le'
01691                 xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
01692             elif xml_data[:4] == '\x00\x00\x00\x3c':
01693                 # UTF-32BE
01694                 sniffed_xml_encoding = 'utf-32be'
01695                 xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
01696             elif xml_data[:4] == '\x3c\x00\x00\x00':
01697                 # UTF-32LE
01698                 sniffed_xml_encoding = 'utf-32le'
01699                 xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
01700             elif xml_data[:4] == '\x00\x00\xfe\xff':
01701                 # UTF-32BE with BOM
01702                 sniffed_xml_encoding = 'utf-32be'
01703                 xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
01704             elif xml_data[:4] == '\xff\xfe\x00\x00':
01705                 # UTF-32LE with BOM
01706                 sniffed_xml_encoding = 'utf-32le'
01707                 xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
01708             elif xml_data[:3] == '\xef\xbb\xbf':
01709                 # UTF-8 with BOM
01710                 sniffed_xml_encoding = 'utf-8'
01711                 xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
01712             else:
01713                 sniffed_xml_encoding = 'ascii'
01714                 pass
01715             xml_encoding_match = re.compile \
01716                                  ('^<\?.*encoding=[\'"](.*?)[\'"].*\?>')\
01717                                  .match(xml_data)
01718         except:
01719             xml_encoding_match = None
01720         if xml_encoding_match:
01721             xml_encoding = xml_encoding_match.groups()[0].lower()
01722             if sniffed_xml_encoding and \
01723                (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
01724                                  'iso-10646-ucs-4', 'ucs-4', 'csucs4',
01725                                  'utf-16', 'utf-32', 'utf_16', 'utf_32',
01726                                  'utf16', 'u16')):
01727                 xml_encoding = sniffed_xml_encoding
01728         return xml_data, xml_encoding, sniffed_xml_encoding
01729 

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.UnicodeDammit._ebcdic_to_ascii (   self,
  s 
) [private]

Definition at line 1747 of file BeautifulSoup.py.

01747 
01748     def _ebcdic_to_ascii(self, s):
01749         c = self.__class__
01750         if not c.EBCDIC_TO_ASCII_MAP:
01751             emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
01752                     16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
01753                     128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
01754                     144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
01755                     32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
01756                     38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
01757                     45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
01758                     186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
01759                     195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
01760                     201,202,106,107,108,109,110,111,112,113,114,203,204,205,
01761                     206,207,208,209,126,115,116,117,118,119,120,121,122,210,
01762                     211,212,213,214,215,216,217,218,219,220,221,222,223,224,
01763                     225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
01764                     73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
01765                     82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
01766                     90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
01767                     250,251,252,253,254,255)
01768             import string
01769             c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
01770             ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
01771         return s.translate(c.EBCDIC_TO_ASCII_MAP)

Here is the caller graph for this function:

def kss.core.BeautifulSoup.UnicodeDammit._toUnicode (   self,
  data,
  encoding 
) [private]
Given a string and its encoding, decodes the string into Unicode.
%encoding is a string recognized by encodings.aliases

Definition at line 1641 of file BeautifulSoup.py.

01641 
01642     def _toUnicode(self, data, encoding):
01643         '''Given a string and its encoding, decodes the string into Unicode.
01644         %encoding is a string recognized by encodings.aliases'''
01645 
01646         # strip Byte Order Mark (if present)
01647         if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
01648                and (data[2:4] != '\x00\x00'):
01649             encoding = 'utf-16be'
01650             data = data[2:]
01651         elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
01652                  and (data[2:4] != '\x00\x00'):
01653             encoding = 'utf-16le'
01654             data = data[2:]
01655         elif data[:3] == '\xef\xbb\xbf':
01656             encoding = 'utf-8'
01657             data = data[3:]
01658         elif data[:4] == '\x00\x00\xfe\xff':
01659             encoding = 'utf-32be'
01660             data = data[4:]
01661         elif data[:4] == '\xff\xfe\x00\x00':
01662             encoding = 'utf-32le'
01663             data = data[4:]
01664         newdata = unicode(data, encoding)
01665         return newdata
    

Here is the caller graph for this function:

def kss.core.BeautifulSoup.UnicodeDammit.find_codec (   self,
  charset 
)

Definition at line 1730 of file BeautifulSoup.py.

01730 
01731     def find_codec(self, charset):
01732         return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
01733                or (charset and self._codec(charset.replace("-", ""))) \
01734                or (charset and self._codec(charset.replace("-", "_"))) \
01735                or charset

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.UnicodeDammit.subMSChar (   orig,
  smartQuotesTo 
)
Changes a MS smart quote character to an XML or HTML
entity.

Definition at line 1599 of file BeautifulSoup.py.

01599 
01600     def subMSChar(orig, smartQuotesTo):
01601         """Changes a MS smart quote character to an XML or HTML
01602         entity."""
01603         sub = UnicodeDammit.MS_CHARS.get(orig)
01604         if type(sub) == types.TupleType:
01605             if smartQuotesTo == 'xml':
01606                 sub = '&#x%s;' % sub[1]
01607             elif smartQuotesTo == 'html':
01608                 sub = '&%s;' % sub[0]
01609             else:
01610                 sub = unichr(int(sub[1],16))
        return sub            

Member Data Documentation

Initial value:
{ "macintosh" : "mac-roman",
                        "x-sjis" : "shift-jis" }

Definition at line 1566 of file BeautifulSoup.py.

Definition at line 1746 of file BeautifulSoup.py.

Definition at line 1632 of file BeautifulSoup.py.

Definition at line 1772 of file BeautifulSoup.py.

Definition at line 1597 of file BeautifulSoup.py.

Definition at line 1573 of file BeautifulSoup.py.

Definition at line 1611 of file BeautifulSoup.py.

Definition at line 1574 of file BeautifulSoup.py.

Definition at line 1596 of file BeautifulSoup.py.


The documentation for this class was generated from the following file: