Back to index

plone3  3.1.7
Public Member Functions | Public Attributes | Static Public Attributes
kss.core.BeautifulSoup.BeautifulSoup Class Reference
Inheritance diagram for kss.core.BeautifulSoup.BeautifulSoup:
Inheritance graph
[legend]
Collaboration diagram for kss.core.BeautifulSoup.BeautifulSoup:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def __init__
def start_meta
def __getattr__
def isSelfClosingTag
def reset
def popTag
def pushTag
def endData
def unknown_starttag
def unknown_endtag
def handle_data
def handle_pi
def handle_comment
def handle_charref
def handle_entityref
def handle_decl
def parse_declaration
def setup
def replaceWith
def extract
def insert
def findNext
def findAllNext
def findNextSibling
def findNextSiblings
def findPrevious
def findAllPrevious
def findPreviousSibling
def findPreviousSiblings
def findParent
def findParents
def nextGenerator
def nextSiblingGenerator
def previousGenerator
def previousSiblingGenerator
def parentGenerator
def substituteEncoding
def toEncoding

Public Attributes

 originalEncoding
 declaredHTMLEncoding
 HTML_ENTITIES
 XML_ENTITIES
 parseOnlyThese
 fromEncoding
 smartQuotesTo
 convertHTMLEntities
 convertXMLEntities
 instanceSelfClosingTags
 markup
 markupMassage
 hidden
 currentData
 currentTag
 tagStack
 quoteStack
 previous
 literal
 parent
 next
 previousSibling
 nextSibling

Static Public Attributes

tuple SELF_CLOSING_TAGS
dictionary QUOTE_TAGS = {'script': None}
list NESTABLE_INLINE_TAGS
list NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
dictionary NESTABLE_LIST_TAGS
dictionary NESTABLE_TABLE_TAGS
list NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre']
tuple RESET_NESTING_TAGS
tuple NESTABLE_TAGS
tuple CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)")
list MARKUP_MASSAGE
string ROOT_TAG_NAME = u'[document]'
string HTML_ENTITIES = "html"
string XML_ENTITIES = "xml"
list ALL_ENTITIES = [HTML_ENTITIES, XML_ENTITIES]
 fetchNextSiblings = findNextSiblings
 fetchPrevious = findAllPrevious
 fetchPreviousSiblings = findPreviousSiblings
 fetchParents = findParents

Detailed Description

This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.

Definition at line 1275 of file BeautifulSoup.py.


Constructor & Destructor Documentation

def kss.core.BeautifulSoup.BeautifulSoup.__init__ (   self,
  args,
  kwargs 
)

Definition at line 1323 of file BeautifulSoup.py.

01323 
01324     def __init__(self, *args, **kwargs):
01325         if not kwargs.has_key('smartQuotesTo'):
01326             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
01327         BeautifulStoneSoup.__init__(self, *args, **kwargs)

Here is the caller graph for this function:


Member Function Documentation

def kss.core.BeautifulSoup.BeautifulStoneSoup.__getattr__ (   self,
  methodName 
) [inherited]
This method routes method call requests to either the SGMLParser
superclass or the Tag superclass, depending on the method name.

Definition at line 1005 of file BeautifulSoup.py.

01005 
01006     def __getattr__(self, methodName):
01007         """This method routes method call requests to either the SGMLParser
01008         superclass or the Tag superclass, depending on the method name."""
01009         #print "__getattr__ called on %s.%s" % (self.__class__, methodName)
01010 
01011         if methodName.find('start_') == 0 or methodName.find('end_') == 0 \
01012                or methodName.find('do_') == 0:
01013             return SGMLParser.__getattr__(self, methodName)
01014         elif methodName.find('__') != 0:
01015             return Tag.__getattr__(self, methodName)
01016         else:
01017             raise AttributeError

Here is the caller graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.endData (   self,
  containerClass = NavigableString 
) [inherited]

Definition at line 1055 of file BeautifulSoup.py.

01055 
01056     def endData(self, containerClass=NavigableString):
01057         if self.currentData:
01058             currentData = ''.join(self.currentData)
01059             if currentData.endswith('<') and self.convertHTMLEntities:
01060                 currentData = currentData[:-1] + '&lt;'
01061             if not currentData.strip():
01062                 if '\n' in currentData:
01063                     currentData = '\n'
01064                 else:
01065                     currentData = ' '
01066             self.currentData = []
01067             if self.parseOnlyThese and len(self.tagStack) <= 1 and \
01068                    (not self.parseOnlyThese.text or \
01069                     not self.parseOnlyThese.search(currentData)):
01070                 return
01071             o = containerClass(currentData)
01072             o.setup(self.currentTag, self.previous)
01073             if self.previous:
01074                 self.previous.next = o
01075             self.previous = o
01076             self.currentTag.contents.append(o)
01077 

Here is the caller graph for this function:

def kss.core.BeautifulSoup.PageElement.extract (   self) [inherited]
Destructively rips this element out of the tree.

Definition at line 102 of file BeautifulSoup.py.

00102 
00103     def extract(self):
00104         """Destructively rips this element out of the tree."""        
00105         if self.parent:
00106             try:
00107                 self.parent.contents.remove(self)
00108             except ValueError:
00109                 pass
00110 
00111         #Find the two elements that would be next to each other if
00112         #this element (and any children) hadn't been parsed. Connect
00113         #the two.        
00114         lastChild = self._lastRecursiveChild()
00115         nextElement = lastChild.next
00116 
00117         if self.previous:
00118             self.previous.next = nextElement
00119         if nextElement:
00120             nextElement.previous = self.previous
00121         self.previous = None
00122         lastChild.next = None
00123 
00124         self.parent = None        
00125         if self.previousSibling:
00126             self.previousSibling.nextSibling = self.nextSibling
00127         if self.nextSibling:
00128             self.nextSibling.previousSibling = self.previousSibling
00129         self.previousSibling = self.nextSibling = None       

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.PageElement.findAllNext (   self,
  name = None,
  attrs = {},
  text = None,
  limit = None,
  kwargs 
) [inherited]
Returns all items that match the given criteria and appear
before after Tag in the document.

Definition at line 203 of file BeautifulSoup.py.

00203 
00204                     **kwargs):
00205         """Returns all items that match the given criteria and appear
00206         before after Tag in the document."""
00207         return self._findAll(name, attrs, text, limit, self.nextGenerator)

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.PageElement.findAllPrevious (   self,
  name = None,
  attrs = {},
  text = None,
  limit = None,
  kwargs 
) [inherited]
Returns all items that match the given criteria and appear
before this Tag in the document.

Definition at line 228 of file BeautifulSoup.py.

00228 
00229                         **kwargs):
00230         """Returns all items that match the given criteria and appear
00231         before this Tag in the document."""
00232         return self._findAll(name, attrs, text, limit, self.previousGenerator,
                           **kwargs)

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.PageElement.findNext (   self,
  name = None,
  attrs = {},
  text = None,
  kwargs 
) [inherited]
Returns the first item that matches the given criteria and
appears after this Tag in the document.

Definition at line 197 of file BeautifulSoup.py.

00197 
00198     def findNext(self, name=None, attrs={}, text=None, **kwargs):
00199         """Returns the first item that matches the given criteria and
00200         appears after this Tag in the document."""
00201         return self._findOne(self.findAllNext, name, attrs, text, **kwargs)

Here is the call graph for this function:

def kss.core.BeautifulSoup.PageElement.findNextSibling (   self,
  name = None,
  attrs = {},
  text = None,
  kwargs 
) [inherited]
Returns the closest sibling to this Tag that matches the
given criteria and appears after this Tag in the document.

Definition at line 208 of file BeautifulSoup.py.

00208 
00209     def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
00210         """Returns the closest sibling to this Tag that matches the
00211         given criteria and appears after this Tag in the document."""
00212         return self._findOne(self.findNextSiblings, name, attrs, text,
00213                              **kwargs)

Here is the call graph for this function:

def kss.core.BeautifulSoup.PageElement.findNextSiblings (   self,
  name = None,
  attrs = {},
  text = None,
  limit = None,
  kwargs 
) [inherited]
Returns the siblings of this Tag that match the given
criteria and appear after this Tag in the document.

Definition at line 215 of file BeautifulSoup.py.

00215 
00216                          **kwargs):
00217         """Returns the siblings of this Tag that match the given
00218         criteria and appear after this Tag in the document."""
00219         return self._findAll(name, attrs, text, limit,
                             self.nextSiblingGenerator, **kwargs)

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.PageElement.findParent (   self,
  name = None,
  attrs = {},
  kwargs 
) [inherited]
Returns the closest parent of this Tag that matches the given
criteria.

Definition at line 249 of file BeautifulSoup.py.

00249 
00250     def findParent(self, name=None, attrs={}, **kwargs):
00251         """Returns the closest parent of this Tag that matches the given
00252         criteria."""
00253         # NOTE: We can't use _findOne because findParents takes a different
00254         # set of arguments.
00255         r = None
00256         l = self.findParents(name, attrs, 1)
00257         if l:
00258             r = l[0]
00259         return r

Here is the call graph for this function:

def kss.core.BeautifulSoup.PageElement.findParents (   self,
  name = None,
  attrs = {},
  limit = None,
  kwargs 
) [inherited]
Returns the parents of this Tag that match the given
criteria.

Definition at line 260 of file BeautifulSoup.py.

00260 
00261     def findParents(self, name=None, attrs={}, limit=None, **kwargs):
00262         """Returns the parents of this Tag that match the given
00263         criteria."""
00264 
00265         return self._findAll(name, attrs, None, limit, self.parentGenerator,
                             **kwargs)

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.PageElement.findPrevious (   self,
  name = None,
  attrs = {},
  text = None,
  kwargs 
) [inherited]
Returns the first item that matches the given criteria and
appears before this Tag in the document.

Definition at line 222 of file BeautifulSoup.py.

00222 
00223     def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
00224         """Returns the first item that matches the given criteria and
00225         appears before this Tag in the document."""
00226         return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)

Here is the call graph for this function:

def kss.core.BeautifulSoup.PageElement.findPreviousSibling (   self,
  name = None,
  attrs = {},
  text = None,
  kwargs 
) [inherited]
Returns the closest sibling to this Tag that matches the
given criteria and appears before this Tag in the document.

Definition at line 235 of file BeautifulSoup.py.

00235 
00236     def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
00237         """Returns the closest sibling to this Tag that matches the
00238         given criteria and appears before this Tag in the document."""
00239         return self._findOne(self.findPreviousSiblings, name, attrs, text,
00240                              **kwargs)

Here is the call graph for this function:

def kss.core.BeautifulSoup.PageElement.findPreviousSiblings (   self,
  name = None,
  attrs = {},
  text = None,
  limit = None,
  kwargs 
) [inherited]
Returns the siblings of this Tag that match the given
criteria and appear before this Tag in the document.

Definition at line 242 of file BeautifulSoup.py.

00242 
00243                              limit=None, **kwargs):
00244         """Returns the siblings of this Tag that match the given
00245         criteria and appear before this Tag in the document."""
00246         return self._findAll(name, attrs, text, limit,
                             self.previousSiblingGenerator, **kwargs)

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.handle_charref (   self,
  ref 
) [inherited]

Definition at line 1219 of file BeautifulSoup.py.

01219 
01220     def handle_charref(self, ref):
01221         "Handle character references as data."
01222         if ref[0] == 'x':
01223             data = unichr(int(ref[1:],16))
01224         else:
01225             data = unichr(int(ref))
01226         
01227         if u'\x80' <= data <= u'\x9F':
01228             data = UnicodeDammit.subMSChar(chr(ord(data)), self.smartQuotesTo)
01229         elif not self.convertHTMLEntities and not self.convertXMLEntities:
01230             data = '&#%s;' % ref
01231 
01232         self.handle_data(data)

Here is the call graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.handle_comment (   self,
  text 
) [inherited]

Definition at line 1215 of file BeautifulSoup.py.

01215 
01216     def handle_comment(self, text):
01217         "Handle comments as Comment objects."
01218         self._toStringSubclass(text, Comment)

Here is the call graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.handle_data (   self,
  data 
) [inherited]

Definition at line 1190 of file BeautifulSoup.py.

01190 
01191     def handle_data(self, data):
01192         if self.convertHTMLEntities:
01193             if data[0] == '&':
01194                 data = self.BARE_AMPERSAND.sub("&amp;",data)
01195             else:
01196                 data = data.replace('&','&amp;') \
01197                            .replace('<','&lt;') \
01198                            .replace('>','&gt;')
01199         self.currentData.append(data)

Here is the caller graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.handle_decl (   self,
  data 
) [inherited]

Definition at line 1251 of file BeautifulSoup.py.

01251 
01252     def handle_decl(self, data):
01253         "Handle DOCTYPEs and the like as Declaration objects."
01254         self._toStringSubclass(data, Declaration)

Here is the call graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.handle_entityref (   self,
  ref 
) [inherited]
Handle entity references as data, possibly converting known
HTML entity references to the corresponding Unicode
characters.

Definition at line 1233 of file BeautifulSoup.py.

01233 
01234     def handle_entityref(self, ref):
01235         """Handle entity references as data, possibly converting known
01236         HTML entity references to the corresponding Unicode
01237         characters."""
01238         replaceWithXMLEntity = self.convertXMLEntities and \
01239                                self.XML_ENTITIES_TO_CHARS.has_key(ref)
01240         if self.convertHTMLEntities or replaceWithXMLEntity:
01241             try:
01242                 data = unichr(name2codepoint[ref])
01243             except KeyError:
01244                 if replaceWithXMLEntity:
01245                     data = self.XML_ENTITIES_TO_CHARS.get(ref)
01246                 else:
01247                     data="&amp;%s" % ref
01248         else:
01249             data = '&%s;' % ref
01250         self.handle_data(data)
        

Here is the call graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.handle_pi (   self,
  text 
) [inherited]
Handle a processing instruction as a ProcessingInstruction
object, possibly one with a %SOUP-ENCODING% slot into which an
encoding will be plugged later.

Definition at line 1207 of file BeautifulSoup.py.

01207 
01208     def handle_pi(self, text):
01209         """Handle a processing instruction as a ProcessingInstruction
01210         object, possibly one with a %SOUP-ENCODING% slot into which an
01211         encoding will be plugged later."""
01212         if text[:3] == "xml":
01213             text = "xml version='1.0' encoding='%SOUP-ENCODING%'"
01214         self._toStringSubclass(text, ProcessingInstruction)

Here is the call graph for this function:

def kss.core.BeautifulSoup.PageElement.insert (   self,
  position,
  newChild 
) [inherited]

Definition at line 137 of file BeautifulSoup.py.

00137 
00138     def insert(self, position, newChild):
00139         if (isinstance(newChild, basestring)
00140             or isinstance(newChild, unicode)) \
00141             and not isinstance(newChild, NavigableString):
00142             newChild = NavigableString(newChild)        
00143 
00144         position =  min(position, len(self.contents))
00145         if hasattr(newChild, 'parent') and newChild.parent != None:
00146             # We're 'inserting' an element that's already one
00147             # of this object's children. 
00148             if newChild.parent == self:
00149                 index = self.find(newChild)
00150                 if index and index < position:
00151                     # Furthermore we're moving it further down the
00152                     # list of this object's children. That means that
00153                     # when we extract this element, our target index
00154                     # will jump down one.
00155                     position = position - 1
00156             newChild.extract()
00157             
00158         newChild.parent = self
00159         previousChild = None
00160         if position == 0:
00161             newChild.previousSibling = None
00162             newChild.previous = self
00163         else:
00164             previousChild = self.contents[position-1]
00165             newChild.previousSibling = previousChild
00166             newChild.previousSibling.nextSibling = newChild
00167             newChild.previous = previousChild._lastRecursiveChild()
00168         if newChild.previous:
00169             newChild.previous.next = newChild        
00170 
00171         newChildsLastElement = newChild._lastRecursiveChild()
00172 
00173         if position >= len(self.contents):
00174             newChild.nextSibling = None
00175             
00176             parent = self
00177             parentsNextSibling = None
00178             while not parentsNextSibling:
00179                 parentsNextSibling = parent.nextSibling
00180                 parent = parent.parent
00181                 if not parent: # This is the last element in the document.
00182                     break
00183             if parentsNextSibling:
00184                 newChildsLastElement.next = parentsNextSibling
00185             else:
00186                 newChildsLastElement.next = None
00187         else:
00188             nextChild = self.contents[position]            
00189             newChild.nextSibling = nextChild            
00190             if newChild.nextSibling:
00191                 newChild.nextSibling.previousSibling = newChild
00192             newChildsLastElement.next = nextChild
00193 
00194         if newChildsLastElement.next:
00195             newChildsLastElement.next.previous = newChildsLastElement
00196         self.contents.insert(position, newChild)

Here is the caller graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.isSelfClosingTag (   self,
  name 
) [inherited]
Returns true iff the given string is the name of a
self-closing tag according to this parser.

Definition at line 1018 of file BeautifulSoup.py.

01018 
01019     def isSelfClosingTag(self, name):
01020         """Returns true iff the given string is the name of a
01021         self-closing tag according to this parser."""
01022         return self.SELF_CLOSING_TAGS.has_key(name) \
01023                or self.instanceSelfClosingTags.has_key(name)
            

Here is the caller graph for this function:

Definition at line 302 of file BeautifulSoup.py.

00302 
00303     def nextGenerator(self):
00304         i = self
00305         while i:
00306             i = i.next
00307             yield i

Here is the caller graph for this function:

Definition at line 308 of file BeautifulSoup.py.

00308 
00309     def nextSiblingGenerator(self):
00310         i = self
00311         while i:
00312             i = i.nextSibling
00313             yield i

Here is the caller graph for this function:

Definition at line 326 of file BeautifulSoup.py.

00326 
00327     def parentGenerator(self):
00328         i = self
00329         while i:
00330             i = i.parent
00331             yield i

Here is the caller graph for this function:

Treat a bogus SGML declaration as raw data. Treat a CDATA
declaration as a CData object.

Definition at line 1255 of file BeautifulSoup.py.

01255 
01256     def parse_declaration(self, i):
01257         """Treat a bogus SGML declaration as raw data. Treat a CDATA
01258         declaration as a CData object."""
01259         j = None
01260         if self.rawdata[i:i+9] == '<![CDATA[':
01261              k = self.rawdata.find(']]>', i)
01262              if k == -1:
01263                  k = len(self.rawdata)
01264              data = self.rawdata[i+9:k]
01265              j = k+3
01266              self._toStringSubclass(data, CData)
01267         else:
01268             try:
01269                 j = SGMLParser.parse_declaration(self, i)
01270             except SGMLParseError:
01271                 toHandle = self.rawdata[i:]
01272                 self.handle_data(toHandle)
01273                 j = i + len(toHandle)
01274         return j

Here is the call graph for this function:

Reimplemented in kss.core.BeautifulSoup.BeautifulSOAP.

Definition at line 1034 of file BeautifulSoup.py.

01034 
01035     def popTag(self):
01036         tag = self.tagStack.pop()
01037         # Tags with just one string-owning child get the child as a
01038         # 'string' property, so that soup.tag.string is shorthand for
01039         # soup.tag.contents[0]
01040         if len(self.currentTag.contents) == 1 and \
01041            isinstance(self.currentTag.contents[0], NavigableString):
01042             self.currentTag.string = self.currentTag.contents[0]
01043 
01044         #print "Pop", tag.name
01045         if self.tagStack:
01046             self.currentTag = self.tagStack[-1]
01047         return self.currentTag

Here is the caller graph for this function:

Definition at line 314 of file BeautifulSoup.py.

00314 
00315     def previousGenerator(self):
00316         i = self
00317         while i:
00318             i = i.previous
00319             yield i

Here is the caller graph for this function:

Definition at line 320 of file BeautifulSoup.py.

00320 
00321     def previousSiblingGenerator(self):
00322         i = self
00323         while i:
00324             i = i.previousSibling
00325             yield i

Here is the caller graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.pushTag (   self,
  tag 
) [inherited]

Definition at line 1048 of file BeautifulSoup.py.

01048 
01049     def pushTag(self, tag):
01050         #print "Push", tag.name
01051         if self.currentTag:
01052             self.currentTag.append(tag)
01053         self.tagStack.append(tag)
01054         self.currentTag = self.tagStack[-1]

Here is the caller graph for this function:

def kss.core.BeautifulSoup.PageElement.replaceWith (   self,
  replaceWith 
) [inherited]

Definition at line 88 of file BeautifulSoup.py.

00088 
00089     def replaceWith(self, replaceWith):        
00090         oldParent = self.parent
00091         myIndex = self.parent.contents.index(self)
00092         if hasattr(replaceWith, 'parent') and replaceWith.parent == self.parent:
00093             # We're replacing this element with one of its siblings.
00094             index = self.parent.contents.index(replaceWith)
00095             if index and index < myIndex:
00096                 # Furthermore, it comes before this element. That
00097                 # means that when we extract it, the index of this
00098                 # element will change.
00099                 myIndex = myIndex - 1
00100         self.extract()        
00101         oldParent.insert(myIndex, replaceWith)
        

Here is the call graph for this function:

Definition at line 1024 of file BeautifulSoup.py.

01024 
01025     def reset(self):
01026         Tag.__init__(self, self, self.ROOT_TAG_NAME)
01027         self.hidden = 1
01028         SGMLParser.reset(self)
01029         self.currentData = []
01030         self.currentTag = None
01031         self.tagStack = []
01032         self.quoteStack = []
01033         self.pushTag(self)
    

Here is the caller graph for this function:

def kss.core.BeautifulSoup.PageElement.setup (   self,
  parent = None,
  previous = None 
) [inherited]
Sets up the initial relations between this element and
other elements.

Definition at line 76 of file BeautifulSoup.py.

00076 
00077     def setup(self, parent=None, previous=None):
00078         """Sets up the initial relations between this element and
00079         other elements."""        
00080         self.parent = parent
00081         self.previous = previous
00082         self.next = None
00083         self.previousSibling = None
00084         self.nextSibling = None
00085         if self.parent and self.parent.contents:
00086             self.previousSibling = self.parent.contents[-1]
00087             self.previousSibling.nextSibling = self

Beautiful Soup can detect a charset included in a META tag,
try to convert the document to that charset, and re-parse the
document from the beginning.

Definition at line 1378 of file BeautifulSoup.py.

01378 
01379     def start_meta(self, attrs):
01380         """Beautiful Soup can detect a charset included in a META tag,
01381         try to convert the document to that charset, and re-parse the
01382         document from the beginning."""
01383         httpEquiv = None
01384         contentType = None
01385         contentTypeIndex = None
01386         tagNeedsEncodingSubstitution = False
01387 
01388         for i in range(0, len(attrs)):
01389             key, value = attrs[i]
01390             key = key.lower()
01391             if key == 'http-equiv':
01392                 httpEquiv = value
01393             elif key == 'content':
01394                 contentType = value
01395                 contentTypeIndex = i
01396 
01397         if httpEquiv and contentType: # It's an interesting meta tag.
01398             match = self.CHARSET_RE.search(contentType)
01399             if match:
01400                 if getattr(self, 'declaredHTMLEncoding') or \
01401                        (self.originalEncoding == self.fromEncoding):
01402                     # This is our second pass through the document, or
01403                     # else an encoding was specified explicitly and it
01404                     # worked. Rewrite the meta tag.
01405                     newAttr = self.CHARSET_RE.sub\
01406                               (lambda(match):match.group(1) +
01407                                "%SOUP-ENCODING%", value)
01408                     attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
01409                                                newAttr)
01410                     tagNeedsEncodingSubstitution = True
01411                 else:
01412                     # This is our first pass through the document.
01413                     # Go through it again with the new information.
01414                     newCharset = match.group(3)
01415                     if newCharset and newCharset != self.originalEncoding:
01416                         self.declaredHTMLEncoding = newCharset
01417                         self._feed(self.declaredHTMLEncoding)
01418                         raise StopParsing
01419         tag = self.unknown_starttag("meta", attrs)
01420         if tag and tagNeedsEncodingSubstitution:
01421             tag.containsSubstitutions = True

def kss.core.BeautifulSoup.PageElement.substituteEncoding (   self,
  str,
  encoding = None 
) [inherited]

Definition at line 333 of file BeautifulSoup.py.

00333 
00334     def substituteEncoding(self, str, encoding=None):
00335         encoding = encoding or "utf-8"
00336         return str.replace("%SOUP-ENCODING%", encoding)    

Here is the caller graph for this function:

def kss.core.BeautifulSoup.PageElement.toEncoding (   self,
  s,
  encoding = None 
) [inherited]
Encodes an object to a string in some encoding, or to Unicode.
.

Definition at line 337 of file BeautifulSoup.py.

00337 
00338     def toEncoding(self, s, encoding=None):
00339         """Encodes an object to a string in some encoding, or to Unicode.
00340         ."""
00341         if isinstance(s, unicode):
00342             if encoding:
00343                 s = s.encode(encoding)
00344         elif isinstance(s, str):
00345             if encoding:
00346                 s = s.encode(encoding)
00347             else:
00348                 s = unicode(s)
00349         else:
00350             if encoding:
00351                 s  = self.toEncoding(str(s), encoding)
00352             else:
00353                 s = unicode(s)
00354         return s

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.unknown_endtag (   self,
  name 
) [inherited]

Definition at line 1177 of file BeautifulSoup.py.

01177 
01178     def unknown_endtag(self, name):
01179         #print "End tag %s" % name
01180         if self.quoteStack and self.quoteStack[-1] != name:
01181             #This is not a real end tag.
01182             #print "</%s> is not real!" % name
01183             self.currentData.append('</%s>' % name)
01184             return
01185         self.endData()
01186         self._popToTag(name)
01187         if self.quoteStack and self.quoteStack[-1] == name:
01188             self.quoteStack.pop()
01189             self.literal = (len(self.quoteStack) > 0)

Here is the call graph for this function:

Here is the caller graph for this function:

def kss.core.BeautifulSoup.BeautifulStoneSoup.unknown_starttag (   self,
  name,
  attrs,
  selfClosing = 0 
) [inherited]

Definition at line 1147 of file BeautifulSoup.py.

01147 
01148     def unknown_starttag(self, name, attrs, selfClosing=0):
01149         #print "Start tag %s: %s" % (name, attrs)
01150         if self.quoteStack:
01151             #This is not a real tag.
01152             #print "<%s> is not real!" % name
01153             attrs = ''.join(map(lambda(x, y): ' %s="%s"' % (x, y), attrs))
01154             self.currentData.append('<%s%s>' % (name, attrs))
01155             return        
01156         self.endData()
01157 
01158         if not self.isSelfClosingTag(name) and not selfClosing:
01159             self._smartPop(name)
01160 
01161         if self.parseOnlyThese and len(self.tagStack) <= 1 \
01162                and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
01163             return
01164 
01165         tag = Tag(self, name, attrs, self.currentTag, self.previous)
01166         if self.previous:
01167             self.previous.next = tag
01168         self.previous = tag
01169         self.pushTag(tag)
01170         if selfClosing or self.isSelfClosingTag(name):
01171             self.popTag()                
01172         if name in self.QUOTE_TAGS:
01173             #print "Beginning quote (%s)" % name
01174             self.quoteStack.append(name)
01175             self.literal = 1
01176         return tag

Here is the call graph for this function:

Here is the caller graph for this function:


Member Data Documentation

Definition at line 918 of file BeautifulSoup.py.

tuple kss.core.BeautifulSoup.BeautifulSoup.CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)") [static]

Definition at line 1376 of file BeautifulSoup.py.

Definition at line 959 of file BeautifulSoup.py.

Definition at line 960 of file BeautifulSoup.py.

Definition at line 1028 of file BeautifulSoup.py.

Definition at line 1029 of file BeautifulSoup.py.

Definition at line 1415 of file BeautifulSoup.py.

Definition at line 220 of file BeautifulSoup.py.

Definition at line 266 of file BeautifulSoup.py.

Definition at line 233 of file BeautifulSoup.py.

Definition at line 247 of file BeautifulSoup.py.

Definition at line 949 of file BeautifulSoup.py.

Definition at line 1026 of file BeautifulSoup.py.

string kss.core.BeautifulSoup.BeautifulStoneSoup.HTML_ENTITIES = "html" [static, inherited]

Definition at line 916 of file BeautifulSoup.py.

Definition at line 962 of file BeautifulSoup.py.

Definition at line 965 of file BeautifulSoup.py.

Definition at line 1174 of file BeautifulSoup.py.

Definition at line 970 of file BeautifulSoup.py.

Initial value:
[(re.compile('(<[^<>]*)/>'),
                       lambda x: x.group(1) + ' />'),
                      (re.compile('<!\s+([^<>]*)>'),
                       lambda x: '<!' + x.group(1) + '>')
                      ]

Definition at line 908 of file BeautifulSoup.py.

Definition at line 971 of file BeautifulSoup.py.

list kss.core.BeautifulSoup.BeautifulSoup.NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del'] [static]

Definition at line 1343 of file BeautifulSoup.py.

Initial value:
['span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
                            'center']

Definition at line 1337 of file BeautifulSoup.py.

Initial value:
{ 'ol' : [],
                           'ul' : [],
                           'li' : ['ul', 'ol'],
                           'dl' : [],
                           'dd' : ['dl'],
                           'dt' : ['dl'] }

Definition at line 1346 of file BeautifulSoup.py.

Initial value:
{'table' : [], 
                           'tr' : ['table', 'tbody', 'tfoot', 'thead'],
                           'td' : ['tr'],
                           'th' : ['tr'],
                           'thead' : ['table'],
                           'tbody' : ['table'],
                           'tfoot' : ['table'],
                           }

Definition at line 1354 of file BeautifulSoup.py.

Initial value:
buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
                                NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)

Reimplemented from kss.core.BeautifulSoup.BeautifulStoneSoup.

Reimplemented in kss.core.BeautifulSoup.MinimalSoup, and kss.core.BeautifulSoup.ICantBelieveItsBeautifulSoup.

Definition at line 1372 of file BeautifulSoup.py.

Definition at line 81 of file BeautifulSoup.py.

Definition at line 83 of file BeautifulSoup.py.

list kss.core.BeautifulSoup.BeautifulSoup.NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre'] [static]

Definition at line 1363 of file BeautifulSoup.py.

Reimplemented from kss.core.BeautifulSoup.BeautifulStoneSoup.

Definition at line 1400 of file BeautifulSoup.py.

Definition at line 79 of file BeautifulSoup.py.

Definition at line 948 of file BeautifulSoup.py.

Reimplemented from kss.core.BeautifulSoup.PageElement.

Definition at line 1074 of file BeautifulSoup.py.

Definition at line 82 of file BeautifulSoup.py.

dictionary kss.core.BeautifulSoup.BeautifulSoup.QUOTE_TAGS = {'script': None} [static]

Reimplemented from kss.core.BeautifulSoup.BeautifulStoneSoup.

Definition at line 1332 of file BeautifulSoup.py.

Definition at line 1031 of file BeautifulSoup.py.

Initial value:
buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
                                     NON_NESTABLE_BLOCK_TAGS,
                                     NESTABLE_LIST_TAGS,
                                     NESTABLE_TABLE_TAGS)

Reimplemented from kss.core.BeautifulSoup.BeautifulStoneSoup.

Reimplemented in kss.core.BeautifulSoup.MinimalSoup.

Definition at line 1367 of file BeautifulSoup.py.

string kss.core.BeautifulSoup.BeautifulStoneSoup.ROOT_TAG_NAME = u'[document]' [static, inherited]

Definition at line 914 of file BeautifulSoup.py.

Initial value:
buildTagMap(None,
                                    ['br' , 'hr', 'input', 'img', 'meta',
                                    'spacer', 'link', 'frame', 'base'])

Reimplemented from kss.core.BeautifulSoup.BeautifulStoneSoup.

Definition at line 1328 of file BeautifulSoup.py.

Definition at line 950 of file BeautifulSoup.py.

Definition at line 1030 of file BeautifulSoup.py.

string kss.core.BeautifulSoup.BeautifulStoneSoup.XML_ENTITIES = "xml" [static, inherited]

Definition at line 917 of file BeautifulSoup.py.

Definition at line 963 of file BeautifulSoup.py.


The documentation for this class was generated from the following file: