Back to index

python3.2  3.2.2
Public Member Functions | Private Member Functions | Private Attributes
email.feedparser.FeedParser Class Reference
Inheritance diagram for email.feedparser.FeedParser:
Inheritance graph
[legend]

List of all members.

Public Member Functions

def __init__
def feed
def close

Private Member Functions

def _set_headersonly
def _call_parse
def _new_message
def _pop_message
def _parsegen
def _parse_headers

Private Attributes

 _factory
 _input
 _msgstack
 _parse
 _cur
 _last
 _headersonly

Detailed Description

A feed-style parser of email.

Definition at line 137 of file feedparser.py.


Constructor & Destructor Documentation

def email.feedparser.FeedParser.__init__ (   self,
  _factory = message.Message 
)
_factory is called with no arguments to create a new message obj

Definition at line 140 of file feedparser.py.

00140 
00141     def __init__(self, _factory=message.Message):
00142         """_factory is called with no arguments to create a new message obj"""
00143         self._factory = _factory
00144         self._input = BufferedSubFile()
00145         self._msgstack = []
00146         self._parse = self._parsegen().__next__
00147         self._cur = None
00148         self._last = None
00149         self._headersonly = False

Here is the caller graph for this function:


Member Function Documentation

def email.feedparser.FeedParser._call_parse (   self) [private]

Definition at line 159 of file feedparser.py.

00159 
00160     def _call_parse(self):
00161         try:
00162             self._parse()
00163         except StopIteration:
00164             pass

Here is the caller graph for this function:

def email.feedparser.FeedParser._new_message (   self) [private]

Definition at line 177 of file feedparser.py.

00177 
00178     def _new_message(self):
00179         msg = self._factory()
00180         if self._cur and self._cur.get_content_type() == 'multipart/digest':
00181             msg.set_default_type('message/rfc822')
00182         if self._msgstack:
00183             self._msgstack[-1].attach(msg)
00184         self._msgstack.append(msg)
00185         self._cur = msg
00186         self._last = msg

Here is the caller graph for this function:

def email.feedparser.FeedParser._parse_headers (   self,
  lines 
) [private]

Definition at line 431 of file feedparser.py.

00431 
00432     def _parse_headers(self, lines):
00433         # Passed a list of lines that make up the headers for the current msg
00434         lastheader = ''
00435         lastvalue = []
00436         for lineno, line in enumerate(lines):
00437             # Check for continuation
00438             if line[0] in ' \t':
00439                 if not lastheader:
00440                     # The first line of the headers was a continuation.  This
00441                     # is illegal, so let's note the defect, store the illegal
00442                     # line, and ignore it for purposes of headers.
00443                     defect = errors.FirstHeaderLineIsContinuationDefect(line)
00444                     self._cur.defects.append(defect)
00445                     continue
00446                 lastvalue.append(line)
00447                 continue
00448             if lastheader:
00449                 # XXX reconsider the joining of folded lines
00450                 lhdr = EMPTYSTRING.join(lastvalue)[:-1].rstrip('\r\n')
00451                 self._cur[lastheader] = lhdr
00452                 lastheader, lastvalue = '', []
00453             # Check for envelope header, i.e. unix-from
00454             if line.startswith('From '):
00455                 if lineno == 0:
00456                     # Strip off the trailing newline
00457                     mo = NLCRE_eol.search(line)
00458                     if mo:
00459                         line = line[:-len(mo.group(0))]
00460                     self._cur.set_unixfrom(line)
00461                     continue
00462                 elif lineno == len(lines) - 1:
00463                     # Something looking like a unix-from at the end - it's
00464                     # probably the first line of the body, so push back the
00465                     # line and stop.
00466                     self._input.unreadline(line)
00467                     return
00468                 else:
00469                     # Weirdly placed unix-from line.  Note this as a defect
00470                     # and ignore it.
00471                     defect = errors.MisplacedEnvelopeHeaderDefect(line)
00472                     self._cur.defects.append(defect)
00473                     continue
00474             # Split the line on the colon separating field name from value.
00475             i = line.find(':')
00476             if i < 0:
00477                 defect = errors.MalformedHeaderDefect(line)
00478                 self._cur.defects.append(defect)
00479                 continue
00480             lastheader = line[:i]
00481             lastvalue = [line[i+1:].lstrip()]
00482         # Done with all the lines, so handle the last header.
00483         if lastheader:
00484             # XXX reconsider the joining of folded lines
00485             self._cur[lastheader] = EMPTYSTRING.join(lastvalue).rstrip('\r\n')
00486 


Here is the call graph for this function:

Here is the caller graph for this function:

def email.feedparser.FeedParser._parsegen (   self) [private]

Definition at line 195 of file feedparser.py.

00195 
00196     def _parsegen(self):
00197         # Create a new message and start by parsing headers.
00198         self._new_message()
00199         headers = []
00200         # Collect the headers, searching for a line that doesn't match the RFC
00201         # 2822 header or continuation pattern (including an empty line).
00202         for line in self._input:
00203             if line is NeedMoreData:
00204                 yield NeedMoreData
00205                 continue
00206             if not headerRE.match(line):
00207                 # If we saw the RFC defined header/body separator
00208                 # (i.e. newline), just throw it away. Otherwise the line is
00209                 # part of the body so push it back.
00210                 if not NLCRE.match(line):
00211                     self._input.unreadline(line)
00212                 break
00213             headers.append(line)
00214         # Done with the headers, so parse them and figure out what we're
00215         # supposed to see in the body of the message.
00216         self._parse_headers(headers)
00217         # Headers-only parsing is a backwards compatibility hack, which was
00218         # necessary in the older parser, which could throw errors.  All
00219         # remaining lines in the input are thrown into the message body.
00220         if self._headersonly:
00221             lines = []
00222             while True:
00223                 line = self._input.readline()
00224                 if line is NeedMoreData:
00225                     yield NeedMoreData
00226                     continue
00227                 if line == '':
00228                     break
00229                 lines.append(line)
00230             self._cur.set_payload(EMPTYSTRING.join(lines))
00231             return
00232         if self._cur.get_content_type() == 'message/delivery-status':
00233             # message/delivery-status contains blocks of headers separated by
00234             # a blank line.  We'll represent each header block as a separate
00235             # nested message object, but the processing is a bit different
00236             # than standard message/* types because there is no body for the
00237             # nested messages.  A blank line separates the subparts.
00238             while True:
00239                 self._input.push_eof_matcher(NLCRE.match)
00240                 for retval in self._parsegen():
00241                     if retval is NeedMoreData:
00242                         yield NeedMoreData
00243                         continue
00244                     break
00245                 msg = self._pop_message()
00246                 # We need to pop the EOF matcher in order to tell if we're at
00247                 # the end of the current file, not the end of the last block
00248                 # of message headers.
00249                 self._input.pop_eof_matcher()
00250                 # The input stream must be sitting at the newline or at the
00251                 # EOF.  We want to see if we're at the end of this subpart, so
00252                 # first consume the blank line, then test the next line to see
00253                 # if we're at this subpart's EOF.
00254                 while True:
00255                     line = self._input.readline()
00256                     if line is NeedMoreData:
00257                         yield NeedMoreData
00258                         continue
00259                     break
00260                 while True:
00261                     line = self._input.readline()
00262                     if line is NeedMoreData:
00263                         yield NeedMoreData
00264                         continue
00265                     break
00266                 if line == '':
00267                     break
00268                 # Not at EOF so this is a line we're going to need.
00269                 self._input.unreadline(line)
00270             return
00271         if self._cur.get_content_maintype() == 'message':
00272             # The message claims to be a message/* type, then what follows is
00273             # another RFC 2822 message.
00274             for retval in self._parsegen():
00275                 if retval is NeedMoreData:
00276                     yield NeedMoreData
00277                     continue
00278                 break
00279             self._pop_message()
00280             return
00281         if self._cur.get_content_maintype() == 'multipart':
00282             boundary = self._cur.get_boundary()
00283             if boundary is None:
00284                 # The message /claims/ to be a multipart but it has not
00285                 # defined a boundary.  That's a problem which we'll handle by
00286                 # reading everything until the EOF and marking the message as
00287                 # defective.
00288                 self._cur.defects.append(errors.NoBoundaryInMultipartDefect())
00289                 lines = []
00290                 for line in self._input:
00291                     if line is NeedMoreData:
00292                         yield NeedMoreData
00293                         continue
00294                     lines.append(line)
00295                 self._cur.set_payload(EMPTYSTRING.join(lines))
00296                 return
00297             # Create a line match predicate which matches the inter-part
00298             # boundary as well as the end-of-multipart boundary.  Don't push
00299             # this onto the input stream until we've scanned past the
00300             # preamble.
00301             separator = '--' + boundary
00302             boundaryre = re.compile(
00303                 '(?P<sep>' + re.escape(separator) +
00304                 r')(?P<end>--)?(?P<ws>[ \t]*)(?P<linesep>\r\n|\r|\n)?$')
00305             capturing_preamble = True
00306             preamble = []
00307             linesep = False
00308             while True:
00309                 line = self._input.readline()
00310                 if line is NeedMoreData:
00311                     yield NeedMoreData
00312                     continue
00313                 if line == '':
00314                     break
00315                 mo = boundaryre.match(line)
00316                 if mo:
00317                     # If we're looking at the end boundary, we're done with
00318                     # this multipart.  If there was a newline at the end of
00319                     # the closing boundary, then we need to initialize the
00320                     # epilogue with the empty string (see below).
00321                     if mo.group('end'):
00322                         linesep = mo.group('linesep')
00323                         break
00324                     # We saw an inter-part boundary.  Were we in the preamble?
00325                     if capturing_preamble:
00326                         if preamble:
00327                             # According to RFC 2046, the last newline belongs
00328                             # to the boundary.
00329                             lastline = preamble[-1]
00330                             eolmo = NLCRE_eol.search(lastline)
00331                             if eolmo:
00332                                 preamble[-1] = lastline[:-len(eolmo.group(0))]
00333                             self._cur.preamble = EMPTYSTRING.join(preamble)
00334                         capturing_preamble = False
00335                         self._input.unreadline(line)
00336                         continue
00337                     # We saw a boundary separating two parts.  Consume any
00338                     # multiple boundary lines that may be following.  Our
00339                     # interpretation of RFC 2046 BNF grammar does not produce
00340                     # body parts within such double boundaries.
00341                     while True:
00342                         line = self._input.readline()
00343                         if line is NeedMoreData:
00344                             yield NeedMoreData
00345                             continue
00346                         mo = boundaryre.match(line)
00347                         if not mo:
00348                             self._input.unreadline(line)
00349                             break
00350                     # Recurse to parse this subpart; the input stream points
00351                     # at the subpart's first line.
00352                     self._input.push_eof_matcher(boundaryre.match)
00353                     for retval in self._parsegen():
00354                         if retval is NeedMoreData:
00355                             yield NeedMoreData
00356                             continue
00357                         break
00358                     # Because of RFC 2046, the newline preceding the boundary
00359                     # separator actually belongs to the boundary, not the
00360                     # previous subpart's payload (or epilogue if the previous
00361                     # part is a multipart).
00362                     if self._last.get_content_maintype() == 'multipart':
00363                         epilogue = self._last.epilogue
00364                         if epilogue == '':
00365                             self._last.epilogue = None
00366                         elif epilogue is not None:
00367                             mo = NLCRE_eol.search(epilogue)
00368                             if mo:
00369                                 end = len(mo.group(0))
00370                                 self._last.epilogue = epilogue[:-end]
00371                     else:
00372                         payload = self._last._payload
00373                         if isinstance(payload, str):
00374                             mo = NLCRE_eol.search(payload)
00375                             if mo:
00376                                 payload = payload[:-len(mo.group(0))]
00377                                 self._last._payload = payload
00378                     self._input.pop_eof_matcher()
00379                     self._pop_message()
00380                     # Set the multipart up for newline cleansing, which will
00381                     # happen if we're in a nested multipart.
00382                     self._last = self._cur
00383                 else:
00384                     # I think we must be in the preamble
00385                     assert capturing_preamble
00386                     preamble.append(line)
00387             # We've seen either the EOF or the end boundary.  If we're still
00388             # capturing the preamble, we never saw the start boundary.  Note
00389             # that as a defect and store the captured text as the payload.
00390             # Everything from here to the EOF is epilogue.
00391             if capturing_preamble:
00392                 self._cur.defects.append(errors.StartBoundaryNotFoundDefect())
00393                 self._cur.set_payload(EMPTYSTRING.join(preamble))
00394                 epilogue = []
00395                 for line in self._input:
00396                     if line is NeedMoreData:
00397                         yield NeedMoreData
00398                         continue
00399                 self._cur.epilogue = EMPTYSTRING.join(epilogue)
00400                 return
00401             # If the end boundary ended in a newline, we'll need to make sure
00402             # the epilogue isn't None
00403             if linesep:
00404                 epilogue = ['']
00405             else:
00406                 epilogue = []
00407             for line in self._input:
00408                 if line is NeedMoreData:
00409                     yield NeedMoreData
00410                     continue
00411                 epilogue.append(line)
00412             # Any CRLF at the front of the epilogue is not technically part of
00413             # the epilogue.  Also, watch out for an empty string epilogue,
00414             # which means a single newline.
00415             if epilogue:
00416                 firstline = epilogue[0]
00417                 bolmo = NLCRE_bol.match(firstline)
00418                 if bolmo:
00419                     epilogue[0] = firstline[len(bolmo.group(0)):]
00420             self._cur.epilogue = EMPTYSTRING.join(epilogue)
00421             return
00422         # Otherwise, it's some non-multipart type, so the entire rest of the
00423         # file contents becomes the payload.
00424         lines = []
00425         for line in self._input:
00426             if line is NeedMoreData:
00427                 yield NeedMoreData
00428                 continue
00429             lines.append(line)
00430         self._cur.set_payload(EMPTYSTRING.join(lines))

Here is the call graph for this function:

Here is the caller graph for this function:

def email.feedparser.FeedParser._pop_message (   self) [private]

Definition at line 187 of file feedparser.py.

00187 
00188     def _pop_message(self):
00189         retval = self._msgstack.pop()
00190         if self._msgstack:
00191             self._cur = self._msgstack[-1]
00192         else:
00193             self._cur = None
00194         return retval

Here is the caller graph for this function:

Definition at line 151 of file feedparser.py.

00151 
00152     def _set_headersonly(self):
00153         self._headersonly = True

Parse all remaining data and return the root message object.

Definition at line 165 of file feedparser.py.

00165 
00166     def close(self):
00167         """Parse all remaining data and return the root message object."""
00168         self._input.close()
00169         self._call_parse()
00170         root = self._pop_message()
00171         assert not self._msgstack
00172         # Look for final set of defects
00173         if root.get_content_maintype() == 'multipart' \
00174                and not root.is_multipart():
00175             root.defects.append(errors.MultipartInvariantViolationDefect())
00176         return root

Here is the call graph for this function:

def email.feedparser.FeedParser.feed (   self,
  data 
)
Push more data into the parser.

Reimplemented in email.feedparser.BytesFeedParser.

Definition at line 154 of file feedparser.py.

00154 
00155     def feed(self, data):
00156         """Push more data into the parser."""
00157         self._input.push(data)
00158         self._call_parse()

Here is the call graph for this function:

Here is the caller graph for this function:


Member Data Documentation

Definition at line 146 of file feedparser.py.

Definition at line 142 of file feedparser.py.

Definition at line 148 of file feedparser.py.

Definition at line 143 of file feedparser.py.

Definition at line 147 of file feedparser.py.

Definition at line 144 of file feedparser.py.

Definition at line 145 of file feedparser.py.


The documentation for this class was generated from the following file: