Back to index

python-biopython  1.60
Public Member Functions | Public Attributes | Static Public Attributes
Bio.Entrez.Parser.DataHandler Class Reference

List of all members.

Public Member Functions

def __init__
def read
def parse
def xmlDeclHandler
def startNamespaceDeclHandler
def startElementHandler
def endElementHandler
def characterDataHandler
def elementDecl
def open_dtd_file
def externalEntityRefHandler

Public Attributes

 stack
 errors
 integers
 strings
 lists
 dictionaries
 structures
 items
 dtd_urls
 validating
 parser
 content
 attributes
 object

Static Public Attributes

tuple home = os.path.expanduser('~')
tuple local_dtd_dir = os.path.join(home, '.biopython', 'Bio', 'Entrez', 'DTDs')
tuple global_dtd_dir = os.path.join(str(Entrez.__path__[0]), "DTDs")

Detailed Description

Definition at line 137 of file Parser.py.


Constructor & Destructor Documentation

def Bio.Entrez.Parser.DataHandler.__init__ (   self,
  validate 
)

Definition at line 147 of file Parser.py.

00147 
00148     def __init__(self, validate):
00149         self.stack = []
00150         self.errors = []
00151         self.integers = []
00152         self.strings = []
00153         self.lists = []
00154         self.dictionaries = []
00155         self.structures = {}
00156         self.items = []
00157         self.dtd_urls = []
00158         self.validating = validate
00159         self.parser = expat.ParserCreate(namespace_separator=" ")
00160         self.parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS)
00161         self.parser.XmlDeclHandler = self.xmlDeclHandler

Here is the caller graph for this function:


Member Function Documentation

Definition at line 343 of file Parser.py.

00343 
00344     def characterDataHandler(self, content):
00345         self.content += content

Here is the caller graph for this function:

def Bio.Entrez.Parser.DataHandler.elementDecl (   self,
  name,
  model 
)
This callback function is called for each element declaration:
<!ELEMENT       name          (...)>
encountered in a DTD. The purpose of this function is to determine
whether this element should be regarded as a string, integer, list
dictionary, structure, or error.

Definition at line 346 of file Parser.py.

00346 
00347     def elementDecl(self, name, model):
00348         """This callback function is called for each element declaration:
00349         <!ELEMENT       name          (...)>
00350         encountered in a DTD. The purpose of this function is to determine
00351         whether this element should be regarded as a string, integer, list
00352         dictionary, structure, or error."""
00353         if name.upper()=="ERROR":
00354             self.errors.append(name)
00355             return
00356         if name=='Item' and model==(expat.model.XML_CTYPE_MIXED,
00357                                     expat.model.XML_CQUANT_REP,
00358                                     None, ((expat.model.XML_CTYPE_NAME,
00359                                             expat.model.XML_CQUANT_NONE,
00360                                             'Item',
00361                                             ()
00362                                            ),
00363                                           )
00364                                    ):
00365             # Special case. As far as I can tell, this only occurs in the
00366             # eSummary DTD.
00367             self.items.append(name)
00368             return
00369         # First, remove ignorable parentheses around declarations
00370         while (model[0] in (expat.model.XML_CTYPE_SEQ,
00371                             expat.model.XML_CTYPE_CHOICE)
00372           and model[1] in (expat.model.XML_CQUANT_NONE,
00373                            expat.model.XML_CQUANT_OPT)
00374           and len(model[3])==1):
00375             model = model[3][0]
00376         # PCDATA declarations correspond to strings
00377         if model[0] in (expat.model.XML_CTYPE_MIXED,
00378                         expat.model.XML_CTYPE_EMPTY):
00379             self.strings.append(name)
00380             return
00381         # List-type elements
00382         if (model[0] in (expat.model.XML_CTYPE_CHOICE,
00383                          expat.model.XML_CTYPE_SEQ) and
00384             model[1] in (expat.model.XML_CQUANT_PLUS,
00385                          expat.model.XML_CQUANT_REP)):
00386             self.lists.append(name)
00387             return
00388         # This is the tricky case. Check which keys can occur multiple
00389         # times. If only one key is possible, and it can occur multiple
00390         # times, then this is a list. If more than one key is possible,
00391         # but none of them can occur multiple times, then this is a
00392         # dictionary. Otherwise, this is a structure.
00393         # In 'single' and 'multiple', we keep track which keys can occur
00394         # only once, and which can occur multiple times.
00395         single = []
00396         multiple = []
00397         # The 'count' function is called recursively to make sure all the
00398         # children in this model are counted. Error keys are ignored;
00399         # they raise an exception in Python.
00400         def count(model):
00401             quantifier, name, children = model[1:]
00402             if name==None:
00403                 if quantifier in (expat.model.XML_CQUANT_PLUS,
00404                                   expat.model.XML_CQUANT_REP):
00405                     for child in children:
00406                         multiple.append(child[2])
00407                 else:
00408                     for child in children:
00409                         count(child)
00410             elif name.upper()!="ERROR":
00411                 if quantifier in (expat.model.XML_CQUANT_NONE,
00412                                   expat.model.XML_CQUANT_OPT):
00413                     single.append(name)
00414                 elif quantifier in (expat.model.XML_CQUANT_PLUS,
00415                                     expat.model.XML_CQUANT_REP):
00416                     multiple.append(name)
00417         count(model)
00418         if len(single)==0 and len(multiple)==1:
00419             self.lists.append(name)
00420         elif len(multiple)==0:
00421             self.dictionaries.append(name)
00422         else:
00423             self.structures.update({name: multiple})

Here is the caller graph for this function:

Definition at line 301 of file Parser.py.

00301 
00302     def endElementHandler(self, name):
00303         value = self.content
00304         if name in self.errors:
00305             if value=="":
00306                 return
00307             else:
00308                 raise RuntimeError(value)
00309         elif name in self.integers:
00310             value = IntegerElement(value)
00311         elif name in self.strings:
00312             # Convert Unicode strings to plain strings if possible
00313             try:
00314                 value = StringElement(value)
00315             except UnicodeEncodeError:
00316                 value = UnicodeElement(value)
00317         elif name in self.items:
00318             self.object = self.stack.pop()
00319             if self.object.itemtype in ("List", "Structure"):
00320                 return
00321             elif self.object.itemtype=="Integer" and value:
00322                 value = IntegerElement(value)
00323             else:
00324                 # Convert Unicode strings to plain strings if possible
00325                 try:
00326                     value = StringElement(value)
00327                 except UnicodeEncodeError:
00328                     value = UnicodeElement(value)
00329             name = self.object.itemname
00330         else:
00331             self.object = self.stack.pop()
00332             return
00333         value.tag = name
00334         if self.attributes:
00335             value.attributes = dict(self.attributes)
00336             del self.attributes
00337         current = self.stack[-1]
00338         if current!="":
00339             try:
00340                 current.append(value)
00341             except AttributeError:
00342                 current[name] = value

Here is the caller graph for this function:

def Bio.Entrez.Parser.DataHandler.externalEntityRefHandler (   self,
  context,
  base,
  systemId,
  publicId 
)
The purpose of this function is to load the DTD locally, instead
of downloading it from the URL specified in the XML. Using the local
DTD results in much faster parsing. If the DTD is not found locally,
we try to download it. If new DTDs become available from NCBI,
putting them in Bio/Entrez/DTDs will allow the parser to see them.

Definition at line 441 of file Parser.py.

00441 
00442     def externalEntityRefHandler(self, context, base, systemId, publicId):
00443         """The purpose of this function is to load the DTD locally, instead
00444         of downloading it from the URL specified in the XML. Using the local
00445         DTD results in much faster parsing. If the DTD is not found locally,
00446         we try to download it. If new DTDs become available from NCBI,
00447         putting them in Bio/Entrez/DTDs will allow the parser to see them."""
00448         urlinfo = urlparse.urlparse(systemId)
00449         #Following attribute requires Python 2.5+
00450         #if urlinfo.scheme=='http':
00451         if urlinfo[0]=='http':
00452             # Then this is an absolute path to the DTD.
00453             url = systemId
00454         elif urlinfo[0]=='':
00455             # Then this is a relative path to the DTD.
00456             # Look at the parent URL to find the full path.
00457             try:
00458                 url = self.dtd_urls[-1]
00459             except IndexError:
00460                 # Assume the default URL for DTDs if the top parent
00461                 # does not contain an absolute path
00462                 source = "http://www.ncbi.nlm.nih.gov/dtd/"
00463             else:
00464                 source = os.path.dirname(url)
00465             # urls always have a forward slash, don't use os.path.join
00466             url = source.rstrip("/") + "/" + systemId
00467         self.dtd_urls.append(url)
00468         # First, try to load the local version of the DTD file
00469         location, filename = os.path.split(systemId)
00470         handle = self.open_dtd_file(filename)
00471         if not handle:
00472             # DTD is not available as a local file. Try accessing it through
00473             # the internet instead.
00474             message = """\
00475 Unable to load DTD file %s.
00476 
00477 Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI Entrez.
00478 Though most of NCBI's DTD files are included in the Biopython distribution,
00479 sometimes you may find that a particular DTD file is missing. While we can
00480 access the DTD file through the internet, the parser is much faster if the
00481 required DTD files are available locally.
00482 
00483 For this purpose, please download %s from
00484 
00485 %s
00486 
00487 and save it either in directory
00488 
00489 %s
00490 
00491 or in directory
00492 
00493 %s
00494 
00495 in order for Bio.Entrez to find it.
00496 
00497 Alternatively, you can save %s in the directory
00498 Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython.
00499 
00500 Please also inform the Biopython developers about this missing DTD, by
00501 reporting a bug on http://bugzilla.open-bio.org/ or sign up to our mailing
00502 list and emailing us, so that we can include it with the next release of
00503 Biopython.
00504 
00505 Proceeding to access the DTD file through the internet...
00506 """ % (filename, filename, url, self.global_dtd_dir, self.local_dtd_dir, filename)
00507             warnings.warn(message)
00508             try:
00509                 handle = urllib.urlopen(url)
00510             except IOError:
00511                 raise RuntimeException("Failed to access %s at %s" % (filename, url))
00512 
00513         parser = self.parser.ExternalEntityParserCreate(context)
00514         parser.ElementDeclHandler = self.elementDecl
00515         parser.ParseFile(handle)
00516         handle.close()
00517         self.dtd_urls.pop()
00518         return 1

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Entrez.Parser.DataHandler.open_dtd_file (   self,
  filename 
)

Definition at line 424 of file Parser.py.

00424 
00425     def open_dtd_file(self, filename):
00426         path = os.path.join(DataHandler.local_dtd_dir, filename)
00427         try:
00428             handle = open(path, "rb")
00429         except IOError:
00430             pass
00431         else:
00432             return handle
00433         path = os.path.join(DataHandler.global_dtd_dir, filename)
00434         try:
00435             handle = open(path, "rb")
00436         except IOError:
00437             pass
00438         else:
00439             return handle
00440         return None

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Entrez.Parser.DataHandler.parse (   self,
  handle 
)

Definition at line 193 of file Parser.py.

00193 
00194     def parse(self, handle):
00195         BLOCK = 1024
00196         while True:
00197             #Read in another block of the file...
00198             text = handle.read(BLOCK)
00199             if not text:
00200                 # We have reached the end of the XML file
00201                 if self.stack:
00202                     # No more XML data, but there is still some unfinished
00203                     # business
00204                     raise CorruptedXMLError
00205                 try:
00206                     for record in self.object:
00207                         yield record
00208                 except AttributeError:
00209                     if self.parser.StartElementHandler:
00210                         # We saw the initial <!xml declaration, and expat
00211                         # didn't notice any errors, so self.object should be
00212                         # defined. If not, this is a bug.
00213                         raise RuntimeError("Failed to parse the XML file correctly, possibly due to a bug in Bio.Entrez. Please contact the Biopython developers at biopython-dev@biopython.org for assistance.")
00214                     else:
00215                         # We did not see the initial <!xml declaration, so
00216                         # probably the input data is not in XML format.
00217                         raise NotXMLError("XML declaration not found")
00218                 self.parser.Parse("", True)
00219                 self.parser = None
00220                 return
00221 
00222             try:
00223                 self.parser.Parse(text, False)        
00224             except expat.ExpatError, e:
00225                 if self.parser.StartElementHandler:
00226                     # We saw the initial <!xml declaration, so we can be sure
00227                     # that we are parsing XML data. Most likely, the XML file
00228                     # is corrupted.
00229                     raise CorruptedXMLError(e)
00230                 else:
00231                     # We have not seen the initial <!xml declaration, so
00232                     # probably the input data is not in XML format.
00233                     raise NotXMLError(e)
00234 
00235             if not self.stack:
00236                 # Haven't read enough from the XML file yet
00237                 continue
00238 
00239             records = self.stack[0]
00240             if not isinstance(records, list):
00241                 raise ValueError("The XML file does not represent a list. Please use Entrez.read instead of Entrez.parse")
00242             while len(records) > 1: # Then the top record is finished
00243                 record = records.pop(0)
00244                 yield record

Here is the caller graph for this function:

def Bio.Entrez.Parser.DataHandler.read (   self,
  handle 
)
Set up the parser and let it parse the XML results

Definition at line 162 of file Parser.py.

00162 
00163     def read(self, handle):
00164         """Set up the parser and let it parse the XML results"""
00165         if hasattr(handle, "closed") and handle.closed:
00166             #Should avoid a possible Segmentation Fault, see:
00167             #http://bugs.python.org/issue4877
00168             raise IOError("Can't parse a closed handle")
00169         try:
00170             self.parser.ParseFile(handle)
00171         except expat.ExpatError, e:
00172             if self.parser.StartElementHandler:
00173                 # We saw the initial <!xml declaration, so we can be sure that
00174                 # we are parsing XML data. Most likely, the XML file is
00175                 # corrupted.
00176                 raise CorruptedXMLError(e)
00177             else:
00178                 # We have not seen the initial <!xml declaration, so probably
00179                 # the input data is not in XML format.
00180                 raise NotXMLError(e)
00181         try:
00182             return self.object
00183         except AttributeError:
00184             if self.parser.StartElementHandler:
00185                 # We saw the initial <!xml declaration, and expat didn't notice
00186                 # any errors, so self.object should be defined. If not, this is
00187                 # a bug.
00188                 raise RuntimeError("Failed to parse the XML file correctly, possibly due to a bug in Bio.Entrez. Please contact the Biopython developers at biopython-dev@biopython.org for assistance.")
00189             else:
00190                 # We did not see the initial <!xml declaration, so probably
00191                 # the input data is not in XML format.
00192                 raise NotXMLError("XML declaration not found")

def Bio.Entrez.Parser.DataHandler.startElementHandler (   self,
  name,
  attrs 
)

Definition at line 256 of file Parser.py.

00256 
00257     def startElementHandler(self, name, attrs):
00258         self.content = ""
00259         if name in self.lists:
00260             object = ListElement()
00261         elif name in self.dictionaries:
00262             object = DictionaryElement()
00263         elif name in self.structures:
00264             object = StructureElement(self.structures[name])
00265         elif name in self.items: # Only appears in ESummary
00266             name = str(attrs["Name"]) # convert from Unicode
00267             del attrs["Name"]
00268             itemtype = str(attrs["Type"]) # convert from Unicode
00269             del attrs["Type"]
00270             if itemtype=="Structure":
00271                 object = DictionaryElement()
00272             elif name in ("ArticleIds", "History"):
00273                 object = StructureElement(["pubmed", "medline"])
00274             elif itemtype=="List":
00275                 object = ListElement()
00276             else:
00277                 object = StringElement()
00278             object.itemname = name
00279             object.itemtype = itemtype
00280         elif name in self.strings + self.errors + self.integers:
00281             self.attributes = attrs
00282             return
00283         else:
00284             # Element not found in DTD
00285             if self.validating:
00286                 raise ValidationError(name)
00287             else:
00288                 # this will not be stored in the record
00289                 object = ""
00290         if object!="":
00291             object.tag = name
00292             if attrs:
00293                 object.attributes = dict(attrs)
00294             if len(self.stack)!=0:
00295                 current = self.stack[-1]
00296                 try:
00297                     current.append(object)
00298                 except AttributeError:
00299                     current[name] = object
00300         self.stack.append(object)

Here is the caller graph for this function:

def Bio.Entrez.Parser.DataHandler.startNamespaceDeclHandler (   self,
  prefix,
  un 
)

Definition at line 253 of file Parser.py.

00253 
00254     def startNamespaceDeclHandler(self, prefix, un):
00255         raise NotImplementedError("The Bio.Entrez parser cannot handle XML data that make use of XML namespaces")

Here is the caller graph for this function:

def Bio.Entrez.Parser.DataHandler.xmlDeclHandler (   self,
  version,
  encoding,
  standalone 
)

Definition at line 245 of file Parser.py.

00245 
00246     def xmlDeclHandler(self, version, encoding, standalone):
00247         # XML declaration found; set the handlers
00248         self.parser.StartElementHandler = self.startElementHandler
00249         self.parser.EndElementHandler = self.endElementHandler
00250         self.parser.CharacterDataHandler = self.characterDataHandler
00251         self.parser.ExternalEntityRefHandler = self.externalEntityRefHandler
00252         self.parser.StartNamespaceDeclHandler = self.startNamespaceDeclHandler

Here is the call graph for this function:


Member Data Documentation

Definition at line 280 of file Parser.py.

Definition at line 257 of file Parser.py.

Definition at line 153 of file Parser.py.

Definition at line 156 of file Parser.py.

Definition at line 149 of file Parser.py.

tuple Bio.Entrez.Parser.DataHandler.global_dtd_dir = os.path.join(str(Entrez.__path__[0]), "DTDs") [static]

Definition at line 144 of file Parser.py.

tuple Bio.Entrez.Parser.DataHandler.home = os.path.expanduser('~') [static]

Definition at line 139 of file Parser.py.

Definition at line 150 of file Parser.py.

Definition at line 155 of file Parser.py.

Definition at line 152 of file Parser.py.

tuple Bio.Entrez.Parser.DataHandler.local_dtd_dir = os.path.join(home, '.biopython', 'Bio', 'Entrez', 'DTDs') [static]

Definition at line 140 of file Parser.py.

Definition at line 317 of file Parser.py.

Definition at line 158 of file Parser.py.

Definition at line 148 of file Parser.py.

Definition at line 151 of file Parser.py.

Definition at line 154 of file Parser.py.

Definition at line 157 of file Parser.py.


The documentation for this class was generated from the following file: