Back to index

python3.2  3.2.2
Public Member Functions | Public Attributes | Private Member Functions
urllib.robotparser.RobotFileParser Class Reference

List of all members.

Public Member Functions

def __init__
def mtime
def modified
def set_url
def read
def parse
def can_fetch
def __str__

Public Attributes

 entries
 default_entry
 disallow_all
 allow_all
 last_checked
 url
 path

Private Member Functions

def _add_entry

Detailed Description

This class provides a set of methods to read, parse and answer
questions about a single robots.txt file.

Definition at line 17 of file robotparser.py.


Constructor & Destructor Documentation

def urllib.robotparser.RobotFileParser.__init__ (   self,
  url = '' 
)

Definition at line 23 of file robotparser.py.

00023 
00024     def __init__(self, url=''):
00025         self.entries = []
00026         self.default_entry = None
00027         self.disallow_all = False
00028         self.allow_all = False
00029         self.set_url(url)
00030         self.last_checked = 0

Here is the caller graph for this function:


Member Function Documentation

Definition at line 149 of file robotparser.py.

00149 
00150     def __str__(self):
00151         return ''.join([str(entry) + "\n" for entry in self.entries])
00152 

def urllib.robotparser.RobotFileParser._add_entry (   self,
  entry 
) [private]

Definition at line 66 of file robotparser.py.

00066 
00067     def _add_entry(self, entry):
00068         if "*" in entry.useragents:
00069             # the default entry is considered last
00070             if self.default_entry is None:
00071                 # the first default entry wins
00072                 self.default_entry = entry
00073         else:
00074             self.entries.append(entry)

Here is the caller graph for this function:

def urllib.robotparser.RobotFileParser.can_fetch (   self,
  useragent,
  url 
)
using the parsed robots.txt decide if useragent can fetch url

Definition at line 126 of file robotparser.py.

00126 
00127     def can_fetch(self, useragent, url):
00128         """using the parsed robots.txt decide if useragent can fetch url"""
00129         if self.disallow_all:
00130             return False
00131         if self.allow_all:
00132             return True
00133         # search for given user agent matches
00134         # the first match counts
00135         parsed_url = urllib.parse.urlparse(urllib.parse.unquote(url))
00136         url = urllib.parse.urlunparse(('','',parsed_url.path,
00137             parsed_url.params,parsed_url.query, parsed_url.fragment))
00138         url = urllib.parse.quote(url)
00139         if not url:
00140             url = "/"
00141         for entry in self.entries:
00142             if entry.applies_to(useragent):
00143                 return entry.allowance(url)
00144         # try the default entry last
00145         if self.default_entry:
00146             return self.default_entry.allowance(url)
00147         # agent not found ==> access granted
00148         return True

Here is the call graph for this function:

Sets the time the robots.txt file was last fetched to the
current time.

Definition at line 40 of file robotparser.py.

00040 
00041     def modified(self):
00042         """Sets the time the robots.txt file was last fetched to the
00043         current time.
00044 
00045         """
00046         import time
00047         self.last_checked = time.time()

Returns the time the robots.txt file was last fetched.

This is useful for long-running web spiders that need to
check for new robots.txt files periodically.

Definition at line 31 of file robotparser.py.

00031 
00032     def mtime(self):
00033         """Returns the time the robots.txt file was last fetched.
00034 
00035         This is useful for long-running web spiders that need to
00036         check for new robots.txt files periodically.
00037 
00038         """
00039         return self.last_checked

def urllib.robotparser.RobotFileParser.parse (   self,
  lines 
)
Parse the input lines from a robots.txt file.

We allow that a user-agent: line is not preceded by
one or more blank lines.

Definition at line 75 of file robotparser.py.

00075 
00076     def parse(self, lines):
00077         """Parse the input lines from a robots.txt file.
00078 
00079         We allow that a user-agent: line is not preceded by
00080         one or more blank lines.
00081         """
00082         # states:
00083         #   0: start state
00084         #   1: saw user-agent line
00085         #   2: saw an allow or disallow line
00086         state = 0
00087         entry = Entry()
00088 
00089         for line in lines:
00090             if not line:
00091                 if state == 1:
00092                     entry = Entry()
00093                     state = 0
00094                 elif state == 2:
00095                     self._add_entry(entry)
00096                     entry = Entry()
00097                     state = 0
00098             # remove optional comment and strip line
00099             i = line.find('#')
00100             if i >= 0:
00101                 line = line[:i]
00102             line = line.strip()
00103             if not line:
00104                 continue
00105             line = line.split(':', 1)
00106             if len(line) == 2:
00107                 line[0] = line[0].strip().lower()
00108                 line[1] = urllib.parse.unquote(line[1].strip())
00109                 if line[0] == "user-agent":
00110                     if state == 2:
00111                         self._add_entry(entry)
00112                         entry = Entry()
00113                     entry.useragents.append(line[1])
00114                     state = 1
00115                 elif line[0] == "disallow":
00116                     if state != 0:
00117                         entry.rulelines.append(RuleLine(line[1], False))
00118                         state = 2
00119                 elif line[0] == "allow":
00120                     if state != 0:
00121                         entry.rulelines.append(RuleLine(line[1], True))
00122                         state = 2
00123         if state == 2:
00124             self._add_entry(entry)
00125 

Here is the call graph for this function:

Here is the caller graph for this function:

Reads the robots.txt URL and feeds it to the parser.

Definition at line 53 of file robotparser.py.

00053 
00054     def read(self):
00055         """Reads the robots.txt URL and feeds it to the parser."""
00056         try:
00057             f = urllib.request.urlopen(self.url)
00058         except urllib.error.HTTPError as err:
00059             if err.code in (401, 403):
00060                 self.disallow_all = True
00061             elif err.code >= 400:
00062                 self.allow_all = True
00063         else:
00064             raw = f.read()
00065             self.parse(raw.decode("utf-8").splitlines())

Here is the call graph for this function:

Here is the caller graph for this function:

def urllib.robotparser.RobotFileParser.set_url (   self,
  url 
)
Sets the URL referring to a robots.txt file.

Definition at line 48 of file robotparser.py.

00048 
00049     def set_url(self, url):
00050         """Sets the URL referring to a robots.txt file."""
00051         self.url = url
00052         self.host, self.path = urllib.parse.urlparse(url)[1:3]


Member Data Documentation

Definition at line 27 of file robotparser.py.

Definition at line 25 of file robotparser.py.

Definition at line 26 of file robotparser.py.

Definition at line 24 of file robotparser.py.

Definition at line 29 of file robotparser.py.

Definition at line 51 of file robotparser.py.

Definition at line 50 of file robotparser.py.


The documentation for this class was generated from the following file: