Back to index

python3.2  3.2.2
Public Member Functions | Public Attributes | Private Member Functions
csv.Sniffer Class Reference

List of all members.

Public Member Functions

def __init__
def sniff
def has_header

Public Attributes

 preferred

Private Member Functions

def _guess_quote_and_delimiter
def _guess_delimiter

Detailed Description

"Sniffs" the format of a CSV file (i.e. delimiter, quotechar)
Returns a Dialect object.

Definition at line 167 of file csv.py.


Constructor & Destructor Documentation

def csv.Sniffer.__init__ (   self)

Definition at line 172 of file csv.py.

00172 
00173     def __init__(self):
00174         # in case there is more than one possible delimiter
00175         self.preferred = [',', '\t', ';', ' ', ':']
00176 

Here is the caller graph for this function:


Member Function Documentation

def csv.Sniffer._guess_delimiter (   self,
  data,
  delimiters 
) [private]
The delimiter /should/ occur the same number of times on
each row. However, due to malformed data, it may not. We don't want
an all or nothing approach, so we allow for small variations in this
number.
  1) build a table of the frequency of each character on every line.
  2) build a table of frequencies of this frequency (meta-frequency?),
     e.g.  'x occurred 5 times in 10 rows, 6 times in 1000 rows,
     7 times in 2 rows'
  3) use the mode of the meta-frequency to determine the /expected/
     frequency for that character
  4) find out how often the character actually meets that goal
  5) the character that best meets its goal is the delimiter
For performance reasons, the data is evaluated in chunks, so it can
try and evaluate the smallest portion of the data possible, evaluating
additional chunks as necessary.

Definition at line 280 of file csv.py.

00280 
00281     def _guess_delimiter(self, data, delimiters):
00282         """
00283         The delimiter /should/ occur the same number of times on
00284         each row. However, due to malformed data, it may not. We don't want
00285         an all or nothing approach, so we allow for small variations in this
00286         number.
00287           1) build a table of the frequency of each character on every line.
00288           2) build a table of frequencies of this frequency (meta-frequency?),
00289              e.g.  'x occurred 5 times in 10 rows, 6 times in 1000 rows,
00290              7 times in 2 rows'
00291           3) use the mode of the meta-frequency to determine the /expected/
00292              frequency for that character
00293           4) find out how often the character actually meets that goal
00294           5) the character that best meets its goal is the delimiter
00295         For performance reasons, the data is evaluated in chunks, so it can
00296         try and evaluate the smallest portion of the data possible, evaluating
00297         additional chunks as necessary.
00298         """
00299 
00300         data = list(filter(None, data.split('\n')))
00301 
00302         ascii = [chr(c) for c in range(127)] # 7-bit ASCII
00303 
00304         # build frequency tables
00305         chunkLength = min(10, len(data))
00306         iteration = 0
00307         charFrequency = {}
00308         modes = {}
00309         delims = {}
00310         start, end = 0, min(chunkLength, len(data))
00311         while start < len(data):
00312             iteration += 1
00313             for line in data[start:end]:
00314                 for char in ascii:
00315                     metaFrequency = charFrequency.get(char, {})
00316                     # must count even if frequency is 0
00317                     freq = line.count(char)
00318                     # value is the mode
00319                     metaFrequency[freq] = metaFrequency.get(freq, 0) + 1
00320                     charFrequency[char] = metaFrequency
00321 
00322             for char in charFrequency.keys():
00323                 items = list(charFrequency[char].items())
00324                 if len(items) == 1 and items[0][0] == 0:
00325                     continue
00326                 # get the mode of the frequencies
00327                 if len(items) > 1:
00328                     modes[char] = max(items, key=lambda x: x[1])
00329                     # adjust the mode - subtract the sum of all
00330                     # other frequencies
00331                     items.remove(modes[char])
00332                     modes[char] = (modes[char][0], modes[char][1]
00333                                    - sum(item[1] for item in items))
00334                 else:
00335                     modes[char] = items[0]
00336 
00337             # build a list of possible delimiters
00338             modeList = modes.items()
00339             total = float(chunkLength * iteration)
00340             # (rows of consistent data) / (number of rows) = 100%
00341             consistency = 1.0
00342             # minimum consistency threshold
00343             threshold = 0.9
00344             while len(delims) == 0 and consistency >= threshold:
00345                 for k, v in modeList:
00346                     if v[0] > 0 and v[1] > 0:
00347                         if ((v[1]/total) >= consistency and
00348                             (delimiters is None or k in delimiters)):
00349                             delims[k] = v
00350                 consistency -= 0.01
00351 
00352             if len(delims) == 1:
00353                 delim = list(delims.keys())[0]
00354                 skipinitialspace = (data[0].count(delim) ==
00355                                     data[0].count("%c " % delim))
00356                 return (delim, skipinitialspace)
00357 
00358             # analyze another chunkLength lines
00359             start = end
00360             end += chunkLength
00361 
00362         if not delims:
00363             return ('', 0)
00364 
00365         # if there's more than one, fall back to a 'preferred' list
00366         if len(delims) > 1:
00367             for d in self.preferred:
00368                 if d in delims.keys():
00369                     skipinitialspace = (data[0].count(d) ==
00370                                         data[0].count("%c " % d))
00371                     return (d, skipinitialspace)
00372 
00373         # nothing else indicates a preference, pick the character that
00374         # dominates(?)
00375         items = [(v,k) for (k,v) in delims.items()]
00376         items.sort()
00377         delim = items[-1][1]
00378 
00379         skipinitialspace = (data[0].count(delim) ==
00380                             data[0].count("%c " % delim))
00381         return (delim, skipinitialspace)
00382 

Here is the call graph for this function:

Here is the caller graph for this function:

def csv.Sniffer._guess_quote_and_delimiter (   self,
  data,
  delimiters 
) [private]
Looks for text enclosed between two identical quotes
(the probable quotechar) which are preceded and followed
by the same character (the probable delimiter).
For example:
         ,'some text',
The quote with the most wins, same with the delimiter.
If there is no quotechar the delimiter can't be determined
this way.

Definition at line 206 of file csv.py.

00206 
00207     def _guess_quote_and_delimiter(self, data, delimiters):
00208         """
00209         Looks for text enclosed between two identical quotes
00210         (the probable quotechar) which are preceded and followed
00211         by the same character (the probable delimiter).
00212         For example:
00213                          ,'some text',
00214         The quote with the most wins, same with the delimiter.
00215         If there is no quotechar the delimiter can't be determined
00216         this way.
00217         """
00218 
00219         matches = []
00220         for restr in ('(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?P=delim)', # ,".*?",
00221                       '(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?P<delim>[^\w\n"\'])(?P<space> ?)',   #  ".*?",
00222                       '(?P<delim>>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?:$|\n)',  # ,".*?"
00223                       '(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):                            #  ".*?" (no delim, no space)
00224             regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
00225             matches = regexp.findall(data)
00226             if matches:
00227                 break
00228 
00229         if not matches:
00230             # (quotechar, doublequote, delimiter, skipinitialspace)
00231             return ('', False, None, 0)
00232         quotes = {}
00233         delims = {}
00234         spaces = 0
00235         for m in matches:
00236             n = regexp.groupindex['quote'] - 1
00237             key = m[n]
00238             if key:
00239                 quotes[key] = quotes.get(key, 0) + 1
00240             try:
00241                 n = regexp.groupindex['delim'] - 1
00242                 key = m[n]
00243             except KeyError:
00244                 continue
00245             if key and (delimiters is None or key in delimiters):
00246                 delims[key] = delims.get(key, 0) + 1
00247             try:
00248                 n = regexp.groupindex['space'] - 1
00249             except KeyError:
00250                 continue
00251             if m[n]:
00252                 spaces += 1
00253 
00254         quotechar = max(quotes, key=quotes.get)
00255 
00256         if delims:
00257             delim = max(delims, key=delims.get)
00258             skipinitialspace = delims[delim] == spaces
00259             if delim == '\n': # most likely a file with a single column
00260                 delim = ''
00261         else:
00262             # there is *no* delimiter, it's a single column of quoted data
00263             delim = ''
00264             skipinitialspace = 0
00265 
00266         # if we see an extra quote between delimiters, we've got a
00267         # double quoted format
00268         dq_regexp = re.compile(r"((%(delim)s)|^)\W*%(quote)s[^%(delim)s\n]*%(quote)s[^%(delim)s\n]*%(quote)s\W*((%(delim)s)|$)" % \
00269                                {'delim':delim, 'quote':quotechar}, re.MULTILINE)
00270 
00271 
00272 
00273         if dq_regexp.search(data):
00274             doublequote = True
00275         else:
00276             doublequote = False
00277 
00278         return (quotechar, doublequote, delim, skipinitialspace)
00279 

Here is the call graph for this function:

Here is the caller graph for this function:

def csv.Sniffer.has_header (   self,
  sample 
)

Definition at line 383 of file csv.py.

00383 
00384     def has_header(self, sample):
00385         # Creates a dictionary of types of data in each column. If any
00386         # column is of a single type (say, integers), *except* for the first
00387         # row, then the first row is presumed to be labels. If the type
00388         # can't be determined, it is assumed to be a string in which case
00389         # the length of the string is the determining factor: if all of the
00390         # rows except for the first are the same length, it's a header.
00391         # Finally, a 'vote' is taken at the end for each column, adding or
00392         # subtracting from the likelihood of the first row being a header.
00393 
00394         rdr = reader(StringIO(sample), self.sniff(sample))
00395 
00396         header = next(rdr) # assume first row is header
00397 
00398         columns = len(header)
00399         columnTypes = {}
00400         for i in range(columns): columnTypes[i] = None
00401 
00402         checked = 0
00403         for row in rdr:
00404             # arbitrary number of rows to check, to keep it sane
00405             if checked > 20:
00406                 break
00407             checked += 1
00408 
00409             if len(row) != columns:
00410                 continue # skip rows that have irregular number of columns
00411 
00412             for col in list(columnTypes.keys()):
00413 
00414                 for thisType in [int, float, complex]:
00415                     try:
00416                         thisType(row[col])
00417                         break
00418                     except (ValueError, OverflowError):
00419                         pass
00420                 else:
00421                     # fallback to length of string
00422                     thisType = len(row[col])
00423 
00424                 if thisType != columnTypes[col]:
00425                     if columnTypes[col] is None: # add new column type
00426                         columnTypes[col] = thisType
00427                     else:
00428                         # type is inconsistent, remove column from
00429                         # consideration
00430                         del columnTypes[col]
00431 
00432         # finally, compare results against first row and "vote"
00433         # on whether it's a header
00434         hasHeader = 0
00435         for col, colType in columnTypes.items():
00436             if type(colType) == type(0): # it's a length
00437                 if len(header[col]) != colType:
00438                     hasHeader += 1
00439                 else:
00440                     hasHeader -= 1
00441             else: # attempt typecast
00442                 try:
00443                     colType(header[col])
00444                 except (ValueError, TypeError):
00445                     hasHeader += 1
00446                 else:
00447                     hasHeader -= 1
00448 
00449         return hasHeader > 0

Here is the call graph for this function:

def csv.Sniffer.sniff (   self,
  sample,
  delimiters = None 
)
Returns a dialect (or None) corresponding to the sample

Definition at line 177 of file csv.py.

00177 
00178     def sniff(self, sample, delimiters=None):
00179         """
00180         Returns a dialect (or None) corresponding to the sample
00181         """
00182 
00183         quotechar, doublequote, delimiter, skipinitialspace = \
00184                    self._guess_quote_and_delimiter(sample, delimiters)
00185         if not delimiter:
00186             delimiter, skipinitialspace = self._guess_delimiter(sample,
00187                                                                 delimiters)
00188 
00189         if not delimiter:
00190             raise Error("Could not determine delimiter")
00191 
00192         class dialect(Dialect):
00193             _name = "sniffed"
00194             lineterminator = '\r\n'
00195             quoting = QUOTE_MINIMAL
00196             # escapechar = ''
00197 
00198         dialect.doublequote = doublequote
00199         dialect.delimiter = delimiter
00200         # _csv.reader won't accept a quotechar of ''
00201         dialect.quotechar = quotechar or '"'
00202         dialect.skipinitialspace = skipinitialspace
00203 
00204         return dialect
00205 

Here is the call graph for this function:

Here is the caller graph for this function:


Member Data Documentation

Definition at line 174 of file csv.py.


The documentation for this class was generated from the following file: