Back to index

moin  1.9.0~rc2
Public Member Functions | Public Attributes | Private Member Functions
MoinMoin.support.difflib.SequenceMatcher Class Reference

List of all members.

Public Member Functions

def __init__
def set_seqs
def set_seq1
def set_seq2
def find_longest_match
def get_matching_blocks
def get_opcodes
def get_grouped_opcodes
def ratio
def quick_ratio
def real_quick_ratio

Public Attributes

 isjunk
 a
 b
 matching_blocks
 opcodes
 fullbcount
 b2j
 isbjunk
 isbpopular

Private Member Functions

def __chain_b

Detailed Description

SequenceMatcher is a flexible class for comparing pairs of sequences of
any type, so long as the sequence elements are hashable.  The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching".  The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk).  The same idea is then applied
recursively to the pieces of the sequences to the left and to the right
of the matching subsequence.  This does not yield minimal edit
sequences, but does tend to yield matches that "look right" to people.

SequenceMatcher tries to compute a "human-friendly diff" between two
sequences.  Unlike e.g. UNIX(tm) diff, the fundamental notion is the
longest *contiguous* & junk-free matching subsequence.  That's what
catches peoples' eyes.  The Windows(tm) windiff has another interesting
notion, pairing up elements that appear uniquely in each sequence.
That, and the method here, appear to yield more intuitive difference
reports than does diff.  This method appears to be the least vulnerable
to synching up on blocks of "junk lines", though (like blank lines in
ordinary text files, or maybe "<P>" lines in HTML files).  That may be
because this is the only method of the 3 that has a *concept* of
"junk" <wink>.

Example, comparing two strings, and considering blanks to be "junk":

>>> s = SequenceMatcher(lambda x: x == " ",
...                     "private Thread currentThread;",
...                     "private volatile Thread currentThread;")
>>>

.ratio() returns a float in [0, 1], measuring the "similarity" of the
sequences.  As a rule of thumb, a .ratio() value over 0.6 means the
sequences are close matches:

>>> print round(s.ratio(), 3)
0.866
>>>

If you're only interested in where the sequences match,
.get_matching_blocks() is handy:

>>> for block in s.get_matching_blocks():
...     print "a[%d] and b[%d] match for %d elements" % block
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 21 elements
a[29] and b[38] match for 0 elements

Note that the last tuple returned by .get_matching_blocks() is always a
dummy, (len(a), len(b), 0), and this is the only case in which the last
tuple element (number of elements matched) is 0.

If you want to know how to change the first sequence into the second,
use .get_opcodes():

>>> for opcode in s.get_opcodes():
...     print "%6s a[%d:%d] b[%d:%d]" % opcode
 equal a[0:8] b[0:8]
insert a[8:8] b[8:17]
 equal a[8:29] b[17:38]

See the Differ class for a fancy human-friendly file differencer, which
uses SequenceMatcher both to compare sequences of lines, and to compare
sequences of characters within similar (near-matching) lines.

See also function get_close_matches() in this module, which shows how
simple code building on SequenceMatcher can be used to do useful work.

Timing:  Basic R-O is cubic time worst case and quadratic time expected
case.  SequenceMatcher is quadratic time for the worst case and has
expected-case behavior dependent in a complicated way on how many
elements the sequences have in common; best case time is linear.

Methods:

__init__(isjunk=None, a='', b='')
    Construct a SequenceMatcher.

set_seqs(a, b)
    Set the two sequences to be compared.

set_seq1(a)
    Set the first sequence to be compared.

set_seq2(b)
    Set the second sequence to be compared.

find_longest_match(alo, ahi, blo, bhi)
    Find longest matching block in a[alo:ahi] and b[blo:bhi].

get_matching_blocks()
    Return list of triples describing matching subsequences.

get_opcodes()
    Return list of 5-tuples describing how to turn a into b.

ratio()
    Return a measure of the sequences' similarity (float in [0,1]).

quick_ratio()
    Return an upper bound on .ratio() relatively quickly.

real_quick_ratio()
    Return an upper bound on ratio() very quickly.

Definition at line 45 of file difflib.py.


Constructor & Destructor Documentation

def MoinMoin.support.difflib.SequenceMatcher.__init__ (   self,
  isjunk = None,
  a = '',
  b = '' 
)
Construct a SequenceMatcher.

Optional arg isjunk is None (the default), or a one-argument
function that takes a sequence element and returns true iff the
element is junk.  None is equivalent to passing "lambda x: 0", i.e.
no elements are considered to be junk.  For example, pass
    lambda x: x in " \\t"
if you're comparing lines as sequences of characters, and don't
want to synch up on blanks or hard tabs.

Optional arg a is the first of two sequences to be compared.  By
default, an empty string.  The elements of a must be hashable.  See
also .set_seqs() and .set_seq1().

Optional arg b is the second of two sequences to be compared.  By
default, an empty string.  The elements of b must be hashable. See
also .set_seqs() and .set_seq2().

Definition at line 153 of file difflib.py.

00153 
00154     def __init__(self, isjunk=None, a='', b=''):
00155         """Construct a SequenceMatcher.
00156 
00157         Optional arg isjunk is None (the default), or a one-argument
00158         function that takes a sequence element and returns true iff the
00159         element is junk.  None is equivalent to passing "lambda x: 0", i.e.
00160         no elements are considered to be junk.  For example, pass
00161             lambda x: x in " \\t"
00162         if you're comparing lines as sequences of characters, and don't
00163         want to synch up on blanks or hard tabs.
00164 
00165         Optional arg a is the first of two sequences to be compared.  By
00166         default, an empty string.  The elements of a must be hashable.  See
00167         also .set_seqs() and .set_seq1().
00168 
00169         Optional arg b is the second of two sequences to be compared.  By
00170         default, an empty string.  The elements of b must be hashable. See
00171         also .set_seqs() and .set_seq2().
00172         """
00173 
00174         # Members:
00175         # a
00176         #      first sequence
00177         # b
00178         #      second sequence; differences are computed as "what do
00179         #      we need to do to 'a' to change it into 'b'?"
00180         # b2j
00181         #      for x in b, b2j[x] is a list of the indices (into b)
00182         #      at which x appears; junk elements do not appear
00183         # fullbcount
00184         #      for x in b, fullbcount[x] == the number of times x
00185         #      appears in b; only materialized if really needed (used
00186         #      only for computing quick_ratio())
00187         # matching_blocks
00188         #      a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
00189         #      ascending & non-overlapping in i and in j; terminated by
00190         #      a dummy (len(a), len(b), 0) sentinel
00191         # opcodes
00192         #      a list of (tag, i1, i2, j1, j2) tuples, where tag is
00193         #      one of
00194         #          'replace'   a[i1:i2] should be replaced by b[j1:j2]
00195         #          'delete'    a[i1:i2] should be deleted
00196         #          'insert'    b[j1:j2] should be inserted
00197         #          'equal'     a[i1:i2] == b[j1:j2]
00198         # isjunk
00199         #      a user-supplied function taking a sequence element and
00200         #      returning true iff the element is "junk" -- this has
00201         #      subtle but helpful effects on the algorithm, which I'll
00202         #      get around to writing up someday <0.9 wink>.
00203         #      DON'T USE!  Only __chain_b uses this.  Use isbjunk.
00204         # isbjunk
00205         #      for x in b, isbjunk(x) == isjunk(x) but much faster;
00206         #      it's really the has_key method of a hidden dict.
00207         #      DOES NOT WORK for x in a!
00208         # isbpopular
00209         #      for x in b, isbpopular(x) is true iff b is reasonably long
00210         #      (at least 200 elements) and x accounts for more than 1% of
00211         #      its elements.  DOES NOT WORK for x in a!
00212 
00213         self.isjunk = isjunk
00214         self.a = self.b = None
00215         self.set_seqs(a, b)


Member Function Documentation

Definition at line 299 of file difflib.py.

00299 
00300     def __chain_b(self):
00301         # Because isjunk is a user-defined (not C) function, and we test
00302         # for junk a LOT, it's important to minimize the number of calls.
00303         # Before the tricks described here, __chain_b was by far the most
00304         # time-consuming routine in the whole module!  If anyone sees
00305         # Jim Roskind, thank him again for profile.py -- I never would
00306         # have guessed that.
00307         # The first trick is to build b2j ignoring the possibility
00308         # of junk.  I.e., we don't call isjunk at all yet.  Throwing
00309         # out the junk later is much cheaper than building b2j "right"
00310         # from the start.
00311         b = self.b
00312         n = len(b)
00313         self.b2j = b2j = {}
00314         populardict = {}
00315         for i, elt in enumerate(b):
00316             if elt in b2j:
00317                 indices = b2j[elt]
00318                 if n >= 200 and len(indices) * 100 > n:
00319                     populardict[elt] = 1
00320                     del indices[:]
00321                 else:
00322                     indices.append(i)
00323             else:
00324                 b2j[elt] = [i]
00325 
00326         # Purge leftover indices for popular elements.
00327         for elt in populardict:
00328             del b2j[elt]
00329 
00330         # Now b2j.keys() contains elements uniquely, and especially when
00331         # the sequence is a string, that's usually a good deal smaller
00332         # than len(string).  The difference is the number of isjunk calls
00333         # saved.
00334         isjunk = self.isjunk
00335         junkdict = {}
00336         if isjunk:
00337             for d in populardict, b2j:
00338                 for elt in d.keys():
00339                     if isjunk(elt):
00340                         junkdict[elt] = 1
00341                         del d[elt]
00342 
00343         # Now for x in b, isjunk(x) == x in junkdict, but the
00344         # latter is much faster.  Note too that while there may be a
00345         # lot of junk in the sequence, the number of *unique* junk
00346         # elements is probably small.  So the memory burden of keeping
00347         # this dict alive is likely trivial compared to the size of b2j.
00348         self.isbjunk = junkdict.has_key
00349         self.isbpopular = populardict.has_key

def MoinMoin.support.difflib.SequenceMatcher.find_longest_match (   self,
  alo,
  ahi,
  blo,
  bhi 
)
Find longest matching block in a[alo:ahi] and b[blo:bhi].

If isjunk is not defined:

Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
    alo <= i <= i+k <= ahi
    blo <= j <= j+k <= bhi
and for all (i',j',k') meeting those conditions,
    k >= k'
    i <= i'
    and if i == i', j <= j'

In other words, of all maximal matching blocks, return one that
starts earliest in a, and of all those maximal matching blocks that
start earliest in a, return the one that starts earliest in b.

>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
(0, 4, 5)

If isjunk is defined, first the longest matching block is
determined as above, but with the additional restriction that no
junk element appears in the block.  Then that block is extended as
far as possible by matching (only) junk elements on both sides.  So
the resulting block never matches on junk except as identical junk
happens to be adjacent to an "interesting" match.

Here's the same example as before, but considering blanks to be
junk.  That prevents " abcd" from matching the " abcd" at the tail
end of the second sequence directly.  Instead only the "abcd" can
match, and matches the leftmost "abcd" in the second sequence:

>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
(1, 0, 4)

If no blocks match, return (alo, blo, 0).

>>> s = SequenceMatcher(None, "ab", "c")
>>> s.find_longest_match(0, 2, 0, 1)
(0, 0, 0)

Definition at line 350 of file difflib.py.

00350 
00351     def find_longest_match(self, alo, ahi, blo, bhi):
00352         """Find longest matching block in a[alo:ahi] and b[blo:bhi].
00353 
00354         If isjunk is not defined:
00355 
00356         Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
00357             alo <= i <= i+k <= ahi
00358             blo <= j <= j+k <= bhi
00359         and for all (i',j',k') meeting those conditions,
00360             k >= k'
00361             i <= i'
00362             and if i == i', j <= j'
00363 
00364         In other words, of all maximal matching blocks, return one that
00365         starts earliest in a, and of all those maximal matching blocks that
00366         start earliest in a, return the one that starts earliest in b.
00367 
00368         >>> s = SequenceMatcher(None, " abcd", "abcd abcd")
00369         >>> s.find_longest_match(0, 5, 0, 9)
00370         (0, 4, 5)
00371 
00372         If isjunk is defined, first the longest matching block is
00373         determined as above, but with the additional restriction that no
00374         junk element appears in the block.  Then that block is extended as
00375         far as possible by matching (only) junk elements on both sides.  So
00376         the resulting block never matches on junk except as identical junk
00377         happens to be adjacent to an "interesting" match.
00378 
00379         Here's the same example as before, but considering blanks to be
00380         junk.  That prevents " abcd" from matching the " abcd" at the tail
00381         end of the second sequence directly.  Instead only the "abcd" can
00382         match, and matches the leftmost "abcd" in the second sequence:
00383 
00384         >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
00385         >>> s.find_longest_match(0, 5, 0, 9)
00386         (1, 0, 4)
00387 
00388         If no blocks match, return (alo, blo, 0).
00389 
00390         >>> s = SequenceMatcher(None, "ab", "c")
00391         >>> s.find_longest_match(0, 2, 0, 1)
00392         (0, 0, 0)
00393         """
00394 
00395         # CAUTION:  stripping common prefix or suffix would be incorrect.
00396         # E.g.,
00397         #    ab
00398         #    acab
00399         # Longest matching block is "ab", but if common prefix is
00400         # stripped, it's "a" (tied with "b").  UNIX(tm) diff does so
00401         # strip, so ends up claiming that ab is changed to acab by
00402         # inserting "ca" in the middle.  That's minimal but unintuitive:
00403         # "it's obvious" that someone inserted "ac" at the front.
00404         # Windiff ends up at the same place as diff, but by pairing up
00405         # the unique 'b's and then matching the first two 'a's.
00406 
00407         a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.isbjunk
00408         besti, bestj, bestsize = alo, blo, 0
00409         # find longest junk-free match
00410         # during an iteration of the loop, j2len[j] = length of longest
00411         # junk-free match ending with a[i-1] and b[j]
00412         j2len = {}
00413         nothing = []
00414         for i in xrange(alo, ahi):
00415             # look at all instances of a[i] in b; note that because
00416             # b2j has no junk keys, the loop is skipped if a[i] is junk
00417             j2lenget = j2len.get
00418             newj2len = {}
00419             for j in b2j.get(a[i], nothing):
00420                 # a[i] matches b[j]
00421                 if j < blo:
00422                     continue
00423                 if j >= bhi:
00424                     break
00425                 k = newj2len[j] = j2lenget(j-1, 0) + 1
00426                 if k > bestsize:
00427                     besti, bestj, bestsize = i-k+1, j-k+1, k
00428             j2len = newj2len
00429 
00430         # Extend the best by non-junk elements on each end.  In particular,
00431         # "popular" non-junk elements aren't in b2j, which greatly speeds
00432         # the inner loop above, but also means "the best" match so far
00433         # doesn't contain any junk *or* popular non-junk elements.
00434         while besti > alo and bestj > blo and \
00435               not isbjunk(b[bestj-1]) and \
00436               a[besti-1] == b[bestj-1]:
00437             besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
00438         while besti+bestsize < ahi and bestj+bestsize < bhi and \
00439               not isbjunk(b[bestj+bestsize]) and \
00440               a[besti+bestsize] == b[bestj+bestsize]:
00441             bestsize += 1
00442 
00443         # Now that we have a wholly interesting match (albeit possibly
00444         # empty!), we may as well suck up the matching junk on each
00445         # side of it too.  Can't think of a good reason not to, and it
00446         # saves post-processing the (possibly considerable) expense of
00447         # figuring out what to do with it.  In the case of an empty
00448         # interesting match, this is clearly the right thing to do,
00449         # because no other kind of match is possible in the regions.
00450         while besti > alo and bestj > blo and \
00451               isbjunk(b[bestj-1]) and \
00452               a[besti-1] == b[bestj-1]:
00453             besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
00454         while besti+bestsize < ahi and bestj+bestsize < bhi and \
00455               isbjunk(b[bestj+bestsize]) and \
00456               a[besti+bestsize] == b[bestj+bestsize]:
00457             bestsize = bestsize + 1
00458 
00459         return besti, bestj, bestsize

Here is the caller graph for this function:

Isolate change clusters by eliminating ranges with no changes.

Return a generator of groups with upto n lines of context.
Each group is in the same format as returned by get_opcodes().

>>> from pprint import pprint
>>> a = map(str, range(1,40))
>>> b = a[:]
>>> b[8:8] = ['i']     # Make an insertion
>>> b[20] += 'x'       # Make a replacement
>>> b[23:28] = []      # Make a deletion
>>> b[30] += 'y'       # Make another replacement
>>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes()))
[[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)],
 [('equal', 16, 19, 17, 20),
  ('replace', 19, 20, 20, 21),
  ('equal', 20, 22, 21, 23),
  ('delete', 22, 27, 23, 23),
  ('equal', 27, 30, 23, 26)],
 [('equal', 31, 34, 27, 30),
  ('replace', 34, 35, 30, 31),
  ('equal', 35, 38, 31, 34)]]

Definition at line 586 of file difflib.py.

00586 
00587     def get_grouped_opcodes(self, n=3):
00588         """ Isolate change clusters by eliminating ranges with no changes.
00589 
00590         Return a generator of groups with upto n lines of context.
00591         Each group is in the same format as returned by get_opcodes().
00592 
00593         >>> from pprint import pprint
00594         >>> a = map(str, range(1,40))
00595         >>> b = a[:]
00596         >>> b[8:8] = ['i']     # Make an insertion
00597         >>> b[20] += 'x'       # Make a replacement
00598         >>> b[23:28] = []      # Make a deletion
00599         >>> b[30] += 'y'       # Make another replacement
00600         >>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes()))
00601         [[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)],
00602          [('equal', 16, 19, 17, 20),
00603           ('replace', 19, 20, 20, 21),
00604           ('equal', 20, 22, 21, 23),
00605           ('delete', 22, 27, 23, 23),
00606           ('equal', 27, 30, 23, 26)],
00607          [('equal', 31, 34, 27, 30),
00608           ('replace', 34, 35, 30, 31),
00609           ('equal', 35, 38, 31, 34)]]
00610         """
00611 
00612         codes = self.get_opcodes()
00613         if not codes:
00614             codes = [("equal", 0, 1, 0, 1)]
00615         # Fixup leading and trailing groups if they show no changes.
00616         if codes[0][0] == 'equal':
00617             tag, i1, i2, j1, j2 = codes[0]
00618             codes[0] = tag, max(i1, i2-n), i2, max(j1, j2-n), j2
00619         if codes[-1][0] == 'equal':
00620             tag, i1, i2, j1, j2 = codes[-1]
00621             codes[-1] = tag, i1, min(i2, i1+n), j1, min(j2, j1+n)
00622 
00623         nn = n + n
00624         group = []
00625         for tag, i1, i2, j1, j2 in codes:
00626             # End the current group and start a new one whenever
00627             # there is a large range with no changes.
00628             if tag == 'equal' and i2-i1 > nn:
00629                 group.append((tag, i1, min(i2, i1+n), j1, min(j2, j1+n)))
00630                 yield group
00631                 group = []
00632                 i1, j1 = max(i1, i2-n), max(j1, j2-n)
00633             group.append((tag, i1, i2, j1 ,j2))
00634         if group and not (len(group)==1 and group[0][0] == 'equal'):
00635             yield group

Here is the call graph for this function:

Return list of triples describing matching subsequences.

Each triple is of the form (i, j, n), and means that
a[i:i+n] == b[j:j+n].  The triples are monotonically increasing in
i and in j.  New in Python 2.5, it's also guaranteed that if
(i, j, n) and (i', j', n') are adjacent triples in the list, and
the second is not the last triple in the list, then i+n != i' or
j+n != j'.  IOW, adjacent triples never describe adjacent equal
blocks.

The last triple is a dummy, (len(a), len(b), 0), and is the only
triple with n==0.

>>> s = SequenceMatcher(None, "abxcd", "abcd")
>>> s.get_matching_blocks()
[(0, 0, 2), (3, 2, 2), (5, 4, 0)]

Definition at line 460 of file difflib.py.

00460 
00461     def get_matching_blocks(self):
00462         """Return list of triples describing matching subsequences.
00463 
00464         Each triple is of the form (i, j, n), and means that
00465         a[i:i+n] == b[j:j+n].  The triples are monotonically increasing in
00466         i and in j.  New in Python 2.5, it's also guaranteed that if
00467         (i, j, n) and (i', j', n') are adjacent triples in the list, and
00468         the second is not the last triple in the list, then i+n != i' or
00469         j+n != j'.  IOW, adjacent triples never describe adjacent equal
00470         blocks.
00471 
00472         The last triple is a dummy, (len(a), len(b), 0), and is the only
00473         triple with n==0.
00474 
00475         >>> s = SequenceMatcher(None, "abxcd", "abcd")
00476         >>> s.get_matching_blocks()
00477         [(0, 0, 2), (3, 2, 2), (5, 4, 0)]
00478         """
00479 
00480         if self.matching_blocks is not None:
00481             return self.matching_blocks
00482         la, lb = len(self.a), len(self.b)
00483 
00484         # This is most naturally expressed as a recursive algorithm, but
00485         # at least one user bumped into extreme use cases that exceeded
00486         # the recursion limit on their box.  So, now we maintain a list
00487         # ('queue`) of blocks we still need to look at, and append partial
00488         # results to `matching_blocks` in a loop; the matches are sorted
00489         # at the end.
00490         queue = [(0, la, 0, lb)]
00491         matching_blocks = []
00492         while queue:
00493             alo, ahi, blo, bhi = queue.pop()
00494             i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
00495             # a[alo:i] vs b[blo:j] unknown
00496             # a[i:i+k] same as b[j:j+k]
00497             # a[i+k:ahi] vs b[j+k:bhi] unknown
00498             if k:   # if k is 0, there was no matching block
00499                 matching_blocks.append(x)
00500                 if alo < i and blo < j:
00501                     queue.append((alo, i, blo, j))
00502                 if i+k < ahi and j+k < bhi:
00503                     queue.append((i+k, ahi, j+k, bhi))
00504         matching_blocks.sort()
00505 
00506         # It's possible that we have adjacent equal blocks in the
00507         # matching_blocks list now.  Starting with 2.5, this code was added
00508         # to collapse them.
00509         i1 = j1 = k1 = 0
00510         non_adjacent = []
00511         for i2, j2, k2 in matching_blocks:
00512             # Is this block adjacent to i1, j1, k1?
00513             if i1 + k1 == i2 and j1 + k1 == j2:
00514                 # Yes, so collapse them -- this just increases the length of
00515                 # the first block by the length of the second, and the first
00516                 # block so lengthened remains the block to compare against.
00517                 k1 += k2
00518             else:
00519                 # Not adjacent.  Remember the first block (k1==0 means it's
00520                 # the dummy we started with), and make the second block the
00521                 # new block to compare against.
00522                 if k1:
00523                     non_adjacent.append((i1, j1, k1))
00524                 i1, j1, k1 = i2, j2, k2
00525         if k1:
00526             non_adjacent.append((i1, j1, k1))
00527 
00528         non_adjacent.append( (la, lb, 0) )
00529         self.matching_blocks = non_adjacent
00530         return self.matching_blocks

Here is the call graph for this function:

Here is the caller graph for this function:

Return list of 5-tuples describing how to turn a into b.

Each tuple is of the form (tag, i1, i2, j1, j2).  The first tuple
has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
tuple preceding it, and likewise for j1 == the previous j2.

The tags are strings, with these meanings:

'replace':  a[i1:i2] should be replaced by b[j1:j2]
'delete':   a[i1:i2] should be deleted.
    Note that j1==j2 in this case.
'insert':   b[j1:j2] should be inserted at a[i1:i1].
    Note that i1==i2 in this case.
'equal':    a[i1:i2] == b[j1:j2]

>>> a = "qabxcd"
>>> b = "abycdf"
>>> s = SequenceMatcher(None, a, b)
>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
...    print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
 delete a[0:1] (q) b[0:0] ()
  equal a[1:3] (ab) b[0:2] (ab)
replace a[3:4] (x) b[2:3] (y)
  equal a[4:6] (cd) b[3:5] (cd)
 insert a[6:6] () b[5:6] (f)

Definition at line 531 of file difflib.py.

00531 
00532     def get_opcodes(self):
00533         """Return list of 5-tuples describing how to turn a into b.
00534 
00535         Each tuple is of the form (tag, i1, i2, j1, j2).  The first tuple
00536         has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
00537         tuple preceding it, and likewise for j1 == the previous j2.
00538 
00539         The tags are strings, with these meanings:
00540 
00541         'replace':  a[i1:i2] should be replaced by b[j1:j2]
00542         'delete':   a[i1:i2] should be deleted.
00543                     Note that j1==j2 in this case.
00544         'insert':   b[j1:j2] should be inserted at a[i1:i1].
00545                     Note that i1==i2 in this case.
00546         'equal':    a[i1:i2] == b[j1:j2]
00547 
00548         >>> a = "qabxcd"
00549         >>> b = "abycdf"
00550         >>> s = SequenceMatcher(None, a, b)
00551         >>> for tag, i1, i2, j1, j2 in s.get_opcodes():
00552         ...    print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
00553         ...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
00554          delete a[0:1] (q) b[0:0] ()
00555           equal a[1:3] (ab) b[0:2] (ab)
00556         replace a[3:4] (x) b[2:3] (y)
00557           equal a[4:6] (cd) b[3:5] (cd)
00558          insert a[6:6] () b[5:6] (f)
00559         """
00560 
00561         if self.opcodes is not None:
00562             return self.opcodes
00563         i = j = 0
00564         self.opcodes = answer = []
00565         for ai, bj, size in self.get_matching_blocks():
00566             # invariant:  we've pumped out correct diffs to change
00567             # a[:i] into b[:j], and the next matching block is
00568             # a[ai:ai+size] == b[bj:bj+size].  So we need to pump
00569             # out a diff to change a[i:ai] into b[j:bj], pump out
00570             # the matching block, and move (i,j) beyond the match
00571             tag = ''
00572             if i < ai and j < bj:
00573                 tag = 'replace'
00574             elif i < ai:
00575                 tag = 'delete'
00576             elif j < bj:
00577                 tag = 'insert'
00578             if tag:
00579                 answer.append( (tag, i, ai, j, bj) )
00580             i, j = ai+size, bj+size
00581             # the list of matching blocks is terminated by a
00582             # sentinel with size 0
00583             if size:
00584                 answer.append( ('equal', ai, i, bj, j) )
00585         return answer

Here is the call graph for this function:

Here is the caller graph for this function:

Return an upper bound on ratio() relatively quickly.

This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute.

Definition at line 662 of file difflib.py.

00662 
00663     def quick_ratio(self):
00664         """Return an upper bound on ratio() relatively quickly.
00665 
00666         This isn't defined beyond that it is an upper bound on .ratio(), and
00667         is faster to compute.
00668         """
00669 
00670         # viewing a and b as multisets, set matches to the cardinality
00671         # of their intersection; this counts the number of matches
00672         # without regard to order, so is clearly an upper bound
00673         if self.fullbcount is None:
00674             self.fullbcount = fullbcount = {}
00675             for elt in self.b:
00676                 fullbcount[elt] = fullbcount.get(elt, 0) + 1
00677         fullbcount = self.fullbcount
00678         # avail[x] is the number of times x appears in 'b' less the
00679         # number of times we've seen it in 'a' so far ... kinda
00680         avail = {}
00681         availhas, matches = avail.has_key, 0
00682         for elt in self.a:
00683             if availhas(elt):
00684                 numb = avail[elt]
00685             else:
00686                 numb = fullbcount.get(elt, 0)
00687             avail[elt] = numb - 1
00688             if numb > 0:
00689                 matches = matches + 1
00690         return _calculate_ratio(matches, len(self.a) + len(self.b))

Here is the call graph for this function:

Return a measure of the sequences' similarity (float in [0,1]).

Where T is the total number of elements in both sequences, and
M is the number of matches, this is 2.0*M / T.
Note that this is 1 if the sequences are identical, and 0 if
they have nothing in common.

.ratio() is expensive to compute if you haven't already computed
.get_matching_blocks() or .get_opcodes(), in which case you may
want to try .quick_ratio() or .real_quick_ratio() first to get an
upper bound.

>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0

Definition at line 636 of file difflib.py.

00636 
00637     def ratio(self):
00638         """Return a measure of the sequences' similarity (float in [0,1]).
00639 
00640         Where T is the total number of elements in both sequences, and
00641         M is the number of matches, this is 2.0*M / T.
00642         Note that this is 1 if the sequences are identical, and 0 if
00643         they have nothing in common.
00644 
00645         .ratio() is expensive to compute if you haven't already computed
00646         .get_matching_blocks() or .get_opcodes(), in which case you may
00647         want to try .quick_ratio() or .real_quick_ratio() first to get an
00648         upper bound.
00649 
00650         >>> s = SequenceMatcher(None, "abcd", "bcde")
00651         >>> s.ratio()
00652         0.75
00653         >>> s.quick_ratio()
00654         0.75
00655         >>> s.real_quick_ratio()
00656         1.0
00657         """
00658 
00659         matches = reduce(lambda sum, triple: sum + triple[-1],
00660                          self.get_matching_blocks(), 0)
00661         return _calculate_ratio(matches, len(self.a) + len(self.b))

Here is the call graph for this function:

Return an upper bound on ratio() very quickly.

This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute than either .ratio() or .quick_ratio().

Definition at line 691 of file difflib.py.

00691 
00692     def real_quick_ratio(self):
00693         """Return an upper bound on ratio() very quickly.
00694 
00695         This isn't defined beyond that it is an upper bound on .ratio(), and
00696         is faster to compute than either .ratio() or .quick_ratio().
00697         """
00698 
00699         la, lb = len(self.a), len(self.b)
00700         # can't have more matches than the number of elements in the
00701         # shorter sequence
00702         return _calculate_ratio(min(la, lb), la + lb)

Here is the call graph for this function:

Set the first sequence to be compared.

The second sequence to be compared is not changed.

>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.set_seq1("bcde")
>>> s.ratio()
1.0
>>>

SequenceMatcher computes and caches detailed information about the
second sequence, so if you want to compare one sequence S against
many sequences, use .set_seq2(S) once and call .set_seq1(x)
repeatedly for each of the other sequences.

See also set_seqs() and set_seq2().

Definition at line 228 of file difflib.py.

00228 
00229     def set_seq1(self, a):
00230         """Set the first sequence to be compared.
00231 
00232         The second sequence to be compared is not changed.
00233 
00234         >>> s = SequenceMatcher(None, "abcd", "bcde")
00235         >>> s.ratio()
00236         0.75
00237         >>> s.set_seq1("bcde")
00238         >>> s.ratio()
00239         1.0
00240         >>>
00241 
00242         SequenceMatcher computes and caches detailed information about the
00243         second sequence, so if you want to compare one sequence S against
00244         many sequences, use .set_seq2(S) once and call .set_seq1(x)
00245         repeatedly for each of the other sequences.
00246 
00247         See also set_seqs() and set_seq2().
00248         """
00249 
00250         if a is self.a:
00251             return
00252         self.a = a
00253         self.matching_blocks = self.opcodes = None

Here is the caller graph for this function:

Set the second sequence to be compared.

The first sequence to be compared is not changed.

>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.set_seq2("abcd")
>>> s.ratio()
1.0
>>>

SequenceMatcher computes and caches detailed information about the
second sequence, so if you want to compare one sequence S against
many sequences, use .set_seq2(S) once and call .set_seq1(x)
repeatedly for each of the other sequences.

See also set_seqs() and set_seq1().

Definition at line 254 of file difflib.py.

00254 
00255     def set_seq2(self, b):
00256         """Set the second sequence to be compared.
00257 
00258         The first sequence to be compared is not changed.
00259 
00260         >>> s = SequenceMatcher(None, "abcd", "bcde")
00261         >>> s.ratio()
00262         0.75
00263         >>> s.set_seq2("abcd")
00264         >>> s.ratio()
00265         1.0
00266         >>>
00267 
00268         SequenceMatcher computes and caches detailed information about the
00269         second sequence, so if you want to compare one sequence S against
00270         many sequences, use .set_seq2(S) once and call .set_seq1(x)
00271         repeatedly for each of the other sequences.
00272 
00273         See also set_seqs() and set_seq1().
00274         """
00275 
00276         if b is self.b:
00277             return
00278         self.b = b
00279         self.matching_blocks = self.opcodes = None
00280         self.fullbcount = None
00281         self.__chain_b()

Here is the caller graph for this function:

Set the two sequences to be compared.

>>> s = SequenceMatcher()
>>> s.set_seqs("abcd", "bcde")
>>> s.ratio()
0.75

Definition at line 216 of file difflib.py.

00216 
00217     def set_seqs(self, a, b):
00218         """Set the two sequences to be compared.
00219 
00220         >>> s = SequenceMatcher()
00221         >>> s.set_seqs("abcd", "bcde")
00222         >>> s.ratio()
00223         0.75
00224         """
00225 
00226         self.set_seq1(a)
00227         self.set_seq2(b)

Here is the call graph for this function:


Member Data Documentation

Definition at line 213 of file difflib.py.

Definition at line 213 of file difflib.py.

Definition at line 312 of file difflib.py.

Definition at line 279 of file difflib.py.

Definition at line 347 of file difflib.py.

Definition at line 348 of file difflib.py.

Definition at line 212 of file difflib.py.

Definition at line 252 of file difflib.py.

Definition at line 252 of file difflib.py.


The documentation for this class was generated from the following file: