Back to index

python3.2  3.2.2
Public Member Functions | Public Attributes | Private Member Functions
difflib.SequenceMatcher Class Reference

List of all members.

Public Member Functions

def __init__
def set_seqs
def set_seq1
def set_seq2
def isbjunk
def isbpopular
def find_longest_match
def get_matching_blocks
def get_opcodes
def get_grouped_opcodes
def ratio
def quick_ratio
def real_quick_ratio

Public Attributes

 isjunk
 a
 b
 autojunk
 matching_blocks
 opcodes
 fullbcount
 b2j
 bjunk
 bpopular

Private Member Functions

def __chain_b

Detailed Description

SequenceMatcher is a flexible class for comparing pairs of sequences of
any type, so long as the sequence elements are hashable.  The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching".  The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk).  The same idea is then applied
recursively to the pieces of the sequences to the left and to the right
of the matching subsequence.  This does not yield minimal edit
sequences, but does tend to yield matches that "look right" to people.

SequenceMatcher tries to compute a "human-friendly diff" between two
sequences.  Unlike e.g. UNIX(tm) diff, the fundamental notion is the
longest *contiguous* & junk-free matching subsequence.  That's what
catches peoples' eyes.  The Windows(tm) windiff has another interesting
notion, pairing up elements that appear uniquely in each sequence.
That, and the method here, appear to yield more intuitive difference
reports than does diff.  This method appears to be the least vulnerable
to synching up on blocks of "junk lines", though (like blank lines in
ordinary text files, or maybe "<P>" lines in HTML files).  That may be
because this is the only method of the 3 that has a *concept* of
"junk" <wink>.

Example, comparing two strings, and considering blanks to be "junk":

>>> s = SequenceMatcher(lambda x: x == " ",
...                     "private Thread currentThread;",
...                     "private volatile Thread currentThread;")
>>>

.ratio() returns a float in [0, 1], measuring the "similarity" of the
sequences.  As a rule of thumb, a .ratio() value over 0.6 means the
sequences are close matches:

>>> print(round(s.ratio(), 3))
0.866
>>>

If you're only interested in where the sequences match,
.get_matching_blocks() is handy:

>>> for block in s.get_matching_blocks():
...     print("a[%d] and b[%d] match for %d elements" % block)
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 21 elements
a[29] and b[38] match for 0 elements

Note that the last tuple returned by .get_matching_blocks() is always a
dummy, (len(a), len(b), 0), and this is the only case in which the last
tuple element (number of elements matched) is 0.

If you want to know how to change the first sequence into the second,
use .get_opcodes():

>>> for opcode in s.get_opcodes():
...     print("%6s a[%d:%d] b[%d:%d]" % opcode)
 equal a[0:8] b[0:8]
insert a[8:8] b[8:17]
 equal a[8:29] b[17:38]

See the Differ class for a fancy human-friendly file differencer, which
uses SequenceMatcher both to compare sequences of lines, and to compare
sequences of characters within similar (near-matching) lines.

See also function get_close_matches() in this module, which shows how
simple code building on SequenceMatcher can be used to do useful work.

Timing:  Basic R-O is cubic time worst case and quadratic time expected
case.  SequenceMatcher is quadratic time for the worst case and has
expected-case behavior dependent in a complicated way on how many
elements the sequences have in common; best case time is linear.

Methods:

__init__(isjunk=None, a='', b='')
    Construct a SequenceMatcher.

set_seqs(a, b)
    Set the two sequences to be compared.

set_seq1(a)
    Set the first sequence to be compared.

set_seq2(b)
    Set the second sequence to be compared.

find_longest_match(alo, ahi, blo, bhi)
    Find longest matching block in a[alo:ahi] and b[blo:bhi].

get_matching_blocks()
    Return list of triples describing matching subsequences.

get_opcodes()
    Return list of 5-tuples describing how to turn a into b.

ratio()
    Return a measure of the sequences' similarity (float in [0,1]).

quick_ratio()
    Return an upper bound on .ratio() relatively quickly.

real_quick_ratio()
    Return an upper bound on ratio() very quickly.

Definition at line 46 of file difflib.py.


Constructor & Destructor Documentation

def difflib.SequenceMatcher.__init__ (   self,
  isjunk = None,
  a = '',
  b = '',
  autojunk = True 
)
Construct a SequenceMatcher.

Optional arg isjunk is None (the default), or a one-argument
function that takes a sequence element and returns true iff the
element is junk.  None is equivalent to passing "lambda x: 0", i.e.
no elements are considered to be junk.  For example, pass
    lambda x: x in " \\t"
if you're comparing lines as sequences of characters, and don't
want to synch up on blanks or hard tabs.

Optional arg a is the first of two sequences to be compared.  By
default, an empty string.  The elements of a must be hashable.  See
also .set_seqs() and .set_seq1().

Optional arg b is the second of two sequences to be compared.  By
default, an empty string.  The elements of b must be hashable. See
also .set_seqs() and .set_seq2().

Optional arg autojunk should be set to False to disable the
"automatic junk heuristic" that treats popular elements as junk
(see module documentation for more information).

Definition at line 154 of file difflib.py.

00154 
00155     def __init__(self, isjunk=None, a='', b='', autojunk=True):
00156         """Construct a SequenceMatcher.
00157 
00158         Optional arg isjunk is None (the default), or a one-argument
00159         function that takes a sequence element and returns true iff the
00160         element is junk.  None is equivalent to passing "lambda x: 0", i.e.
00161         no elements are considered to be junk.  For example, pass
00162             lambda x: x in " \\t"
00163         if you're comparing lines as sequences of characters, and don't
00164         want to synch up on blanks or hard tabs.
00165 
00166         Optional arg a is the first of two sequences to be compared.  By
00167         default, an empty string.  The elements of a must be hashable.  See
00168         also .set_seqs() and .set_seq1().
00169 
00170         Optional arg b is the second of two sequences to be compared.  By
00171         default, an empty string.  The elements of b must be hashable. See
00172         also .set_seqs() and .set_seq2().
00173 
00174         Optional arg autojunk should be set to False to disable the
00175         "automatic junk heuristic" that treats popular elements as junk
00176         (see module documentation for more information).
00177         """
00178 
00179         # Members:
00180         # a
00181         #      first sequence
00182         # b
00183         #      second sequence; differences are computed as "what do
00184         #      we need to do to 'a' to change it into 'b'?"
00185         # b2j
00186         #      for x in b, b2j[x] is a list of the indices (into b)
00187         #      at which x appears; junk and popular elements do not appear
00188         # fullbcount
00189         #      for x in b, fullbcount[x] == the number of times x
00190         #      appears in b; only materialized if really needed (used
00191         #      only for computing quick_ratio())
00192         # matching_blocks
00193         #      a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
00194         #      ascending & non-overlapping in i and in j; terminated by
00195         #      a dummy (len(a), len(b), 0) sentinel
00196         # opcodes
00197         #      a list of (tag, i1, i2, j1, j2) tuples, where tag is
00198         #      one of
00199         #          'replace'   a[i1:i2] should be replaced by b[j1:j2]
00200         #          'delete'    a[i1:i2] should be deleted
00201         #          'insert'    b[j1:j2] should be inserted
00202         #          'equal'     a[i1:i2] == b[j1:j2]
00203         # isjunk
00204         #      a user-supplied function taking a sequence element and
00205         #      returning true iff the element is "junk" -- this has
00206         #      subtle but helpful effects on the algorithm, which I'll
00207         #      get around to writing up someday <0.9 wink>.
00208         #      DON'T USE!  Only __chain_b uses this.  Use isbjunk.
00209         # bjunk
00210         #      the items in b for which isjunk is True.
00211         # bpopular
00212         #      nonjunk items in b treated as junk by the heuristic (if used).
00213 
00214         self.isjunk = isjunk
00215         self.a = self.b = None
00216         self.autojunk = autojunk
00217         self.set_seqs(a, b)

Here is the caller graph for this function:


Member Function Documentation

def difflib.SequenceMatcher.__chain_b (   self) [private]

Definition at line 301 of file difflib.py.

00301 
00302     def __chain_b(self):
00303         # Because isjunk is a user-defined (not C) function, and we test
00304         # for junk a LOT, it's important to minimize the number of calls.
00305         # Before the tricks described here, __chain_b was by far the most
00306         # time-consuming routine in the whole module!  If anyone sees
00307         # Jim Roskind, thank him again for profile.py -- I never would
00308         # have guessed that.
00309         # The first trick is to build b2j ignoring the possibility
00310         # of junk.  I.e., we don't call isjunk at all yet.  Throwing
00311         # out the junk later is much cheaper than building b2j "right"
00312         # from the start.
00313         b = self.b
00314         self.b2j = b2j = {}
00315 
00316         for i, elt in enumerate(b):
00317             indices = b2j.setdefault(elt, [])
00318             indices.append(i)
00319 
00320         # Purge junk elements
00321         self.bjunk = junk = set()
00322         isjunk = self.isjunk
00323         if isjunk:
00324             for elt in b2j.keys():
00325                 if isjunk(elt):
00326                     junk.add(elt)
00327             for elt in junk: # separate loop avoids separate list of keys
00328                 del b2j[elt]
00329 
00330         # Purge popular elements that are not junk
00331         self.bpopular = popular = set()
00332         n = len(b)
00333         if self.autojunk and n >= 200:
00334             ntest = n // 100 + 1
00335             for elt, idxs in b2j.items():
00336                 if len(idxs) > ntest:
00337                     popular.add(elt)
00338             for elt in popular: # ditto; as fast for 1% deletion
00339                 del b2j[elt]

def difflib.SequenceMatcher.find_longest_match (   self,
  alo,
  ahi,
  blo,
  bhi 
)
Find longest matching block in a[alo:ahi] and b[blo:bhi].

If isjunk is not defined:

Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
    alo <= i <= i+k <= ahi
    blo <= j <= j+k <= bhi
and for all (i',j',k') meeting those conditions,
    k >= k'
    i <= i'
    and if i == i', j <= j'

In other words, of all maximal matching blocks, return one that
starts earliest in a, and of all those maximal matching blocks that
start earliest in a, return the one that starts earliest in b.

>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
Match(a=0, b=4, size=5)

If isjunk is defined, first the longest matching block is
determined as above, but with the additional restriction that no
junk element appears in the block.  Then that block is extended as
far as possible by matching (only) junk elements on both sides.  So
the resulting block never matches on junk except as identical junk
happens to be adjacent to an "interesting" match.

Here's the same example as before, but considering blanks to be
junk.  That prevents " abcd" from matching the " abcd" at the tail
end of the second sequence directly.  Instead only the "abcd" can
match, and matches the leftmost "abcd" in the second sequence:

>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
Match(a=1, b=0, size=4)

If no blocks match, return (alo, blo, 0).

>>> s = SequenceMatcher(None, "ab", "c")
>>> s.find_longest_match(0, 2, 0, 1)
Match(a=0, b=0, size=0)

Definition at line 354 of file difflib.py.

00354 
00355     def find_longest_match(self, alo, ahi, blo, bhi):
00356         """Find longest matching block in a[alo:ahi] and b[blo:bhi].
00357 
00358         If isjunk is not defined:
00359 
00360         Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
00361             alo <= i <= i+k <= ahi
00362             blo <= j <= j+k <= bhi
00363         and for all (i',j',k') meeting those conditions,
00364             k >= k'
00365             i <= i'
00366             and if i == i', j <= j'
00367 
00368         In other words, of all maximal matching blocks, return one that
00369         starts earliest in a, and of all those maximal matching blocks that
00370         start earliest in a, return the one that starts earliest in b.
00371 
00372         >>> s = SequenceMatcher(None, " abcd", "abcd abcd")
00373         >>> s.find_longest_match(0, 5, 0, 9)
00374         Match(a=0, b=4, size=5)
00375 
00376         If isjunk is defined, first the longest matching block is
00377         determined as above, but with the additional restriction that no
00378         junk element appears in the block.  Then that block is extended as
00379         far as possible by matching (only) junk elements on both sides.  So
00380         the resulting block never matches on junk except as identical junk
00381         happens to be adjacent to an "interesting" match.
00382 
00383         Here's the same example as before, but considering blanks to be
00384         junk.  That prevents " abcd" from matching the " abcd" at the tail
00385         end of the second sequence directly.  Instead only the "abcd" can
00386         match, and matches the leftmost "abcd" in the second sequence:
00387 
00388         >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
00389         >>> s.find_longest_match(0, 5, 0, 9)
00390         Match(a=1, b=0, size=4)
00391 
00392         If no blocks match, return (alo, blo, 0).
00393 
00394         >>> s = SequenceMatcher(None, "ab", "c")
00395         >>> s.find_longest_match(0, 2, 0, 1)
00396         Match(a=0, b=0, size=0)
00397         """
00398 
00399         # CAUTION:  stripping common prefix or suffix would be incorrect.
00400         # E.g.,
00401         #    ab
00402         #    acab
00403         # Longest matching block is "ab", but if common prefix is
00404         # stripped, it's "a" (tied with "b").  UNIX(tm) diff does so
00405         # strip, so ends up claiming that ab is changed to acab by
00406         # inserting "ca" in the middle.  That's minimal but unintuitive:
00407         # "it's obvious" that someone inserted "ac" at the front.
00408         # Windiff ends up at the same place as diff, but by pairing up
00409         # the unique 'b's and then matching the first two 'a's.
00410 
00411         a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__
00412         besti, bestj, bestsize = alo, blo, 0
00413         # find longest junk-free match
00414         # during an iteration of the loop, j2len[j] = length of longest
00415         # junk-free match ending with a[i-1] and b[j]
00416         j2len = {}
00417         nothing = []
00418         for i in range(alo, ahi):
00419             # look at all instances of a[i] in b; note that because
00420             # b2j has no junk keys, the loop is skipped if a[i] is junk
00421             j2lenget = j2len.get
00422             newj2len = {}
00423             for j in b2j.get(a[i], nothing):
00424                 # a[i] matches b[j]
00425                 if j < blo:
00426                     continue
00427                 if j >= bhi:
00428                     break
00429                 k = newj2len[j] = j2lenget(j-1, 0) + 1
00430                 if k > bestsize:
00431                     besti, bestj, bestsize = i-k+1, j-k+1, k
00432             j2len = newj2len
00433 
00434         # Extend the best by non-junk elements on each end.  In particular,
00435         # "popular" non-junk elements aren't in b2j, which greatly speeds
00436         # the inner loop above, but also means "the best" match so far
00437         # doesn't contain any junk *or* popular non-junk elements.
00438         while besti > alo and bestj > blo and \
00439               not isbjunk(b[bestj-1]) and \
00440               a[besti-1] == b[bestj-1]:
00441             besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
00442         while besti+bestsize < ahi and bestj+bestsize < bhi and \
00443               not isbjunk(b[bestj+bestsize]) and \
00444               a[besti+bestsize] == b[bestj+bestsize]:
00445             bestsize += 1
00446 
00447         # Now that we have a wholly interesting match (albeit possibly
00448         # empty!), we may as well suck up the matching junk on each
00449         # side of it too.  Can't think of a good reason not to, and it
00450         # saves post-processing the (possibly considerable) expense of
00451         # figuring out what to do with it.  In the case of an empty
00452         # interesting match, this is clearly the right thing to do,
00453         # because no other kind of match is possible in the regions.
00454         while besti > alo and bestj > blo and \
00455               isbjunk(b[bestj-1]) and \
00456               a[besti-1] == b[bestj-1]:
00457             besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
00458         while besti+bestsize < ahi and bestj+bestsize < bhi and \
00459               isbjunk(b[bestj+bestsize]) and \
00460               a[besti+bestsize] == b[bestj+bestsize]:
00461             bestsize = bestsize + 1
00462 
00463         return Match(besti, bestj, bestsize)

Here is the call graph for this function:

Here is the caller graph for this function:

def difflib.SequenceMatcher.get_grouped_opcodes (   self,
  n = 3 
)
Isolate change clusters by eliminating ranges with no changes.

Return a generator of groups with upto n lines of context.
Each group is in the same format as returned by get_opcodes().

>>> from pprint import pprint
>>> a = list(map(str, range(1,40)))
>>> b = a[:]
>>> b[8:8] = ['i']     # Make an insertion
>>> b[20] += 'x'       # Make a replacement
>>> b[23:28] = []      # Make a deletion
>>> b[30] += 'y'       # Make another replacement
>>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes()))
[[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)],
 [('equal', 16, 19, 17, 20),
  ('replace', 19, 20, 20, 21),
  ('equal', 20, 22, 21, 23),
  ('delete', 22, 27, 23, 23),
  ('equal', 27, 30, 23, 26)],
 [('equal', 31, 34, 27, 30),
  ('replace', 34, 35, 30, 31),
  ('equal', 35, 38, 31, 34)]]

Definition at line 590 of file difflib.py.

00590 
00591     def get_grouped_opcodes(self, n=3):
00592         """ Isolate change clusters by eliminating ranges with no changes.
00593 
00594         Return a generator of groups with upto n lines of context.
00595         Each group is in the same format as returned by get_opcodes().
00596 
00597         >>> from pprint import pprint
00598         >>> a = list(map(str, range(1,40)))
00599         >>> b = a[:]
00600         >>> b[8:8] = ['i']     # Make an insertion
00601         >>> b[20] += 'x'       # Make a replacement
00602         >>> b[23:28] = []      # Make a deletion
00603         >>> b[30] += 'y'       # Make another replacement
00604         >>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes()))
00605         [[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)],
00606          [('equal', 16, 19, 17, 20),
00607           ('replace', 19, 20, 20, 21),
00608           ('equal', 20, 22, 21, 23),
00609           ('delete', 22, 27, 23, 23),
00610           ('equal', 27, 30, 23, 26)],
00611          [('equal', 31, 34, 27, 30),
00612           ('replace', 34, 35, 30, 31),
00613           ('equal', 35, 38, 31, 34)]]
00614         """
00615 
00616         codes = self.get_opcodes()
00617         if not codes:
00618             codes = [("equal", 0, 1, 0, 1)]
00619         # Fixup leading and trailing groups if they show no changes.
00620         if codes[0][0] == 'equal':
00621             tag, i1, i2, j1, j2 = codes[0]
00622             codes[0] = tag, max(i1, i2-n), i2, max(j1, j2-n), j2
00623         if codes[-1][0] == 'equal':
00624             tag, i1, i2, j1, j2 = codes[-1]
00625             codes[-1] = tag, i1, min(i2, i1+n), j1, min(j2, j1+n)
00626 
00627         nn = n + n
00628         group = []
00629         for tag, i1, i2, j1, j2 in codes:
00630             # End the current group and start a new one whenever
00631             # there is a large range with no changes.
00632             if tag == 'equal' and i2-i1 > nn:
00633                 group.append((tag, i1, min(i2, i1+n), j1, min(j2, j1+n)))
00634                 yield group
00635                 group = []
00636                 i1, j1 = max(i1, i2-n), max(j1, j2-n)
00637             group.append((tag, i1, i2, j1 ,j2))
00638         if group and not (len(group)==1 and group[0][0] == 'equal'):
00639             yield group

Here is the call graph for this function:

Return list of triples describing matching subsequences.

Each triple is of the form (i, j, n), and means that
a[i:i+n] == b[j:j+n].  The triples are monotonically increasing in
i and in j.  New in Python 2.5, it's also guaranteed that if
(i, j, n) and (i', j', n') are adjacent triples in the list, and
the second is not the last triple in the list, then i+n != i' or
j+n != j'.  IOW, adjacent triples never describe adjacent equal
blocks.

The last triple is a dummy, (len(a), len(b), 0), and is the only
triple with n==0.

>>> s = SequenceMatcher(None, "abxcd", "abcd")
>>> list(s.get_matching_blocks())
[Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]

Definition at line 464 of file difflib.py.

00464 
00465     def get_matching_blocks(self):
00466         """Return list of triples describing matching subsequences.
00467 
00468         Each triple is of the form (i, j, n), and means that
00469         a[i:i+n] == b[j:j+n].  The triples are monotonically increasing in
00470         i and in j.  New in Python 2.5, it's also guaranteed that if
00471         (i, j, n) and (i', j', n') are adjacent triples in the list, and
00472         the second is not the last triple in the list, then i+n != i' or
00473         j+n != j'.  IOW, adjacent triples never describe adjacent equal
00474         blocks.
00475 
00476         The last triple is a dummy, (len(a), len(b), 0), and is the only
00477         triple with n==0.
00478 
00479         >>> s = SequenceMatcher(None, "abxcd", "abcd")
00480         >>> list(s.get_matching_blocks())
00481         [Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]
00482         """
00483 
00484         if self.matching_blocks is not None:
00485             return self.matching_blocks
00486         la, lb = len(self.a), len(self.b)
00487 
00488         # This is most naturally expressed as a recursive algorithm, but
00489         # at least one user bumped into extreme use cases that exceeded
00490         # the recursion limit on their box.  So, now we maintain a list
00491         # ('queue`) of blocks we still need to look at, and append partial
00492         # results to `matching_blocks` in a loop; the matches are sorted
00493         # at the end.
00494         queue = [(0, la, 0, lb)]
00495         matching_blocks = []
00496         while queue:
00497             alo, ahi, blo, bhi = queue.pop()
00498             i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
00499             # a[alo:i] vs b[blo:j] unknown
00500             # a[i:i+k] same as b[j:j+k]
00501             # a[i+k:ahi] vs b[j+k:bhi] unknown
00502             if k:   # if k is 0, there was no matching block
00503                 matching_blocks.append(x)
00504                 if alo < i and blo < j:
00505                     queue.append((alo, i, blo, j))
00506                 if i+k < ahi and j+k < bhi:
00507                     queue.append((i+k, ahi, j+k, bhi))
00508         matching_blocks.sort()
00509 
00510         # It's possible that we have adjacent equal blocks in the
00511         # matching_blocks list now.  Starting with 2.5, this code was added
00512         # to collapse them.
00513         i1 = j1 = k1 = 0
00514         non_adjacent = []
00515         for i2, j2, k2 in matching_blocks:
00516             # Is this block adjacent to i1, j1, k1?
00517             if i1 + k1 == i2 and j1 + k1 == j2:
00518                 # Yes, so collapse them -- this just increases the length of
00519                 # the first block by the length of the second, and the first
00520                 # block so lengthened remains the block to compare against.
00521                 k1 += k2
00522             else:
00523                 # Not adjacent.  Remember the first block (k1==0 means it's
00524                 # the dummy we started with), and make the second block the
00525                 # new block to compare against.
00526                 if k1:
00527                     non_adjacent.append((i1, j1, k1))
00528                 i1, j1, k1 = i2, j2, k2
00529         if k1:
00530             non_adjacent.append((i1, j1, k1))
00531 
00532         non_adjacent.append( (la, lb, 0) )
00533         self.matching_blocks = non_adjacent
00534         return map(Match._make, self.matching_blocks)

Here is the call graph for this function:

Here is the caller graph for this function:

Return list of 5-tuples describing how to turn a into b.

Each tuple is of the form (tag, i1, i2, j1, j2).  The first tuple
has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
tuple preceding it, and likewise for j1 == the previous j2.

The tags are strings, with these meanings:

'replace':  a[i1:i2] should be replaced by b[j1:j2]
'delete':   a[i1:i2] should be deleted.
    Note that j1==j2 in this case.
'insert':   b[j1:j2] should be inserted at a[i1:i1].
    Note that i1==i2 in this case.
'equal':    a[i1:i2] == b[j1:j2]

>>> a = "qabxcd"
>>> b = "abycdf"
>>> s = SequenceMatcher(None, a, b)
>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
...    print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])))
 delete a[0:1] (q) b[0:0] ()
  equal a[1:3] (ab) b[0:2] (ab)
replace a[3:4] (x) b[2:3] (y)
  equal a[4:6] (cd) b[3:5] (cd)
 insert a[6:6] () b[5:6] (f)

Definition at line 535 of file difflib.py.

00535 
00536     def get_opcodes(self):
00537         """Return list of 5-tuples describing how to turn a into b.
00538 
00539         Each tuple is of the form (tag, i1, i2, j1, j2).  The first tuple
00540         has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
00541         tuple preceding it, and likewise for j1 == the previous j2.
00542 
00543         The tags are strings, with these meanings:
00544 
00545         'replace':  a[i1:i2] should be replaced by b[j1:j2]
00546         'delete':   a[i1:i2] should be deleted.
00547                     Note that j1==j2 in this case.
00548         'insert':   b[j1:j2] should be inserted at a[i1:i1].
00549                     Note that i1==i2 in this case.
00550         'equal':    a[i1:i2] == b[j1:j2]
00551 
00552         >>> a = "qabxcd"
00553         >>> b = "abycdf"
00554         >>> s = SequenceMatcher(None, a, b)
00555         >>> for tag, i1, i2, j1, j2 in s.get_opcodes():
00556         ...    print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
00557         ...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])))
00558          delete a[0:1] (q) b[0:0] ()
00559           equal a[1:3] (ab) b[0:2] (ab)
00560         replace a[3:4] (x) b[2:3] (y)
00561           equal a[4:6] (cd) b[3:5] (cd)
00562          insert a[6:6] () b[5:6] (f)
00563         """
00564 
00565         if self.opcodes is not None:
00566             return self.opcodes
00567         i = j = 0
00568         self.opcodes = answer = []
00569         for ai, bj, size in self.get_matching_blocks():
00570             # invariant:  we've pumped out correct diffs to change
00571             # a[:i] into b[:j], and the next matching block is
00572             # a[ai:ai+size] == b[bj:bj+size].  So we need to pump
00573             # out a diff to change a[i:ai] into b[j:bj], pump out
00574             # the matching block, and move (i,j) beyond the match
00575             tag = ''
00576             if i < ai and j < bj:
00577                 tag = 'replace'
00578             elif i < ai:
00579                 tag = 'delete'
00580             elif j < bj:
00581                 tag = 'insert'
00582             if tag:
00583                 answer.append( (tag, i, ai, j, bj) )
00584             i, j = ai+size, bj+size
00585             # the list of matching blocks is terminated by a
00586             # sentinel with size 0
00587             if size:
00588                 answer.append( ('equal', ai, i, bj, j) )
00589         return answer

Here is the call graph for this function:

Here is the caller graph for this function:

def difflib.SequenceMatcher.isbjunk (   self,
  item 
)

Definition at line 340 of file difflib.py.

00340 
00341     def isbjunk(self, item):
00342         "Deprecated; use 'item in SequenceMatcher().bjunk'."
00343         warnings.warn("'SequenceMatcher().isbjunk(item)' is deprecated;\n"
00344                       "use 'item in SMinstance.bjunk' instead.",
00345                       DeprecationWarning, 2)
00346         return item in self.bjunk

Here is the call graph for this function:

Here is the caller graph for this function:

def difflib.SequenceMatcher.isbpopular (   self,
  item 
)

Definition at line 347 of file difflib.py.

00347 
00348     def isbpopular(self, item):
00349         "Deprecated; use 'item in SequenceMatcher().bpopular'."
00350         warnings.warn("'SequenceMatcher().isbpopular(item)' is deprecated;\n"
00351                       "use 'item in SMinstance.bpopular' instead.",
00352                       DeprecationWarning, 2)
00353         return item in self.bpopular

Here is the call graph for this function:

Return an upper bound on ratio() relatively quickly.

This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute.

Definition at line 665 of file difflib.py.

00665 
00666     def quick_ratio(self):
00667         """Return an upper bound on ratio() relatively quickly.
00668 
00669         This isn't defined beyond that it is an upper bound on .ratio(), and
00670         is faster to compute.
00671         """
00672 
00673         # viewing a and b as multisets, set matches to the cardinality
00674         # of their intersection; this counts the number of matches
00675         # without regard to order, so is clearly an upper bound
00676         if self.fullbcount is None:
00677             self.fullbcount = fullbcount = {}
00678             for elt in self.b:
00679                 fullbcount[elt] = fullbcount.get(elt, 0) + 1
00680         fullbcount = self.fullbcount
00681         # avail[x] is the number of times x appears in 'b' less the
00682         # number of times we've seen it in 'a' so far ... kinda
00683         avail = {}
00684         availhas, matches = avail.__contains__, 0
00685         for elt in self.a:
00686             if availhas(elt):
00687                 numb = avail[elt]
00688             else:
00689                 numb = fullbcount.get(elt, 0)
00690             avail[elt] = numb - 1
00691             if numb > 0:
00692                 matches = matches + 1
00693         return _calculate_ratio(matches, len(self.a) + len(self.b))

Here is the call graph for this function:

Return a measure of the sequences' similarity (float in [0,1]).

Where T is the total number of elements in both sequences, and
M is the number of matches, this is 2.0*M / T.
Note that this is 1 if the sequences are identical, and 0 if
they have nothing in common.

.ratio() is expensive to compute if you haven't already computed
.get_matching_blocks() or .get_opcodes(), in which case you may
want to try .quick_ratio() or .real_quick_ratio() first to get an
upper bound.

>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0

Definition at line 640 of file difflib.py.

00640 
00641     def ratio(self):
00642         """Return a measure of the sequences' similarity (float in [0,1]).
00643 
00644         Where T is the total number of elements in both sequences, and
00645         M is the number of matches, this is 2.0*M / T.
00646         Note that this is 1 if the sequences are identical, and 0 if
00647         they have nothing in common.
00648 
00649         .ratio() is expensive to compute if you haven't already computed
00650         .get_matching_blocks() or .get_opcodes(), in which case you may
00651         want to try .quick_ratio() or .real_quick_ratio() first to get an
00652         upper bound.
00653 
00654         >>> s = SequenceMatcher(None, "abcd", "bcde")
00655         >>> s.ratio()
00656         0.75
00657         >>> s.quick_ratio()
00658         0.75
00659         >>> s.real_quick_ratio()
00660         1.0
00661         """
00662 
00663         matches = sum(triple[-1] for triple in self.get_matching_blocks())
00664         return _calculate_ratio(matches, len(self.a) + len(self.b))

Here is the call graph for this function:

Return an upper bound on ratio() very quickly.

This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute than either .ratio() or .quick_ratio().

Definition at line 694 of file difflib.py.

00694 
00695     def real_quick_ratio(self):
00696         """Return an upper bound on ratio() very quickly.
00697 
00698         This isn't defined beyond that it is an upper bound on .ratio(), and
00699         is faster to compute than either .ratio() or .quick_ratio().
00700         """
00701 
00702         la, lb = len(self.a), len(self.b)
00703         # can't have more matches than the number of elements in the
00704         # shorter sequence
00705         return _calculate_ratio(min(la, lb), la + lb)

Here is the call graph for this function:

def difflib.SequenceMatcher.set_seq1 (   self,
  a 
)
Set the first sequence to be compared.

The second sequence to be compared is not changed.

>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.set_seq1("bcde")
>>> s.ratio()
1.0
>>>

SequenceMatcher computes and caches detailed information about the
second sequence, so if you want to compare one sequence S against
many sequences, use .set_seq2(S) once and call .set_seq1(x)
repeatedly for each of the other sequences.

See also set_seqs() and set_seq2().

Definition at line 230 of file difflib.py.

00230 
00231     def set_seq1(self, a):
00232         """Set the first sequence to be compared.
00233 
00234         The second sequence to be compared is not changed.
00235 
00236         >>> s = SequenceMatcher(None, "abcd", "bcde")
00237         >>> s.ratio()
00238         0.75
00239         >>> s.set_seq1("bcde")
00240         >>> s.ratio()
00241         1.0
00242         >>>
00243 
00244         SequenceMatcher computes and caches detailed information about the
00245         second sequence, so if you want to compare one sequence S against
00246         many sequences, use .set_seq2(S) once and call .set_seq1(x)
00247         repeatedly for each of the other sequences.
00248 
00249         See also set_seqs() and set_seq2().
00250         """
00251 
00252         if a is self.a:
00253             return
00254         self.a = a
00255         self.matching_blocks = self.opcodes = None

Here is the caller graph for this function:

def difflib.SequenceMatcher.set_seq2 (   self,
  b 
)
Set the second sequence to be compared.

The first sequence to be compared is not changed.

>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.set_seq2("abcd")
>>> s.ratio()
1.0
>>>

SequenceMatcher computes and caches detailed information about the
second sequence, so if you want to compare one sequence S against
many sequences, use .set_seq2(S) once and call .set_seq1(x)
repeatedly for each of the other sequences.

See also set_seqs() and set_seq1().

Definition at line 256 of file difflib.py.

00256 
00257     def set_seq2(self, b):
00258         """Set the second sequence to be compared.
00259 
00260         The first sequence to be compared is not changed.
00261 
00262         >>> s = SequenceMatcher(None, "abcd", "bcde")
00263         >>> s.ratio()
00264         0.75
00265         >>> s.set_seq2("abcd")
00266         >>> s.ratio()
00267         1.0
00268         >>>
00269 
00270         SequenceMatcher computes and caches detailed information about the
00271         second sequence, so if you want to compare one sequence S against
00272         many sequences, use .set_seq2(S) once and call .set_seq1(x)
00273         repeatedly for each of the other sequences.
00274 
00275         See also set_seqs() and set_seq1().
00276         """
00277 
00278         if b is self.b:
00279             return
00280         self.b = b
00281         self.matching_blocks = self.opcodes = None
00282         self.fullbcount = None
00283         self.__chain_b()

Here is the caller graph for this function:

def difflib.SequenceMatcher.set_seqs (   self,
  a,
  b 
)
Set the two sequences to be compared.

>>> s = SequenceMatcher()
>>> s.set_seqs("abcd", "bcde")
>>> s.ratio()
0.75

Definition at line 218 of file difflib.py.

00218 
00219     def set_seqs(self, a, b):
00220         """Set the two sequences to be compared.
00221 
00222         >>> s = SequenceMatcher()
00223         >>> s.set_seqs("abcd", "bcde")
00224         >>> s.ratio()
00225         0.75
00226         """
00227 
00228         self.set_seq1(a)
00229         self.set_seq2(b)

Here is the call graph for this function:


Member Data Documentation

Definition at line 214 of file difflib.py.

Definition at line 215 of file difflib.py.

Definition at line 214 of file difflib.py.

Definition at line 313 of file difflib.py.

Definition at line 320 of file difflib.py.

Definition at line 330 of file difflib.py.

Definition at line 281 of file difflib.py.

Definition at line 213 of file difflib.py.

Definition at line 254 of file difflib.py.

Definition at line 254 of file difflib.py.


The documentation for this class was generated from the following file: