Back to index

moin  1.9.0~rc2
Public Member Functions | Public Attributes | Private Member Functions | Static Private Attributes
MoinMoin.support.xappy.highlight.Highlighter Class Reference

List of all members.

Public Member Functions

def __init__
def makeSample
def highlight

Public Attributes

 stem

Private Member Functions

def _split_text
def _strip_prefix
def _query_to_stemmed_words
def _hl

Static Private Attributes

tuple _split_re = re.compile(r'<\w+[^>]*>|</\w+>|[\w\']+|\s+|[^\w\'\s<>/]+')

Detailed Description

Class for highlighting text and creating contextual summaries.

>>> hl = Highlighter("en")
>>> hl.makeSample('Hello world.', ['world'])
'Hello world.'
>>> hl.highlight('Hello world', ['world'], ('<', '>'))
'Hello <world>'

Definition at line 26 of file highlight.py.


Constructor & Destructor Documentation

def MoinMoin.support.xappy.highlight.Highlighter.__init__ (   self,
  language_code = 'en',
  stemmer = None 
)
Create a new highlighter for the specified language.

Definition at line 40 of file highlight.py.

00040 
00041     def __init__(self, language_code='en', stemmer=None):
00042         """Create a new highlighter for the specified language.
00043 
00044         """
00045         if stemmer is not None:
00046             self.stem = stemmer
00047         else:
00048             self.stem = xapian.Stem(language_code)


Member Function Documentation

def MoinMoin.support.xappy.highlight.Highlighter._hl (   self,
  words,
  terms,
  hl 
) [private]
Add highlights to a list of words.

`words` is the list of words and non-words to be highlighted..
`terms` is the list of stemmed words to look for.

Definition at line 211 of file highlight.py.

00211 
00212     def _hl(self, words, terms, hl):
00213         """Add highlights to a list of words.
00214         
00215         `words` is the list of words and non-words to be highlighted..
00216         `terms` is the list of stemmed words to look for.
00217 
00218         """
00219         for i, w in enumerate(words):
00220             # HACK - more forgiving about stemmed terms 
00221             wl = w.lower()
00222             if wl in terms or self.stem (wl) in terms:
00223                 words[i] = ''.join((hl[0], w, hl[1]))
00224 
00225         return ''.join(words)
00226 

Here is the caller graph for this function:

Convert a query to a list of stemmed words.

- `query` is the query to parse: it may be xapian.Query object, or a
  sequence of terms.

Definition at line 97 of file highlight.py.

00097 
00098     def _query_to_stemmed_words(self, query):
00099         """Convert a query to a list of stemmed words.
00100 
00101         - `query` is the query to parse: it may be xapian.Query object, or a
00102           sequence of terms.
00103 
00104         """
00105         if isinstance(query, xapian.Query):
00106             return [self._strip_prefix(t) for t in query]
00107         else:
00108             return [self.stem(q.lower()) for q in query]
00109 

Here is the call graph for this function:

Here is the caller graph for this function:

def MoinMoin.support.xappy.highlight.Highlighter._split_text (   self,
  text,
  strip_tags = False 
) [private]
Split some text into words and non-words.

- `text` is the text to process.  It may be a unicode object or a utf-8
  encoded simple string.
- `strip_tags` is a flag - False to keep tags, True to strip all tags
  from the output.

Returns a list of utf-8 encoded simple strings.

Definition at line 49 of file highlight.py.

00049 
00050     def _split_text(self, text, strip_tags=False):
00051         """Split some text into words and non-words.
00052 
00053         - `text` is the text to process.  It may be a unicode object or a utf-8
00054           encoded simple string.
00055         - `strip_tags` is a flag - False to keep tags, True to strip all tags
00056           from the output.
00057 
00058         Returns a list of utf-8 encoded simple strings.
00059 
00060         """
00061         if isinstance(text, unicode):
00062             text = text.encode('utf-8')
00063 
00064         words = self._split_re.findall(text)
00065         if strip_tags:
00066             return [w for w in words if w[0] != '<']
00067         else:
00068             return words

Here is the caller graph for this function:

def MoinMoin.support.xappy.highlight.Highlighter._strip_prefix (   self,
  term 
) [private]
Strip the prefix off a term.

Prefixes are any initial capital letters, with the exception that R always
ends a prefix, even if followed by capital letters.

>>> hl = Highlighter("en")
>>> print hl._strip_prefix('hello')
hello
>>> print hl._strip_prefix('Rhello')
hello
>>> print hl._strip_prefix('XARHello')
Hello
>>> print hl._strip_prefix('XAhello')
hello
>>> print hl._strip_prefix('XAh')
h
>>> print hl._strip_prefix('XA')
<BLANKLINE>

Definition at line 69 of file highlight.py.

00069 
00070     def _strip_prefix(self, term):
00071         """Strip the prefix off a term.
00072 
00073         Prefixes are any initial capital letters, with the exception that R always
00074         ends a prefix, even if followed by capital letters.
00075 
00076         >>> hl = Highlighter("en")
00077         >>> print hl._strip_prefix('hello')
00078         hello
00079         >>> print hl._strip_prefix('Rhello')
00080         hello
00081         >>> print hl._strip_prefix('XARHello')
00082         Hello
00083         >>> print hl._strip_prefix('XAhello')
00084         hello
00085         >>> print hl._strip_prefix('XAh')
00086         h
00087         >>> print hl._strip_prefix('XA')
00088         <BLANKLINE>
00089 
00090         """
00091         for p in xrange(len(term)):
00092             if term[p].islower():
00093                 return term[p:]
00094             elif term[p] == 'R':
00095                 return term[p+1:]
00096         return ''

Here is the caller graph for this function:

def MoinMoin.support.xappy.highlight.Highlighter.highlight (   self,
  text,
  query,
  hl,
  strip_tags = False 
)
Add highlights (string prefix/postfix) to a string.

`text` is the source to highlight.
`query` is either a Xapian query object or a list of (unstemmed) term strings.
`hl` is a pair of highlight strings, e.g. ('<i>', '</i>')
`strip_tags` strips HTML markout iff True

>>> hl = Highlighter()
>>> qp = xapian.QueryParser()
>>> q = qp.parse_query('cat dog')
>>> tags = ('[[', ']]')
>>> hl.highlight('The cat went Dogging; but was <i>dog tired</i>.', q, tags)
'The [[cat]] went [[Dogging]]; but was <i>[[dog]] tired</i>.'

Definition at line 191 of file highlight.py.

00191 
00192     def highlight(self, text, query, hl, strip_tags=False):
00193         """Add highlights (string prefix/postfix) to a string.
00194 
00195         `text` is the source to highlight.
00196         `query` is either a Xapian query object or a list of (unstemmed) term strings.
00197         `hl` is a pair of highlight strings, e.g. ('<i>', '</i>')
00198         `strip_tags` strips HTML markout iff True
00199 
00200         >>> hl = Highlighter()
00201         >>> qp = xapian.QueryParser()
00202         >>> q = qp.parse_query('cat dog')
00203         >>> tags = ('[[', ']]')
00204         >>> hl.highlight('The cat went Dogging; but was <i>dog tired</i>.', q, tags)
00205         'The [[cat]] went [[Dogging]]; but was <i>[[dog]] tired</i>.'
00206 
00207         """
00208         words = self._split_text(text, strip_tags)
00209         terms = self._query_to_stemmed_words(query)
00210         return self._hl(words, terms, hl)

Here is the call graph for this function:

def MoinMoin.support.xappy.highlight.Highlighter.makeSample (   self,
  text,
  query,
  maxlen = 600,
  hl = None 
)
Make a contextual summary from the supplied text.

This basically works by splitting the text into phrases, counting the query
terms in each, and keeping those with the most.

Any markup tags in the text will be stripped.

`text` is the source text to summarise.
`query` is either a Xapian query object or a list of (unstemmed) term strings.
`maxlen` is the maximum length of the generated summary.
`hl` is a pair of strings to insert around highlighted terms, e.g. ('<b>', '</b>')

Definition at line 110 of file highlight.py.

00110 
00111     def makeSample(self, text, query, maxlen=600, hl=None):
00112         """Make a contextual summary from the supplied text.
00113 
00114         This basically works by splitting the text into phrases, counting the query
00115         terms in each, and keeping those with the most.
00116 
00117         Any markup tags in the text will be stripped.
00118 
00119         `text` is the source text to summarise.
00120         `query` is either a Xapian query object or a list of (unstemmed) term strings.
00121         `maxlen` is the maximum length of the generated summary.
00122         `hl` is a pair of strings to insert around highlighted terms, e.g. ('<b>', '</b>')
00123 
00124         """
00125 
00126         # coerce maxlen into an int, otherwise truncation doesn't happen
00127         maxlen = int(maxlen)
00128 
00129         words = self._split_text(text, True)
00130         terms = self._query_to_stemmed_words(query)
00131         
00132         # build blocks delimited by puncuation, and count matching words in each block
00133         # blocks[n] is a block [firstword, endword, charcount, termcount, selected]
00134         blocks = []
00135         start = end = count = blockchars = 0
00136 
00137         while end < len(words):
00138             blockchars += len(words[end])
00139             if words[end].isalnum():
00140                 if self.stem(words[end].lower()) in terms:
00141                     count += 1
00142                 end += 1
00143             elif words[end] in ',.;:?!\n':
00144                 end += 1
00145                 blocks.append([start, end, blockchars, count, False])
00146                 start = end
00147                 blockchars = 0
00148                 count = 0
00149             else:
00150                 end += 1
00151         if start != end:
00152             blocks.append([start, end, blockchars, count, False])
00153         if len(blocks) == 0:
00154             return ''
00155 
00156         # select high-scoring blocks first, down to zero-scoring
00157         chars = 0
00158         for count in xrange(3, -1, -1):
00159             for b in blocks:
00160                 if b[3] >= count:
00161                     b[4] = True
00162                     chars += b[2]
00163                     if chars >= maxlen: break
00164             if chars >= maxlen: break
00165 
00166         # assemble summary
00167         words2 = []
00168         lastblock = -1
00169         for i, b in enumerate(blocks):
00170             if b[4]:
00171                 if i != lastblock + 1:
00172                     words2.append('..')
00173                 words2.extend(words[b[0]:b[1]])
00174                 lastblock = i
00175 
00176         if not blocks[-1][4]:
00177             words2.append('..')
00178 
00179         # trim down to maxlen
00180         l = 0
00181         for i in xrange (len (words2)):
00182             l += len (words2[i])
00183             if l >= maxlen:
00184                 words2[i:] = ['..']
00185                 break
00186 
00187         if hl is None:
00188             return ''.join(words2)
00189         else:
00190             return self._hl(words2, terms, hl)

Here is the call graph for this function:


Member Data Documentation

tuple MoinMoin.support.xappy.highlight.Highlighter._split_re = re.compile(r'<\w+[^>]*>|</\w+>|[\w\']+|\s+|[^\w\'\s<>/]+') [static, private]

Definition at line 38 of file highlight.py.

Definition at line 45 of file highlight.py.


The documentation for this class was generated from the following file: