Back to index

moin  1.9.0~rc2
Public Member Functions | Public Attributes | Static Public Attributes
MoinMoin.search.Xapian.tokenizer.WikiAnalyzer Class Reference
Collaboration diagram for MoinMoin.search.Xapian.tokenizer.WikiAnalyzer:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def __init__
def raw_tokenize_word
def raw_tokenize
def tokenize

Public Attributes

 stemmer

Static Public Attributes

string singleword = r"[%(u)s][%(l)s]+"
tuple singleword_re = re.compile(singleword, re.U)
tuple wikiword_re = re.compile(WikiParser.word_rule, re.UNICODE|re.VERBOSE)
tuple token_re
tuple dot_re = re.compile(r"[-_/,.]")
tuple mail_re = re.compile(r"[-_/,.]|(@)")
tuple alpha_num_re = re.compile(r"\d+|\D+")

Detailed Description

A text analyzer for wiki syntax

The purpose of this class is to analyze texts/pages in wiki syntax
and yield single terms to feed into the xapian database.

Definition at line 17 of file tokenizer.py.


Constructor & Destructor Documentation

def MoinMoin.search.Xapian.tokenizer.WikiAnalyzer.__init__ (   self,
  request = None,
  language = None 
)
@param request: current request
@param language: if given, the language in which to stem words

Definition at line 43 of file tokenizer.py.

00043 
00044     def __init__(self, request=None, language=None):
00045         """
00046         @param request: current request
00047         @param language: if given, the language in which to stem words
00048         """
00049         self.stemmer = None
00050         if request and request.cfg.xapian_stemming and language:
00051             try:
00052                 stemmer = xapian.Stem(language)
00053                 # we need this wrapper because the stemmer returns a utf-8
00054                 # encoded string even when it gets fed with unicode objects:
00055                 self.stemmer = lambda word: stemmer(word).decode('utf-8')
00056             except xapian.InvalidArgumentError:
00057                 # lang is not stemmable or not available
00058                 pass


Member Function Documentation

Yield a stream of words from a string.

@param value: string to split, must be an unicode object or a list of
      unicode objects

Definition at line 76 of file tokenizer.py.

00076 
00077     def raw_tokenize(self, value):
00078         """ Yield a stream of words from a string.
00079 
00080         @param value: string to split, must be an unicode object or a list of
00081                       unicode objects
00082         """
00083         if isinstance(value, list): # used for page links
00084             for v in value:
00085                 yield (v, 0)
00086         else:
00087             tokenstream = re.finditer(self.token_re, value)
00088             for m in tokenstream:
00089                 if m.group("acronym"):
00090                     yield (m.group("acronym").replace('.', ''), m.start())
00091                 elif m.group("company"):
00092                     yield (m.group("company"), m.start())
00093                 elif m.group("email"):
00094                     displ = 0
00095                     for word in self.mail_re.split(m.group("email")):
00096                         if word:
00097                             yield (word, m.start() + displ)
00098                             displ += len(word) + 1
00099                 elif m.group("word"):
00100                     for word, pos in self.raw_tokenize_word(m.group("word"), m.start()):
00101                         yield word, pos

Here is the call graph for this function:

Here is the caller graph for this function:

try to further tokenize some word starting at pos 

Definition at line 59 of file tokenizer.py.

00059 
00060     def raw_tokenize_word(self, word, pos):
00061         """ try to further tokenize some word starting at pos """
00062         yield (word, pos)
00063         if self.wikiword_re.match(word):
00064             # if it is a CamelCaseWord, we additionally try to tokenize Camel, Case and Word
00065             for m in re.finditer(self.singleword_re, word):
00066                 mw, mp = m.group(), pos + m.start()
00067                 for w, p in self.raw_tokenize_word(mw, mp):
00068                     yield (w, p)
00069         else:
00070             # if we have Foo42, yield Foo and 42
00071             for m in re.finditer(self.alpha_num_re, word):
00072                 mw, mp = m.group(), pos + m.start()
00073                 if mw != word:
00074                     for w, p in self.raw_tokenize_word(mw, mp):
00075                         yield (w, p)

Here is the call graph for this function:

Here is the caller graph for this function:

Yield a stream of raw lower cased and stemmed words from a string.

@param value: string to split, must be an unicode object or a list of
      unicode objects

Definition at line 102 of file tokenizer.py.

00102 
00103     def tokenize(self, value):
00104         """
00105         Yield a stream of raw lower cased and stemmed words from a string.
00106 
00107         @param value: string to split, must be an unicode object or a list of
00108                       unicode objects
00109         """
00110         if self.stemmer:
00111 
00112             def stemmer(value):
00113                 stemmed = self.stemmer(value)
00114                 if stemmed != value:
00115                     return stemmed
00116                 else:
00117                     return ''
00118         else:
00119             stemmer = lambda v: ''
00120 
00121         for word, pos in self.raw_tokenize(value):
00122             # Xapian stemmer expects lowercase input
00123             word = word.lower()
00124             yield word, stemmer(word)
00125 

Here is the call graph for this function:


Member Data Documentation

tuple MoinMoin.search.Xapian.tokenizer.WikiAnalyzer.alpha_num_re = re.compile(r"\d+|\D+") [static]

Definition at line 41 of file tokenizer.py.

tuple MoinMoin.search.Xapian.tokenizer.WikiAnalyzer.dot_re = re.compile(r"[-_/,.]") [static]

Definition at line 39 of file tokenizer.py.

tuple MoinMoin.search.Xapian.tokenizer.WikiAnalyzer.mail_re = re.compile(r"[-_/,.]|(@)") [static]

Definition at line 40 of file tokenizer.py.

string MoinMoin.search.Xapian.tokenizer.WikiAnalyzer.singleword = r"[%(u)s][%(l)s]+" [static]

Definition at line 24 of file tokenizer.py.

Definition at line 29 of file tokenizer.py.

Definition at line 48 of file tokenizer.py.

Initial value:
re.compile(
        r"(?P<company>\w+[&@]\w+)|" + # company names like AT&T and Excite@Home.
        r"(?P<email>\w+([.-]\w+)*@\w+([.-]\w+)*)|" +    # email addresses
        r"(?P<acronym>(\w\.)+)|" +          # acronyms: U.S.A., I.B.M., etc.
        r"(?P<word>\w+)",                   # words (including WikiWords)
        re.U)

Definition at line 32 of file tokenizer.py.

tuple MoinMoin.search.Xapian.tokenizer.WikiAnalyzer.wikiword_re = re.compile(WikiParser.word_rule, re.UNICODE|re.VERBOSE) [static]

Definition at line 30 of file tokenizer.py.


The documentation for this class was generated from the following file: