Back to index

lightning-sunbird  0.9+nobinonly
Public Member Functions | Public Attributes | Static Protected Member Functions | Protected Attributes
nsBayesianFilter Class Reference

#include <nsBayesianFilter.h>

Inheritance diagram for nsBayesianFilter:
Inheritance graph
[legend]
Collaboration diagram for nsBayesianFilter:
Collaboration graph
[legend]

List of all members.

Public Member Functions

NS_DECL_ISUPPORTS
NS_DECL_NSIMSGFILTERPLUGIN
NS_DECL_NSIJUNKMAILPLUGIN 
nsBayesianFilter ()
virtual ~nsBayesianFilter ()
nsresult tokenizeMessage (const char *messageURI, nsIMsgWindow *aMsgWindow, TokenAnalyzer *analyzer)
void classifyMessage (Tokenizer &tokens, const char *messageURI, nsIJunkMailClassificationListener *listener)
void observeMessage (Tokenizer &tokens, const char *messageURI, nsMsgJunkStatus oldClassification, nsMsgJunkStatus newClassification, nsIJunkMailClassificationListener *listener)
void writeTrainingData ()
void readTrainingData ()
nsresult getTrainingFile (nsILocalFile **aFile)
void classifyMessage (in string aMsgURI, in nsIMsgWindow aMsgWindow, in nsIJunkMailClassificationListener aListener)
 Given a message URI, determine what its current classification is according to the current training set.
void classifyMessages (in unsigned long aCount,[array, size_is(aCount)] in string aMsgURIs, in nsIMsgWindow aMsgWindow, in nsIJunkMailClassificationListener aListener)
void setMessageClassification (in string aMsgURI, in nsMsgJunkStatus aOldUserClassification, in nsMsgJunkStatus aNewClassification, in nsIMsgWindow aMsgWindow, in nsIJunkMailClassificationListener aListener)
 Called when a user forces the classification of a message.
void resetTrainingData ()
 Removes the training file and clears out any in memory training tokens.
void shutdown ()
 Do any necessary cleanup: flush and close any open files, etc.

Public Attributes

const nsMsgJunkStatus UNCLASSIFIED = 0
 Message classifications.
const nsMsgJunkStatus GOOD = 1
const nsMsgJunkStatus JUNK = 2
readonly attribute boolean userHasClassified
readonly attribute boolean shouldDownloadAllHeaders
 Some protocols (ie IMAP) can, as an optimization, avoid downloading all message header lines.

Static Protected Member Functions

static void TimerCallback (nsITimer *aTimer, void *aClosure)

Protected Attributes

Tokenizer mGoodTokens
Tokenizer mBadTokens
double mJunkProbabilityThreshold
PRUint32 mGoodCount
PRUint32 mBadCount
PRPackedBool mTrainingDataDirty
PRInt32 mMinFlushInterval
nsCOMPtr< nsITimermTimer
nsCOMPtr< nsILocalFilemTrainingFile

Detailed Description

Definition at line 139 of file nsBayesianFilter.h.


Constructor & Destructor Documentation

Definition at line 912 of file nsBayesianFilter.cpp.

    :   mGoodCount(0), mBadCount(0), mTrainingDataDirty(PR_FALSE)
{
    if (!BayesianFilterLogModule)
      BayesianFilterLogModule = PR_NewLogModule("BayesianFilter");
    
    PRInt32 junkThreshold = 0;
    nsresult rv;
    nsCOMPtr<nsIPrefBranch> pPrefBranch(do_GetService(NS_PREFSERVICE_CONTRACTID, &rv));
    if (pPrefBranch)
      pPrefBranch->GetIntPref("mail.adaptivefilters.junk_threshold", &junkThreshold);

    mJunkProbabilityThreshold = ((double) junkThreshold) / 100;
    if (mJunkProbabilityThreshold == 0 || mJunkProbabilityThreshold >= 1)
      mJunkProbabilityThreshold = kDefaultJunkThreshold;

    PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("junk probabilty threshold: %f", mJunkProbabilityThreshold));

    getTrainingFile(getter_AddRefs(mTrainingFile));

    PRBool ok = (mGoodTokens && mBadTokens);
    NS_ASSERTION(ok, "error allocating tokenizers");
    if (ok)
        readTrainingData();
    else {
      PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("error allocating tokenizers"));
    }
    
    // get parameters for training data flushing, from the prefs

    nsCOMPtr<nsIPrefBranch> prefBranch;
    
    nsCOMPtr<nsIPrefService> prefs = do_GetService(NS_PREFSERVICE_CONTRACTID, &rv);
    NS_ASSERTION(NS_SUCCEEDED(rv),"failed accessing preferences service");
    rv = prefs->GetBranch(nsnull, getter_AddRefs(prefBranch));
    NS_ASSERTION(NS_SUCCEEDED(rv),"failed getting preferences branch");

    rv = prefBranch->GetIntPref("mailnews.bayesian_spam_filter.flush.minimum_interval",&mMinFlushInterval);
    // it is not a good idea to allow a minimum interval of under 1 second
    if (NS_FAILED(rv) || (mMinFlushInterval <= 1000) )
        mMinFlushInterval = DEFAULT_MIN_INTERVAL_BETWEEN_WRITES;

    mTimer = do_CreateInstance(NS_TIMER_CONTRACTID, &rv);
    NS_ASSERTION(NS_SUCCEEDED(rv), "unable to create a timer; training data will only be written on exit");
    
    // the timer is not used on object construction, since for
    // the time being there are no dirying messages
    
}

Here is the call graph for this function:

Definition at line 972 of file nsBayesianFilter.cpp.

{
    if (mTimer)
    {
        mTimer->Cancel();
        mTimer = nsnull;
    }
    // call shutdown when we are going away in case we need
    // to flush the training set to disk
    Shutdown();
}

Here is the call graph for this function:


Member Function Documentation

void nsIJunkMailPlugin::classifyMessage ( in string  aMsgURI,
in nsIMsgWindow  aMsgWindow,
in nsIJunkMailClassificationListener  aListener 
) [inherited]

Given a message URI, determine what its current classification is according to the current training set.

void nsBayesianFilter::classifyMessage ( Tokenizer tokens,
const char *  messageURI,
nsIJunkMailClassificationListener listener 
)

Definition at line 1087 of file nsBayesianFilter.cpp.

{
    Token* tokens = tokenizer.copyTokens();
    if (!tokens) return;
  
    // the algorithm in "A Plan For Spam" assumes that you have a large good
    // corpus and a large junk corpus.
    // that won't be the case with users who first use the junk mail feature
    // so, we do certain things to encourage them to train.
    //
    // if there are no good tokens, assume the message is junk
    // this will "encourage" the user to train
    // and if there are no bad tokens, assume the message is not junk
    // this will also "encourage" the user to train
    // see bug #194238
    if (listener && !mGoodCount && !mGoodTokens.countTokens()) {
      PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("no good tokens, assume junk"));
      listener->OnMessageClassified(messageURI, nsMsgJunkStatus(nsIJunkMailPlugin::JUNK));
      return;
    }
    if (listener && !mBadCount && !mBadTokens.countTokens()) {
      PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("no bad tokens, assume good"));
      listener->OnMessageClassified(messageURI, nsMsgJunkStatus(nsIJunkMailPlugin::GOOD));
      return;
    }

    /* this part is similar to the Graham algorithm with some adjustments. */
    PRUint32 i, goodclues=0, count = tokenizer.countTokens();
    double ngood = mGoodCount, nbad = mBadCount, prob;

    for (i = 0; i < count; ++i) 
    {
        Token& token = tokens[i];
        const char* word = token.mWord;
        Token* t = mGoodTokens.get(word);
      double hamcount = ((t != NULL) ? t->mCount : 0);
        t = mBadTokens.get(word);
       double spamcount = ((t != NULL) ? t->mCount : 0);

      // if hamcount and spam count are both 0, we could end up with a divide by 0 error, 
      // tread carefully here. (Bug #240819)
      double probDenom = (hamcount *nbad + spamcount*ngood);
      if (probDenom == 0.0) // nGood and nbad are known to be non zero or we wouldn't be here
        probDenom = nbad + ngood; // error case use a value of 1 for hamcount and spamcount if they are both zero.

      prob = (spamcount * ngood)/probDenom;
       double n = hamcount + spamcount;
       prob =  (0.225 + n * prob) / (.45 + n);
       double distance = PR_ABS(prob - 0.5);
       if (distance >= .1) 
       {
         goodclues++;
         token.mDistance = distance;
         token.mProbability = prob;
            PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("token.mProbability (%s) is %f", word, token.mProbability));
        }
      else 
        token.mDistance = -1; //ignore clue
    }
    
    // sort the array by the token distances
        NS_QuickSort(tokens, count, sizeof(Token), compareTokens, NULL);
    PRUint32 first, last = count;
    first = (goodclues > 150) ? count - 150 : 0;

    double H = 1.0, S = 1.0;
    PRInt32 Hexp = 0, Sexp = 0;
    goodclues=0;
    int e;

    for (i = first; i < last; ++i) 
    {
      if (tokens[i].mDistance != -1) 
      {
        goodclues++;
        double value = tokens[i].mProbability;
        S *= (1.0 - value);
        H *= value;
        if ( S < 1e-200 ) 
        {
          S = frexp(S, &e);
          Sexp += e;
        }
        if ( H < 1e-200 ) 
        {
          H = frexp(H, &e);
          Hexp += e;
    }
    }
    }

    S = log(S) + Sexp * M_LN2;
    H = log(H) + Hexp * M_LN2;

    if (goodclues > 0) 
    {
        PRInt32 chi_error;
        S = chi2P(-2.0 * S, 2.0 * goodclues, &chi_error);
        if (!chi_error)
            H = chi2P(-2.0 * H, 2.0 * goodclues, &chi_error);
        // if any error toss the entire calculation
        if (!chi_error)
            prob = (S-H +1.0) / 2.0;
        else
            prob = 0.5;
    } 
    else 
        prob = 0.5;

    PRBool isJunk = (prob >= mJunkProbabilityThreshold);
    PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("%s is junk probability = (%f)  HAM SCORE:%f SPAM SCORE:%f", messageURI, prob,H,S));

    delete[] tokens;

    if (listener)
        listener->OnMessageClassified(messageURI, isJunk ? nsMsgJunkStatus(nsIJunkMailPlugin::JUNK) : nsMsgJunkStatus(nsIJunkMailPlugin::GOOD));
}

Here is the call graph for this function:

Here is the caller graph for this function:

void nsIJunkMailPlugin::classifyMessages ( in unsigned long  aCount,
[array, size_is(aCount)] in string  aMsgURIs,
in nsIMsgWindow  aMsgWindow,
in nsIJunkMailClassificationListener  aListener 
) [inherited]

Definition at line 1456 of file nsBayesianFilter.cpp.

{
  // should we cache the profile manager's directory?
  nsCOMPtr<nsIFile> profileDir;

  nsresult rv = NS_GetSpecialDirectory(NS_APP_USER_PROFILE_50_DIR, getter_AddRefs(profileDir));
  NS_ENSURE_SUCCESS(rv, rv);
  rv = profileDir->Append(NS_LITERAL_STRING("training.dat"));
  NS_ENSURE_SUCCESS(rv, rv);
  
  return profileDir->QueryInterface(NS_GET_IID(nsILocalFile), (void **) aTrainingFile);
}

Here is the call graph for this function:

void nsBayesianFilter::observeMessage ( Tokenizer tokens,
const char *  messageURI,
nsMsgJunkStatus  oldClassification,
nsMsgJunkStatus  newClassification,
nsIJunkMailClassificationListener listener 
)

Definition at line 1291 of file nsBayesianFilter.cpp.

{
    PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("observeMessage(%s) old=%d new=%d", messageURL, oldClassification, newClassification));

    PRBool trainingDataWasDirty = mTrainingDataDirty;
    TokenEnumeration tokens = tokenizer.getTokens();

    // Uhoh...if the user is re-training then the message may already be classified and we are classifying it again with the same classification.
    // the old code would have removed the tokens for this message then added them back. But this really hurts the message occurrence
    // count for tokens if you just removed training.dat and are re-training. See Bug #237095 for more details.
    // What can we do here? Well we can skip the token removal step if the classifications are the same and assume the user is
    // just re-training. But this then allows users to re-classify the same message on the same training set over and over again
    // leading to data skew. But that's all I can think to do right now to address this.....
    if (oldClassification != newClassification) 
    {
      // remove the tokens from the token set it is currently in
    switch (oldClassification) {
    case nsIJunkMailPlugin::JUNK:
        // remove tokens from junk corpus.
        if (mBadCount > 0) {
            --mBadCount;
            forgetTokens(mBadTokens, tokens);
            mTrainingDataDirty = PR_TRUE;
        }
        break;
    case nsIJunkMailPlugin::GOOD:
        // remove tokens from good corpus.
        if (mGoodCount > 0) {
            --mGoodCount;
            forgetTokens(mGoodTokens, tokens);
            mTrainingDataDirty = PR_TRUE;
        }
        break;
    }
    }

    
    switch (newClassification) {
    case nsIJunkMailPlugin::JUNK:
        // put tokens into junk corpus.
        ++mBadCount;
        rememberTokens(mBadTokens, tokens);
        mTrainingDataDirty = PR_TRUE;
        break;
    case nsIJunkMailPlugin::GOOD:
        // put tokens into good corpus.
        ++mGoodCount;
        rememberTokens(mGoodTokens, tokens);
        mTrainingDataDirty = PR_TRUE;
        break;
    }
    
    if (listener)
        listener->OnMessageClassified(messageURL, newClassification);
    
    if (mTrainingDataDirty && !trainingDataWasDirty && ( mTimer != nsnull ))
    {
        // if training data became dirty just now, schedule flush
        // mMinFlushInterval msec from now
        PR_LOG(
            BayesianFilterLogModule, PR_LOG_ALWAYS,
            ("starting training data flush timer %i msec", mMinFlushInterval));
        mTimer->InitWithFuncCallback(nsBayesianFilter::TimerCallback, this, mMinFlushInterval, nsITimer::TYPE_ONE_SHOT);
    }
}

Here is the call graph for this function:

Here is the caller graph for this function:

Definition at line 1501 of file nsBayesianFilter.cpp.

{
  if (!mTrainingFile) 
    return;
  
  PRBool exists;
  nsresult rv = mTrainingFile->Exists(&exists);
  if (NS_FAILED(rv) || !exists) 
    return;

  FILE* stream;
  rv = mTrainingFile->OpenANSIFileDesc("rb", &stream);
  if (NS_FAILED(rv)) 
    return;

  PRInt64 fileSize;
  rv = mTrainingFile->GetFileSize(&fileSize);
  if (NS_FAILED(rv)) 
    return;

  // FIXME:  should make sure that the tokenizers are empty.
  char cookie[4];
  if (!((fread(cookie, sizeof(cookie), 1, stream) == 1) &&
        (memcmp(cookie, kMagicCookie, sizeof(cookie)) == 0) &&
        (readUInt32(stream, &mGoodCount) == 1) &&
        (readUInt32(stream, &mBadCount) == 1) &&
         readTokens(stream, mGoodTokens, fileSize) &&
         readTokens(stream, mBadTokens, fileSize))) {
      NS_WARNING("failed to read training data.");
      PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("failed to read training data."));
  }
  
  fclose(stream);
}

Here is the call graph for this function:

Removes the training file and clears out any in memory training tokens.

User must retrain after doing this.

void nsIJunkMailPlugin::setMessageClassification ( in string  aMsgURI,
in nsMsgJunkStatus  aOldUserClassification,
in nsMsgJunkStatus  aNewClassification,
in nsIMsgWindow  aMsgWindow,
in nsIJunkMailClassificationListener  aListener 
) [inherited]

Called when a user forces the classification of a message.

Should cause the training set to be updated appropriately.

  • aMsgURI URI of the message to be classified
  • aOldUserClassification Was it previous manually classified by the user? If so, how?
  • aNewClassification New manual classification.
  • aListener Callback

Do any necessary cleanup: flush and close any open files, etc.

void nsBayesianFilter::TimerCallback ( nsITimer aTimer,
void aClosure 
) [static, protected]

Definition at line 963 of file nsBayesianFilter.cpp.

{
    // we will flush the training data to disk after enough time has passed
    // since the first time a message has been classified after the last flush

    nsBayesianFilter *filter = NS_STATIC_CAST(nsBayesianFilter *, aClosure);
    filter->writeTrainingData();
}

Here is the call graph for this function:

Here is the caller graph for this function:

nsresult nsBayesianFilter::tokenizeMessage ( const char *  messageURI,
nsIMsgWindow aMsgWindow,
TokenAnalyzer analyzer 
)

Definition at line 1040 of file nsBayesianFilter.cpp.

{

    nsCOMPtr <nsIMsgMessageService> msgService;
    nsresult rv = GetMessageServiceFromURI(aMessageURI, getter_AddRefs(msgService));
    NS_ENSURE_SUCCESS(rv, rv);

    aAnalyzer->setSource(aMessageURI);
    return msgService->StreamMessage(aMessageURI, aAnalyzer->mTokenListener, aMsgWindow,
                                          nsnull, PR_TRUE /* convert data */, 
                                                "filter", nsnull);
}

Here is the call graph for this function:

Here is the caller graph for this function:

Definition at line 1471 of file nsBayesianFilter.cpp.

{
  PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("writeTrainingData() entered"));
  if (!mTrainingFile) 
    return;

  // open the file, and write out training data
  FILE* stream;
  nsresult rv = mTrainingFile->OpenANSIFileDesc("wb", &stream);
  if (NS_FAILED(rv)) 
    return;

  if (!((fwrite(kMagicCookie, sizeof(kMagicCookie), 1, stream) == 1) &&
        (writeUInt32(stream, mGoodCount) == 1) &&
        (writeUInt32(stream, mBadCount) == 1) &&
         writeTokens(stream, mGoodTokens) &&
         writeTokens(stream, mBadTokens))) 
  {
    NS_WARNING("failed to write training data.");
    fclose(stream);
    // delete the training data file, since it is potentially corrupt.
    mTrainingFile->Remove(PR_FALSE);
  } 
  else 
  {
    fclose(stream);
    mTrainingDataDirty = PR_FALSE;
  }
}

Here is the call graph for this function:

Here is the caller graph for this function:


Member Data Documentation

Definition at line 87 of file nsIMsgFilterPlugin.idl.

Definition at line 88 of file nsIMsgFilterPlugin.idl.

Definition at line 163 of file nsBayesianFilter.h.

Definition at line 161 of file nsBayesianFilter.h.

Definition at line 163 of file nsBayesianFilter.h.

Definition at line 161 of file nsBayesianFilter.h.

Definition at line 162 of file nsBayesianFilter.h.

Definition at line 165 of file nsBayesianFilter.h.

Definition at line 167 of file nsBayesianFilter.h.

Definition at line 164 of file nsBayesianFilter.h.

Definition at line 168 of file nsBayesianFilter.h.

Some protocols (ie IMAP) can, as an optimization, avoid downloading all message header lines.

If your plugin doesn't need any more than the minimal set, it can return false for this attribute.

Definition at line 65 of file nsIMsgFilterPlugin.idl.

Message classifications.

Definition at line 86 of file nsIMsgFilterPlugin.idl.

Definition at line 118 of file nsIMsgFilterPlugin.idl.


The documentation for this class was generated from the following files: