Keeping up with multiple, high traffic, mailing lists can be a fool's errand. Machine learning and automatic text classification have for many years promised a better solution to this problem. This project seeks to put that promise to the test, by building a custom model of emails of interest to a particular user.

In the long run, this can be turned into a multi-user website incorporating not only a trained model, but also hard constraints (show all messages that mention a particular person or project), plus thread-pattern-based filtering (see Joey's famous blog post referenced below).

In the future, this can be part of a personalized information dashboard incorporating RSS feeds, tweeter feeds and user activities via http://zeitgeist-project.com/ (and others). While the focus here is in 100% in mailing list, I do so want to live in such future!

See a demo and consider joining the SourceForge project.

Version 0

This a first version to get things moving and start collecting training material.

  • Single user
  • Backend code in perl
  • Using re-purposed spam detection technology.
  • Gmane for showing the message (Gmane doesn't track all the mailing lists I follow)
  • SQLite3 backend
  • AJAX front-end written in Scala and NextApp Echo3

Details

  • An existing trainable spam classifier, trained once a week per mailing list.
  • A mailing list model, trained time of the day and number of interesting messages left not for version 0
  • An existing ML mixing the different scores, retrained every time a new email is classified. not for version 0
  • A rule engine, with rules written in Perl

Possible software to use

References

  • Text Classification from Labeled and Unlabeled Documents using EM by Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchel. Machine Learning , 1–34 (1999)
  • Experience with Rule Induction and k-Nearest Neighbor Methods for Interface Agents that Learn, by Terry R. Payne, Peter Edwards, and Claire L. Green. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 2, MARCH-APRIL 1997.