Table of contents
[Show/Hide]Keeping up with multiple, high traffic, mailing lists can be a fool's errand. Machine learning and automatic text classification have for many years promised a better solution to this problem. This project seeks to put that promise to the test, by building a custom model of emails of interest to a particular user.
In the long run, this can be turned into a multi-user website incorporating not only a trained model, but also hard constraints (show all messages that mention a particular person or project), plus thread-pattern-based filtering (see Joey's famous blog post referenced below).
In the future, this can be part of a personalized information dashboard incorporating RSS feeds, tweeter feeds and user activities via http://zeitgeist-project.com/ (and others). While the focus here is in 100% in mailing list, I do so want to live in such future!
See a demo and consider joining the SourceForge project.
Version 0
This a first version to get things moving and start collecting training material.
- Single user
- Backend code in perl
- Using re-purposed spam detection technology.
-
Gmane for showing the message(Gmane doesn't track all the mailing lists I follow) - SQLite3 backend
- AJAX front-end written in Scala and NextApp Echo3
Details
- An existing trainable spam classifier, trained once a week per mailing list.
-
A mailing list model, trained time of the day and number of interesting messages leftnot for version 0 -
An existing ML mixing the different scores, retrained every time a new email is classified.not for version 0 - A rule engine, with rules written in Perl
Possible software to use
- Lurker: http://lurker.sourceforge.net
- Try to collaborate with the author? http://www.dvs.tu-darmstadt.de/staff/terpstra/
- Lurker in action: https://lists.exim.org/lurker/message/20110911.031744.c6d23255.en.html
- libbow: http://www.cs.cmu.edu/~mccallum/bow/
- Try to get it back into Debian: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=525229
- CRF++: http://crfpp.sourceforge.net/
- a digramic Bayesian classifier: http://dbacl.sourceforge.net/
- Also used as a chess player! http://dbacl.sourceforge.net/spam_chess-12.html
- http://jmap.io/software.html
- https://james.apache.org/
References
- Text Classification from Labeled and Unlabeled Documents using EM by Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchel. Machine Learning , 1–34 (1999)
- Experience with Rule Induction and k-Nearest Neighbor Methods for Interface Agents that Learn, by Terry R. Payne, Peter Edwards, and Claire L. Green. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 2, MARCH-APRIL 1997.
- Joey's thread patterns: http://joey.kitenet.net/blog/entry/thread_patterns/