Keeping up with multiple, high traffic, mailing lists can be a fool's errand. Machine learning and automatic text classification have for many years promised a better solution to this problem. This project seeks to put that promise to the test, by building a custom model of emails of interest to a particular user.
In the long run, this can be turned into a multi-user website incorporating not only a trained model, but also hard constraints (show all messages that mention a particular person or project), plus thread-pattern-based filtering (see Joey's famous blog post referenced below).
In the future, this can be part of a personalized information dashboard incorporating RSS feeds, tweeter feeds and user activities via http://zeitgeist-project.com/ (and others). While the focus here is in 100% in mailing list, I do so want to live in such future!
This a first version to get things moving and start collecting training material.
- Single user
- Backend code in perl
- Using re-purposed spam detection technology.
Gmane for showing the message(Gmane doesn't track all the mailing lists I follow)
- SQLite3 backend
- AJAX front-end written in Scala and NextApp Echo3
- An existing trainable spam classifier, trained once a week per mailing list.
A mailing list model, trained time of the day and number of interesting messages leftnot for version 0
An existing ML mixing the different scores, retrained every time a new email is classified.not for version 0
- A rule engine, with rules written in Perl
- Lurker: http://lurker.sourceforge.net
- Try to collaborate with the author? http://www.dvs.tu-darmstadt.de/staff/terpstra/
- Lurker in action: https://lists.exim.org/lurker/message/20110911.031744.c6d23255.en.html
- libbow: http://www.cs.cmu.edu/~mccallum/bow/
- Try to get it back into Debian: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=525229
- CRF++: http://crfpp.sourceforge.net/
- a digramic Bayesian classifier: http://dbacl.sourceforge.net/
- Also used as a chess player! http://dbacl.sourceforge.net/spam_chess-12.html
- Text Classiﬁcation from Labeled and Unlabeled Documents using EM by Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchel. Machine Learning , 1–34 (1999)
- Experience with Rule Induction and k-Nearest Neighbor Methods for Interface Agents that Learn, by Terry R. Payne, Peter Edwards, and Claire L. Green. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 2, MARCH-APRIL 1997.
- Joey's thread patterns: http://joey.kitenet.net/blog/entry/thread_patterns/