@InProceedings{ Duboue:12,
  title = {Extractive email thread summarization: Can we do better than He Said She Said?},
  author = "Pablo A. Duboue",
  location = "Starved Rock, IL",
  booktitle = "7th International Conference on Natural Language Generation",
  month = "June",
  year = "2012",
  url = "http://duboue.net/pablo/papers/INLG2012duboue.pdf"
}

He Said She Said Annotation


This paper starts my work as an independent scientist. It is also a short paper, a type of communication I have preferred for this journey, as it allows to convey ideas precisely and succinctly. Bringing ideas from FLOSS into research, I sought to research problems that had a direct impact on my life, in this case email summarization (a topic I lead a town-hall meeting at the Debian Conference in 2012). While working on a Smart Mailing List Reader, I found the Kernel Traffic newsletter with hand-written summaries of the Linux Kernel mailing list.

From there I extracted all the verbs used by its author to introduce quoted speech. Interestingly, he used much many verbs beyond "he said / she said". A large total of 344 verbs. From the list, I group them by hand into 39 categories over which any of the verbs could be used interchangeably. I also volunteered my opinion of the type of semantic processing needed to enable some categories (e.g., if the system wants to say joked, it would need sentiment analysis).

The paper was well received and to my surprise it has received a modest number of citations over the years.

The data employed by the paper was originally released under an open source license by the author of the Kernel Traffic mailing list. It seems since then to have disappeared online and I have received requests for it. I am hosting it available for download on my website.

A poster presentation for the paper is also available.