Loading...
 

IE4OpenData

Information Extraction for Open Data is a project seeking to empower citizens by automatically sifting through large amounts of text.

This project sought to combine rule-based techniques such as Apache RuTA with statistical IE methods such as CRFs. For this project, I put together a large IE pipeline Octroy and taught two courses studying it in detail. Even after having taught close to 150 people on its internals, a collaborator still failed to materialize.

Besides Octoroy, the project also sparked Vozyvoto, started at Hackatong2016. The objective of Vozyvoto (voice and vote) was the study of group participation (both in terms of speaking and being addressed by other speakers) in the proceedings of government assemblies. That data was key to my 2019 paper evaluating the impact of dialect on BERT embeddings.

Moreover, the Open Data movement seem to have moved away from requesting text dumps as part of "open data". It makes sense as, without a project such as IE4OpenData, text data is too opaque to operate.

A wrap-up for the project can be seen in this 2018's talk in the Data Science for Good Vancouver Meetup.