This idea is part of the A Dollar Worth of Ideas series, with potential open source, research or data science projects or contributions for people to pursue. I would be interested in mentoring some of them. Just contact me for details.


This is a very straightforward idea, most probably there are solutions similar to this floating around.

Take a collection of research papers, such as the Association for Computational Linguistics Anthology Reference Corpus (ACL ARC), and extract two blobs of text:

  • Problem being addressed.
  • Dataset used.


(The extraction of such is a traditional information extraction task. I'm fond of anchor-based solutions using rules but with suitable annotations, there are many ML solutions.)

Then train a system that given a problem statement suggests datasets that can be used to address it. The datasets do not need to be publicly available, sometimes just finding out what data has been collected and annotated in the past can be used.

A memory-based (i.e., k-nearest neighbors) solution can have the added value of linking back to the original research paper for further inspiration.

If enough data is available, however, a transformer-based solution might even suggest new and original dataset ideas for consideration.