DrDubWiki | Pretrained Models for Clustering

This idea is part of the A Dollar Worth of Ideas series, with potential open source, research or data science projects or contributions for people to pursue. I would be interested in mentoring some of them. Just contact me for details.

Pretrained models are the best bet to address new problems using deep learning.

For practitioners using non-deep learning methods, such as k-Means clustering, it can be useful to bring a pretrained distance between instances that can leverage the power of pretrained models.

For common types of data that practitioner cluster (e.g., text reviews, people bio data, places, etc), it might be possible to gather weak distances (or devise a self-training procedure) and do a Siamese network type of embedding. See this Google Colab Notebook for an example in the curriculum alignment domain.

The pretrained models can then be shared with the community.

Related work:

Kernels for structured data, by Gartner et al.