Getting the Word Out for Word2Vec

This idea is part of the A Dollar Worth of Ideas series, with potential open source, research or data science projects or contributions for people to pursue. I would be interested in mentoring some of them. Just contact me for details.

The explosive growth in NLP in recent years is leaving little time to use its innovations to the fullest.

Almost a decade ago, Tomas Mikolov and other researchers at Google devised Word2Vec a program that computes embeddings, mappings from words to vectors in n-dimensions in such a way that the relative distances between words (in a semantic sense) was captured by normal distance in n-dimensional space.

Computing quality embeddings has been researched for decades before Word2Vec, what sets the algorithm apart is that it allowed to compute the embeddings without having to undergo a computationally expensive global optimization process.

Together with the discovery of the algorithm, the authors released a highly optimized parallel implementation in portable C and word embeddings computed on different collections of documents.

The field has now moved to better embeddings computed in larger collections of documents and Word2Vec is confined to text books. That doesn't mean that Word2Vec is still not useful, it is still used in many applications (for example, for improving search).

This idea relates to my very modest fiction writing. There has been multiple occasions that Word2Vec and its GoogleNews vectors have come handy when writing.

For example, here is the partial output for the query boring chores against the 3.4Gb GoogleNews embeddings:

  • chore 0.710083
  • household_chores 0.642860
  • monotonous 0.627925
  • tedious 0.622424
  • drudgery 0.616152
  • menial_tasks 0.605626
  • drudge 0.587488
  • dull 0.586221
  • busywork 0.586098
  • mundane_tasks 0.581152

The scores next to the words (or phrases) are the similarity scores. I'd argue that "drudgery" is a fantastic word to capture the original query and it will clearly enhance the prose. It allows the text to go beyond "he said, she said" type of language.

This could be put into a website, and I'm sure such websites exist. A desktop app that can generate embeddings from a writer's own texts or from a collection the writer has curated that are of quality and interest to the writer would make more sense. Also, from my experience using Word2Vec query against the Google News embeddings, it takes an amount of RAM that might make it unsuitable for a self-sustainable website.

But by far what is more needed is simply to outreach to the writers community, through events such as the word to National Novel Writing Month or Mundial de la Escritura.