Loading...
 

Overfitting the Web

This is a research project exploring the hypothesis that using the whole web for training a language model is akin to overfitting, as every text is already seen.

A potential way to explore this hypothesis is by using pseudo-words (e.g., "bananadoor"). See Maning & Schutze (2003), Chapter 7, section 7.1.2 (pseudowords).

Other potential way: test on very recent content and measure the performance on it vs. historical content.