This is a research project exploring the hypothesis that using the whole web for training a language model is akin to overfitting, as every text is already seen.
A potential way to explore this hypothesis is by using pseudo-words (e.g., "bananadoor"). See Maning & Schutze (2003), Chapter 7, section 7.1.2 (pseudowords).
Other potential way: test on very recent content and measure the performance on it vs. historical content.