Today, Google released the Google Books NGram Viewer, which is a beautiful frontend to a historical ngram model. They have a separate ngram model for each year and for each language type (English, American English, British English, Simplified Chinese, etc).
To some extent, this already existed in the Corpus of Historical American English (COHA), but that’s only American English and it doesn’t seem to produce pretty graphs. However, COHA allows for richer queries.
As a tool, you input a list of terms (phrases work too) and pick a corpus. It checks the language models and produces a graph like so:
How does it work and how can you use it? The data source is the result of Google’s work in scanning books and applying character recognition. It seems that all of the books in Google Books are included, but it’s somewhat unclear.
They record 5-gram models for each corpus and year. Amazingly, the raw ngram models are available online for free (though not all of it is online yet).
One funny note is that the term smoothing in the context of ngram really makes me think of ngram model smoothing, but in this case it’s smoothing the graph — each point is the average of adjacent points to smooth out noise in the graph.
In terms of how you can use it, it’s an interesting way to choose between near-synonyms. In that sense, it’s a little like the popular Google Battle. For example, consider the choice of a contrast discourse marker:
I didn’t expect to see However increase in popularity at the expense of But. I wonder if this reflects changes in language or changes in the distribution of book types?
Note that it’s case-sensitive, so you find different things depending on case. Also, if you intend to use it for anything serious, I suggest taking a look at their pages to understand the tokenization methods.
More than anything this seems to be an incredible resource for corpus linguistics.