On December 16, 2010, Google launched a new tool called ‘Ngram Viewer’ at https://books.google.com/ngrams. Anyone driving to work and listening to NPR got a quick preview – I’ve been playing with it ever since. This utility is a direct result of Google’s massive book scanning program. Some sources say that 4% of all the world’s books have currently been scanned, while other sources say 11%. It allows the user to search for up to five word phrases contained in this database spanning the years 1500 to 2008. Presenting results as data rather than as coherent sentences avoids lots of trouble with copyright laws.
A paper in the journal Science has been published on this work http://www.sciencemag.org/content/early/2010/12/15/science.1199644. You do have to have a subscription to the AAAS (American Association for the Advancement of Science) to get it. I’ll give a report on the paper later as a comment to this post.
An ‘n-gram’ is formally defined as a subsequence of n objects from a defined sequence. Google appears to be using this term in the context of the number of adjacent words in their word database. So the word ‘lace’ would be a unigram, ‘Alencon lace’ a bigram, etc. You can search for up to five words, and that restriction is probably due to unreasonably long searchs time as ‘n’ increases. The user can also choose from material organized into 10 different ‘language’ categories or ‘corpora’. For example, a search on the word ‘lace’ in the English corpus gives the following:
Chosing the ‘British English’ corpus, which is a subset of works in English but published in England, the result is:
Obviously the algorithm isn’t going to distinguish between the textile ‘lace’ and something like ‘lace’ up your shoe. So you have to use this with caution. Case also matters, so searching for ‘Lace’ is different from searching for ‘lace’. And the ngram must appear in 40 different works in order to show up on the graph. I’m not sure if that holds true for the very early books, or for a subset corpus like British English, which may account for some of the differences between these two graphs.
The entire database is searched for your ngram, and the number of times that the ngram appears is counted for every year. That value is normalized by the number of books published in that year, and this is the quantity plotted on the vertical axis. I assume this means that if the ngram is used more than once in a single book, all the instances of usage are counted. If there were very few books published in one year, you will get a spike – this can be smoothed by averaging over nearby years with the smoothing parameter. (Actually, I’m starting to rethink this – the count may be over the number of books that terms appear in, otherwise the percentages could get greater than 1.0.).
Here are a few more searches – first on the term ‘dentelle’ in the French corpus (caution, dentelle can also be associated with teeth and jagged mountains):
Several search terms can be plotted on the same graph. Below is shown searches for “Alencon lace”, “Valenciennes Lace” and “Chantilly lace” in the English corpus using a smoothing factor of 2 (meaning the data is averaged over the two previous and two following years). I couldn’t find anything earlier than the spike around 1745. The English search did not seem to understand the ç in ‘Alençon’, but did pick up ‘Alencon’.
What does this all mean? The first appearance of terms is quite valuable, if you understand the definition of the search term at the time it was published. Unfortunately you cannot learn what specific book contained the phrase. I have a feeling that the shapes of the curves are influenced by the normalization process – think about what happens if a term is mentioned many times in just a few books.
Databases can be downloaded for the user to do their own searches, and this will certainly help understand the process. I’ll revise this post as I learn more about this fascinating utility.