As many of you know, in recent years Google has been scanning books by the millions. Their Ngram Viewer allows users to see how often a certain word or phrase was used over time in a certain subset of their collection of scanned books. Subsets include American English (books published in the U.S.), British English, English, and other languages as well. Ngram calls the entire collection and each of the individual collections, a corpus.
We used “Joseph Smith” and “Mormon” to create the Ngram above from the American English corpus. You can quickly see the birth of the movement surrounding The Church of Jesus Christ of Latter-day Saints and how closely the term Mormon was tied to Joseph Smith in the very early years. As popularity of the terms rose they begin to cycle. The peak about 1889 appears related to several issues, among them: Utah’s quest for statehood, polygamy coming to an end, and several prominent trials involving Mormons or the Church in the news.
The percentages for a single word represents the percentage of all words in the corpus for that year that were mentioned enough times to be indexed that were your word. Percentages for phrases represent the percentage of all phrases in the corpus for that year with the same number of words as your phrase that were used enough times to be indexed that were your phrase. Note this process normalizes the number of uses of your term by the total number of words each year allowing a more true comparison of occurrences of your term across the years.
Google processes the books in each corpus in advance by year to make the data for every word and phrase mentioned often enough to be included, quickly available. When you type some terms in Ngram, it consults the database for the answers.
The actual indexes, which are also available for download, allow access to how many books mentioned your term each year as well as how many total times your term was mentioned. For less frequently used terms, many of their occurrences may come from just a few books.
While Google has indexed over 15 million books, Ngram allows access to a subset of about five million of those books that represent about four percent of all the books ever published.
Books are an interesting comparison to news media feeds and other cultural sources. The news quickly covers an event and it is often over. Books are more likely to discuss larger ongoing social issues, movements, and ideas (especially religious ideas). Plus some books create a movement in and of themselves (like the Book of Mormon). That movement often comes some time after the book is published, and may then be reflected in the news media.
Ngram allows charting some terms back to 1500. When doing so you need to remember that much of the drive behind developing and deploying the printing press was to publish religious materials. As a result, religious words and phrases often have very high percentages in those early years.
While Google Ngram does a great job of illustrating the rising or falling popularity of certain words or phrases, it does not explain why those words or phrases waxed or waned. Sometimes the meaning of words change as well. We will suggest some causes for major changes in trend lines if they are not already obvious to LDS viewers (such as those we offered above for the 1889 spike in the example chart above.)
Another feature of Ngram is the ability to smooth your chart by averaging the data for each year with the data for nearby years. For example, with a smoothing of 3 (the default value) each data point is is an average of seven values (the value for the specific year, the value of each of the proceeding three years, and the value of each of the following three years summed and divided by seven). The algorithm also has a method for handling each end of the chart (the first few values and the last few values), but averages fewer data points for them.
You need to be cautious to make sure there are sufficient mentions of your term or phrase in the time frame and the corpus you plan to use or your results may be meaningless. You can verify there are enough occurrences of your term by watching how small the percentages are OR by clicking on the links below the chart to see how many books contained the word or phrase in that time increment. If the percentages get extremely small, or the book count gets extremely small, you need to be careful in making generalizations from your data.
For those who wonder where the term Ngram came from, a “gram” is a string of characters uninterrupted by a space. A “gram” may be a word or a group of numbers like 3.141592654 , or gibberish %$H^$(*H#. Punctuation marks at the beginning or end of a gram are broken off as separate tokens. Apostrophes in the middle of a word split the word into two grams such as “Gary’s” becomes “Gary” “‘s” . “N” is an integer that signifies how many “grams” are in the phrase. Ngram is often written as “n-gram”. Wikipedia describes an n-gram as a subsequence of “n” items from a given sequence.
The technology behind Google Ngram is described in “Quantitative Analysis of Culture Using Millions of Digitized Books”. Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Science. Vol. 331. Pgs. 176-182. January 14, 2011.
We will be posting more Ngrams in the future. We just wanted to lay some groundwork with this post.