The Google Books Ngram Data (raw data available here) is a pretty amazing resource. Version 2, released in July 2012, contains 1gram through 5gram frequency counts derived from 6% of all books ever published (!).
There's clearly a lot an enterprising young gentleman or gentlewoman could accomplish with a resource such as this at his or her disposal, so in this post I explain how to download a local copy of the Google Books Ngrams data, and how to process this data using PyTables so that it can be queried efficiently.
Getting The Data
While it wouldn't be hard to write code to iterate over the URLs of the raw data files and download them one by one, fortunately someone has already done this work for us. google-ngram-downloader provides a nice command-line tool to download the raw data files. Installing it and setting it running to download the bigram data into a directory inside your current working directory is as easy as
pip install -U google-ngram-downloader
google-ngram-downloader download -n 2
The script ran more or less without a hitch (I think I had to restart it once after it stalled), and after a couple of days spent downloading (YMMV of course) I had a whole bunch of files containing gzip compressed bigram data occupying about 150GB of space on my hard drive.
First Attempt At Querying The Data: zgrep
Each file in the Google Books Ngram Data is a compressed TSV file including all ngrams beginning with a 2 letter sequence (e.g. all bigrams beginning with the latters "ca"). Each line in the files references a specific ngram, the year of measurement, and the frequency count of the ngram and the number of books it appeared in.
Because the files are gzipped, my first attempt at querying them was by using zgrep. This got a little fiddly as I wanted to extract lines in a target file where the ngram in question included a word of interest in any position, and this required writing regular expressions which included tab characters. For some reason it's hard to write regular expressions on the command-line that include tab characters (in OSX at least), but once I got this working it became pretty apparent that even searching single files in this fashion was going to be relatively time consuming. The other disadvantage of this approach is that it would have entailed querying each of the many bigram files individually, and what I really wanted was a single database to query.
Second Attempt At Querying The Data: PyTables To The Rescue
PyTables is a pretty sophisticated tool allowing large amounts of hierarchical data to be stored on disk and queried rapidly. It uses the HDF5 library as its storage engine, and focuses on the compression and decompression of fixed-width records extremely rapidly to overcome the bottleneck of IO disk latency, and hence increase throughput. I've been interested in the potential of PyTables for a while now so the Google Books Ngram Data seemed to be a great test case to evaluate whether it could deliver on its claims.
The first stage in using PyTables is to specify the storage format for your table. I used the following table definition, which represents the first and second word of each bigram as a 20 character string, and the frequency count associated with that bigram as an unsigned 64 bit integer (breaking out the component words of the bigrams, rather than keeping them as a single string, makes it possible to query them separately)
# Bigram table definition
class Bigram(tables.IsDescription):
w1 = tables.StringCol(20)
w2 = tables.StringCol(20)
freq = tables.UInt64Col()
Obviously I could have added an additional column to the table definition to represent the year of the ngram count, but I decided in the present context to integrate out year by summing over the multiple rows corresponding to different years of the same ngram. So much for diachronics!
Before data can be imported into the table, the table needs to be associated with an HDF5 file, which is accomplished in PyTables as follows
# Initialize the H5 file.
h5filename = "google_bigrams.h5"
filters = tables.Filters(complevel=9, complib="blosc", fletcher32=False)
h5f = tables.open_file(h5filename, mode="w", title="Google Books Bigrams", filters=filters)
bigrams_table = h5f.create_table("/", "bigrams", Bigram, "Bigrams Table")
One thing to note here is that I elected to use the BLOSC compression library to compress the fixed-width rows of my table. While BLOSC may not attain the highest compression ratios of data, it is extremely quick at compressing and decompressing data, so I used it in the hope of achieving the highest throughput of the bigram data in my queries.
Adding the bigram data to the table was accomplished with the following inner loop
# Write the bigrams to the table.
bigram, rows = bigrams_table.row, 0
for ngram in compressed_ngram_yielder(2):
bigram["w1"], bigram["w2"], bigram["freq"] = ngram
bigram.append()
rows += 1
if rows % 100000 == 0:
h5f.flush()
print("{:d} ngrams written...".format(rows))
h5f.close()
I elected to index the "w1" and "w2" columns for querying after importing the data, as I wanted insertions to occur as quickly as possible (the word on the street is that inserting into an indexed table can slow things down appreciably). Even given this I think it took a couple of nights for my code to walk the gzipped files and insert them into the HDF5 file, but whether the majority of this time was spent decompressing and walking the data stream or inserting the table rows, I don't know.
Indexing the word columns after all the rows had been inserted into the table was accomplished with the following commands
h5f.root.bigrams.cols.w1.create_index()
h5f.root.bigrams.cols.w2.create_index()
h5f.flush()
I didn't keep a close eye on this process, but I believe indexing the columns took around 1-2 hours, and added about 2GB to the size of the 9.5GB compressed HDF5 data file. This was a good tradeoff to make in terms of querying speed though, as we will see shortly.
Playtime
At the end of this process we're left with a bigrams table containing 667.3M distinct bigrams encompassing a total of 4.067T bigram occurrences. This yields an average token frequency for the bigrams of a little over 6000. This is a large sample of language by any standards I am accustomed to.
Interrogating all of the bigrams in the table is still relatively time consuming (for example, it took my computer 332s to count the number of bigrams and the total token frequency in the table reported above). However because we've indexed the bigrams by the first and by the second word we can use this index to query subsets of the data efficiently. For example, we can retrieve all and only the bigrams where the first word is "foo" with something like
for bg in h5f.root.bigrams.where("w1 == 'foo'"):
print(bg)
Even without indexing the word columns, querying with PyTables is orders of magnitude faster than the zgrep querying process. An average query, without indexing, takes around around a minute to complete, which is probably less than it took zgrep to query a single data file, let alone the whole data set. However, indexing improves things even further. A query on the set of bigrams constrained in this manner finishes in around one second on my machine, which is pretty good considering this involves interrogating 6% of bigrams ever published.
Overall I am really impressed with the performance, and ease-of-use, of PyTables. I have no doubt that this level of performance far outstrips what would be attainable with a (semi-)equivalent DB such as SQLite. Obviously we're giving up ACID-compliance and other niceties in using PyTables, but given that I have a frequent need to query large data sets in this kind of fashion, I'll surely be using PyTables more in the future.