Andrei Dubovik
Blog

A Hash Table Based Data Matching Exercise, Benchmarked

The village of Den Haag
Sept. 12, 2020

Recently I was asked to help with a data matching exercise. For that particular task, a hash table based algorithm seemed most appropriate. The code base was in R and while R lacks native hash tables, environments can be used as a substitute (at least, when keys are strings). When I finished coding the matching algorithm in R, I had the feeling it was running too slow. To be perfectly honest, I have the same feeling about most software nowadays. It is an age related mental disorder: I grew up watching hardware getting exponentially better, while software performance seemingly stagnated. In this case, however, it turned out I was right, R was indeed the slowest in its class. Today I want to discuss this data matching exercise and benchmark the corresponding algorithm with popular and less popular interactive programming languages, namely with R, Python, and Lisp. For comparison, I’m also adding a C++ benchmark.

I’ll do data introduction and the first benchmark in Python. The data is split into two files, documents.txt (1.7GB) and topics.txt (280KB). Each line of documents.txt corresponds to a single document and contains space separated terms present in that document. Analogously for topics.txt. The original data is private, so what I have done is I have replaced each term from the original data with a random 2-gram based on popular English words. In this way, the data structure is preserved and I can share the files.

Let’s load the data and preview it:

from reprlib import Repr

def read_words(path):
    '''Read bags of words from a file'''
    with open(path, 'r') as f:
        return [line.strip().split(' ') for line in f]

def preview(obj):
    '''A wrapper around Repr.repr()'''
    r = Repr()
    r.maxlist = 3
    print(r.repr(obj))

documents = read_words('documents.txt')
topics = read_words('topics.txt')

preview(documents)
# [['soon.hours', 'sincerely.assign', 'genetic.attacks', ...],
#  ['covering.benefited', 'greeted.device', 'matches.loosely', ...],
#  ['related.preserved', 'occurrence.literature', 'desires.hormone', ...],
#  ...]

print(len(documents))
# 2000000

print(sum(len(d) for d in documents)/len(documents))
# 53.0899165

preview(topics)
# [['venous.murmur', 'collecting.truths'],
#  ['influenced.prospect', 'collecting.truths'],
#  ['enable.gastric', 'musicians.disposition'],
#  ...]

print(len(topics))
# 8929

print(sum(len(t) for t in topics)/len(topics))
# 1.9723373278082652

On a side note, who would have thought that making random pairs of popular English words could produce such hilarious results? “venous.murmur,” “collecting.truths.” The next time a website tells me the chosen login is already in use, I would know what to do. Muhaha.

The merging task is as follows. For each topic, find all the documents that contain each of the terms of that topic.

Running ahead, the solution turns out to be sparse. Less than half of the topics match any documents at all. And those topics that do match one or more documents, match only 928 documents on average (out of the total number of 2,000,000 documents). There are a number of different algorithms of the same time complexity that solve this matching task, but they still differ substantially in how fast they are due to the sparsity of the matching outcome. The following algorithm is the fastest that I am aware of.

  1. For each term found in any of the documents, collect the documents that have that term.

  2. For any term in any given topic, we then have a set of corresponding documents. Intersecting those sets for all the terms in a given topic, we obtain the solution for that topic.

There may be faster algorithms out there, but for the purpose of comparing hash table performance on production data, the above algorithm works nicely. (To be precise, it’s a mix of hash tables and set intersections, so I’m benchmarking both.) In Python, this algorithm can be implemented concisely.

from functools import reduce

def match(documents, topics):
    '''For each topic, find documents containing all the words from that topic'''
    word_docs = {}
    for i, d in enumerate(documents):
        for w in d:
            word_docs.setdefault(w, []).append(i)

    rslt = []
    for t in topics:
        it = (word_docs.get(w, []) for w in t)
        ds = reduce(lambda x, y: x.intersection(y), it, set(next(it)))
        rslt.append(ds)

    return rslt

Let’s run it. We can also check out some of the results.

rslt = match(documents, topics)

print(sum(len(d) > 0 for d in rslt))
# 3973

preview(next(sorted(d) for d in rslt if len(d) > 0))
# [924, 1254, 4887, ...]

print(max(len(d) for d in rslt))
# 254315

I have also written mostly the same algorithm in R, Lisp, and C++. According to a quick benchmark run by Eduardo Ariño de la Rubia, environments are the fastest way to implement hash tables in R. So, I have used environments in R. In Lisp I have used built-in hash tables, and in C++ I have used unordered_map from the STL.

With these type of benchmarks my idea is not to try to use exactly the same algorithms across different languages, but rather to give preference to built-in functions and popular libraries. After all, if you write your program only in one language, that’s how you’re most likely going to approach it. This choice often results in a faster code anyways.

So then, set intersection is handled differently across my implementations. In Python, I have used native sets for that, which are hash table based I believe. In Lisp and C++ I have presorted the corresponding vectors, which then allows for trivial set intersection in linear time. In R I have used the built-in intersect(), which is the fastest solution I can find. (E.g. using environments for set intersection is really much slower in R.) What does intersect() do? The documentation is silent on the matter, so I have posted a question on Stackoverflow and have been told to “Use the source, Luke.” Professor, how do I take this integral? Go to the library and consult the integration tables. Anyhoo, consulting the source code reveals that intersect() uses hash tables under the hood, albeit in a convoluted fashion, where a hash table needs to be constructed twice for each function call.

But enough chit-chat for one blog, let’s see the benchmarks. (In the table, language names link to the respective code.)

LanguageMergingTotalCPUMemory (USS)Caveats
C++10.5s18.7s99%7.7GB0
Python62.3s88.8s105%10.1GB0
Lisp62.3s105.7s103%8.2GB2
R292.3s327.7s102%2.8GB2

In all cases, the data was first read from disk and loaded into memory, as a vector of vectors of strings, and then merged. “Merging” gives the time it took to merge the data, “Total“ gives total time.

Caveats, Conclusions, and Surprises

  • Lisp has a built-in intersection function, which at least in SBCL is implemented with quadratic time complexity and so it’s unusable for this matching exercise. I had to implement a set intersection function myself. Secondly, Lisp stores strings as wide strings by default (32 bits per character), and so I had to cast them to byte strings to avoid a huge memory footprint. Hence, 2 caveats for Lisp.

  • R’s built-in input, output functions are rubbish. If you want to read or write a big file before you coffee gets cold, look for a package. (In my code I’ve used data.table for reading in text data.) Also, how do you grow vector x so that the amortized time complexity is linear? While the answer is simple, x[length(x) + 1] <- ..., finding it is not. So, 2 caveats for R.

  • R is slow. C++ is fast but not interactive. Lisp and Python are very similar in performance on this task. Still, programming in Python was a breeze, while Lisp took effort due to the caveats. For me, Python is a clear winner here.

  • Why does R use so little memory? Because there are 106,179,833 strings in total but only 1,013,798 unique strings, and R automatically uses shared memory for duplicated strings. Did I ask for this optimization? No. Is it nice to have it in an interactive language? Maybe. Does it give R any bonus points? Unfortunately, not on this task. Step 1 of the algorithm—hash table construction—can be done directly from disk, without the need to store duplicated strings in memory. The reason I didn’t do it, is because I wanted to benchmark the merging speed, and for that I needed to preload the data into memory.

Coming back to this last point on memory consumption, what if the data needs to be preloaded? How easy is it to implement in Python the same deduplication trick that R uses? Very easy. We change

def read_words(path):
    '''Read bags of words from a file'''
    with open(path, 'r') as f:
        return [line.strip().split(' ') for line in f]

to

def read_words(path):
    '''Read bags of words from a file, deduplicate'''
    dt = {}
    dedup = lambda w: dt.setdefault(w, w)
    with open(path, 'r') as f:
        return [[dedup(w) for w in line.strip().split(' ')] for line in f]

and the memory consumption goes down from 10.1GB to 2.5GB (and it’s a tad faster too).

Need for Speed

I have a meager AMD Ryzen 5 3600, but let’s imagine for a moment that I have an AMD Ryzen Threadripper 3990X. Wouldn’t I want to use all the 64 cores in their full glory for my data matching exercise? Of course I would!

R is long out of competition. Python has GIL, so—also out for this one. Lisp works nicely with threads but having a first hand experience with the Lisp ecosystem, I assume I will need to implement a concurrent hash table from scratch myself, so Lisp is out. C++ it is then, helped by TBB.

In many ways a modern C++ works at the same level of abstraction as, say, Python. If not for the absence of interactivity, I would have chosen C++ over Python more often. Let’s take the hash table construction as an example. In Python, we have

word_docs = {}
for i, d in enumerate(documents):
    for w in d:
        word_docs.setdefault(w, []).append(i)

which translates to

unordered_map<string,vector<size_t>> word_docs;
for(size_t i = 0; i < documents.size(); i++) {
  for(auto& w: documents[i]) {
    word_docs[w].push_back(i);
  }
}

in C++. Not that bad at all. Admittedly, it does get more c like with a thread-safe hash table from TBB and parallel execution, but only marginally so:

using Map = concurrent_hash_map<string,vector<size_t>>;
Map word_docs;
parallel_for(0ul, documents.size(), [&](auto i) {
  for(auto& w: documents[i]) {
    Map::accessor acc;
    word_docs.insert(acc, w);
    acc->second.push_back(i);
  }
});

The complete code can be found here. Benchmarking across different number of threads gives the following results.

ThreadsMergingCPU
112.8s97%
212.8s151%
38.35s189%
65.07s263%
123.66s365%
243.62s365%

(The number of threads is inclusive the control thread.)

My processor has 6 physical cores and 12 virtual cores. I’m not a computer scientist, so I have no idea why utilizing all virtual cores gives better performance than simply utilizing all physical cores, but clearly it does. If I go above my number of virtual cores, there is no benefit anymore, as expected.

So, we went from 292.3s in R down to 3.66s in multi-threaded C++, an almost 80 times speedup. Oh reader, please, don’t let me be misunderstood. I use R regularly as it has the best collection of statistical packages, but that is precisely it. R packages are the only compelling reason to use R, because base R is inferior to alternatives as today’s exercise alludes. Python is often faster and always easier, and C++ is substantially faster, without losing much in the level of abstraction, surprisingly, but of course losing in interactivity.

Postscriptum

So far, this blog has been mostly about programming. However, till now it lacked syntax highlighting for code snippets. There were many packages out there for syntax highlighting, both client-side and server-side. Hence, to minimize choice related stress, I programmed a rudimentary server-side solution myself. Technically it was straightforward, except for one “argh” moment when it came to template types in C++. What was difficult, though, was choosing colours.

Strings are coloured red,
Function names are always blue,
Green is not easy.

# Hash Tables
Lambda Pipes
On Lisp
Motorbike Diaries
Raspberry Pi
1