More Randfacts Testing
Feb 4, 20267 minute read
randfacts rust python math
I’ve covered this topic before in a previous post, but I’ve recently gotten back into improving randfacts after switching it from poetry to uv. I’m really not sure why, but after improving the unit tests, I had an urge to see how much faster I could make the checkduplicates test.
As a brief historical overview, the checkduplicates test finds any facts that are duplicated in the dataset. This sounds easy at first, but I originally wanted it to work for facts with different phrasings but the same meaning. From this, I stumbled across the Levenshtein Distance algorithm. As of the time I’m writing this, there have been four major iterations of this test, listed below:
- Python: I implemented this first version with the
fuzzywuzzymodule, but this was pretty slow because it’s all Python. - Python/C++: After a few years of using this, I rewrote it to use
rapidfuzz, which is a much faster C++ based implementation of Levenshtein. This worked decently well for a while, but I again got tired of the slowdown. - Rust #1: The third time around I had recently learned Rust and wanted to try that out, so I did. However, if I remember correctly it was actually originally slower than the C++ version, probably because there’s not much speedup you can do when using a recursive algorithm. This is when I discovered Wagner-Fischer, which is a dynamic programming approach to Levenshtein distance. My initial implementation was already drastically faster, and I go over the particular optimizations that I did in my other blog post.
- Rust #2: This is my fourth and most recent development, and on my personal desktop I’ve increased the speed from ~17.10s to ~165.52ms, which is about a 196% increase in speed, and this ratio seems to be similar on my laptop.
Dice-Sørensen
There were a few main optimizations that I made. Firstly, I completely switched algorithms. I found a formula called the Dice-Sørensen coefficient which is used for calculating the similarity between two samples. It’s defined as follows:
Between two sets, it computes the ratio of twice the number of items in their intersection to the sum of their lengths. For example, if you have two sets of {i, love, programming} and {programming, is, what, i, love}, their intersection is {i, programming, love} of length three, giving us:
Once we establish a similarity threshold that works best (turned out to be 0.70 for me), then we can just use this to compare similarity without having to do recursion or complex matrix edit distance calculations.
The actual base algorithm itself ended up being fairly simple, as it is provided a sorted array of u64s (we’ll discuss why/how u64 in a moment), so we’re able to calculate the intersection in one linear pass:
| |
Hashing/Tokenizing
I mentioned in the last section that the algorithm is provided a list (set) of u64s. This is another reason why it can run so must faster. Since integer comparisons are so much quicker compared to full string comparisons, why not just tokenize the string (which we were already doing) and hash each word? This allows us to just store a vector of each hash and check if numbers are equal () instead of checking if strings are equal (). In this same normalization pass, I also added “stop word” removal. This is a term from natural language processing that describes words such as “i”, “was”, “a”, “or”, “at”, or anything else like that that doesn’t really contribute “meaning” to a sample. By removing all of these words from the set of tokens, we can construct a set that more closely represents the core “subject” of the fact.
Length Difference Ratio Termination
The other huge optimization that this approach allowed us to do is the fact that we’re now able to calculate the mathematical limit at which two facts could not possibly be considered similar if their lengths differ by amount. This was a little bit complex to do but I’ll try to annotate the math here:
We already know the following formula for calculating the coefficient:
Since this is a ratio against the sum of the lengths of the sets, we should be able to derive at what length of (given ) will the ratio not surpass the set threshold. We know that the list is sorted from least to greatest by length, and from that, we know that as we traverse from start to finish (comparing every against everything after it), . This tells us that the maximum possible intersection length between the two sets is . From this, we can construct an inequality representing this maximum case:
In my final Rust implementation, I was able to use this formula to break off processing once we’ve reached this limit, shown here:
| |
Speed Comparison and Conclusion
| Python | C++/Python | Wagner-Fischer Rust | Dice-Sørensen Rust | |
|---|---|---|---|---|
| Runtime (seconds) | 419.22 seconds (6m59s) | 68.34 seconds (1m08s) | 17.1 seconds | 0.165 seconds |
| Approximate Iterations/second | ~60,000-70,000 | ~400,000 | ~1,550,000 | ~50,0001 |
| Source Code Permalink | Link | Link | Link | Link |
I found this comparison kind of interesting because the new Rust version is so fast, but it by far has the least number of iterations/second. While this may seem like it’d be slower at first, the speed is due to the test having far less comparisons to do than any of the other implementations because of the new ability to only compare facts that could possibly be similar. I’ve actually found this new implementation to have picked up duplicate facts that the Levenshtein and Wagner-Fischer implementations missed, while also having almost no false positives. Overall, this has been a very worthwhile investment and I had a lot of fun learning about this new algorithm.
This is kind of difficult to measure because of how fast it runs, but manually calculating it gives around this. The quick flash of the progress bar also shows something around this ↩︎