<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Blog on Tabulate</title><link>http://tabulate.tech/blog/</link><description>Recent content in Blog on Tabulate</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Wed, 04 Feb 2026 15:57:04 -0500</lastBuildDate><atom:link href="http://tabulate.tech/blog/index.xml" rel="self" type="application/rss+xml"/><item><title>More Randfacts Testing</title><link>http://tabulate.tech/blog/more_randfacts_testing/</link><pubDate>Wed, 04 Feb 2026 15:57:04 -0500</pubDate><category>randfacts</category><category>rust</category><category>python</category><category>math</category><guid>http://tabulate.tech/blog/more_randfacts_testing/</guid><description>&lt;p&gt;I&amp;rsquo;ve covered this topic before in a &lt;a href="http://tabulate.tech/blog/randfacts_checkduplicates/"&gt;previous post&lt;/a&gt;, but I&amp;rsquo;ve recently gotten back into improving &lt;a href="https://github.com/TabulateJarl8/randfacts"&gt;randfacts&lt;/a&gt; after switching it from poetry to uv. I&amp;rsquo;m really not sure why, but after improving the unit tests, I had an urge to see how much faster I could make the checkduplicates test.&lt;/p&gt;
&lt;p&gt;As a brief historical overview, the checkduplicates test finds any facts that are duplicated in the dataset. This sounds easy at first, but I originally wanted it to work for facts with different phrasings but the same meaning. From this, I stumbled across the &lt;a href="https://en.wikipedia.org/wiki/Levenshtein_distance"&gt;Levenshtein Distance&lt;/a&gt; algorithm. As of the time I&amp;rsquo;m writing this, there have been four major iterations of this test, listed below:&lt;/p&gt;</description><content:encoded><![CDATA[<p>I&rsquo;ve covered this topic before in a <a href="http://tabulate.tech/blog/randfacts_checkduplicates/">previous post</a>, but I&rsquo;ve recently gotten back into improving <a href="https://github.com/TabulateJarl8/randfacts">randfacts</a> after switching it from poetry to uv. I&rsquo;m really not sure why, but after improving the unit tests, I had an urge to see how much faster I could make the checkduplicates test.</p>
<p>As a brief historical overview, the checkduplicates test finds any facts that are duplicated in the dataset. This sounds easy at first, but I originally wanted it to work for facts with different phrasings but the same meaning. From this, I stumbled across the <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein Distance</a> algorithm. As of the time I&rsquo;m writing this, there have been four major iterations of this test, listed below:</p>
<ul>
<li><strong>Python</strong>: I implemented this first version with the <a href="https://github.com/seatgeek/fuzzywuzzy"><code>fuzzywuzzy</code></a> module, but this was pretty slow because it&rsquo;s all Python.</li>
<li><strong>Python/C++</strong>: After a few years of using this, I rewrote it to use <a href="https://github.com/rapidfuzz/RapidFuzz"><code>rapidfuzz</code></a>, which is a much faster C++ based implementation of Levenshtein. This worked decently well for a while, but I again got tired of the slowdown.</li>
<li><strong>Rust #1</strong>: The third time around I had recently learned Rust and wanted to try that out, so I did. However, if I remember correctly it was actually originally slower than the C++ version, probably because there&rsquo;s not much speedup you can do when using a recursive algorithm. This is when I discovered <a href="https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm">Wagner-Fischer</a>, which is a dynamic programming approach to Levenshtein distance. My initial implementation was already drastically faster, and I go over the particular optimizations that I did in <a href="http://tabulate.tech/blog/randfacts_checkduplicates/">my other blog post</a>.</li>
<li><strong>Rust #2</strong>: This is my fourth and most recent development, and on my personal desktop I&rsquo;ve increased the speed from ~17.10s to ~165.52ms, which is about a 196% increase in speed, and this ratio seems to be similar on my laptop.</li>
</ul>
<h2 id="dice-s248rensen">Dice-Sørensen</h2>
<p>There were a few main optimizations that I made. Firstly, I completely switched algorithms. I found a formula called the <a href="https://en.wikipedia.org/wiki/Dice-S%C3%B8rensen_coefficient">Dice-Sørensen coefficient</a> which is used for calculating the similarity between two samples. It&rsquo;s defined as follows:</p>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>D</mi><mi>S</mi><mi>C</mi><mo>=</mo><mfrac><mrow><mn>2</mn><mrow><mo fence="true">∣</mo><mi>X</mi><mo>∩</mo><mi>Y</mi><mo fence="true">∣</mo></mrow></mrow><mrow><mrow><mo fence="true">∣</mo><mi>X</mi><mo fence="true">∣</mo></mrow><mo>+</mo><mrow><mo fence="true">∣</mo><mi>Y</mi><mo fence="true">∣</mo></mrow></mrow></mfrac></mrow><annotation encoding="application/x-tex">DSC=\frac{2\left|X\cap Y\right|}{\left|X\right| + \left|Y\right|}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.07153em;">SC</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.363em;vertical-align:-0.936em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.427em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="minner"><span class="mopen delimcenter" style="top:0em;">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mclose delimcenter" style="top:0em;">∣</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">∣</span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mclose delimcenter" style="top:0em;">∣</span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">∩</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mclose delimcenter" style="top:0em;">∣</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.936em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></span><p>Between two sets, it computes the ratio of twice the number of items in their intersection to the sum of their lengths. For example, if you have two sets of <code>{i, love, programming}</code> and <code>{programming, is, what, i, love}</code>, their intersection is <code>{i, programming, love}</code> of length three, giving us:</p>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mfrac><mrow><mn>2</mn><mo>⋅</mo><mn>3</mn></mrow><mrow><mn>3</mn><mo>+</mo><mn>5</mn></mrow></mfrac><mo>=</mo><mn>0.75</mn></mrow><annotation encoding="application/x-tex">\frac{2\cdot 3}{3+5}=0.75</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:2.0908em;vertical-align:-0.7693em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3214em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">3</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">5</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">3</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.7693em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.75</span></span></span></span></span><p>Once we establish a similarity threshold that works best (turned out to be <code>0.70</code> for me), then we can just use this to compare similarity without having to do recursion or complex matrix edit distance calculations.</p>
<p>The actual base algorithm itself ended up being fairly simple, as it is provided a sorted array of <code>u64</code>s (we&rsquo;ll discuss why/how <code>u64</code> in a moment), so we&rsquo;re able to calculate the intersection in one linear pass:</p>
<pre><code class="language-rs">#[inline(always)]
fn dice_sorensen_sorted(set1: &[u64], set2: &[u64]) -> f64 {
    let len1 = set1.len();
    let len2 = set2.len();

    let mut intersect = 0;
    let mut i = 0;
    let mut j = 0;

    // calculate the intersection of the two sorted vecs of tokens:
    while i < len1 && j < len2 {
        let x = set1[i];
        let y = set2[j];

        if x == y {
            intersect += 1;
            i += 1;
            j += 1;
        } else if x < y {
            i += 1;
        } else {
            j += 1;
        }
    }

    // DSC = (2|X intersect Y|)/(|X| + |Y|)
    (2.0 * intersect as f64) / ((len1 + len2) as f64) * 100.0
}</code></pre>
<h2 id="hashingtokenizing">Hashing/Tokenizing</h2>
<p>I mentioned in the last section that the algorithm is provided a list (set) of <code>u64</code>s. This is another reason why it can run so must faster. Since integer comparisons are so much quicker compared to full string comparisons, why not just tokenize the string (which we were already doing) and hash each word? This allows us to just store a vector of each hash and check if numbers are equal (<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(1)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord">1</span><span class="mclose">)</span></span></span></span>) instead of checking if strings are equal (<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><mi>n</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord mathnormal">n</span><span class="mclose">)</span></span></span></span>). In this same normalization pass, I also added &ldquo;stop word&rdquo; removal. This is a term from natural language processing that describes words such as &ldquo;i&rdquo;, &ldquo;was&rdquo;, &ldquo;a&rdquo;, &ldquo;or&rdquo;, &ldquo;at&rdquo;, or anything else like that that doesn&rsquo;t really contribute &ldquo;meaning&rdquo; to a sample. By removing all of these words from the set of tokens, we can construct a set that more closely represents the core &ldquo;subject&rdquo; of the fact.</p>
<h2 id="length-difference-ratio-termination">Length Difference Ratio Termination</h2>
<p>The other huge optimization that this approach allowed us to do is the fact that we&rsquo;re now able to calculate the mathematical limit at which two facts could not possibly be considered similar if their lengths differ by <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">x</span></span></span></span> amount. This was a little bit complex to do but I&rsquo;ll try to annotate the math here:</p>
<p>We already know the following formula for calculating the coefficient:</p>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>D</mi><mi>S</mi><mi>C</mi><mo>=</mo><mfrac><mrow><mn>2</mn><mrow><mo fence="true">∣</mo><mi>X</mi><mo>∩</mo><mi>Y</mi><mo fence="true">∣</mo></mrow></mrow><mrow><mrow><mo fence="true">∣</mo><mi>X</mi><mo fence="true">∣</mo></mrow><mo>+</mo><mrow><mo fence="true">∣</mo><mi>Y</mi><mo fence="true">∣</mo></mrow></mrow></mfrac></mrow><annotation encoding="application/x-tex">DSC=\frac{2\left|X\cap Y\right|}{\left|X\right| + \left|Y\right|}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.07153em;">SC</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.363em;vertical-align:-0.936em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.427em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="minner"><span class="mopen delimcenter" style="top:0em;">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mclose delimcenter" style="top:0em;">∣</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">∣</span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mclose delimcenter" style="top:0em;">∣</span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">∩</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mclose delimcenter" style="top:0em;">∣</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.936em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></span><p>Since this is a ratio against the sum of the lengths of the sets, we should be able to derive at what length of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>Y</mi></mrow><annotation encoding="application/x-tex">Y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span></span></span></span> (given <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi></mrow><annotation encoding="application/x-tex">|X|</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span></span></span></span>) will the ratio not surpass the set threshold. We know that the list is sorted from least to greatest by length, and from that, we know that as we traverse from start to finish (comparing every <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span></span></span></span> against everything after it), <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi><mo>≤</mo><mi mathvariant="normal">∣</mi><mi>Y</mi><mi mathvariant="normal">∣</mi></mrow><annotation encoding="application/x-tex">|X| \leq |Y|</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mord">∣</span></span></span></span>. This tells us that the maximum possible intersection length between the two sets is <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi></mrow><annotation encoding="application/x-tex">|X|</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span></span></span></span>. From this, we can construct an inequality representing this maximum case:</p>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mtable rowspacing="0.25em" columnalign="right left" columnspacing="0em"><mtr><mtd><mstyle scriptlevel="0" displaystyle="true"><mfrac><mrow><mn>2</mn><mo>⋅</mo><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi></mrow><mrow><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi><mo>+</mo><mi mathvariant="normal">∣</mi><mi>Y</mi><mi mathvariant="normal">∣</mi></mrow></mfrac></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mrow></mrow><mo>≥</mo><mtext>threshold</mtext></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mn>2</mn><mo>⋅</mo><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi></mrow></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mrow></mrow><mo>≥</mo><mtext>threshold</mtext><mo>⋅</mo><mrow><mo fence="true">(</mo><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi><mo>+</mo><mi mathvariant="normal">∣</mi><mi>Y</mi><mi mathvariant="normal">∣</mi><mo fence="true">)</mo></mrow></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mn>2</mn><mo>⋅</mo><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi></mrow></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mrow></mrow><mo>≥</mo><mtext>threshold</mtext><mo>⋅</mo><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi><mo>+</mo><mtext>threshold</mtext><mo>⋅</mo><mi mathvariant="normal">∣</mi><mi>Y</mi><mi mathvariant="normal">∣</mi></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mn>2</mn><mo>⋅</mo><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi><mo>−</mo><mtext>threshold</mtext><mo>⋅</mo><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi></mrow></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mrow></mrow><mo>≥</mo><mtext>threshold</mtext><mo>⋅</mo><mi mathvariant="normal">∣</mi><mi>Y</mi><mi mathvariant="normal">∣</mi></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi><mrow><mo fence="true">(</mo><mn>2</mn><mo>−</mo><mtext>threshold</mtext><mo fence="true">)</mo></mrow></mrow></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mrow></mrow><mo>≥</mo><mtext>threshold</mtext><mo>⋅</mo><mi mathvariant="normal">∣</mi><mi>Y</mi><mi mathvariant="normal">∣</mi></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="true"><mfrac><mrow><mi mathvariant="normal">∣</mi><mi>X</mi><mi mathvariant="normal">∣</mi><mrow><mo fence="true">(</mo><mn>2</mn><mo>−</mo><mtext>threshold</mtext><mo fence="true">)</mo></mrow></mrow><mtext>threshold</mtext></mfrac></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="true"><mrow><mrow></mrow><mo>≥</mo><mi mathvariant="normal">∣</mi><mi>Y</mi><mi mathvariant="normal">∣</mi></mrow></mstyle></mtd></mtr></mtable><annotation encoding="application/x-tex">
\begin{aligned}
\frac{2\cdot |X|}{|X| + |Y|} &amp; \geq \text{threshold} \\
2\cdot |X| &amp; \geq \text{threshold}\cdot\left(|X|+|Y|\right) \\
2\cdot |X| &amp; \geq \text{threshold}\cdot|X| + \text{threshold}\cdot|Y| \\
2\cdot |X| - \text{threshold}\cdot |X| &amp; \geq \text{threshold}\cdot|Y| \\
|X|\left(2 - \text{threshold}\right) &amp; \geq \text{threshold}\cdot|Y| \\
\frac{|X|\left(2 - \text{threshold}\right)}{\text{threshold}} &amp; \geq |Y| \\
\end{aligned}
</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:11.076em;vertical-align:-5.288em;"></span><span class="mord"><span class="mtable"><span class="col-align-r"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:5.788em;"><span style="top:-7.788em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.427em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mord">∣</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.936em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span><span style="top:-5.712em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span></span></span><span style="top:-4.212em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span></span></span><span style="top:-2.712em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord text"><span class="mord">threshold</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span></span></span><span style="top:-1.212em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord text"><span class="mord">threshold</span></span><span class="mclose delimcenter" style="top:0em;">)</span></span></span></span><span style="top:0.875em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.427em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord text"><span class="mord">threshold</span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord text"><span class="mord">threshold</span></span><span class="mclose delimcenter" style="top:0em;">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:5.288em;"><span></span></span></span></span></span><span class="col-align-l"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:5.788em;"><span style="top:-7.788em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord"></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≥</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mord text"><span class="mord">threshold</span></span></span></span><span style="top:-5.712em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord"></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≥</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mord text"><span class="mord">threshold</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mord">∣</span><span class="mclose delimcenter" style="top:0em;">)</span></span></span></span><span style="top:-4.212em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord"></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≥</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mord text"><span class="mord">threshold</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord">∣</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord text"><span class="mord">threshold</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mord">∣</span></span></span><span style="top:-2.712em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord"></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≥</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mord text"><span class="mord">threshold</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mord">∣</span></span></span><span style="top:-1.212em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord"></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≥</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mord text"><span class="mord">threshold</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mord">∣</span></span></span><span style="top:0.875em;"><span class="pstrut" style="height:3.427em;"></span><span class="mord"><span class="mord"></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≥</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.22222em;">Y</span><span class="mord">∣</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:5.288em;"><span></span></span></span></span></span></span></span></span></span></span></span><p>In my final Rust implementation, I was able to use this formula to break off processing once we&rsquo;ve reached this limit, shown here:</p>
<pre><code class="language-rust">// ...
// this would take 70.0 and make it into 0.7
let fractional_ratio = SIMILARITY_THRESHOLD / 100.0;
// use the formula we just calculated
let length_diff_ratio = (2.0 - fractional_ratio) / fractional_ratio;

// Process facts in parallel
facts
    .par_iter()
    .enumerate()
    .progress_with(pb)
    .flat_map(|(i, f1)| {
        let mut matches = vec![];
        // here is the length of X, or |X|
        let len1 = f1.token_hashes.len();

        for f2 in &facts[i + 1..] {
            // here is the length of the fact we're comparing against, or |Y|
            let len2 = f2.token_hashes.len();

            // if lengths are too different to possibly be the same, don't try any of the
            // remaining facts
            if (len2 as f64) > (len1 as f64) * length_diff_ratio {
                break;
            }

            let ratio = dice_sorensen_sorted(&f1.token_hashes, &f2.token_hashes);
            if ratio > SIMILARITY_THRESHOLD {
                matches.push((f1.original.clone(), f2.original.clone(), ratio));
            }
        }

        matches
    })
    .collect()</code></pre>
<h2 id="speed-comparison-and-conclusion">Speed Comparison and Conclusion</h2>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Python</th>
          <th>C++/Python</th>
          <th>Wagner-Fischer Rust</th>
          <th>Dice-Sørensen Rust</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Runtime (seconds)</td>
          <td>419.22 seconds (6m59s)</td>
          <td>68.34 seconds (1m08s)</td>
          <td>17.1 seconds</td>
          <td>0.165 seconds</td>
      </tr>
      <tr>
          <td>Approximate Iterations/second</td>
          <td>~60,000-70,000</td>
          <td>~400,000</td>
          <td>~1,550,000</td>
          <td>~50,000<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></td>
      </tr>
      <tr>
          <td>Source Code Permalink</td>
          <td><a href="https://github.com/TabulateJarl8/randfacts/blob/021bd555f1b1931343acc7dfe7e0746af9003afe/tests/checkduplicates.py">Link</a></td>
          <td><a href="https://github.com/TabulateJarl8/randfacts/blob/de5f66ff1eb4545de82c14c62405fd33c7cd07e7/tests/checkduplicates.py">Link</a></td>
          <td><a href="https://github.com/TabulateJarl8/randfacts/blob/4c56325c1c8a529c12cca2eebfdc2a2eac6307d0/tests/checkduplicates/src/main.rs">Link</a></td>
          <td><a href="https://github.com/TabulateJarl8/randfacts/blob/94029d363bb7a3a5e8e15c70a8eec16864f5f7aa/tests/checkduplicates/src/main.rs">Link</a></td>
      </tr>
  </tbody>
</table>
<p>I found this comparison kind of interesting because the new Rust version is so fast, but it by far has the least number of iterations/second. While this may seem like it&rsquo;d be slower at first, the speed is due to the test having far less comparisons to do than any of the other implementations because of the new ability to only compare facts that could possibly be similar. I&rsquo;ve actually found this new implementation to have picked up duplicate facts that the Levenshtein and Wagner-Fischer implementations missed, while also having almost no false positives. Overall, this has been a very worthwhile investment and I had a lot of fun learning about this new algorithm.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>This is kind of difficult to measure because of how fast it runs, but manually calculating it gives around this. The quick flash of the progress bar also shows something around this&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded></item><item><title>Python Bug</title><link>http://tabulate.tech/blog/python_bug/</link><pubDate>Mon, 24 Feb 2025 01:03:22 -0500</pubDate><category>python</category><guid>http://tabulate.tech/blog/python_bug/</guid><description>&lt;p&gt;I was recently working on my &lt;a href="http://tabulate.tech/software/vapor/"&gt;Vapor&lt;/a&gt; project, which is a TUI program written in Python using &lt;a href="https://github.com/Textualize/textual"&gt;Textual&lt;/a&gt;. I was updating the project&amp;rsquo;s typing to utilize newer features which were introduced in Python 3.10, such as using the bitwise OR operator instead of &lt;code&gt;Union&lt;/code&gt;. This involved rewriting things like &lt;code&gt;x: Union[str, int]&lt;/code&gt; as &lt;code&gt;x: str | int&lt;/code&gt;. In this process, I came across the following piece of code:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-py"&gt;yield Container(
 ...,
 DataTable[Union[str, Text]](zebra_stripes=True)
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;DataTable&lt;/code&gt; in Textual accepts a generic type parameter, which in my case is a string or a &lt;code&gt;Text&lt;/code&gt; object. This seems like it&amp;rsquo;d be pretty easy to update, so I rewrote it as this:&lt;/p&gt;</description><content:encoded><![CDATA[<p>I was recently working on my <a href="/software/vapor/">Vapor</a> project, which is a TUI program written in Python using <a href="https://github.com/Textualize/textual">Textual</a>. I was updating the project&rsquo;s typing to utilize newer features which were introduced in Python 3.10, such as using the bitwise OR operator instead of <code>Union</code>. This involved rewriting things like <code>x: Union[str, int]</code> as <code>x: str | int</code>. In this process, I came across the following piece of code:</p>
<pre><code class="language-py">yield Container(
    ...,
    DataTable[Union[str, Text]](zebra_stripes=True)
)</code></pre>
<p>The <code>DataTable</code> in Textual accepts a generic type parameter, which in my case is a string or a <code>Text</code> object. This seems like it&rsquo;d be pretty easy to update, so I rewrote it as this:</p>
<pre><code class="language-py">yield Container(
    ...,
    DataTable[str | Text](zebra_stripes=True)
)</code></pre>
<p>After this, I realized that in order to backport this behavior into Python versions before 3.10, you need to add <code>from __future__ import annotations</code> to the top of each file which uses these newer types of typing. From my understanding, this sets an interpreter flag which converts type hints into strings at runtime, allowing static type checkers to still read the types, while the string literals are ignored by the interpreter while the program is running. After adding this and running my unit tests in Python 3.9, I realized that the <code>DataTable</code> generic type was raising a <code>TypeError</code>. I looked around for a while, eventually coming to the conclusion that this might be a bug in Python itself. I was then able to produce the following minimal reproducible example:</p>
<pre><code class="language-py">from __future__ import annotations
from typing import Generic, TypeVar

T = TypeVar('T')

class Node(Generic[T]):
    x = None

    def __init__(self, label: T = None) -> None:
        pass

    def __str__(self) -> None:
        return str(self.x)

print(Node[str | int](''))</code></pre>
<p>This example will raise a <code>TypeError</code> in Python 3.9. I thought about fixing this bug, however with Python 3.9 being EOL in October, they&rsquo;re only accepting security fixes. While talking about this with some others, the only other possible conclusion that we could come to is that this behavior is intentional. Technically, this <code>Node[str | int]</code> syntax could be valid in 3.9 if you had a metaclass which defined <code>__getitem__</code> and then indexed into a class&rsquo;s attributes with an object that defined <code>__or__</code>. Such an example could be something like this:</p>
<pre><code class="language-py">class Subscriptable(type):
    def __getitem__(self, item):
        return self.__dict__[item]

class Subscript(metaclass=Subscriptable):
    testing = '1'

class BitwiseORString:
    def __init__(self, data):
        self.data = data

    def __or__(self, other):
        if isinstance(other, BitwiseORString):
            return self.data + other.data
        return ''

s1 = BitwiseORString('test')
s2 = BitwiseORString('ing')
print(Subscript[s1 | s2])</code></pre>
<p>In my personal opinion (if I were designing the language), this seems like something that would be too inconsistent to leave out, especially since this generic syntax works for singular types in Python 3.9, just not when they&rsquo;re OR&rsquo;d together. This means that the parser has the ability to differentiate between the two, it just seems that they&rsquo;ve forgotten about this edge case. You could maybe make the argument to say that they intentionally left this out to not break code that was using something like this, however if you&rsquo;re using <code>from __future__ import annotations</code>, I would guess that you&rsquo;re using this for backwards compatibility with older versions of Python, therefore you&rsquo;d want your entire codebase to behave the same way instead of having weird discontinuities like this. Thankfully, the fix is pretty simple and you can just quote the types yourself like so:</p>
<pre><code class="language-py">yield Container(
    ...,
    DataTable["str | Text"](zebra_stripes=True)
)</code></pre>
<p>If nothing else, maybe this will help someone who also comes across the same issue, as I couldn&rsquo;t really find much talk about this online. The associated PR <a href="https://github.com/TabulateJarl8/vapor/pull/22/">can be found here</a>.</p>
]]></content:encoded></item><item><title>Rewriting the randfacts duplicate facts test</title><link>http://tabulate.tech/blog/randfacts_checkduplicates/</link><pubDate>Mon, 18 Nov 2024 14:44:19 -0500</pubDate><category>randfacts</category><category>rust</category><category>python</category><guid>http://tabulate.tech/blog/randfacts_checkduplicates/</guid><description>&lt;p&gt;Recently, I was working on a Python web backend project for work, and I noticed something strange with the LSP I was using, Pyright. For some reason, it couldn&amp;rsquo;t automatically detect and import modules that I referenced. This seemed like a pretty standard and basic feature, so after a quick search, I stumbled upon &lt;a href="https://github.com/microsoft/pyright/issues/4263"&gt;microsoft/pyright#4263&lt;/a&gt;. Someone posted an issue asking Microsoft about why this feature wasn&amp;rsquo;t available in Pyright, and they responded with this:&lt;/p&gt;</description><content:encoded><![CDATA[<p>Recently, I was working on a Python web backend project for work, and I noticed something strange with the LSP I was using, Pyright. For some reason, it couldn&rsquo;t automatically detect and import modules that I referenced. This seemed like a pretty standard and basic feature, so after a quick search, I stumbled upon <a href="https://github.com/microsoft/pyright/issues/4263">microsoft/pyright#4263</a>. Someone posted an issue asking Microsoft about why this feature wasn&rsquo;t available in Pyright, and they responded with this:</p>
<blockquote>
<p>This is a language service feature that is included in pylance, Microsoft&rsquo;s premium Python language server for VS Code. We don&rsquo;t have plans to port it to pyright. If you want this functionality, please switch to pylance.</p>
</blockquote>
<p>This was pretty annoying, as I have switched to Neovim as my primary editor, and I didn&rsquo;t want to switch back to VSCode. Fortunately, I learned about basedpyright in the same issue, and the author commented that they had pushed an update to the LSP which added this feature. Along with this, it also seemed to give more warnings about typing issues in the code, so I started going through some of my projects and transitioning them to be fully typed. Eventually, I got to my randfacts Python module and this is where the story really begins.</p>
<h2 id="some-background">Some Background</h2>
<p>Randfacts is a Python module that I created with a very simple purpose, which is to provide a developer with an easy-to-use interface to a database of random facts. I had made this for a Discord bot, as nothing else existed at the time, and I wasn&rsquo;t expecting it to actually be anything. At the same time, I was also starting to learn about publishing PyPI modules, so I figured I&rsquo;d throw it up on there to learn how the whole publishing process worked. After a while, however, I noticed that the downloads started going up a lot more than I expected, so I started maintaining the project some more, and I eventually got to where I am today. At the time of writing this, the module has about 1.2 million downloads, which isn&rsquo;t a ton compared to some other modules, but it&rsquo;s pretty cool to me.</p>
<h2 id="the-checkduplicates-test">The checkduplicates test</h2>
<p>After a while of maintaining the module, I noticed a problem. Since the facts were being scraped off of the web, I inevitably ended up with some duplicates. To address this, I wrote a test in Python that would go through all of the facts and use the Levenshtein distance algorithm to compute the similarity between the two strings. On top of Levenshtein distance, I used a token sort ratio preprocessor, which tokenizes each string by converting it to lowercase and removing any punctuation because this usually gave more accurate results. With this method of string similarity checking, I could accurately match strings with the same meaning but different wording, such as &ldquo;Jupiter is the biggest planet in the solar system&rdquo; and &ldquo;The biggest planet in the solar system is Jupiter&rdquo;. This test worked fine for a while, but every time I added another fact, it needed to be compared with every fact before it. With the current list of over 7,000 facts, the test needs to compute about 27.5 million string comparisons. The Python version of the test could compute about 400k-500k string comparisons per second, which ended up taking a bit over a minute just to check for duplicate facts.</p>
<p>In comes Rust. When I was first learning Rust, I started to rewrite this test, as I thought using a compiled language would at least provide a small benefit in computation time, but this doesn&rsquo;t address the underlying problem of why the test is so slow. When I came back to randfacts with my new LSP, I rediscovered this half-finished implementation and decided that it would be fun to finish now that I know more about Rust.</p>
<h2 id="finishing-the-rewrite">Finishing the Rewrite</h2>
<p>My goal was for the Rust test to have similar, if not the same functionality as the Python test.</p>
<h3 id="algorithm-optimizations">Algorithm Optimizations</h3>
<p>The first problem I addressed was the efficiency of the Levenshtein distance algorithm. Since this is originally a mathematical equation and wasn&rsquo;t designed for programming, it isn&rsquo;t particularly efficient. This is where Wagner-Fischer comes in. Wagner-Fischer is an implementation of Levenshtein distance that uses dynamic programming to avoid redundant calculations. Levenshtein distance is also recursive, which Wagner-Fischer is not, avoiding that extra recursive overhead. I chose to go with a Wagner-Fischer implementation that only uses two arrays instead of a full matrix to hopefully get even better performance. The full algorithm is below:</p>
<pre><code class="language-rs">#[inline(always)]
fn wagner_fischer_2row(s1: &[char], s2: &[char]) -> usize {
    // Ensure s1 is the shorter sequence for optimization
    let (s1, s2) = if s1.len() < s2.len() {
        (s1, s2)
    } else {
        (s2, s1)
    };

    let len1 = s1.len();
    let len2 = s2.len();

    // handle empty string cases
    if len1 == 0 {
        return len2;
    }
    if len2 == 0 {
        return len1;
    }

    // Initialize two rows for the dynamic programming matrix
    let mut prev_row = vec![0; len2 + 1];
    let mut curr_row = vec![0; len2 + 1];

    // Initialize first row with incremental values
    (0..=len2).for_each(|i| {
        prev_row[i] = i;
    });

    // Fill the matrix using only two rows
    for (i, c1) in s1.iter().enumerate() {
        curr_row[0] = i + 1;

        for (j, c2) in s2.iter().enumerate() {
            curr_row[j + 1] = if c1 == c2 {
                // No edit needed
                prev_row[j]
            } else {
                // Take minimum of three possible operations (insert, delete, substitute)
                1 + prev_row[j].min(prev_row[j + 1]).min(curr_row[j])
            };
        }

        // Swap rows using mem::swap for better performance
        std::mem::swap(&mut prev_row, &mut curr_row);
    }

    prev_row[len2]

}</code></pre>
<h3 id="tokenization-optimizations">Tokenization Optimizations</h3>
<p>To speed it up a bit more, I added the following check to the <code>token_sort_ratio</code> function:</p>
<pre><code class="language-rs">if (len1 as f64 / len2 as f64) < 0.5 || (len2 as f64 / len1 as f64) < 0.5 {
    return 0.0;
}</code></pre>
<p>This snippet will check if the length of the strings we&rsquo;re comparing differ by more than half. If they do, we could reasonably assume that the strings are different. While this may not always present to be true, the performance gain is great enough to justify it being in the algorithm. This makes it so that on some comparisons we can just completely skip the Wagner-Fischer computations, which is an O(m*n) algorithm, with m and n being the lengths of the strings.</p>
<h3 id="iteration-optimizations">Iteration Optimizations</h3>
<p>Other than the algorithm implementation, this may be the most important part to focus on. There are so many different ways to iterate over every combination of facts, so choosing the correct way is crucial to a fast algorithm. Let&rsquo;s take a look at the iteration line by line:</p>
<pre><code class="language-rs">// Generate all possible indices combinations
let indices: Vec<_> = (0..all_facts.len())
    .flat_map(|i| ((i + 1)..all_facts.len()).map(move |j| (i, j)))
    .collect();</code></pre>
<p>Instead of generating an iterable structure that contains all of the facts pre-paired, we can generate all pairs of indices instead. The <code>all_facts</code> array contains a struct with information about the fact, such as the fact itself and the line number in the file where the fact can be located. The fact itself isn&rsquo;t just a <code>String</code>, but rather an <code>Arc&lt;String&gt;</code>. This allows us to have cheaper clones which is crucial for performance. Next, we can look at how these indices are used:</p>
<pre><code class="language-rs">// Process combinations in parallel
indices
    .into_par_iter()
    .progress_with(pb)
    .filter_map(|(i, j)| {
        let facts = &all_facts;
        let fact1 = &facts[i];
        let fact2 = &facts[j];

        let ratio = token_sort_ratio(&fact1.fact, &fact2.fact);
        if ratio > SIMILARITY_THRESHOLD {
            Some((fact1.clone(), fact2.clone(), ratio))
        } else {
            None
        }
    })
    .collect()</code></pre>
<p>If we take a look at this first part, we can see where a huge amount of the improved performance lies. I&rsquo;m using a Rust library called <a href="https://github.com/rayon-rs/rayon">Rayon</a> which makes it incredibly easy to convert a sequential iterator into a parallel iterator. This means that instead of doing one string comparison at a time, I can take advantage of all of my CPU cores and do many computations at once, drastically speeding up the time it takes to find duplicate facts.</p>
<pre><code class="language-rs">// Process combinations in parallel
indices
    .into_par_iter()
    .progress_with(pb)
    .filter_map(|(i, j)| {
        let facts = &all_facts;
        let fact1 = &facts[i];
        let fact2 = &facts[j];

        let ratio = token_sort_ratio(&fact1.fact, &fact2.fact);
        if ratio > SIMILARITY_THRESHOLD {
            Some((fact1.clone(), fact2.clone(), ratio))
        } else {
            None
        }
    })
    .collect()</code></pre>
<p>The next part is pretty simple. We can take references of the facts to avoid copying/cloning, and calculate the similarity ratio. If it&rsquo;s above the threshold, add it to the removal list and continue. I found a good threshold with this particular algorithm is 82.5.</p>
<h3 id="ci-caching">CI Caching</h3>
<p>The one downfall of the Rust version is that it takes time to compile which can slow down the CI, and that defeats the purpose of having a faster test. To solve this issue, I used GitHub&rsquo;s <a href="https://github.com/actions/cache">actions/cache</a> action. Here&rsquo;s the relevant section of the CI:</p>
<pre><code class="language-yml">- name: Cache checkduplicates binary
        uses: actions/cache@v4
        id: cache
        with:
          path: |
            tests/checkduplicates/target/release/checkduplicates
          key: ${{ runner.os }}-cargo-${{ hashFiles('tests/checkduplicates/Cargo.lock', 'tests/checkduplicates/Cargo.toml', 'tests/checkduplicates/src/**') }}
          restore-keys: |
            ${{ runner.os }}-cargo-

      - name: Build checkduplicates test
        if: steps.cache.outputs.cache-hit != 'true'
        run: |
          cd tests/checkduplicates
          cargo build --release

      - name: Check for duplicate facts
        run: ./tests/checkduplicates/target/release/checkduplicates</code></pre>
<p>To explain this simply, the cache action will check if <code>Cargo.toml</code>, <code>Cargo.lock</code>, or anything in <code>src/**</code> have changed. If it has, we&rsquo;ll assume that the cache is expired and the test should be rebuilt, which you can see in lines 12, 14-15. If the cache is not expired, we place the cached <code>checkduplicates</code> binary in appropriate place. After building, or if building is skipped, we then run the resulting binary. This allows us to skip the build time if nothing has changed in the test, while still letting it automatically build if something has changed.</p>
<h2 id="conclusion">Conclusion</h2>
<p>After all of this work, was it worth it? Let&rsquo;s let the number speak for themselves.</p>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Python Test</th>
          <th>Rust Test</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Approximate iterations/sec</td>
          <td>550,000</td>
          <td>2,200,000</td>
      </tr>
      <tr>
          <td>Time Taken</td>
          <td>48 seconds</td>
          <td>12 seconds</td>
      </tr>
  </tbody>
</table>
<p>This benchmark was performed on my Framework 16 Laptop. I have a Ryzen 7 7840HS @ 3.8GHz, 16GB of DDR5-5600 RAM, and I was using the &ldquo;Performance&rdquo; profile with power profiles daemon on Arch Linux. In this case, the Rust version of the test performed 4× faster than the Python version of the test.</p>
<p>This metric, along with the CI caching, led to a huge performance gain in the duplicate fact checking. That&rsquo;s all I have for now so hopefully you learned something or just enjoyed this post. The full source code for the new test can be found below, just note that I&rsquo;ve pinned the commit so there may be a more up to date version on the master branch.</p>
<p><a href="https://github.com/TabulateJarl8/randfacts/tree/5e6786e8b536efc2895880ce5f0e88a8f442454b/tests/checkduplicates">https://github.com/TabulateJarl8/randfacts/tree/5e6786e8b536efc2895880ce5f0e88a8f442454b/tests/checkduplicates</a></p>
]]></content:encoded></item><item><title>College Range Assignment</title><link>http://tabulate.tech/blog/college_range_assignment/</link><pubDate>Wed, 17 Jan 2024 19:10:05 -0500</pubDate><category>assembly</category><category>python</category><guid>http://tabulate.tech/blog/college_range_assignment/</guid><description>&lt;p&gt;A friend of mine is enrolled in a college intro to programming course. This course had a very simple entrance test: they needed to write a program in any language to display the numbers 5-60 prefixed with &amp;ldquo;number &amp;ldquo;, like so:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-txt"&gt;number 5
number 6
number 7
number 8
...
number 60&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After he told me about this assignment, I thought that it was pretty funny, and I wanted to write it in assembly as a joke. I started off with stealing some integer printing code that I had written for another project.&lt;/p&gt;</description><content:encoded><![CDATA[<p>A friend of mine is enrolled in a college intro to programming course. This course had a very simple entrance test: they needed to write a program in any language to display the numbers 5-60 prefixed with &ldquo;number &ldquo;, like so:</p>
<pre><code class="language-txt">number 5
number 6
number 7
number 8
...
number 60</code></pre>
<p>After he told me about this assignment, I thought that it was pretty funny, and I wanted to write it in assembly as a joke. I started off with stealing some integer printing code that I had written for another project.</p>
<details>
	<summary>The integer printing code</summary>
	<pre><code class="language-nasm">; Print "number i" for i in range(5,60)
section .data
    str_buffer db 0 ; for printing integers

section .text
    global _start

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Integer printing system ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

calculate_ten_power:
    ; calculate the power of 10 that corresponds to an integer
    ; for example, 100 for 543, 1000 for 8956, and 10000 for 15236

    ; returns the integer in rcx

    ; rsp is the return address, add 8 to get the argument
    mov rcx, [rsp+8] ; rcx should be the integer to find the power of 10 for

    mov rax, 1 ; we need to calculate the power of 10 that corresponds to rcx
    ; for example 100 for 543 and 1000 for 8753
    mov rbx, 10
    calculate_ten:
        mul rbx
        cmp rax, rcx
        jg finish_power_ten ; if number is greater than target, divide by 10 and ret
        jmp calculate_ten

    finish_power_ten:
        ; divide ax by 10 to finish the calculation
        xor rdx, rdx
        div rbx
        ; now rax contains the power of 10
        mov rcx, rax
    ret


print_digit:
    push rcx
    mov rcx, [rsp+16]
    add rcx, '0' ; convert digit to ASCII

    mov byte [str_buffer], cl ; assign lower 8 bits of rcx to buffer

    mov rsi, str_buffer ; buffer pointer
    mov rax, 1 ; write
    mov rdi, 1 ; stdout
    mov rdx, 1 ; len
    syscall   ; call kernel

    pop rcx ; restore rcx

    ret


print_integer:
    ; takes the integer in from rax
    push rax ; push rax it for the next function to consume
    call calculate_ten_power ; power of 10 is now in rcx

    pop rax ; mov the argument (number to print) that was pushed into rax

    iter_number:
        ; num_to_print: rax
        ; base_10_place: rcx
        ; formula for accessing number: (num_to_print // base_10_place) % 10
        ; base_10_place is the power of 10 that corresponds to the place of number to print
        ; using 123 for example, 100 will get the 1, 10 will get the 2, and 1 will get the 3

        ; first, make sure we have a copy of rax
        push rax

        ; 10 for use in modulo
        mov rbx, 10

        ; next, floor divide rax by rcx
        xor rdx, rdx
        div rcx
        ; result is stored in rax, mod 10
        ; clear out rdx because thats where remainder is stored
        xor rdx, rdx
        div rbx

        ; rdx now contains our digit to print
        push rdx
        call print_digit
        add rsp, 8 ; remove the rdx that never got popped from print_digit from the stack


        ; check if rcx is equal to 1. if so, we just did the last digit
        mov rax, 1
        cmp rax, rcx
        je exit_print_integer

        ; divide out power of 10 by 10 to get the next digit
        xor rdx, rdx
        mov rbx, 10
        mov rax, rcx

        div rbx
        mov rcx, rax

        ; restore our original number to print
        pop rax

        ; loop to iter_number until rcx is 1 (we've done the last digit)
        jmp iter_number

    exit_print_integer:
    pop rax ; pop off our original number so that we return to the correct address

    ret

_start:
    mov rax, 60
    mov rdi, 0
    syscall ; call kernel</code></pre>

</details>

<p>After setting this up, I wrote a quick assembly program to iterate over the numbers 5-40 and print each one:</p>
<pre><code class="language-nasm">section .data
    line_text db "number "
    line_text_len equ $ - line_text

    str_buffer db 0 ; for printing integers

section .text
    global _start

; integer printing code here

_start:
    mov rcx, 5
    loop_start:
        cmp rcx, 61 ; if at 61, jump to exit
        je exit

        push rcx

        ; print line text
        mov rax, 1
        mov rdi, 1
        mov rsi, line_text
        mov rdx, line_text_len
        syscall

        mov rax, [rsp] ; put current increment in rax to print

        call print_integer

        ; print newline
        push 0xa ; newline
        mov rax, 1
        mov rdi, 1
        mov rsi, rsp
        mov rdx, 1
        syscall

        ; pop newline from stack
        add rsp, 8
        pop rcx

        inc rcx
        jmp loop_start

    exit:
        mov rax, 60
        mov rdi, 0
        syscall ; call kernel</code></pre>
<hr>
<p>This was pretty simple, and I felt good about it, so I sent it to my friend. He then responded with, &ldquo;Unfortunately a requirement was that it be &lsquo;significantly less than 55 lines&rsquo;&rdquo;. This could only be taken as a challenge, of course. He suggested that I hardcode a block of memory to contain the numbers and text that I have to print, and I thought that was a great idea, so I got to work.</p>
<h2 id="generating-the-data">Generating the data</h2>
<p>My first task was to write a throwaway Python script to generate this data. After some fiddling, I came up with this:</p>
<pre><code class="language-python">import textwrap
data = ''

# "number {i}\n"

for num in [f'6e756d62657220{str(i).encode().hex():0<4}0a' for i in range(5, 61)]:
    data += ', '.join(['0x' + split for split in textwrap.wrap(num, 2)]) + ', '

data = data.rstrip(', ')

print(data)</code></pre>
<p>This program is pretty simple if you break it down. Let&rsquo;s focus on the list comprehension first:</p>
<pre><code class="language-python">    [f'6e756d62657220{str(i).encode().hex():0<4}0a' for i in range(5, 61)]</code></pre>
<p>Inside the list comprehension, you can see the numbers we&rsquo;re iterating over in <code>range(5, 61)</code>. For every number in this range, we add a new string to the list:</p>
<pre><code class="language-python">    f'6e756d62657220{str(i).encode().hex():0<4}0a'</code></pre>
<p>The first chunk of this string, <code>6e756d62657220</code>, represents the text prefixing the number (&ldquo;number &ldquo;) in hex. Then, we convert the current number in the loop iteration to a string, and generate the ASCII hex representation for each of it&rsquo;s digits. For example, <code>str(12).encode().hex()</code> would return <code>3132</code>, since 1 in hex is <code>0x31</code> and 2 is <code>0x32</code>. You may have noticed the <code>:0&lt;4</code> at the end of the f-string. This fixes a bug that I discovered. I want each line to be the same length so that it&rsquo;s super easy to print in assembly, however single digit numbers are only one hex number, while double digit numbers are two hex digits. To solve this, I introduced null-padding into the numbers. This means that if a number is only 1 hex number long, I add a null byte after it. This null byte is not rendered by the terminal, so it shouldn&rsquo;t interfere with how the display is formatted. For example, if we have the number 7 (0x37), this little part of the f-string will add 0x00 after it, giving us a 4 byte long string of <code>0x37, 0x00</code>. The only part remaining is the <code>0a</code> at the end of the string, which is hex for a newline (<code>\n</code>). This covers the list comprehension of the Python. Next is the second step:</p>
<pre><code class="language-python">    data += ', '.join(['0x' + split for split in textwrap.wrap(num, 2)]) + ', '</code></pre>
<p>This part is pretty simple as well. We take each sequence of hex digits that were generated from the previous step (e.g. <code>3132</code>), and we convert them into a format that assembly can read. First, we use the textwrap module (very strange usage of this module but I guess it works) to split the data into 2 digit long chunks. For example, <code>'3132'</code> would be split into <code>['31', '32']</code>. We then prepend <code>0x</code> to each of these strings to tell the assembly that this is a hexadecimal number and not a base 10 number. The rest of the code on that line just strings together each of these new numbers with commas, giving you a final result of <code>0x31, 0x32</code>.</p>
<h2 id="writing-the-assembly-program">Writing the assembly program</h2>
<p>Next step was to write the program. I had a pretty strong mental image of how this should go:</p>
<ol>
<li>Initialize the hardcoded data</li>
<li>Check if the address that I&rsquo;m reading from is past the bounds of the data
<ul>
<li>If it is, exit</li>
<li>If it&rsquo;s not, continue</li>
</ul>
</li>
<li>Print a set amount of data from the memory</li>
<li>Increment the address that I&rsquo;m reading from and loop</li>
</ol>
<p>I started writing, and after a while I came up with a very basic implementation. I had a counter that incremented until it was equal to the length of the data block, and when it was, it exited. However, I realized that I could do some math to replace the counter entirely, and this removed a few lines of code. Here&rsquo;s what I ended up with:</p>
<pre><code class="language-nasm">section .data
    numbers db 0x6e, 0x75, 0x6d, ...
    numbers_len equ $ - numbers
section .text
    global _start

_start:
    lea rsi, [numbers] ; load address of numbers into rsi

    printing_loop:

        lea rdi, [numbers + numbers_len]
        cmp rsi, rdi
        je exit
        mov rdi, 1
        mov rax, 1
        mov rdx, 10 ; the number of bytes to print from rsi
        syscall
        add rsi, 10
        jmp printing_loop

    exit:
        mov rax, 60 ; exit
        mov rdi, 0
        syscall</code></pre>
<p>As you can see, this implementation worked and it was significantly shorter than the previous implementation due to it&rsquo;s hardcoded nature. Line 2 initializes the data, line 8 loads the address of the data into <code>rsi</code> before we start the loop, and then we start printing. Lines 12-14 are for bounds checking. We load the address of <code>numbers</code> plus the length of that block of data into <code>rdi</code>, and then compare it with the address we&rsquo;re currently reading from, <code>rsi</code>. If they&rsquo;re equal, that means we&rsquo;ve read all of the data, and we can exit. If not, we continue and load data into the appropriate registers in order to print 10 bytes from our current memory address. Think of it as taking 10 bytes at a time from our huge list of bytes. Then, we print these 10 bytes to the screen and end up with something like <code>number 12</code>. Then, we check if there&rsquo;s another 10 bytes, available, and if there are, we continue doing this. However, I was really invested now and wanted to try and make it as short as possible, so I looked for ways to optimize it. Even when I remove linebreaks and put labels on the same lines as code, I still end up with this:</p>
<pre><code class="language-nasm">section .data
    numbers db 0x6e, 0x75, 0x6d, ...
    numbers_len equ $ - numbers
section .text
    global _start
_start: lea rsi, [numbers] ; load address of numbers into rsi
    printing_loop: lea rdi, [numbers + numbers_len]
        cmp rsi, rdi
        je exit
        mov rdi, 1
        mov rax, 1
        mov rdx, 10 ; the number of bytes to print from rsi
        syscall
        add rsi, 10
        jmp printing_loop
    exit: mov rax, 60 ; exit
        mov rdi, 0
        syscall</code></pre>
<p>This is 18 lines, which is better than 25 but I still felt like I could do even better. I kept analyzing, and then all of a sudden, I saw it in a way I hadn&rsquo;t seen before, and I quickly moved some stuff around. Here&rsquo;s the final product:</p>
<pre><code class="language-nasm">section .data
    numbers db 0x6e, 0x75, 0x6d, ..., 0x0
section .text
    global _start

_start:
    lea rsi, [numbers] ; load address of numbers into rsi

    printing_loop:
        mov rdi, 1
        mov rax, 1
        mov rdx, 10 ; print 10 bytes. rsi is already loaded
        syscall

        ; increment rsi by 10. this changes the address to start at the next number, since "line xx\n" is 8 bytes long
        add rsi, 10

        cmp byte [rsi], 0 ; check if the first byte in the next sequence is null
        jne printing_loop ; if it isn't we haven't reached the end, keep printing

        mov rax, 60 ; exit
        mov rdi, 0
        syscall</code></pre>
<p>You may have noticed something a little different, which is the trailing null byte at the end of the data block. I realized that the first byte in the chunk of 10 will never be null unless I explicitly set it, since the first thing in each line is the letter &ldquo;n&rdquo; (<code>0x6e</code>). After adding this null byte, I could get rid of the <code>numbers_len</code> variable completely and all of the extra lines that came along with it. The flow of the program is still mostly the same. Instead of bounds checking first, I print our 10 bytes first. Then, I increment our address by 10 and check if the first byte in this next chunk of 10 is <code>0x0</code>. If it is <strong>not</strong>, then that means we&rsquo;re not done and we jump back up to the printing loop. This little inversion of checking saves us an extra line of code because we don&rsquo;t have to <code>jmp printing_loop</code> at the bottom of the loop, this is just done if we&rsquo;re not finished. If we are finished, the program will continue reading top down, skipping the jump to <code>printing_loop</code>, and we&rsquo;ll exit with status code 0. When we remove all blank lines and we collapse labels, we end up with a final total of 15 lines, which is pretty good in my opinion. The full code for all of these files can be found below:</p>
<p><a href="https://github.com/TabulateJarl8/random-junk/tree/94c72e4d746f0e7d9116e38230301fffaee47b82/asm/range_print">https://github.com/TabulateJarl8/random-junk/tree/94c72e4d746f0e7d9116e38230301fffaee47b82/asm/range_print</a></p>
]]></content:encoded></item><item><title>Trying Out Typst</title><link>http://tabulate.tech/blog/trying_typst/</link><pubDate>Tue, 05 Dec 2023 00:51:22 -0500</pubDate><category>markup</category><category>latex</category><category>typst</category><guid>http://tabulate.tech/blog/trying_typst/</guid><description>&lt;p&gt;Recently, I came across a new project, &lt;a href="https://typst.app/"&gt;Typst&lt;/a&gt;. From their GitHub, &amp;ldquo;Typst is a new markup-based typesetting system that is designed to be as powerful as LaTeX while being much easier to learn and use&amp;rdquo;. It&amp;rsquo;s written in Rust which I was immediately a fan of, and I was super interested in an alternative to LaTeX as I use it heavily for school papers, and while it&amp;rsquo;s super powerful, it can be annoying to set up and the compile times can start to get slow when you start compiling 70 page documents. I started checking out their examples, and I couldn&amp;rsquo;t find an APA template, so I figured that was a great way to start learning.&lt;/p&gt;</description><content:encoded><![CDATA[<p>Recently, I came across a new project, <a href="https://typst.app/">Typst</a>. From their GitHub, &ldquo;Typst is a new markup-based typesetting system that is designed to be as powerful as LaTeX while being much easier to learn and use&rdquo;. It&rsquo;s written in Rust which I was immediately a fan of, and I was super interested in an alternative to LaTeX as I use it heavily for school papers, and while it&rsquo;s super powerful, it can be annoying to set up and the compile times can start to get slow when you start compiling 70 page documents. I started checking out their examples, and I couldn&rsquo;t find an APA template, so I figured that was a great way to start learning.</p>
<p><img src="/img/blog/typst_showcase.png" alt="Showcase of basic typst document"></p>
<h2 id="similarities-and-differences-from-latex">Similarities and Differences from LaTeX</h2>
<p>I first noticed a few things that really set Typst aside from LaTeX. The first thing was how all of the packages I needed were just built in to Typst. For the APA paper I was recreating, I needed to import 8 packages, some of which need to be manually installed by the person compiling the document. With Typst, I just needed to import the Cetz package for more advanced graphing stuff, as my paper included bar charts. The Cetz package is also included within Typst, so I didn&rsquo;t have to install any extra dependencies. I also noticed that commands in Typst start with #, as opposed to LaTeX where they start with \. Typst has different elements like figures, blocks, and text, and the styling of these can be overridden with the show command. This can also be used very easily to dynamically override element styles. For example, the APA spec requires a specific and different type of heading for each different heading level (1: centered + bold, 2: align left + bold, 3: align left + italic, &hellip;). This is different from the set command which allows you to configure different elements, for example, setting the global text size/font, or setting the spacing around lines. Below is an example of the usage of the show command to create APA headings.</p>
<p><img src="/img/blog/typst_apa_headings.png" alt="APA headings implemented in Typst"></p>
<p>In my LaTeX paper, I was able to set the document class to APA, which provided me with macros to create the title, such as author, affiliation, course, due date, etc. In Typst, I had to implement this myself since, obviously, there wasn&rsquo;t any other template. However, scripting in Typst is much easier than in LaTeX and I&rsquo;ll talk about that a little bit more later.</p>
<h2 id="graphs">Graphs</h2>
<p>One of the main components of my original APA paper was graphs that I created to showcase the research I did. Looking into what Typst had built in, there was some rudimentary graphing stuff, but I needed to import a 3rd package, <a href="https://github.com/johannes-wolf/cetz">Cetz</a>, that comes pre-bundled with Typst in order to get more complicated graphics, however it&rsquo;s the exact same in LaTeX so that&rsquo;s fine. A wrote a quick rule in my template to format figures according to the APA spec, and then started reimplementing my graphs. Graphs using Cetz are much more readable than graphs using Tikz/Pgfplots in LaTeX.</p>
<p><img src="/img/blog/cetz_graph_code.png" alt="Cetz graph code"></p>
<p>However, I noticed something strange after I finished writing the code. There was a lot of left padding on the graph, and it was difficult to fit some bigger graphs. I started an issue (<a href="https://github.com/johannes-wolf/cetz/issues/341">johannes-wolf/cetz#341</a>) asking the developer about this issue, and he explained to me that he was currently in the process of rewriting the ColumnChart to be a wrapper around the Plot API. This would allow users to manually adjust the <code>x-min</code> and <code>x-max</code> values, solving the issue with the extra padding. This library needs a bit of work because of how new it is, but I could see it very easily evolving into a suitable replacement for Tikz.</p>
<h2 id="bibliography-issues">Bibliography Issues</h2>
<p>Next step was to complete the bibliography, and fortunately Typst has tons of bibliography formats built in, APA being one of them. The Typst developers have made an alternative format to BibTeX, called <a href="https://github.com/typst/hayagriva">Hayagriva</a>, which is just YAML. However, they also fully support using legacy BibTeX files from your old documents, and since many automatic citation generators and other tools don&rsquo;t support their newer format yet. After constructing a bibliography, I noticed another issue. When an author is missing, the APA style guidelines say that the source should be referenced from the source title, and when missing a date, it should include &ldquo;(n.d.)&rdquo;. There are a few issues that I was experiencing:</p>
<ol>
<li>When provided an author but not a date, (n.d.) was missing</li>
<li>When provided a date but not an author, the source title was missing</li>
<li>When neither an author nor a date is provided, the source was missing but (n.d.) is provided</li>
</ol>
<p>After asking about this issue in the Typst Discord server, one of the maintainers of Hayagriva reached out and asked a few questions. Afterwards I was referred to an existing issue (<a href="https://github.com/typst/typst/issues/2762">typst/typst#2762</a>). This issue documents my exact issue, and it&rsquo;s currently being resolved, which is nice to see.</p>
<h2 id="indentation-issues">Indentation Issues</h2>
<p>Typst has a known issue where the first paragraph under headings ignore the indentation rules set by <code>par(first-line-indent: size)</code>. This issue is currently being tracked (<a href="https://github.com/typst/typst/issues/311">typst/typst#311</a>), but in the meantime, I needed a workaround. After looking through the issue, I found some people who made workarounds, but there were issues with all of them. The closest one was pretty simple, but it added a bit of vertical spacing underneath each header. To counter this, I just added some negative vertical space right next to the added horizontal space. The following code snippet loops through all headings that are levels 1-3, and adds a 0.5in indent and -0.67in of vertical space:</p>
<p><img src="/img/blog/typst_indentation_fix.png" alt="Typst Indentation Fix"></p>
<h2 id="small-things">Small Things</h2>
<p>Theres a few small things that are nice about Typst, and since they&rsquo;re not big enough for their own section, I&rsquo;ll just list them all here.</p>
<ul>
<li>Typst has really nice errors. Coming from LaTeX that&rsquo;s really not a high bar to pass, but it&rsquo;s still nice nevertheless. They&rsquo;re clear and concise, and point out the exact line that the issue lies on.</li>
<li>Typst scripting is much easier than scripting in LaTeX. As you can see from the few screenshots I&rsquo;ve included, the code is readable and easy to understand, which is great for maintainability.</li>
<li>Typst has syntax highlighting. Not just this, but it also has inline syntax highlighting which is really nice.</li>
<li>Typst build times are much faster than build times in LaTeX.</li>
<li>There are no auxiliary files in Typst like in LaTeX. When I would compile a LaTeX document, I would get tons of auxiliary files, like <code>.aux</code>, <code>.bbl</code>, <code>.blg</code>, <code>.fls</code>, <code>.out</code>, and <code>.log</code> to name a few. Typst just has your <code>.typ</code> markup file, your bibliography if you have one, and then it generates a PDF without any of the extra junk that LaTeX uses.</li>
</ul>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>Typst seems like a really nice tool and I&rsquo;m excited to see how it matures. It definitely needs some refinement as you saw from the types of issues that were open, but it&rsquo;s nothing that&rsquo;s unfixable. They could also do with some more commands to reduce boilerplate, such as the <code>\doublespacing</code> command from LaTeX. In Typst, you need to implement double spacing yourself, and while it&rsquo;s not too difficult and it&rsquo;s only 2 lines of code, it would still be more friendly to beginners to add more commands like that. If you have a paper you need to write, or if you&rsquo;re just curious, give Typst a try. My completed APA template can be found in my random-junk repository for now, until I decide if I want to put this into it&rsquo;s own Typst package. <a href="https://github.com/TabulateJarl8/random-junk/blob/master/typst/apa.typ">https://github.com/TabulateJarl8/random-junk/blob/master/typst/apa.typ</a>.</p>
]]></content:encoded></item><item><title>TiO2</title><link>http://tabulate.tech/blog/tio2/</link><pubDate>Fri, 20 Oct 2023 00:29:03 -0400</pubDate><category>programming</category><category>parsing</category><category>rust</category><category>python</category><guid>http://tabulate.tech/blog/tio2/</guid><description>&lt;p&gt;Some of you may know of my TI-BASIC to Python transpiler, &lt;a href="http://tabulate.tech/software/ti842py"&gt;ti842py&lt;/a&gt;. While not very practical, this project was pretty fun for me to work on because I was having to find all of these different ways to implement TI-BASIC functions in Python. This project was based on a project that I found by &lt;a href="https://github.com/thenaterhood"&gt;thenaterhood&lt;/a&gt; called basically-ti-basic, which could decompile and (almost) compile the TI calculator .8XP files. He did a lot of the hard work of reverse engineering the bytecode, and his program helped me out a lot. I forked his project, reverse engineered some instructions that he missed, and then packaged it for PyPI so I could more easily use it in ti842py.&lt;/p&gt;</description><content:encoded><![CDATA[<p>Some of you may know of my TI-BASIC to Python transpiler, <a href="/software/ti842py">ti842py</a>. While not very practical, this project was pretty fun for me to work on because I was having to find all of these different ways to implement TI-BASIC functions in Python. This project was based on a project that I found by <a href="https://github.com/thenaterhood">thenaterhood</a> called basically-ti-basic, which could decompile and (almost) compile the TI calculator .8XP files. He did a lot of the hard work of reverse engineering the bytecode, and his program helped me out a lot. I forked his project, reverse engineered some instructions that he missed, and then packaged it for PyPI so I could more easily use it in ti842py.</p>
<h2 id="the-problem">The Problem</h2>
<p>My program worked fine for a while, and I implemented many features such as matrices support, <code>Goto</code>/<code>Lbl</code>, <code>getKey</code>, and many others. However, it was the goto support that eventally broke everything. I had been using a <a href="https://github.com/snoack/python-goto/pull/23">fork</a> of snoack&rsquo;s <code>goto-statement</code> Python module which modified the Python bytecode to allow for jumping to labels, and after some recent Python update, they changed how their internal instructions work and it broke the goto module. Someone did fork the project to add support for Python 3.11, however if I switched to this fork, I would lose a nice feature from the fork I was using: goto into blocks. While it probably didn&rsquo;t matter too much, I figured that this wasn&rsquo;t maintainable and I should look for another solution.</p>
<h2 id="the-solution">The Solution</h2>
<p>Since I&rsquo;m a huge fan of Rust, I decided that I should rewrite my project in Rust, but do it better and do things correctly this time. I created a new project, and with the help of <a href="https://turbowafflz.gitlab.io">a friend</a>, named it TiO2. The name is a play on &ldquo;TI&rdquo; from Texas Instruments, and the &ldquo;Oxidize&rdquo; trend in naming Rust projects, as the element TiO2 is Titanium Dioxide. I&rsquo;ve been working on this project a lot for the past few weeks, and I&rsquo;m excited seeing how far it&rsquo;s come. At this point, I&rsquo;ve completely rewritten the basically-ti-basic project in Rust, including fixing the compiler. This means that TiO2 will be able to both decompile .8XP files as well as compile to them from plain text (this is a lot harder than it sounds, barely any of the 8XP bytecode is documented). Since 2/3 features are completely, I&rsquo;ve now moved on to the most difficult and largest part of the project, which is building the interpreter. I&rsquo;ve opted to go with a bytecode interpreter rather than a plaintext interpreter or transpiling to a different language, as I feel that this is the most maintainable route to go. TI-BASIC can be represented in plaintext in too many different ways, and other programming languages can change, but the bytecode is going to remain the same, so if I implement it once, I (hopefully) never have to look at it again.</p>
<p><img src="/img/blog/tio2_compile_debug_output.png" alt="Debug output from compiling a small program"></p>
<h2 id="where-i-am-now">Where I Am Now</h2>
<p>Currently, I&rsquo;m trying to figure out the best way to implement the parser. If I was able to somehow parse the bytecode into postfix notation, that would be really helpful, however that sounds pretty difficult. I may take inspiration from postfix though, as it does seem like a smart idea if it was able to be done. I&rsquo;m about to stop programming for the night, however the last thing that I was stuck on was trying to figure out a way to gather tokens together, such as in a number or a string. Since each number or character is only one byte, I need to find a way to group the tokens together that are all part of one object, such as the number <code>-3.56</code> or the text in the command <code>Disp &quot;HELLO WORLD&quot;</code>. I might come up with a list of which bytes represent functions, and if the interpreter comes across a function, it will add the following bytes to the top argument in an argument stack until a comma is reached, which signifies the end of an argument and the beginning of a new one. Once the end of the line or, in some cases, a closing parenthesis is reached, the arguments will be popped back into the function and then evaluted. That&rsquo;s just one idea I have, but I suppose we&rsquo;ll have to see what works out.</p>
<p><img src="/img/blog/tio2_tokens_rs_nvim.png" alt="A screenshot of the tokens.rs file"></p>
]]></content:encoded></item><item><title>G502 Hero Mouse Repair</title><link>http://tabulate.tech/blog/g502_mouse_repair/</link><pubDate>Tue, 17 Oct 2023 00:24:15 -0400</pubDate><category>repair</category><category>hardware</category><guid>http://tabulate.tech/blog/g502_mouse_repair/</guid><description>&lt;p&gt;This is the first blog post I&amp;rsquo;ve made so it might be a little strange until I get used to it. I use the G502 Hero mouse made by Logitech, and it&amp;rsquo;s the best mouse I&amp;rsquo;ve ever used. I won&amp;rsquo;t get too far into the details as of why but it&amp;rsquo;s just really good. Anyway, I was in my dorm and I was doing things on my computer like normal, when I reached out for my mouse and accidentally knocked over my cup full of ramen water which subsequently spilled all over my entire desk and everything on it, including my mouse. I dried everything off and my mouse seemed to work still which was nice, and I didn&amp;rsquo;t think much of it. A bit later when I was programming, I noticed something a bit odd. My scroll wheel would stop scrolling for a few lines every now and then, and I needed to fix that.&lt;/p&gt;</description><content:encoded><![CDATA[<p>This is the first blog post I&rsquo;ve made so it might be a little strange until I get used to it. I use the G502 Hero mouse made by Logitech, and it&rsquo;s the best mouse I&rsquo;ve ever used. I won&rsquo;t get too far into the details as of why but it&rsquo;s just really good. Anyway, I was in my dorm and I was doing things on my computer like normal, when I reached out for my mouse and accidentally knocked over my cup full of ramen water which subsequently spilled all over my entire desk and everything on it, including my mouse. I dried everything off and my mouse seemed to work still which was nice, and I didn&rsquo;t think much of it. A bit later when I was programming, I noticed something a bit odd. My scroll wheel would stop scrolling for a few lines every now and then, and I needed to fix that.</p>
<p><img src="/img/blog/mouse_electronics.jpg" alt="The insides of the mouse"></p>
<h2 id="the-repair">The Repair</h2>
<p>First, I tried cleaning it with Isopropyl Alcohol and drying it with compressed air and a paper towel, which didn&rsquo;t seem to fix the issue. I couldn&rsquo;t think of much else to try except to take it apart and try and fix it, so I did just that. It was about 11PM so I set up a desk lamp and brought out all of my electronics repair tools, and got to work. The first step in disassembling a G502 is to remove the pads on the bottom, which can easily be done with a spudger, and unscrew the screws underneath. After this, I used spudgers and prying picks to open the mouse the rest of the way. I was then able to dry out and clean the inside of the top cover, and I was able to carefully clean the electronics. I started with the scrollwheel since that was the main issue, but I also noticed a lot of moisture on the primary and secondary click buttons, which I also cleaned off. I plugged the mouse back in, and the scrollwheel seemed to work again. After reassembling the entire mouse, I noticed something else was off, which was middle click. I hadn&rsquo;t tested this, and it turns out that it had somehow broken. I then proceeded to disassembly the entire mouse again, and now I had to fix the middle mouse click. I checked it visually and couldn&rsquo;t see anything wrong, so I figured I might as well just try to take off the entire scroll wheel assembly, clean it, and then reseat it. In order to take off the scroll wheel assembly on a G502, you need to use a pointy spudger or something similar to push the back pin out, and then you can lift off the scroll wheel assembly, taking care not to lose the two tiny springs at the tip of the mouse. I cleaned every surface of both the wheel, and the electronics under the wheel that I couldn&rsquo;t access before. I then plugged in the mouse and tested middle click by manually pressing the gold button that the scrollwheel presses down on, and it seemed to work. I then reseated the scroll wheel assembly, and after testing it again, I was able to successfully put the mouse back together.</p>
<h2 id="results">Results</h2>
<p>After my repair, I was able to fix both the scrolling issue as well as the middle mouse click issue. I did end up having to order new bottom pads for the mouse, however they were only around $12 so it wasn&rsquo;t too bad. Overall, it was a pretty fun experience to take one of these apart.</p>
]]></content:encoded></item></channel></rss>