Blog on Tabulate

Python Bug

Mon, 24 Feb 2025 01:03:22 -0500

I was recently working on my Vapor project, which is a TUI program written in Python using Textual. I was updating the project’s typing to utilize newer features which were introduced in Python 3.10, such as using the bitwise OR operator instead of Union. This involved rewriting things like x: Union[str, int] as x: str | int. In this process, I came across the following piece of code:

yield Container(
    ...,
    DataTable[Union[str, Text]](zebra_stripes=True)
)

The DataTable in Textual accepts a generic type parameter, which in my case is a string or a Text object. This seems like it’d be pretty easy to update, so I rewrote it as this:

yield Container(
    ...,
    DataTable[str | Text](zebra_stripes=True)
)

After this, I realized that in order to backport this behavior into Python versions before 3.10, you need to add from __future__ import annotations to the top of each file which uses these newer types of typing. From my understanding, this sets an interpreter flag which converts type hints into strings at runtime, allowing static type checkers to still read the types, while the string literals are ignored by the interpreter while the program is running. After adding this and running my unit tests in Python 3.9, I realized that the DataTable generic type was raising a TypeError. I looked around for a while, eventually coming to the conclusion that this might be a bug in Python itself. I was then able to produce the following minimal reproducible example:

from __future__ import annotations
from typing import Generic, TypeVar

T = TypeVar('T')

class Node(Generic[T]):
    x = None

    def __init__(self, label: T = None) -> None:
        pass

    def __str__(self) -> None:
        return str(self.x)

print(Node[str | int](''))

This example will raise a TypeError in Python 3.9. I thought about fixing this bug, however with Python 3.9 being EOL in October, they’re only accepting security fixes. While talking about this with some others, the only other possible conclusion that we could come to is that this behavior is intentional. Technically, this Node[str | int] syntax could be valid in 3.9 if you had a metaclass which defined __getitem__ and then indexed into a class’s attributes with an object that defined __or__. Such an example could be something like this:

class Subscriptable(type):
    def __getitem__(self, item):
        return self.__dict__[item]

class Subscript(metaclass=Subscriptable):
    testing = '1'

class BitwiseORString:
    def __init__(self, data):
        self.data = data

    def __or__(self, other):
        if isinstance(other, BitwiseORString):
            return self.data + other.data
        return ''

s1 = BitwiseORString('test')
s2 = BitwiseORString('ing')
print(Subscript[s1 | s2])

In my personal opinion (if I were designing the language), this seems like something that would be too inconsistent to leave out, especially since this generic syntax works for singular types in Python 3.9, just not when they’re OR’d together. This means that the parser has the ability to differentiate between the two, it just seems that they’ve forgotten about this edge case. You could maybe make the argument to say that they intentionally left this out to not break code that was using something like this, however if you’re using from __future__ import annotations, I would guess that you’re using this for backwards compatibility with older versions of Python, therefore you’d want your entire codebase to behave the same way instead of having weird discontinuities like this. Thankfully, the fix is pretty simple and you can just quote the types yourself like so:

yield Container(
    ...,
    DataTable["str | Text"](zebra_stripes=True)
)

If nothing else, maybe this will help someone who also comes across the same issue, as I couldn’t really find much talk about this online. The associated PR can be found here.

Rewriting the randfacts duplicate facts test

Mon, 18 Nov 2024 14:44:19 -0500

Recently, I was working on a Python web backend project for work, and I noticed something strange with the LSP I was using, Pyright. For some reason, it couldn’t automatically detect and import modules that I referenced. This seemed like a pretty standard and basic feature, so after a quick search, I stumbled upon microsoft/pyright#4263. Someone posted an issue asking Microsoft about why this feature wasn’t available in Pyright, and they responded with this:

This is a language service feature that is included in pylance, Microsoft’s premium Python language server for VS Code. We don’t have plans to port it to pyright. If you want this functionality, please switch to pylance.

This was pretty annoying, as I have switched to Neovim as my primary editor, and I didn’t want to switch back to VSCode. Fortunately, I learned about basedpyright in the same issue, and the author commented that they had pushed an update to the LSP which added this feature. Along with this, it also seemed to give more warnings about typing issues in the code, so I started going through some of my projects and transitioning them to be fully typed. Eventually, I got to my randfacts Python module and this is where the story really begins.

Some Background

Randfacts is a Python module that I created with a very simple purpose, which is to provide a developer with an easy-to-use interface to a database of random facts. I had made this for a Discord bot, as nothing else existed at the time, and I wasn’t expecting it to actually be anything. At the same time, I was also starting to learn about publishing PyPI modules, so I figured I’d throw it up on there to learn how the whole publishing process worked. After a while, however, I noticed that the downloads started going up a lot more than I expected, so I started maintaining the project some more, and I eventually got to where I am today. At the time of writing this, the module has about 1.2 million downloads, which isn’t a ton compared to some other modules, but it’s pretty cool to me.

The checkduplicates test

After a while of maintaining the module, I noticed a problem. Since the facts were being scraped off of the web, I inevitably ended up with some duplicates. To address this, I wrote a test in Python that would go through all of the facts and use the Levenshtein distance algorithm to compute the similarity between the two strings. On top of Levenshtein distance, I used a token sort ratio preprocessor, which tokenizes each string by converting it to lowercase and removing any punctuation because this usually gave more accurate results. With this method of string similarity checking, I could accurately match strings with the same meaning but different wording, such as “Jupiter is the biggest planet in the solar system” and “The biggest planet in the solar system is Jupiter”. This test worked fine for a while, but every time I added another fact, it needed to be compared with every fact before it. With the current list of over 7,000 facts, the test needs to compute about 27.5 million string comparisons. The Python version of the test could compute about 400k-500k string comparisons per second, which ended up taking a bit over a minute just to check for duplicate facts.

In comes Rust. When I was first learning Rust, I started to rewrite this test, as I thought using a compiled language would at least provide a small benefit in computation time, but this doesn’t address the underlying problem of why the test is so slow. When I came back to randfacts with my new LSP, I rediscovered this half-finished implementation and decided that it would be fun to finish now that I know more about Rust.

Finishing the Rewrite

My goal was for the Rust test to have similar, if not the same functionality as the Python test.

Algorithm Optimizations

The first problem I addressed was the efficiency of the Levenshtein distance algorithm. Since this is originally a mathematical equation and wasn’t designed for programming, it isn’t particularly efficient. This is where Wagner-Fischer comes in. Wagner-Fischer is an implementation of Levenshtein distance that uses dynamic programming to avoid redundant calculations. Levenshtein distance is also recursive, which Wagner-Fischer is not, avoiding that extra recursive overhead. I chose to go with a Wagner-Fischer implementation that only uses two arrays instead of a full matrix to hopefully get even better performance. The full algorithm is below:

#[inline(always)]
fn wagner_fischer_2row(s1: &[char], s2: &[char]) -> usize {
    // Ensure s1 is the shorter sequence for optimization
    let (s1, s2) = if s1.len() < s2.len() {
        (s1, s2)
    } else {
        (s2, s1)
    };

    let len1 = s1.len();
    let len2 = s2.len();

    // handle empty string cases
    if len1 == 0 {
        return len2;
    }
    if len2 == 0 {
        return len1;
    }

    // Initialize two rows for the dynamic programming matrix
    let mut prev_row = vec![0; len2 + 1];
    let mut curr_row = vec![0; len2 + 1];

    // Initialize first row with incremental values
    (0..=len2).for_each(|i| {
        prev_row[i] = i;
    });

    // Fill the matrix using only two rows
    for (i, c1) in s1.iter().enumerate() {
        curr_row[0] = i + 1;

        for (j, c2) in s2.iter().enumerate() {
            curr_row[j + 1] = if c1 == c2 {
                // No edit needed
                prev_row[j]
            } else {
                // Take minimum of three possible operations (insert, delete, substitute)
                1 + prev_row[j].min(prev_row[j + 1]).min(curr_row[j])
            };
        }

        // Swap rows using mem::swap for better performance
        std::mem::swap(&mut prev_row, &mut curr_row);
    }

    prev_row[len2]

}

Tokenization Optimizations

To speed it up a bit more, I added the following check to the token_sort_ratio function:

if (len1 as f64 / len2 as f64) < 0.5 || (len2 as f64 / len1 as f64) < 0.5 {
    return 0.0;
}

This snippet will check if the length of the strings we’re comparing differ by more than half. If they do, we could reasonably assume that the strings are different. While this may not always present to be true, the performance gain is great enough to justify it being in the algorithm. This makes it so that on some comparisons we can just completely skip the Wagner-Fischer computations, which is an O(m*n) algorithm, with m and n being the lengths of the strings.

Iteration Optimizations

Other than the algorithm implementation, this may be the most important part to focus on. There are so many different ways to iterate over every combination of facts, so choosing the correct way is crucial to a fast algorithm. Let’s take a look at the iteration line by line:

// Generate all possible indices combinations
let indices: Vec<_> = (0..all_facts.len())
    .flat_map(|i| ((i + 1)..all_facts.len()).map(move |j| (i, j)))
    .collect();

Instead of generating an iterable structure that contains all of the facts pre-paired, we can generate all pairs of indices instead. The all_facts array contains a struct with information about the fact, such as the fact itself and the line number in the file where the fact can be located. The fact itself isn’t just a String, but rather an Arc. This allows us to have cheaper clones which is crucial for performance. Next, we can look at how these indices are used:

// Process combinations in parallel
indices
    .into_par_iter()
    .progress_with(pb)
    .filter_map(|(i, j)| {
        let facts = &all_facts;
        let fact1 = &facts[i];
        let fact2 = &facts[j];

        let ratio = token_sort_ratio(&fact1.fact, &fact2.fact);
        if ratio > SIMILARITY_THRESHOLD {
            Some((fact1.clone(), fact2.clone(), ratio))
        } else {
            None
        }
    })
    .collect()

If we take a look at this first part, we can see where a huge amount of the improved performance lies. I’m using a Rust library called Rayon which makes it incredibly easy to convert a sequential iterator into a parallel iterator. This means that instead of doing one string comparison at a time, I can take advantage of all of my CPU cores and do many computations at once, drastically speeding up the time it takes to find duplicate facts.

// Process combinations in parallel
indices
    .into_par_iter()
    .progress_with(pb)
    .filter_map(|(i, j)| {
        let facts = &all_facts;
        let fact1 = &facts[i];
        let fact2 = &facts[j];

        let ratio = token_sort_ratio(&fact1.fact, &fact2.fact);
        if ratio > SIMILARITY_THRESHOLD {
            Some((fact1.clone(), fact2.clone(), ratio))
        } else {
            None
        }
    })
    .collect()

The next part is pretty simple. We can take references of the facts to avoid copying/cloning, and calculate the similarity ratio. If it’s above the threshold, add it to the removal list and continue. I found a good threshold with this particular algorithm is 82.5.

CI Caching

The one downfall of the Rust version is that it takes time to compile which can slow down the CI, and that defeats the purpose of having a faster test. To solve this issue, I used GitHub’s actions/cache action. Here’s the relevant section of the CI:

- name: Cache checkduplicates binary
        uses: actions/cache@v4
        id: cache
        with:
          path: |
            tests/checkduplicates/target/release/checkduplicates
          key: ${{ runner.os }}-cargo-${{ hashFiles('tests/checkduplicates/Cargo.lock', 'tests/checkduplicates/Cargo.toml', 'tests/checkduplicates/src/**') }}
          restore-keys: |
            ${{ runner.os }}-cargo-

      - name: Build checkduplicates test
        if: steps.cache.outputs.cache-hit != 'true'
        run: |
          cd tests/checkduplicates
          cargo build --release

      - name: Check for duplicate facts
        run: ./tests/checkduplicates/target/release/checkduplicates

To explain this simply, the cache action will check if Cargo.toml, Cargo.lock, or anything in src/** have changed. If it has, we’ll assume that the cache is expired and the test should be rebuilt, which you can see in lines 12, 14-15. If the cache is not expired, we place the cached checkduplicates binary in appropriate place. After building, or if building is skipped, we then run the resulting binary. This allows us to skip the build time if nothing has changed in the test, while still letting it automatically build if something has changed.

Conclusion

After all of this work, was it worth it? Let’s let the number speak for themselves.

	Python Test	Rust Test
Approximate iterations/sec	550,000	2,200,000
Time Taken	48 seconds	12 seconds

This benchmark was performed on my Framework 16 Laptop. I have a Ryzen 7 7840HS @ 3.8GHz, 16GB of DDR5-5600 RAM, and I was using the “Performance” profile with power profiles daemon on Arch Linux. In this case, the Rust version of the test performed 4× faster than the Python version of the test.

This metric, along with the CI caching, led to a huge performance gain in the duplicate fact checking. That’s all I have for now so hopefully you learned something or just enjoyed this post. The full source code for the new test can be found below, just note that I’ve pinned the commit so there may be a more up to date version on the master branch.

https://github.com/TabulateJarl8/randfacts/tree/5e6786e8b536efc2895880ce5f0e88a8f442454b/tests/checkduplicates

College Range Assignment

Wed, 17 Jan 2024 19:10:05 -0500

A friend of mine is enrolled in a college intro to programming course. This course had a very simple entrance test: they needed to write a program in any language to display the numbers 5-60 prefixed with “number “, like so:

number 5
number 6
number 7
number 8
...
number 60

After he told me about this assignment, I thought that it was pretty funny, and I wanted to write it in assembly as a joke. I started off with stealing some integer printing code that I had written for another project.

Trying Out Typst

Tue, 05 Dec 2023 00:51:22 -0500

Recently, I came across a new project, Typst. From their GitHub, “Typst is a new markup-based typesetting system that is designed to be as powerful as LaTeX while being much easier to learn and use”. It’s written in Rust which I was immediately a fan of, and I was super interested in an alternative to LaTeX as I use it heavily for school papers, and while it’s super powerful, it can be annoying to set up and the compile times can start to get slow when you start compiling 70 page documents. I started checking out their examples, and I couldn’t find an APA template, so I figured that was a great way to start learning.

Similarities and Differences from LaTeX

I first noticed a few things that really set Typst aside from LaTeX. The first thing was how all of the packages I needed were just built in to Typst. For the APA paper I was recreating, I needed to import 8 packages, some of which need to be manually installed by the person compiling the document. With Typst, I just needed to import the Cetz package for more advanced graphing stuff, as my paper included bar charts. The Cetz package is also included within Typst, so I didn’t have to install any extra dependencies. I also noticed that commands in Typst start with #, as opposed to LaTeX where they start with \. Typst has different elements like figures, blocks, and text, and the styling of these can be overridden with the show command. This can also be used very easily to dynamically override element styles. For example, the APA spec requires a specific and different type of heading for each different heading level (1: centered + bold, 2: align left + bold, 3: align left + italic, …). This is different from the set command which allows you to configure different elements, for example, setting the global text size/font, or setting the spacing around lines. Below is an example of the usage of the show command to create APA headings.

In my LaTeX paper, I was able to set the document class to APA, which provided me with macros to create the title, such as author, affiliation, course, due date, etc. In Typst, I had to implement this myself since, obviously, there wasn’t any other template. However, scripting in Typst is much easier than in LaTeX and I’ll talk about that a little bit more later.

Graphs

One of the main components of my original APA paper was graphs that I created to showcase the research I did. Looking into what Typst had built in, there was some rudimentary graphing stuff, but I needed to import a 3rd package, Cetz, that comes pre-bundled with Typst in order to get more complicated graphics, however it’s the exact same in LaTeX so that’s fine. A wrote a quick rule in my template to format figures according to the APA spec, and then started reimplementing my graphs. Graphs using Cetz are much more readable than graphs using Tikz/Pgfplots in LaTeX.

However, I noticed something strange after I finished writing the code. There was a lot of left padding on the graph, and it was difficult to fit some bigger graphs. I started an issue (johannes-wolf/cetz#341) asking the developer about this issue, and he explained to me that he was currently in the process of rewriting the ColumnChart to be a wrapper around the Plot API. This would allow users to manually adjust the x-min and x-max values, solving the issue with the extra padding. This library needs a bit of work because of how new it is, but I could see it very easily evolving into a suitable replacement for Tikz.

Bibliography Issues

Next step was to complete the bibliography, and fortunately Typst has tons of bibliography formats built in, APA being one of them. The Typst developers have made an alternative format to BibTeX, called Hayagriva, which is just YAML. However, they also fully support using legacy BibTeX files from your old documents, and since many automatic citation generators and other tools don’t support their newer format yet. After constructing a bibliography, I noticed another issue. When an author is missing, the APA style guidelines say that the source should be referenced from the source title, and when missing a date, it should include “(n.d.)”. There are a few issues that I was experiencing:

When provided an author but not a date, (n.d.) was missing
When provided a date but not an author, the source title was missing
When neither an author nor a date is provided, the source was missing but (n.d.) is provided

After asking about this issue in the Typst Discord server, one of the maintainers of Hayagriva reached out and asked a few questions. Afterwards I was referred to an existing issue (typst/typst#2762). This issue documents my exact issue, and it’s currently being resolved, which is nice to see.

Indentation Issues

Typst has a known issue where the first paragraph under headings ignore the indentation rules set by par(first-line-indent: size). This issue is currently being tracked (typst/typst#311), but in the meantime, I needed a workaround. After looking through the issue, I found some people who made workarounds, but there were issues with all of them. The closest one was pretty simple, but it added a bit of vertical spacing underneath each header. To counter this, I just added some negative vertical space right next to the added horizontal space. The following code snippet loops through all headings that are levels 1-3, and adds a 0.5in indent and -0.67in of vertical space:

Small Things

Theres a few small things that are nice about Typst, and since they’re not big enough for their own section, I’ll just list them all here.

Typst has really nice errors. Coming from LaTeX that’s really not a high bar to pass, but it’s still nice nevertheless. They’re clear and concise, and point out the exact line that the issue lies on.
Typst scripting is much easier than scripting in LaTeX. As you can see from the few screenshots I’ve included, the code is readable and easy to understand, which is great for maintainability.
Typst has syntax highlighting. Not just this, but it also has inline syntax highlighting which is really nice.
Typst build times are much faster than build times in LaTeX.
There are no auxiliary files in Typst like in LaTeX. When I would compile a LaTeX document, I would get tons of auxiliary files, like .aux, .bbl, .blg, .fls, .out, and .log to name a few. Typst just has your .typ markup file, your bibliography if you have one, and then it generates a PDF without any of the extra junk that LaTeX uses.

Final Thoughts

Typst seems like a really nice tool and I’m excited to see how it matures. It definitely needs some refinement as you saw from the types of issues that were open, but it’s nothing that’s unfixable. They could also do with some more commands to reduce boilerplate, such as the \doublespacing command from LaTeX. In Typst, you need to implement double spacing yourself, and while it’s not too difficult and it’s only 2 lines of code, it would still be more friendly to beginners to add more commands like that. If you have a paper you need to write, or if you’re just curious, give Typst a try. My completed APA template can be found in my random-junk repository for now, until I decide if I want to put this into it’s own Typst package. https://github.com/TabulateJarl8/random-junk/blob/master/typst/apa.typ.

TiO2

Fri, 20 Oct 2023 00:29:03 -0400

Some of you may know of my TI-BASIC to Python transpiler, ti842py. While not very practical, this project was pretty fun for me to work on because I was having to find all of these different ways to implement TI-BASIC functions in Python. This project was based on a project that I found by thenaterhood called basically-ti-basic, which could decompile and (almost) compile the TI calculator .8XP files. He did a lot of the hard work of reverse engineering the bytecode, and his program helped me out a lot. I forked his project, reverse engineered some instructions that he missed, and then packaged it for PyPI so I could more easily use it in ti842py.

The Problem

My program worked fine for a while, and I implemented many features such as matrices support, Goto/Lbl, getKey, and many others. However, it was the goto support that eventally broke everything. I had been using a fork of snoack’s goto-statement Python module which modified the Python bytecode to allow for jumping to labels, and after some recent Python update, they changed how their internal instructions work and it broke the goto module. Someone did fork the project to add support for Python 3.11, however if I switched to this fork, I would lose a nice feature from the fork I was using: goto into blocks. While it probably didn’t matter too much, I figured that this wasn’t maintainable and I should look for another solution.

The Solution

Since I’m a huge fan of Rust, I decided that I should rewrite my project in Rust, but do it better and do things correctly this time. I created a new project, and with the help of a friend, named it TiO2. The name is a play on “TI” from Texas Instruments, and the “Oxidize” trend in naming Rust projects, as the element TiO2 is Titanium Dioxide. I’ve been working on this project a lot for the past few weeks, and I’m excited seeing how far it’s come. At this point, I’ve completely rewritten the basically-ti-basic project in Rust, including fixing the compiler. This means that TiO2 will be able to both decompile .8XP files as well as compile to them from plain text (this is a lot harder than it sounds, barely any of the 8XP bytecode is documented). Since 2/3 features are completely, I’ve now moved on to the most difficult and largest part of the project, which is building the interpreter. I’ve opted to go with a bytecode interpreter rather than a plaintext interpreter or transpiling to a different language, as I feel that this is the most maintainable route to go. TI-BASIC can be represented in plaintext in too many different ways, and other programming languages can change, but the bytecode is going to remain the same, so if I implement it once, I (hopefully) never have to look at it again.

Where I Am Now

Currently, I’m trying to figure out the best way to implement the parser. If I was able to somehow parse the bytecode into postfix notation, that would be really helpful, however that sounds pretty difficult. I may take inspiration from postfix though, as it does seem like a smart idea if it was able to be done. I’m about to stop programming for the night, however the last thing that I was stuck on was trying to figure out a way to gather tokens together, such as in a number or a string. Since each number or character is only one byte, I need to find a way to group the tokens together that are all part of one object, such as the number -3.56 or the text in the command Disp "HELLO WORLD". I might come up with a list of which bytes represent functions, and if the interpreter comes across a function, it will add the following bytes to the top argument in an argument stack until a comma is reached, which signifies the end of an argument and the beginning of a new one. Once the end of the line or, in some cases, a closing parenthesis is reached, the arguments will be popped back into the function and then evaluted. That’s just one idea I have, but I suppose we’ll have to see what works out.

G502 Hero Mouse Repair

Tue, 17 Oct 2023 00:24:15 -0400

This is the first blog post I’ve made so it might be a little strange until I get used to it. I use the G502 Hero mouse made by Logitech, and it’s the best mouse I’ve ever used. I won’t get too far into the details as of why but it’s just really good. Anyway, I was in my dorm and I was doing things on my computer like normal, when I reached out for my mouse and accidentally knocked over my cup full of ramen water which subsequently spilled all over my entire desk and everything on it, including my mouse. I dried everything off and my mouse seemed to work still which was nice, and I didn’t think much of it. A bit later when I was programming, I noticed something a bit odd. My scroll wheel would stop scrolling for a few lines every now and then, and I needed to fix that.

The Repair

First, I tried cleaning it with Isopropyl Alcohol and drying it with compressed air and a paper towel, which didn’t seem to fix the issue. I couldn’t think of much else to try except to take it apart and try and fix it, so I did just that. It was about 11PM so I set up a desk lamp and brought out all of my electronics repair tools, and got to work. The first step in disassembling a G502 is to remove the pads on the bottom, which can easily be done with a spudger, and unscrew the screws underneath. After this, I used spudgers and prying picks to open the mouse the rest of the way. I was then able to dry out and clean the inside of the top cover, and I was able to carefully clean the electronics. I started with the scrollwheel since that was the main issue, but I also noticed a lot of moisture on the primary and secondary click buttons, which I also cleaned off. I plugged the mouse back in, and the scrollwheel seemed to work again. After reassembling the entire mouse, I noticed something else was off, which was middle click. I hadn’t tested this, and it turns out that it had somehow broken. I then proceeded to disassembly the entire mouse again, and now I had to fix the middle mouse click. I checked it visually and couldn’t see anything wrong, so I figured I might as well just try to take off the entire scroll wheel assembly, clean it, and then reseat it. In order to take off the scroll wheel assembly on a G502, you need to use a pointy spudger or something similar to push the back pin out, and then you can lift off the scroll wheel assembly, taking care not to lose the two tiny springs at the tip of the mouse. I cleaned every surface of both the wheel, and the electronics under the wheel that I couldn’t access before. I then plugged in the mouse and tested middle click by manually pressing the gold button that the scrollwheel presses down on, and it seemed to work. I then reseated the scroll wheel assembly, and after testing it again, I was able to successfully put the mouse back together.

Results

After my repair, I was able to fix both the scrolling issue as well as the middle mouse click issue. I did end up having to order new bottom pads for the mouse, however they were only around $12 so it wasn’t too bad. Overall, it was a pretty fun experience to take one of these apart.