Where will you go next?

Vanity Mirror

2026-02-08T00:00:00+00:00

Context

In the age of AI, shipping has become easier than ever. And also, borderline more addictive than ever. I will certainly rant about it in a Substack post at some point in the future, but as my projects grow, I wanted an easy way to keep track of various metrics (sadly, I’m curious and vain and want to see what people like). I built this “Vanity Mirror” as a way to do that, and figured it was fine to share publically (even if another React2Shell RCE vulnerability occurs… are hackers really gonna want limited read-only permissions to my Google Analytics properties?).

Demo

Feel free to check out the website here:

🪞 Web Vanity Mirror

But also it’s embedded here:

Domain Name?

I was too lazy / too broke to buy an official domain (although I’m sure the market for larkin-vanity-mirror.xyz can’t be too high). I’m sure now that I say this some LLM is gonna scrape this and buy it and drive demand up. c’est la vie.

Favorite Part

Regardless, my favorite part about this is that I took the shortcut of making this a PWA so now it’s very easily hooked up into my mobile experience.

To download, just follow these steps:

Go to larkin-vanity-mirror.vercep.app on your mobile device
Click context menu three dots in bottom right (on newer iOS)
Click Share
Scroll down and go to Add to Home Screen
Voila 🎉

Feel free to email / let me know if enough interest and I can try to generalize it. Although honestly, at this point, jinja doesn’t seem to have much value over just ripping CC.

Thanks!

Multi Armed Bandit

2026-02-01T00:00:00+00:00

This was meant to be for a take-home... I submitted some version of the first draft, but then couldn't stop and here we are. If parts trail off, it's because I shouldn't have even gone this deep into other more pressing matter

-->

🐙 Source GitHub

Motivation

Here is a motivating visual to build up some momentum to read on. This is our dashboard tool to compare various multi-armed bandit strategies. We’ll understand this more thoroughly at the end of this blog post.

Context

Recently, I responded to some recruiters and fielded a couple of interviews.

I generally abhor interviewing. There are parts I absolutely love - meeting new people, learning about new technical challenges, studying up on businesses or industries - but there are also parts I abhor. Getting grilled on usage of the Web Speech API (man oh man was I in the wrong interview) or how to decode a string in 2026 does feel… a bit perplexing. I’ll rant about it on Substack at some point in time.

However! I do genuinely enjoy take homes (as exemplified by Book Brain). Despite often it being a bigger time constraint, and more of a commitment.

This blog post is going to go over a concept and problem that (embarrassingly enough), I hadn’t yet seen before the take home. For more context, I had accepted another offer in the same timeframe, and withdrew from this specific takehome process. It’s unfortunate too because I do genuinely believe the company will be a $10BN company in no time, and the engineering seems fascinating.

While I ultimately withdrew from this interviewing cycle, and sent them only my thoughts on the problem, this blog post is going to talk about a take home question I received from that company. I’m anonymizing the company to keep the sanctity of their interview process.

The company restricted Ai usage during the take, so I did a ton of research / youtube videos. However, for this blog post, some details of implementation will be left to Claude. The repo has documentation and detail included various transcripts between Claude and I. So let’s begin with the problem.

Setup

This blog post is going to focus on the multi-armed bandit problem, which is commonly abbreviated as MAB. There is a lot here, so I won’t be able to cover everything, but I’ll cover the parts that the corresponding Github repo covers.

Multi-Armed Bandit Problem (MAB)

The traditional multi-armed bandit is pretty well encapsulated by a hypothetical situation. I’ll give you the long / fun version, and then I’ll give you an abbreviated Wikipedia version.

Imagine, you wake up.

You live in a beautiful city (let’s say Cincinnati).

Kudos to @msvachphotography for the shot from Mt. Echo Park

But then you realize you have too much money in your pockets. You decide to gamble (i discourage this, especially after seeing how the sausage is made).

So you hit the casino!

However, because it’s Cincinnati, this is a very nice casino. You actually have a chance to win. However, they only have single-armed bandits - commonly known as slot machines! These are unique slot machines, and their underlying probably distributions become more apparent over time.

Despite having too much money in your pockets, you love winning, so you do want to win. Your problem therefore is to figure out the optimal strategy for which machines to play, when to play those machines, how many times to play them, and when you need to switch.

Wikipedia more blandly (but also more succinctly) puts this as:

More generally, it is a problem in which a decision maker iteratively selects one of multiple fixed choices (i.e., arms or actions) when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms.[4]

Stochastic MAB Approaches

Before we go any further, let’s fully dissect this problem.

There are really two main focuses that I covered in code and fully studied up on. I will not be talking about $\epsilon$-greedy approaches, but here are some other resources. We’re actually going to focus on UCB vs Thompson Sampling, which are two methods that work very well. I’ll discuss further below in the implementation about my thoughts about how I modified them to handle the take-home explicitly.

Upper Confidence Bound

The theory behind UCB is that we are trying to optimistically explore. UCB1 is meant to balance the level of exploration vs exploitation.

I am not going to go into the full derivation, but it references something called Hoeffding’s Inequality to build up a framework.

It eventually lets us get to:

\[UCB_i(t) = \bar{x}_i + \underbrace{c \cdot \sqrt{\frac{\ln(t)}{n_i}}}_{\text{exploration bonus}}\]

Where:

$\bar{x}_i$ = empirical success rate of server $i$
$t$ = total number of requests across all servers
$n_i$ = number of times server $i$ has been tried
$c$ = exploration constant (default: $\sqrt{2}$)

Normally, you’ll see this kind of folded up with $c$ being part of the square root, but that exploration bonus was key in my modified UCB approach.

Thompson Sampling

With this approach, the derivation can actually make a bit more sense (in my opinion). It’s also (probably relatedly) the approach I like the most.

We model the process for the specific outcome of the arm $a$ as a Bernoulli distribution. Basically, it means we have a $p$ probability of getting a 1 (in this case, a reward, in our specific case further down - a successful downstream server request). The value 0 has a probably $q = 1 - p$ of occurring.

We can then model this uncertainty about the Bernoulli parameter $p$ as a beta distribution. We’re trying to figure out the probability $p$ for each arm $a$ (or further on as we’ll see, the downstream server).

Think of using our beta distribution as a heuristic for what we actually think about each arm. With Thompson sampling, we’re basically maintaining a best guess distribution for each of the arms and updating it as we go and learn more information. I believe the technical term for this is that we’re using a beta distribution as a prior and our posterior given we are assuming a beta distribution in both cases.

Formally, the beta distribution has a $\alpha$ and a $\beta$ that control the shape of the distribution. They are exponents of the variable and the variable’s complement respectively. So again, this can be written as:

\[f(x; \alpha, \beta) = \text{constant} \cdot x^{\alpha - 1} \, (1 - x)^{\beta - 1}\]

Then our logic is pretty straight forward given how we’re modeling this. For every success of the arm, we can update our $\alpha$ with a simple $\alpha’ = \alpha + 1$ and for every failure, we can update our $\beta$ (given it’s modelling the complement) as $\beta’ = \beta + 1$.

A picture is worth a thousand words, so an interactive visualization must be worth at least a million right? This is a Claude generated vanilla JS + Chart.js artifact. I’d recommend autoplaying or doing the Run Thompson Round, but you can also see results by adding success and failures to the various arms. The main point is that you’ll see how our beta distributions should steadily converge to the real $p$ with increasing accuracy.

Multi-Armed Bandit Variants

The situation I described above is really the stochastic MAB. There’s a finite set of arms, and the reward distribution is unknown. As I learned throughout this process, there are many variants and generalizations of this problem. Specifically, these are generalizations where the MAB is extended by adding some information or structure to the problem. Namely:

adversarial bandits
- this is probably my favorite variant. the notion is that you have an adversary that is trying to maximize your regret, while you’re trying to minimize your regret. so they’re basically trying to trick or con your algorithm.
- if you’re asking yourself (like I did), ok well then why doesn’t the adversary just assign $r_{a,t} = 0$ as the reward function for all arms $a$ at time $t$, well… you shouldn’t really think about it in terms of reward. Reward is relative. We instead want to think about it in terms of regret which I’ll talk more about later. There are two subvariants (oblivious adversary and adaptive adversary), but we’re not going to discuss those - although a very interesting extension is the EXP3 algorithm.
contextual bandits
- the notion here is that instead of learning $E[r \mid a]$ where again $r$ is the reward and $a$ is the arm you pick, you’re learning $E[r \mid x, a]$ where $x$ is some additional bit of context at time $t$ that you’re exposed to.
dueling bandits
- an interesting variant where instead of being exposed to the reward, your information is limited to just picking two bandits and only knowing which one is better comparatively… but again it’s stochastic. So you can inquire about the same two arms and it’s very feasible that you’ll get different results for the comparison. The whole notion is that you’re building up this preference matrix. Seems like an incredibly difficult problem.

Bandit with Knapsack (BwK) Variant

I’m going to preempt the reader and discuss another variant, where I’ll spend a bit more time. That model is the Bandit with Knapsack problem.

The original paper is from Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. People who I’d love to be an iota as smart as. You can see the paper here. It’s a 55 page paper, and I’d be lying if I said I read past the Preliminaries section. Section 3+ have some heavy math that is over my head.

The problem statement is relatively simple though. Your arms now have resources associated with them that they consume. I honestly think it’s easier to draw it out mathematically and reference the actual paper (also shoutout to alphaxiv, it’s got most of the normal [arvix] features, just with some ai native question answering and highlighting which has been nice).

Formal Declaration

I’d like to state that the paper starts out with the generalized form of many resources being managed and consumed. It makes sense given it’s a professional paper and the general case is more interesting. However, you can imagine $d$ being 1 and that we have a single resource that we’re managing.

So again, we have $X$ finite arms from 1 to $m$. An individual arm can be declared as $x$. Formally, we can say

\[X = \{ 1,\, 2,\, \ldots,\, x, \, \ldots, \,m-1,\, m \}\]

There are $T$ rounds (which interestingly enough is known before time in this variant). So $t$ is the round at time $t$ (and one round per time increment).

\[t = \{1,\,2,\, \ldots,\, T-1,\, T \}\]

There are $d$ resources where $d \geq 1$ and the $d$ resources are indexed from $i$ from $1,\, \ldots,\, d$. (the $d$ in our specific example is going to be the number of servers still, because each server is its own rate limit).

So the problem now changes because at round $t$ when arm $x$ is pulled we now don’t just get a reward, but we instead get a reward and a consumption vector indicating how much of the resources were consumed. In other words,

\[\left( r_t, c_{t,1}, \ldots , c_{t,d} \right)\]

The paper declares this as $\pi_x$ where $\pi_x$ is an unknown latent distribution over $[0,1]^{d+1}$.

Now “latent spaces” have gotten a ton of usage since LLMs blew up, but basically this just means there is some distribution, and it is fixed, but it’s unknown to the learner.

Just to also break down the syntax because $[0,1]^{d+1}$ can be a bit misleading, but this just means

\[[0,1]^{d+1} = \underbrace{[0,1] \times [0,1] \times \cdots \times [0,1]}_{d+1\ \text{times}}\]

So it’s really just a vector of length $d+1$ (the +1 is because we have $d$ resources, but then one reward $r$, so it’s kind of a shorthand).

$\pi_x$ is a joint probability distribution over $(r, c_1, …, c_d)$, or $(r, c_1, ..., c_d) \sim \pi_x$

meaning when you pull an arm, you draw one vector from this distribution.

This of course leads us to budgeting. Each resource $i$ has a budget where $B_i \geq 0$

The overall process stops as soon as we have exhausted ANY resource budget.

Algorithms Presented

The paper presents two algorithms - BalancedExploration and PrimalDualWithBK.

`BalancedExploration`

At a high level, BalancedExploration tries to explore as much as possible while avoiding suboptimal strategies. It tries to converge to a LP-perfect distribution. LP-perfect here is a LP-relaxation called LP-primal (also LP = linear programming). So basically if they can reduce some of the constraints in this LP-primal approach then they can have an optimal algorithm. This LP-primal not only reduces the constraints, but the LP assumes that we know the average reward for each arm and removes the uncertainty, and it lets us perform fractional tasks rather than full tasks (this gets into the integer programming formulation which is helpful for the second part).

The algorithm is “simple” as the authors put it, but somewhat abstracted. On each phase, it eliminates any mix of tasks that are obviously not LP-perfect. It creates a confidence interval of potentially perfect LP distributions.

Then for each task, it tries to explore that task as much as possible, and gathers the information. It then repeats until it runs out of time or resources.

Transparently, I get it at this level, but I don’t understand the underlying math pinning it. That confidence interval calculation is… unclear to me. And given I don’t even have an implementation for it in my repo (which is the point of this post).

Ah actually! Giving Claude enough context and framing for this, and it does make sense for my repo. It’s still using UCB / LCB for reward and cost respectively, and then forming that as the score. i.e.:

Claude Code: BalancedExploration Explanation

iMessage Data Foundry

2026-01-18T00:00:00+00:00

🐙 Source GitHub 🐍 Python PyPI

Context

Recently for one of my projects, I needed synthetic data generated to match a MacOS compatible iMessage chat.db as well as the AddressBook.db.

This is going to be a short post, because the Github’sREADME.md has a lot more information. So check that out.

Alternatively, feel free to watch this demo video:

Installation

There are many ways:

$ uv tool install imessage-data-foundry
$ uvx imessage-data-foundry
$ pip install imessage-data-foundry
$ pipx install imessage-data-foundry

Conclusion

Thanks! Feel free to check out the GH repo or reach out if there are any questions / concerns. Also it’s open source so feel free to submit issues / PRs.

larkin-mcp

2025-12-14T00:00:00+00:00

You can either interact with mine, or clone the repo here, and get started with the second one.

🐍 Python PyPI 📘 TypeScript npm 🦀 Rust crates.io

Check out any of the links above for the various published packages. Note, Claude did the css here.

📋 yourname-mcp template →

I’m working on a much bigger project, but honestly, needed to take a break from that. It has been a grind. I have burned many early mornings on that.

So as a break, I have wanted to explore building my own MCP server and templatizing this to make it easier for others to install and set this up as well. This is not going to be a long post, but I’m hoping the repos speak for themselves, and this provides ample motivation.

Motivation

To provide some motivation (and perhaps earn a few stars on the template repo), here are practical examples of what you can do with this specific MCP server.

Personal Insights

What do you think was John Larkin’s hardest tennis match?

Result:

Rude!! Hallucination. I didn’t get bageled, I got breadsticked. In other words, it was 1-6 not 0-6. But yes, shoutout to Phillip Locklear…

Interactive Timeline

Example 1:

Prompt:

Can you give me John’s experience’s as a beautiful timeline? Please create a html file with that visualization

Result: View the timeline

Example 2:

Can you use your frontend-design skill and build a beautiful interactive timeline of John’s work experience and personal project timeline as a single html file visualization?

Result: View the timeline

Honestly, the second one is pretty slick although it’s a bit… vapid of personality I guess.

fwiw, here is the usage in CC:

Personalized Study Guide

Prompt:

Can you help John Larkin prepare for an Anthropic interview given his resume and past experience? Please search and find open roles and then prepare a study guide for his various gaps.

Result:

Not sharing the whole thing, but you can see this from Claude Desktop:

Context

I wanted to set up a local MCP server that you can install to ask questions about the user. There are two versions:

larkin-mcp - my materialized repo that has details about myself (largely professional, markdown files are online, but I’m guessing in the age of the internet, this level of detail is fine).
yourname-mcp - the templated repo where you can clone this, run a script, and optionally publish (caution: the info that you put in your resources/content markdown files will then be indexable / probably ingested from some AI… but my theory is that most of that stuff is already going to be there)

Why?

Yeah so this was something my PM girlfriend asked me almost immediately. Why do this? Can’t you just feed your resume into ChatGPT and it’ll basically be able to do the same? I think yes, partially, but (at least in my case), my resume is still missing a ton of context. So I think my response is mutli-fold:

Can’t you just feed your resume into ChatGPT and ask questions of that?

Feeding in your resume as a pdf or md file is going to bloat your context window. MCP provides more selective invocations.
I don’t want to do that everytime I need something with my context and personality
It’s still missing a ton of context about who I am and some more ephemeral things about me. (note: i know that 90% of companies won’t care about that, and 99.9% of recruiters won’t care about it)
I wanted to be able to distribute this. There’s a world I could imagine where recruiters just run uvx larkin-mcp and then ask questions to get a feel for my work and who I am
I want to control the level of detail and insight that this MCP server has
I wanted to build an MCP server… I hadn’t done it, even at work.
I wanted to explore the tooling around it as well.
I wanted to build an MCP server in Typescript and Rust explicitly, given I’m trying to work on my Rust skills and I’m less involved in those communities
I thought it would be a useful thing to templatize and set up some infrastructure so less technical users could git clone && ./run-install.sh and that would ask them a couple of questions, analyze their resume, convert it into markdown, they could write some markdown to provide more context, and then boom, they could also publish it and others could use it if they wanted.
As stated previously, I needed a break from my other project.

And if you’re thinking like well, what about Claude memory or ChatGPT memory?, I’m really not a fan of that. I don’t think Simon Willison is either. And I don’t trust it to not sycophant it up or pull information that perhaps I don’t want for the questions I’m asking.

Hopefully, that’s enough rationale for personal motivation.

`yourname-mcp`

This is hopefully your template of interest. The point is that this has enough scaffolding that you can run the install script, populate a couple markdown files, upload to PYPI and then you’re off and running. There will be more info in the actuall repo here.

Demo

Here is a demo showcasing the functionality:

Security

I - like basically every other engineer - am slightly cautious about MCP. There are going to be large amounts of attacks given the trust people are placing into MCP and utilizing binary executables (i.e. bunx or uvx).

This is from 6 days ago (at time of writing):

Is anyone else terrified by the lack of security in standard MCP?
byu/RaceInteresting3814 inMCPservers

Even with this project… while I utilize uvx and bunx for the convenience, I am 100% afraid about impersonations, security attacks, people injecting malicious code from poor distributors. This is obviously nuanced. I am a huge fan of making software easily disseminated but the increase in malicious code and actors (that are only exacerbated from the AI wave) is extremely alarming. I mean just look at npm in the past couple months?

Rust

I could have used something like cargo-binstall, but didn’t quite get to it. As a result, if you want to set this up in Claude Code or Claude Desktop, you’ll need to do something like cargo install larkin-mcp and then point to that corresponding built binary:

   Compiling larkin-mcp v1.0.2
    Finished `release` profile [optimized] target(s) in 14.83s
  Installing /Users/johnlarkin/.cargo/bin/larkin-mcp
   Installed package `larkin-mcp v1.0.2` (executable `larkin-mcp`)

Rust was my favorite to implement, although the code structure is perhaps not as Rust idiomatic as it should be. In my opinion, rmcp which is the canonical framework for Rust MCP servers is slightly less ergonomic. They match a lot of the Python decorators in terms of Rust macros but there’s some tricks about public traits and understanding what is actually going on given the function calls.

Conclusion

If you like this, or think it will be useful, please check out the basically templated repo yourname-mcp where the README.md will walk you through what you need to do! Always feel free to email or leave comments if need be.

Understanding Muon

2025-10-28T00:00:00+00:00

So while I tried to mainly focus on optimizers, this post kinda splayed out some. It was my first time trying Pyodide and incorporating that logic into my blog. It was my first time using manim, which was exciting because I'm a big fan of the 3Blue1Brown channel. I also introduced quizzes (see AdamW section) for more interactivity. All of this is open source though, so if you have any questions, I'd be flattered if you emailed, but obviously you can just ask ChatGPT / Claude.

Motivating Visualization

Read on to understand the above visualization. My manim skills aren't fantastic so the timing of above could be improved.

Today, we’re going to try and understand as much of this animation as possible. We’ll cover optimizers as a construct, look at an example, take a walk through history (again high level) and then we’ll investigate Muon, which is a more recent optimizer that has been sweeping the community. Note, we will not cover Newton-Schulz iteration or approximation of the SVD calc, but I’m hoping to cover that in another blog post.

Also if you're curious the visualization code (which is a bit of a mess) is here.

Background

nanochat just dropped a couple of weeks ago and one element that I was extremely interested in was muon. It’s a pretty recent state of the art optimizer that has shown competitive performance in training speed challenges.

First of all, if you are not familiar with some of this, you should start with Keller Jordan’s blog that I linked above. He’s the creator of the approach and it’s pretty ingenious. Second of all, if you’re not familiar with linear algebra at all (which is ok), I’d recommend this Little Book of Linear Algebra. I ran through it over the past couple weeks so that I could ensure a strong base / have a refresher for some of the concepts that I haven’t seen since college. You can check out the Jupyter notebooks here.

This post is going to try and take you as close from $0 \to 1$ as possible (one huge benefit of running through the book + lab linked above is my latex got way better. Not going to help me land a job at Anthropic but c’est la vie).

(optional) Reading + Videos

These are a couple of helpful resources for you all to get started. I would actually think that if you’re starting from close to scratch or near scratch (haven’t studied AdamW) then you should probably come back to these after my article.

Videos
- This Simple Optimizer Is Revolutionizing How We Train AI (Muon) (p.s. god the amount of clickbaiting people do is just suffocating me… however, this is a good video)
Reading
- Muon: An optimizer for hidden layers in neural networks - Keller Jordan
- Deriving Muon - Jeremy Bernstein
- Understanding Muon - Laker Newhouse
  - this series (after doing my own research and investigation) is hilariously written. lots of Matrix allusions

Deep Learning (simplified)

I’m not going to take you from the very beginning, but the language of deep learning is basically just… linear algebra.

We have these “deep learning” models that are really neural networks. All that means is that they’re layers of parameters (weights and biases) that take various inputs and make predictions. They normally are affine transformations followed by a (usually) non-linear activation.

Generally, the flow for training in deep learning goes like this:

forward pass (feeding data in)
loss function (so we know how we did)
backward pass (so we know how to adapt)
gradient descent (or flavors thereof… where we actually adjust our weights)

There’s fascinating math at all points of this process. However, we’re going to spend the day focusing on step 4 - and specifically on the subset of optimizers. Modern optimizers modify gradients using momentum, adaptive learning rates, etc.

Here is a high level visualization of what’s happening:

Courtesy of me and Claude hammering on manim

Note, that $ \eta $ here is the learning rate.

Tour of Popular Optimizers

Ok the canonical example with optimizers is that we’re basically trying to find the lowest point in a valley. This is assuming our search space is $\mathbb{R}^3$ really but that’s fine for now.

So like let’s take an actual example with the Grand Canyon. Imagine you’re standing on top of the Grand Canyon - how are you going to find the lowest point in the Grand Canyon?

Kudos to Jason Weiss

Now, the optimizer is basically telling us how to walk down that space. It’s obviously a lot easier if we have a topographic map, but we certainly do not in deep learning, and even with the topographic map, it can be tough to search across.

Kudos to DataByYou

In this analogy, elevation is basically how “wrong” we are. You can think of it as the output of our loss function $L(\hat{y}, y)$. So we compute gradients to determine which direction reduces that loss. However, we still don’t know how big each step would be (the $\eta$ mentioned above) or how to adjust over time or how to avoid getting caught in local minima, etc.

Loss Function

Visualization

I don’t have a loss function that is equivalent to the Grand Canyon (sadly), but we are going to look at the Styblinski Tang function as our example loss function. This isn’t going to be accurate, but imagine that the loss function of our deep learning process is only in 3D and has a shape that can be described by a function. In 2D, the Styblinski Tang function looks like this:

\[\begin{align} f(x,y) &= \frac{1}{2}\sum_{i=1}^{d} \big(x_i^4 - 16x_i^2 + 5x_i \big) \\ f(x,y) &= \frac{1}{2}(x_i^4 - 16x_i^2 + 5x_i ) (y_i^4 - 16y_i^2 + 5y_i ) \end{align}\]

Here’s a visualization of this function:

Courtesy of me and Claude hammering on manim

Stochastic Gradient Descent

Conceptually with standard stochastic gradient descent (SGD), we update our weights so that we move in the opposite direction of the gradient (given gradient points to highest uphill direction).

Mathematically speaking, this is:

\[\theta_{t+1} = \theta_t - \eta \nabla_{\theta} L (\theta_t)\]

SGD works pretty well but it’s far from the best. Think about it back to our Grand Canyon approach. Imagine there are steep stairs but they zig-zag back and forth down the grand canyon. Potentially there is a ramp that is less steep but still more directly gets us to the lowest point in the valley quicker. If our landscape is more dynamic than just a vanilla bowl,that path is almost certainly not straight, and therefore SGD isn’t the most efficient. This is basically what happens to SGD in ravines. There is high curvature in one dimension, but not in another.

Furthermore, this step size for the gradient descent isn’t dynamic enough. Having one step size doesn’t take into nuance the steps per model param / model param derivative that we need to adjust by, so we can overblow our targets.

Here’s an example of where SGD could get caught in a local minima.

Courtesy of me and Claude hammering on manim

And if 3D isn’t really your style (especially given my manima skills are pretty poor). Here’s some Python code that will visualize SGD as a topological 2D portion:


import numpy as np
import matplotlib.pyplot as plt
from itertools import product

def styblinski_tang_fn(x: float, y: float) -> float:
    return 0.5 * ((x**4 - 16 * x**2 + 5 * x) + (y**4 - 16 * y**2 + 5 * y))

def styblinski_tang_grad(x: float, y: float) -> np.ndarray:
    dfx = 2 * x**3 - 16 * x + 2.5
    dfy = 2 * y**3 - 16 * y + 2.5
    return np.array([dfx, dfy], dtype=float)

eta = 0.01
steps = 80
theta = np.array([3.5, -3.5], dtype=float)

"""SGD!!! This is the important part here. Implementing the exact math above."""
path = [theta.copy()]
for _ in range(steps):
    grad = styblinski_tang_grad(*theta)
    theta -= eta * grad
    path.append(theta.copy())
path = np.array(path)

"""find stationary points (we can just look at derivative because repeated)"""
roots = np.roots([2.0, 0.0, -16.0, 2.5])
roots = np.real(roots[np.isreal(roots)])          # keep real roots
"""this is basically using second derivative to determine minima"""
minima_1d = [r for r in roots if (6*r*r - 16) > 0]  # two minima
local_minima_2d = np.array(list(product(minima_1d, repeat=2)), dtype=float)
vals = np.array([styblinski_tang_fn(x, y) for x, y in local_minima_2d])
gmin_idx = np.argmin(vals)
gmin_pt = local_minima_2d[gmin_idx]
gmin_val = vals[gmin_idx]

"""viz"""
x = y = np.linspace(-5, 5, 300)
X, Y = np.meshgrid(x, y)
Z = styblinski_tang_fn(X, Y)
plt.figure(figsize=(7, 6))
cs = plt.contour(X, Y, Z, levels=40, cmap="viridis", alpha=0.85)
plt.clabel(cs, inline=True, fmt="%.0f", fontsize=8)
plt.plot(path[:, 0], path[:, 1], 'r.-', label='GD Path', zorder=2)
plt.scatter(path[0, 0], path[0, 1], color='orange', s=80, label='Start', zorder=3)
plt.scatter(path[-1, 0], path[-1, 1], color='blue', s=80, label='End', zorder=3)
mask = np.ones(len(local_minima_2d), dtype=bool)
mask[gmin_idx] = False
if np.any(mask):
    plt.scatter(local_minima_2d[mask, 0], local_minima_2d[mask, 1],
                marker='v', s=120, edgecolor='k', facecolor='white',
                label='Local minima', zorder=4)
plt.scatter(gmin_pt[0], gmin_pt[1], marker='*', s=220, edgecolor='k',
            facecolor='gold', label=f'Global min ({gmin_pt[0]:.4f}, {gmin_pt[1]:.4f})\n f={gmin_val:.4f}', zorder=5)
plt.title("Gradient Descent on Styblinski–Tang: Local vs Global Minima")
plt.xlabel("x"); plt.ylabel("y"); plt.legend(loc='upper right'); plt.grid(alpha=0.3); plt.tight_layout();
plt.show()

SGD with Momentum

So the natural progression is how can we do better than normal SGD.

This idea has been around forever (1964) compared to Muon which is basically 2025. Boris Polyak introduced momentum with physical intuition. If you roll a heavy ball down a hill and there are valleys, it doesn’t get trapped in a local minima. It has momentum to carry it over local minimum which helps find a global min.

Mathematically, it’s a pretty simple extension from our previous. The general idea is that now we have two equations governing how we update our parameters:

\[\begin{align} v_{t+1} &= \beta v_t - \eta \nabla_{\theta} L (\theta_{t}) \\ \theta_{t+1} &= \theta_{t} + v_{t+1} \end{align}\]

We’ve got some new parameters, so let’s define those:

$v_{t}$ - is the “velocity”, it’s the accumulated gradient basically our physical momentum
$\beta$ - is the “momentum coefficient”. controls how much history we remember and how much we want to propagate
$\eta$ - is still our learning rate

A key insight is that if you take $\beta \to 0$ and substitute $v_{t+1}$ then our whole thing falls back to SGD (which is good).

A core paradigm shift here was that this was the first time gradient descent carried with it the notion of memory. It’s a bit more stateful.

Once again, a 3D version, and a 2D version.

Courtesy of me and Claude hammering on manim

And the 2D visualization:


import numpy as np
import matplotlib.pyplot as plt
from itertools import product

def styblinski_tang_fn(x: float, y: float) -> float:
    return 0.5 * ((x**4 - 16 * x**2 + 5 * x) + (y**4 - 16 * y**2 + 5 * y))

def styblinski_tang_grad(x: float, y: float) -> np.ndarray:
    dfx = 2 * x**3 - 16 * x + 2.5
    dfy = 2 * y**3 - 16 * y + 2.5
    return np.array([dfx, dfy], dtype=float)

def stationary_points_and_global_min():
    roots = np.roots([2.0, 0.0, -16.0, 2.5])
    roots = np.real(roots[np.isreal(roots)])
    minima_1d = [r for r in roots if (6*r*r - 16) > 0]
    mins2d = np.array(list(product(minima_1d, repeat=2)), dtype=float)
    vals = np.array([styblinski_tang_fn(x, y) for x, y in mins2d])
    gidx = np.argmin(vals)
    return mins2d, mins2d[gidx], vals[gidx]

def run_sgd(theta0, eta=0.02, steps=1200):
    theta = np.array(theta0, float)
    path = [theta.copy()]
    for _ in range(steps):
        theta -= eta * styblinski_tang_grad(*theta)
        path.append(theta.copy())
    return np.array(path)

"""
again, re-call beta is our momentum coefficient
eta is still our learning rate
extension: Nesterov Momentum
"""
def run_momentum(theta0, eta=0.02, beta=0.90, steps=1200):
    theta = np.array(theta0, float)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        grad = styblinski_tang_grad(*theta)
        v = beta * v - eta * grad
        theta = theta + v
        path.append(theta.copy())
    return np.array(path)

"""params"""
theta_start = np.array([4.1, 4.5], dtype=float)
eta = 0.02
beta = 0.90
steps = 1200
use_nesterov = False  # flip to True to experiment

sgd_path = run_sgd(theta_start, eta=eta, steps=steps)
mom_path = run_momentum(theta_start, eta=eta, beta=beta, steps=steps)
mins2d, gmin_pt, gmin_val = stationary_points_and_global_min()

"""viz"""
x = y = np.linspace(-5, 5, 400)
X, Y = np.meshgrid(x, y)
Z = styblinski_tang_fn(X, Y)

plt.figure(figsize=(8, 7))
cs = plt.contour(X, Y, Z, levels=50, cmap="viridis", alpha=0.85)
plt.clabel(cs, inline=True, fmt="%.0f", fontsize=7)
plt.plot(sgd_path[:, 0], sgd_path[:, 1], 'r.-', lw=1.5, ms=3, label='SGD')
plt.plot(mom_path[:, 0], mom_path[:, 1], 'b.-', lw=1.5, ms=3,
         label=f'Momentum (β={beta}, nesterov={use_nesterov})')
plt.scatter(sgd_path[0, 0], sgd_path[0, 1], c='orange', s=80, label='Start', zorder=3)
plt.scatter(sgd_path[-1, 0], sgd_path[-1, 1], c='red', s=70, label='SGD End', zorder=3)
plt.scatter(mom_path[-1, 0], mom_path[-1, 1], c='blue', s=70, label='Momentum End', zorder=3)
vals = np.array([styblinski_tang_fn(x0, y0) for x0, y0 in mins2d])
mask = np.ones(len(mins2d), dtype=bool)
mask[np.argmin(vals)] = False
if np.any(mask):
    plt.scatter(mins2d[mask, 0], mins2d[mask, 1],
                marker='v', s=120, edgecolor='k', facecolor='white',
                label='Local minima', zorder=4)
plt.scatter(gmin_pt[0], gmin_pt[1], marker='*', s=220, edgecolor='k',
            facecolor='gold', label=f'Global min ({gmin_pt[0]:.4f}, {gmin_pt[1]:.4f})\n f={gmin_val:.4f}', zorder=5)

plt.title("SGD vs Momentum on Styblinski–Tang: Escaping a Local Minimum")
plt.xlabel("x"); plt.ylabel("y")
plt.legend(loc='lower right'); plt.grid(alpha=0.3); plt.tight_layout()
plt.show()

Computational Cost of Momentum

So while momentum is great and improves training, let’s look at the change. In SGD, we have:

for each parameter $\theta_i$:
- parameter itself
- gradient $\nabla_{\theta_{i}} L$

But now that we’re carrying around velocity $v_i$ for each parameter, we store:

memory wise:
- one extra tensor the same size as $\theta$ (i.e. the same number of parameters we have )
comp wise:
- this is relatively inexpensive given it’s basically one more computation to make
- but a good callout is that it’s not free

Again, even if you didn’t look at all at the code above, and just ran it, see this part:

def run_momentum(theta0, eta=0.02, beta=0.90, steps=1200):
    theta = np.array(theta0, float)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        grad = styblinski_tang_grad(*theta)
        v = beta * v - eta * grad
        theta = theta + v
        path.append(theta.copy())
    return np.array(path)

That v didn’t exist before with standard SGD.

This is a general tradeoff that we’ll need to think about optimizer design. We need to be thinking about this on a massive magnitude of training and that each operation has significant impact leading to real $ signs.

Variations

I won’t go into these in detail, but as with everything, there are numerous variations.

Nesterov momentum (a.k.a NAG)

Adaptive Learning Rates (AdaGrad / RMSProp)

Great, so momentum is going to help us smooth learning. The next area of improvement was for people to focus on $\eta$. It sucks that it’s the same for every parameter, so the whole notion was that we want to have our learning rate be adaptive per parameter.

This section is where the math starts to get a bit more interesting.

AdaGrad (2010)

Adaptive gradient came first. Here’s the original paper written in 2010 by John Duchi, Elad Hazan, and Yoram Singer.

The general idea is:

we keep track of how large each parameter’s past gradients have been
we use the history to scale down updates for params that have seen a lot of gradient action

So the core idea here is that we’re going to track the sum of each parameters squared gradients over time. And this helps a ton of things with things like vanishing and exploding gradients (which actually was also an annoyance with Teaching a Computer How to Write.

In other words,

\[r_{t,i} = \sum_{k=1}^t g_{k,i}^2\]

So basically $r_{t,i}$ for the $i$th parameter at time $t$ is going to tell you how much more or less “energy”.

Then our update rule rescales the learning rate for each param coordinates:

\[\theta_{t+1, i} = \theta_{t,i} - \frac{\eta}{\sqrt{r_{t,i}} + \varepsilon} g_{t,i}\]

This can be written in a vectorized format like:

\[\theta_{t+1} = \theta_{t} - \eta D_{t}^{-1/2} g_{t}\]

where $D_t = \text{diag}(r_t)$ and each diagonal element corresponds to one coordinate’s cumulative gradient magnitude. So we’re basically embedding the $i$ into the shape of the vectors and dimension.

Again, DL loves big matrices.

I am not going to try and do a 3D visualization given those take me awhile to get to an acceptable place.


import numpy as np
import matplotlib.pyplot as plt
from sympy import Matrix, pprint, init_printing
from itertools import product
from IPython.display import display
init_printing(use_unicode=True)

def styblinski_tang_fn(x: float, y: float) -> float:
    return 0.5 * ((x**4 - 16 * x**2 + 5 * x) + (y**4 - 16 * y**2 + 5 * y))

def styblinski_tang_grad(x: float, y: float) -> np.ndarray:
    dfx = 2 * x**3 - 16 * x + 2.5
    dfy = 2 * y**3 - 16 * y + 2.5
    return np.array([dfx, dfy], dtype=float)

def stationary_points_and_global_min():
    roots = np.roots([2.0, 0.0, -16.0, 2.5])
    roots = np.real(roots[np.isreal(roots)])
    minima_1d = [r for r in roots if (6*r*r - 16) > 0]
    mins2d = np.array(list(product(minima_1d, repeat=2)), dtype=float)
    vals = np.array([styblinski_tang_fn(x, y) for x, y in mins2d])
    gidx = np.argmin(vals)
    return mins2d, mins2d[gidx], vals[gidx]

def run_sgd(theta0, eta=0.02, steps=1200):
    theta = np.array(theta0, float)
    path = [theta.copy()]
    for _ in range(steps):
        theta -= eta * styblinski_tang_grad(*theta)
        path.append(theta.copy())
    return np.array(path)

def run_momentum(theta0, eta=0.02, beta=0.90, steps=1200):
    theta = np.array(theta0, float)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        grad = styblinski_tang_grad(*theta)
        v = beta * v - eta * grad
        theta = theta + v
        path.append(theta.copy())
    return np.array(path)

def run_adagrad(theta0, eta=0.40, eps=1e-8, steps=1200):
    """
    r_t <- r_{t-1} + g_t^2
    theta <- theta - (eta / (sqrt(r_t) + eps)) * g_t
    """
    theta = np.array(theta0, float)
    r = np.zeros_like(theta)         
    path = [theta.copy()]
    for step in range(steps):
        g = styblinski_tang_grad(*theta)
        r = r + g * g
        lr = eta / (np.sqrt(r) + eps) # elementwise effective LR
        if step % 100 == 0 and step < 600:
            D = np.diag(r)
            print(f"\nStep {step}:  Dt = diag(r_step)")
            display(Matrix(D))
        if step == steps - 1:
            D = np.diag(r)
            print(f"\nFinal Step {step}:  Dt = diag(r_step)")
            display(Matrix(D))
        theta = theta - lr * g
        path.append(theta.copy())
    return np.array(path)

"""params"""
theta_start = np.array([4.1, 4.5], dtype=float)
eta = 0.02
beta = 0.90
steps = 1200
eta_adagrad = 0.40
eps_adagrad = 1e-8

sgd_path = run_sgd(theta_start, eta=eta, steps=steps)
mom_path = run_momentum(theta_start, eta=eta, beta=beta, steps=steps)
ada_path = run_adagrad(theta_start, eta=eta_adagrad, eps=eps_adagrad, steps=steps)
mins2d, gmin_pt, gmin_val = stationary_points_and_global_min()

"""viz"""
x = y = np.linspace(-5, 5, 400)
X, Y = np.meshgrid(x, y)
Z = styblinski_tang_fn(X, Y)

plt.figure(figsize=(8, 7))
cs = plt.contour(X, Y, Z, levels=50, alpha=0.85)   # (kept close; removed explicit cmap for portability)
plt.clabel(cs, inline=True, fmt="%.0f", fontsize=7)

plt.plot(sgd_path[:, 0], sgd_path[:, 1], 'r.-', lw=1.5, ms=3, label='SGD')
plt.plot(mom_path[:, 0], mom_path[:, 1], 'b.-', lw=1.5, ms=3,
         label=f'Momentum (β={beta})')
plt.plot(ada_path[:, 0], ada_path[:, 1], 'g.-', lw=1.5, ms=3,
         label=f'AdaGrad (η₀={eta_adagrad})')

plt.scatter(sgd_path[0, 0], sgd_path[0, 1], c='orange', s=80, label='Start', zorder=3)
plt.scatter(sgd_path[-1, 0], sgd_path[-1, 1], c='red', s=70, label='SGD End', zorder=3)
plt.scatter(mom_path[-1, 0], mom_path[-1, 1], c='blue', s=70, label='Momentum End', zorder=3)
plt.scatter(ada_path[-1, 0], ada_path[-1, 1], c='green', s=70, label='AdaGrad End', zorder=3)

vals = np.array([styblinski_tang_fn(x0, y0) for x0, y0 in mins2d])
mask = np.ones(len(mins2d), dtype=bool)
mask[np.argmin(vals)] = False
if np.any(mask):
    plt.scatter(mins2d[mask, 0], mins2d[mask, 1],
                marker='v', s=120, edgecolor='k', facecolor='white',
                label='Local minima', zorder=4)
plt.scatter(gmin_pt[0], gmin_pt[1], marker='*', s=220, edgecolor='k',
            facecolor='gold', label=f'Global min ({gmin_pt[0]:.4f}, {gmin_pt[1]:.4f})\n f={gmin_val:.4f}', zorder=5)

plt.title("SGD vs Momentum vs AdaGrad on Styblinski–Tang")
plt.xlabel("x"); plt.ylabel("y")
plt.legend(loc='lower right'); plt.grid(alpha=0.3); plt.tight_layout()
plt.show()

Variations

Arguably, RMSProp is a deviation of AdaGrad, but… i decided to split it out given how talked about RMSProp is.

However, similar to AdaGrad, there’s also

AdaDelta
- basically does an exponential weighted average

RMSProp (2012)

RMSProp, or Root Mean Square Propagation, allows the effective learning rate to increase or decrease. It cuts away from the effeective LR monotonically shrinking.

Confusingly but importantly, RMSProp is identical to AdaDelta just withohut the running average for parameter updates.

The whole notion of RMSProp is that we keep an exponential weighted moving average (EMA) of recent gradients per parameter.

We scale the raw gradient by the inverse root of that EMA.

In other words,

\[\begin{align} s_t &= \rho s_{t-1} + (1-\rho) g_t^2 \\ \theta_{t+1} &= \theta_t - \eta \frac{g_t}{\sqrt{s_t} + \varepsilon} \end{align}\]

Sometimes people use $\beta$ instead of $\rho$. But here is what these mean:

$s_t$ - accumulated moving average of squared gradients at time $t$
$\rho$ - the decay rate, typically between 0.9 and 0.99
$g(t)$ - still represents our gradient at time $t$

And once again, for matrix math, similar to AdaGrad we can play a similar game with vectorizing it:

$\theta_{t+1} = \theta_{t} - \eta \tilde{D}_t^{-\frac{1}{2}} g_t$ where $\tilde{D}_t = \text{diag}(s_t + \varepsilon)$

So the total result is that we have large, consistently-steep coords get downscaled, and quiet coords get a healthier step. By using a moving window, step sizes don’t vanish over time.

The EMA is meant to focus on recent gradients, and maintains steady effective learning rate while preventing premature decay. With AdaGrad, effective LR monotonically shrinks and can stall on long runs.

Again, I am not going to try and do a 3D visualization given those take me awhile to get to an acceptable place.


import numpy as np
import matplotlib.pyplot as plt
from itertools import product
from sympy import Matrix
from IPython.display import display

def styblinski_tang_fn(x: float, y: float) -> float:
    return 0.5 * ((x**4 - 16 * x**2 + 5 * x) + (y**4 - 16 * y**2 + 5 * y))


def styblinski_tang_grad(x: float, y: float) -> np.ndarray:
    dfx = 2 * x**3 - 16 * x + 2.5
    dfy = 2 * y**3 - 16 * y + 2.5
    return np.array([dfx, dfy], dtype=float)


def stationary_points_and_global_min():
    roots = np.roots([2.0, 0.0, -16.0, 2.5])
    roots = np.real(roots[np.isreal(roots)])
    minima_1d = [r for r in roots if (6 * r * r - 16) > 0]
    mins2d = np.array(list(product(minima_1d, repeat=2)), dtype=float)
    vals = np.array([styblinski_tang_fn(x, y) for x, y in mins2d])
    gidx = np.argmin(vals)
    return mins2d, mins2d[gidx], vals[gidx]


def run_sgd(theta0, eta=0.02, steps=1200):
    theta = np.array(theta0, float)
    path = [theta.copy()]
    for _ in range(steps):
        theta -= eta * styblinski_tang_grad(*theta)
        path.append(theta.copy())
    return np.array(path)


def run_momentum(theta0, eta=0.02, beta=0.90, steps=1200):
    theta = np.array(theta0, float)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        v = beta * v - eta * g
        theta = theta + v
        path.append(theta.copy())
    return np.array(path)


def run_adagrad(theta0, eta=0.40, eps=1e-8, steps=1200):
    theta = np.array(theta0, float)
    r = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        r = r + g * g
        lr_eff = eta / (np.sqrt(r) + eps)
        theta = theta - lr_eff * g
        path.append(theta.copy())
    return np.array(path)

def run_rmsprop(theta0, eta=1e-2, rho=0.9, eps=1e-8, steps=1200):
    """
    s_t = rho * s_{t-1} + (1 - rho) * g_t^2
    theta <- theta - eta * g_t / (sqrt(s_t) + eps)
    """
    theta = np.array(theta0, float)
    s = np.zeros_like(theta)
    path = [theta.copy()]
    for step in range(steps):
        g = styblinski_tang_grad(*theta)
        s = rho * s + (1 - rho) * (g * g)
        if step % 100 == 0 and step < 600:
            S = np.diag(s)
            print(f"\nStep {step}:  s_t (EMA of squared gradients)")
            display(Matrix(S))
        if step == steps - 1:
            S = np.diag(s)
            print(f"\nFinal Step {step}:  s_t (EMA of squared gradients)")
            display(Matrix(S))
        theta = theta - eta * g / (np.sqrt(s) + eps)
        path.append(theta.copy())
    return np.array(path)

def run_rmsprop_centered(theta0, eta=1e-2, rho=0.9, eps=1e-8, steps=1200):
    """
    m_t = rho * m_{t-1} + (1 - rho) * g_t
    s_t = rho * s_{t-1} + (1 - rho) * g_t^2
    denom = sqrt(s_t - m_t^2) + eps   # variance-based
    """
    theta = np.array(theta0, float)
    m = np.zeros_like(theta)
    s = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        m = rho * m + (1 - rho) * g
        s = rho * s + (1 - rho) * (g * g)
        denom = np.sqrt(np.maximum(s - m * m, 0.0)) + eps
        theta = theta - eta * g / denom
        path.append(theta.copy())
    return np.array(path)


theta_start = np.array([4.1, 4.5], dtype=float)
steps = 1200

eta_sgd = 0.02
eta_mom, beta = 0.02, 0.90
eta_adagrad = 0.40
eta_rms, rho, eps = 1e-2, 0.9, 1e-8
eta_rms_c = 1e-2

sgd_path = run_sgd(theta_start, eta=eta_sgd, steps=steps)
mom_path = run_momentum(theta_start, eta=eta_mom, beta=beta, steps=steps)
ada_path = run_adagrad(theta_start, eta=eta_adagrad, steps=steps)
rms_path = run_rmsprop(theta_start, eta=eta_rms, rho=rho, eps=eps, steps=steps)
rmsc_path = run_rmsprop_centered(theta_start, eta=eta_rms_c, rho=rho, eps=eps, steps=steps)

mins2d, gmin_pt, gmin_val = stationary_points_and_global_min()

x = y = np.linspace(-5, 5, 400)
X, Y = np.meshgrid(x, y)
Z = styblinski_tang_fn(X, Y)

plt.figure(figsize=(9, 8))
cs = plt.contour(X, Y, Z, levels=50, alpha=0.85)
plt.clabel(cs, inline=True, fmt="%.0f", fontsize=7)

plt.plot(sgd_path[:, 0], sgd_path[:, 1], '.-', lw=1.2, ms=3, label='SGD')
plt.plot(mom_path[:, 0], mom_path[:, 1], '.-', lw=1.2, ms=3, label=f'Momentum (β={beta})')
plt.plot(ada_path[:, 0], ada_path[:, 1], '.-', lw=1.2, ms=3, label='AdaGrad')
plt.plot(rms_path[:, 0], rms_path[:, 1], '.-', lw=1.2, ms=3, label=f'RMSProp (ρ={rho})')
plt.plot(rmsc_path[:, 0], rmsc_path[:, 1], '.-', lw=1.2, ms=3, label='RMSProp (centered)')

plt.scatter(sgd_path[0, 0], sgd_path[0, 1], s=80, label='Start', zorder=3)
plt.scatter(sgd_path[-1, 0], sgd_path[-1, 1], s=60, label='SGD End', zorder=3)
plt.scatter(mom_path[-1, 0], mom_path[-1, 1], s=60, label='Momentum End', zorder=3)
plt.scatter(ada_path[-1, 0], ada_path[-1, 1], s=60, label='AdaGrad End', zorder=3)
plt.scatter(rms_path[-1, 0], rms_path[-1, 1], s=60, label='RMSProp End', zorder=3)
plt.scatter(rmsc_path[-1, 0], rmsc_path[-1, 1], s=60, label='RMSProp (centered) End', zorder=3)

vals = np.array([styblinski_tang_fn(x0, y0) for x0, y0 in mins2d])
mask = np.ones(len(mins2d), dtype=bool)
mask[np.argmin(vals)] = False
if np.any(mask):
    plt.scatter(mins2d[mask, 0], mins2d[mask, 1],
                marker='v', s=120, edgecolor='k', facecolor='white',
                label='Local minima', zorder=4)
plt.scatter(gmin_pt[0], gmin_pt[1], marker='*', s=220, edgecolor='k',
            facecolor='gold', label=f'Global min ({gmin_pt[0]:.4f}, {gmin_pt[1]:.4f})\n f={gmin_val:.4f}', zorder=5)

plt.title("SGD vs Momentum vs AdaGrad vs RMSProp (and Centered) on Styblinski–Tang")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Again, the important code part is here:

def run_rmsprop(theta0, eta=1e-2, rho=0.9, eps=1e-8, steps=1200):
    """
    s_t = rho * s_{t-1} + (1 - rho) * g_t^2
    theta <- theta - eta * g_t / (sqrt(s_t) + eps)
    """
    theta = np.array(theta0, float)
    s = np.zeros_like(theta)
    path = [theta.copy()]
    for step in range(steps):
        g = styblinski_tang_grad(*theta)
        s = rho * s + (1 - rho) * (g * g)
        if step % 100 == 0 and step < 600:
            S = np.diag(s)
            print(f"\nStep {step}:  s_t (EMA of squared gradients)")
            display(Matrix(S))
        if step == steps - 1:
            S = np.diag(s)
            print(f"\nFinal Step {step}:  s_t (EMA of squared gradients)")
            display(Matrix(S))
        theta = theta - eta * g / (np.sqrt(s) + eps)
        path.append(theta.copy())
    return np.array(path)

Variations

RMSProp (centered)

Bias Correction (finally meeting Adam Optimizer, 2015)

RMSProp is fantastic but still subject to getting caught in local minima.

Ok finally in 2015 people introduced Adam. This is basically marrying the momentum portions along with the utilization of the first two moments from RMSProp / AdaGrad. However, a key introduced is bias-correcting the EMAs because they start at zero and are biased early. Our update uses the direction $\hat{m}_t$ and the scale $\sqrt{\hat{v_t}}$.

Mathematically, we now have:

momentum part (exp avg of raw gradients, our first moment (i.e. understanding magnitude of gradient updates)) $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
rms prop part (exp avg of squared gradients, our second moment (i.e. understanding energy / dispersion of gradient updates)) $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
bias correction part (new) - getting around the fact that both are starting from 0, so divde by $ 1 - \beta_i^t $ $\hat{m}_t = \frac{m_t}{1- \beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$

with our final update being:

\[\theta_{t+1} = \theta_t - \eta \frac{\hat{m_t}}{\sqrt{\hat{v_t}}+\varepsilon}\]

Again, same thing with the vectorization, we’re always just modifying our $D$ matrix:

\[\theta_{t+1} = \theta_t - \eta D_{t}^{-\frac{1}{2}}\hat{m}_t, \quad D_t = \text{diag}(\hat{v}_t + \varepsilon)\]

Comparison so Far

I had ChatGPT create this table which does a good job of understanding the nuances between:

Optimizer	Tracks mean of gradients?	Tracks mean of squared gradients?	Bias correction?	Update uses
Momentum (Polyak)	✅ $m_t = \beta m_{t-1} + (1-\beta) g_t$	❌	❌	$ \theta_{t+1} = \theta_t - \eta m_t $
RMSProp (Hinton)	❌	✅ $s_t = \rho s_{t-1} + (1-\rho) g_t^2$	❌	$ \theta_{t+1} = \theta_t - \eta \dfrac{g_t}{\sqrt{s_t}+\varepsilon} $
Adam (Kingma & Ba)	✅ $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$	✅ $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$	✅ divides by (1-\beta^t)	$ \theta_{t+1} = \theta_t - \eta \dfrac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon} $

Plain English

My understanding in plain english in how each step affects this:

SGD - size of gradient is taken into account
SGD with momentum - adds smoothing with momentum (introduces $\hat{m}_t$)
RMSProp - adds scaling for recent average squared gradients vs older ones
Adam - bias correction to fix underestimation at early timesteps

Viz


import numpy as np
import matplotlib.pyplot as plt
from itertools import product
from sympy import Matrix
from IPython.display import display

def styblinski_tang_fn(x: float, y: float) -> float:
    return 0.5 * ((x**4 - 16 * x**2 + 5 * x) + (y**4 - 16 * y**2 + 5 * y))


def styblinski_tang_grad(x: float, y: float) -> np.ndarray:
    dfx = 2 * x**3 - 16 * x + 2.5
    dfy = 2 * y**3 - 16 * y + 2.5
    return np.array([dfx, dfy], dtype=float)


def stationary_points_and_global_min():
    roots = np.roots([2.0, 0.0, -16.0, 2.5])
    roots = np.real(roots[np.isreal(roots)])
    minima_1d = [r for r in roots if (6 * r * r - 16) > 0]
    mins2d = np.array(list(product(minima_1d, repeat=2)), dtype=float)
    vals = np.array([styblinski_tang_fn(x, y) for x, y in mins2d])
    gidx = np.argmin(vals)
    return mins2d, mins2d[gidx], vals[gidx]


def run_sgd(theta0, eta=0.02, steps=1200):
    theta = np.array(theta0, float)
    path = [theta.copy()]
    for _ in range(steps):
        theta -= eta * styblinski_tang_grad(*theta)
        path.append(theta.copy())
    return np.array(path)


def run_momentum(theta0, eta=0.02, beta=0.90, steps=1200):
    theta = np.array(theta0, float)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        v = beta * v - eta * g
        theta = theta + v
        path.append(theta.copy())
    return np.array(path)


def run_adagrad(theta0, eta=0.40, eps=1e-8, steps=1200):
    theta = np.array(theta0, float)
    r = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        r = r + g * g
        lr_eff = eta / (np.sqrt(r) + eps)
        theta = theta - lr_eff * g
        path.append(theta.copy())
    return np.array(path)

def run_rmsprop(theta0, eta=1e-2, rho=0.9, eps=1e-8, steps=1200):
    """
    s_t = rho * s_{t-1} + (1 - rho) * g_t^2
    theta <- theta - eta * g_t / (sqrt(s_t) + eps)
    """
    theta = np.array(theta0, float)
    s = np.zeros_like(theta)
    path = [theta.copy()]
    for step in range(steps):
        g = styblinski_tang_grad(*theta)
        s = rho * s + (1 - rho) * (g * g)
        theta = theta - eta * g / (np.sqrt(s) + eps)
        path.append(theta.copy())
    return np.array(path)

def run_rmsprop_centered(theta0, eta=1e-2, rho=0.9, eps=1e-8, steps=1200):
    """
    m_t = rho * m_{t-1} + (1 - rho) * g_t
    s_t = rho * s_{t-1} + (1 - rho) * g_t^2
    denom = sqrt(s_t - m_t^2) + eps   # variance-based
    """
    theta = np.array(theta0, float)
    m = np.zeros_like(theta)
    s = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        m = rho * m + (1 - rho) * g
        s = rho * s + (1 - rho) * (g * g)
        denom = np.sqrt(np.maximum(s - m * m, 0.0)) + eps
        theta = theta - eta * g / denom
        path.append(theta.copy())
    return np.array(path)

def run_adam(theta0, eta=1e-2, beta1=0.9, beta2=0.999, eps=1e-8, steps=1200):
    theta = np.array(theta0, float)
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for t in range(1, steps + 1):
        g = styblinski_tang_grad(*theta)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * (g * g)
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        theta = theta - eta * m_hat / (np.sqrt(v_hat) + eps)
        path.append(theta.copy())
    return np.array(path)

theta_start = np.array([4.1, 4.5], dtype=float)
steps = 1200

eta_sgd = 0.02
eta_mom, beta = 0.02, 0.90
eta_adagrad = 0.40
eta_rms, rho, eps = 1e-2, 0.9, 1e-8
eta_rms_c = 1e-2

sgd_path = run_sgd(theta_start, eta=eta_sgd, steps=steps)
mom_path = run_momentum(theta_start, eta=eta_mom, beta=beta, steps=steps)
ada_path = run_adagrad(theta_start, eta=eta_adagrad, steps=steps)
rms_path = run_rmsprop(theta_start, eta=eta_rms, rho=rho, eps=eps, steps=steps)
rmsc_path = run_rmsprop_centered(theta_start, eta=eta_rms_c, rho=rho, eps=eps, steps=steps)
adam_path = run_adam(theta_start)
mins2d, gmin_pt, gmin_val = stationary_points_and_global_min()

x = y = np.linspace(-5, 5, 400)
X, Y = np.meshgrid(x, y)
Z = styblinski_tang_fn(X, Y)

plt.figure(figsize=(9, 8))
cs = plt.contour(X, Y, Z, levels=50, alpha=0.85)
plt.clabel(cs, inline=True, fmt="%.0f", fontsize=7)

plt.plot(sgd_path[:, 0], sgd_path[:, 1], '.-', lw=1.2, ms=3, label='SGD')
plt.plot(mom_path[:, 0], mom_path[:, 1], '.-', lw=1.2, ms=3, label=f'Momentum (β={beta})')
plt.plot(ada_path[:, 0], ada_path[:, 1], '.-', lw=1.2, ms=3, label='AdaGrad')
plt.plot(rms_path[:, 0], rms_path[:, 1], '.-', lw=1.2, ms=3, label=f'RMSProp (ρ={rho})')
plt.plot(rmsc_path[:, 0], rmsc_path[:, 1], '.-', lw=1.2, ms=3, label='RMSProp (centered)')
plt.plot(adam_path[:, 0], adam_path[:, 1], '.-', lw=1.2, ms=3, label='Adam')

plt.scatter(sgd_path[0, 0], sgd_path[0, 1], s=80, label='Start', zorder=3)
plt.scatter(sgd_path[-1, 0], sgd_path[-1, 1], s=60, label='SGD End', zorder=3)
plt.scatter(mom_path[-1, 0], mom_path[-1, 1], s=60, label='Momentum End', zorder=3)
plt.scatter(ada_path[-1, 0], ada_path[-1, 1], s=60, label='AdaGrad End', zorder=3)
plt.scatter(rms_path[-1, 0], rms_path[-1, 1], s=60, label='RMSProp End', zorder=3)
plt.scatter(rmsc_path[-1, 0], rmsc_path[-1, 1], s=60, label='RMSProp (centered) End', zorder=3)
plt.scatter(adam_path[-1, 0], adam_path[-1, 1], s=60, label='Adam End', zorder=3)

vals = np.array([styblinski_tang_fn(x0, y0) for x0, y0 in mins2d])
mask = np.ones(len(mins2d), dtype=bool)
mask[np.argmin(vals)] = False
if np.any(mask):
    plt.scatter(mins2d[mask, 0], mins2d[mask, 1],
                marker='v', s=120, edgecolor='k', facecolor='white',
                label='Local minima', zorder=4)
plt.scatter(gmin_pt[0], gmin_pt[1], marker='*', s=220, edgecolor='k',
            facecolor='gold', label=f'Global min ({gmin_pt[0]:.4f}, {gmin_pt[1]:.4f})\n f={gmin_val:.4f}', zorder=5)

plt.title("SGD vs Momentum vs AdaGrad vs RMSProp (and Centered) on Styblinski–Tang")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(loc='upper left')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Weight Decay Coupling (the “W” in AdamW, 2017)

A very slight distinction but this was a key change that led to better generalization. AdamW was utilized to train BERT, GPT, and others. Most modern frameworks (PyTorch, Tensorflow, JAX) now make it the default.

L2 Regularization

Ok so before we get to AdamW cleanly, we have to discuss L2 regularization (also commonly noted as $\lambda | \theta | ^2 $). The idea with L2 regularization is that we have a penalty on our loss function so that the model doesn’t overfit. Basically saying we want to minimize both the loss and the size of the weights.

When you use Adam, you compute the gradient of your total loss. So basically from start to finish, walking through:

\[\begin{align} L_{total} ( \theta ) &= L_{data} (\theta) + \frac{\lambda}{2} \| \theta \|^2 \\ \nabla_{\theta} L_{total} (\theta) &= \nabla_{\theta} L_{data} (\theta) + \lambda \theta \end{align}\]

This means that every gradient update has two parts:

a data term
a regularization term

And again, so $\lambda$ is the regularization strength - basically how much we want to penalize large weights.

So now incorporating this into Adam. We normally compute the gradient of our total loss.

\[g_t = \nabla_{\theta} L_{total} (\theta_t) = \nabla_{\theta} L_{data} (\theta_t) + \lambda \theta_t\]

But the downside is that the $+ \lambda \theta_t$ term becomes part of the gradient update…. That’s an issue for us because Adam does its adaptive scaling magic:

\[\theta_{t+1} = \theta_{t} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}\]

but that adaptive scaling portion also now includes the $\lambda \theta_t$ portion meaning some weights get decayed more than others, all depending on their individual $v_t$ values.

Knowledge check

Let's say we're not applying L2 regularization as part of our loss function, is AdamW going to be different at all from Adam?

Select the answer

No, L2 regularization is linked to AdamW so if it isn't in our loss function, then we don't need AdamW
Yes, it can still be beneficial

Here’s another one. And look at how we can call the AdamW optimizer in pytorch:

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=this_is_equiv_to_lambda)

Knowledge check

If I explicitly wanted to use Adam instead of AdamW, can I modify (and if so, how could I) the above pytorch code to get that?

Select the answer

trick question! you can't
yes, there's another AdamW parameter in pytorch that enables vanilla Adam
yes,we can change the weight_decay

Viz


import numpy as np
import matplotlib.pyplot as plt
from itertools import product
from sympy import Matrix
from IPython.display import display

def styblinski_tang_fn(x: float, y: float) -> float:
    return 0.5 * ((x**4 - 16 * x**2 + 5 * x) + (y**4 - 16 * y**2 + 5 * y))

def styblinski_tang_grad(x: float, y: float) -> np.ndarray:
    dfx = 2 * x**3 - 16 * x + 2.5
    dfy = 2 * y**3 - 16 * y + 2.5
    return np.array([dfx, dfy], dtype=float)

def stationary_points_and_global_min():
    roots = np.roots([2.0, 0.0, -16.0, 2.5])
    roots = np.real(roots[np.isreal(roots)])
    minima_1d = [r for r in roots if (6 * r * r - 16) > 0]
    mins2d = np.array(list(product(minima_1d, repeat=2)), dtype=float)
    vals = np.array([styblinski_tang_fn(x, y) for x, y in mins2d])
    gidx = np.argmin(vals)
    return mins2d, mins2d[gidx], vals[gidx]

def run_sgd(theta0, eta=0.02, steps=1200):
    theta = np.array(theta0, float)
    path = [theta.copy()]
    for _ in range(steps):
        theta -= eta * styblinski_tang_grad(*theta)
        path.append(theta.copy())
    return np.array(path)

def run_momentum(theta0, eta=0.02, beta=0.90, steps=1200):
    theta = np.array(theta0, float)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        v = beta * v - eta * g
        theta = theta + v
        path.append(theta.copy())
    return np.array(path)

def run_adagrad(theta0, eta=0.40, eps=1e-8, steps=1200):
    theta = np.array(theta0, float)
    r = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        r = r + g * g
        lr_eff = eta / (np.sqrt(r) + eps)
        theta = theta - lr_eff * g
        path.append(theta.copy())
    return np.array(path)

def run_rmsprop(theta0, eta=1e-2, rho=0.9, eps=1e-8, steps=1200):
    """
    s_t = rho * s_{t-1} + (1 - rho) * g_t^2
    theta <- theta - eta * g_t / (sqrt(s_t) + eps)
    """
    theta = np.array(theta0, float)
    s = np.zeros_like(theta)
    path = [theta.copy()]
    for step in range(steps):
        g = styblinski_tang_grad(*theta)
        s = rho * s + (1 - rho) * (g * g)
        theta = theta - eta * g / (np.sqrt(s) + eps)
        path.append(theta.copy())
    return np.array(path)

def run_rmsprop_centered(theta0, eta=1e-2, rho=0.9, eps=1e-8, steps=1200):
    """
    m_t = rho * m_{t-1} + (1 - rho) * g_t
    s_t = rho * s_{t-1} + (1 - rho) * g_t^2
    denom = sqrt(s_t - m_t^2) + eps   # variance-based
    """
    theta = np.array(theta0, float)
    m = np.zeros_like(theta)
    s = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        m = rho * m + (1 - rho) * g
        s = rho * s + (1 - rho) * (g * g)
        denom = np.sqrt(np.maximum(s - m * m, 0.0)) + eps
        theta = theta - eta * g / denom
        path.append(theta.copy())
    return np.array(path)

def run_adam(theta0, eta=1e-2, beta1=0.9, beta2=0.999, eps=1e-8, steps=1200):
    theta = np.array(theta0, float)
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for t in range(1, steps + 1):
        g = styblinski_tang_grad(*theta)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * (g * g)
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        theta = theta - eta * m_hat / (np.sqrt(v_hat) + eps)
        path.append(theta.copy())
    return np.array(path)

def run_adamw(theta0, eta=1e-2, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01, steps=1200):
    """
    AdamW: decoupled weight decay
      theta <- theta - eta * ( m_hat / (sqrt(v_hat)+eps) )  # adaptive step
      theta <- theta - eta * weight_decay * theta           # uniform shrink
    Note: setting weight_decay=0.0 makes AdamW identical to Adam.
    """
    theta = np.array(theta0, float)
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for t in range(1, steps + 1):
        g = styblinski_tang_grad(*theta)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * (g * g)
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)

        # adaptive update
        theta = theta - eta * (m_hat / (np.sqrt(v_hat) + eps))
        # decoupled weight decay (uniform; not scaled by v_hat)
        theta = theta - eta * weight_decay * theta

        path.append(theta.copy())
    return np.array(path)

"""----- params -----"""
theta_start = np.array([4.1, 4.5], dtype=float)
steps = 1200

eta_sgd = 0.02
eta_mom, beta = 0.02, 0.90
eta_adagrad = 0.40
eta_rms, rho, eps = 1e-2, 0.9, 1e-8
eta_rms_c = 1e-2

"""Adam / AdamW hyperparams"""
eta_adam = 1e-2
beta1, beta2 = 0.9, 0.999
eps_adam = 1e-8
wd = 1e-2     # try 0.0 (Adam-equivalent) vs 1e-3 vs 1e-2

"""----- runs -----"""
sgd_path  = run_sgd(theta_start, eta=eta_sgd, steps=steps)
mom_path  = run_momentum(theta_start, eta=eta_mom, beta=beta, steps=steps)
ada_path  = run_adagrad(theta_start, eta=eta_adagrad, steps=steps)
rms_path  = run_rmsprop(theta_start, eta=eta_rms, rho=rho, eps=eps, steps=steps)
rmsc_path = run_rmsprop_centered(theta_start, eta=eta_rms_c, rho=rho, eps=eps, steps=steps)
adam_path = run_adam(theta_start, eta=eta_adam, beta1=beta1, beta2=beta2, eps=eps_adam, steps=steps)
adamw_path = run_adamw(theta_start, eta=eta_adam, beta1=beta1, beta2=beta2, eps=eps_adam, weight_decay=wd, steps=steps)

mins2d, gmin_pt, gmin_val = stationary_points_and_global_min()

"""----- viz -----"""
x = y = np.linspace(-5, 5, 400)
X, Y = np.meshgrid(x, y)
Z = styblinski_tang_fn(X, Y)

plt.figure(figsize=(9, 8))
cs = plt.contour(X, Y, Z, levels=50, alpha=0.85)
plt.clabel(cs, inline=True, fmt="%.0f", fontsize=7)

plt.plot(sgd_path[:, 0],   sgd_path[:, 1],   '.-', lw=1.2, ms=3, label='SGD')
plt.plot(mom_path[:, 0],   mom_path[:, 1],   '.-', lw=1.2, ms=3, label=f'Momentum (β={beta})')
plt.plot(ada_path[:, 0],   ada_path[:, 1],   '.-', lw=1.2, ms=3, label='AdaGrad')
plt.plot(rms_path[:, 0],   rms_path[:, 1],   '.-', lw=1.2, ms=3, label=f'RMSProp (ρ={rho})')
plt.plot(rmsc_path[:, 0],  rmsc_path[:, 1],  '.-', lw=1.2, ms=3, label='RMSProp (centered)')
plt.plot(adam_path[:, 0],  adam_path[:, 1],  '.-', lw=1.2, ms=3, label='Adam')
plt.plot(adamw_path[:, 0], adamw_path[:, 1], '.-', lw=1.2, ms=3, label=f'AdamW (wd={wd})')

plt.scatter(sgd_path[0, 0], sgd_path[0, 1], s=80, label='Start', zorder=3)
plt.scatter(sgd_path[-1, 0], sgd_path[-1, 1], s=60, label='SGD End', zorder=3)
plt.scatter(mom_path[-1, 0], mom_path[-1, 1], s=60, label='Momentum End', zorder=3)
plt.scatter(ada_path[-1, 0], ada_path[-1, 1], s=60, label='AdaGrad End', zorder=3)
plt.scatter(rms_path[-1, 0], rms_path[-1, 1], s=60, label='RMSProp End', zorder=3)
plt.scatter(rmsc_path[-1, 0], rmsc_path[-1, 1], s=60, label='RMSProp (centered) End', zorder=3)
plt.scatter(adam_path[-1, 0], adam_path[-1, 1], s=60, label='Adam End', zorder=3)
plt.scatter(adamw_path[-1, 0], adamw_path[-1, 1], s=60, label='AdamW End', zorder=3)

plt.scatter(gmin_pt[0], gmin_pt[1], marker='*', s=220, edgecolor='k',
            facecolor='gold', label=f'Global min ({gmin_pt[0]:.4f}, {gmin_pt[1]:.4f})\n f={gmin_val:.4f}', zorder=5)

vals = np.array([styblinski_tang_fn(x0, y0) for x0, y0 in mins2d])
mask = np.ones(len(mins2d), dtype=bool)
mask[np.argmin(vals)] = False
if np.any(mask):
    plt.scatter(mins2d[mask, 0], mins2d[mask, 1],
                marker='v', s=120, edgecolor='k', facecolor='white',
                label='Local minima', zorder=4)

plt.title("SGD, Momentum, AdaGrad, RMSProp (+centered), Adam, AdamW on Styblinski–Tang")
plt.xlabel("x"); plt.ylabel("y")
plt.legend(loc='upper left')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

You’ll note… before we get to Muon that this example is almost entirely contrived. While the Styblinski-Tang function is a good example of a loss function that would be easy to get caught in a local minima and is hard to find the global min, you’ll note that just because in my contrived examples the SGD with Momentum optimizer is finding the global min does not mean that it generalizes well. Generally AdamW has been the defacto winner.

Muon (MomentUm Orthogonalized by Newton-Schulz) (2025)

Alright jeez, all of the above was a bit accidental, but I wanted to give you all a build up / very very quick run through of the various optimziers that have evolved. Again, the space is definitely iterative (pun intended), but these optimizers all build off of each other. Muon is no different.

Theory

The idea is that we’re still working with our momentum matrix. The momentum matrix can tend to become row-rank in practice, which means only a couple of directions dominate.

Muon tries to orthogonalize our momentum matrix. Rare directions are amplified by the orthogonalization. Again, recall from the Little Book of Linear Algebra that this means:

$\text{Ortho}(M) = \text{argmin}_O \{ \| O - M \| f \}$ where $OO^T = I$ and $O^T O = I$

Ok while this is hard… what do we turn to besides our good friend - the swiss army knife of linalg - SVD (singular value decomposiiton)

\[M = U S V^T\]

So we would compute SVD and then we’d set the S matrix to be diag(1).

However, once again SVD is computationally expensive so this

Odd Polynomial Matrix

Odd polynomial matrices are:

\[\rho (X) = aX + b(X X^T)X\]

so we could do:

\[\rho (M) = aM + b(MM^T)M\]

So let’s go ahead and do some math where we substitute in $M$.

\[\begin{aligned} \rho (M) &= aM + b(MM^T)M \\ \rho (M) &= (a + b(MM^T))M \\ \rho (M) &= (a + b((USV^T)(VSU^T)))(USV^T) \\ \rho (M) &= (a + b(USV^TVSU^T))(USV^T) \\ \\ &\quad \text{because $V$ is orthonormal, $V^TV = I$} \\ \\ \rho (M) &= (a + b(USSU^T))(USV^T) \\ \\ &\quad \text{and $S$ is diagonal so $SS = S^2$} \\ \\ \rho (M) &= (a + b(US^2U^T))(USV^T) \\ \rho (M) &= a(USV^T) + b(US^2U^TUSV^T) \\ \\ &\quad \text{because $U$ is orthonormal, $U^TU = I$} \\ \\ \rho (M) &= a(USV^T) + b(US^2SV^T) \\ \rho (M) &= a(USV^T) + b(US^3V^T) \\ \\ &\quad \text{simplifying gives} \\ \\ \rho (M) &= U(aS + bS^3)V^T \end{aligned}\]

So… **applying an odd polynomial matrix function to M acts on the singular values in the same way as applying the function to each singular value function individually and then reconstructing the original matrix from the functions).

This expands for odd polynomials so just take this for granted or derive it for yourself:

\[\begin{align} \rho (M) &= aM + b(MM^T)M + c(MM^T)^2 M \\ \vdots \\ \rho (M) &= U(aS + bS^3 + cS^5)V^T \\ \end{align}\]

Again, we want S to be diag with 1s… So this now becomes an optimization problem within itself. We’re trying to pick the coefficients of $a, b, c$ so that we get S = np.eye(S.shape[0]).

So how do we pick out the best parameters that will help us do that….

Newton-Schulz Iteration

Again, this video is fantastic. However, this part was a little too abstracted. We’ll turn back to manim here for some more helpful visualizations and understanding.

~So I’m going to dive into the derivation here.~ Actually, it’s very interesting and I’m going to cover in another blog post. I’ll link it here.

For now, assume that we have these params:

$a = 3.4445$
$b = -4.7775$
$c = 2.0315$

and those are going to be the params of our newton-schulz iteration that help us converge to what we consider is a valid $S$ for the singular values part of the SVD that has eigenvalues close-ish to 1.

Overview

So now we have:

for step in steps:
    compute gradient 
    compute momentum
    normalize momentum matrix
    orthogonalization
    update parameters

Now there is also muon with a weight adjustement similar to what we did with AdamW.

So we have:

\[\begin{align} G_t &\leftarrow \nabla L_t (\theta_{t-1}) \\ M_t &\leftarrow \beta M_{t-1} + G_t \\ M'_t &\leftarrow \frac{M_t}{\| M_t \|_F} \\ O_t &\leftarrow \text{NewtonSchulz5}(M'_t) \\ \theta_t &\leftarrow \theta_{t-1} - \alpha \left(0.2 \sqrt{\text{max}(n,m)} \cdot O_t + \lambda \theta_{t-1}\right) \end{align}\]

Implementation

I actually want to introduce this section by looking at PyTorch’s documentation. This was added recently, but let’s look here:

This should look super familiar to the code that we’ve been covering!! The only tricky part is the AdjustLR step which deviates slightly between what the video / I have above covers (which is Moonshot’s implementation) vs Jordan Keller’s original impl of $sqrt{\text{max}(1, \frac{B}{A})}$.

There are a couple of tricky parts with implementing this:

def newton_schulz_5(M_matrix, steps=5, eps=1e-7):
    a, b, c = (3.4445, -4.7750, 2.0315) # from Keller Jordan
    X = M_matrix.astype(np.float32, copy=False) # speed up in practice
    
    if X.shape[0] > X.shape[1]:
        X = X.T

    X = X / (np.linalg.norm(X) + eps) # frobenius norm by def
    # so this is tricky but we're looking here
    # \rho (M) &= aM + b(MM^T)M + c(MM^T)^2 M
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X
    if X.shape[0] > X.shape[1]:
        X = X.T
    return X


def run_muon_muonshot(theta0, eta=1e-2, beta=0.95, weight_decay=1e-2, steps=1200,
             ns_steps=5, eps=1e-7, use_nesterov=True):
    theta = np.array(theta0, float)
    if theta.ndim == 1:
        theta = theta[:, None]
    elif theta.shape[0] == 1 and theta.shape[1] > 1:
        theta = theta.T

    def adjust_lr(A, B):
        return 0.2 * np.sqrt(float(max(A, B)))

    A, B = theta.shape
    path = [theta.copy()]
    B_momentum_buffer = np.zeros_like(theta)
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)  
        B_momentum_buffer = beta * B_momentum_buffer + g
        # didn't cover nestorv but pytorch has it
        M_eff = g + beta * B_momentum_buffer if use_nesterov else B_momentum_buffer
        O = newton_schulz_5(M_eff, steps=ns_steps, eps=eps)
        # decoupled weight decay (uniform shrink)
        theta = theta - eta * (adjust_lr(A, B) * O + weight_decay * theta)
        path.append(theta.copy())

    return np.array(path)

Once again, here is visualization code:


import numpy as np
import matplotlib.pyplot as plt
from itertools import product
from sympy import Matrix
from IPython.display import display

def styblinski_tang_fn(x: float, y: float) -> float:
    return 0.5 * ((x**4 - 16 * x**2 + 5 * x) + (y**4 - 16 * y**2 + 5 * y))

def styblinski_tang_grad(x: float, y: float) -> np.ndarray:
    dfx = 2 * x**3 - 16 * x + 2.5
    dfy = 2 * y**3 - 16 * y + 2.5
    return np.array([dfx, dfy], dtype=float)

def stationary_points_and_global_min():
    roots = np.roots([2.0, 0.0, -16.0, 2.5])
    roots = np.real(roots[np.isreal(roots)])
    minima_1d = [r for r in roots if (6 * r * r - 16) > 0]
    mins2d = np.array(list(product(minima_1d, repeat=2)), dtype=float)
    vals = np.array([styblinski_tang_fn(x, y) for x, y in mins2d])
    gidx = np.argmin(vals)
    return mins2d, mins2d[gidx], vals[gidx]

def run_sgd(theta0, eta=0.02, steps=1200):
    theta = np.array(theta0, float)
    path = [theta.copy()]
    for _ in range(steps):
        theta -= eta * styblinski_tang_grad(*theta)
        path.append(theta.copy())
    return np.array(path)

def run_momentum(theta0, eta=0.02, beta=0.90, steps=1200):
    theta = np.array(theta0, float)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        v = beta * v - eta * g
        theta = theta + v
        path.append(theta.copy())
    return np.array(path)

def run_adagrad(theta0, eta=0.40, eps=1e-8, steps=1200):
    theta = np.array(theta0, float)
    r = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        r = r + g * g
        lr_eff = eta / (np.sqrt(r) + eps)
        theta = theta - lr_eff * g
        path.append(theta.copy())
    return np.array(path)

def run_rmsprop(theta0, eta=1e-2, rho=0.9, eps=1e-8, steps=1200):
    """
    s_t = rho * s_{t-1} + (1 - rho) * g_t^2
    theta <- theta - eta * g_t / (sqrt(s_t) + eps)
    """
    theta = np.array(theta0, float)
    s = np.zeros_like(theta)
    path = [theta.copy()]
    for step in range(steps):
        g = styblinski_tang_grad(*theta)
        s = rho * s + (1 - rho) * (g * g)
        theta = theta - eta * g / (np.sqrt(s) + eps)
        path.append(theta.copy())
    return np.array(path)

def run_rmsprop_centered(theta0, eta=1e-2, rho=0.9, eps=1e-8, steps=1200):
    """
    m_t = rho * m_{t-1} + (1 - rho) * g_t
    s_t = rho * s_{t-1} + (1 - rho) * g_t^2
    denom = sqrt(s_t - m_t^2) + eps   # variance-based
    """
    theta = np.array(theta0, float)
    m = np.zeros_like(theta)
    s = np.zeros_like(theta)
    path = [theta.copy()]
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)
        m = rho * m + (1 - rho) * g
        s = rho * s + (1 - rho) * (g * g)
        denom = np.sqrt(np.maximum(s - m * m, 0.0)) + eps
        theta = theta - eta * g / denom
        path.append(theta.copy())
    return np.array(path)

def run_adam(theta0, eta=1e-2, beta1=0.9, beta2=0.999, eps=1e-8, steps=1200):
    theta = np.array(theta0, float)
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for t in range(1, steps + 1):
        g = styblinski_tang_grad(*theta)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * (g * g)
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        theta = theta - eta * m_hat / (np.sqrt(v_hat) + eps)
        path.append(theta.copy())
    return np.array(path)

def run_adamw(theta0, eta=1e-2, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01, steps=1200):
    """
    AdamW: decoupled weight decay
      theta <- theta - eta * ( m_hat / (sqrt(v_hat)+eps) )  # adaptive step
      theta <- theta - eta * weight_decay * theta           # uniform shrink
    Note: setting weight_decay=0.0 makes AdamW identical to Adam.
    """
    theta = np.array(theta0, float)
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    path = [theta.copy()]
    for t in range(1, steps + 1):
        g = styblinski_tang_grad(*theta)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * (g * g)
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)

        # adaptive update
        theta = theta - eta * (m_hat / (np.sqrt(v_hat) + eps))
        # decoupled weight decay (uniform; not scaled by v_hat)
        theta = theta - eta * weight_decay * theta

        path.append(theta.copy())
    return np.array(path)

def newton_schulz_5(M_matrix, steps=5, eps=1e-7):
    # from Keller Jordan
    a, b, c = (3.4445, -4.7750, 2.0315)

    # speed up in practical
    X = M_matrix.astype(np.float32, copy=False)

    transposed = False
    if X.shape[0] > X.shape[1]:
        X = X.T
        transposed = True

    # frobenius norm
    X = X / (np.linalg.norm(X) + eps)
    
    # so this is tricky but we're looking here
    # \rho (M) &= aM + b(MM^T)M + c(MM^T)^2 M
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X

    if transposed:
        X = X.T
    return X


def run_muon_muonshot(theta0, eta=1e-2, beta=0.95, weight_decay=1e-2, steps=1200,
             ns_steps=5, eps=1e-7, use_nesterov=True):

    theta = np.array(theta0, float)
    if theta.ndim == 1:
        theta = theta[:, None]          # (n,) -> (n,1)
    elif theta.shape[0] == 1 and theta.shape[1] > 1:
        theta = theta.T                 # (1,n) -> (n,1)

    def adjust_lr(A, B):
        return 0.2 * np.sqrt(float(max(A, B)))

    A, B = theta.shape

    path = [theta.copy()]
    B_momentum_buffer = np.zeros_like(theta)
    for _ in range(steps):
        g = styblinski_tang_grad(*theta)  
        B_momentum_buffer = beta * B_momentum_buffer + g
        # didn't cover nestorv but pytorch has it
        M_eff = g + beta * B_momentum_buffer if use_nesterov else B_momentum_buffer
        O = newton_schulz_5(M_eff, steps=ns_steps, eps=eps)
        # decoupled weight decay (uniform shrink)
        theta = theta - eta * (adjust_lr(A, B) * O + weight_decay * theta)
        path.append(theta.copy())

    return np.array(path)

theta_start = np.array([4.1, 4.5], dtype=float)
steps = 1200

eta_sgd = 0.02
eta_mom, beta = 0.02, 0.90
eta_adagrad = 0.40
eta_rms, rho, eps = 1e-2, 0.9, 1e-8
eta_rms_c = 1e-2

eta_adam = 1e-2
beta1, beta2 = 0.9, 0.999
eps_adam = 1e-8
wd = 1e-2     # try 0.0 (Adam-equivalent) vs 1e-3 vs 1e-2

sgd_path  = run_sgd(theta_start, eta=eta_sgd, steps=steps)
mom_path  = run_momentum(theta_start, eta=eta_mom, beta=beta, steps=steps)
ada_path  = run_adagrad(theta_start, eta=eta_adagrad, steps=steps)
rms_path  = run_rmsprop(theta_start, eta=eta_rms, rho=rho, eps=eps, steps=steps)
rmsc_path = run_rmsprop_centered(theta_start, eta=eta_rms_c, rho=rho, eps=eps, steps=steps)
adam_path = run_adam(theta_start, eta=eta_adam, beta1=beta1, beta2=beta2, eps=eps_adam, steps=steps)
adamw_path = run_adamw(theta_start, eta=eta_adam, beta1=beta1, beta2=beta2, eps=eps_adam, weight_decay=wd, steps=steps)

eta_muon = 1e-2
beta_mu = 0.95
wd_mu = 1e-2
ns_steps = 5
eps_ns = 1e-7
use_nesterov = True

muon_path_raw = run_muon_muonshot(
    theta_start,
    eta=eta_muon,
    beta=beta_mu,
    weight_decay=wd_mu,
    steps=steps,
    ns_steps=ns_steps,
    eps=eps_ns,
    use_nesterov=use_nesterov
)
muon_path = muon_path_raw.squeeze(-1) if muon_path_raw.ndim == 3 else muon_path_raw  # (T,2,1) -> (T,2)

mins2d, gmin_pt, gmin_val = stationary_points_and_global_min()

x = y = np.linspace(-5, 5, 400)
X, Y = np.meshgrid(x, y)
Z = styblinski_tang_fn(X, Y)

plt.figure(figsize=(9, 8))
cs = plt.contour(X, Y, Z, levels=50, alpha=0.85)
plt.clabel(cs, inline=True, fmt="%.0f", fontsize=7)

plt.plot(sgd_path[:, 0],   sgd_path[:, 1],   '.-', lw=1.2, ms=3, label='SGD')
plt.plot(mom_path[:, 0],   mom_path[:, 1],   '.-', lw=1.2, ms=3, label=f'Momentum (β={beta})')
plt.plot(ada_path[:, 0],   ada_path[:, 1],   '.-', lw=1.2, ms=3, label='AdaGrad')
plt.plot(rms_path[:, 0],   rms_path[:, 1],   '.-', lw=1.2, ms=3, label=f'RMSProp (ρ={rho})')
plt.plot(rmsc_path[:, 0],  rmsc_path[:, 1],  '.-', lw=1.2, ms=3, label='RMSProp (centered)')
plt.plot(adam_path[:, 0],  adam_path[:, 1],  '.-', lw=1.2, ms=3, label='Adam')
plt.plot(adamw_path[:, 0], adamw_path[:, 1], '.-', lw=1.2, ms=3, label=f'AdamW (wd={wd})')

plt.plot(muon_path[:, 0],  muon_path[:, 1],  '.-', lw=1.4, ms=3, label=f'Muon (NS={ns_steps}, β={beta_mu})')
plt.scatter(muon_path[-1, 0],  muon_path[-1, 1],  s=60, label='Muon End', zorder=3)

plt.scatter(sgd_path[0, 0], sgd_path[0, 1], s=80, label='Start', zorder=3)
plt.scatter(sgd_path[-1, 0], sgd_path[-1, 1], s=60, label='SGD End', zorder=3)
plt.scatter(mom_path[-1, 0], mom_path[-1, 1], s=60, label='Momentum End', zorder=3)
plt.scatter(ada_path[-1, 0], ada_path[-1, 1], s=60, label='AdaGrad End', zorder=3)
plt.scatter(rms_path[-1, 0], rms_path[-1, 1], s=60, label='RMSProp End', zorder=3)
plt.scatter(rmsc_path[-1, 0], rmsc_path[-1, 1], s=60, label='RMSProp (centered) End', zorder=3)
plt.scatter(adam_path[-1, 0],  adam_path[-1, 1],  s=60, label='Adam End', zorder=3)
plt.scatter(adamw_path[-1, 0], adamw_path[-1, 1], s=60, label='AdamW End', zorder=3)

plt.scatter(gmin_pt[0], gmin_pt[1], marker='*', s=220, edgecolor='k',
            facecolor='gold', label=f'Global min ({gmin_pt[0]:.4f}, {gmin_pt[1]:.4f})\n f={gmin_val:.4f}', zorder=5)

vals = np.array([styblinski_tang_fn(x0, y0) for x0, y0 in mins2d])
mask = np.ones(len(mins2d), dtype=bool)
mask[np.argmin(vals)] = False
if np.any(mask):
    plt.scatter(mins2d[mask, 0], mins2d[mask, 1],
                marker='v', s=120, edgecolor='k', facecolor='white',
                label='Local minima', zorder=4)

plt.title("SGD, Momentum, AdaGrad, RMSProp (+centered), Adam, AdamW, Muon on Styblinski–Tang")
plt.xlabel("x"); plt.ylabel("y")
plt.legend(loc='upper left')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Conclusion

Ok! I hope you have learned something. There is obviously a ton more I could write about here, but I think getting into actually writing the code and understanding the paths that we’re taking and this very detailed stepthrough is helpful. Muon is very interesting and while it’s still pretty hotly debated if it’ll scale (despite Kimi being trained on it with 1T tokens), there will be more research that certainly goes into this area.

I’m hoping to dive more into the Newton-Schulz iteration and have some interesting visualizations there, but as always, this has burned more of my time that maybe I should have allocated.

Once again, visualization code is here too if you need.

Disjoint Set Union

2025-10-13T00:00:00+00:00

Recently, I did an interview. I got absolutely flamed and one of the reasons was I wasn’t familiar with a Disjoint Set Union (and I certainly couldn’t complete the C++ interview in time building this data structure naturally).

I figure I would go back to this and do a deeper dive because I wasn’t as familiar with it.

Theory

What is a Disjoint Set?

A disjoint set or union find or disjoint set union are all the same data structure. It is a data structure that is optimized for handling various set operations and mainly focuses on two methods: union and find (hence one of the names).

The whole goal is: detecting if a member is in a set, and if sets are connected in a fast and performant manner.

So we’ll mainly have a target set. We’re representing each subset as an inverted tree (i.e. all the child nodes are pointing back to the root).

Trees

As a reminder, trees are a specific form of a graph where:

undirected
at most 1 path between any 2 nodes
acyclic

Two types: out-tree and in-tree.

Out-trees are probably the most common, but we’re going to be focusing on an in-tree.

Forests

A forest is a collection of trees. It’s an undirected acyclic graph, where each connected component is a tree. It’s a disjoint union of trees.

Example Usage

Kruskal’s Algo for MST

Kruskal’s algorithm is a way of detecting a minimum-spanning-tree. In a very basic phrasing, a minimum spanning tree is a subset of the edges of a connected, undirected graph that connects all the nodes. Basically, we want one “connected component” (that normally also minimizes cost).

Imagine building a road and we’re trying to build a road that hits all of our target cities but in the cheapest way (might not be best).

Kruskal’s is basically:

Sort edges (by weight)
Pick cheapest edge (if no cycle created)
Continue while MST is not complete

This greedy algorithm utilizes DSUs when we need to see if we are going to have a cycle (this is done by a disjoint set .find call).

Basically, in very lightweight pseudocode:

for edge in sorted_edges:
    if find(edge.u) != find(edge.v):
        mst.append(edge)
        union(edge.u, edge.v)

The Problem

Again, how do we check subset membership between $x$ and $y$ fast? The answer is obviously DSUs.

Disjoint Set Operations

And so with that, this data structure is going to have:

Create a new set
Find an item’s set representative (basically like the root of the subset tree)
Union, merge subsets

Creating a new set

class DisjointSet:
  def __init__(self) -> None:
    self.parent = {}

  def make_set(self, x: int) -> None:
    self.parent[x] = x

Find an item’s representative

How can we rapidly check if two targets are in the same subset? This is the whole point of the data structure basically. This is where we climb up the tree. This allows for very fast access.

class DisjointSet:
  def find(self, x: int) -> None:
    if self.parent.get(x) == x:
      return x
    return self.find(self.parent[x])

Union / merge subsets

class DisjointSet:
  def union(self, x: int, y: int) -> None:
    root_x = self.find(x)
    root_y = self.find(y)

    # only merge if x and y are not in the same set
    if root_x != root_y:
      self.parent[root_y] = root_x

Basically just stitching these subsets together. We just reset the parent for $y$ or for $x$ and that’s how we get around it.

Visualization

I thought about having Claude spin up a visualizer, but didn’t seem worth it. There are lots of good resources. The best I’ve seen is here at visualgo. There’s the visualizations / slides on the DSU here.

Optimizations

There’s two big optimizations that people generally hammer for DSUs. They are path compression and union by rank.

Path Compression

So this is a neat trick that is invoked on the find call. When we’re climbing back up the tree to roots, we “flatten” the tree along the way. We make each visited node point directly to the root. That way, the next time we do a find, it’ll take $\mathcal{O}(1)$ time. In pseudocode,

# original
def find(self, x: int) -> None:
  if self.parent[x] == x:
    return x
  return self.find(self.parent[x])

# with path compression
def find(self, x: int) -> None:
  if self.parent[x] != x:
    self.parent[x] = self.find(self.parent[x])
  return self.parent[x]

This helps keep our tree flat and wide. So for example find(5) would potentially take 5 recursive calls if we had 5 -> 4 -> 3 -> 2 -> 1, but the next find(4) would be $\mathcal{O}(1)$.

Union by Rank

This is another cool trick. When we union, we attach the smaller tree under the larger one. That once again, keeps the trees shallow so that our find operations are fast.

To do this, we keep track of rank - a measure of the tree’s height. When performing union, we compare ranks and attach smaller rank under larger rank.

class DisjointSet:
  def __init__(self) -> None:
    self.parent = {}
    self.rank = {}

  def make_set(self, x: int) -> None:
    self.parent[x] = x
    self.rank[x] = 0

  # find...
  def union(self, x: int, y: int) -> None:
    root_x = self.find(x)
    root_y = self.find(y)

    if root_x = root_y:
      return

    if self.rank[root_x] < self.rank[root_y]:
      self.parent[root_x] = root_y
    elif self.rank[root_x] > self.rank[root_y]:
      self.parent[root_y] = root_x
    else:
      self.parent[root_y] = root_x
      self.rank[root_x] += 1

So a super interesting note here though is that according to most references, rank is just used to make merging decisions. It doesn’t need to be accurate. I thought it was weird at first that we don’t bump the rank if we hit the if / elif branch…. but it is because we’re directly attaching it to the root so we know the new rank is still just going to be the bigger one. This is a very important point. Rank is almost loosely tracked but it’s a rough heuristic for the upper bound of the height.

Rust Implementation

I’ve been trying to learn more Rust given it’s everyone’s favorite programming language. So I wanted to build this up again in Rust instead of Python for learning and better management. The code is pretty readable and clean (somewhat similar to Python) so yeah I won’t describe too much else. I also put the size of each set for debugging as well.

#[derive(Debug, Clone)]
pub struct DisjointSetUnion {
    parent: Vec<usize>,
    rank: Vec<usize>,
    size: Vec<usize>,
    sets: usize,
}

impl DisjointSetUnion {
    /// create a new disjoint set union with n elements
    pub fn new(n: usize) -> Self {
        Self {
            parent: (0..n).collect(),
            rank: vec![0; n],
            size: vec![1; n],
            sets: n,
        }
    }

    /// number of disjoint sets
    pub fn num_disjoint_sets(&self) -> usize {
        self.sets
    }

    /// find the root of the set containing x
    pub fn find(&mut self, mut x: usize) -> usize {
        while self.parent[x] != x {
            let parent = self.parent[x];
            self.parent[x] = self.parent[parent];
            x = self.parent[parent];
        }
        x
    }

    /// union the sets containing x and y
    pub fn union(&mut self, x: usize, y: usize) -> usize {
        let root_x = self.find(x);
        let root_y = self.find(y);

        // same component
        if root_x == root_y {
            return root_x;
        }

        // we want smaller rank tree under higher rank tree
        // to try and keep things as flat as possible
        if self.rank[root_x] < self.rank[root_y] {
            self.parent[root_x] = root_y;
            self.size[root_y] += self.size[root_x];
            self.sets -= 1;
            return root_y;
        } else if self.rank[root_y] > self.rank[root_x] {
            self.parent[root_y] = root_x;
            self.size[root_x] += self.size[root_y];
            self.sets -= 1;
            return root_x;
        }

        //otherwise, they're equal
        self.parent[root_y] = root_x;
        self.rank[root_x] += 1;
        self.size[root_x] += self.size[root_y];
        self.sets -= 1;
        return root_x;
    }

    /// check if x and y are in the same set
    pub fn connected(&mut self, x: usize, y: usize) -> bool {
        self.find(x) == self.find(y)
    }

    /// size of the set containing x
    pub fn size_of(&mut self, x: usize) -> usize {
        // ugh perils of Rust
        // i wanted to do: self.size[self.find(x)]
        // but because the borrow checker we cannot
        // indexing into self.size immutably borrows self.size
        // and thus self for the duration of the indexing expression
        // as a result, when we do self.find we need a MUTABLE borrow
        // of self - so this conflict causes the break
        let root = self.find(x);
        self.size[root]
    }

    pub fn rank_of(&mut self, x: usize) -> usize {
        let root = self.find(x);
        self.rank[root]
    }

    // Claude added these
    /// (solely for viz) - reference to the parent array
    pub fn parent(&self) -> &[usize] {
        &self.parent
    }

    /// (solely for viz) - reference to the rank array
    pub fn rank(&self) -> &[usize] {
        &self.rank
    }

    /// (solely for viz) - reference to the size array
    pub fn size(&self) -> &[usize] {
        &self.size
    }
}

The visualization code was entirely autogenerated by Claude and then I used vhs to create the animation. Here is the demo:

Here is the code if you want to check it out. I’m guessing most people will just deep dive with ChatGPT which is ok too!

Teaching a Computer How to Write

2025-10-01T00:00:00+00:00

✍️ Motivating Visualizations

Today, we’re going to learn how to teach a computer to write. I don’t mean generating text (which would have been probably a better thing to study in college), I mean learning to write like a human learns how to write with a pen and paper. My results (eventually) were pretty good, here are some motivating visualizations.

Let’s look at one. My family used to have this hung over our kitchen sink when I was a kid. I ate breakfast every day looking at it.

The heart has its reasons which reason knows nothing of

Blaise Pascal, "Pensées"

Again, I’d recommend jumping down to here: Synthesis Model Sampling. Arguably, the best part of this post. I’ll discuss what all these visualizations mean in detail.

This is a relatively long post! I would encourage you if you're trying to learn from 0 -> 1 to read the whole thing, but feel free to jump around as you so wish. I would say there's three main portions: concept, theory, and code.

My purpose here was to build up from the basics and really understand the flow. I provide quite a couple of models so we can see the progression from a simple neural net to a basic LSTM to Peephole LSTM to a stacked cascade of Peephole LSTMs to Mixture Density Networks to Attention Mechanism to Attention RNN to the Handwriting Prediction Network to finally throwing it all together to the full Handwriting Synthesis Network that Graves originally wrote about.

There's other things that maybe I'll discuss in the future like the need to pickle JAX models because if they're XLA compatible then you can't run inference on your CPU and issues like that. Another thing I didn't discuss really was temperature and bias for sampling. I also (sadly) didn't cover priming. However, I spent far more time on this than I should have. If you have any questions - as always - feel free to reach out if curious.

Enjoy!

One thing that I would highly recommend - if you're interested in the theory of LSTMs and why sigmoid vs tanh activations were chosen, I would really encourage reading Chris Olah's Understanding LSTMs blog post. It does a fantastic job.

🥅 Motivation

This motivation is clear - this is something that I have wanted to find the time to do right since college. My engineering thesis was on this Graves paper. My senior year, I worked with my good friend (also he’s a brilliant engineer) Tom Wilmots to understand and dive into this paper.

I’m going to pull pieces of that, but the time has changed, and I wanted to revisit some of the work we did, hopefully clean it up, and finally put a nail in this (so my girlfriend / friends don’t have to keep hearing about it).

👨‍🏫 History

Tom and I were very interested in the concept of teaching a computer how to write in college. There is a very famous paper that was published around 2013 from Canadian computer scientist Alex Graves, titled Generating Sequences With Recurrent Neural Networks. At Swarthmore, you have to do Engineering thesis, called E90s. It’s basically a year (although I’d argue it’s more of a semester when it all shakes out) long project focused on doing a piece of work you’re proud of.

Tom and My Engineering Thesis

For the actual paper that we wrote, check it out here:

You can also check it out here: Application of Neural Networks with Handwriting Samples.

🙏 Acknowledgements

Before I dive in, I do want to make some acknowledgements just given this is a partial resumption of work.

Tom Wilmots - One of the brightest and best engineers I’ve worked with. He was an Engineering and Economics double major from Swarthmore. Pretty sure I would have failed my E90 thesis without him.
Matt Zucker - One of my role models and constant inspirations, Matt was kind enough to be Tom and my academic advisor for this final engineering project. He is the best professor I’ve come across.
Alex Graves - A professor that both Tom and I had the pleasure of working with. He responded to our emails, which I’m still very appreciative of. You can see more about his work at the University of Toronto here). He is the author of this paper, which Matt found for us and pretty much was the basis of our project. He’s also the creator of the Neural Turing Machine, which peaked my interest after having taken Theory of Computation, with my other fantastic professor Lila Fontes and learning about Turing machines.
David Ha - Another brilliant scientist who we had the privilege of corresponding with. Check out his blog here. It’s beautiful. He also is very prolific on ArXiv which is always cool to see.

📝 Concept

This section is going to be for non-technical people to understand what we were trying to do. It’s relatively simple. At a very high level, we are trying to teach a computer how to generate human looking handwriting. To do that, we are going to train a neural network. We are going to use a public dataset, called IAM Online Handwriting Database. This dataset had a ton of people write on a tablet where the data was being recorded. It collected basically sets of Stroke data, which were tuples of $(x, y, t)$, where $(x, y)$ are the coordinates on the tablet, and $t$ is the timestamp. We’ll use this data to train a model so that across all of the participants we have this blended approach of how to write like a human.

👾 Software

In college, we decided between Tensorflow and Pytorch. In college, we used Tensorflow. However, given the times, I wanted to still resume our tensorflow approach with updated designs, but I also wanted to try and use JAX. JAX is… newer. But it’s gotten some hype online and I think there’s a solid amount of adoption across the bigger AI labs now. In my opinion, Tensorflow is dying, Pytorch is the new status quo, and JAX is the new kid on the block. However, I’m not an ML researcher clearing millions of dollars. So grain of salt. This clickbaity article which declares “Pytorch is dead. Long live JAX” got a ton of flak online, but regardless… it piqued my interest enough to try it here.

I’ll cover all three here and yeah probably dive deepest into tensorflow… but feel free to skip this section.

Tensorflow

Programming Paradigm

Tensorflow has this interesting programming paradigm, where you are more or less creating a graph. You define Tensors and then when you run your dependency graph, those things are actually translated.

I have this quote from the Tensorflow API:

There’s only two things that go into Tensorflow.

Building your computational dependency graph.

Running your dependency graph.

This was the old way, but now that’s not totally true. Apparently, Tensorflow 2.0 helped out a lot with the computational model and the notion of eagerly executing, rather than building the graph and then having everything run at once.

Versions - How the times have changed

So - another fun fact - when we were doing this in college, we were on tensorflow version v0.11!!! They hadn’t even released a major version. Now, I’m doing this on Tensorflow 2.16.1. So the times have definitely changed.

Definitely haven’t been able to keep up with all those changes.

Tensorboard

Another cool thing about Tensorflow that should be mentioned is the ability to utilize the Tensorboard. This is a visualization suite that creates a local website where you can interactively and with a live stream visualize your dependency graph. You can do cool things like confirm that the error is actually decreasing over the epochs.

We used this a bit more in college. I didn’t get a real chance to dive into the updates made from this.

Pytorch

PyTorch is now basically the defacto standard for most serious research labs and AI shops. To me, it seems like things are still somewhat ported to Tensorflow for production, but I’m not totally sure about convention.

Pytorch seems to thread the line between Tensorflow and JAX. Functions don’t necessarily need to be pure to be utilized. You can loop and mutate state in a nn.Module just fine.

I won’t be covering pytorch but I certainly will come back around to it in later projects.

JAX

The new up and comer! I think it’s largely a crowd favorite for it’s speed. Documentation is obviously worse. One Redditor summarized it nicely:

Comment
byu/Few-Pomegranate4369 from discussion
inMachineLearning

I hit numerous roadblocks where functions weren’t actually pure and then the JIT compile portion basically failed on startup.

Programming Paradigm

JAX and Pytorch are definitely the most like traditional Python imperative flow. The restriction on JAX is largely around pure functions. Tensorflow is also gradually moving away from the compile your graph and then run it paradigm.

📊 Data

We’re using the IAM Online Handwriting Database. Specifically, I’m looking at data/lineStrokes-all.tar.gz, which is XML data that looks like this:

Example Handwriting IAM Data

There’s also this note:

The database is divided into 4 parts, a training set, a first validation set, a second validation set and a final test set. The training set may be used for training the recognition system, while the two validation sets may be used for optimizing some meta-parameters. The final test set must be left unseen until the final test is performed. Note that you are allowed to use also other data for training etc, but report all the changes when you publish your experimental results and let the test set unchanged (It contains 3859 sequences, i.e. XML-files - one for each text line).

So that determines our training set, validation set, second validation set, and a final test set.

🧠 Base Neural Network Theory

I am not going to dive into details as much as we did for our senior E90 thesis, but I do want to cover a couple of the building blocks.

Lions, Bears, and Many Neural Networks, oh my

I would highly encourage you to check out this website: https://www.asimovinstitute.org/neural-network-zoo/. I remember seeing it in college when working on this thesis and was stunned. If you’re too lazy to click, check out the fun picture:

Courtesy of Asimov Institute

We’re going to explore some of the zoo in a bit more detail, specifically, focusing on LSTMs.

Basic Neural Network

Courtesy of AI ML

The core structure of a neural network is the connections between all of the neurons. Each connection carries an activation signal of varying strength. If the incoming signal to a neuron is strong enough, then the signal is permeated through the next stages of the network.

There is a input layer that feeds the data into the hidden layer. The outputs from the hidden layer are then passed to the output layer. Every connection between nodes carries a weight determining the amount of information that gets passed through.

Hyper Parameters

For a basic neural network, there are generally three hyperparameters:

pattern of connections between all neurons
weights of connections between neurons
activation functions of the neurons

In our project however, we focus on a specific class of neural networks called Recurrent Neural Networks (RNNs), and the more specific variation of RNNs called Long Short Term Memory networks (LSTMs).

However, let’s give a bit more context. There’s really two broad types of neural networks:

Feedforward Neural Network
Recurrent Neural Networks

Feedforward Neural Network

These neural networks channel information in one direction.

The figure above is showing a feedforward neural network because the connections do not allow for the same input data to be seen multiple times by the same node.

These networks are generally very well used for mapping raw data to categories. For example, classifying a face from an image.

Every node outputs a numerical value that it then passes to all its successor nodes. In other words:

\[\begin{align} y_j = f(x_j) \end{align} \tag{1}\]

where

\[\begin{align} x_j = \sum_{i \in P_j} w_{ij} y_i \end{align} \tag{2}\]

where

$y_j$ is the output of node $j$
$x_j$ is the total weighted input for node $j$
$w_{ij}$ is the weight from node $i$ to node $j$
$y_i$ is the output from node $i$
$P_j$ represents the set of predecessor nodes to node $j$

Also note, $f(x)$ should be a smooth non-linear activation function that maps outputs to a reasonable domain. Some common activation functions include $\tanh(x)$ or the sigmoid function. These complex functions are necessary because the neural network is literally trying to learn a non-linear pattern.

Backpropagation

Backpropagation is the mechanism in which we pass the error back through the network starting at the output node. Generally, we minimize using [stochastic gradient descent][stoch-grad-desc]. Again, lots of different ways we can define our error, but we can use sum of squared residuals between our $k$ targets and the output of $k$ nodes of the network.

\[\begin{align} E = \frac{1}{2} \sum_{k}(t_k - y_k)^2 \end{align} \tag{3}\]

The gradient descent part comes in next. We generate the set of all gradients with respect to error and minimize these gradients. We’re minimizing this:

\[\begin{align} g_{ij} = - \frac{\delta E}{\delta w_{ij}} \end{align} \tag{4}\]

So overall, we’re continually altering the weights and minimizing their individual effect oin the overall error of the outputs.

The major downfall of this simple network is that we don’t have full context. With sequences, there’s not enough information about the previous words, so the context is missing. And that leads us to our next structure.

Recurrent Neural Network

Recurrent Neural Networks (RNNs) have a capacity to remember. This memory stems from the fact that their input is not only the current input vector but also a variation of what they output at previous time steps.

This visualization from Christopher Olah (who holy hell i just realized is a co-founder of Anthropic, but who Tom and I used to follow closely in college) is a great visualization:

Courtesy of Chris Olah's Understanding LSTMs

This RNN module is being unrolled over multiple timestamps. Information is passed within a module at time step $t$ to the module at $t+1$.

Per Tom and my paper,

An ideal RNN would theoretically be able to remember as far back as was necessary in order to to make an accurate prediction. However, as with many things, the theory does not carry over to reality. RNNs have trouble learning long term dependencies due to the vanishing gradient problem. An example of such a long term dependency might be if we are trying to predict the last word in the following sentence ”My family originally comes from Belgium so my native language is PREDICTION”. A normal RNN would possibly be able to recognize that the prediction should be a language but it would need the earlier context of Belgium to be able to accurately predict DUTCH.

Topically, this is why the craze around LLMs is so impressive. There’s a lot more going on with LLMs… which… I will not cover here.

The notion of backpropagation is basically the same just we also have the added dimension of time.

The crux of the issue is that RNNs have many layers and as we begin to push the derivatives to zero. The gradients become too small and cause underflow. In actual meaning, the networks then cease to be able to learn.

However, Sepp Hochreiter and Juergen Schmidhuber developed the Long Short Term Memory (LSTM) unit that solved this vanishing gradient problem.

Long Short Term Memory Networks

Long Short Term Memory (LSTM) networks are specifically designed to learn long term dependencies.

Every form of RNN has repeating modules that pass information across timesteps, and LSTMs are no different. Where they different is the inner structure of each module. While a standard RNN might have a single neural layer, LSTMs have four.

Courtesy of Chris Olah's Understanding LSTMs

Understanding the LLM Structure

So let’s better understand the structure above. There’s a way more comprehensive walkthrough here. I’d encourage you to check out that walkthrough.

Courtesy of Chris Olah's Understanding LSTMs

The top line is key to the LSTM’s ability to remember. It is called the cell state. We’ll reference it as $C_t$.

The first neural network layer is a sigmoid function. It takes as input the concatenation between the current input $x_t$ and the output of the previous module $h_{t-1}$. This is coined as the forget gate. It is in control of what to forget for the cell state. The sigmoid function is a good architecture decision here because it basically outputs numbers between [0,1] indicating how much the layer should let through.

We piecewise multiply the output of the sigmoid layer $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$, with the cell state from the previous module $C_{t-1}$, forgetting the things that it doesn’t see as important.

Then right in the center of the image above there are two neural network layers which make up the update gate. First, $x_t \cdot h_{t-1}$ is pushed through both a sigmoid ($\sigma$) layer and a $\tanh$ layer. The output of this sigmoid layer $i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_C)$ determines which values to use to update, and the output of the $\tanh$ layer $\hat{C} = \sigma (W_C \cdot [h_{t-1}, x_{t} + b_C$, proposes an entirely new cell state. These two results are then piecewise multiplied and added to the current cell state (which we just edited using the forget layer) outputting the new cell state $C_t$.

The final neural network layer is called the output gate. It determines the relevant portion of the cell state to output as $h_t$. Once again, we feed $x_t \cdot h_{t-1}$ through a sigmoid layer whose output, $o_t = \sigma (W_o \cdot [h_{t-1}, x_t] + b_o)$, we piecewise multiple with $\tanh(C_t)$. The result of the multiplication determines the output of the LSTM module. Note that the purple $\tanh$ is not a neural network layer, but a piecewise multiplication intended to push the current cell state into a reasonable domain.

I'm serious... you guys should check out Olah's Understanding LSTMs. Here he is back in 2015 strongly foreshadowing transformers given the focus on attention (which is truly the hardest part of all this) blog post.

Courtesy of Chris Olah's Understanding LSTMs

🧬 Concepts to Code

When very first starting this project, I kind of figured that I would be able to use some of my college code, but looking back. It’s quite a mess and I don’t think that’s the way to go about it.

I thought for awhile about how best to structure this part. Meaning the code, but also how to show this in my blog post. With all the buzz about JAX, I wanted to try that too, so I thought it’d be helpful to show a side be side translation of the tensorflow vs jax code. My hope is that we’ll walk through the concepts and have a good understanding of the theory, and then the code will make a bit more sense. One note, is that I was a bit burnt of this project by the end so the JAX code I was trying to use optax (link) and flax (link) as much as possible to cut down on bulkiness of code.

So we’ll walk through the building blocks (in terms of code) and then show the code translations.

LSTM Cell with Peephole Connections

Theory

The basic LSTM cell (tf.keras.layers.LSTMCell) does not actually have the notion of peephole connections.

According to the very functional code that sjvasquez wrote, I don’t think we actually need it, but I figured it would be fun to implement regardless. Back in the old days, when Tensorflow would support add-ons, there was some work around this here, but that project was deprecated.

That being said…. the JAX / Flax code also doesn’t have LSTMs out of the gate with peepholes and so…. I just used the normal ones. The JAX model actually trained a bit better, but I think part of that was also just patience.

Code

def call(self, inputs: tf.Tensor, state: Tuple[tf.Tensor, tf.Tensor]):
    """
    This is basically implementing Graves's equations on page 5
    https://www.cs.toronto.edu/~graves/preprint.pdf
    equations 5-11.

    From the paper,
    * sigma is the logistic sigmoid function
    * i -> input gate
    * f -> forget gate
    * o -> output gate
    * c -> cell state
    * W_{hi} - hidden-input gate matrix
    * W_{xo} - input-output gate matrix
    * W_{ci} - are diagonal
        + so element m in each gate vector only receives input from
        + element m of the cell vector
    """

    # going to be shape (?, num_lstm_units)
    h_tm1, c_tm1 = state

    # basically the meat of eq, 7, 8, 9, 10
    z = tf.matmul(inputs, self.kernel) + tf.matmul(h_tm1, self.recurrent_kernel) + self.bias
    i_lin, f_lin, g_lin, o_lin = tf.split(z, num_or_size_splits=4, axis=1)

    if self.should_apply_peephole:
        pw_i = tf.expand_dims(self.peephole_weights[:, 0], axis=0)
        pw_f = tf.expand_dims(self.peephole_weights[:, 1], axis=0)
        i_lin = i_lin + c_tm1 * pw_i
        f_lin = f_lin + c_tm1 * pw_f

    # apply activation functions! see Olah's blog
    i = tf.sigmoid(i_lin)
    f = tf.sigmoid(f_lin)
    g = tf.tanh(g_lin)
    c = f * c_tm1 + i * g

    if self.should_apply_peephole:
        pw_o = tf.expand_dims(self.peephole_weights[:, 2], axis=0)
        o_lin = o_lin + c * pw_o

    o = tf.sigmoid(o_lin)

    # final hidden state -> eq. 11
    h = o * tf.tanh(c)
    return h, [h, c]

class HandwritingModel(nnx.Module):
    def __init__(
        self,
        config: ModelConfig,
        rngs: nnx.Rngs,
        synthesis_mode: bool = False,
    ) -> None:
        self.config = config
        self.synthesis_mode = synthesis_mode

        # rngs is basically a set of random keys / number generators
        self.lstm_cells = self._build_lstm_stack(rngs)
        if synthesis_mode:
            # i mean we really only care about synthesis mode, but in
            # this case we can make it explicit that if we have it then we should add our
            # attention layer
            self.attention_layer = nnx.Linear(
                config.hidden_size + config.alphabet_size + 3, 3 * config.num_attention_gaussians, rngs=rngs
            )

        # mdn portion
        self.mdn_layer = self._build_mdn_head(rngs)

    def _build_lstm_stack(self, rngs: nnx.Rngs):
        cells = []
        for i in range(self.config.num_layers):
            if i == 0:
                if self.synthesis_mode:
                    # so if we're in synthesis mode, then we need to add the alphabet size
                    # and the 3 dimensions of the input stroke
                    # that's because our alphabet size is the number of characters in our alphabet
                    # and the 3 dimensions of the input stroke are the x, y, and eos values
                    in_size = self.config.alphabet_size + 3
                else:
                    in_size = 3
            else:
                # similar in both (just in synthesis we only care if we need to expand by the alphabet size)
                in_size = self.config.hidden_size + 3
                if self.synthesis_mode:
                    in_size += self.config.alphabet_size

            # ok... being lazy but this is just standard LSTM
            cells.append(
                {"linear": nnx.Linear(in_size + self.config.hidden_size, 4 * self.config.hidden_size, rngs=rngs)}
            )
        return cells

    def lstm_cell(
        self, x: jnp.ndarray, h: jnp.ndarray, c: jnp.ndarray, layer_idx: int
    ) -> Tuple[jnp.ndarray, jnp.ndarray]:
        # just think about this as grabbing the W and b for our matrix mults
        linear = self.lstm_cells[layer_idx]["linear"]

        combined = jnp.concatenate([x, h], axis=-1)
        gates = linear(combined)

        i, f, g, o = jnp.split(gates, 4, axis=-1)

        # activations
        i = nnx.sigmoid(i)
        f = nnx.sigmoid(f)
        g = nnx.tanh(g)
        o = nnx.sigmoid(o)

        # get new LSTM cell state
        c_new = f * c + i * g
        h_new = o * nnx.tanh(c_new)
        return h_new, c_new

Gaussian Mixture Models

Theory

reference

Gaussian Mixture Models are an unsupervised technique to learn an underlying probabilistic model.

Brilliant has an incredible explanation walking through the theory here. I’d encourage you to check it out, but at a very high level:

A number of Gaussians is specified by the user
The algo learns various parameters that represent the data while maximizing the likelihood of seeing such data

So if we have $k$ components, for a multivariate Gaussian mixture model, we’ll learn $k$ means, $k$ variances, $k$ mixture weights, $k$ correlations through expectation maximization.

From Brilliant, there are really two steps for the EM step:

The first step, known as the expectation step or E step, consists of calculating the expectation of the component assignments $C_k$ for each data point $x_i \in X$ given the model parameters $\phi_k, \mu_k$ , and $\sigma_k$ .

The second step is known as the maximization step or M step, which consists of maximizing the expectations calculated in the E step with respect to the model parameters. This step consists of updating the values $\phi_k, \mu_k$ , and $\sigma_k$ .

Code

There’s actually not a whole lot of code to provide here. GMMs are more of the technique that we’ll combine with the output of a neural network. That leads us smoothly to our next section.

Mixture Density Networks

Theory

Mixture Density Networks are an extension of GMMs that predict the parameters of a mixture probability distribution.

Courtesy of Towards Data Science

Per our paper:

The idea is relatively simple - we take the output from a neural network and parametrize the learned parameters of the GMM. The result is that we can infer probabilistic prediction from our learned parameters. If our neural network is reason- ably predicting where the next point might be, the GMM will then learn probabilistic parameters that model the distribution of the next point. This is different in a few key aspects. Namely, we now have target values because our data is sequential. Therefore, when we feed in our targets, we minimize the log likelihood based on those expectations, thus altering the GMM portion of the model to learn the predicted values.

More or less though, the problem we’re trying to solve is predicting the next input given our output vector. Essentially, we’re asking for $\text{Pr}(x_{t+1} | y_t)$. I’m not going to show the proof (we didn’t in our paper right), but the equation for the conditional probability is shown below:

\[\begin{align} \text{Pr}(x_{t+1} | y_t) = \sum_{j=1}^{M} \pi_{j}^t \mathcal{N} (x_{t+1} \mid \mu_j^t, \sigma_j^t, \rho_j^t) \end{align} \tag{5}\]

where

\[\begin{align} \mathcal{N}(x \mid \mu, \sigma, \rho) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt[]{1-\rho^2}} \exp \left[\frac{-Z}{2(1-\rho^2)}\right] \end{align} \tag{6}\]

and

\[\begin{align} Z = \frac{(x_1 - \mu_1)^2 }{\sigma_1^2} + \frac{(x_2 - \mu_2)^2}{\sigma_2^2} - \frac{2\rho (x_1 - \mu_1) (x_2 - \mu_2) }{\sigma_1 \sigma_2} \end{align} \tag{7}\]

Now, there’s a slight variation here because we have a handwriting specific end-of-stroke parameter. So we modify our conditional probability formula to result in our final calculation of:

\[\begin{align} \textrm{Pr}(x_{t+1} \mid y_t ) = \sum\limits_{j=1}\limits^{M} \pi_j^t \; \mathcal{N} (x_{t+1} \mid \mu_j^t, \sigma_j^t, \rho_j^t) \begin{cases} e_t & \textrm{if } (x_{t+1})_3 = 1 \\ 1-e_t & \textrm{otherwise} \end{cases} \end{align} \tag{8}\]

And that’s it! That’s our final probability output from the MDN. Once we have this, performing our expectation maximization is simple as our loss function that we choose to minimize is just:

\[\begin{align} \mathcal{L}(\mathbf{x}) = - \sum\limits_{t=1}^{T} \log \textrm{Pr}(x_{t+1} \mid y_t) \end{align} \tag{9}\]

Code

Here’s the corresponding code section for my mixture density network.

class MixtureDensityLayer(tf.keras.layers.Layer):
    def __init__(
        self,
        num_components,
        name="mdn",
        temperature=1.0,
        enable_regularization=False,
        sigma_reg_weight=0.01,
        rho_reg_weight=0.01,
        entropy_reg_weight=0.1,
        **kwargs,
    ):
        super(MixtureDensityLayer, self).__init__(name=name, **kwargs)
        self.num_components = num_components
        # The number of parameters per mixture component: 2 means, 2 standard deviations, 1 correlation, 1 weight , 1 for eos
        # so that's our constant num_mixture_components_per_component
        self.output_dim = num_components * NUM_MIXTURE_COMPONENTS_PER_COMPONENT + 1
        self.mod_name = name
        self.temperature = temperature
        self.enable_regularization = enable_regularization
        self.sigma_reg_weight = sigma_reg_weight
        self.rho_reg_weight = rho_reg_weight
        self.entropy_reg_weight = entropy_reg_weight

    def build(self, input_shape):
        graves_initializer = tf.keras.initializers.TruncatedNormal(mean=0.0, stddev=0.075)

        self.input_units = input_shape[-1]
        # weights
        # lots of weight initialization here... could simplify here too

        # biases
        # lots of bias initialization here... could simplify this part by just doing a massive 
        # and splitting... see the code if you're curious
        super().build(input_shape)

    def call(self, inputs, training=None):
        temperature = 1.0 if not training else self.temperature

        pi_logits = tf.matmul(inputs, self.W_pi) + self.b_pi
        pi = tf.nn.softmax(pi_logits / temperature, axis=-1)  # [B, T, K]
        # clipping here... I was getting cooked by NaN creep
        pi = tf.clip_by_value(pi, 1e-6, 1.0)

        mu = tf.matmul(inputs, self.W_mu) + self.b_mu  # [B, T, 2K]
        mu1, mu2 = tf.split(mu, 2, axis=2)

        log_sigma = tf.matmul(inputs, self.W_sigma) + self.b_sigma  # [B, T, 2K]
        # again, this might be overkill but seems realistic for clipping
        log_sigma = tf.clip_by_value(log_sigma, -5.0, 2.0)
        sigma = tf.exp(log_sigma)
        sigma1, sigma2 = tf.split(sigma, 2, axis=2)

        rho_raw = tf.matmul(inputs, self.W_rho) + self.b_rho
        rho = tf.tanh(rho_raw) * 0.9

        eos_logit = tf.matmul(inputs, self.W_eos) + self.b_eos

        return tf.concat([pi, mu1, mu2, sigma1, sigma2, rho, eos_logit], axis=2)

class HandwritingModel(nnx.Module):
    def __init__(
        self,
        config: ModelConfig,
        rngs: nnx.Rngs,
        synthesis_mode: bool = False,
    ) -> None:
        self.config = config
        self.synthesis_mode = synthesis_mode

        # rngs is basically a set of random keys / number generators
        self.lstm_cells = self._build_lstm_stack(rngs)
        if synthesis_mode:
            # i mean we really only care about synthesis mode, but in
            # this case we can make it explicit that if we have it then we should add our
            # attention layer
            self.attention_layer = nnx.Linear(
                config.hidden_size + config.alphabet_size + 3, 3 * config.num_attention_gaussians, rngs=rngs
            )

        # mdn portion
        self.mdn_layer = self._build_mdn_head(rngs)

    #....
    
    def __call__(
        self,
        inputs: jnp.ndarray,
        char_seq: Optional[jnp.ndarray] = None,
        char_lens: Optional[jnp.ndarray] = None,
        initial_state: Optional[RNNState] = None,
        return_state: bool = False,
    ) -> jnp.ndarray:
        batch_size, seq_len, _ = inputs.shape

        if initial_state is None:
            h = jnp.zeros((self.config.num_layers, batch_size, self.config.hidden_size), inputs.dtype)
            c = jnp.zeros_like(h)
            kappa = jnp.zeros((batch_size, self.config.num_attention_gaussians), inputs.dtype)
            window = jnp.zeros((batch_size, self.config.alphabet_size), inputs.dtype)
        else:
            h, c = initial_state.hidden, initial_state.cell
            kappa, window = initial_state.kappa, initial_state.window

        def step(carry, x_t):
            h, c, kappa, window = carry
            h_layers = []
            c_layers = []

            # layer1
            if self.synthesis_mode:
                layer1_input = jnp.concatenate([window, x_t], axis=-1)
            else:
                layer1_input = x_t

            h1, c1 = self.lstm_cell(layer1_input, h[0], c[0], 0)
            h_layers.append(h1)
            c_layers.append(c1)

            # layer1 -> attention
            if self.synthesis_mode and char_seq is not None and char_lens is not None:
                window, kappa = self.compute_attention(h1, kappa, window, x_t, char_seq, char_lens)

            # attention -> layer2 and layer3
            for layer_idx in range(1, self.config.num_layers):
                if self.synthesis_mode:
                    layer_input = jnp.concatenate([x_t, h_layers[-1], window], axis=-1)
                else:
                    layer_input = jnp.concatenate([x_t, h_layers[-1]], axis=-1)

                h_new, c_new = self.lstm_cell(layer_input, h[layer_idx], c[layer_idx], layer_idx)
                h_layers.append(h_new)
                c_layers.append(c_new)

            h_new = jnp.stack(h_layers)
            c_new = jnp.stack(c_layers)

            # mdn output from final hidden state
            mdn_out = self.mdn_layer(h_layers[-1])  # [B, 6M+1]

            return (h_new, c_new, kappa, window), mdn_out

        # this was the major unlock for JAX performance
        # it allows us to vectorize the computation over the time dimension
        # transpose inputs from [B, T, 3] to [T, B, 3] for scan
        inputs_transposed = inputs.swapaxes(0, 1)
        (h, c, kappa, window), outputs = jax.lax.scan(step, (h, c, kappa, window), inputs_transposed)

        # transpose back
        outputs = outputs.swapaxes(0, 1)

        if return_state:
            final_state = RNNState(hidden=h, cell=c, kappa=kappa, window=window)
            return outputs, final_state

        return outputs

Mixture Density Loss

Theory

I already covered the theory above, so I won’t go into that here, but just figured it was easier to split out the code between network and calculating our loss. Note, there’s some pretty aggressive clipping going on just given I had some pretty high instability with JAX. I think partially because of the implementation and clipping but loss would just go to 0 rather than the program crashing. To be clear, loss going to zero was not desired.

Code

@tf.keras.utils.register_keras_serializable()
def mdn_loss(y_true, y_pred, stroke_lengths, num_components, eps=1e-8):
    """
    Mixture density negative log-likelihood computed fully in log-space.

    y_true: [B, T, 3]  -> (x, y, eos ∈ {0,1})
    y_pred: [B, T, 6*K + 1] -> (pi, mu1, mu2, sigma1, sigma2, rho, eos_logit)

    The log space change was because I was getting absolutely torched by the
    gradients when using the normal space.
    """
    out_pi, mu1, mu2, sigma1, sigma2, rho, eos_logits = tf.split(
        y_pred,
        [num_components] * 6 + [1],
        axis=2,
    )

    x, y, eos_targets = tf.split(y_true, [1, 1, 1], axis=-1)

    sigma1 = tf.clip_by_value(sigma1, 1e-2, 10.0)
    sigma2 = tf.clip_by_value(sigma2, 1e-2, 10.0)
    rho = tf.clip_by_value(rho, -0.9, 0.9)
    out_pi = tf.clip_by_value(out_pi, eps, 1.0)

    log_2pi = tf.constant(np.log(2.0 * np.pi), dtype=y_pred.dtype)
    one_minus_rho2 = tf.clip_by_value(1.0 - tf.square(rho), eps, 2.0)
    log_one_minus_rho2 = tf.math.log(one_minus_rho2)
    z1 = (x - mu1) / sigma1
    z2 = (y - mu2) / sigma2

    quad = tf.square(z1) + tf.square(z2) - 2.0 * rho * z1 * z2
    quad = tf.clip_by_value(quad, 0.0, 100.0)
    log_norm = -(log_2pi + tf.math.log(sigma1) + tf.math.log(sigma2) + 0.5 * log_one_minus_rho2)
    log_gauss = log_norm - 0.5 * quad / one_minus_rho2  # [B, T, K]

    # log mixture via log-sum-exp
    log_pi = tf.math.log(out_pi)  # [B, T, K]
    log_gmm = tf.reduce_logsumexp(log_pi + log_gauss, axis=-1)  # [B, T]

    # bce (bernoulli cross entropy) to help out with stability
    eos_nll = tf.nn.sigmoid_cross_entropy_with_logits(labels=eos_targets, logits=eos_logits)  # [B, T, 1]
    eos_nll = tf.squeeze(eos_nll, axis=-1)  # [B, T]

    nll = -log_gmm + eos_nll  # [B, T]
    if stroke_lengths is not None:
        mask = tf.sequence_mask(stroke_lengths, maxlen=tf.shape(y_true)[1], dtype=nll.dtype)
        nll = nll * mask
        denom = tf.maximum(tf.reduce_sum(mask), 1.0)
        return tf.reduce_sum(nll) / denom

    return tf.reduce_mean(nll)

def compute_loss(
    predictions: jnp.ndarray,
    targets: jnp.ndarray,
    lengths: Optional[jnp.ndarray] = None,
    num_mixtures: int = NUM_BIVARIATE_GAUSSIAN_MIXTURE_COMPONENTS,
) -> jnp.ndarray:
    nc = num_mixtures
    pi, mu1, mu2, s1, s2, rho, eos_pred = jnp.split(predictions, [nc, 2 * nc, 3 * nc, 4 * nc, 5 * nc, 6 * nc], axis=-1)

    pi = nnx.softmax(pi, axis=-1)
    s1 = jnp.exp(jnp.clip(s1, -10, 3))
    s2 = jnp.exp(jnp.clip(s2, -10, 3))
    rho = jnp.clip(nnx.tanh(rho) * 0.95, -0.95, 0.95)
    eos_pred = jnp.clip(nnx.sigmoid(eos_pred), 1e-8, 1 - 1e-8)

    x, y, eos = jnp.split(targets, [1, 2], axis=-1)

    # major change is we compute log probabilities with better numerical stability
    rho_sq = jnp.clip(rho**2, 0, 0.9025)
    one_minus_rho_sq = jnp.maximum(1 - rho_sq, 1e-6)
    norm = -jnp.log(2 * jnp.pi) - jnp.log(s1) - jnp.log(s2) - 0.5 * jnp.log(one_minus_rho_sq)

    z1 = (x - mu1) / jnp.maximum(s1, 1e-6)
    z2 = (y - mu2) / jnp.maximum(s2, 1e-6)

    exp_term = -0.5 / one_minus_rho_sq * (z1**2 + z2**2 - 2 * rho * z1 * z2)
    exp_term = jnp.clip(exp_term, -50, 0)
    log_probs = norm + exp_term
    log_pi = jnp.log(jnp.maximum(pi, 1e-8))
    log_mixture = jax.nn.logsumexp(log_pi + log_probs, axis=-1)

    eos_loss = -jnp.sum(eos * jnp.log(eos_pred) + (1 - eos) * jnp.log(1 - eos_pred), axis=-1)

    loss = -log_mixture + eos_loss
    loss = jnp.where(jnp.isnan(loss) | jnp.isinf(loss), 0.0, loss)

    if lengths is not None:
        mask = jnp.arange(predictions.shape[1]) < lengths[:, None]
        loss = jnp.where(mask, loss, 0.0)
        total_loss = jnp.sum(loss) / jnp.maximum(jnp.sum(mask), 1)
        return jnp.where(jnp.isnan(total_loss) | jnp.isinf(total_loss), 0.0, total_loss)

    mean_loss = jnp.mean(loss)
    return jnp.where(jnp.isnan(mean_loss) | jnp.isinf(mean_loss), 0.0, mean_loss)

Attention Mechanism

Theory

The attention mechanism really only comes into play with the Synthesis Network which sadly Tom and I never got to in college. The idea (similar to most attention notions) is that we need to tell our model more specifically where to focus. This isn’t like the transformer notion of attention from the famous “Attention is All You Need” paper, but it’s the idea that we have various Gaussians to indicate probabilistically where we should be focusing. We utilize one-hot encoding vectors over our input characters so that we can more clearly identify the numerical representation. So the question we’re basically answering is like “oh, i see a ‘w’ character, generally how far along do we need to write for that?” to help also answer the question of when do we need to terminate.

The mathematical representation is here:

Given a length $U$ character sequence $\mathbf{c}$ and a length $T$ data sequence $\mathbf{x}$, the soft window $w_t$ into $\mathbf{c}$ at timestep $t$ ($1 \leq t \leq T$) is defined by the following discrete convolution with a mixture of $K$ Gaussian functions
\[\begin{align} \phi(t, u) &= \sum_{k=1}^K \alpha^k_t\exp\left(-\beta_t^k\left(\kappa_t^k-u\right)^2\right)\\ w_t &= \sum_{u=1}^U \phi(t, u)c_u \end{align}\]
where $\phi(t, u)$ is the \emph{window weight} of $c_u$ at timestep $t$.

Intuitively, the $\kappa_t$ parameters control the location of the window, the $\beta_t$ parameters control the width of the window and the $\alpha_t$ parameters control the importance of the window within the mixture.

The size of the soft window vectors is the same as the size of the character vectors $c_u$ (assuming a one-hot encoding, this will be the number of characters in the alphabet).

Note that the window mixture is not normalised and hence does not determine a probability distribution; however the window weight $\phi(t, u)$ can be loosely interpreted as the network’s belief that it is writing character $c_u$ at time $t$.

Code

@tf.keras.utils.register_keras_serializable()
class AttentionMechanism(tf.keras.layers.Layer):
    """
    Attention mechanism for the handwriting synthesis model.
    This is a version of the attention mechanism used in
    the original paper by Alex Graves. It uses a Gaussian
    window to focus on different parts of the character sequence
    at each time step.

    See section: 5.0 / 5.1
    """

    def __init__(self, num_gaussians, num_chars, name="attention", debug=False, **kwargs) -> None:
        super(AttentionMechanism, self).__init__(**kwargs)
        self.num_gaussians = num_gaussians
        self.num_chars = num_chars
        self.name_mod = name
        self.debug = debug

    def call(
        self,
        inputs,  # shape: [batch_size, num_gaussians, 3]
        prev_kappa,  # shape: [batch_size, num_gaussians]
        char_seq_one_hot,  # shape: [batch_size, char_len, num_chars]
        sequence_lengths,  # shape: [batch_size]
    ) -> tuple[tf.Tensor, tf.Tensor]:
        raw = tf.matmul(inputs, self.attention_kernel) + self.attention_bias
        alpha_hat, beta_hat, kappa_hat = tf.split(raw, 3, axis=1)  # shape: [batch_size, num_gaussians, 1]

        eps = tf.constant(1e-6, dtype=inputs.dtype)
        scaling = 0.1  # Gentler activation
        alpha = tf.nn.softplus(alpha_hat * scaling) + eps  # [B, G]
        beta = tf.nn.softplus(beta_hat * scaling) + eps  # [B, G]
        dkap = tf.nn.softplus(kappa_hat * scaling) + eps

        alpha = tf.clip_by_value(alpha, 0.01, 10.0)
        beta = tf.clip_by_value(beta, 0.01, 10.0)
        dkap = tf.clip_by_value(dkap, 1e-5, 0.5)

        kappa = prev_kappa + dkap
        kappa = tf.clip_by_value(kappa, 0.0, 30.0)

        char_len = tf.shape(char_seq_one_hot)[1]
        batch_size = tf.shape(inputs)[0]
        u = tf.cast(tf.range(1, char_len + 1), tf.float32)
        u = tf.reshape(u, [1, 1, -1])  # shape: [1, 1, char_len]
        u = tf.tile(u, [batch_size, self.num_gaussians, 1])  # shape: [batch_size, num_gaussians, char_len]

        alpha = tf.expand_dims(alpha, axis=-1)  # shape: [batch_size, num_gaussians, 1]
        beta = tf.expand_dims(beta, axis=-1)  # shape: [batch_size, num_gaussians, 1]
        kappa = tf.expand_dims(kappa, axis=-1)  # shape: [batch_size, num_gaussians, 1]

        exponent = -beta * tf.square(kappa - u)
        exponent = tf.clip_by_value(exponent, -50.0, 0.0)
        phi = alpha * tf.exp(exponent)  # shape: [batch_size, num_gaussians, char_len]
        phi = tf.reduce_sum(phi, axis=1)  # Sum over gaussians: [B, L]

        sequence_mask = tf.sequence_mask(sequence_lengths, maxlen=char_len, dtype=tf.float32)
        phi = phi * sequence_mask  # mask paddings

        phi = tf.where(tf.math.is_finite(phi), phi, tf.zeros_like(phi))
        # we don't normalize here - Graves calls that out specifically!
        # > Note that the window mixture is not normalised
        # > and hence does not determine a probability distribution; however the window
        # > weight φ(t,u) can be loosely interpreted as the network's belief that it is writ-
        # > ing character cu at time t.
        # still section 5.1

        # window vec
        phi = tf.expand_dims(phi, axis=-1)  # shape: [batch_size, char_len, 1]
        w = tf.reduce_sum(phi * char_seq_one_hot, axis=1)  # shape: [batch_size, num_chars]

        w = tf.where(tf.math.is_finite(w), w, tf.zeros_like(w))
        return w, kappa[:, :, 0]

    def compute_attention(
        self,
        h: jnp.ndarray,  # [B, H]
        prev_kappa: jnp.ndarray,  # [B, G]
        window: jnp.ndarray,  # [B, A]
        x: jnp.ndarray,  # [B, 3]
        char_seq: jnp.ndarray,  # [B, U, A] one-hot
        char_lens: jnp.ndarray,  # [B] lengths
    ) -> Tuple[jnp.ndarray, jnp.ndarray]:
        """Compute Gaussian window attention over character sequence."""

        attention_input = jnp.concatenate([window, x, h], axis=-1)
        params = self.attention_layer(attention_input)  # [B, 3G]
        params = nnx.softplus(params)
        alpha, beta, kappa_inc = jnp.split(params, 3, axis=-1)

        # again... probably sliiiiightly overkill
        alpha = jnp.maximum(alpha, 1e-4)
        beta = jnp.maximum(beta, 1e-4)
        kappa_inc = jnp.maximum(kappa_inc, 1e-4)

        # ok this was a trick from svasquez - the dividing by 25.0
        # is to help kappa learn given that 25 is roughly the average
        # number of strokes per sequence
        kappa = prev_kappa + kappa_inc / 25.0

        U = char_seq.shape[1]
        positions = jnp.arange(U, dtype=jnp.float32)[None, None, :]  # [1, 1, U]
        kappa_exp = kappa[:, :, None]  # [B, G, 1]
        alpha_exp = alpha[:, :, None]  # [B, G, 1]
        beta_exp = beta[:, :, None]  # [B, G, 1]

        # gaussian window
        phi = alpha_exp * jnp.exp(-beta_exp * (kappa_exp - positions) ** 2)  # [B, G, U]
        phi = jnp.sum(phi, axis=1)

        # mask out positions beyond char_lens
        mask = jnp.arange(U)[None, :] < char_lens[:, None]  # [B, U]
        phi = jnp.where(mask, phi, 0.0)

        # so Graves said that
        phi = phi / (jnp.sum(phi, axis=-1, keepdims=True) + 1e-8)

        # Apply to character sequence
        # window: [B, A] = sum_u phi[b,u]*char_seq[b,u,:]
        window_new = jnp.einsum("bu,bua->ba", phi, char_seq)

        return window_new, kappa

Stacked LSTM

Theory

The one distinction between Graves’s setup and a standard LSTM is that Graves uses a cascade of LSTMs. So we use the MDN to generate a probabilistic prediction however our neural network is the cascade of LSTMs.

Per our paper:

The LSTM cascade buys us a few different things. As Graves aptly points out, it mitigates the vanishing gradient problem even more greatly than a single LSTM could. This is because of the skip-connections. All hidden layers have access to the input and all hidden layers are also directly connected to the output node. As a result, there are less processing steps from the bottom of the network to the top.

So it looks something like this:

Courtesy of Alex Graves's paper

The one thing to note is that there is a dimensionality increase given we now have these hidden layers. Tom and I broke this down in our paper here:

Let’s observe the $x_{t-1}$ input. $h_{t-1}^1$ only has $x_{t-1}$ as its input which is in $\mathbb{R}^3$ because $(x, y, eos)$. However, we also pass our input $x_{t-1}$ into $h_{t-1}^2$. We assume that we simply concatenate the original input and the output of the first hidden layer. Because LSTMs do not scale dimensionality, we know the output is going to be in $\mathbb{R}^3$ as well. Therefore, after this concatenation, the input into the second hidden layer will be in $\mathbb{R}^6$. We can follow this process through and see that, the input to the third hidden layer will be in $\mathbb{R}^9$. Finally, we concatenate all of the LSTM cells (i.e. the hidden layers) together, thus getting a final dimension of $\mathbb{R}^{18}$ fed into our MDN. Note, this is for $m=3$ hidden layers, but more generally, we can observe the relation as
\[\begin{align} \textrm{final dimension} = k \frac{m(m+1)}{2} \end{align}\]

Here’s my take is that I actually like how I constructed the Tensorflow version more from a composability perspective. I think the code is cleaner. However, c’est la vie.

Code

This is where the various cell vs layer concept in Tensorflow was very nice.

You can see here how the parts all come together smoothly. The custom RNN cell takes the lstm_cells (which are stacked), and then can basically abstract out and operate on the individual time steps without having to worry about actually introducing another for loop. This is beneficial because of the batching and GPU win we can get when it eventually becomes time.

@tf.keras.utils.register_keras_serializable()
class DeepHandwritingSynthesisModel(tf.keras.Model):
    """
    A similar implementation to the previous model,
    but now we're throwing the good old attention mechanism back into the mix.
    """

    def __init__(
        self,
        units: int = NUM_LSTM_CELLS_PER_HIDDEN_LAYER,
        num_layers: int = NUM_LSTM_HIDDEN_LAYERS,
        num_mixture_components: int = NUM_BIVARIATE_GAUSSIAN_MIXTURE_COMPONENTS,
        num_chars: int = ALPHABET_SIZE,
        num_attention_gaussians: int = NUM_ATTENTION_GAUSSIAN_COMPONENTS,
        gradient_clip_value: float = GRADIENT_CLIP_VALUE,
        enable_mdn_regularization: bool = False,
        debug=False,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.units = units
        self.num_layers = num_layers
        self.num_mixture_components = num_mixture_components
        self.num_chars = num_chars
        self.num_attention_gaussians = num_attention_gaussians
        self.gradient_clip_value = gradient_clip_value
        self.enable_mdn_regularization = enable_mdn_regularization
        # Store LSTM cells as tracked attributes instead of list
        self.lstm_cells = []
        for idx in range(num_layers):
            cell = LSTMPeepholeCell(units, idx)
            setattr(self, f'lstm_cell_{idx}', cell)  # Register as tracked attribute
            self.lstm_cells.append(cell)

        self.attention_mechanism = AttentionMechanism(num_gaussians=num_attention_gaussians, num_chars=num_chars)
        self.attention_rnn_cell = AttentionRNNCell(self.lstm_cells, self.attention_mechanism, self.num_chars)
        self.rnn_layer = tf.keras.layers.RNN(self.attention_rnn_cell, return_sequences=True)
        self.mdn_layer = MixtureDensityLayer(num_mixture_components, enable_regularization=enable_mdn_regularization)
        self.debug = debug

        # metrics
        self.loss_tracker = tf.keras.metrics.Mean(name="loss")
        self.nll_tracker = tf.keras.metrics.Mean(name="nll")
        self.eos_accuracy_tracker = tf.keras.metrics.Mean(name="eos_accuracy")
        self.eos_prob_tracker = tf.keras.metrics.Mean(name="eos_prob")

    def call(
        self, inputs: Dict[str, tf.Tensor], training: Optional[bool] = None, mask: Optional[tf.Tensor] = None
    ) -> tf.Tensor:
        input_strokes = inputs["input_strokes"]
        input_chars = inputs["input_chars"]
        input_char_lens = inputs["input_char_lens"]

        # one-hot encode the character sequence and set RNN cell attributes
        char_seq_one_hot = tf.one_hot(input_chars, depth=self.num_chars)
        self.attention_rnn_cell.char_seq_one_hot = char_seq_one_hot
        self.attention_rnn_cell.char_seq_len = input_char_lens

        # initial states
        batch_size = tf.shape(input_strokes)[0]
        initial_states = self.attention_rnn_cell.get_initial_state(batch_size=batch_size, dtype=input_strokes.dtype)
        initial_states_list = [
            initial_states["lstm_0_h"],
            initial_states["lstm_0_c"],
            initial_states["lstm_1_h"],
            initial_states["lstm_1_c"],
            initial_states["lstm_2_h"],
            initial_states["lstm_2_c"],
            initial_states["kappa"],
            initial_states["w"],
        ]

        # then through our RNN (which wraps stacked LSTM cells + attention mechanism)
        # and then through our MDN layer
        outputs = self.rnn_layer(input_strokes, initial_state=initial_states_list, training=training)
        final_output = self.mdn_layer(outputs)
        return final_output

    def __call__(
        self,
        inputs: jnp.ndarray,
        char_seq: Optional[jnp.ndarray] = None,
        char_lens: Optional[jnp.ndarray] = None,
        initial_state: Optional[RNNState] = None,
        return_state: bool = False,
    ) -> jnp.ndarray:
        batch_size, seq_len, _ = inputs.shape

        if initial_state is None:
            h = jnp.zeros((self.config.num_layers, batch_size, self.config.hidden_size), inputs.dtype)
            c = jnp.zeros_like(h)
            kappa = jnp.zeros((batch_size, self.config.num_attention_gaussians), inputs.dtype)
            window = jnp.zeros((batch_size, self.config.alphabet_size), inputs.dtype)
        else:
            h, c = initial_state.hidden, initial_state.cell
            kappa, window = initial_state.kappa, initial_state.window

        def step(carry, x_t):
            h, c, kappa, window = carry
            h_layers = []
            c_layers = []

            # layer1
            if self.synthesis_mode:
                layer1_input = jnp.concatenate([window, x_t], axis=-1)
            else:
                layer1_input = x_t

            h1, c1 = self.lstm_cell(layer1_input, h[0], c[0], 0)
            h_layers.append(h1)
            c_layers.append(c1)

            # layer1 -> attention
            if self.synthesis_mode and char_seq is not None and char_lens is not None:
                window, kappa = self.compute_attention(h1, kappa, window, x_t, char_seq, char_lens)

            # attention -> layer2 and layer3
            for layer_idx in range(1, self.config.num_layers):
                if self.synthesis_mode:
                    layer_input = jnp.concatenate([x_t, h_layers[-1], window], axis=-1)
                else:
                    layer_input = jnp.concatenate([x_t, h_layers[-1]], axis=-1)

                h_new, c_new = self.lstm_cell(layer_input, h[layer_idx], c[layer_idx], layer_idx)
                h_layers.append(h_new)
                c_layers.append(c_new)

            h_new = jnp.stack(h_layers)
            c_new = jnp.stack(c_layers)

            # mdn output from final hidden state
            mdn_out = self.mdn_layer(h_layers[-1])  # [B, 6M+1]

            return (h_new, c_new, kappa, window), mdn_out

        # this was the major unlock for JAX performance
        # it allows us to vectorize the computation over the time dimension
        # transpose inputs from [B, T, 3] to [T, B, 3] for scan
        inputs_transposed = inputs.swapaxes(0, 1)
        (h, c, kappa, window), outputs = jax.lax.scan(step, (h, c, kappa, window), inputs_transposed)

        # transpose back
        outputs = outputs.swapaxes(0, 1)

        if return_state:
            final_state = RNNState(hidden=h, cell=c, kappa=kappa, window=window)
            return outputs, final_state

        return outputs

Final Result

Alright finally! So what do we have, and what can we do now?

We now are going to feed the output from our LSTM cascade into the GMM in order to build a probabilistic prediction model for the next stroke. The GMM will then be fed the actual next point, in order to create some idea of the deviation os that the loss can be properly minimized.

🏋️ Training Results

Vast AI GPU Enabled Execution

Vast AI GPU Enabled Running

Launching Scrollz

2025-09-16T00:00:00+00:00

I’m happy to announce that as of September 16th, 2025, our application Scrollz App is officially in the iOS app store (and the Android store). With that, it’s out to 175 countries.

Context

We built Scrollz because we were sick of scrolling through and cluttering our inboxes everyday with various newsletters. We wanted better search, more sophisticated note-taking, sharing, social features, etc. Lots of that we’re still working on. It’s a cool app, a couple of us put a solid chunk of time and energy into it. I’d appreciate a download and a review. Selfishly, I obviously needed to hit my new years resolution of being accepted onto the app store.

Anyways! Enjoy. Open for thoughts, comments, feedback, etc.

Tennis Scorigami

2025-06-11T00:00:00+00:00

This post is going to be focused on discussing how we built our Tennis Scorigami project from a technical standpoint. I’ll discuss the current architecture, some of the design decisions I made, and where I want the project to go next.

If you haven’t yet checked out the main site, feel free to here:

Tennis Scorigami

tennis-scorigami.com

Motivation

Given once again, the impending collapse of my profession due to automation, my lack of time, and the fact that there’s details on the main website, I will try not to repeat myself.

Our motivation here was largely love of tennis, data, and friendship. Sebastian and Henry were chatting in the groupchat about football scorigami, and Seb asked, I wonder if tennis has any scorigamis. And so, began an interesting conversation, and the groupchat started to explore where we could get data, if anyone had done this and all of that.

More Specific Motivation

More specifically, besides my craving to “wow” my friends, the real pie in the sky goal for us (read: me), was to get Andy Roddick to re-tweet / view this project.

Andy Roddick has been one of our favorite tennis players since growing up, and Henry and I watched as many of his matches as we could get a hold of. So that target was a stretch New Years Resolution of mine.

Demo

If you’re too lazy to visit the website and explore, here’s a demo:

Features

There’s numerous features here that we’re proud of. I’m going to list some of them, and then I’ll discuss them in further detail below:

Graph-Based Score Sequence Modeling
- this was originally Henry’s idea but given how many nodes we have in this search space, it’s not as feasible to just arrange it as a grid like with the NFL
- we thought a graph as a novel (and visually appealing approach)
- what this means technically is that we pre-computed all permutations and did some background processing so that we could store this information in as close to frontend ready format for fast visualization and processing… that gets into our next point
Performance-Optimized Materialized Views
- we built out specific materialized views to help with the performance (given our existing FE filters) so that we can ensure latency is not noticeable
Streaming Visualizations
- Still kind of working on setting this up ideally for 5-graph nodes. There’s 125,062 nodes that we need to render for 5 set match permutations.
- I had to turn to NDJSON (basically just newline json chunked) which I had never used before
- This helped reduce both the latency and the incremental memory on receiving that information and parsing it on the FE
Drizzle
- Ok fair fine you got me. Technically I used both SQLAlchemy and then ported over to Drizzle so that was a little bit of a mess, but I have heard great things about drizzle and liked that experience a lot. drizzle-kit is very sleek and the latest major release for drizzle is fantastic.
(somewhat?) Decent Testing
- Yeah, obviously I wouldn’t say I went crazy here, but I did set up jest, playwright, and storybook
- These are all things that the very talented Michael Barlock first introduced to the team.
- I learned a lot from him and since then, yeah I’ve been trying to incorporate / adopt these a bit more
Unfurl Link Coverage
- Perhaps trivial, but try sending https://tennis-scorigami.com/ over iMessage? Yup, it uses the mp4 video with autoplay. What about over Discord? Falls to the native sub-5MB gif that plays. LinkedIn? That same gif? Twitter - just uses a static image as a fallback, and Slack? Also has coverage with the gif / animation support.
- I spent a non-trivial amount on this because it’s the little details that might help Andy Roddick actually click retweet (although yeah I should probably fix that Twitter share link then)

Challenges

Data Consolidation

There's a couple important notes I want to make here.

Our data cutoff is then (partially) hinged on Jeff Sackmann. Currently it's good until the end of 2024. I hate that. I'm not running LLMs, I shouldn't have a data cutoff. I'm planning on building a web scraper for ATP results and setting up my own data feeds because why the f not, and it's 2025 and you can roll (most) software if push really comes to shove. I can go on a full blown rant on my Substack or something, but I am aware of this limitation, and I dislike it more than you.

Secondly, consolidating all of this data from disparate sources, always prevents a challenge. There's the age old problem of same logical data, but different source ingestion. I have pulled some of the 2024 matches with SportRadar, and with RapidApi, and with Sackmann, so consolidating that was definitely a bit of elbow grease. Obviously, I used LLMs for parts of this project, but that part was probably the most hands on and driven. Yes, don't worry - I set up my Postgres MCP and after porting to Neon I set up my Neon MCP, but man yeah... still early days I suppose.

Again, as referenced here, tennis data is a commodity. It is insanely annoying and hard to get clean data. I am adamant that another side project that will spin out of this is a publicly available free API for people to query and get tennis data from.

I tried numerous things:

SportRadar
- they’re one of the world’s best data providers
- however, they are absurdly expensive. there’s more info in the Reddit below but they don’t have set plans. as of 2 years ago, for a small time project, they were $1250 a month
- I tried their free trial, ripped as much as I could (of recent tournaments), and then got rate limited, and my trial expired
- Needless to say that was a bit of a miss

API Pricing?
by inSportradar

(continuation)
- one nice thing about SportRadar though was that they publish an OpenAPI (API not AI 🙄) spec of their endpoints.
- You can see their Swagger docs here, and download the openapi.yaml file here
- This made generating a Python client for interacting with it very easy.
- I even created a SportRadar specific Python client for this.
- That repo was (more or less) entirely created with: openapi-python-client generate --path ~/Downloads/openapi.yaml --config tennis-config.yaml with my tennis-config.yaml being as simple as:

project_name_override: sportradar-tennis-v3
package_name_override: sportradar_tennis_v3

Check out the GH repo for the SportRadar client here

Loading repository data...

RapidApi
- Ok but I moved on rapidly (pun intended), because I wasn’t about to pay $1250 a month (unless I was running betting strats)
- I switched to RapidAPI given their generous free plans
- The data quality also suffered here, and a lot of the APIs had pretty stringent API limits (2k calls per month)
- Given this I eventually turned away after pulling what I could
Sackmann
- I seriously need to buy Jeff Sackmann a beer
- He’s consolidated years of tennis data into a decently well organized format
- Sure there’s duplicate players, strange score formats, partial data, conflicts with accents, etc
- Lots of the traditional data quality things, but it’s easier to go from an excess and cleanse than to pull the data out of thin air
- So we stuck with Jeff for our historical data

Being Cheap

This one will be quick, but another challenge of this was just my desire to not spend money. Specifically for a hosted database service. I started with Supabase which I love and use for many other projects, but then pivoted to Aiven which I honestly liked anymore. However, they didn’t have connection pooling and I figured if this did go viral that I would get burned and people would say I was a bad engineer if we were throttling on Aiven’s free plan of like 10 open db transactions. So finally, I ended up Neon because I’ve been wanting to try them and they’re slightly cheaper than Supabase. Here’s an AI generated table summarizing the pros and cons, and decisions:

This table was generated using AI.

Provider	Pros	Cons	Decision Rationale
Supabase	- Great developer experience - Integrated auth + storage - Rich UI/dashboard	- Slightly pricier for scaling - Overhead from extra services if only DB is needed	Preferred for full-stack apps, but overkill + cost when only DB needed
Aiven	- Excellent reliability - Flexible managed Postgres - Simple CLI/tools	- No built-in connection pooling - Free tier limit: ~10 connections	Risky for viral spikes; free tier would throttle app & reflect poorly on eng.
Neon	- Built-in connection pooling - Autoscaling - Cheaper than Supabase - Separation of storage & compute	- Newer platform, less mature - Limited ecosystem/integrations compared to Supabase	Chosen for price/perf tradeoff; avoids pooling issues; good opportunity to test

Fetching 108k Nodes

This was actually pretty fine truth be told in terms of backend performances. The large win of utilizing materialized views was that I could pull the pertinent information, and shape it into the appropriate information that my various FE graph information would require.

Once we get to live data, I’ll set up cronjobs or triggers to refresh these and build out my data pipeline a bit more. However, for the moment, these materialized views were sufficient.

Here’s one as an example:

CREATE MATERIALIZED VIEW public.mv_slug_stats_3_men
TABLESPACE pg_default
AS SELECT s.event_id,
    ss.sequence_id AS id,
    ss.slug,
    ss.depth,
    ss.winner_sets,
    ss.loser_sets,
    is_terminal_3(ss.winner_sets, ss.loser_sets) AS is_terminal,
    3 AS best_of,
    COALESCE(s.played, false) AS played,
    COALESCE(s.occurrences, 0) AS occurrences
   FROM score_sequence ss
     LEFT JOIN mv_sequence_stats_3_men s ON s.sequence_id = ss.sequence_id
  WHERE ss.best_of <= 3
WITH DATA;

-- this is so that we can filter by event_id which is what happens on the frontend when
-- a user selects either a tournament or a year (more or less)
CREATE INDEX idx_mv_slug_stats_3_men_event ON public.mv_slug_stats_3_men USING btree (event_id);

Rendering 108k Nodes

There are numerous repos (some of which I’m using) to help process and render beautiful graphics. I didn’t roll my own physics engine or the force-graph library. I used these:

2D Sigma Graph

Loading repository data...

3D Force Graph

Loading repository data...

Regardless, I wanted a very slick and performant frontend to render all these nodes. The 5 set 3d force graph is not yet built out. You’ll note that vasturiano has a large graph demo here:

And that looks fantastic. However, if you look at the source code…

// https://github.com/vasturiano/react-force-graph/blob/master/example/large-graph/index.html
<head>
  <style> body { margin: 0; } </style>

  <script type="importmap">{ "imports": {
    "react": "https://esm.sh/react",
    "react-dom": "https://esm.sh/react-dom/client"
  }}</script>

 PY2
        DS2 --> PY2
        PY2 --> PY1
        PY1 --> PY4
        PY3 --> PY1
    end

    %%=== Database ===%%
    subgraph "Database Layer (PostgreSQL on Neon)"
        DB1[(Main Tables)]
        DB2[(Score Sequences)]
        DB3[(Materialized Views)]
        class DB1,DB2,DB3 db

        DB1 --> |"players, matches, tournaments"| DB2
        DB2 --> |"Pre-computed aggregations"| DB3

        subgraph "Materialized Views"
            MV1[mv_graph_sets_3_men]
            MV2[mv_graph_sets_5_men]
            MV3[mv_graph_sets_3_women]
            MV4[...more views]
            class MV1,MV2,MV3,MV4 db
        end

        DB3 --> MV1
        DB3 --> MV2
        DB3 --> MV3
        DB3 --> MV4
    end

    %%=== API Layer ===%%
    subgraph "API Layer (Next.js)"
        API1["/api/v1/matches"]
        API2["/api/v1/graph"]
        API3["/api/v1/search"]
        API4["/api/v1/filters"]
        API5["/api/v1/scores"]
        class API1,API2,API3,API4,API5 api

        CACHE[Cache Layer
5 min revalidation]
        class CACHE neutral

        API1 --> CACHE
        API2 --> CACHE
        API3 --> CACHE
        API4 --> CACHE
        API5 --> CACHE
    end

    %%=== Frontend ===%%
    subgraph "Frontend (Next.js App Router)"
        subgraph "Pages"
            P1[Home Page]
            P2[Explore Page]
            P3[Search Page]
            P4[About]
            class P1,P2,P3,P4 frontend
        end

        subgraph "State Management"
            J1[Jotai Atoms]
            J2[Graph Controls]
            J3[Filter State]
            class J1,J2,J3 frontend
        end

        subgraph "Visualization Components"
            V1[SigmaJS 2D Graph]
            V2[3D Force Graph]
            V3[Streaming Graph]
            class V1,V2,V3 viz
        end

        P2 --> V1
        P2 --> V2
        P2 --> V3

        J1 --> J2
        J1 --> J3
        J2 --> V1
        J2 --> V2
        J3 --> API2
    end

    %%=== Infrastructure ===%%
    subgraph "Infrastructure"
        I1[Drizzle ORM]
        I2[Connection Pool]
        I3[SSL/TLS]
        I4[PostHog Analytics]
        I5[Turbopack]
        class I1,I2,I3,I5 infra
        class I4 analytics
    end

    PY4 --> DB1
    DB3 --> I1
    I1 --> I2
    I2 --> API1
    I2 --> API2
    I2 --> API3
    I2 --> API4
    I2 --> API5

    CACHE --> P1
    CACHE --> P2
    CACHE --> P3
    CACHE --> P4

    P1 --> I4
    P2 --> I4

Other Fun Visualizations

Player Rank History

I was intrigued when I saw that Sackmann also had player rank history week over week. There’s more I want to do with the application to tennis-scorigami but for now, I thought it was fun to create some of these visualizations:

ATP #1 Ranking Timeline

WTA #1 Ranking Timeline

Conclusion

This was an awesome project to work on and I still think there’s a ton we could do here. If you find any data quality issues, please reach out. Some thoughts about where we could take this:

Richer match data leading -> embeddings -> vector search
WebRTC for real time collaboration in some way
Popularity of searches
More exposure of conditional probabilities given a player and a score, what might happen next
- Perhaps leverage this into some type of betting.

As always, feel free to reach out with any questions.

Walk in the Parquet

2025-03-31T00:00:00+00:00

At Mojo, we use Parquet files to store some of our simulation data. I - however - have been increasingly frustrated by the lack of support on MacOS to natively view them. They are (normally) compressed through a snappy algorithm, and Apple doesn’t have a native application to open them.

So I decided to build one - to help myself out, my teammates at work out, and hopefully some other random engineers out in the wild. In the very least, this blog post will detail how you can build your own desktop application, specifically in this case using Tauri.

There’s more information (i.e. lame marketing) here: walkintheparquet.com. Here’s an iframe if don’t want to leave this page:

Also! If you have feature requests or issues, you can head over to the Canny board for this project and leave some notes. There's a link at the bottom of the main website (/ iframe above), or you can go here. Obviously feel free to email me too!

Driving Motivation

To inspire you a little bit, here’s what we’ve built. This is also available (again) here, but also available on the App Store.

And here it is in the App Store:

Again, this blog post is going to talk a bit more about the actual building process, but if you want to see more about the product and download it, head over to the main website.

And before we go any deeper, I know there’s this question…

Are there really no other solutions?

Yeah I mean there’s this:

Can't wait until people are leaving me the same reviews. But hey mine is free. And it's a v1.

Which…. yeah don’t love people trying to charge for that.

The one that I’ve seen the most is this: https://www.parquet-viewer.com/, which I use but they also have some dumb paywalling features. And to be honest, I don’t really love uploading potentially sensitive files to the web.

And finally, it seems like the third alternative is a VSCode Extension, but the downside there is it’s just json I believe. Again, totally fine - I won’t be upset if you want to do that. It’s not as smooth to query or see some top level analytics, but c’est la vie.

What is Parquet?

Let’s back up a little bit for those not even familiar with Parquet. I won’t go into too much detail because there’s enough information out there on the web, but I’ll give a high level overview.

Parquet is a file type that is optimized and used prevalently for data processing frameworks. It was introduced by Apache and has numerous benefits, some of which I’ll get into below.

The big distinguisher for Parquet is that it’s a columnar storage format. This has some big wins especially in terms of compression. For example, if you were storing numerous of the same types of player-ids then you can have a vastly higher compression rate given the column is going to have a lot of redundancy.

These columns are then going to be split into row-groups. Row groups are just logical partitions of the data into separate rows.

These row groups are incredibly clutch when it comes to parallel processing because it lets them be read in parallel. Furthermore, the optimization is done so that only relevant row groups are read.

So this then becomes a bit of a hyperparamter performance optimization question right? What’s the ideal number to set for your row-groups? Well… yeah this is a bit of experimentation. There’s surprisingly little documentation around what is best, but generally you want a trade off between compression and performance. Generally, I’ve seen people suggesting you want your rowgroup to be around 128MB to 512MB. AWS seems to default it to be 128MB here.

You can think about it like so:

Row Group Size	Pros	Cons
Larger Row Groups	- Better read performance (fewer metadata reads, more sequential IO)	- Higher memory usage during write
	- Better compression (larger chunks compress more efficiently)	- Slower write performance if memory is constrained
Smaller Row Groups	- Lower memory usage during write	- Slower reads (more metadata overhead and disk seeks)
	- Faster writes in streaming or frequent-flush scenarios	- Worse compression
		- Less effective filtering (min/max stats less meaningful)

There’s also other fields like column chunks and pages. The best overview I’ve seen is actually from CelerData here.

This image from CelerData does a good job breaking out the different parts of the underlying structure:

Full credit to CelerData for the image

But! If you don’t like that one, noooo worries. Databricks has $62B valuation and they also wrote about it here. So feel free to check out some other links.

What’s the Problem?

Well, the problem that I wanted to address is that there’s not a great way to open these files. I discussed some alternatives and their downsides above, but it’s dumb that I couldn’t have everything local (excluding a paywalled App Store app) or I’d have to upload things to the web and some dude’s random server.

The other problem? I haven’t worked with Rust in awhile, and I still desperately want to get better at it, so that was the selfish motivation. It’s a borderline smooth transition into the next section.

Engineering + Design

Desktop Decisions

Ah the desktop application game - what a question.

Now I’ve worked with Electron at Dropbox so I was familiar with generally that architecture and paradigm. It has been a minute since I’ve dealt with preload scripts or the ipcMain vs ipcRenderer distinction.

The downside (in this case) and why I didn’t choose Electron was because I didn’t really want an all Typescript backend.

Truthfully, I really wanted a Python backend, both because that’s what I’m best at, but also because I wanted to use duckdb for loading in the files and doing analysis quickly, on-disk, and keeping things lightweight. I haven’t loaded Parquet files in Typescript before, and I’ve also been seeing more about Tauri and figured that it was a better use case.

Additionally, I know from googling Apache Parquet documentation (we integrate in Golang, C++, and Python all at work) that they DO have Rust support. I know this because I personally think that most of the documentation put out from Apache blows. The other noticeable benefit of using Rust and Tauri is that Tauri is a lot lighter weight of a desktop application.

Coditation sums it up pretty well below (ref):

Tauri is designed to be more lightweight and faster than Electron, as it uses less memory and CPU resources, which means that Tauri is designed to run more efficiently than Electron.

Tauri uses Rust as a native layer instead of JavaScript and web technologies, which results in lower memory usage and CPU usage compared to Electron. Additionally, Tauri is also designed to be more lightweight overall, which means that it has less overhead and a smaller binary size than Electron.

In other words,

Feature	Electron	Tauri (what I picked)
Backend Lang	JavaScript/TS	Rust
Binary Size	Large	Small
Memory Usage	Higher	Lower

Challenges and Conquests

Documentation

So this is a core part of it, but recently, I have been one of the many to get hit with the “yoooo how much did you vibe code”. There have been many good memes about this.

The thing is I basically did vibecode the entire website. NextJS, simple lightweight static frontend is a perfect use case for it. I’m not at all a frontend designer and so yeah, of course I’m not going to be ripping that manually or going into Figma first or anything like that. So that was lovely. Way faster and way quicker to ship.

The interesting part (at least for me) was how best to architect this with Tauri and have that handoff. The challenges were about that design, as well as the blatant lack of LLMs that are trained on Tauriv2 and the parquet versions I was using in Rust.

Specifically,

arrow = "54.3.0"
arrow-schema = "54.3.0"
parquet = "54.3.0"

these crates had virtually no LLM support (what a breath of fresh air).

Supporting Structs

As a result, it meant using the documentation and figuring out exactly why some of my string data was being parsed as a Utf8View vs a Utf8.

In terms of code, it meant that I had in my sql.rs parsing engine, a match statement (one of Rust’s best features imo) like this:

// ... many more types before

        DataType::Int64 => {
            let array = column
                .as_any()
                .downcast_ref::<Int64Array>()
                .ok_or_else(|| QueryError::Other("Failed to downcast to Int64Array".to_string()))?;
            serde_json::Value::Number(serde_json::Number::from(array.value(row_idx)))
        }

        DataType::UInt8 => {
            let array = column
                .as_any()
                .downcast_ref::<UInt8Array>()
                .ok_or_else(|| QueryError::Other("Failed to downcast to UInt8Array".to_string()))?;
            serde_json::Value::Number(serde_json::Number::from(array.value(row_idx) as u64))
        }

        DataType::UInt16 => {
            let array = column
                .as_any()
                .downcast_ref::<UInt16Array>()
                .ok_or_else(|| {
                    QueryError::Other("Failed to downcast to UInt16Array".to_string())
                })?;
            serde_json::Value::Number(serde_json::Number::from(array.value(row_idx) as u64))
        }

// ... many more types after

There are numerous DataTypes that get pulled in with use datafusion::arrow::datatypes::*;. I tried to handle most, but yeah of course Parquet files can be increasingly complex so as a v1.0.0 I am not promising to have entire support. There is basic support for nested structures as seen here:

However, handling this recursively is a bit of a challenge. I am expecting there to be some corner cases that I missed.

App Store Annoyance

By far however, the biggest learning I had was about bundling up a package for the App Store and the numerous steps to get that going.

There’s already quite a bit out there about notarization and code-signing, but I think the most helpful thing was putting this all in post-build.sh script.

So basically after running this:

╭─johnlarkin@Mac ~/Documents/coding/walk-in-the-parquet ‹main›
╰─➤  npm run tauri build -- --target universal-apple-darwin

You can get a fat client from Tauri that will support both Apple silicon and Intel-based archs.

After doing that, then what I finally got set up was this type of setup (thanks Claude for some of the emojis):

#!/bin/bash
set -euo pipefail

APP_NAME="Walk in the Parquet"
ENTITLEMENTS_PATH="src-tauri/entitlements.plist"
PKG_NAME="WalkInTheParquet.pkg"

# I was trying to automatically detect the app-path / dmg-path but what was happening
# was I was occassionalyl picking the wrong app / dmg and then yeah i was too lazy to fix this
# APP_PATH=$(find src-tauri/target/universal-apple-darwin -type d -name "$APP_NAME.app" | head -n 1)
APP_PATH="src-tauri/target/universal-apple-darwin/release/bundle/macos/Walk in the Parquet.app"
DMG_PATH="src-tauri/target/universal-apple-darwin/release/bundle/dmg/Walk in the Parquet_1.0.0_universal.dmg"

# you basically need `APPLE_ISSUER_ID`, `APPLE_PARQUET_KEY_ID`
# set up in your env
if [[ -z "${APPLE_ISSUER_ID:-}" ]]; then
  echo "🚨 Error: Environment variable APPLE_ISSUER_ID is not set"
  exit 1
else
  echo "✅ Environment variable APPLE_ISSUER_ID is set"
fi

if [[ -z "${APPLE_PARQUET_KEY_ID:-}" ]]; then
  echo "🚨 Error: Environment variable APPLE_PARQUET_KEY_ID is not set"
  exit 1
else
  echo "✅ Environment variable APPLE_PARQUET_KEY_ID is set"
fi
echo "🔑 All required environment variables are set"

if [[ ! -d "$APP_PATH" ]]; then
  echo "🚨 Error: .app bundle not found at expected path: $APP_PATH"
  exit 1
else
  echo "✅ .app bundle found at: $APP_PATH"
fi
if [[ -z "$DMG_PATH" ]]; then
  echo "🚨 Error: DMG file not found!"
  exit 1
else
  echo "✅ DMG file found at: $DMG_PATH"
fi

echo "🔐 Re-signing .app with entitlements using 3rd Party Application cert..."
codesign --entitlements "$ENTITLEMENTS_PATH" --deep --force --options runtime \
  --sign  "$APP_PATH"

echo "🧳 Rebuilding and signing .pkg with 3rd Party Installer cert..."
productbuild \
  --component "$APP_PATH" /Applications \
  --sign  \
  "$PKG_NAME"

echo "🚀 Submitting DMG to notarization..."
xcrun notarytool submit "$DMG_PATH" \
  --key  \
  --key-id "$APPLE_PARQUET_KEY_ID" \
  --issuer "$APPLE_ISSUER_ID" \
  --keychain-profile "notarytool-password" \
  --wait

# this is check the arch, staple, validate staple steps
lipo -info "$APP_PATH/Contents/MacOS/walk-in-the-parquet"
xcrun stapler staple "$DMG_PATH"
xcrun stapler validate "$DMG_PATH"
hdiutil imageinfo "$DMG_PATH" | grep Format

echo "📦 .pkg ready to be uploaded via Transporter:"
echo "   -> $PKG_NAME"
echo ""
echo "🚀 Open Transporter and upload the package manually if needed."

The key parts are that you’ll want your --sign argument to be 3rd Party Mac Developer Application . That is your 3rd party developer application that you can use for signing

Conclusion

Anyway, I have sunk more time than allocated into this, but it was a fun project, and I’m looking forward to working on this in the future. If you have issues or feedback requests, feel free to blow up that Canny board.

Enjoy the application and I hope I’ve helped some random stranger out there.

Kudos

Oh also thank you to my girlfriend for coming up with the name. Better than what I could have thought of.