Motivation
Here is a motivating visual to build up some momentum to read on. This is our dashboard tool to compare various multi-armed bandit strategies. Weβll understand this more thoroughly at the end of this blog post.
Context
Recently, I responded to some recruiters and fielded a couple of interviews.
I generally abhor interviewing. There are parts I absolutely love - meeting new people, learning about new technical challenges, studying up on businesses or industries - but there are also parts I abhor. Getting grilled on usage of the Web Speech API (man oh man was I in the wrong interview) or how to decode a string in 2026 does feelβ¦ a bit perplexing. Iβll rant about it on Substack at some point in time.
However! I do genuinely enjoy take homes (as exemplified by Book Brain). Despite often it being a bigger time constraint, and more of a commitment.
This blog post is going to go over a concept and problem that (embarrassingly enough), I hadnβt yet seen before the take home. For more context, I had accepted another offer in the same timeframe, and withdrew from this specific takehome process. Itβs unfortunate too because I do genuinely believe the company will be a $10BN company in no time, and the engineering seems fascinating.
While I ultimately withdrew from this interviewing cycle, and sent them only my thoughts on the problem, this blog post is going to talk about a take home question I received from that company. Iβm anonymizing the company to keep the sanctity of their interview process.
The company restricted Ai usage during the take, so I did a ton of research / youtube videos. However, for this blog post, some details of implementation will be left to Claude. The repo has documentation and detail included various transcripts between Claude and I. So letβs begin with the problem.
Setup
This blog post is going to focus on the multi-armed bandit problem, which is commonly abbreviated as MAB. There is a lot here, so I wonβt be able to cover everything, but Iβll cover the parts that the corresponding Github repo covers.
Multi-Armed Bandit Problem (MAB)
The traditional multi-armed bandit is pretty well encapsulated by a hypothetical situation. Iβll give you the long / fun version, and then Iβll give you an abbreviated Wikipedia version.
Imagine, you wake up.
You live in a beautiful city (letβs say Cincinnati).

But then you realize you have too much money in your pockets. You decide to gamble (i discourage this, especially after seeing how the sausage is made).
So you hit the casino!
However, because itβs Cincinnati, this is a very nice casino. You actually have a chance to win. However, they only have single-armed bandits - commonly known as slot machines! These are unique slot machines, and their underlying probably distributions become more apparent over time.
Despite having too much money in your pockets, you love winning, so you do want to win. Your problem therefore is to figure out the optimal strategy for which machines to play, when to play those machines, how many times to play them, and when you need to switch.
Wikipedia more blandly (but also more succinctly) puts this as:
More generally, it is a problem in which a decision maker iteratively selects one of multiple fixed choices (i.e., arms or actions) when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms.[4]
Stochastic MAB Approaches
Before we go any further, letβs fully dissect this problem.
There are really two main focuses that I covered in code and fully studied up on. I will not be talking about $\epsilon$-greedy approaches, but here are some other resources. Weβre actually going to focus on UCB vs Thompson Sampling, which are two methods that work very well. Iβll discuss further below in the implementation about my thoughts about how I modified them to handle the take-home explicitly.
Upper Confidence Bound
The theory behind UCB is that we are trying to optimistically explore. UCB1 is meant to balance the level of exploration vs exploitation.
I am not going to go into the full derivation, but it references something called Hoeffdingβs Inequality to build up a framework.
It eventually lets us get to:
\[UCB_i(t) = \bar{x}_i + \underbrace{c \cdot \sqrt{\frac{\ln(t)}{n_i}}}_{\text{exploration bonus}}\]Where:
- $\bar{x}_i$ = empirical success rate of server $i$
- $t$ = total number of requests across all servers
- $n_i$ = number of times server $i$ has been tried
- $c$ = exploration constant (default: $\sqrt{2}$)
Normally, youβll see this kind of folded up with $c$ being part of the square root, but that exploration bonus was key in my modified UCB approach.
Thompson Sampling
With this approach, the derivation can actually make a bit more sense (in my opinion). Itβs also (probably relatedly) the approach I like the most.
We model the process for the specific outcome of the arm $a$ as a Bernoulli distribution. Basically, it means we have a $p$ probability of getting a 1 (in this case, a reward, in our specific case further down - a successful downstream server request). The value 0 has a probably $q = 1 - p$ of occurring.
We can then model this uncertainty about the Bernoulli parameter $p$ as a beta distribution. Weβre trying to figure out the probability $p$ for each arm $a$ (or further on as weβll see, the downstream server).
Think of using our beta distribution as a heuristic for what we actually think about each arm. With Thompson sampling, weβre basically maintaining a best guess distribution for each of the arms and updating it as we go and learn more information. I believe the technical term for this is that weβre using a beta distribution as a prior and our posterior given we are assuming a beta distribution in both cases.
Formally, the beta distribution has a $\alpha$ and a $\beta$ that control the shape of the distribution. They are exponents of the variable and the variableβs complement respectively. So again, this can be written as:
\[f(x; \alpha, \beta) = \text{constant} \cdot x^{\alpha - 1} \, (1 - x)^{\beta - 1}\]Then our logic is pretty straight forward given how weβre modeling this. For every success of the arm, we can update our $\alpha$ with a simple $\alphaβ = \alpha + 1$ and for every failure, we can update our $\beta$ (given itβs modelling the complement) as $\betaβ = \beta + 1$.
A picture is worth a thousand words, so an interactive visualization must be worth at least a million right? This is a Claude generated vanilla JS + Chart.js artifact. Iβd recommend autoplaying or doing the Run Thompson Round, but you can also see results by adding success and failures to the various arms. The main point is that youβll see how our beta distributions should steadily converge to the real $p$ with increasing accuracy.
Multi-Armed Bandit Variants
The situation I described above is really the stochastic MAB. Thereβs a finite set of arms, and the reward distribution is unknown. As I learned throughout this process, there are many variants and generalizations of this problem. Specifically, these are generalizations where the MAB is extended by adding some information or structure to the problem. Namely:
- adversarial bandits
- this is probably my favorite variant. the notion is that you have an adversary that is trying to maximize your regret, while youβre trying to minimize your regret. so theyβre basically trying to trick or con your algorithm.
- if youβre asking yourself (like I did), ok well then why doesnβt the adversary just assign $r_{a,t} = 0$ as the reward function for all arms $a$ at time $t$, wellβ¦ you shouldnβt really think about it in terms of reward. Reward is relative. We instead want to think about it in terms of regret which Iβll talk more about later. There are two subvariants (oblivious adversary and adaptive adversary), but weβre not going to discuss those - although a very interesting extension is the EXP3 algorithm.
- contextual bandits
- the notion here is that instead of learning $E[r \mid a]$ where again $r$ is the reward and $a$ is the arm you pick, youβre learning $E[r \mid x, a]$ where $x$ is some additional bit of context at time $t$ that youβre exposed to.
- dueling bandits
- an interesting variant where instead of being exposed to the reward, your information is limited to just picking two bandits and only knowing which one is better comparativelyβ¦ but again itβs stochastic. So you can inquire about the same two arms and itβs very feasible that youβll get different results for the comparison. The whole notion is that youβre building up this preference matrix. Seems like an incredibly difficult problem.
Bandit with Knapsack (BwK) Variant
Iβm going to preempt the reader and discuss another variant, where Iβll spend a bit more time. That model is the Bandit with Knapsack problem.
The original paper is from Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. People who Iβd love to be an iota as smart as. You can see the paper here. Itβs a 55 page paper, and Iβd be lying if I said I read past the Preliminaries section. Section 3+ have some heavy math that is over my head.
The problem statement is relatively simple though. Your arms now have resources associated with them that they consume. I honestly think itβs easier to draw it out mathematically and reference the actual paper (also shoutout to alphaxiv, itβs got most of the normal [arvix] features, just with some ai native question answering and highlighting which has been nice).
Formal Declaration
Iβd like to state that the paper starts out with the generalized form of many resources being managed and consumed. It makes sense given itβs a professional paper and the general case is more interesting. However, you can imagine $d$ being 1 and that we have a single resource that weβre managing.
So again, we have $X$ finite arms from 1 to $m$. An individual arm can be declared as $x$. Formally, we can say
\[X = \{ 1,\, 2,\, \ldots,\, x, \, \ldots, \,m-1,\, m \}\]There are $T$ rounds (which interestingly enough is known before time in this variant). So $t$ is the round at time $t$ (and one round per time increment).
\[t = \{1,\,2,\, \ldots,\, T-1,\, T \}\]There are $d$ resources where $d \geq 1$ and the $d$ resources are indexed from $i$ from $1,\, \ldots,\, d$. (the $d$ in our specific example is going to be the number of servers still, because each server is its own rate limit).
So the problem now changes because at round $t$ when arm $x$ is pulled we now donβt just get a reward, but we instead get a reward and a consumption vector indicating how much of the resources were consumed. In other words,
\[\left( r_t, c_{t,1}, \ldots , c_{t,d} \right)\]The paper declares this as $\pi_x$ where $\pi_x$ is an unknown latent distribution over $[0,1]^{d+1}$.
Now βlatent spacesβ have gotten a ton of usage since LLMs blew up, but basically this just means there is some distribution, and it is fixed, but itβs unknown to the learner.
Just to also break down the syntax because $[0,1]^{d+1}$ can be a bit misleading, but this just means
\[[0,1]^{d+1} = \underbrace{[0,1] \times [0,1] \times \cdots \times [0,1]}_{d+1\ \text{times}}\]So itβs really just a vector of length $d+1$ (the +1 is because we have $d$ resources, but then one reward $r$, so itβs kind of a shorthand).
$\pi_x$ is a joint probability distribution over $(r, c_1, β¦, c_d)$, or \((r, c_1, ..., c_d) \sim \pi_x\)
meaning when you pull an arm, you draw one vector from this distribution.
This of course leads us to budgeting. Each resource $i$ has a budget where $B_i \geq 0$
The overall process stops as soon as we have exhausted ANY resource budget.
Algorithms Presented
The paper presents two algorithms - BalancedExploration and PrimalDualWithBK.
BalancedExploration
At a high level, BalancedExploration tries to explore as much as possible while avoiding suboptimal strategies. It tries to converge to a LP-perfect distribution. LP-perfect here is a LP-relaxation called LP-primal (also LP = linear programming). So basically if they can reduce some of the constraints in this LP-primal approach then they can have an optimal algorithm. This LP-primal not only reduces the constraints, but the LP assumes that we know the average reward for each arm and removes the uncertainty, and it lets us perform fractional tasks rather than full tasks (this gets into the integer programming formulation which is helpful for the second part).
The algorithm is βsimpleβ as the authors put it, but somewhat abstracted. On each phase, it eliminates any mix of tasks that are obviously not LP-perfect. It creates a confidence interval of potentially perfect LP distributions.
Then for each task, it tries to explore that task as much as possible, and gathers the information. It then repeats until it runs out of time or resources.
Transparently, I get it at this level, but I donβt understand the underlying math pinning it. That confidence interval calculation isβ¦ unclear to me. And given I donβt even have an implementation for it in my repo (which is the point of this post).
Ah actually! Giving Claude enough context and framing for this, and it does make sense for my repo. Itβs still using UCB / LCB for reward and cost respectively, and then forming that as the score. i.e.:
Claude Code: BalancedExploration Explanation
βββββββ Claude Code v2.1.17
βββββββββ Opus 4.5 Β· Claude Max
ββ ββ ~/Documents/coding/multi-armed-bandit
β― /clear
βΏ Β (no content)
β― Can you explain how the `BalancedExploration` BwK algorithm could be applied here?
βΊ The Balanced Exploration algorithm from the Bandits with Knapsacks (BwK) literature is a natural fit for your load balancer because you
have a resource constraint built into the problem: the retry penalty system.
BwK Problem Mapping
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BwK Concept β Your Load Balancer β
ββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Arms β Downstream servers (T1/T2/T3 tiers) β
ββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Reward β Successful request (+1) β
ββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Resource consumption β Retries needed (penalty after 3 attempts) β
ββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Budget β Implicit via score formula: success - 0.5 * penalty_retries β
ββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
How Balanced Exploration Works
Standard MAB algorithms (like your Thompson/UCB strategies) only track reward (success rate). BwK algorithms track both:
1. Reward distribution: P(success | server)
2. Cost distribution: E[retries needed | server]
The key insight: a server with 80% success rate that usually fails on first attempt but succeeds on retry might be worse than a 70%
success server that either works immediately or fails completely.
Algorithm Sketch
class BalancedExplorationStrategy(BaseStrategy):
def __init__(self):
# Track both success AND retry cost per server
self.reward_estimates = {} # E[success]
self.cost_estimates = {} # E[attempts needed]
self.confidence_bounds = {}
def select_server(self, servers, attempt_number):
# Compute "bang per buck" with confidence bounds
for server in servers:
reward_ucb = self.reward_estimates[server] + confidence_bonus
cost_lcb = max(self.cost_estimates[server] - confidence_bonus, epsilon)
# Key BwK idea: optimize reward/cost ratio
server.score = reward_ucb / cost_lcb
# Balanced exploration: mix greedy + uniform exploration
if should_explore(): # Based on remaining budget
return uniform_random(servers)
return argmax(server.score for server in servers)
PrimalDualWithBK
The difference here is that we basically have a shadow parameter per resource that is the dual vector portion of this. It represents unit costs for resources and then helps balance resource consumption.
This $\lambda$ is our internal or shadow price that indicates how much weβre willing to pay for a given resource.
Primal-dual is different because itβs adaptive to our remaining budget. If weβre burning through resources too quickly, $\lambda$ rises and we do our best to avoid costly servers.
Claude Code: PrimalDual Explanation
βΊ Great questionβthese are two fundamentally different algorithmic paradigms for BwK.
Core Philosophical Difference
ββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Balanced Exploration β Primal-Dual BwK β
ββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Ratio-based: Optimize reward/cost directly β Price-based: Learn a "shadow price" for resources β
ββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Explore to reduce uncertainty β Adapt prices based on consumption rate β
ββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β UCB-style confidence bounds β Lagrangian relaxation of constraints β
ββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Primal-Dual Intuition
Instead of optimizing a ratio, Primal-Dual converts the constrained problem into an unconstrained one:
Original: maximize E[reward] subject to E[cost] β€ budget
Primal-Dual: maximize E[reward] - Ξ» * E[cost]
where Ξ» (dual variable) is learned online
The dual variable Ξ» is the shadow priceβhow much you're willing to "pay" to consume one unit of resource.
Algorithm Sketch
class PrimalDualBwKStrategy(BaseStrategy):
def __init__(self):
self.lambda_price = 0.0 # Shadow price for retries
self.total_penalty_retries = 0
self.total_requests = 0
def select_server(self, servers, attempt_number):
# Pick server maximizing: E[reward] - Ξ» * E[cost]
for server in servers:
reward = self.success_estimate[server]
cost = self.retry_cost_estimate[server]
server.score = reward - self.lambda_price * cost
return argmax(server.score)
def update(self, server_url, success, attempts_used):
# Update reward/cost estimates (standard)
self._update_estimates(server_url, success, attempts_used)
# KEY DIFFERENCE: Update the shadow price
self.total_penalty_retries += max(0, attempts_used - 3)
self.total_requests += 1
# If consuming resources too fast, increase price
target_rate = 0.5 # Expected penalty retries per request
actual_rate = self.total_penalty_retries / self.total_requests
# Multiplicative weights update
self.lambda_price *= (1 + eta * (actual_rate - target_rate))
Take Home Multi-Arm Bandit Variant
The takehome I received had an interesting twist on this. The change is that: you are only penalized for a failing server request after $k$ tries. So you are still trying to maximize your βscoreβ (i.e. reward) but youβre also given some leeway.
It was not until deep research with Claude / ChatGPT that I learned the problem could (I think) best be framed as a BwK problem.
Flaky Server - BwK Framing
For more context with the takehome, the MAB portion was framed as youβre building a load balancer where the downstream servers are flaky and youβre trying to minimize penalties (which are signed after your failing load balancer request). They simply sent a binary (which I actually dislike and think is very sketch to send a binary with no details, certs or signatures, notarization etc). The binary opened up the following:
- 10 ports with Config 1 (constant error rate)
- 10 ports with Config 2 (constant error rate + constant rate limit)
- 10 ports with Config 3 (constant error rate + complex rate limit)
Approach
Aggressively Inspecting the Binary
No fucking way am I blankly running a binary on my personal computer.
I am familiar with some basic CLI tools for inspecting binaries (otool, strings, xattr from the Dropbox days). However, this was something that I freely threw at Claude with explicit instructions not to run the binary and not to tell me anything about the underlying implementations of the load balancer config implementations (Iβll get to the de-compilation step in a bit).
I also knew that for all commands actually starting the load balancer binary that we would be running them in a restricted mode using sandbox-exec which I hadnβt stumbled upon until this project. The blog i just linked does a fantastic job, so you should feel comfortable giving it some site traffic and peeking into that one. TLDR is itβs a way to run a binary in a sandboxed environment so that it only has permissions to various resources that you permit.
All of this looked good, so I was onto the actual implementation.
Load Balancer
This was obviously the meat of the problem and the most fun to reason and think about. Probably because it was the most math / stats intensive. I wrote a couple of versions myself, tried and saw the failures (Claude found bugs with how I was calculating the beta distributions variance for example) and kept iterating. Itβs the part of the code I know the best and I can walk through the various implementations.
The later versions where we get into the BwK approaches (v6 - v8) are implementations by Claude, but still interesting to see how they perform relative to the original ideas.
At this point, Iβm pretty burnt on this project and Iβm technically on vacation, so I am going to summarize and leave it as an exercise to the reader to investigate the code and understand the underlying logic
These versions are all basic MAB approaches, not BwK specific.
| Code Version | Method | Description |
|---|---|---|
| V1 | Larkin Intuition | We still model things as a Beta distribution. We have a DISCOVER_LIMIT. While weβre in DISCOVER_MODE, we select the arm / server with highest beta variance, and fire off attempts to that server. If that fails, we re-evaluate. We continue until we fail. After the discover limit, then we statically pick the best server to send requests to. |
| V2 | Vanilla UCB | This is the UCB method described above. We first prioritize any untried servers (since technically they have an infinite UCB score). Then for each server, we calculate the UCB score using the formula: \(UCB = \text{success_rate} + \sqrt{\frac{2 \ln(\text{total_requests})}{\text{num_attempts}}}\) |
| V3 | Adjusted UCB | Very similar to the above however this type we play games with our exploration constant. Itβs no longer $\sqrt{2}$, itβs 3 (chosen arbitrarily, just bigger than $\sqrt{2}$) for the first three attempts and then 1 after that when weβre starting to get penalized. |
| V4 | Vanilla Thompson Sampling | What we described above, we pick the server with the highest $p$ and then we go from there. Either way if itβs a success or a failure, we update our $\alpha$ and $\beta$. |
| V5 | Modified Thompson Sampling | In a somewhat similar game to the modified UCB, we scale alpha and beta based on the number of requests to encourage exploration. We use an exponential decay and if weβre at 3 attempts or more, we do not scale at all and just revert back to normal TS. Our scale_factor then becomes max(2, total/variance_scale) / total where total = alpha + beta. We then multiply $\alpha$ and $\beta$ by those coefficients. |
These approaches in honesty were CC generated, but are rate limited aware and targeted at BwK approaches.
| Code Version | Method | Description |
|---|---|---|
| V6 | Thompson Masked | A slight discrepancy from the original Thompson Sampling. Here 429s which indicate that we have been rate limited. We exclude rate limited servers from the selection pool. Note, we also indicate a server as being rate-limited if weβve gotten a 429 in the past second. The big notion is that 429s are treated as different than failures. We do not update $\beta$ when we get one, we instead just indicate itβs been rate limited. If all of our servers are rate limited, we get the server that is most likely to expire soon. This is probably best for Config Type T2. |
| V7 | Sliding Window | Here given that we have the notion of temporal and dynamic rate limiting, we only remember a set amount of requests / history. I chose 30 basically arbitrarily. Again, perhaps ideally we could learn the rate limits and dynamically adapt this. Our $\alpha$ and $\beta$ params are only updated based on the set history. |
| V8 | Blocking Bandit | And here is the adaptive cooldown / blocking that V7 was lacking. The difference is now if we hit a 429 we start to exponentially increase the wait time to block the incoming requests from going to a server that we know is rate-limited. |
Simulation Harness
The simulation harness is almost entirely vibe-coded but basically sends requests to our load balancer at the prescribed rate of 10 RPS. For more information, I would check the flaky-load-balancer/flaky_load_balancer/harness.py file out. Itβs on GH here.
Dashboard
The dashboard was a fun vibe coded application that is a NextJS app. Thereβs a decent amount of functionality here, so Iβll cover some of the highlights. This NextJS project is meant to summarize and compare the results from various strategies (V1-V8) against the various config types (T1-T3). It also has a comparison route that compares all of them for a given run.
It connects and listens to the FastAPI server (basically to our load balancer) so that we get SSE streams for things like the heartbeat, metrics, and connected state. So what I would suggest is running make harness and that will start your FastAPI load balancer, start the dashboard, start the downstream flakyservers binary, and then start firing off requests.
Here is a demo:
And furthermore, here are some screenshots from the comparison page:


Conclusion
So! What were the results?

Unsurprisingly, our Thompson Modified seemed to do the best on T1, the Sliding Window somewhat surprisingly did the best on T2 (probably because the underlying binary is sinusoidal and there was some benefit about the cadence and the window being used). Finally, for T3 the Blocking Bandit or Thompson Masked seemed to do the best.
Thereβs a lot more I could talk about here, but this has already spilled over on the time budgeting so I will end here. If interested, feel free to reach out!