Skip to main content

About

i started in trading. not the kind where you have opinions about the market -- the kind where your system processes 25TB before the bell rings or you lose actual money. sub-millisecond latency. real consequences. that is where i learned what production means. not the conference talk version. the version where something breaks at 2am and the P&L moves in the wrong direction.

i have been a founding engineer, an ML platform builder, and the person who gets paged when production systems need to be rearchitected without downtime. every role had the same job -- make the system reliable enough that nobody has to think about it.

the layers between the model and the user -- inference runtimes, GPU scheduling, observability, cost attribution -- those are the parts i care about. not because they are glamorous. because they are the parts that determine whether an AI product is viable at scale... or just a prototype with good funding.

i am currently a member of technical staff at Rational Dynamics, working on AI reasoning systems for tasks of high cognitive complexity. the infrastructure has to disappear so the reasoning is the only thing left to get right.

Tools

i reach for CUDA before Python when latency is the constraint. the kernel is where the real work happens -- everything above it is just scheduling.

for serving: vLLM for standard inference, custom PagedAttention extensions when the access patterns break vLLM's assumptions. TensorRT-LLM for transformer layers where the fusion matters.

Kubernetes for orchestration. not because it's simple -- it isn't -- but because the failure modes are documented and the escape hatches exist.

Rust when i need systems-level control without the undefined behavior tax. Go for the infrastructure glue. C++ when i'm talking directly to the GPU driver.

Prometheus + OpenTelemetry everywhere. a system you can't measure is a system you can't trust.