Skip to main content

Work

Five years of building the infrastructure behind AI systems, trading platforms, and ML pipelines.

Member of Technical Staff, Machine Learning

Rational Dynamics (Voleon) · rationaldynamics.ai

Jun 2026 – Present

AI reasoning systems for tasks of high cognitive complexity.

Founding AI Infrastructure & Systems Engineer

4MINDS · 4minds.ai

May 2025 – Jun 2026

Founding infrastructure engineer. Built the platform infrastructure 0→1, the full inference/deployment/observability stack, before the team grew around it. The company launched on it into the AWS and Azure Marketplaces and the AWS Global Startup Program, and earned a place in Microsoft's invite-only Pegasus program.

  • Built SYMI's execution sandbox. Agents take real actions across email, CRM, and external systems, and you can't trust the model's calls. So every session runs walled off: Linux namespaces, cgroups, seccomp, microVM boundaries.
  • Put eight tenants on one GPU and held inference under 50ms doing it: MIG partitioning, speculative decoding, 8:1 sharing without the latency tax, on a vLLM stack running 12x throughput at 60% less GPU memory.
  • Did the GPU acceleration on the knowledge graph the platform retrieves from. Graph traversal and similarity ranking run on the GPU, so retrieval scales to millions of nodes without becoming the bottleneck inference waits on.
  • Designed the deployment model: one build, every target. SaaS multi-tenant, single-tenant inside a customer's own AWS, Azure, or Google Cloud account, or fully on-prem and air-gapped. Kubernetes-native with no managed-cloud service dependencies, so the same artifact ships to all three major clouds and a private datacenter with no rewrite.
  • Built the model serving behind Constellation, the verification layer between draft and final response: a DAG of agents parallelized where dependencies allow, bounded so cognition stays predictable. I also built the model-as-judge harness that grades every customized model against a gpt-oss-120b baseline before it ships.
  • Led the compliance effort to SOC-2 Type II and ISO 27001, GDPR and CCPA on top: single-tenant isolation, JWT and SSO auth, RBAC down to resource:action grants, and a governance kernel that stops the model executing anything it wasn't approved to. Cut infrastructure cost 70% and held 99.9% uptime running it.

Python, Kubernetes, PyTorch, Ray, vLLM, TensorRT, TensorRT-LLM, torch.compile, CUDA, Custom CUDA Kernels, TransformerEngine, FlashAttention, Nsight Compute, Nsight Systems, ArgoCD, Helm, Kustomize, Prometheus, OpenTelemetry, Grafana, AWS, Docker, GitOps, CI/CD, GPU Scheduling, Mixed Precision, ONNX Runtime

Machine Learning Engineer

GoodRx · goodrx.com

May 2024 – May 2025

Rearchitected batch systems into real-time streaming. Built an observability platform from scratch and presented it to exec leadership. Optimized SageMaker endpoints until inference costs stopped being a line item anyone questioned.

Apache Airflow, Python, AWS, SageMaker, gRPC, Databricks, Kubernetes, Docker, Helm, Terraform, Prometheus, OpenTelemetry, Distributed Tracing, CI/CD Pipelines, MLflow, Model Serving, ETL Pipelines, SQL, Load Balancing, IAM

ML Engineer, Quantitative Research

Tier-1 Market Making Firm

Aug 2022 – May 2024

25TB of market data. Every day. Sub-millisecond latency. I built the tick-level processing system behind $2M+ in annual trading decisions. Cut order execution latency by 78%.

C++, Python, Apache Kafka, Apache Spark, Low-Latency Networking, GPU Profiling, TLS, DNS, Network Optimization, Real-Time Analytics, gRPC, Bash

Data Engineer

VHN

May 2021 – Sep 2021

Seven business units with zero interoperability. I wired ML platforms into legacy Teradata and Oracle systems. Cross-system compatibility up 65%. Data quality up 85%.

Python, SQL, Teradata, Oracle, ETL, Data Pipelines, Data Governance, Java

Proprietary Work

Closed source. Built privately.

WMServe

Production inference for video world models. Custom spatiotemporal PagedAttention. Sub-50ms latency at 10K+ concurrent requests. 99.99% availability. 85%+ GPU utilization. Built for robotics-control-loop latencies.

Go, CUDA C++, Python, PagedAttention, FlashAttention, Kubernetes, gRPC, Raft Consensus, OpenTelemetry, GPU Memory Management, Kernel Fusion, Occupancy Optimization, Model Serving Architecture, Quantization (FP16), Nsight Compute

FlowLLM

Custom hypervisor for AI inference. No Linux kernel. No CUDA driver. No Python runtime. Direct GPU control in Rust and Assembly. 95% overhead reduction. 15-70 microsecond stack latency. Boots in 50 microseconds. Linux takes 30 seconds.

Rust, Assembly, CUDA, Bare Metal, GPU Programming, Warp-Level Primitives, GPU Memory Management, Custom CUDA Kernels, Nsight Systems, Profiling

APEX

GPU-native vector database. 3.5M queries per second per GPU. 1.8 microsecond p50 latency. 500K inserts per second. 10x cheaper than cloud vector providers. Built from first principles on tensor cores.

CUDA, Tensor Cores, Rust, NVLink, GPUDirect, Lock-Free Algorithms, GPU FinOps, Kernel Fusion, Occupancy Optimization, Custom CUDA Kernels

SchemaForge

Declarative database infrastructure. No migrations. Bidirectional state convergence with SMT-verified invariants. O(n log n) complexity guarantees. Parallel DDL via dependency graph. Adopted by an internal-tooling team at a FAANG company.

Rust, SMT Solver, PostgreSQL, Formal Verification, Graph Theory, CI/CD, Distributed Systems

Open Source

PHANTOM

code

Multi-agent LLM serving for Apple Silicon's unified memory. existing systems were designed for discrete GPUs where weights must be copied over PCIe. on M-series chips, CPU, GPU, and Neural Engine share one physical pool -- that copy is unnecessary. PHANTOM eliminates it. 10 agents sharing a 50-page document: prefix stored once, not 10 times. DualRadixTree copy-on-write KV cache. MESI coherence formally specified in TLA+. formally verified scheduler. M0 proven: zero-copy GPU pipeline working end to end.

Rust, Apple Silicon, Metal, Unified Memory, TLA+, Formal Verification, Multi-Agent Systems, KV Cache, Copy-on-Write, Neural Engine

NEMESIS

code

Autonomous GPU cluster orchestration. Replaces on-call SRE judgment with a hierarchy of specialized agents that perceive hardware degradation before it becomes failure. Topology-aware scheduling, not just GPU counts. Heals running training jobs without restart using NCCL 2.27 Communicator Shrink. Validated against the Alibaba Cluster Trace dataset. Every benchmark reproducible from a single command.

Rust, Python, NCCL, Kubernetes, GPU Scheduling, Distributed Systems, Multi-Agent Systems, Fault Tolerance

TASFT

code

Task-Aware Sparse Fine-Tuning. Co-trains LoRA adapters with block-sparse attention gates. 2-5x decode throughput at 70-85% sparsity. 676 tests passing. Cuts inference costs without pretending accuracy doesn't matter.

Python, PyTorch, LoRA/QLoRA, CUDA, FlashAttention-2, Block-Sparse Attention, vLLM, Quantization, Model Compilation, Transformer Architecture Optimization, Mixed Precision, Gradient Checkpointing

KubeBalance

code

Kubernetes scheduler plugin. Network topology-aware, cost-based, and performance-driven pod placement. The scheduler your cluster should have shipped with.

Go, Kubernetes, Docker, Helm, GPU Scheduling, Cold-Start Optimization, Multi-Region, Ingress, Load Balancing

AirflowLLM

code

Generate production-ready Airflow DAGs from natural language. 45 tokens/sec on CodeLlama 7B. ~700ms on an M2 Pro. No API calls. No cloud dependency. Your DAGs, your machine.

Python, Apache Airflow, LLMs, Ollama, vLLM, Model Serving

EdgeTrain

code

Neural network training in the browser. WebGPU compute shaders. No server. No Python. The model trains on your GPU, in your tab.

TypeScript, WebGPU, WGSL

SimTextGuard

code

AI-generated text detection in C++. Jaccard similarity against known AI responses. Fast enough to run inline on submission.

C++, NLP, Pybind11

PokerGenius

code

Poker AI. Monte Carlo tree search, neural hand evaluation, adaptive opponent modeling. Game theory applied to a game most people think is about luck.

Python, Game Theory, Monte Carlo, Neural Networks