Work
Five years of building the infrastructure behind AI systems, trading platforms, and ML pipelines.
Member of Technical Staff, Machine Learning
Rational Dynamics (Voleon) · rationaldynamics.ai
AI reasoning systems for tasks of high cognitive complexity.
Founding AI Infrastructure & Systems Engineer
4MINDS · 4minds.ai
Founding infrastructure engineer. Built the platform infrastructure 0→1, the full inference/deployment/observability stack, before the team grew around it. The company launched on it into the AWS and Azure Marketplaces and the AWS Global Startup Program, and earned a place in Microsoft's invite-only Pegasus program.
- Built SYMI's execution sandbox. Agents take real actions across email, CRM, and external systems, and you can't trust the model's calls. So every session runs walled off: Linux namespaces, cgroups, seccomp, microVM boundaries.
- Put eight tenants on one GPU and held inference under 50ms doing it: MIG partitioning, speculative decoding, 8:1 sharing without the latency tax, on a vLLM stack running 12x throughput at 60% less GPU memory.
- Did the GPU acceleration on the knowledge graph the platform retrieves from. Graph traversal and similarity ranking run on the GPU, so retrieval scales to millions of nodes without becoming the bottleneck inference waits on.
- Designed the deployment model: one build, every target. SaaS multi-tenant, single-tenant inside a customer's own AWS, Azure, or Google Cloud account, or fully on-prem and air-gapped. Kubernetes-native with no managed-cloud service dependencies, so the same artifact ships to all three major clouds and a private datacenter with no rewrite.
- Built the model serving behind Constellation, the verification layer between draft and final response: a DAG of agents parallelized where dependencies allow, bounded so cognition stays predictable. I also built the model-as-judge harness that grades every customized model against a gpt-oss-120b baseline before it ships.
- Led the compliance effort to SOC-2 Type II and ISO 27001, GDPR and CCPA on top: single-tenant isolation, JWT and SSO auth, RBAC down to resource:action grants, and a governance kernel that stops the model executing anything it wasn't approved to. Cut infrastructure cost 70% and held 99.9% uptime running it.
Python, Kubernetes, PyTorch, Ray, vLLM, TensorRT, TensorRT-LLM, torch.compile, CUDA, Custom CUDA Kernels, TransformerEngine, FlashAttention, Nsight Compute, Nsight Systems, ArgoCD, Helm, Kustomize, Prometheus, OpenTelemetry, Grafana, AWS, Docker, GitOps, CI/CD, GPU Scheduling, Mixed Precision, ONNX Runtime
Machine Learning Engineer
GoodRx · goodrx.com
Rearchitected batch systems into real-time streaming. Built an observability platform from scratch and presented it to exec leadership. Optimized SageMaker endpoints until inference costs stopped being a line item anyone questioned.
Apache Airflow, Python, AWS, SageMaker, gRPC, Databricks, Kubernetes, Docker, Helm, Terraform, Prometheus, OpenTelemetry, Distributed Tracing, CI/CD Pipelines, MLflow, Model Serving, ETL Pipelines, SQL, Load Balancing, IAM
ML Engineer, Quantitative Research
Tier-1 Market Making Firm
25TB of market data. Every day. Sub-millisecond latency. I built the tick-level processing system behind $2M+ in annual trading decisions. Cut order execution latency by 78%.
C++, Python, Apache Kafka, Apache Spark, Low-Latency Networking, GPU Profiling, TLS, DNS, Network Optimization, Real-Time Analytics, gRPC, Bash
Data Engineer
VHN
Seven business units with zero interoperability. I wired ML platforms into legacy Teradata and Oracle systems. Cross-system compatibility up 65%. Data quality up 85%.
Python, SQL, Teradata, Oracle, ETL, Data Pipelines, Data Governance, Java
Proprietary Work
Closed source. Built privately.
WMServe
Production inference for video world models. Custom spatiotemporal PagedAttention. Sub-50ms latency at 10K+ concurrent requests. 99.99% availability. 85%+ GPU utilization. Built for robotics-control-loop latencies.
Go, CUDA C++, Python, PagedAttention, FlashAttention, Kubernetes, gRPC, Raft Consensus, OpenTelemetry, GPU Memory Management, Kernel Fusion, Occupancy Optimization, Model Serving Architecture, Quantization (FP16), Nsight Compute
FlowLLM
Custom hypervisor for AI inference. No Linux kernel. No CUDA driver. No Python runtime. Direct GPU control in Rust and Assembly. 95% overhead reduction. 15-70 microsecond stack latency. Boots in 50 microseconds. Linux takes 30 seconds.
Rust, Assembly, CUDA, Bare Metal, GPU Programming, Warp-Level Primitives, GPU Memory Management, Custom CUDA Kernels, Nsight Systems, Profiling
APEX
GPU-native vector database. 3.5M queries per second per GPU. 1.8 microsecond p50 latency. 500K inserts per second. 10x cheaper than cloud vector providers. Built from first principles on tensor cores.
CUDA, Tensor Cores, Rust, NVLink, GPUDirect, Lock-Free Algorithms, GPU FinOps, Kernel Fusion, Occupancy Optimization, Custom CUDA Kernels
SchemaForge
Declarative database infrastructure. No migrations. Bidirectional state convergence with SMT-verified invariants. O(n log n) complexity guarantees. Parallel DDL via dependency graph. Adopted by an internal-tooling team at a FAANG company.
Rust, SMT Solver, PostgreSQL, Formal Verification, Graph Theory, CI/CD, Distributed Systems
Open Source
PHANTOM
codeMulti-agent LLM serving for Apple Silicon's unified memory. existing systems were designed for discrete GPUs where weights must be copied over PCIe. on M-series chips, CPU, GPU, and Neural Engine share one physical pool -- that copy is unnecessary. PHANTOM eliminates it. 10 agents sharing a 50-page document: prefix stored once, not 10 times. DualRadixTree copy-on-write KV cache. MESI coherence formally specified in TLA+. formally verified scheduler. M0 proven: zero-copy GPU pipeline working end to end.
Rust, Apple Silicon, Metal, Unified Memory, TLA+, Formal Verification, Multi-Agent Systems, KV Cache, Copy-on-Write, Neural Engine
NEMESIS
codeAutonomous GPU cluster orchestration. Replaces on-call SRE judgment with a hierarchy of specialized agents that perceive hardware degradation before it becomes failure. Topology-aware scheduling, not just GPU counts. Heals running training jobs without restart using NCCL 2.27 Communicator Shrink. Validated against the Alibaba Cluster Trace dataset. Every benchmark reproducible from a single command.
Rust, Python, NCCL, Kubernetes, GPU Scheduling, Distributed Systems, Multi-Agent Systems, Fault Tolerance
TASFT
codeTask-Aware Sparse Fine-Tuning. Co-trains LoRA adapters with block-sparse attention gates. 2-5x decode throughput at 70-85% sparsity. 676 tests passing. Cuts inference costs without pretending accuracy doesn't matter.
Python, PyTorch, LoRA/QLoRA, CUDA, FlashAttention-2, Block-Sparse Attention, vLLM, Quantization, Model Compilation, Transformer Architecture Optimization, Mixed Precision, Gradient Checkpointing
KubeBalance
codeKubernetes scheduler plugin. Network topology-aware, cost-based, and performance-driven pod placement. The scheduler your cluster should have shipped with.
Go, Kubernetes, Docker, Helm, GPU Scheduling, Cold-Start Optimization, Multi-Region, Ingress, Load Balancing
AirflowLLM
codeGenerate production-ready Airflow DAGs from natural language. 45 tokens/sec on CodeLlama 7B. ~700ms on an M2 Pro. No API calls. No cloud dependency. Your DAGs, your machine.
Python, Apache Airflow, LLMs, Ollama, vLLM, Model Serving
EdgeTrain
codeNeural network training in the browser. WebGPU compute shaders. No server. No Python. The model trains on your GPU, in your tab.
TypeScript, WebGPU, WGSL
SimTextGuard
codeAI-generated text detection in C++. Jaccard similarity against known AI responses. Fast enough to run inline on submission.
C++, NLP, Pybind11
PokerGenius
codePoker AI. Monte Carlo tree search, neural hand evaluation, adaptive opponent modeling. Game theory applied to a game most people think is about luck.
Python, Game Theory, Monte Carlo, Neural Networks