Work
Four years of building the infrastructure behind AI systems, trading platforms, and ML pipelines.
Founding AI Infrastructure & Systems Engineer
4MINDS — 4minds.ai
Production inference, training pipelines, GPU scheduling across multi-region Kubernetes. Custom CUDA kernels where the off-the-shelf runtimes couldn't hit latency targets.
Python, Kubernetes, PyTorch, Ray, vLLM, TensorRT, TensorRT-LLM, torch.compile, CUDA, Custom CUDA Kernels, TransformerEngine, FlashAttention, Nsight Compute, Nsight Systems, ArgoCD, Helm, Kustomize, Prometheus, OpenTelemetry, Grafana, AWS, Docker, GitOps, CI/CD, GPU Scheduling, Mixed Precision, ONNX Runtime
Machine Learning Engineer
GoodRx
Rearchitected batch systems into real-time streaming. Built an observability platform from scratch and presented it to exec leadership. Optimized SageMaker endpoints until inference costs stopped being a line item anyone questioned.
Apache Airflow, Python, AWS, SageMaker, gRPC, Databricks, Kubernetes, Docker, Helm, Terraform, Prometheus, OpenTelemetry, Distributed Tracing, CI/CD Pipelines, MLflow, Model Serving, ETL Pipelines, SQL, Load Balancing, IAM
ML Engineer, Quantitative Research
Tier-1 Market Making Firm
25TB of market data. Every day. Sub-millisecond latency. I built the tick-level processing system behind $2M+ in annual trading decisions. Cut order execution latency by 78%.
C++, Python, Apache Kafka, Apache Spark, Low-Latency Networking, GPU Profiling, TLS, DNS, Network Optimization, Real-Time Analytics, gRPC, Bash
Data Engineer
VHN
Seven business units with zero interoperability. I wired ML platforms into legacy Teradata and Oracle systems. Cross-system compatibility up 65%. Data quality up 85%.
Python, SQL, Teradata, Oracle, ETL, Data Pipelines, Data Governance, Java
Proprietary Work
Closed source. Patent pending.
WMServe
Production inference for video world models. Custom spatiotemporal PagedAttention. Sub-50ms latency at 10K+ concurrent requests. 99.99% availability. 85%+ GPU utilization. The first system that makes world models fast enough for robotics control loops.
Go, CUDA C++, Python, PagedAttention, FlashAttention, Kubernetes, gRPC, Raft Consensus, OpenTelemetry, GPU Memory Management, Kernel Fusion, Occupancy Optimization, Model Serving Architecture, Quantization (FP16), Nsight Compute
FlowLLM
Custom hypervisor for AI inference. No Linux kernel. No CUDA driver. No Python runtime. Direct GPU control in Rust and Assembly. 95% overhead reduction. 15-70 microsecond stack latency. Boots in 50 microseconds. Linux takes 30 seconds.
Rust, Assembly, CUDA, Bare Metal, GPU Programming, Warp-Level Primitives, GPU Memory Management, Custom CUDA Kernels, Nsight Systems, Profiling
APEX
GPU-native vector database. 3.5M queries per second per GPU. 1.8 microsecond p50 latency. 500K inserts per second. 10x cheaper than cloud vector providers. Built from first principles on tensor cores.
CUDA, Tensor Cores, Rust, NVLink, GPUDirect, Lock-Free Algorithms, GPU FinOps, Kernel Fusion, Occupancy Optimization, Custom CUDA Kernels
SchemaForge
Declarative database infrastructure. No migrations. Bidirectional state convergence with SMT-verified invariants. O(n log n) complexity guarantees. Parallel DDL via dependency graph. Production-tested at FAANG scale.
Rust, SMT Solver, PostgreSQL, Formal Verification, Graph Theory, CI/CD, Distributed Systems
Open Source
TASFT
codeTask-Aware Sparse Fine-Tuning. Co-trains LoRA adapters with block-sparse attention gates. 2-5x decode throughput at 70-85% sparsity. 676 tests passing. Cuts inference costs without pretending accuracy doesn't matter.
Python, PyTorch, LoRA/QLoRA, CUDA, FlashAttention-2, Block-Sparse Attention, vLLM, Quantization, Model Compilation, Transformer Architecture Optimization, Mixed Precision, Gradient Checkpointing
KubeBalance
codeKubernetes scheduler plugin. Network topology-aware, cost-based, and performance-driven pod placement. The scheduler your cluster should have shipped with.
Go, Kubernetes, Docker, Helm, GPU Scheduling, Cold-Start Optimization, Multi-Region, Ingress, Load Balancing
AirflowLLM
codeGenerate production-ready Airflow DAGs from natural language. 45 tokens/sec on CodeLlama 7B. ~700ms on an M2 Pro. No API calls. No cloud dependency. Your DAGs, your machine.
Python, Apache Airflow, LLMs, Ollama, vLLM, Model Serving
EdgeTrain
codeNeural network training in the browser. WebGPU compute shaders. No server. No Python. The model trains on your GPU, in your tab.
TypeScript, WebGPU, WGSL
SimTextGuard
codeAI-generated text detection in C++. Jaccard similarity against known AI responses. Fast enough to run inline on submission.
C++, NLP, Pybind11
PokerGenius
codePoker AI. Monte Carlo tree search, neural hand evaluation, adaptive opponent modeling. Game theory applied to a game most people think is about luck.
Python, Game Theory, Monte Carlo, Neural Networks