Skip to main content

Work

Four years of building the infrastructure behind AI systems, trading platforms, and ML pipelines.

Founding AI Infrastructure & Systems Engineer

4MINDS — 4minds.ai

May 2025 – Present

Production inference, training pipelines, GPU scheduling across multi-region Kubernetes. Custom CUDA kernels where the off-the-shelf runtimes couldn't hit latency targets.

Python, Kubernetes, PyTorch, Ray, vLLM, TensorRT, TensorRT-LLM, torch.compile, CUDA, Custom CUDA Kernels, TransformerEngine, FlashAttention, Nsight Compute, Nsight Systems, ArgoCD, Helm, Kustomize, Prometheus, OpenTelemetry, Grafana, AWS, Docker, GitOps, CI/CD, GPU Scheduling, Mixed Precision, ONNX Runtime

Machine Learning Engineer

GoodRx

May 2024 – May 2025

Rearchitected batch systems into real-time streaming. Built an observability platform from scratch and presented it to exec leadership. Optimized SageMaker endpoints until inference costs stopped being a line item anyone questioned.

Apache Airflow, Python, AWS, SageMaker, gRPC, Databricks, Kubernetes, Docker, Helm, Terraform, Prometheus, OpenTelemetry, Distributed Tracing, CI/CD Pipelines, MLflow, Model Serving, ETL Pipelines, SQL, Load Balancing, IAM

ML Engineer, Quantitative Research

Tier-1 Market Making Firm

Aug 2022 – May 2024

25TB of market data. Every day. Sub-millisecond latency. I built the tick-level processing system behind $2M+ in annual trading decisions. Cut order execution latency by 78%.

C++, Python, Apache Kafka, Apache Spark, Low-Latency Networking, GPU Profiling, TLS, DNS, Network Optimization, Real-Time Analytics, gRPC, Bash

Data Engineer

VHN

May 2021 – Sep 2021

Seven business units with zero interoperability. I wired ML platforms into legacy Teradata and Oracle systems. Cross-system compatibility up 65%. Data quality up 85%.

Python, SQL, Teradata, Oracle, ETL, Data Pipelines, Data Governance, Java

Proprietary Work

Closed source. Patent pending.

WMServe

Production inference for video world models. Custom spatiotemporal PagedAttention. Sub-50ms latency at 10K+ concurrent requests. 99.99% availability. 85%+ GPU utilization. The first system that makes world models fast enough for robotics control loops.

Go, CUDA C++, Python, PagedAttention, FlashAttention, Kubernetes, gRPC, Raft Consensus, OpenTelemetry, GPU Memory Management, Kernel Fusion, Occupancy Optimization, Model Serving Architecture, Quantization (FP16), Nsight Compute

FlowLLM

Custom hypervisor for AI inference. No Linux kernel. No CUDA driver. No Python runtime. Direct GPU control in Rust and Assembly. 95% overhead reduction. 15-70 microsecond stack latency. Boots in 50 microseconds. Linux takes 30 seconds.

Rust, Assembly, CUDA, Bare Metal, GPU Programming, Warp-Level Primitives, GPU Memory Management, Custom CUDA Kernels, Nsight Systems, Profiling

APEX

GPU-native vector database. 3.5M queries per second per GPU. 1.8 microsecond p50 latency. 500K inserts per second. 10x cheaper than cloud vector providers. Built from first principles on tensor cores.

CUDA, Tensor Cores, Rust, NVLink, GPUDirect, Lock-Free Algorithms, GPU FinOps, Kernel Fusion, Occupancy Optimization, Custom CUDA Kernels

SchemaForge

Declarative database infrastructure. No migrations. Bidirectional state convergence with SMT-verified invariants. O(n log n) complexity guarantees. Parallel DDL via dependency graph. Production-tested at FAANG scale.

Rust, SMT Solver, PostgreSQL, Formal Verification, Graph Theory, CI/CD, Distributed Systems

Open Source

TASFT

code

Task-Aware Sparse Fine-Tuning. Co-trains LoRA adapters with block-sparse attention gates. 2-5x decode throughput at 70-85% sparsity. 676 tests passing. Cuts inference costs without pretending accuracy doesn't matter.

Python, PyTorch, LoRA/QLoRA, CUDA, FlashAttention-2, Block-Sparse Attention, vLLM, Quantization, Model Compilation, Transformer Architecture Optimization, Mixed Precision, Gradient Checkpointing

KubeBalance

code

Kubernetes scheduler plugin. Network topology-aware, cost-based, and performance-driven pod placement. The scheduler your cluster should have shipped with.

Go, Kubernetes, Docker, Helm, GPU Scheduling, Cold-Start Optimization, Multi-Region, Ingress, Load Balancing

AirflowLLM

code

Generate production-ready Airflow DAGs from natural language. 45 tokens/sec on CodeLlama 7B. ~700ms on an M2 Pro. No API calls. No cloud dependency. Your DAGs, your machine.

Python, Apache Airflow, LLMs, Ollama, vLLM, Model Serving

EdgeTrain

code

Neural network training in the browser. WebGPU compute shaders. No server. No Python. The model trains on your GPU, in your tab.

TypeScript, WebGPU, WGSL

SimTextGuard

code

AI-generated text detection in C++. Jaccard similarity against known AI responses. Fast enough to run inline on submission.

C++, NLP, Pybind11

PokerGenius

code

Poker AI. Monte Carlo tree search, neural hand evaluation, adaptive opponent modeling. Game theory applied to a game most people think is about luck.

Python, Game Theory, Monte Carlo, Neural Networks