Part-Time AI Infrastructure & Reliability Engineer
Part-time • Remote • Start ASAP • No H-1B sponsorship
Gensee runs AI agent infrastructure at scale — Kubernetes-based, GPU-heavy, and increasingly multi-cloud. We're looking for a part-time AI Infra and Reliability Engineer who can help us optimize our AI infrastructure and ensure our cluster runs reliably as we grow quickly. You'll own AI infra and reliability across distributed model serving, traffic management, and multi-tier SLO enforcement. This is a high-leverage, hands-on role where your work directly impacts every user session.
Responsibilities:
- Design, deploy, and maintain Kubernetes clusters across GCP and alternative GPU cloud providers (RunPod, GMI Cloud, etc.), ensuring smooth cross-datacenter operations.
- Own observability: instrument services with metrics, logs, and traces; build and maintain dashboards and alerts for GPU utilization, inference latency, error rates, and queue depths.
- Manage and tune distributed model serving workloads — large multimodal language models, embedding models, and other GPU-resident services — ensuring efficient resource usage and low-latency inference.
- Implement and enforce multi-tier SLO/SLA frameworks; define error budgets and drive reliability improvements across service tiers.
- Design and maintain traffic load balancing across data centers, including failover, geo-routing, and capacity-aware scheduling.
- Automate infrastructure provisioning, scaling, and incident response.
- Participate in on-call rotation (lightweight, commensurate with part-time hours); triage and resolve incidents, write blameless post-mortems.
- Part-time, 20–30 hrs/week; fully remote; immediate start.
Qualifications:
- Bachelor's degree in Computer Science or a related field, or equivalent practical experience. Master's degree or equivalent industry experience is preferred.
- 1+ years of SRE, DevOps, or infrastructure engineering experience in production environments.
- Deep hands-on experience with Kubernetes and containers — cluster administration, networking, autoscaling, and mixed CPU/GPU node management.
- Experience operating both CPU-based services and GPU workloads in the same cluster; comfort managing resource scheduling and isolation across heterogeneous node pools.
- Familiarity with traditional virtualization and container isolation techniques (VMs, namespaces, cgroups) alongside modern ML serving systems like vLLM and SGLang.
- Experience running GPU workloads in cloud infrastructure; familiarity with GPU scheduling primitives (device plugins, MIG, time-slicing) is a plus.
- Experience with multi-cloud or hybrid-cloud environments; comfort evaluating and onboarding new cloud providers.
- Strong understanding of networking and load balancing fundamentals.
- Ability to work independently and communicate clearly in an async, distributed team.
Why Join:
- Work on genuinely hard infrastructure. Our stack is a CPU-GPU hybrid — traditional virtualization and container isolation running side-by-side with distributed ML serving systems like vLLM and SGLang. Cross-datacenter traffic balancing, multi-tier SLOs, heterogeneous node pools — problems most engineers never get near.
- Rare breadth in one role. You'll apply classical SRE practices (SLOs, error budgets, on-call, post-mortems) to infrastructure that also spans GPU scheduling, model serving, and multi-cloud orchestration. It's a uniquely full-stack infrastructure challenge.
- High ownership from day one. Small team means your decisions land in production quickly. No bureaucracy, no waiting for approval chains.
- Work with state-of-the-art AI systems. You'll be running the infrastructure that serves large multimodal models and embedding pipelines — the kind of systems that are defining the next generation of AI.
Work authorization
Candidates must already have authorization to work in the US, or authorization to work in the country where they live. We do not sponsor H-1B visas.
Start
As soon as possible.
About GenseeAI:
GenseeAI is a research-based, quickly growing startup building the foundational infrastructure layer for the future of AI agents. Instead of building another agent, we focus on the harder and more important layer underneath: making agents execute with far better efficiency, safety, security, and privacy in real production environments. GenseeAI is founded by a UCSD professor and an ex-Googler (L7) who spent 10+ years managing core ML/AI infrastructure at Google. If you join now, you won’t just be joining a startup — you’ll be helping build a piece of the stack that could become essential to the entire AI agent ecosystem.