Careers · engineering

AI Infrastructure Engineer

Lead AI Infrastructure engagements — GPU clusters, model serving, MLOps, LLM observability. The work that makes production AI as boring as the rest of the stack.

Remote (India / EU-friendly hours) Full-time Senior

What you’d do

Lead AI Infrastructure engagements. Discovery, architecture, hands-on delivery. The clients are mostly Series A–C teams putting their first ML or LLM workloads into production.
Design GPU cluster topologies. EKS / GKE / AKS GPU node groups, NVIDIA operator, Spot vs. on-demand mix, multi-region resilience.
Stand up serving runtimes. vLLM, TGI, Triton — picking, deploying, tuning, autoscaling. You know which flag matters and which is theatre.
Build the MLOps platform. Argo Workflows, Kubeflow, MLflow — depending on the client’s existing stack. Paved-path templates so data scientists ship without filing platform tickets.
Wire LLM observability. Langfuse / Phoenix / Helicone alongside the standard stack. Make hallucinations a debuggable artefact.

What we’re looking for

4+ years on production infrastructure, with at least 1 year on AI/ML workloads
Deep Kubernetes — you’ve shipped GPU workloads, you know the failure modes
Comfort with the modern serving stack (vLLM, TGI, etc.) and the trade-offs between them
FinOps instincts — GPU spend is a different beast and clients hire us partly to control it
Ability to write — blog posts, runbooks, deliverable documents

Nice to have

Open-source contributions to vLLM, Kubeflow, Argo, Langfuse, or similar
Prior experience scaling ML inference at meaningful traffic
Background that includes both ops AND a real touch of ML (we don’t need a PhD; we do need someone who can pair with data scientists fluently)

Why CloudWizz

A real AI infrastructure practice. Not “we tried it once.” Real engagements, real metrics, real on-call.
AI-native workflow. AI does the routine; you do the senior judgment.
Open-source aligned. Our work feeds back into the OSS we publish.
Sustainable pace. No graveyard rotations.

How we hire

30-min intro call
Technical conversation on a real (anonymized) AI infra problem
Paid trial — 1–2 weeks on a real engagement with another engineer
Offer

Apply below.