ABOUT THE OPPORTUNITY
Join a technically ambitious organisation operating at the intersection of infrastructure, security, and cutting-edge AI/ML workloads. This is a senior individual contributor role where your work directly shapes the reliability, security posture, and scalability of critical platform infrastructure. You'll be embedded in a high-trust engineering culture that values ownership, technical depth, and continuous improvement — not ticket-pushing.
The environment is complex, the challenges are real, and the impact is immediate. If you thrive in platforms where GPU-accelerated workloads, zero-trust security, and cloud-native tooling converge, this role was built for you.
PROJECT & CONTEXT
You will be working on a mature but evolving internal platform that supports advanced data and ML workloads running on Kubernetes across both cloud (AKS) and on-premises environments. The platform serves cross-functional engineering teams and demands the highest standards in security hardening, observability, and operational resilience.
The stack is modern, the team is senior, and the expectations are high — in the best possible way. You'll be operating in a GitOps-first, security-first culture where Infrastructure as Code is the norm and every change is deliberate and auditable.
WHAT WE'RE LOOKING FOR (Required)
Experience
• 7+ years in DevOps / Platform Engineering
• 2+ years in a dedicated DevSecOps capacity
Kubernetes & Infrastructure
• Deep expertise in Kubernetes — AKS, upstream K8s, or enterprise distributions
• Hands-on experience with on-premises Kubernetes: RKE2, K3s, or OpenShift
• IaC proficiency: Terraform, Helm, Kustomize, YAML, GitOps workflows
• CI/CD pipeline ownership: Azure DevOps and/or GitHub Actions
GPU & ML Workloads
• Experience deploying and managing GPU-accelerated workloads using NVIDIA operators, GPU device plugins, and/or Run:AI
Security (Non-Negotiable)
• CIS-hardened Kubernetes environments
• Zero Trust network architecture principles
• Container security tooling: Trivy, Aqua, Prisma, or equivalent
• SAST/DAST/SCA toolchain implementation
• RBAC, NetworkPolicies, PodSecurityAdmission configuration
• Encryption at rest and in transit; certificate lifecycle management
• Secrets management: HashiCorp Vault and/or Azure Key Vault
• Keycloak configuration and identity management
Observability
• Prometheus and Grafana — deployment, configuration, and dashboard ownership
Networking & Linux
• Strong Linux fundamentals and networking depth: TLS, DNS, Ingress, OAuth/OIDC, VNet, Peering, VPN/Jump Host configuration
Data & Storage
• Experience managing MinIO, MLflow, and PostgreSQL (HA configurations and backup strategies)
Scripting & Languages
• Strong scripting in Python and Bash (required)
NICE TO HAVE (Preferred)
• Experience supporting ML platforms: Kubeflow, MLflow, KServe
• Knowledge of distributed storage systems: Ceph or NetApp
• Background in regulated industries — automotive, aerospace, medical, energy, or railway
• Scripting experience in Go