We are hiring a Lead Infrastructure Engineer to design, build, and operate scalable, production-grade infrastructure from the ground up for enterprise, big bank clients with critical data. You'll own and develop our deployment pipelines, observability systems, and cloud infrastructure as we transition to a Kubernetes-based architecture. This is an onsite role in San Francisco — close collaboration with product, engineering, and leadership is critical.
You’ll have significant autonomy: we're looking for someone who can operate independently, set technical direction, and build systems that scale without extensive guidance.
Responsibilities
- Architect and implement a new scalable, reliable, and secure infrastructure for real-time AI-driven voice services.
- Lead our migration to Kubernetes (from ECS) and establish infrastructure best practices.
- Build, optimize, and maintain CI/CD pipelines to support rapid and safe deployments.
- Own monitoring, alerting, and incident response systems to ensure uptime and performance.
- Be the primary PoC for on-call responsibilities to maintain 24/7 uptime systems
- Automate operational workflows and infrastructure provisioning (Infrastructure-as-Code).
- Collaborate with engineering teams to debug live issues, improve system resilience, and optimize performance.
Requirements
- 5+ years of DevOps, SRE, or infrastructure engineering experience, including experience leading projects independently.
- Deep expertise in cloud environments (AWS, GCP, or similar).
- Strong experience with containerization and orchestration (Docker & Kubernetes) in production environments.
- Strong awareness of networking concepts and how to implement within AWS (DNS, HTTP(S), SSH, FTP, SMTP, Firewalls, NAT)
- Proficiency with infrastructure-as-code tools (Terraform, Helm, etc.).
- Strong software engineering skills with demonstrated proficiency in languages commonly used for infrastructure automation like Python, Go, and Bash
- Experience designing monitoring and alerting systems (e.g., Prometheus, Datadog, Grafana).
- Strong understanding of security, reliability, and scaling best practices for cloud-native systems.
- Excellent communication skills and a hands-on, ownership-driven mindset.
- Willingness to work long hours - 8 am-7 pm is a good day
- In-person in San Francisco four days a week
Nice to have
- Experience working with real-time communication systems (e.g., SIP, WebRTC, LiveKit).
- Background in highly regulated industries (e.g., financial services, healthcare).