Job Description:
• Design, implement, and maintain scalable and reliable infrastructure to support Netflix Streaming Suite.
• Collaborate with engineering and product teams to integrate observability, reliability, and security considerations into the entire software development lifecycle.
• Develop and implement automation tools for monitoring, deployment, and incident response to ensure efficient and reliable operations.
• Participate in on-call rotations to ensure the 24/7 health of the Netflix Streaming and contribute to incident response, diagnosis, and resolution.
• Implement and maintain a robust incident response framework, including blame-aware incident reviews to learn from operational surprises.
• Proactively identify sources of instability in distributed systems and analyze how complex systems fail from a reliability and resilience perspective.
• Champion and embed a culture of reliability across the Ads organization.
• Act as a force multiplier by creating clear documentation, developing best-practice guides, and building tooling to roll out reliability enhancements automatically.
Requirements:
• 5+ years of experience as a Site Reliability Engineer (SRE), Production Engineer, or similar role supporting business-critical, high-traffic services.
• Write code to solve problems.
• Proficient in one or more languages like Python, Go, or Java.
• Fluent in modern cloud infrastructure.
• Hands-on experience with cloud providers such as AWS/Azure/GCP.
• Experience with Infrastructure as Code such as Terraform.
• Experience with container orchestration systems like Kubernetes.
• Understand large-scale distributed systems, their common failure modes and edge cases.
• Excellent communication skills and a proven ability to build relationships with engineering partners.
• Experience with incident management and response.
• Calmly navigate complex production issues, identify root causes, and implement effective, lasting solutions.
• Possess a growth mindset. Relentlessly curious and committed to continuous improvement.
Benefits:
• Health Plans
• Mental Health support
• 401(k) Retirement Plan with employer match
• Stock Option Program
• Disability Programs
• Health Savings and Flexible Spending Accounts
• Family-forming benefits
• Life and Serious Injury Benefits
• paid leave of absence programs
• Full-time hourly employees accrue 35 days annually for paid time off to be used for vacation, holidays, and sick paid time off.
• Full-time salaried employees are immediately entitled to flexible time off.