About the Team: The Live SRE team is a core part of the Netflix Streaming organization, responsible for the reliability and performance of systems that directly impact the playback experience for our 270M+ subscribers. About the Role: As a Site Reliability Engineer 5 on the Live SRE team, you will be instrumental in ensuring the highest level of reliability, scalability, and efficiency of Netflix’s streaming infrastructure. Key Responsibilities: Drive the design, implementation, and maintenance of highly reliable, scalable, and performant systems for Netflix’s streaming infrastructure. Lead efforts to improve system observability, monitoring, and alerting, ensuring proactive identification and resolution of potential issues. Develop and implement robust incident response procedures, participating in on-call rotations and leading efforts to minimize MTTR. Collaborate with engineering teams across Netflix to promote SRE best practices, influence architectural decisions, and ensure system resilience. Identify and address performance bottlenecks, capacity constraints, and reliability risks, implementing solutions to optimize system health. Automate operational tasks and build tools to improve efficiency and reduce manual toil. Mentor and guide junior engineers, fostering a culture of continuous learning and operational excellence. What You Bring to the Team: 10+ years of experience in Site Reliability Engineering, DevOps, or a related field, with a strong focus on distributed systems. Deep expertise in cloud computing platforms (AWS, Azure, GCP) and containerization technologies (Docker, Kubernetes). Proficiency in one or more programming languages (Java, Python, Go, Rust). Extensive experience with monitoring and observability tools (Prometheus, Grafana, ELK Stack, Jaeger). Proven track record of leading complex incident response efforts and performing root cause analysis. Strong understanding of networking, operating systems, and distributed consensus algorithms. Excellent problem-solving skills, with a proactive and methodical approach to troubleshooting. Exceptional communication and collaboration skills, with the ability to influence and lead technical discussions. Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.