Senior Site Reliability Engineer

The organization is looking for a Senior Site Reliability Engineer to join our Core Platforms organization. In this role, you will be part of a team responsible for ensuring that our critical infrastructure, which serves millions of users daily, is reliable, performant, and scalable. You will work on a diverse set of technologies, contributing to the stability and growth of one of the world’s most visited websites.

As a Senior SRE, you will be responsible for:

Designing, implementing, and maintaining scalable and reliable infrastructure systems across various environments.
Proactively identifying and addressing potential issues, bottlenecks, and areas for improvement in our systems.
Developing and implementing monitoring, alerting, and observability solutions to ensure the health and performance of our services.
Automating operational tasks, deploying new services, and improving existing deployment processes.
Participating in on-call rotations to respond to incidents and ensure timely resolution of production issues.
Collaborating with other engineering teams to provide SRE expertise, promote best practices, and contribute to system architecture design.
Mentoring junior SREs and contributing to the growth of the team’s technical capabilities.
Documenting procedures, configurations, and system designs to ensure maintainability and knowledge sharing.

We’d love to hear from you if you have:

6+ years of experience in Site Reliability Engineering, DevOps, or a related field, with a strong focus on distributed systems.
Expertise in at least one major cloud provider (e.g., AWS, GCP, Azure) and on-premise infrastructure.
Proficiency in infrastructure as code (IaC) tools such as Terraform, Ansible, or Puppet.
Strong programming skills in at least one language (e.g., Python, Go, Java, Rust) and experience with scripting for automation.
Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
In-depth knowledge of Linux operating systems, networking, and security best practices.
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack).
Excellent problem-solving skills, with a track record of diagnosing and resolving complex technical issues.
Strong communication and collaboration skills, with the ability to work effectively in a distributed team environment.
Ability to participate in on-call rotations and respond to incidents outside of regular business hours.
Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.

Bonus points if you have:

Experience with large-scale, high-traffic web applications.
Familiarity with open-source technologies and community contributions.
Experience with database systems (e.g., MySQL, PostgreSQL, Cassandra).
Familiarity with content delivery networks (CDNs).
Prior experience working in a non-profit or mission-driven organization.

The organization is an equal opportunity employer, and we value diversity at our organization. We don’t discriminate on the basis of age, ancestry, citizenship, color, disability, ethnicity, family status, gender identity or expression, marital status, national origin, race, religion, sex, sexual orientation, or veteran status. We strive to provide an inclusive and equitable environment for all employees, where everyone feels respected and valued.

The organization is a 501(c)(3) nonprofit organization that operates Wikipedia and its sister sites. We believe in making knowledge available to everyone for free, and we rely on the support of our readers and volunteers to make that happen.

If you’re excited by the prospect of contributing to a global mission and working on a platform that impacts millions worldwide, we encourage you to apply!

Senior Site Reliability Engineer

As a Senior SRE, you will be responsible for:

We’d love to hear from you if you have:

Bonus points if you have:

Apply for this position

Looking for a career change? Browse our job listings now!

Company

Newsletter