Data Center Incident Program Manager

About the Role

As a Data Center Incident Program Manager, you will be at the forefront of ensuring the reliability and stability of our data center infrastructure. You will be responsible for leading critical incidents from detection to resolution, developing robust incident response strategies, and continuously improving our incident management processes.

You will play a pivotal role in minimizing downtime, mitigating risks, and driving operational excellence across our global data center footprint. This is a unique opportunity to shape the future of incident management within a rapidly growing and innovative organization.

Key Responsibilities:

  • Incident Management Leadership: Lead and manage critical incidents within data center environments, ensuring rapid response, containment, resolution, and post-incident analysis.
  • Program Development: Design, implement, and continuously improve incident management processes, playbooks, and tools to enhance efficiency and effectiveness.
  • Cross-Functional Collaboration: Partner with engineering, operations, security, and other teams to develop robust incident response strategies and ensure seamless communication during incidents.
  • Post-Incident Review: Conduct thorough root cause analyses (RCAs) for all significant incidents, identify preventative measures, and track the implementation of corrective actions.
  • Training & Mentorship: Develop and deliver training programs for incident responders, fostering a culture of continuous learning and improvement.
  • Metrics & Reporting: Establish key performance indicators (KPIs) for incident management, track performance, and provide regular reports to stakeholders on incident trends and program health.
  • Tooling & Automation: Identify opportunities to leverage automation and new tools to streamline incident response workflows and reduce manual effort.
  • On-Call Rotation: Participate in an on-call rotation to provide 24/7 incident support as needed.

You might thrive in this role if you have:

  • 7+ years of experience in incident management, technical program management, or a related role within a data center or large-scale infrastructure environment.
  • Demonstrated experience leading critical incidents from detection through resolution, including post-incident review and follow-up.
  • Deep understanding of data center operations, network infrastructure, server hardware, and cloud computing concepts.
  • Proficiency in incident management frameworks (e.g., ITIL) and best practices.
  • Strong analytical and problem-solving skills with the ability to quickly assess complex technical issues.
  • Excellent communication (written and verbal), interpersonal, and leadership skills.
  • Ability to remain calm and effective under pressure, making sound decisions in high-stress situations.
  • Experience working with mission-critical infrastructure or distributed systems.
Job Category: Technology
Job Type: Remote
Job Location: USA
Organization: Job Hunting U

Apply for this position

Allowed Type(s): .pdf, .doc, .docx