Microsoft Director, System Reliability Engineering

Job is more than three months old.

Job Details

Posted date: May 12, 2025

Category: Hardware Engineering

Location: Redmond, WA

Estimated salary: $215,800
Range: $137,600 - $294,000

Employment type: Full-Time

Travel amount: 25.0%

Work location type: Up to 50% work from home

Role: People Manager

Description

Microsoft Silicon, Cloud Hardware Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for passionate, high energy engineers to help achieve that mission.

As Microsoft's Cloud business continues to grow the ability to deploy new offerings and HW infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for Cloud infrastructure reliability, improving the planning process, manufacturing, quality, delivery at scale, serviceability and sustainability. We are looking for a System Reliability Engineering Leader with a passion for customer focused solutions, insight and industry knowledge to envision and implement future technical solutions that will optimize the Cloud infrastructure and its reliability.

We are looking for an experienced Director, System Reliability Engineering who will be responsible to drive reliability performance across architecture, design, component and material selections, manufacturing and integration of datacenter hardware, ensuring that all electrical, mechanical, thermal, environmental, transportation and operational aspects along with telemetry, diagnostic and the SW/FW stack of the cloud solution are optimized throughout the lifecycle of each cloud service. The candidate will interact with Engineering, Supply Chain, Sourcing, Manufacturing & Quality, Fleet Management, Datacenter Operations, and other internal and external stakeholders.

Lead the design, implementation, and continuous improvement of reliability practices across our AI infrastructure. Ensure the performance, scalability, and resilience of AI systems in production environmentsLead the development and execution of both systems and components’ reliability engineering strategies for all Cloud platforms and servicesCollaborate across HW and SW architecture, data engineering, and platform teams to ensure robust deployment of resilient solutions and servicesLead strategic innovations and develop processes to integrate industry practices to ensure efficiency in achieving high reliability and qualityDesign and implement observability frameworks tailored to AI workloadsDrive incident response, root cause analysis, and postmortem processes for HW system outages or degradationsEstablish and monitor SLAs (Availability, Node In Service, Time to restore Availability) for all cloud services, ensuring alignment with business goals and product requirements Foster a culture of reliability, automation, consistency of execution and continuous improvement across engineering teamsSupport manufacturing, datacenter operation, troubleshooting and diagnostic methods to optimize the cloud infrastructure reliability

Qualifications

Required/minimum qualifications

Bachelor's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 8+ years technical engineering experience OR Master's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 7+ years technical engineering experience OR Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 5+ years technical engineering experience. 5+ years of people management including resource planning, career development and performance management.5+ years of experience in system reliability, site reliability engineering, or infrastructure engineering, with at least 1 years focused on AI systemsOther Requirements

Ability to meet Microsoft, customer and/or government security screening requirements is necessary for this role. These requirements include, but are not limited to, the following specialized security screenings: Microsoft Cloud Background Check. Preferred Qualifications:

Bachelor's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 12+ years technical engineering experience OR Master's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 10+ years technical engineering experience OR Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 7+ years technical engineering experience. Experience in AI lifecycle, including model training, deployment, monitoring, and retrainingExperience in cloud fleet management, telemetry, diagnostic and troubleshooting of IT systemsExperience and knowledge in the server industry product development processExperience in managing cross-functional teams and large-scale distributed systemsExperience with system reliability, manufacturing process and datacenter operations, leading continuous improvements through automationExperience with liquid cooling infrastructure for IT racksReliability Engineering M5 - The typical base pay range for this role across the U.S. is USD $137,600 - $267,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $180,400 - $294,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Microsoft will accept applications for the role until May 26th, 2025.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

#azurehwjobs #HIFE #Azure #Cloud #Hardware #AHSI

Check out other jobs at Microsoft.

Job is more than three months old.

Job Details

Description

Qualifications

Email job link for Director, System Reliability Engineering at Microsoft