Microsoft Principal Supercomputing Operations Engineering Manager

New job, posted less than a week ago!

Job Details

Posted date: Feb 27, 2026

Category: Software Engineering

Location: Multiple Locations, Multiple Locations

Estimated salary: $222,050
Range: $139,900 - $304,200

Employment type: Full-Time

Work location type: 0 days / week in-office – remote

Role: People Manager

Description

Overview

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation.

At this scale, interconnect fabrics are a first order reliability system that directly determines GPU availability, training throughput, and customer SLAs. As a Principal Supercomputing Operations Engineering Manager, you own the operational strategy and organizational execution for interconnect fabric reliability across flagship AI supercomputing environments. You lead teams that operate InfiniBand and GPU interconnect fabrics as a single end to end reliability domain, defining how they are operated, debugged, hardened, and scaled in production.

This is a hands on technical leadership role combined with people and operational management. You are accountable not only for technical outcomes, but for building and leading high performing engineering teams that consistently deliver availability, correctness, and resilience under extreme scale and ambiguity. You set expectations, drive execution through others, and ensure your team is prepared to respond decisively to the most complex production failures.

You lead and oversee the most severe fabric related incidents, guiding technical direction, escalation strategy, and risk trade offs while empowering senior engineers to execute deep investigations. Beyond incident response, you define operational strategy, reliability models, and systemic prevention mechanisms that reduce recurrence at fleet scale. Your impact multiplies through organizational leadership: developing talent, setting operational standards, influencing engineering direction across organizations, and partnering deeply with platform, hardware, firmware, and service teams to deliver durable reliability improvements.

You are responsible for ensuring that your organization produces high quality automation, diagnostics, telemetry, playbooks, and escalation models that materially improve operability and debuggability across the platform. Through your leadership, judgment, and technical direction, Azure’s largest AI supercomputing platforms scale safely, predictably, and sustainably to meet the demands of next generation AI workloads.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities

Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA complianceLead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomesProvide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the teamEnsure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixesDrive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworksPartner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scaleSponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet

Qualifications

Required Qualifications:

Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or PythonOR equivalent experience.Other Qualifications:

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.Preferred Qualifications:

Bachelor's Degree in Computer ScienceOR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript,OR PythonOR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or PythonOR equivalent experience.4+ years people management experience.6+ years of experience operating largescale distributed systems, highperformance computing (HPC), or artificial intelligence (AI) infrastructure in production environments Demonstrated experience leading engineering teams responsible for mission critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs Strong hands-on background in operating and debugging interconnect fabrics or similarly complex infrastructure supporting largescale compute workloads Solid Linux systems knowledge with experience reasoning across operating systems, drivers, services, and hardware layers Proven ability to make highimpact technical and organizational decisions under ambiguity while balancing availability, risk, longterm correctness, and business impact

Software Engineering M5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

Check out other jobs at Microsoft.

New job, posted less than a week ago!

Job Details

Description

Email job link for Principal Supercomputing Operations Engineering Manager at Microsoft