Microsoft Principal Supercomputing Operations Software Engineer

New job, posted less than a week ago!

Job Details

Posted date: Mar 03, 2026

Category: Software Engineering

Location: Multiple Locations, Multiple Locations

Estimated salary: $222,050
Range: $139,900 - $304,200

Employment type: Full-Time

Work location type: 0 days / week in-office – remote

Role: Individual Contributor

Description

Overview

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation.

At this scale, interconnect fabrics are a first order reliability system that directly determines GPU availability, training throughput, and customer SLAs. As a Principal Supercomputing Operations Engineer, you serve as the technical authority and strategic owner for interconnect fabric operations across flagship AI supercomputing environments. You treat InfiniBand and GPU interconnect fabrics as a single end to end reliability domain, defining how they are operated, debugged, hardened, and scaled in production. This is a hands on, production first leadership role operating at the intersection of architecture, live operations, and reliability engineering.

You will lead the most complex and impactful fabric related incidents, making high stakes technical decisions under ambiguity while balancing availability, risk, long term correctness, and customer impact. Beyond resolving incidents, you define failure models, operational strategy, and systemic prevention mechanisms that reduce recurrence at fleet scale. Your impact multiplies through technical leadership: setting operational standards, influencing engineering direction across teams, mentoring senior engineers, and partnering deeply with platform, hardware, firmware, and service teams to drive durable reliability improvements.

You will architect and drive automation, diagnostics, and telemetry that materially improve operability and debuggability of interconnect fabrics, and author authoritative playbooks, TSGs, and escalation models relied on across the organization. Through your judgment, designs, and operational strategy, Azure’s largest AI platforms scale safely, predictably, and sustainably to meet the demands of next generation AI workloads.

Microsoft’s mission is to empower every person and organization on the planet to achieve more. We work with a growth mindset, innovate to empower others, and collaborate to realize shared goals. Our culture is rooted in respect, integrity, and accountability, and we strive to build an environment where every engineer can learn, grow, and have real impact. As part of this team, you’ll help shape the next generation of cloud scale AI infrastructure and contribute to an inclusive culture where your expertise makes a difference every day.

Responsibilities

Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA complianceLead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguityPerform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scaleDrive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teamsArchitect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform

Qualifications

Required Qualifications:

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or PythonOR equivalent experience.

Other Qualifications:

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

Bachelor's Degree in Computer ScienceOR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR PythonOR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or PythonOR equivalent experience.6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments

Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs

Hands‑on experience operating and debugging interconnect fabrics supporting large‑scale compute workloads

Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services

Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues

Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

Check out other jobs at Microsoft.

New job, posted less than a week ago!

Job Details

Description

Email job link for Principal Supercomputing Operations Software Engineer at Microsoft