Microsoft Principal AI Operations Engineer

New job, posted less than a week ago!

Job Details

Posted date: Jan 30, 2026

Category: Software Engineering

Location: Multiple Locations, Multiple Locations

Estimated salary: $222,050
Range: $139,900 - $304,200

Employment type: Full-Time

Work location type: 0 days / week in-office – remote

Role: Individual Contributor


Description

Overview

Security represents the most critical priorities for our customers in a world awash in digital threats, regulatory scrutiny, and estate complexity. Microsoft Security aspires to make the world a safer place for all. We want to reshape security and empower every user, customer, and developer with a security cloud that protects them with end to end, simplified solutions. The Microsoft Security organization accelerates Microsoft’s mission and bold ambitions to ensure that our company and industry is securing digital technology platforms, devices, and clouds in our customers’ heterogeneous environments, as well as ensuring the security of our own internal estate.

The Security AI Platform team builds and operates production infrastructure that powers AI-native security capabilities at Microsoft scale. We are organized into two focused groups: Platform + Apps develops the core product, microservices, and architecture; AI Operations ensures reliability, deployments, and operational excellence. Together, we deliver mission-critical services that process millions of requests daily.

We are seeking a Principal AI Operations Engineer to define the technical direction for the AI Operations group. In this role, you will design and architect operational systems, establish standards for branch health, CI/CD pipelines, production deployments, and on-call processes. You will drive reliability initiatives, maintain production health and uptime, and ensure the platform meets its SLOs. You will be the escalation point for complex incidents and work closely with the Platform team to ensure services are operationally ready.

Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.

This role requires the ability to work East Coast (U.S. Eastern Time) business hours to ensure alignment with team collaboration, customer needs, and operational requirements. Candidates need to be available to work a schedule that aligns with Eastern Time, regardless of their physical location.

Responsibilities

Define the operational vision, standards, and roadmap for the platform; establish SLOs, error budgets, and reliability targetsDrive technical direction for the AI Operations group: architecture for deployments, pipelines, branch health, and production reliabilityOwn CI/CD pipeline architecture: Azure DevOps/GitHub Actions pipelines, build optimization, artifact management, and deployment automationManage Kubernetes infrastructure: AKS cluster operations, Helm chart management, node pool configuration, GPU resource allocation, and autoscaling (KEDA)Drive production deployments: canary/ring rollouts, safe deployment practices, rollback procedures, and release coordination with Platform teamEstablish and operate first-level on-call: incident response procedures, escalation paths, runbooks, and post-incident reviewsBuild and maintain observability infrastructure: Prometheus, Grafana, OpenTelemetry collectors, alerting rules, and dashboard curationManage infrastructure as code: Bicep templates for Azure resources, Helm charts for Kubernetes deployments, and environment parityEnsure branch health and code quality gates: PR validation pipelines, automated testing, security scanning, and merge policiesDebug and diagnose production issues: analyze logs (Kusto/ADX), traces, and metrics to identify root causes and drive resolutionCollaborate with Platform team on operational readiness: review service designs for operability, define deployment requirements, and validate runbooksDrive reliability improvements: capacity planning, performance optimization, chaos engineering, and disaster recovery testingGuide and mentor operations engineers; establish operational effective practices and continuous improvement cultureEmbody our culture and values

Qualifications

Required Qualifications:

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience. 6+ years technical engineering experience in DevOps, SRE, or platform operations 6+ years driving complex operational initiatives across teams; demonstrated success leading without authority 4+ years hands-on experience with Kubernetes in production environments 3+ years building and maintaining CI/CD pipelines at scale Other Requirements: Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

Experienced with Kubernetes: cluster operations, Helm, troubleshooting, autoscaling, and production managementProficiency with CI/CD platforms: Azure DevOps, GitHub Actions, or similar pipeline toolingExperience with cloud platforms (Azure preferred): AKS, networking, identity management, and resource provisioningInfrastructure as Code: Bicep, Terraform, or Helm chart developmentObservability tooling: Prometheus, Grafana, OpenTelemetry, and log analytics (Kusto/KQL)

#MSFTSecurity #NEXTAI #AI #MSEC #MSECAI #SENTINELAI

Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.



Email job link for Principal AI Operations Engineer at Microsoft

Provide your email address to receive a message with the job link and details.

Check out other jobs at Microsoft.