SITE RELIABILITY ENGINEERING MANAGER

Site Reliability Engineering Manager

Ensure Reliability of Systems that Move the Nation's Food Supply

Who We Are

US Cold owns and operates one of the most complex temperature-controlled logistics networks in North America. Every day, our systems coordinate the storage and movement of food at national scale across a network of state-of-the-art distribution centers, including multiple highly automated warehouse facilities.

We continue to advance our core warehouse and logistics platforms. Our current focus is on modular, event-driven, API-first and cloud architectures. We continue to enhance reliability and accelerate engineering productivity by strengthening our SRE and AI practices. This is a large investment in innovation to continue to drive operational excellence at our facilities.

If you want to build durable systems that operate in the physical world at scale, this is that opportunity.

The Role

The SRE Manager will design and implement the company’s SRE framework from the ground up.

You will define what reliability means at US Cold.
You will establish SLIs and SLOs.
You will modernize monitoring and incident response.
You will build the playbook others will follow.

This is both a hands-on technical role and a practice-building leadership position.

You will report to the Director of IT Operations and

What Own

Establish the company’s first SRE practice including principles, standards, tooling, and operational processes
Define SLIs, SLOs, and error budgets across SaaS, on-prem, and custom services
Build reliability dashboards and executive-level reporting
Implement and evolve observability across logs, metrics, and distributed tracing
Mature incident response, outage management, and post-incident review processes
Partner with engineering to design resilient systems and reduce operational toil
Strengthen CI/CD reliability using safe deploy strategies such as canary and blue/green patterns
Implement cost visibility and cloud governance in partnership with Finance
Build runbooks, playbooks, and operational standards
Establish on-call structures and escalation clarity
Assist in hiring, mentoring, and developing future SRE team members

This is foundational work. The systems and practices you design will shape how engineering operates for years.

Technical Environment

Azure cloud infrastructure
Infrastructure as Code using Bicep, Terraform, or ARM
GitHub Actions for CI/CD orchestration
Safe deployment patterns including gated releases, canary, and blue/green
Observability across logging, metrics, and distributed tracing
Python scripting for automation and reliability tooling
SaaS integrations, on-prem infrastructure, and custom-built services

What We’re Looking For

5–7+ years in SRE, DevOps, Infrastructure, or Production Engineering
Hands-on ownership of production services
Proven experience implementing SLIs, SLOs, observability, and automation
Leadership in major incident response and post-incident reviews
Deep CI/CD expertise, particularly GitHub Actions
Strong Python scripting for automation and operational tooling
Practical knowledge of cloud cost optimization and FinOps principles
Ability to influence cross-functional teams

Education:
Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

Why This Role Is Different

This is not an inherited SRE function.
There is no existing framework to simply maintain.

You will:

Define the reliability bar
Build the operating model
Influence architectural decisions
Establish executive-level visibility into system health
Create a culture where reliability is engineered, not reactive

This is an opportunity to build something durable inside a company modernizing its core technology platform.

Compensation & Structure

Salary Range: $160,000 - $190,000

Bonus Eligible

Full-time, exempt

Reports to:

Travel less than 10%

Location : Hybrid Greater Philadelphia

Operational Context

This role is primarily technical and office-based, with occasional interaction in operational environments depending on system needs.

Benefits Include

If annual hours are attained, these benefits may apply. Medical, Dental, Vision, Prescription, Legal Insurance, Pet Discount, Critical Illness, Accident Insurance, Hospital Indemnity, Long Term Care + Permanent Life Insurance, Identity Theft Protection, Short Term Disability Insurance, Long Term Disability Insurance, Supplemental Disability Insurance, Basic Life Insurance, Accidental Death and Dismemberment Insurance, Supplemental Life Insurance, Supplemental Spouse Life Insurance, Child Life Insurance, Loan Solution, Health Flexible Spending Account, Dependent Flexible Spending Account, Telemedicine, Virtual Primary Care, Prescription Savings Plan, Prescription Specialty Copay Assistance Program, Weight Management Program, Chronic Condition Management, Care Navigator Program, 24/7 Nurse Line, Expert Medical Opinion, Precious Additions Maternity Program, Health Advocacy, Employee Assistance Program, Digital Cognitive Behavioral Therapy, Digital Physical Therapy, Behavioral and Mental Health Platforms, Auto and home discount program, Secure Travel Protection, Discount Programs, 401(k) plan, Education Assistance, Paid Time Off, Referral program & Commuter Benefit (NJ ONLY).

Physical & Operational Context

May require physical effort associated with using the computer to access information, or occasional standing, walking, lifting needed to carry out everyday activities. Effective communication, vision, and hearing are essential for safety and productivity. Operate scanners, tablets, radios, phones, computers, and other essential equipment as required. Additional work hours may be requested by management to help manage employee production, projects, and/or special events. Engage in frequent personal interaction and communication. Attend in-person meetings and/or training on a regular basis. Possess strong arithmetic and reading skills. Follow verbal instructions, written instructions, and company policies. Work independently and coordinate with others. Fast-paced environment, managing stress and meeting productivity standards.

Additional Information

Job functions may vary based on the area of operation. This description outlines the most common tasks required for the job. Reasonable accommodation may be provided to enable individuals with disabilities to perform essential duties. This job description may not encompass all tasks necessary to complete the role. Collaborate across Software Engineering, Customer Integration Technology, Data Engineering, Infrastructure and Security

#INDIT

Similar jobs