LancasterPARecruiter Since 2001
the smart solution for Lancaster jobs

Lead Site Reliability Engineer

Company: Bridge Defense
Location: Washington
Posted on: April 2, 2026

Job Description:

About the Role As the Lead Site Reliability Engineer for our ComputeBridge Engagement, you’ll be responsible for the reliability, scalability, and performance of one of the largest hardware and AI infrastructure efforts in the U.S. defense sector. You will lead the deployment, management, and automation of a high-performance computing mesh across multiple secure environments, ensuring operational excellence and mission continuity for a 9-figure government program. This is a hands-on engineering leadership role that bridges physical infrastructure and modern DevOps automation, ideal for someone who thrives at the intersection of hardware systems, distributed computing, and AI/ML workflows. What You’ll Do Lead infrastructure design, deployment, and operations for ComputeBridge hardware clusters across secure and distributed environments Install and configure physical systems, including high-density GPU servers, networking gear, and storage arrays Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.) Operate and maintain distributed networking meshes across multiple classified and unclassified domains Implement and manage out-of-band management tools (IMPI, iDRAC, BMC, etc.) for remote troubleshooting and control Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance Provide on-site technical leadership for deployments, troubleshooting, and continuous improvement Mentor junior engineers and establish operational best practices across the ComputeBridge program as the contract grows What You’ll Bring 3 years of experience in site reliability, systems engineering, or hardware operations roles Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting Strong experience with Linux systems administration, imaging, and automated deployment Hands-on experience managing large-scale clusters or distributed systems in OpenShift or Kubernetes environments Familiarity with DevOps automation (Ansible, Terraform, CI/CD pipelines) Experience configuring and managing networking and mesh architectures Direct experience with NVIDIA GPUs, CUDA, and related AI/ML frameworks Proficiency with out-of-band management and IMPI/iDRAC tooling Certifications: Linux and Security (required or in-progress) Excellent communication, documentation, and problem-solving skills Clearance: Active TS/SCI required or ability to obtain Bonus Points For Experience operating in secure DoD or intelligence environments Familiarity with Palantir platforms or other government data systems Prior experience supporting AI/ML infrastructure in production or tactical settings Experience with performance tuning and monitoring of HPC or GPU-accelerated clusters General Factors: Depending on project requirements, may be required to work within a compressed schedule; overtime should be expected when schedules demand it. Willing to travel, if needed. No Relocation . Why Bridge Defense Shape how advanced computing supports national security missions at scale Lead engineering for a major government program with direct mission impact Competitive compensation, benefits, and growth opportunities in a mission-driven environment Bridge Defense is committed to building a collaborative and mission-focused team. Bridge Defense reserves the right to modify job duties or requirements at any time. Employment with Bridge Defense is at-will. Candidates must be eligible to work in the United States and complete any required background checks or security clearance processes as a condition of employment.

Keywords: Bridge Defense, Lancaster , Lead Site Reliability Engineer, IT / Software / Systems , Washington, Pennsylvania


Didn't find what you're looking for? Search again!

I'm looking for
in category
within


Log In or Create An Account

Get the latest Pennsylvania jobs by following @recnetPA on Twitter!

Lancaster RSS job feeds