posted Jun 05

Senior DevOps Infrastructure Engineer (US)

Ansible Bash Cloud Go Grafana Kubernetes MySQL Prometheus Puppet Python SDLC VMware senior

Job Location: Remote

Salary: $120,000 - $155,000 a year

Job Description

• Administration - Participate in maintenance and operations of our production environment, including patching, deployment, server administration, and troubleshooting, either using configuration as code tooling or manually. • Reliability & Performance - Ensure reliability, availability and performance of services. Respond to incidents and resolve before they become customer impacting. • Projects - Deliver complex solutions that traverse all layers of the technology stack: Operating System, Virtualisation, Network, Storage & Cloud. • Data Centre - Participate and coordinate on-site deployments of critical hardware, including servers and storage. • Collaboration - Work closely with teammates, software, and security teams to rapidly meet customer, business, and compliance needs. • Automation - Drive the automation of operational tasks, and ensure our infrastructure is more like cattle than pets. • Observability - Develop and maintain internal and commercial or OSS tools to improve system health, performance, and deployment. • Continuous Improvement - Drive never-ending improvement in SRE processes, tools, and methodologies. Take a leading role in blameless post-mortems to avoid repeat issues or mistakes and clearly document all lessons learned for others. If you love writing actionable documentation, we’d love to set up an interview. • On-Call - Participate in a rotating 24x7 on-call schedule with your team to ensure availability of services across the production environment.

Qualifications

• 5+ years of experience in Site Reliability Engineering, DevOps, System Administration, or similar roles. • Deep experience working in colocation facilities – we have a hybrid footprint, and if you have only worked in the public cloud space, this role is not a great fit for you. • Experience using Puppet, Ansible, or other common configuration as code tooling to deploy and configure systems. • Strong familiarity with Linux systems (any distro is fine, but we have a preference for RHEL downstreams). • Experience using Proxmox, VMWare, or KVM as virtualization platforms for large-scale production environments. • Experience administering enterprise grade SANs and load balancers is necessary to be successful in this role. • Demonstrated proficiency in one or more scripting or programming languages (e.g., Python, Go, Bash/ZSH, etc.) • Multiple years experience proactively implementing and responding to infrastructure, application, and network alerts using industry standard or homebrew toolchains. • Strong problem-solving skills and experience working in extreme high availability production environments (99.95% or greater), with high performance requirements, is required.