posted Jun 29

Manager, Reliability Engineering

Bash Grafana Prometheus Python mid

Job Location: Los Angeles, California

Salary: $110,600 - $138,300 a year

Job Description

• Define, manage, and measure incident response engineering practices • Liaise with engineering teams to ensure work discovered during incident response is prioritized • Participate in incident response engineering duties as necessary • Manage a global Reliability Operations team (3 to 6+ Reliability operations engineers across NAMER, EMEA, APAC) • Adaptive management style according to level and proficiency of engineering reports • Ability to understand technical employee career paths and collaboratively develop career plans • Scheduling a global team through holidays, sickness and vacation leaves, across timezones • Understanding of large-scale distributed system architectures (e.g., databases, web services, application services) • Familiarity with monitoring tools (e.g., Prometheus, Grafana, Nagios) • Ability to author scripts to facilitate troubleshooting as well as configure alerts • Proficiency in scripting languages (e.g., Python, Bash) is a plus • Ability to prioritize and manage incidents based on severity, with a focus on customer impact • Ability to remain calm under pressure and quickly diagnose issues • Understanding of system logs, metrics, telemetry • Ability to take command and confidently direct engineering resources in ambiguous situations • Ability to communicate effectively with stakeholders during an incident • Ability to maintain and update trouble-shooting guides (TSGs) and operational documentation

Qualifications

• Bachelor’s Degree from a four-year university or relevant substitute experience • 6+ years relevant work experience in Technical and/or Application Support with strong knowledge technical troubleshooting • 2-5 years of management experience with direct reports

Benefits

• comprehensive healthcare (medical, dental, and vision) with premiums paid in full for employees and dependents • retirement benefits such as a 401k plan and company match • short and long-term disability coverage • basic life insurance • well-being benefits • reimbursement for certain tuition expenses • parental leave • sick time of 1 hour per 30 hours worked • vacation time for full-time employees up to 120 hours thru the first year and 160 hours thereafter • around 13 paid holidays per year • purchase The Trade Desk stock at a discount through The Trade Desk’s Employee Stock Purchase Plan

Related Jobs

logo
Company
Henry Schein One
Post Date
New
Title
.NET Staff Software Engineer
Type
$120,000 - $160,000 a year
Location
Remote
logo
Company
KUBRA
Post Date
New
Title
Senior Security Architect
Location
Unknown, California
logo
Company
Okta
Post Date
New
Title
Staff Site Reliability Engineer (Customer Identity Cloud)
Type
$160,000 - $240,000 a year
Location
Remote
logo
Company
Kiddom
Post Date
New
Title
Senior Software Engineer, Infrastructure
Location
Remote
logo
Company
OwnBackup
Post Date
New
Title
Team Lead, Production Engineer
Type
$160,000 - $210,000 a year
Location
Unknown, California