Staff Software Engineer, Platform

Salary

Upgrade to Premium to se...

Related skills

datadog aws kubernetes eks observability

???? Description

Own and drive SRE strategy in observability, incidents, reliability, and platform ops
Serve as go-to consultant for infrastructure and reliability across teams
Lead architecture decisions for monitoring, alerting, and SLO frameworks; RFCs
Provide L2 on-call support for complex incidents; build incident response capability
Define multi-quarter SRE initiatives with cross-team dependencies
Define and maintain SLIs/SLOs for tier-1 flows: contributions, disbursements, reporting
Contribute to ActBlue's reliability roadmap; anticipate upstream decisions
Prefer automation over manual processes; reduce toil through tooling

???? Requirements

8+ years in SRE, DevOps, or systems/infrastructure engineering
Deep expertise in observability tooling (Datadog)
Strong Kubernetes and cloud-native infra (AWS EKS)
Experience defining and operating SLIs and SLOs in production
Proven ability to lead cross-functional reliability initiatives
Strong incident management: on-call, post-mortems, blameless culture

???? Benefits

Flexible schedules and unlimited time off
Comprehensive health, dental, vision insurance for you and family
401K with employer match
Paid medical, family and parental leave
Home-office setup allowance for remote employees
Snacks and digital subscriptions perks

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.