DevOps Roles & Responsibilities

01 /CI/CD

Continuous Integration
& Delivery

Get code from commit to production safely, repeatably, and fast. The pipeline is the product — treat it like one. Speed without reliability is just recklessness.

What Good Looks Like

Deployments are boring — automated, tested, and reversible

Developers get pipeline feedback in under 10 minutes

Rollbacks are a one-step operation, not a postmortem

Feature flags decouple deploy from release

Secrets are never hardcoded — ever

Common Tools

GitHub ActionsGitLab CI/CDCircleCIJenkinsArgo CDFluxHarnessSpinnakerArgo RolloutsLaunchDarklyVaultDockerKanikoArtifactoryCypress

Real Bottlenecks

⚠

Flaky tests that no one fixes. They erode trust in the pipeline until engineers start skipping CI entirely.

⚠

Fear of the deploy button. When deployments are risky, teams batch changes. Batching makes deployments more risky. A vicious cycle.

⚠

Manual approval gates that are rubber stamps. They create delay without adding safety. Replace with automated quality gates.

⚠

Pipelines owned by no one. Shared pipeline infrastructure that "everyone" is responsible for means no one fixes it.

⚠

Hardcoded secrets in repositories. Even "old" secrets in git history are exploitable. Rotation alone doesn't solve leakage.

⚠

Environments that don't match production. "Works on staging" is not a deployment guarantee. Config parity is non-negotiable.

Questions Worth Asking

How long does it take from a merged PR to production? Is that acceptable?

When was the last time you tested a rollback under real conditions?

Do developers trust the pipeline enough to merge on a Friday?

If a secret was accidentally committed, how would you find out?

Most CI/CD problems aren't tool problems — they're ownership and discipline problems. The best pipeline is one your team actually trusts.

02 /IaC

Infrastructure
as Code

Infrastructure should be reproducible, reviewable, and version-controlled — the same discipline applied to application code. "Just click around in the console" is not a runbook.

What Good Looks Like

Every infrastructure change goes through a PR — including hotfixes

Environments are spun up and torn down in minutes, not days

Drift between declared and real state is detected automatically

IaC modules are reusable, documented, and owned by someone

Policy-as-code blocks non-compliant resources before they're created

Common Tools

TerraformTerragruntPulumiAWS CDKAWS CloudFormationHelmKustomizeAtlantistfseccheckovTFLintInfracostSOPS

Real Bottlenecks

⚠

State file as a single point of failure. Corrupted or locked Terraform state can freeze an entire team. Remote state with locking is not optional.

⚠

Nobody owns the modules. Shared module libraries that were built once and never maintained become a liability nobody wants to touch.

⚠

Drift between code and reality. Manual changes made "just this once" accumulate. Without drift detection, you eventually lose trust in what the code says.

⚠

IaC used for deployments instead of just provisioning. Terraform is not a deployment tool. Blending provisioning and app deployment in one plan is a recipe for slow, risky applies.

⚠

Abstraction that outpaces the team. Overly complex module hierarchies slow onboarding and make debugging infuriating. Clarity beats cleverness.

Questions Worth Asking

If someone ran `terraform apply` right now, do you know exactly what would change?

How much of your infrastructure exists outside of code?

Who reviews IaC changes, and do they understand what they're approving?

Can a new engineer spin up a full environment without tribal knowledge?

IaC is mature — the hard part is governance, not syntax. Who can approve a change, what does a plan review actually catch, and how does state get protected?

03 /Config

Configuration &
Dependency Management

Apps should behave consistently regardless of where they run. Configuration drift and supply chain vulnerabilities are silent killers — they surface in production at the worst time.

What Good Looks Like

Config is environment-specific but template-driven — no copy-paste between envs

Secrets are injected at runtime, never baked into images or committed to git

Dependency updates are automated, reviewed, and tested — not manual

Third-party dependencies are scanned for vulnerabilities on every build

Any config change is auditable — who changed what, when, and why

Common Tools

HelmKustomizeConfigMaps / SecretsHashiCorp VaultSOPSDopplerSealed SecretsRenovateDependabotSnykTrivynpm / pip / Maven

Real Bottlenecks

⚠

Configs duplicated across environments by hand. One typo in a prod config that staging doesn't have is how Saturday nights become incidents.

⚠

Secrets sprawl. Secrets end up in .env files, Slack messages, wikis, and one person's laptop. The blast radius of a single leak is unknown.

⚠

Dependency updates treated as optional maintenance. Skipping updates for months turns a routine patch into a breaking migration. Automation makes this a non-event.

⚠

Supply chain blind spots. Transitive dependencies — the packages your packages depend on — are the attack surface most teams never look at.

⚠

Undocumented overrides. That extra env var set manually in staging 18 months ago that nobody knows about — it's there in every environment.

Questions Worth Asking

If you had to rotate every secret today, how long would it take?

How many places do your app's configs live? Can you enumerate them?

When was the last time a vulnerable dependency was shipped to production undetected?

Secrets management is where compliance meets engineering. It's not glamorous, but a secrets leak will generate more urgency than any feature outage.

04 /Observability

Observability &
Alerting

You can't fix what you can't see. Observability — metrics, logs, and traces working together — lets you understand why a system is misbehaving, not just that it is.

What Good Looks Like

SLOs defined per service — not just "it's up" but "it's performing acceptably"

Alerts fire on user-impacting symptoms, not just CPU spikes

A trace ID connects a user complaint to the exact failing service call

On-call engineers can diagnose an incident without waking a specialist

Dashboards are maintained and reflect current architecture

Common Tools

PrometheusGrafanaDatadogNew RelicOpenTelemetryJaegerTempoLokiElasticsearchAlertmanagerPagerDutyOpsgenieHoneycombFireHydrant

Real Bottlenecks

⚠

Alert fatigue from noisy, low-signal alerts. When on-call gets paged 30 times a week for non-actionable alerts, real incidents get ignored. Every alert should demand action.

⚠

No SLOs — so no baseline for "is this bad?" Without agreed SLOs, every incident becomes a political debate about severity. Define error budgets before you need them.

⚠

Metrics, logs, and traces as separate silos. If you can't correlate a spike in a Grafana dashboard to a log line in Loki to a trace in Jaeger, you're doing archaeology, not debugging.

⚠

Dashboards that describe the past, not the present. Dashboards built for one version of the architecture and never updated become wallpaper. Treat them like code.

⚠

Tribal knowledge as the primary debugging tool. When only one engineer knows how to read a certain metric, that engineer is the bottleneck for every incident.

Questions Worth Asking

How many alerts fired last week? How many required action?

Can any on-call engineer — not just the expert — diagnose a production incident?

Do you have SLOs? When did you last review them with the business?

How long does it take to go from "something's wrong" to "here's the root cause"?

Observability is not a dashboard. It's the ability to ask arbitrary questions about your system's behavior without deploying new code to answer them.

05 /Platform Eng

Automation &
Platform Engineering

The goal is to eliminate toil — not just automate it. Platform engineering builds the paved roads developers want to use. If your internal platform requires a ticket to use, it's not a platform.

What Good Looks Like

Developers can provision environments, run deployments, and create services without filing tickets

Golden paths exist — opinionated, well-lit routes that are faster than doing it yourself

Internal tools are treated as products: they have owners, docs, and SLAs

Automation is idempotent — running it twice is safe

The platform team measures adoption and developer satisfaction, not just uptime

Common Tools

BackstagePortCortexArgo CDAtlantisTemporalAirflowPrefectAnsibleBash / PythonLaunchDarklyFlagsmithTaskfile

Real Bottlenecks

⚠

Automation that requires a human to babysit it. Scripts that silently fail or require manual intervention aren't automation — they're documentation with a run button.

⚠

The "bus factor" problem. Internal tooling understood only by the engineer who built it becomes a critical dependency. Document it or lose it.

⚠

Internal platforms nobody uses. Building a developer portal that teams route around is the most demoralizing outcome in platform engineering. Build with teams, not for them.

⚠

Ticket-ops replacing self-service. If getting a new environment still requires a JIRA ticket to the platform team, you haven't solved the problem — you've renamed it.

⚠

Overengineered tooling for a 10-person team. Backstage is powerful — and also overkill for many orgs. Match platform complexity to actual developer pain, not aspirational architecture.

Questions Worth Asking

What's the most common thing developers file tickets to the platform team for? Can that be self-served?

If your best platform engineer left tomorrow, what would break?

Do developers actually use the internal platform, or do they route around it?

Are you measuring developer experience, or just platform availability?

The best platform teams think like product teams. They have a roadmap, they talk to users, and they deprecate things that nobody uses. Toil that teams don't notice you've eliminated is the highest compliment.

06 /DevSecOps

Security &
Compliance

Security that lives entirely in a separate team is security theater. Real DevSecOps means developers catch vulnerabilities before they ship — not after a pentest finds them.

What Good Looks Like

Security scans run in CI — containers, dependencies, and IaC are all checked

Secrets are centrally managed, rotated, and audited — not distributed by email

RBAC enforced at every layer — least privilege is the default, not the aspiration

CVEs have an SLA — critical findings get patched within a defined timeframe

Compliance evidence is generated automatically, not assembled manually at audit time

Common Tools

SnykTrivyGrypeCheckovtfsecSonarQubeSemgrepOWASP ZAPHashiCorp VaultOPA / KyvernoFalcoAWS IAMOktaWizDrata

Real Bottlenecks

⚠

Security as a gate, not a guardrail. When security only shows up at the end of a release cycle, they're forced to choose between blocking the release or waving it through. Neither is good.

⚠

CVE backlogs that grow faster than they're cleared. Without a priority framework and SLA, vulnerability queues become noise that teams learn to ignore.

⚠

Secrets that "only exist in one place." They don't. Slack, email, a doc, someone's .bashrc. Audit your actual secrets surface area — it's always larger than you think.

⚠

RBAC in theory, admin access in practice. Emergency access escalations that never get revoked accumulate into permanent over-permissioning.

⚠

Developers who view security as someone else's job. Security training that's a 30-minute annual video doesn't change behavior. Integration into daily workflows does.

Questions Worth Asking

If a critical CVE dropped today, how long before every affected service is patched?

Who has production access? When did you last audit that list?

Can you pass a SOC2 or ISO 27001 audit without a month of frantic prep?

How does a developer find out they've introduced a vulnerability before it ships?

The companies that do DevSecOps well make security frictionless for developers. Scan results in the PR. One-click secret rotation. RBAC that doesn't require a ticket. Remove the friction and compliance follows.

07 /FinOps

Cost & Resource
Optimization

Cloud cost isn't a finance problem — it's an engineering problem. The teams who understand what they're spending and why ship better systems, not just cheaper ones.

What Good Looks Like

Every team can see their own cloud spend — not just platform or finance

Dev and test environments are ephemeral — not running 24/7

Cost impact of infrastructure changes is visible before `apply`

Reserved instances and savings plans are used deliberately, not forgotten

Orphaned resources have an owner and an expiry — not just a tag

Common Tools

KubecostOpenCostAWS Cost ExplorerInfracostCloudHealthSpot.ioCAST AIKEDACluster AutoscalerAWS BudgetsHarness CCM

Real Bottlenecks

⚠

Cloud costs hidden from the teams creating them. If engineers never see a bill, there's no feedback loop. Cost awareness starts with visibility, not governance.

⚠

Dev/test environments running permanently. A dev cluster left running over a long weekend can cost as much as a week of production. Scheduled shutdown policies are free.

⚠

Over-provisioning as the path of least resistance. Engineers size for peak + margin + worry. Rightsizing is uncomfortable but consistently delivers 20-40% savings.

⚠

No ownership of orphaned resources. That 2TB snapshot from a deleted cluster from 18 months ago is still on the bill. Tagging policies and expiry automation help, but enforcement is the hard part.

⚠

FinOps treated as a platform team responsibility. Cost optimization only works at scale when product teams own their spend. Platform can provide tooling; accountability must live with the teams.

Questions Worth Asking

Which team is your largest cloud spender? Do they know it?

What's your cost per customer, per service, per environment? Can you answer that?

When did you last review reserved instance coverage vs actual utilization?

FinOps is 20% tooling and 80% culture. The hardest part isn't finding waste — it's creating the incentive for engineering teams to care about eliminating it.

08 /SRE

Reliability &
Resilience Engineering

Reliability is a feature. Systems that fail gracefully, recover quickly, and degrade predictably are engineered that way — not lucky. The goal is surviving failure, not preventing it.

What Good Looks Like

SLOs are agreed with the business — not set unilaterally by engineering

Failure modes are documented, and failover is tested before it's needed

Runbooks are accessible, current, and usable by any on-call engineer

Chaos engineering happens on a schedule, not after a production incident

On-call rotation is sustainable — no one is burning out to keep the lights on

Common Tools

Chaos MeshLitmusChaosGremlinKubernetes HPA/VPAKEDAVeleroIstioEnvoyResilience4jPagerDutyStatusPageincident.ioRootly

Real Bottlenecks

⚠

DR plans that have never been tested. A disaster recovery plan that's only been read — never executed — is a hypothesis. Test it before an actual disaster does.

⚠

On-call burnout eroding the team quietly. Chronic paging at 3am is an attrition risk, not just a morale issue. On-call health is a reliability metric.

⚠

Runbook rot. Runbooks written during the last incident and never updated are worse than no runbooks — they give false confidence and wrong steps.

⚠

Tight coupling making failures non-local. A database timeout that crashes the frontend because there's no circuit breaker is a design problem masquerading as an ops problem.

⚠

SLOs set too high to be meaningful. A 99.99% SLO on an internal CRUD service nobody depends on wastes engineering effort. Error budgets should reflect user impact, not engineering pride.

Questions Worth Asking

When did you last test your disaster recovery procedure end-to-end?

How many pages did on-call receive last month? How many were actionable?

If your primary database went down right now, what happens to each service?

Are your SLOs based on actual user expectations, or internal engineering targets?

The SRE mindset shift: stop trying to prevent failure and start designing for recovery. Mean time to recovery (MTTR) often matters more to users than mean time between failures (MTBF).

09 /Improvement

Continuous
Improvement

The hardest discipline in DevOps because it has no obvious fire to put out. Most teams skip it entirely. The ones that don't are the ones that compound improvements over time rather than repeatedly solving the same problems.

What Good Looks Like

Blameless postmortems happen after every significant incident — and actions get completed

DORA metrics are tracked: deployment frequency, lead time, MTTR, change failure rate

Developer experience is measured periodically — not just inferred

Time is budgeted explicitly for improvement work, not taken from on-call slack

Retrospective actions have owners, due dates, and are reviewed

Common Tools

DORA MetricsSleuthLinearBHaystackGitHub InsightsParabolEasyRetroMiroLaunchDarklyNotion / Confluence

Real Bottlenecks

⚠

Retrospectives without follow-through. Writing action items in a doc nobody looks at is not improvement — it's ritual. Actions need owners and deadlines or they disappear.

⚠

Improvement work deprioritized by product roadmap. When every sprint is feature delivery, technical debt and process debt compound invisibly until the team hits a wall.

⚠

Measuring outputs instead of outcomes. Tracking story points and tickets closed doesn't tell you if the team is improving. DORA metrics give you actual signal about delivery health.

⚠

Blame culture disguised as postmortems. A "blameless" postmortem that ends with an action item for one specific engineer to "be more careful" is not blameless. It's just deniable.

⚠

Innovation disconnected from user impact. Experimenting with new tools is fun. Experiments that don't connect to measurable engineering or user outcomes are hobbies, not improvements.

Questions Worth Asking

What are your DORA metrics right now? Have they improved in the last quarter?

Look at last quarter's retro action items. How many were completed?

Is there explicit time budgeted for improvement work, or does it happen in between fires?

What's the single biggest recurring problem your team keeps solving? Why hasn't it been fixed?

High-performing DevOps teams aren't better because they have better tools. They're better because they systematically learn from what's not working and protect time to fix it.

10 /Culture

Culture &
Collaboration

DevOps is a culture before it's a toolchain. No amount of automation fixes a team where dev blames ops, ops blames security, and security blames everyone. Shared ownership is the foundation everything else rests on.

What Good Looks Like

Developers feel responsible for what they ship — in production, not just in review

Postmortems focus on systems and processes, never on individuals

Security and ops are consulted during design — not handed the result at the end

Oncall rotation includes developers, not just ops — they feel the pain they create

Psychological safety exists: engineers raise problems before they become incidents

Common Tools

Slack / TeamsConfluence / NotionJira / LinearGitHub / GitLabBackstageRootlyFireHydrantLatticeCultureAmpRetrium

Real Bottlenecks

⚠

Dev and ops as adversarial teams. Dev wants to ship fast. Ops wants stability. Without shared goals and shared on-call, this tension never resolves — it just finds new forms.

⚠

Knowledge hoarding as job security. Engineers who are the only ones who understand a system have incentive to keep it that way. This is cultural, not technical.

⚠

Incident blame culture. When mistakes lead to consequences, engineers stop raising concerns early. Small problems become big ones because nobody wanted to be the person who flagged it.

⚠

Velocity celebrated over quality. Teams that are rewarded only for shipping features have a rational incentive to skip the things that don't show up in a sprint review: tests, docs, runbooks, tech debt.

⚠

DevOps as a team name, not a practice. Renaming your ops team "DevOps" doesn't change how software is built and delivered. DevOps describes how teams work, not what a team is called.

Questions Worth Asking

Would an engineer on your team feel safe raising a concern about a decision they disagreed with?

Do developers participate in on-call? If not, why not?

After your last incident, did the postmortem lead to systemic change, or just an action item?

Are speed and stability treated as trade-offs, or as complementary goals?

Culture is the multiplier on everything else. Great tooling in a blame culture produces great-looking dashboards during incidents nobody admits to. Invest in culture as deliberately as you invest in tooling.

Dev
Ops
Break
down.

The Full Framework

Continuous Integration
& Delivery

Infrastructure
as Code

Configuration &
Dependency Management

Observability &
Alerting

Automation &
Platform Engineering

Security &
Compliance

Cost & Resource
Optimization

Reliability &
Resilience Engineering

Continuous
Improvement

Culture &
Collaboration

DevOpsBreakdown.

The Full Framework

Continuous Integration& Delivery

Infrastructureas Code

Configuration &Dependency Management

Observability &Alerting

Automation &Platform Engineering

Security &Compliance

Cost & ResourceOptimization

Reliability &Resilience Engineering

ContinuousImprovement

Culture &Collaboration

Dev
Ops
Break
down.

Continuous Integration
& Delivery

Infrastructure
as Code

Configuration &
Dependency Management

Observability &
Alerting

Automation &
Platform Engineering

Security &
Compliance

Cost & Resource
Optimization

Reliability &
Resilience Engineering

Continuous
Improvement

Culture &
Collaboration