Roles & Responsibilities Framework

Dev
Ops
Break
down.

Ten disciplines that define modern DevOps — each with a clear objective, what good looks like, the tools teams actually use, and the bottlenecks that don't show up in job descriptions.

10 DISCIPLINES  ·  REAL BOTTLENECKS  ·  TOOLS  ·  BIG QUESTIONS
All Disciplines
CI/CD
Infrastructure as Code
Config & Dependency Mgmt
Observability
Automation & Platform Eng
Security & Compliance
FinOps
Reliability Engineering
Continuous Improvement
Culture & Collaboration
CI/CDInfrastructure as CodeConfig ManagementObservabilityPlatform EngineeringDevSecOpsFinOpsReliability EngineeringContinuous ImprovementCulture & Collaboration CI/CDInfrastructure as CodeConfig ManagementObservabilityPlatform EngineeringDevSecOpsFinOpsReliability EngineeringContinuous ImprovementCulture & Collaboration
10Core Disciplines
60+Tools Across the Stack
80+Real Bottlenecks
40+Big Questions to Answer
01 — All Disciplines

The Full Framework

01 /CI/CD

Continuous Integration
& Delivery

Get code from commit to production safely, repeatably, and fast. The pipeline is the product — treat it like one. Speed without reliability is just recklessness.

What Good Looks Like
Deployments are boring — automated, tested, and reversible
Developers get pipeline feedback in under 10 minutes
Rollbacks are a one-step operation, not a postmortem
Feature flags decouple deploy from release
Secrets are never hardcoded — ever
Common Tools
GitHub ActionsGitLab CI/CDCircleCIJenkinsArgo CDFluxHarnessSpinnakerArgo RolloutsLaunchDarklyVaultDockerKanikoArtifactoryCypress
Real Bottlenecks
Flaky tests that no one fixes. They erode trust in the pipeline until engineers start skipping CI entirely.
Fear of the deploy button. When deployments are risky, teams batch changes. Batching makes deployments more risky. A vicious cycle.
Manual approval gates that are rubber stamps. They create delay without adding safety. Replace with automated quality gates.
Pipelines owned by no one. Shared pipeline infrastructure that "everyone" is responsible for means no one fixes it.
Hardcoded secrets in repositories. Even "old" secrets in git history are exploitable. Rotation alone doesn't solve leakage.
Environments that don't match production. "Works on staging" is not a deployment guarantee. Config parity is non-negotiable.
Questions Worth Asking
How long does it take from a merged PR to production? Is that acceptable?
When was the last time you tested a rollback under real conditions?
Do developers trust the pipeline enough to merge on a Friday?
If a secret was accidentally committed, how would you find out?
Most CI/CD problems aren't tool problems — they're ownership and discipline problems. The best pipeline is one your team actually trusts.
02 /IaC

Infrastructure
as Code

Infrastructure should be reproducible, reviewable, and version-controlled — the same discipline applied to application code. "Just click around in the console" is not a runbook.

What Good Looks Like
Every infrastructure change goes through a PR — including hotfixes
Environments are spun up and torn down in minutes, not days
Drift between declared and real state is detected automatically
IaC modules are reusable, documented, and owned by someone
Policy-as-code blocks non-compliant resources before they're created
Common Tools
TerraformTerragruntPulumiAWS CDKAWS CloudFormationHelmKustomizeAtlantistfseccheckovTFLintInfracostSOPS
Real Bottlenecks
State file as a single point of failure. Corrupted or locked Terraform state can freeze an entire team. Remote state with locking is not optional.
Nobody owns the modules. Shared module libraries that were built once and never maintained become a liability nobody wants to touch.
Drift between code and reality. Manual changes made "just this once" accumulate. Without drift detection, you eventually lose trust in what the code says.
IaC used for deployments instead of just provisioning. Terraform is not a deployment tool. Blending provisioning and app deployment in one plan is a recipe for slow, risky applies.
Abstraction that outpaces the team. Overly complex module hierarchies slow onboarding and make debugging infuriating. Clarity beats cleverness.
Questions Worth Asking
If someone ran `terraform apply` right now, do you know exactly what would change?
How much of your infrastructure exists outside of code?
Who reviews IaC changes, and do they understand what they're approving?
Can a new engineer spin up a full environment without tribal knowledge?
IaC is mature — the hard part is governance, not syntax. Who can approve a change, what does a plan review actually catch, and how does state get protected?
03 /Config

Configuration &
Dependency Management

Apps should behave consistently regardless of where they run. Configuration drift and supply chain vulnerabilities are silent killers — they surface in production at the worst time.

What Good Looks Like
Config is environment-specific but template-driven — no copy-paste between envs
Secrets are injected at runtime, never baked into images or committed to git
Dependency updates are automated, reviewed, and tested — not manual
Third-party dependencies are scanned for vulnerabilities on every build
Any config change is auditable — who changed what, when, and why
Common Tools
HelmKustomizeConfigMaps / SecretsHashiCorp VaultSOPSDopplerSealed SecretsRenovateDependabotSnykTrivynpm / pip / Maven
Real Bottlenecks
Configs duplicated across environments by hand. One typo in a prod config that staging doesn't have is how Saturday nights become incidents.
Secrets sprawl. Secrets end up in .env files, Slack messages, wikis, and one person's laptop. The blast radius of a single leak is unknown.
Dependency updates treated as optional maintenance. Skipping updates for months turns a routine patch into a breaking migration. Automation makes this a non-event.
Supply chain blind spots. Transitive dependencies — the packages your packages depend on — are the attack surface most teams never look at.
Undocumented overrides. That extra env var set manually in staging 18 months ago that nobody knows about — it's there in every environment.
Questions Worth Asking
If you had to rotate every secret today, how long would it take?
How many places do your app's configs live? Can you enumerate them?
When was the last time a vulnerable dependency was shipped to production undetected?
Secrets management is where compliance meets engineering. It's not glamorous, but a secrets leak will generate more urgency than any feature outage.
04 /Observability

Observability &
Alerting

You can't fix what you can't see. Observability — metrics, logs, and traces working together — lets you understand why a system is misbehaving, not just that it is.

What Good Looks Like
SLOs defined per service — not just "it's up" but "it's performing acceptably"
Alerts fire on user-impacting symptoms, not just CPU spikes
A trace ID connects a user complaint to the exact failing service call
On-call engineers can diagnose an incident without waking a specialist
Dashboards are maintained and reflect current architecture
Common Tools
PrometheusGrafanaDatadogNew RelicOpenTelemetryJaegerTempoLokiElasticsearchAlertmanagerPagerDutyOpsgenieHoneycombFireHydrant
Real Bottlenecks
Alert fatigue from noisy, low-signal alerts. When on-call gets paged 30 times a week for non-actionable alerts, real incidents get ignored. Every alert should demand action.
No SLOs — so no baseline for "is this bad?" Without agreed SLOs, every incident becomes a political debate about severity. Define error budgets before you need them.
Metrics, logs, and traces as separate silos. If you can't correlate a spike in a Grafana dashboard to a log line in Loki to a trace in Jaeger, you're doing archaeology, not debugging.
Dashboards that describe the past, not the present. Dashboards built for one version of the architecture and never updated become wallpaper. Treat them like code.
Tribal knowledge as the primary debugging tool. When only one engineer knows how to read a certain metric, that engineer is the bottleneck for every incident.
Questions Worth Asking
How many alerts fired last week? How many required action?
Can any on-call engineer — not just the expert — diagnose a production incident?
Do you have SLOs? When did you last review them with the business?
How long does it take to go from "something's wrong" to "here's the root cause"?
Observability is not a dashboard. It's the ability to ask arbitrary questions about your system's behavior without deploying new code to answer them.
05 /Platform Eng

Automation &
Platform Engineering

The goal is to eliminate toil — not just automate it. Platform engineering builds the paved roads developers want to use. If your internal platform requires a ticket to use, it's not a platform.

What Good Looks Like
Developers can provision environments, run deployments, and create services without filing tickets
Golden paths exist — opinionated, well-lit routes that are faster than doing it yourself
Internal tools are treated as products: they have owners, docs, and SLAs
Automation is idempotent — running it twice is safe
The platform team measures adoption and developer satisfaction, not just uptime
Common Tools
BackstagePortCortexArgo CDAtlantisTemporalAirflowPrefectAnsibleBash / PythonLaunchDarklyFlagsmithTaskfile
Real Bottlenecks
Automation that requires a human to babysit it. Scripts that silently fail or require manual intervention aren't automation — they're documentation with a run button.
The "bus factor" problem. Internal tooling understood only by the engineer who built it becomes a critical dependency. Document it or lose it.
Internal platforms nobody uses. Building a developer portal that teams route around is the most demoralizing outcome in platform engineering. Build with teams, not for them.
Ticket-ops replacing self-service. If getting a new environment still requires a JIRA ticket to the platform team, you haven't solved the problem — you've renamed it.
Overengineered tooling for a 10-person team. Backstage is powerful — and also overkill for many orgs. Match platform complexity to actual developer pain, not aspirational architecture.
Questions Worth Asking
What's the most common thing developers file tickets to the platform team for? Can that be self-served?
If your best platform engineer left tomorrow, what would break?
Do developers actually use the internal platform, or do they route around it?
Are you measuring developer experience, or just platform availability?
The best platform teams think like product teams. They have a roadmap, they talk to users, and they deprecate things that nobody uses. Toil that teams don't notice you've eliminated is the highest compliment.
06 /DevSecOps

Security &
Compliance

Security that lives entirely in a separate team is security theater. Real DevSecOps means developers catch vulnerabilities before they ship — not after a pentest finds them.

What Good Looks Like
Security scans run in CI — containers, dependencies, and IaC are all checked
Secrets are centrally managed, rotated, and audited — not distributed by email
RBAC enforced at every layer — least privilege is the default, not the aspiration
CVEs have an SLA — critical findings get patched within a defined timeframe
Compliance evidence is generated automatically, not assembled manually at audit time
Common Tools
SnykTrivyGrypeCheckovtfsecSonarQubeSemgrepOWASP ZAPHashiCorp VaultOPA / KyvernoFalcoAWS IAMOktaWizDrata
Real Bottlenecks
Security as a gate, not a guardrail. When security only shows up at the end of a release cycle, they're forced to choose between blocking the release or waving it through. Neither is good.
CVE backlogs that grow faster than they're cleared. Without a priority framework and SLA, vulnerability queues become noise that teams learn to ignore.
Secrets that "only exist in one place." They don't. Slack, email, a doc, someone's .bashrc. Audit your actual secrets surface area — it's always larger than you think.
RBAC in theory, admin access in practice. Emergency access escalations that never get revoked accumulate into permanent over-permissioning.
Developers who view security as someone else's job. Security training that's a 30-minute annual video doesn't change behavior. Integration into daily workflows does.
Questions Worth Asking
If a critical CVE dropped today, how long before every affected service is patched?
Who has production access? When did you last audit that list?
Can you pass a SOC2 or ISO 27001 audit without a month of frantic prep?
How does a developer find out they've introduced a vulnerability before it ships?
The companies that do DevSecOps well make security frictionless for developers. Scan results in the PR. One-click secret rotation. RBAC that doesn't require a ticket. Remove the friction and compliance follows.
07 /FinOps

Cost & Resource
Optimization

Cloud cost isn't a finance problem — it's an engineering problem. The teams who understand what they're spending and why ship better systems, not just cheaper ones.

What Good Looks Like
Every team can see their own cloud spend — not just platform or finance
Dev and test environments are ephemeral — not running 24/7
Cost impact of infrastructure changes is visible before `apply`
Reserved instances and savings plans are used deliberately, not forgotten
Orphaned resources have an owner and an expiry — not just a tag
Common Tools
KubecostOpenCostAWS Cost ExplorerInfracostCloudHealthSpot.ioCAST AIKEDACluster AutoscalerAWS BudgetsHarness CCM
Real Bottlenecks
Cloud costs hidden from the teams creating them. If engineers never see a bill, there's no feedback loop. Cost awareness starts with visibility, not governance.
Dev/test environments running permanently. A dev cluster left running over a long weekend can cost as much as a week of production. Scheduled shutdown policies are free.
Over-provisioning as the path of least resistance. Engineers size for peak + margin + worry. Rightsizing is uncomfortable but consistently delivers 20-40% savings.
No ownership of orphaned resources. That 2TB snapshot from a deleted cluster from 18 months ago is still on the bill. Tagging policies and expiry automation help, but enforcement is the hard part.
FinOps treated as a platform team responsibility. Cost optimization only works at scale when product teams own their spend. Platform can provide tooling; accountability must live with the teams.
Questions Worth Asking
Which team is your largest cloud spender? Do they know it?
What's your cost per customer, per service, per environment? Can you answer that?
When did you last review reserved instance coverage vs actual utilization?
FinOps is 20% tooling and 80% culture. The hardest part isn't finding waste — it's creating the incentive for engineering teams to care about eliminating it.
08 /SRE

Reliability &
Resilience Engineering

Reliability is a feature. Systems that fail gracefully, recover quickly, and degrade predictably are engineered that way — not lucky. The goal is surviving failure, not preventing it.

What Good Looks Like
SLOs are agreed with the business — not set unilaterally by engineering
Failure modes are documented, and failover is tested before it's needed
Runbooks are accessible, current, and usable by any on-call engineer
Chaos engineering happens on a schedule, not after a production incident
On-call rotation is sustainable — no one is burning out to keep the lights on
Common Tools
Chaos MeshLitmusChaosGremlinKubernetes HPA/VPAKEDAVeleroIstioEnvoyResilience4jPagerDutyStatusPageincident.ioRootly
Real Bottlenecks
DR plans that have never been tested. A disaster recovery plan that's only been read — never executed — is a hypothesis. Test it before an actual disaster does.
On-call burnout eroding the team quietly. Chronic paging at 3am is an attrition risk, not just a morale issue. On-call health is a reliability metric.
Runbook rot. Runbooks written during the last incident and never updated are worse than no runbooks — they give false confidence and wrong steps.
Tight coupling making failures non-local. A database timeout that crashes the frontend because there's no circuit breaker is a design problem masquerading as an ops problem.
SLOs set too high to be meaningful. A 99.99% SLO on an internal CRUD service nobody depends on wastes engineering effort. Error budgets should reflect user impact, not engineering pride.
Questions Worth Asking
When did you last test your disaster recovery procedure end-to-end?
How many pages did on-call receive last month? How many were actionable?
If your primary database went down right now, what happens to each service?
Are your SLOs based on actual user expectations, or internal engineering targets?
The SRE mindset shift: stop trying to prevent failure and start designing for recovery. Mean time to recovery (MTTR) often matters more to users than mean time between failures (MTBF).
09 /Improvement

Continuous
Improvement

The hardest discipline in DevOps because it has no obvious fire to put out. Most teams skip it entirely. The ones that don't are the ones that compound improvements over time rather than repeatedly solving the same problems.

What Good Looks Like
Blameless postmortems happen after every significant incident — and actions get completed
DORA metrics are tracked: deployment frequency, lead time, MTTR, change failure rate
Developer experience is measured periodically — not just inferred
Time is budgeted explicitly for improvement work, not taken from on-call slack
Retrospective actions have owners, due dates, and are reviewed
Common Tools
DORA MetricsSleuthLinearBHaystackGitHub InsightsParabolEasyRetroMiroLaunchDarklyNotion / Confluence
Real Bottlenecks
Retrospectives without follow-through. Writing action items in a doc nobody looks at is not improvement — it's ritual. Actions need owners and deadlines or they disappear.
Improvement work deprioritized by product roadmap. When every sprint is feature delivery, technical debt and process debt compound invisibly until the team hits a wall.
Measuring outputs instead of outcomes. Tracking story points and tickets closed doesn't tell you if the team is improving. DORA metrics give you actual signal about delivery health.
Blame culture disguised as postmortems. A "blameless" postmortem that ends with an action item for one specific engineer to "be more careful" is not blameless. It's just deniable.
Innovation disconnected from user impact. Experimenting with new tools is fun. Experiments that don't connect to measurable engineering or user outcomes are hobbies, not improvements.
Questions Worth Asking
What are your DORA metrics right now? Have they improved in the last quarter?
Look at last quarter's retro action items. How many were completed?
Is there explicit time budgeted for improvement work, or does it happen in between fires?
What's the single biggest recurring problem your team keeps solving? Why hasn't it been fixed?
High-performing DevOps teams aren't better because they have better tools. They're better because they systematically learn from what's not working and protect time to fix it.
10 /Culture

Culture &
Collaboration

DevOps is a culture before it's a toolchain. No amount of automation fixes a team where dev blames ops, ops blames security, and security blames everyone. Shared ownership is the foundation everything else rests on.

What Good Looks Like
Developers feel responsible for what they ship — in production, not just in review
Postmortems focus on systems and processes, never on individuals
Security and ops are consulted during design — not handed the result at the end
Oncall rotation includes developers, not just ops — they feel the pain they create
Psychological safety exists: engineers raise problems before they become incidents
Common Tools
Slack / TeamsConfluence / NotionJira / LinearGitHub / GitLabBackstageRootlyFireHydrantLatticeCultureAmpRetrium
Real Bottlenecks
Dev and ops as adversarial teams. Dev wants to ship fast. Ops wants stability. Without shared goals and shared on-call, this tension never resolves — it just finds new forms.
Knowledge hoarding as job security. Engineers who are the only ones who understand a system have incentive to keep it that way. This is cultural, not technical.
Incident blame culture. When mistakes lead to consequences, engineers stop raising concerns early. Small problems become big ones because nobody wanted to be the person who flagged it.
Velocity celebrated over quality. Teams that are rewarded only for shipping features have a rational incentive to skip the things that don't show up in a sprint review: tests, docs, runbooks, tech debt.
DevOps as a team name, not a practice. Renaming your ops team "DevOps" doesn't change how software is built and delivered. DevOps describes how teams work, not what a team is called.
Questions Worth Asking
Would an engineer on your team feel safe raising a concern about a decision they disagreed with?
Do developers participate in on-call? If not, why not?
After your last incident, did the postmortem lead to systemic change, or just an action item?
Are speed and stability treated as trade-offs, or as complementary goals?
Culture is the multiplier on everything else. Great tooling in a blame culture produces great-looking dashboards during incidents nobody admits to. Invest in culture as deliberately as you invest in tooling.