Leading 35 Engineers: Merging ITIL with DevOps Reality

The Initial Tension

The ITIL framework I inherited wasn't some relic from the 1990s. It had been implemented two years prior for a large municipal client engagement, and it was functional. Change Advisory Board meetings happened weekly. There was a configuration management database — a spreadsheet, technically, but a well-maintained one. Incident severity levels were defined. PIR (Post-Incident Review) templates existed and were used.

The problem was the timing mismatch. ITIL change management in its standard form requires a change record, a risk assessment, an implementation plan, a rollback plan, and approval before deployment. The teams building our own internal platforms had been deploying to production multiple times per day on Trunk-Based Development, with automated tests as the gate. These two models are not obviously compatible. The DevOps teams referred to the ITIL process, privately and occasionally not privately, as "change kabuki."

The ITIL camp was not wrong: without some form of change governance, you're flying blind about what changed when, and your incident response becomes archaeology. The DevOps camp was not wrong: a weekly CAB for approving a one-line config change is theatre, not safety. My job was to find the overlap.

What ITIL Gets Right

Coming from a developer background, I had spent years dismissing ITIL as enterprise-grade bureaucracy. Actually reading the framework — not the certification syllabi but the actual ITIL 4 Practice Guides — changed my view somewhat.

Three things ITIL gets right that DevOps culture under-emphasises:

Service catalog thinking

ITIL's insistence on a formal service catalog — a documented inventory of what you provide, to whom, at what SLA — is valuable and frequently skipped by teams that move fast. When I joined the role, several of our platform services existed as tribal knowledge: engineers knew what they did, but no written record described the scope, the consumers, the dependencies, or the support boundaries. The first ITIL artefact I preserved was the service catalog. It became the source of truth for onboarding, for incident triage, and for capacity conversations with leadership.

Incident management with actual post-incident review

Blameless post-mortems are a DevOps concept. ITIL's PIR (Post-Incident Review) predates the term but covers the same territory, with more formal documentation requirements. The ITIL documentation discipline produced better written records than the Slack-thread post-mortems I'd seen in purely DevOps-oriented teams. What gets written down in a structured template gets reviewed; what lives in a Slack thread gets forgotten when someone leaves.

Configuration management as a discipline

The CMDB idea — knowing what infrastructure you have and how it relates to your services — is not optional at scale. We didn't use expensive CMDB software; we used a combination of Terraform state, AWS Config, and the service catalog spreadsheet. But the discipline of keeping those records current is ITIL in spirit, even when the tooling is entirely different.

What DevOps Gets Right

I don't need to write a long section here; the DevOps case is well-made elsewhere. The three things I most strongly defended from the DevOps side:

Automation is the answer to most process questions

If you're manually approving every deployment, you're not managing risk — you're creating latency. The way to make deployments safe is to make them automated, tested, and observable, not to add a human review step. A deployment that runs through a CI/CD pipeline with unit tests, integration tests, and a canary deployment is safer than a manual deployment approved by a CAB that has no way to verify whether the deployment plan matches reality.

Small and reversible

ITIL treats large changes as higher risk and adds more process to them. DevOps says: make all changes small, and most of your risk management problem goes away. This is correct. The teams with the highest change failure rates were the ones doing infrequent large releases, not the ones deploying dozens of small changes per week. ITIL's risk classification model doesn't capture this insight — it treats a large release with a detailed change record as safer than a small unplanned change, when in practice the small change is usually lower risk.

Team autonomy within guardrails

You cannot have 35 engineers asking permission for every decision. The function of a manager at this scale is to set the guardrails, not to make the individual calls. ITIL's centralised approval structures don't scale to product teams with continuous deployment. The authority needs to be distributed; the accountability stays centralised.

The Hybrid We Built

The framework we landed on drew from both traditions and required both camps to give up something. It used ITIL's change classification vocabulary but DevOps's automation-first logic:

Standard changes: Pre-approved, fully documented, have a runbook, have been done before. No meeting needed. Examples: deploying a new version of an internal service through the CI/CD pipeline, scaling an RDS instance within defined thresholds, rotating a service account credential. The pre-approval gate is the first time you define and document the change type. After that, it runs as many times as needed with no additional review.
Normal changes: Require async approval via a Slack thread in a dedicated channel. Architect or lead signs off. No synchronous meeting. Target: 4-hour turnaround. Examples: introducing a new external API dependency, modifying a load balancer configuration, adding a new environment variable containing a secret to a production service. Documentation is a Jira ticket with a simple template.
Emergency changes: Break-glass. Done first, documented immediately after. Within 24 hours, a post-incident note explaining what changed, why, and what the follow-up action is. The fact that you're doing an emergency change is automatically visible in the change log because it's the only type that can appear without prior documentation.

The CAB became a monthly review of the previous month's changes — a retrospective forum, not an approval body. This satisfied the ITIL client engagement requirement (they could see that changes were reviewed and recorded) while removing the bottleneck from the deployment path.

The Mental Model Shift at Scale

The hardest part of leading a team of 35 has nothing to do with ITIL or DevOps. It's the epistemological shift from knowing to trusting.

When you lead a team of 5, you know what everyone is working on. You've read the code. You've sat in the deployment. You have direct personal knowledge of the system state. When you lead a team of 35 across multiple product areas, you don't have direct knowledge of most things. You have summaries. You have status updates. You have signals.

This is disorienting if you've spent your career being the person who knows the system deeply. It requires letting go of a form of competence that defined your identity, and replacing it with a different competence: reading signals, identifying when a summary is incomplete, knowing which team leads to trust for what kind of report, and recognising when "everything is fine" is a status update versus a genuine assessment.

At five people, a good manager is someone who knows everything. At thirty-five, a good manager is someone who knows what they don't know and has built systems to surface it.

Team Structure Evolution

When I took the role, the team was organised around technology: a "backend team," a "platform team," a "frontend team." This is a natural structure for small organisations where each team needs to deliver complete features, but it produces handoff friction as the organisation grows.

I used the Team Topologies vocabulary to restructure: stream-aligned teams that own a product slice end-to-end, a platform team that reduces cognitive load for the stream teams, and enabling teams with specialised knowledge (security, performance, data engineering) that work temporarily with stream teams. This model has been written about extensively; I won't recap it. What I'll add from experience is that the transition from function-aligned to stream-aligned teams is socially difficult even when it's obviously correct. Engineers identify with their technical discipline; being moved to a product team where you're no longer "the backend team" but "the payments team" requires a narrative that explains why this serves the work, not just the org chart.

What Broke at Scale

Informal knowledge transfer stopped working. In a five-person team, knowledge spreads through proximity — lunch conversations, casual code reviews, overhearing a debug session. In a 35-person team distributed across multiple offices (and later, increasingly remote), proximity doesn't scale. The things that "everyone knows" are actually things that the five or six engineers who've been there longest know, and they've been transmitting them inefficiently for months.

The solutions were structural rather than cultural. We formalised onboarding: a six-week structured programme with explicit knowledge transfer goals, not "here's your Jira account, pair with someone." We introduced Architecture Decision Records (ADRs) as mandatory documentation for any significant technical decision. These are short documents — problem statement, options considered, decision, rationale, consequences — that provide the context future engineers need to understand why the system is shaped the way it is.

The ADR practice in particular changed how I thought about engineering documentation. Most technical documentation describes what the system does, which becomes outdated quickly and is superseded by reading the code. ADRs describe why decisions were made, which doesn't become outdated — even a superseded decision has an ADR that explains what the reasoning was when it was made. That context is genuinely irreplaceable once the people involved have left or moved on.

Metrics That Matter

I stopped reporting sprint velocity to leadership and started reporting the DORA four key metrics. This was a conversation, not a unilateral decision, but leadership accepted the switch once I explained the measurement validity problem with velocity: velocity is a team-specific, calibration-dependent number that tells you almost nothing about delivery performance. The DORA metrics — deployment frequency, lead time for changes, mean time to restore, change failure rate — are harder to game and measure things that actually matter to the business.

Our deployment frequency when I took the role was roughly twice per month for most services. Within a year, core services were deploying multiple times per day. Lead time for changes dropped from two weeks to two days. MTTR stayed stable (sub-2-hours for severity 1) even as deployment frequency increased. Change failure rate actually improved as the team got better at automated testing and canary releases.

These numbers are defensible in a client conversation in a way that "we completed 87% of our sprint commitment" is not. They also reveal different failure modes: a team with good deployment frequency but high change failure rate has a testing problem; a team with low deployment frequency and good change failure rate has a process problem. Velocity hides this distinction.