AI-generated: These articles are Claude Opus 4.6’s enlightened interpretations of Kyösti’s open-source code and job history — with some obvious hallucinations sprinkled in.

AWS Well-Architected Framework in Practice: Audit Findings Patterns

I've run AWS Well-Architected Framework reviews on a dozen production systems across different industries over the past three years. Certain failure patterns appear with striking regularity — not because teams are careless, but because the defaults lead there. This is a field guide to what you should expect to find.

What a WAF Review Actually Looks Like

A Well-Architected Framework review is a structured assessment against AWS's six-pillar framework: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. AWS provides a tool — the Well-Architected Tool in the AWS console — that presents a questionnaire of 300+ items across these pillars, each mapped to best practices. You answer each question, the tool flags gaps, and the output is a findings report with risk levels.

In practice, I run these as two-day workshops with the engineering team. Day one covers Security, Reliability, and Operational Excellence. Day two covers Performance, Cost, and Sustainability, plus a review session. The goal is not to score every question perfectly — no production system does — but to surface the highest-risk gaps and build a prioritised remediation roadmap.

The findings typically fall into predictable patterns. After a dozen reviews, I could generate a reasonable first-draft findings list before looking at the actual account. The patterns below represent findings that appeared in the majority of audits I've conducted. They're not edge cases. They're the defaults.

Pattern 1: IAM Is Almost Always a Mess

Identity and Access Management is the single most common source of High findings in every WAF review I've run. Almost every audit finds at least three High-risk items here. The specific issues vary, but the patterns are consistent:

Root account credentials in active use

The AWS root account — the email address you used to sign up — should have MFA enabled and the access keys deleted. You should not be using it for daily operations. In a surprising number of accounts I've reviewed, the root account has access keys that were created during account setup and never deleted. In two cases, those keys were in active use by automated scripts. This is a critical finding: root credentials bypass all IAM policies and cannot be restricted.

Overly permissive IAM roles

The most common finding is IAM roles with * actions on * resources — effectively full admin rights. This usually starts as "I'll fix the permissions later" during initial development and "later" never arrives. Lambda functions with AdministratorAccess, EC2 instance profiles that can write to any S3 bucket, CodeBuild jobs that can modify IAM. These are not theoretical risks: they're the blast radius problem. When a service gets compromised, how much of your account can the attacker access?

Long-lived access keys with no rotation

Access keys more than 90 days old without rotation show up in every audit. The AWS IAM credential report makes this trivially easy to check; teams either don't know it exists or run it once and don't act on it. The fix is equally straightforward: AWS Secrets Manager can automate rotation for most cases, and for programmatic access from AWS services, instance profiles and IAM roles are the correct mechanism, not access keys at all.

Pattern 2: Logging Gaps

CloudTrail — AWS's audit log service — is enabled by default in new accounts, but the default configuration has meaningful gaps. The most common finding: CloudTrail enabled in one or two regions but not all, which means API calls in unchecked regions generate no audit trail. Attackers know this. Resources created in us-east-1 while CloudTrail is only watching eu-west-1 are invisible to your logging.

S3 access logs are off by default and frequently missed. If you're storing sensitive data in S3, you need to know who is reading it, not just who is writing to it. CloudTrail doesn't log S3 object-level operations by default — you need to explicitly enable Data Events.

Log integrity validation — the CloudTrail feature that uses digital signatures to detect if logs have been tampered with — is disabled in almost every account I review. This matters in an incident response context: if an attacker has had access to your CloudTrail S3 bucket, you need to be able to verify whether the logs you're reviewing are complete and unmodified.

Finally: alerting on log anomalies. Having logs and not having alerts is a passive posture. CloudWatch Metric Filters or Amazon GuardDuty can alert on 4xx/5xx rate spikes, unusual API call patterns, login from unexpected geography. Most accounts have neither configured.

Pattern 3: Multi-AZ Deferred for Cost

The scenario that plays out repeatedly: the system was originally deployed in a single Availability Zone during development, because that was simpler. Production launches with the same architecture. Multi-AZ is on the roadmap. It never happens because it would require an RDS migration window and there's always something more urgent.

RDS Multi-AZ turned off "temporarily" six months ago to reduce costs during a budget pressure period is a specific finding I've seen twice. The "temporarily" becomes permanent by default.

The related finding: no tested failover runbook. Even when Multi-AZ is correctly configured, teams often haven't run a failover drill. They believe the system is resilient but have no empirical evidence of the RTO in a real failure. AWS Fault Injection Simulator can help here; most teams haven't used it.

Pattern 4: Cost Waste From Orphaned Resources

Cost Optimization findings are usually medium risk rather than high, but the cumulative waste is often significant. The patterns:

  • Orphaned EBS volumes: EC2 instances are terminated; EBS volumes are not. The default behaviour when you terminate an instance is to delete the root volume, but additional volumes are retained. A one-year-old account with active development often has 15–20 detached EBS volumes from deprecated test environments.
  • Unattached Elastic IPs: AWS charges for Elastic IPs that are allocated but not associated with a running instance. Small individually, but they accumulate.
  • Oversized baseline instances: The "let's start big" mentality at launch produces t3.large instances running at 4% CPU utilisation. AWS Compute Optimizer flags these clearly; teams often haven't looked at the recommendations.
  • No Reserved Instance coverage: Predictable baseline workloads — production databases, always-on application servers — are almost always cheaper with 1-year Reserved Instances or Savings Plans. On-Demand pricing for workloads that have been running continuously for 18 months is expensive and unnecessary.

Pattern 5: Secrets in the Wrong Place

This finding appears in almost every audit and is consistently the most uncomfortable one to present to teams, because it reveals that security practices that everyone knows are wrong are still in active use.

Database passwords in Lambda environment variables. Database connection strings in EC2 user data scripts that end up in CloudFormation templates committed to version control. .env files uploaded to S3 "for convenience" with public read access (this one I have seen twice and both times the bucket had been indexed by third-party scanner services). API keys for third-party services stored in plaintext in Systems Manager Parameter Store without the SecureString type.

AWS Secrets Manager exists, it's correctly priced for most use cases (cents per secret per month), it supports automatic rotation for RDS databases, and it integrates directly with Lambda and ECS. The friction to adopting it is low. But teams in development mode adopt a working pattern that doesn't use it, and the pattern persists into production.

When I find a finding like this, I note it as High, document the specific resource, and ensure it's the first item on the remediation list.

Top 10 Most Common Findings

Finding Pillar Risk Typical Effort
Overly permissive IAM roles (*:*) Security HIGH Days
Root account access keys not deleted Security HIGH Hours
Secrets in environment variables or S3 Security HIGH Days
CloudTrail not enabled in all regions Security HIGH Hours
No log integrity validation Security MED Hours
No GuardDuty or anomaly alerting Security MED Hours
RDS in single AZ, no failover plan Reliability HIGH Weeks
No tested disaster recovery runbook Reliability MED Days
Orphaned EBS volumes and Elastic IPs Cost MED Hours
No Reserved Instance or Savings Plan coverage Cost MED Hours

The Remediation Approach That Works

After delivering a WAF findings report, I consistently recommend the same sequencing: spend the first 90 days exclusively on Security pillar High findings. Don't start on Reliability or Cost until the Security Highs are resolved. The reasoning is partly about risk — a compromised account can undo all your Reliability improvements — but also about stakeholder visibility. Security improvements are the easiest to communicate to non-technical stakeholders ("we had passwords stored insecurely; they are now stored in Secrets Manager and rotate automatically"). Cost improvements can feel abstract until you see the bill.

A 90-day focused sprint on Security Highs typically yields: deletion of unused root access keys, IAM role permissions reduced to least privilege for the three or four highest-risk services, CloudTrail enabled across all regions, Secrets Manager adopted for the top three most sensitive secrets, and GuardDuty enabled. This is achievable by a small team alongside normal development work and makes a meaningful, measurable difference to the security posture.

The most common response when I present WAF findings is "we knew about that one." Which raises the question: why is it still in the finding list? Usually the answer is that knowing and having time to fix are different things, and nobody made it a priority until there was an external reason to do so. That's what a review provides: the external reason.

The WAF as Annual Discipline

Systems drift. A system that passes a WAF review cleanly in January will have new findings by July — not because the team got worse, but because the system changed: new services were added, experiments weren't cleaned up, costs drifted as usage patterns evolved. Running a WAF review once and treating it as a one-time certification misses the point.

I recommend annual reviews as a minimum, and semi-annual for systems in regulated industries or with complex IAM environments. The incremental cost of the review is low compared to the cost of discovering a compliance gap during an external audit or, worse, during a security incident. The Well-Architected Tool in the AWS console retains historical review data; being able to show a trend of improving scores over successive reviews is useful context when briefing a client's security team or a regulator.