DevOps / Cloud Engineering

AWS-focused cloud infrastructure, automation, and platform engineering work.

Fixing Auto-Deploy

Before and After auto-deploy architecture

Auto-deploy is a feature where if there’s a new build image, it will deploy it… automatically. Customers started to occassionally report that auto-deploy isn’t working on their environment. The usual fix was to restart the cron job.

That only goes so far.

I didn’t know how it works before picking the issue up. I learned that there was a cron job called schedule_deployments. The way it’s implemented is it calls the AWS API for each environment to 1) fetch the latest ECR image and 2) compare the deployed image in ECS, then trigger a deploy if it does not match. That meant two API calls per environment for all environments. I started understanding why it was working fine before, but not anymore as we grew.

My initial solution was to optimize the current approach

CodeBuild already posts an app package manifest to our dashboard at the end of every build (from another feautre). I thought I could piggyback on that event and store a new_build flag on the environment when it triggers. I would then add a filter in schedule_deployments() to only process environments with new builds.

It was working. Fewer API calls, less chance of failing. PR ready for review.

The question that came up in the review was: if we already know when there is a new build, why are we still polling?

It made a lot of sense, and made me wonder why I restricted myself to optimizing only.

The final fix

New PR created, now with the fan-out approach. I removed schedule_deployments entirely. In its place, a new_build_created hook that triggers when the app package manifest comes in. From this hook, we check if auto-deploy is enabled, and if it is we schedule a deploy.

As a side-effect, this also fixed a bug in our auto-deploy logic where if a customer rolled back to an older build image, it would be reverted back again to latest image on the next cron run (oops).

Obvious in hindsight, of course. What’s important is there have been no auto-deploy complaints since this shipped to production.

Another fun, satisfying thing I was able to work on.

Passed AWS Solutions Architect Professional & DevOps Engineer Professional

/
AWS Solutions Architect Professional and DevOps Engineer Professional badges

I passed AWS Solutions Architect Associate last December. At the end of that post I mentioned I’d pursue the Professional tier next. Passed Solutions Architect Professional on February 15. Then took DevOps Engineer Professional on March 19.

One day of prep for each (the day before). I mostly banked on production experience. Most of what came up in the exam, I’ve encountered at work in some form.

SAP felt like a more comprehensive version of SAA. I finished in two hours out of three.

DOP was a lot harder than I expected. It was less about architecture and more on the specific behavior of AWS services. A lot of EKS questions came up and we don’t use Kubernetes at work. I guessed on those. I flagged 30 questions total. I went back and re-read each one carefully. I used the whole three hours, super exhausted. I was honestly bracing myself for retaking it. Luckily, I passed.

I joined my current role without even knowing what Terraform was. I only knew web development and LAMP stacks. I always had this doubt at the back of my mind whether I’m actually a cloud engineer or just got lucky landing the role. SAA felt too easy and didn’t really move the needle on that.

DOP being tough and still passing did. It’s not something someone could LLM their way around. Almost everything I relied on was things I actually learned on the job. That’s what made it feel different. I feel much more genuine with my title now. More confident with my voice. These exams showed me I can reason out architectural design decisions.

I don’t plan to take more certifications anytime soon. But if I ever did, it’d be CKA or CKAD. The amount of EKS questions on both exams made me realize how much I’m missing there. ECS is what we use at work but it’s AWS-only. Kubernetes is cloud-agnostic, can run anywhere including self-hosted, and it’s almost synonymous with DevOps at this point.

I want to widen the net.

Passed AWS Solutions Architect Associate

/
AWS Certified Solutions Architect - Associate certificate

So, I finally took and passed AWS SAA-C03. 4 years of procrastination, 2 exam reschedules, and 3 days of actually soaking my head in SAA topics.

As most of the things in my life, I spent more time stressing for it, and when I actually pushed thru working on it turned out to be easier than expected.

I spent 40 mins out of the 130 mins allowed. There were about 10 questions I had absolutely no idea about. The rest were fairly straightforward. The questions I did not know weren’t like some architecture questions, more on familiarity on what a service actually does. Anything CloudFront, S3, EC2, ECS, Lambda, VPC I was able to answer confidently. These are services we actually run in production. Having real hands-on experience helped a lot.

I think getting a certification did its purpose, filling in gaps with my AWS knowledge on things we don’t use but actually could be useful. Fun to learn about different storage solutions and appropriate uses. Fun to learn about AWS Org.

The exam felt unusually easy. It didn’t remove my impostor syndrome of being a qualified cloud engineer. Instead, it clarified that the gaps are less about fundamentals and more about reasoning across larger systems. I decided to pursue the more tougher exam Solutions Architect – Professional (scheduled Q1 next year).

Let’s see where this goes.

Shipping Log Shipping: From Spike to Customer Environment

/
Log Shipping feature in the customer dashboard

Log Shipping lets customers automatically ship a copy of their access logs to an S3 bucket they own. Small feature, but it touched every layer of the stack. I had a chance to work on every part of it.

Spike: picking the right approach

Before writing any code, we do a spike to research and decide the best approach. For Log Shipping, two options came up:

  • Option 1: A Lambda function that polls and copies S3 objects on a schedule every minute.
  • Option 2: S3 Event Notifications that trigger a Lambda when new objects are added.

Option 2 was the right call. No polling means fewer API calls, better scaling, and a cleaner architecture. The cost difference was negligible.

Infrastructure

The bulk of the work. Log Shipping is primarily an infrastructure feature. S3 event notifications, Lambda, IAM permissions, and the wiring between them. It was implemented as a Terraform module so it can be enabled or disabled per customer environment without touching the core stack.

Backend API

Once the infrastructure was in place, the feature needed to be surfaced through the API. The backend integrates with AWS and reads from Terraform state to get the configuration details of an environment. Whether log shipping is enabled, the destination bucket, and the encryption key in use.

Frontend (React)

The API data gets displayed in the customer-facing dashboard. This is what the customer actually sees and interacts with. My background in web development came in useful here.

Documentation

The feature shipped with updated customer-facing documentation describing what Log Shipping does and how to enable it. Documentation is part of shipping, not an afterthought.

Rollout

After merging, the rollout followed a set sequence. Infrastructure release, backend deploy, frontend deploy, then a change request to enable the feature in a specific customer environment. The last step involved coordinating with the customer directly to confirm their destination bucket and encryption preferences.

Why ECS Task Definitions Kept Changing on Every Apply

/

Every Terraform apply was creating new ECS task definition revisions even when nothing had actually changed. Our test environment had accumulated 17,000 task definitions. ECS task definitions are tracked by AWS Config, so this was quietly contributing to increased costs.

In AWS ECS, a task definition is a blueprint that describes how a container should run — what image to use, environment variables, resource limits, health checks, and so on. Every time Terraform sees a difference between what it expects and what AWS has, it creates a new revision. When that happens on every apply with no real change, something is off.

Finding the root cause

The way to debug this is to compare what Terraform has in state against what the AWS API actually returns. Pull the task definition from Terraform state and diff it against the output of aws ecs describe-task-definition. The differences tell you exactly what Terraform thinks it needs to change.

Two causes came up.

Environment variable ordering. AWS reorders environment variables alphabetically when storing a task definition. Terraform was building the environment variable list using merge(), a built-in function that combines maps but does not guarantee key order. On each plan, the order could differ from what AWS had stored, so Terraform saw a change that wasn’t really there.

Missing properties in the container templates. Some optional fields were absent from the JSON templates we used to define containers. AWS fills these in with default values and includes them in the API response. When Terraform compared its template against the API response, it saw those extra fields as additions it needed to remove — which triggered another replacement.

Fields like healthCheck.interval, healthCheck.retries, healthCheck.timeout, systemControls, mountPoints, and portMappings all fell into this category.

The fix

For environment variable ordering, the fix is to sort the keys explicitly before building the list so Terraform and AWS always agree on the order.

locals {
  merged_env_vars = merge(local.php_environment_json, var.extra_ecs_env_vars)
  php_environment_ecs = [
    for k in sort(keys(local.merged_env_vars)) : {
      name  = k
      value = local.merged_env_vars[k]
    }
  ]
}

For the missing properties, the fix is to add them explicitly to the container JSON templates to match what AWS returns. Once both sides agree, Terraform stops seeing phantom changes.

After both fixes, running the plan twice in a row showed no changes on the second run.

Working Around a Long Standing Terraform AWS Provider Bug

/

Our CloudFront distribution was showing up in every Terraform plan even when nothing changed. The culprit was origin_shield.

When it’s disabled, AWS doesn’t include it in the refreshed state. Terraform sees it missing and re-adds it with enabled = false every run. The fix was to make the block dynamic so it only appears when actually enabled.

dynamic "origin_shield" {
  for_each = var.enable_cloudfront_origin_shield ? [1] : []
  content {
    enabled              = var.enable_cloudfront_origin_shield
    origin_shield_region = data.aws_region.current.name
  }
}

Drift was gone. But I noticed something else: disabling origin shield had stopped working.

When enable_cloudfront_origin_shield flips to false, the dynamic block stops emitting. Terraform plans to remove origin_shield. The plan looks right. Apply runs. Origin shield is still on in the AWS Console. The provider does nothing. Filed in April 2022, still open.

Static block causes drift. Dynamic block breaks disabling. Neither works on its own.

I can’t fix this upstream. So I worked around it.

I added a null_resource that only exists when origin shield is disabled. When it runs, it calls the AWS API directly to force OriginShield.Enabled = false.

resource "null_resource" "cloudfront_origin_shield_disable" {
  count = var.create_cloudfront == "yes" && !var.enable_cloudfront_origin_shield ? 1 : 0

  triggers = {
    enable_origin_shield = var.enable_cloudfront_origin_shield
    distribution_id      = aws_cloudfront_distribution.default[0].id
    script_hash          = filemd5("${path.module}/bin/disable-cloudfront-origin-shield.sh")
  }

  provisioner "local-exec" {
    command = "${path.module}/bin/disable-cloudfront-origin-shield.sh ${aws_cloudfront_distribution.default[0].id} wordpress"
  }

  depends_on = [aws_cloudfront_distribution.default]
}

The script fetches the current distribution config, checks if origin shield is already disabled, exits early if so. Then patches the config with jq and pushes it back via the AWS CLI.

CLOUDFRONT_CONFIG=$(aws cloudfront get-distribution-config --id "$DISTRIBUTION_ID" --output json)
ETAG=$(echo "$CLOUDFRONT_CONFIG" | jq -r '.ETag')
DISTRIBUTION_CONFIG=$(echo "$CLOUDFRONT_CONFIG" | jq '.DistributionConfig')

jq --arg origin_id "$ORIGIN_ID" '
  .Origins.Items = [
    .Origins.Items[] |
    if .Id == $origin_id then
      .OriginShield.Enabled = false
    else
      .
    end
  ]
' <<< "$DISTRIBUTION_CONFIG" | aws cloudfront update-distribution \
    --id "$DISTRIBUTION_ID" \
    --distribution-config /dev/stdin \
    --if-match "$ETAG"

The null_resource only triggers once on transition. Re-enable then disable again and the triggers change forces a re-run.

It’s a workaround. The bug has been open for years. Infrastructure needs to work today.

Reducing Terraform Drift in Production

/
Terraform plan showing no changes

Our Terraform runs were showing changes on every apply, even when nothing had actually changed. When the plan is always noisy, it’s easy to miss a change that actually matters.

I picked this up on my own during spare time. Originally thinking one drift fix per sprint. Seeing the reduced change set gave me a dopamine hit and I just kept going until there were none left.

Five sources of drift across CloudFront, ECS, Secrets Manager, and Terraform provider bugs. All cleaned up.

The fixes

Make origin_shield dynamic

  • When origin_shield is disabled on a CloudFront distribution, AWS removes it from the refreshed state entirely, causing drift on every plan
  • Made the block dynamic so it is only included when enabled
  • There was a secondary provider bug when disabling via a dynamic block that needed its own workaround

Fix null_resources

  • timestamp() in triggers caused Terraform to show changes every run since the value always differs
  • Moved script execution to external data sources to remove drift while still running scripts every apply
  • Split dependency installation into a separate null_resource so it only runs once

Fix task definitions

  • Every apply created new ECS task definition revisions even with no real changes. Task definitions are tracked by AWS Config so this had a cost implication
  • Identified mismatches in environment variable ordering and missing optional properties by comparing Terraform state with the AWS API response
  • Standardized ordering and completed missing fields in the template
  • Full write-up

Webhook secrets

  • An AWS provider issue caused webhook secrets to always show as changed
  • Added ignore_changes for the secret value

Resources not managed by Terraform

  • Some resources existed in AWS but were intentionally not managed by Terraform, which resulted in drift
  • Added the relevant attributes to ignore_changes

This has been shipped and running across all our environments for a couple of months, nothing broke.

This was also a good stress test of how well I understand our infrastructure.

It was a fun exercise.

New Role: Cloud Engineer at Human Made

/

I recently joined Human Made as a Cloud Engineer. The application process spans a few weeks and I was super anxious with the whole thing. I was actually bracing myself for rejection. But when offer has finally been made, I was jacked to the tits!

This is a big win and exciting for me for multitude of reasons:

Culture

Start with trust, and be trustworthy

The whole company values aligns with mine and makes a lot of sense. I strive in a trusting environment. I naturally give my best to people who supports me. The company have a big focus on the well-being of its people (also called Humans).

HR policies reflects company values. On my first day: 1) they provided more than adequate equipment 2) health care coverage to me and all my dependents 3) pro-rated paid time-off that I’m required to use, and other things that just shows how much they care.

Working with top-notch talent

I couldn’t believe it when I realized that our Director of Product (my boss) was the one who created WP-API.

I distinctly remember getting really excited when it first launched because I was working on a WP project that was AJAX/jQuery-heavy. The code base started to felt hacky the more features were implemented. When I tried WP-API (even though it was still in beta) together with AngularJS, development started to become enjoyable and easier to maintain. It made my life and that project’s code so much nicer!

I feel small around other Humans because of sheer amount of experience everybody have, which is a good thing. Feeling small only expands my opportunity of growth. Observing how other people do things makes me realize gaps in my knowledge that I need to fill-in. I also have access and learn from them. This position will only accelerate my growth.

New career path

I’m coming from full-stack web development. I’ve been doing this for more than a decade. However, I’ve always been drawn to servers, networking, and automation ever since I tried sharing our dial-up connection, for two computers, using a crossover LAN cable when I was a kid.

Applying required me to do a trial. The trial involves troubleshooting a broken stack. They provide all the documentation and access to AWS console. I was told that I don’t have to fix it but I need to write down my troubleshooting steps. I had so much fun doing the trial. It took me 4 hours of intense focus and frustration, but the satisfaction of finally making it work was just priceless.

I still have a lot of gaps in DevOps skills. I’m planning my next few years focusing and honing those to give more value back to the company. And also have fun learning along the way.

Validation of my current skillset

”There’s a deep satisfaction when you know how valuable you are, and the world agrees.” - Derek Sivers

I try not to seek validation. But it just feels good to know that my skills is on par with first-world talent.

#

Despite emphasis on my skills, I still largely credit luck (being blessed) on how I landed in Human Made. My skills only maximized my chance, but overall I just got incredibly lucky (blessed).