When Your GitLab Pipeline Fails: Debugging OIDC Authentication and Rethinking Remote State

On This Page

    The Setup

    I was riding high after deploying AWS Config aggregator infrastructure across my organization. The Terraform code was clean, the OIDC federation was configured, and my GitLab pipeline had worked perfectly… two days ago. But today, pushing the latest Config recorder changes triggered this frustrating error:

    $ export AWS_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/devops-operator"
    $ aws sts assume-role-with-web-identity --role-arn ${AWS_ROLE_ARN} ...
    
    An error occurred (ValidationError) when calling the AssumeRoleWithWebIdentity operation:
    Request ARN is invalid

    The pipeline had been working. The OIDC provider was configured. The IAM role existed. What changed?

    The Investigation

    Step 1: Verify the Infrastructure

    First, I checked if the AWS resources still existed:

    # Check OIDC provider
    $ aws iam list-open-id-connect-providers
    {
        "OpenIDConnectProviderList": [
            {
                "Arn": "arn:aws:iam::266735821834:oidc-provider/gitlab.com"
            }
        ]
    }
    
    # Check IAM role
    $ aws iam get-role --role-name devops-operator
    {
        "Role": {
            "RoleName": "devops-operator",
            "Arn": "arn:aws:iam::266735821834:role/devops-operator",
            "AssumeRolePolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [{
                    "Effect": "Allow",
                    "Principal": {
                        "Federated": "arn:aws:iam::266735821834:oidc-provider/gitlab.com"
                    },
                    "Action": "sts:AssumeRoleWithWebIdentity",
                    "Condition": {
                        "StringEquals": {
                            "gitlab.com:aud": "https://gitlab.com"
                        },
                        "StringLike": {
                            "gitlab.com:sub": "project_path:squinky/aws-sec:*"
                        }
                    }
                }]
            }
        }
    }

    Everything looked perfect. The OIDC provider existed, the role existed, and the trust policy was correct. So why was the ARN invalid?

    Step 2: Examine the Pipeline Logs

    Looking closer at the error, I noticed something subtle:

    $ export AWS_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/devops-operator"
    # This expands to: arn:aws:iam::/role/devops-operator
    #                                  ↑ Missing account ID!

    The ${AWS_ACCOUNT_ID} variable was empty, producing an invalid ARN with missing account ID.

    Step 3: The Root Cause

    I had never configured the AWS_ACCOUNT_ID environment variable in GitLab CI/CD settings.

    But wait—the pipeline worked before! How?

    Digging through git history revealed the answer: When I initially tested the pipeline, I was working with a different branch that had hardcoded the account ID for testing. When I moved to the main branch with the parameterized version, I never added the variable to GitLab.

    The lesson: Environment-specific configuration belongs in CI/CD variables, not in code. But you have to actually set them up.

    The Fix

    The solution was embarrassingly simple:

    1. Navigate to GitLab Project → Settings → CI/CD → Variables
    2. Click “Add variable”
    3. Set:
      • Key: AWS_ACCOUNT_ID
      • Value: 266735821834
      • Type: Variable
      • Protect variable: ☐ Unchecked (to allow use on all branches)
      • Mask variable: ☐ Unchecked (account IDs aren’t sensitive)

    After adding the variable and re-running the pipeline:

    $ aws sts get-caller-identity
    {
        "UserId": "AROAT4...:gitlab-...",
        "Account": "266735821834",
        "Arn": "arn:aws:sts::266735821834:assumed-role/devops-operator/gitlab-..."
    }
    SUCCESS! GitLab can authenticate to AWS!

    Pipeline status: ✅ Passing

    The Bigger Picture: Rethinking State Management

    While debugging the pipeline, I confronted a larger problem that had been slowing me down: remote state management overhead.

    The Problem with Remote State (For Solo Development)

    When I started this project, I followed best practices and set up a proper Terraform state backend:

    terraform {
      backend "s3" {
        bucket         = "terraform-state-266735821834-us-west-1"
        key            = "infrastructure/terraform.tfstate"
        region         = "us-west-1"
        dynamodb_table = "terraform-state-lock"
        encrypt        = true
      }
    }

    This is the “right way” to do it for team collaboration. The DynamoDB table prevents concurrent modifications, S3 versioning provides history, and remote access enables team coordination.

    But here’s what it actually meant for my workflow:

    1. Separate infrastructure to manage: The state backend itself required Terraform code, deployment, and maintenance
    2. State locking overhead: Every terraform plan acquired and released a DynamoDB lock, adding 10-15 seconds
    3. Cleanup complexity: When I wanted to remove it, I had to:
      • Download the state file (aws s3 cp s3://...)
      • Delete all object versions (S3 versioning creates delete markers)
      • Delete the DynamoDB table
      • Update providers.tf
      • Reinitialize Terraform
    4. Cost (minor but real): $1-2/month for resources I didn’t need

    The Trade-off Analysis

    I asked myself: What am I actually getting from remote state?

    Benefits I was using:

    • ✅ State file backup (but I could just commit to git or manually backup)
    • ✅ Infrastructure tracking (local state does this too)

    Benefits I wasn’t using:

    • ❌ Team collaboration (I’m the only developer)
    • ❌ Concurrent modification prevention (no team = no concurrent access)
    • ❌ Remote access (I only work from one machine)

    Costs I was paying:

    • ⏱️ 10-15 seconds added to every Terraform operation
    • 💰 $1-2/month in AWS charges
    • 🧠 Mental overhead of managing additional infrastructure
    • 🐌 Slower iteration cycles

    The Migration

    I decided to migrate to local state management. Here’s how:

    # 1. Download existing state (preserve resource tracking)
    aws s3 cp s3://terraform-state-266735821834-us-west-1/infrastructure/terraform.tfstate ./terraform.tfstate
    
    # 2. Update providers.tf to remove backend config
    # Before:
    terraform {
      required_version = ">= 1.0"
      backend "s3" { ... }
    }
    
    # After:
    terraform {
      required_version = ">= 1.0"
      # Local backend (default)
    }
    
    # 3. Reinitialize Terraform
    terraform init
    
    # 4. Verify state is preserved
    terraform plan  # Should show "No changes"
    
    # 5. Clean up remote backend resources
    # Delete all S3 object versions
    aws s3api list-object-versions --bucket terraform-state-266735821834-us-west-1 \
      --query 'Versions[].{Key:Key,VersionId:VersionId}' \
      | jq -r '.[] | "--key \(.Key) --version-id \(.VersionId)"' \
      | xargs -I {} aws s3api delete-object --bucket terraform-state-266735821834-us-west-1 {}
    
    # Delete delete markers
    aws s3api list-object-versions --bucket terraform-state-266735821834-us-west-1 \
      --query 'DeleteMarkers[].{Key:Key,VersionId:VersionId}' \
      | jq -r '.[] | "--key \(.Key) --version-id \(.VersionId)"' \
      | xargs -I {} aws s3api delete-object --bucket terraform-state-266735821834-us-west-1 {}
    
    # Delete bucket
    aws s3 rb s3://terraform-state-266735821834-us-west-1 --force
    
    # Delete DynamoDB table
    aws dynamodb delete-table --table-name terraform-state-lock

    The Results

    Performance improvement:

    • terraform apply: 45 seconds (down from 60+ seconds)
    • terraform plan: 30 seconds (down from 45+ seconds)

    Workflow simplification:

    • No separate state backend infrastructure to maintain
    • No DynamoDB lock acquisition delays
    • Faster iteration cycles

    Cost savings:

    • $1-2/month eliminated (minor but satisfying)

    Trade-offs accepted:

    • Manual state backup responsibility (I commit terraform.tfstate to private git repo)
    • No team collaboration features (not needed for solo development)

    Lessons Learned

    1. Environment Variables Are Not Optional

    When your CI/CD pipeline references ${VARIABLE_NAME}, you MUST configure it in your CI/CD settings. This seems obvious in retrospect, but it’s easy to overlook when:

    • Variables work in local development (where you have .envrc or shell exports)
    • You’re migrating from hardcoded values to parameterized config
    • You’re copying pipeline configurations from other projects

    Best practice: Create a checklist for new CI/CD pipelines:

    • All environment variables defined in CI/CD settings
    • Variables scoped correctly (protected vs. unprotected)
    • Sensitive values marked as masked
    • Test with a fresh pipeline run (not just re-runs that might cache variables)

    2. Cryptic Errors Often Have Simple Causes

    The error “Request ARN is invalid” suggested complex problems:

    • IAM permission issues?
    • OIDC trust policy misconfiguration?
    • AWS service outage?

    But the actual cause was simple: an empty environment variable.

    Debugging approach:

    1. Start with the simplest possible explanation
    2. Verify assumptions (print variable values, check they’re not empty)
    3. Compare working vs. broken states (what changed?)
    4. Only escalate to complex debugging when simple causes are ruled out

    3. “Best Practices” Depend on Context

    Remote state management with S3 and DynamoDB is a best practice for teams. The benefits (concurrent access prevention, remote access, state locking) are valuable when multiple people modify infrastructure.

    But for solo development, these benefits don’t justify the costs:

    • ⏱️ Slower iteration cycles
    • 💰 Additional infrastructure costs
    • 🧠 Mental overhead
    • 🔧 Maintenance burden

    The principle: Adopt best practices when they solve problems you actually have. Don’t cargo-cult solutions designed for different contexts.

    4. State Management Is a Spectrum

    The debate isn’t “local vs. remote.” It’s about choosing the right approach for your situation:

    ScenarioRecommended Approach
    Solo developer, rapid prototypingLocal state + git backup
    Solo developer, production infrastructureLocal state + automated S3 backup
    Small team (2-3 people)Remote state with locking
    Large teamRemote state + Terraform Cloud/Enterprise
    Multi-team organizationSeparate state files per component + remote backend

    I started at the “small team” level when I should have been at the “solo developer, rapid prototyping” level. Recognizing this and adapting saved time and reduced complexity.

    Practical Takeaways

    For GitLab OIDC Authentication

    If your GitLab pipeline fails with OIDC errors:

    1. Check environment variables first:

      # In your pipeline, temporarily add:
      script:
        - echo "AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID}"
        - echo "Role ARN would be: arn:aws:iam::${AWS_ACCOUNT_ID}:role/devops-operator"
    2. Verify the role ARN is complete (contains account ID)

    3. Check GitLab CI/CD variable settings:

      • Variables must be added in Settings → CI/CD → Variables
      • Unprotected variables work on all branches
      • Protected variables only work on protected branches
      • Masked variables don’t appear in logs (good for secrets, bad for debugging)

    For Terraform State Management

    Ask yourself:

    1. Do I need team collaboration?

      • No → Local state is simpler
      • Yes → Remote state is necessary
    2. Do I need state history/rollback?

      • No → Local state with git commits
      • Yes → S3 versioning or Terraform Cloud
    3. Do I need to prevent concurrent modifications?

      • No → Local state (you can’t modify concurrently alone)
      • Yes → DynamoDB locking or Terraform Cloud
    4. Am I in rapid prototyping mode or production mode?

      • Prototyping → Local state for speed
      • Production → Remote state for safety

    Conclusion

    What started as a frustrating pipeline failure became a valuable lesson in debugging systematic workflows and questioning accepted practices.

    The immediate problem—missing environment variable—was simple to fix. But it prompted a deeper evaluation of my infrastructure choices. By migrating from remote to local state management, I eliminated unnecessary complexity and improved iteration speed.

    The broader lesson: Infrastructure decisions aren’t permanent. When your context changes (from team to solo, prototype to production, learning to production), re-evaluate your choices. The best practice for your situation might differ from the best practice you read in a blog post.

    And always, always check your environment variables first.


    Related Resources:

    Want to discuss state management strategies or CI/CD debugging? I’d love to hear about your experiences—find me on GitHub @hmbldv.