When Your GitLab Pipeline Fails: Debugging OIDC Authentication and Rethinking Remote State

Nov 24, 2025

8 min read — roughly 100% of Free Bird

On This Page

The Setup

I was riding high after deploying AWS Config aggregator infrastructure across my organization. The Terraform code was clean, the OIDC federation was configured, and my GitLab pipeline had worked perfectly… two days ago. But today, pushing the latest Config recorder changes triggered this frustrating error:

$ export AWS_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/devops-operator"
$ aws sts assume-role-with-web-identity --role-arn ${AWS_ROLE_ARN} ...

An error occurred (ValidationError) when calling the AssumeRoleWithWebIdentity operation:
Request ARN is invalid

The pipeline had been working. The OIDC provider was configured. The IAM role existed. What changed?

The Investigation

Step 1: Verify the Infrastructure

First, I checked if the AWS resources still existed:

# Check OIDC provider
$ aws iam list-open-id-connect-providers
{
    "OpenIDConnectProviderList": [
        {
            "Arn": "arn:aws:iam::123456789012:oidc-provider/gitlab.com"
        }
    ]
}

# Check IAM role
$ aws iam get-role --role-name devops-operator
{
    "Role": {
        "RoleName": "devops-operator",
        "Arn": "arn:aws:iam::123456789012:role/devops-operator",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [{
                "Effect": "Allow",
                "Principal": {
                    "Federated": "arn:aws:iam::123456789012:oidc-provider/gitlab.com"
                },
                "Action": "sts:AssumeRoleWithWebIdentity",
                "Condition": {
                    "StringEquals": {
                        "gitlab.com:aud": "https://gitlab.com"
                    },
                    "StringLike": {
                        "gitlab.com:sub": "project_path:squinky/aws-sec:*"
                    }
                }
            }]
        }
    }
}

Everything looked perfect. The OIDC provider existed, the role existed, and the trust policy was correct. So why was the ARN invalid?

Step 2: Examine the Pipeline Logs

Looking closer at the error, I noticed something subtle:

$ export AWS_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/devops-operator"
# This expands to: arn:aws:iam::/role/devops-operator
#                                  ↑ Missing account ID!

The ${AWS_ACCOUNT_ID} variable was empty, producing an invalid ARN with missing account ID.

Step 3: The Root Cause

I had never configured the AWS_ACCOUNT_ID environment variable in GitLab CI/CD settings.

But wait—the pipeline worked before! How?

Digging through git history revealed the answer: When I initially tested the pipeline, I was working with a different branch that had hardcoded the account ID for testing. When I moved to the main branch with the parameterized version, I never added the variable to GitLab.

The lesson: Environment-specific configuration belongs in CI/CD variables, not in code. But you have to actually set them up.

The Fix

The solution was embarrassingly simple:

Navigate to GitLab Project → Settings → CI/CD → Variables
Click “Add variable”
Set:
- Key: AWS_ACCOUNT_ID
- Value: 123456789012
- Type: Variable
- Protect variable: ☐ Unchecked (to allow use on all branches)
- Mask variable: ☐ Unchecked (account IDs aren’t sensitive)

After adding the variable and re-running the pipeline:

$ aws sts get-caller-identity
{
    "UserId": "AROAT4...:gitlab-...",
    "Account": "123456789012",
    "Arn": "arn:aws:sts::123456789012:assumed-role/devops-operator/gitlab-..."
}
SUCCESS! GitLab can authenticate to AWS!

Pipeline status: ✅ Passing

The Bigger Picture: Rethinking State Management

While debugging the pipeline, I confronted a larger problem that had been slowing me down: remote state management overhead.

The Problem with Remote State (For Solo Development)

When I started this project, I followed best practices and set up a proper Terraform state backend:

terraform {
  backend "s3" {
    bucket         = "terraform-state-123456789012-us-west-1"
    key            = "infrastructure/terraform.tfstate"
    region         = "us-west-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

This is the “right way” to do it for team collaboration. The DynamoDB table prevents concurrent modifications, S3 versioning provides history, and remote access enables team coordination.

But here’s what it actually meant for my workflow:

Separate infrastructure to manage: The state backend itself required Terraform code, deployment, and maintenance
State locking overhead: Every terraform plan acquired and released a DynamoDB lock, adding 10-15 seconds
Cleanup complexity: When I wanted to remove it, I had to:
- Download the state file (aws s3 cp s3://...)
- Delete all object versions (S3 versioning creates delete markers)
- Delete the DynamoDB table
- Update providers.tf
- Reinitialize Terraform
Cost (minor but real): $1-2/month for resources I didn’t need

The Trade-off Analysis

I asked myself: What am I actually getting from remote state?

Benefits I was using:

✅ State file backup (but I could just commit to git or manually backup)
✅ Infrastructure tracking (local state does this too)

Benefits I wasn’t using:

❌ Team collaboration (I’m the only developer)
❌ Concurrent modification prevention (no team = no concurrent access)
❌ Remote access (I only work from one machine)

Costs I was paying:

⏱️ 10-15 seconds added to every Terraform operation
💰 $1-2/month in AWS charges
🧠 Mental overhead of managing additional infrastructure
🐌 Slower iteration cycles

The Migration

I decided to migrate to local state management. Here’s how:

# 1. Download existing state (preserve resource tracking)
aws s3 cp s3://terraform-state-123456789012-us-west-1/infrastructure/terraform.tfstate ./terraform.tfstate

# 2. Update providers.tf to remove backend config
# Before:
terraform {
  required_version = ">= 1.0"
  backend "s3" { ... }
}

# After:
terraform {
  required_version = ">= 1.0"
  # Local backend (default)
}

# 3. Reinitialize Terraform
terraform init

# 4. Verify state is preserved
terraform plan  # Should show "No changes"

# 5. Clean up remote backend resources
# Delete all S3 object versions
aws s3api list-object-versions --bucket terraform-state-123456789012-us-west-1 \
  --query 'Versions[].{Key:Key,VersionId:VersionId}' \
  | jq -r '.[] | "--key \(.Key) --version-id \(.VersionId)"' \
  | xargs -I {} aws s3api delete-object --bucket terraform-state-123456789012-us-west-1 {}

# Delete delete markers
aws s3api list-object-versions --bucket terraform-state-123456789012-us-west-1 \
  --query 'DeleteMarkers[].{Key:Key,VersionId:VersionId}' \
  | jq -r '.[] | "--key \(.Key) --version-id \(.VersionId)"' \
  | xargs -I {} aws s3api delete-object --bucket terraform-state-123456789012-us-west-1 {}

# Delete bucket
aws s3 rb s3://terraform-state-123456789012-us-west-1 --force

# Delete DynamoDB table
aws dynamodb delete-table --table-name terraform-state-lock

The Results

Performance improvement:

terraform apply: 45 seconds (down from 60+ seconds)
terraform plan: 30 seconds (down from 45+ seconds)

Workflow simplification:

No separate state backend infrastructure to maintain
No DynamoDB lock acquisition delays
Faster iteration cycles

Cost savings:

$1-2/month eliminated (minor but satisfying)

Trade-offs accepted:

Manual state backup responsibility (I commit terraform.tfstate to private git repo)
No team collaboration features (not needed for solo development)

Lessons Learned

1. Environment Variables Are Not Optional

When your CI/CD pipeline references ${VARIABLE_NAME}, you MUST configure it in your CI/CD settings. This seems obvious in retrospect, but it’s easy to overlook when:

Variables work in local development (where you have .envrc or shell exports)
You’re migrating from hardcoded values to parameterized config
You’re copying pipeline configurations from other projects

Best practice: Create a checklist for new CI/CD pipelines:

All environment variables defined in CI/CD settings
Variables scoped correctly (protected vs. unprotected)
Sensitive values marked as masked
Test with a fresh pipeline run (not just re-runs that might cache variables)

2. Cryptic Errors Often Have Simple Causes

The error “Request ARN is invalid” suggested complex problems:

IAM permission issues?
OIDC trust policy misconfiguration?
AWS service outage?

But the actual cause was simple: an empty environment variable.

Debugging approach:

Start with the simplest possible explanation
Verify assumptions (print variable values, check they’re not empty)
Compare working vs. broken states (what changed?)
Only escalate to complex debugging when simple causes are ruled out

3. “Best Practices” Depend on Context

Remote state management with S3 and DynamoDB is a best practice for teams. The benefits (concurrent access prevention, remote access, state locking) are valuable when multiple people modify infrastructure.

But for solo development, these benefits don’t justify the costs:

⏱️ Slower iteration cycles
💰 Additional infrastructure costs
🧠 Mental overhead
🔧 Maintenance burden

The principle: Adopt best practices when they solve problems you actually have. Don’t cargo-cult solutions designed for different contexts.

4. State Management Is a Spectrum

The debate isn’t “local vs. remote.” It’s about choosing the right approach for your situation:

Scenario	Recommended Approach
Solo developer, rapid prototyping	Local state + git backup
Solo developer, production infrastructure	Local state + automated S3 backup
Small team (2-3 people)	Remote state with locking
Large team	Remote state + Terraform Cloud/Enterprise
Multi-team organization	Separate state files per component + remote backend

I started at the “small team” level when I should have been at the “solo developer, rapid prototyping” level. Recognizing this and adapting saved time and reduced complexity.

Practical Takeaways

For GitLab OIDC Authentication

If your GitLab pipeline fails with OIDC errors:

Check environment variables first:

# In your pipeline, temporarily add:
script:
  - echo "AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID}"
  - echo "Role ARN would be: arn:aws:iam::${AWS_ACCOUNT_ID}:role/devops-operator"

Verify the role ARN is complete (contains account ID)
Check GitLab CI/CD variable settings:
- Variables must be added in Settings → CI/CD → Variables
- Unprotected variables work on all branches
- Protected variables only work on protected branches
- Masked variables don’t appear in logs (good for secrets, bad for debugging)

For Terraform State Management

Ask yourself:

Do I need team collaboration?
- No → Local state is simpler
- Yes → Remote state is necessary
Do I need state history/rollback?
- No → Local state with git commits
- Yes → S3 versioning or Terraform Cloud
Do I need to prevent concurrent modifications?
- No → Local state (you can’t modify concurrently alone)
- Yes → DynamoDB locking or Terraform Cloud
Am I in rapid prototyping mode or production mode?
- Prototyping → Local state for speed
- Production → Remote state for safety

Conclusion

What started as a frustrating pipeline failure became a valuable lesson in debugging systematic workflows and questioning accepted practices.

The immediate problem—missing environment variable—was simple to fix. But it prompted a deeper evaluation of my infrastructure choices. By migrating from remote to local state management, I eliminated unnecessary complexity and improved iteration speed.

The broader lesson: Infrastructure decisions aren’t permanent. When your context changes (from team to solo, prototype to production, learning to production), re-evaluate your choices. The best practice for your situation might differ from the best practice you read in a blog post.

And always, always check your environment variables first.

Related Resources:

Want to discuss state management strategies or CI/CD debugging? I’d love to hear about your experiences—find me on GitHub @hmbldv.

Johnny Endrihs

Has spent 20+ years securing networks and technology, from military communications to finance. Exploring the frontier of cybersecurity and AI security architecture.