When Your GitLab Pipeline Fails: Debugging OIDC Authentication and Rethinking Remote State
On This Page
The Setup
I was riding high after deploying AWS Config aggregator infrastructure across my organization. The Terraform code was clean, the OIDC federation was configured, and my GitLab pipeline had worked perfectly… two days ago. But today, pushing the latest Config recorder changes triggered this frustrating error:
$ export AWS_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/devops-operator"
$ aws sts assume-role-with-web-identity --role-arn ${AWS_ROLE_ARN} ...
An error occurred (ValidationError) when calling the AssumeRoleWithWebIdentity operation:
Request ARN is invalid
The pipeline had been working. The OIDC provider was configured. The IAM role existed. What changed?
The Investigation
Step 1: Verify the Infrastructure
First, I checked if the AWS resources still existed:
# Check OIDC provider
$ aws iam list-open-id-connect-providers
{
"OpenIDConnectProviderList": [
{
"Arn": "arn:aws:iam::266735821834:oidc-provider/gitlab.com"
}
]
}
# Check IAM role
$ aws iam get-role --role-name devops-operator
{
"Role": {
"RoleName": "devops-operator",
"Arn": "arn:aws:iam::266735821834:role/devops-operator",
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::266735821834:oidc-provider/gitlab.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"gitlab.com:aud": "https://gitlab.com"
},
"StringLike": {
"gitlab.com:sub": "project_path:squinky/aws-sec:*"
}
}
}]
}
}
}
Everything looked perfect. The OIDC provider existed, the role existed, and the trust policy was correct. So why was the ARN invalid?
Step 2: Examine the Pipeline Logs
Looking closer at the error, I noticed something subtle:
$ export AWS_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/devops-operator"
# This expands to: arn:aws:iam::/role/devops-operator
# ↑ Missing account ID!
The ${AWS_ACCOUNT_ID} variable was empty, producing an invalid ARN with missing account ID.
Step 3: The Root Cause
I had never configured the AWS_ACCOUNT_ID environment variable in GitLab CI/CD settings.
But wait—the pipeline worked before! How?
Digging through git history revealed the answer: When I initially tested the pipeline, I was working with a different branch that had hardcoded the account ID for testing. When I moved to the main branch with the parameterized version, I never added the variable to GitLab.
The lesson: Environment-specific configuration belongs in CI/CD variables, not in code. But you have to actually set them up.
The Fix
The solution was embarrassingly simple:
- Navigate to GitLab Project → Settings → CI/CD → Variables
- Click “Add variable”
- Set:
- Key:
AWS_ACCOUNT_ID - Value:
266735821834 - Type: Variable
- Protect variable: ☐ Unchecked (to allow use on all branches)
- Mask variable: ☐ Unchecked (account IDs aren’t sensitive)
- Key:
After adding the variable and re-running the pipeline:
$ aws sts get-caller-identity
{
"UserId": "AROAT4...:gitlab-...",
"Account": "266735821834",
"Arn": "arn:aws:sts::266735821834:assumed-role/devops-operator/gitlab-..."
}
SUCCESS! GitLab can authenticate to AWS!
Pipeline status: ✅ Passing
The Bigger Picture: Rethinking State Management
While debugging the pipeline, I confronted a larger problem that had been slowing me down: remote state management overhead.
The Problem with Remote State (For Solo Development)
When I started this project, I followed best practices and set up a proper Terraform state backend:
terraform {
backend "s3" {
bucket = "terraform-state-266735821834-us-west-1"
key = "infrastructure/terraform.tfstate"
region = "us-west-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
This is the “right way” to do it for team collaboration. The DynamoDB table prevents concurrent modifications, S3 versioning provides history, and remote access enables team coordination.
But here’s what it actually meant for my workflow:
- Separate infrastructure to manage: The state backend itself required Terraform code, deployment, and maintenance
- State locking overhead: Every
terraform planacquired and released a DynamoDB lock, adding 10-15 seconds - Cleanup complexity: When I wanted to remove it, I had to:
- Download the state file (
aws s3 cp s3://...) - Delete all object versions (S3 versioning creates delete markers)
- Delete the DynamoDB table
- Update
providers.tf - Reinitialize Terraform
- Download the state file (
- Cost (minor but real): $1-2/month for resources I didn’t need
The Trade-off Analysis
I asked myself: What am I actually getting from remote state?
Benefits I was using:
- ✅ State file backup (but I could just commit to git or manually backup)
- ✅ Infrastructure tracking (local state does this too)
Benefits I wasn’t using:
- ❌ Team collaboration (I’m the only developer)
- ❌ Concurrent modification prevention (no team = no concurrent access)
- ❌ Remote access (I only work from one machine)
Costs I was paying:
- ⏱️ 10-15 seconds added to every Terraform operation
- 💰 $1-2/month in AWS charges
- 🧠 Mental overhead of managing additional infrastructure
- 🐌 Slower iteration cycles
The Migration
I decided to migrate to local state management. Here’s how:
# 1. Download existing state (preserve resource tracking)
aws s3 cp s3://terraform-state-266735821834-us-west-1/infrastructure/terraform.tfstate ./terraform.tfstate
# 2. Update providers.tf to remove backend config
# Before:
terraform {
required_version = ">= 1.0"
backend "s3" { ... }
}
# After:
terraform {
required_version = ">= 1.0"
# Local backend (default)
}
# 3. Reinitialize Terraform
terraform init
# 4. Verify state is preserved
terraform plan # Should show "No changes"
# 5. Clean up remote backend resources
# Delete all S3 object versions
aws s3api list-object-versions --bucket terraform-state-266735821834-us-west-1 \
--query 'Versions[].{Key:Key,VersionId:VersionId}' \
| jq -r '.[] | "--key \(.Key) --version-id \(.VersionId)"' \
| xargs -I {} aws s3api delete-object --bucket terraform-state-266735821834-us-west-1 {}
# Delete delete markers
aws s3api list-object-versions --bucket terraform-state-266735821834-us-west-1 \
--query 'DeleteMarkers[].{Key:Key,VersionId:VersionId}' \
| jq -r '.[] | "--key \(.Key) --version-id \(.VersionId)"' \
| xargs -I {} aws s3api delete-object --bucket terraform-state-266735821834-us-west-1 {}
# Delete bucket
aws s3 rb s3://terraform-state-266735821834-us-west-1 --force
# Delete DynamoDB table
aws dynamodb delete-table --table-name terraform-state-lock
The Results
Performance improvement:
terraform apply: 45 seconds (down from 60+ seconds)terraform plan: 30 seconds (down from 45+ seconds)
Workflow simplification:
- No separate state backend infrastructure to maintain
- No DynamoDB lock acquisition delays
- Faster iteration cycles
Cost savings:
- $1-2/month eliminated (minor but satisfying)
Trade-offs accepted:
- Manual state backup responsibility (I commit
terraform.tfstateto private git repo) - No team collaboration features (not needed for solo development)
Lessons Learned
1. Environment Variables Are Not Optional
When your CI/CD pipeline references ${VARIABLE_NAME}, you MUST configure it in your CI/CD settings. This seems obvious in retrospect, but it’s easy to overlook when:
- Variables work in local development (where you have
.envrcor shell exports) - You’re migrating from hardcoded values to parameterized config
- You’re copying pipeline configurations from other projects
Best practice: Create a checklist for new CI/CD pipelines:
- All environment variables defined in CI/CD settings
- Variables scoped correctly (protected vs. unprotected)
- Sensitive values marked as masked
- Test with a fresh pipeline run (not just re-runs that might cache variables)
2. Cryptic Errors Often Have Simple Causes
The error “Request ARN is invalid” suggested complex problems:
- IAM permission issues?
- OIDC trust policy misconfiguration?
- AWS service outage?
But the actual cause was simple: an empty environment variable.
Debugging approach:
- Start with the simplest possible explanation
- Verify assumptions (print variable values, check they’re not empty)
- Compare working vs. broken states (what changed?)
- Only escalate to complex debugging when simple causes are ruled out
3. “Best Practices” Depend on Context
Remote state management with S3 and DynamoDB is a best practice for teams. The benefits (concurrent access prevention, remote access, state locking) are valuable when multiple people modify infrastructure.
But for solo development, these benefits don’t justify the costs:
- ⏱️ Slower iteration cycles
- 💰 Additional infrastructure costs
- 🧠 Mental overhead
- 🔧 Maintenance burden
The principle: Adopt best practices when they solve problems you actually have. Don’t cargo-cult solutions designed for different contexts.
4. State Management Is a Spectrum
The debate isn’t “local vs. remote.” It’s about choosing the right approach for your situation:
| Scenario | Recommended Approach |
|---|---|
| Solo developer, rapid prototyping | Local state + git backup |
| Solo developer, production infrastructure | Local state + automated S3 backup |
| Small team (2-3 people) | Remote state with locking |
| Large team | Remote state + Terraform Cloud/Enterprise |
| Multi-team organization | Separate state files per component + remote backend |
I started at the “small team” level when I should have been at the “solo developer, rapid prototyping” level. Recognizing this and adapting saved time and reduced complexity.
Practical Takeaways
For GitLab OIDC Authentication
If your GitLab pipeline fails with OIDC errors:
-
Check environment variables first:
# In your pipeline, temporarily add: script: - echo "AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID}" - echo "Role ARN would be: arn:aws:iam::${AWS_ACCOUNT_ID}:role/devops-operator" -
Verify the role ARN is complete (contains account ID)
-
Check GitLab CI/CD variable settings:
- Variables must be added in Settings → CI/CD → Variables
- Unprotected variables work on all branches
- Protected variables only work on protected branches
- Masked variables don’t appear in logs (good for secrets, bad for debugging)
For Terraform State Management
Ask yourself:
-
Do I need team collaboration?
- No → Local state is simpler
- Yes → Remote state is necessary
-
Do I need state history/rollback?
- No → Local state with git commits
- Yes → S3 versioning or Terraform Cloud
-
Do I need to prevent concurrent modifications?
- No → Local state (you can’t modify concurrently alone)
- Yes → DynamoDB locking or Terraform Cloud
-
Am I in rapid prototyping mode or production mode?
- Prototyping → Local state for speed
- Production → Remote state for safety
Conclusion
What started as a frustrating pipeline failure became a valuable lesson in debugging systematic workflows and questioning accepted practices.
The immediate problem—missing environment variable—was simple to fix. But it prompted a deeper evaluation of my infrastructure choices. By migrating from remote to local state management, I eliminated unnecessary complexity and improved iteration speed.
The broader lesson: Infrastructure decisions aren’t permanent. When your context changes (from team to solo, prototype to production, learning to production), re-evaluate your choices. The best practice for your situation might differ from the best practice you read in a blog post.
And always, always check your environment variables first.
Related Resources:
- GitLab OIDC Authentication Documentation
- Terraform Backend Configuration
- AWS Config Infrastructure Project
Want to discuss state management strategies or CI/CD debugging? I’d love to hear about your experiences—find me on GitHub @hmbldv.