Sanitizing Git History: A Security-First Approach to Public Repositories
On This Page
The Problem Nobody Talks About
You’ve built something useful. You want to share it. You push to GitHub and wait. Did you just publish your AWS account ID? Your email? That SSH key comment with your personal address?
This happened to me. Not in a catastrophic way, but in the slow-burn realization that my “public” repositories contained breadcrumbs of information that didn’t need to be public. Account IDs in Terraform variables. GitLab usernames in OIDC trust policies. Email addresses embedded in SSH public keys.
None of this is immediately exploitable. AWS account IDs aren’t secrets per se, but they’re pieces of a puzzle. Combined with other information, they make reconnaissance easier. And in security, we don’t make reconnaissance easier.
The Dual-Remote Reality
My setup uses two remotes:
- GitLab: Production. CI/CD pipelines. Real infrastructure deployment.
- GitHub: Public portfolio. Code samples. Open source contributions.
This separation exists because GitLab CI/CD needs actual values to deploy infrastructure. The OIDC trust policy must reference my real GitLab project path. Terraform needs my actual AWS account ID to name S3 buckets.
But GitHub? GitHub is for showing work, not running it. The code should be instructive, not operational.
The challenge: How do you maintain one codebase that serves both purposes?
The Wrong Approach: Manual Scrubbing
My first instinct was to manually edit files before pushing to GitHub. Find-and-replace account IDs. Delete the GitLab-specific configs. Push a “clean” version.
This fails for three reasons:
-
Git remembers everything. Even if you edit a file, the old version lives in history.
git log -preveals all. -
It’s error-prone. Miss one file, one commit, one variable, and you’ve leaked data.
-
It doesn’t scale. Every commit requires re-sanitization. One slip and you’re back to square one.
The Right Approach: Rewrite History
Git’s history isn’t immutable it just feels that way. Tools like git-filter-repo can rewrite every commit, replacing sensitive strings across the entire repository history.
Here’s what I did:
Step 1: Create a Replacement Map
XXXXXXXXXXXX==>123456789012
YYYYYYYYYYYY==>234567890123
sensitive-username==>username
personal-email@domain.com==>user@example.com
Each line maps a sensitive value to a safe placeholder. The tool processes every blob in every commit, performing these substitutions.
Step 2: Run the Filter
git-filter-repo --replace-text replacements.txt --force
This rewrites history. Every commit that contained XXXXXXXXXXXX now contains 123456789012. The SHA hashes change (because the content changed), but the commit messages, dates, and structure remain.
Step 3: Force Push
git push origin main --force
Yes, force push. This replaces the remote history with the sanitized version. Anyone who cloned before will have conflicts but that’s the point. The old, sensitive history no longer exists on the remote.
The Template Strategy
Sanitizing existing repos is reactive. For the portfolio, I wanted something proactive: a template that’s clean from the start.
The approach:
- Copy the production repo to a new directory
- Remove
.git(fresh history) - Replace all personal content with placeholders
- Create example files that demonstrate structure without leaking data
- Initialize new repo and push to GitHub
The result: portfolio-template a fully functional Astro portfolio with zero personal information. Fork it, customize it, deploy it.
What Got Replaced
| File | Personal Content | Template Placeholder |
|---|---|---|
src/consts.ts | Site title with my name | Your Name | Portfolio |
src/pages/index.astro | Bio, title, specializations | Generic placeholders |
src/components/Header.astro | LinkedIn, GitHub, email URLs | your-linkedin, your-github |
src/data/certifications.ts | My actual certifications | Example certification objects |
src/content/projects/*.md | Detailed project writeups | Single example template |
Terraform: The Trickier Case
Infrastructure code is harder to sanitize because it needs real values to work. My solution uses gitignored variable files:
Committed (public):
# variables.tf
variable "aws_account_id" {
description = "AWS account ID"
type = string
default = "" # Set via terraform.tfvars
}
Gitignored (local only):
# terraform.tfvars
aws_account_id = "XXXXXXXXXXXX"
gitlab_project_path = "sensitive-username/aws-sec"
The committed code has empty defaults. The real values live in .tfvars files that never touch GitHub. Users who clone the repo create their own .tfvars with their own values.
For the S3 backend (which can’t use variables), I use a separate backend.hcl:
# backend.hcl (gitignored)
bucket = "terraform-state-XXXXXXXXXXXX-us-west-1"
Initialize with:
terraform init -backend-config=backend.hcl
Lessons Learned
1. Commit Hygiene Matters From Day One
It’s easier to never commit secrets than to remove them later. Before every commit:
- Check
git difffor account IDs, emails, keys - Use
.gitignoreaggressively - Consider pre-commit hooks that scan for patterns
2. AWS Account IDs Aren’t Secrets, But Treat Them Like One
AWS says account IDs aren’t sensitive. They’re in every ARN, every CloudTrail log, every error message. But:
- They enable targeted attacks
- They’re used in social engineering
- They identify you across services
Minimizing exposure is good hygiene, not paranoia.
3. Git History Is a Liability
Every commit is permanent until you actively rewrite it. That quick test with hardcoded credentials? It’s in your history. That config file you deleted? Still there.
Assume anything committed will be found. Act accordingly.
4. Separation of Concerns Applies to Repos
Not everything belongs in the same repository. Production configs, personal data, and example code have different audiences and different risk profiles. Separate them.
5. Templates Are Documentation
A well-structured template teaches more than a tutorial. It shows the right file structure, the correct frontmatter format, the expected configuration. Users learn by customizing, not by reading.
The Power of Doubling Back
The most valuable skill in this process wasn’t any particular git command—it was the willingness to revisit decisions. To look at a “finished” repo and ask: “Is this actually ready to be public?”
Security isn’t a feature you add at the end. It’s a lens you apply continuously. Every commit, every push, every new file deserves the question: “What am I exposing?”
This blog post exists because I doubled back. I looked at repos I’d already pushed and realized they weren’t as clean as I thought. The fix took a few hours. The alternative—leaving sensitive data exposed indefinitely wasn’t acceptable.
Double back. Check your history. Clean what needs cleaning. Your future self will thank you.
Resources
- git-filter-repo - The modern replacement for
git filter-branch - BFG Repo-Cleaner - Alternative for removing large files and passwords
- GitHub: Removing sensitive data
- Portfolio Template - The sanitized template discussed in this post
Security isn’t about perfection it’s about continuous improvement. Every repo you clean, every secret you catch before commit, every habit you build makes the next project safer.