Sanitizing Git History: A Security-First Approach to Public Repositories

On This Page

    The Problem Nobody Talks About

    You’ve built something useful. You want to share it. You push to GitHub and wait. Did you just publish your AWS account ID? Your email? That SSH key comment with your personal address?

    This happened to me. Not in a catastrophic way, but in the slow-burn realization that my “public” repositories contained breadcrumbs of information that didn’t need to be public. Account IDs in Terraform variables. GitLab usernames in OIDC trust policies. Email addresses embedded in SSH public keys.

    None of this is immediately exploitable. AWS account IDs aren’t secrets per se, but they’re pieces of a puzzle. Combined with other information, they make reconnaissance easier. And in security, we don’t make reconnaissance easier.

    The Dual-Remote Reality

    My setup uses two remotes:

    • GitLab: Production. CI/CD pipelines. Real infrastructure deployment.
    • GitHub: Public portfolio. Code samples. Open source contributions.

    This separation exists because GitLab CI/CD needs actual values to deploy infrastructure. The OIDC trust policy must reference my real GitLab project path. Terraform needs my actual AWS account ID to name S3 buckets.

    But GitHub? GitHub is for showing work, not running it. The code should be instructive, not operational.

    The challenge: How do you maintain one codebase that serves both purposes?

    The Wrong Approach: Manual Scrubbing

    My first instinct was to manually edit files before pushing to GitHub. Find-and-replace account IDs. Delete the GitLab-specific configs. Push a “clean” version.

    This fails for three reasons:

    1. Git remembers everything. Even if you edit a file, the old version lives in history. git log -p reveals all.

    2. It’s error-prone. Miss one file, one commit, one variable, and you’ve leaked data.

    3. It doesn’t scale. Every commit requires re-sanitization. One slip and you’re back to square one.

    The Right Approach: Rewrite History

    Git’s history isn’t immutable it just feels that way. Tools like git-filter-repo can rewrite every commit, replacing sensitive strings across the entire repository history.

    Here’s what I did:

    Step 1: Create a Replacement Map

    XXXXXXXXXXXX==>123456789012
    YYYYYYYYYYYY==>234567890123
    sensitive-username==>username
    personal-email@domain.com==>user@example.com

    Each line maps a sensitive value to a safe placeholder. The tool processes every blob in every commit, performing these substitutions.

    Step 2: Run the Filter

    git-filter-repo --replace-text replacements.txt --force

    This rewrites history. Every commit that contained XXXXXXXXXXXX now contains 123456789012. The SHA hashes change (because the content changed), but the commit messages, dates, and structure remain.

    Step 3: Force Push

    git push origin main --force

    Yes, force push. This replaces the remote history with the sanitized version. Anyone who cloned before will have conflicts but that’s the point. The old, sensitive history no longer exists on the remote.

    The Template Strategy

    Sanitizing existing repos is reactive. For the portfolio, I wanted something proactive: a template that’s clean from the start.

    The approach:

    1. Copy the production repo to a new directory
    2. Remove .git (fresh history)
    3. Replace all personal content with placeholders
    4. Create example files that demonstrate structure without leaking data
    5. Initialize new repo and push to GitHub

    The result: portfolio-template a fully functional Astro portfolio with zero personal information. Fork it, customize it, deploy it.

    What Got Replaced

    FilePersonal ContentTemplate Placeholder
    src/consts.tsSite title with my nameYour Name | Portfolio
    src/pages/index.astroBio, title, specializationsGeneric placeholders
    src/components/Header.astroLinkedIn, GitHub, email URLsyour-linkedin, your-github
    src/data/certifications.tsMy actual certificationsExample certification objects
    src/content/projects/*.mdDetailed project writeupsSingle example template

    Terraform: The Trickier Case

    Infrastructure code is harder to sanitize because it needs real values to work. My solution uses gitignored variable files:

    Committed (public):

    # variables.tf
    variable "aws_account_id" {
      description = "AWS account ID"
      type        = string
      default     = "" # Set via terraform.tfvars
    }

    Gitignored (local only):

    # terraform.tfvars
    aws_account_id = "XXXXXXXXXXXX"
    gitlab_project_path = "sensitive-username/aws-sec"

    The committed code has empty defaults. The real values live in .tfvars files that never touch GitHub. Users who clone the repo create their own .tfvars with their own values.

    For the S3 backend (which can’t use variables), I use a separate backend.hcl:

    # backend.hcl (gitignored)
    bucket = "terraform-state-XXXXXXXXXXXX-us-west-1"

    Initialize with:

    terraform init -backend-config=backend.hcl

    Lessons Learned

    1. Commit Hygiene Matters From Day One

    It’s easier to never commit secrets than to remove them later. Before every commit:

    • Check git diff for account IDs, emails, keys
    • Use .gitignore aggressively
    • Consider pre-commit hooks that scan for patterns

    2. AWS Account IDs Aren’t Secrets, But Treat Them Like One

    AWS says account IDs aren’t sensitive. They’re in every ARN, every CloudTrail log, every error message. But:

    • They enable targeted attacks
    • They’re used in social engineering
    • They identify you across services

    Minimizing exposure is good hygiene, not paranoia.

    3. Git History Is a Liability

    Every commit is permanent until you actively rewrite it. That quick test with hardcoded credentials? It’s in your history. That config file you deleted? Still there.

    Assume anything committed will be found. Act accordingly.

    4. Separation of Concerns Applies to Repos

    Not everything belongs in the same repository. Production configs, personal data, and example code have different audiences and different risk profiles. Separate them.

    5. Templates Are Documentation

    A well-structured template teaches more than a tutorial. It shows the right file structure, the correct frontmatter format, the expected configuration. Users learn by customizing, not by reading.

    The Power of Doubling Back

    The most valuable skill in this process wasn’t any particular git command—it was the willingness to revisit decisions. To look at a “finished” repo and ask: “Is this actually ready to be public?”

    Security isn’t a feature you add at the end. It’s a lens you apply continuously. Every commit, every push, every new file deserves the question: “What am I exposing?”

    This blog post exists because I doubled back. I looked at repos I’d already pushed and realized they weren’t as clean as I thought. The fix took a few hours. The alternative—leaving sensitive data exposed indefinitely wasn’t acceptable.

    Double back. Check your history. Clean what needs cleaning. Your future self will thank you.

    Resources


    Security isn’t about perfection it’s about continuous improvement. Every repo you clean, every secret you catch before commit, every habit you build makes the next project safer.