Beyond Regex: LLMs Are the Final Nail in the Coffin for Hardcoded Secrets

24 November, 2025
Keith Makan
News

TL;DR

Breaches can have real impact on users and impose massive fines on organizations
LLM generated code can introduce vulnerabilities like hardcoded secrets.
Academics are starting to corner the problem of using LLMs to identify secrets in code.
Secrets can persist in git repo’s if you don’t make special effort to clean them out.
The scale of AI means that detection can achieve scale of 10s of millions and compute means detection can happen fast.

The following post is a tool that consultants and security engineers can use to convince devs that the issue of hardcoded credentials in source code is a serious problem. I will attempt to highlight the impact on organizations and humans involved, show the impact that embedded creds can have (from disclosure and breaches) and at the end highlight how AI is contributing to both detection and prevalence of hardcoded creds. The article also provides a quick walkthrough of a tool you can use to hunt down creds in public repos and fix creds that persist in yours. Enjoy.

It’s easy to think of “hacking” as a sophisticated, cinematic event involving zero-day exploits and complex rootkits. But the reality is often far more mundane—and far more preventable. We’ve seen folks take this for granted before[9,10,11], with massive incidents stemming directly from public repositories exposing credentials in source code. Take Toyota[9], for instance, which left an access key exposed on GitHub for nearly five years, potentially compromising customer data. Or look at the Uber breach, where attackers gained initial access and then moved laterally through the network by finding admin credentials hardcoded directly into a PowerShell script. These aren’t just “oops” moments; they are catastrophic failures of basic hygiene that hand the keys to the kingdom to anyone with an internet connection.

At this point in cybersecurity’s young history, it may seem like this a once off or rare occurrence but we’re coming up on a good handful of these breach events now. You might want to drum up some of these examples next time you’re in the room with a shareholder / developer or generally any person who doesn’t grasp the scale and impact of the problem, here’s a table that mentions a few of the known breaches that potentially stem from credential disclosure through source code / repos.

Company	Year	Mechanism of Failure	Impact/Consequence
Uber[13,14]	2016	Hardcoded credentials for an admin account found within a PowerShell script on a private GitHub repository.	Attackers gained access to Uber’s Amazon Web Services (AWS) account, leading to the theft of personal data for 57 million customers and drivers.
LastPass[12]	2022	Attackers stole the company’s source code. The code contained hardcoded credentials and keys needed to access internal cloud storage services.	This allowed the attackers to pivot and access a third-party cloud environment holding backups of customer data, including vault metadata and encrypted vaults.
Toyota[9]	2019	An AWS access key for a cloud environment containing customer data was accidentally committed to a public GitHub repository by a vendor, remaining exposed for nearly five years.	The exposed key granted access to server data, leading to a breach involving up to 3.1 million customer records.
Samsung	2022	The Lapsus$ hacking group breached Samsung and leaked 190 GB of confidential source code for Galaxy devices.	Analysis of the code revealed it contained hardcoded internal API keys and secrets, exposing the internal infrastructure and security logic of the devices.
SolarWinds	Pre-2019	A development team member committed an FTP server password (`solarwinds123`) to a public GitHub repository.	Although not the primary vector for the massive “Sunburst” supply chain attack, the exposed and hardcoded weak credential became a symbol of the company’s lax security culture, demonstrating a clear vulnerability for any attacker.

Disclosing credentials is essentially letting someone in, and this can get really nightmarish for all parties involved. It puts all your users at risk and leaves their data in the hands of people who might exploit them; could track them for the rest of their lives and could make their day to day existence profoundly unsafe to navigate. One disclosure is seldom a once off event and given our political/social climate giving away someones data makes them easy to target to people in hate/terrorist groups (in my opinion due to this reality of the internet we really let orgs off easy in this regard). So its ever just one disclosure and everyone moves on, your customer/client/shareholder data can be traded and passed around the darknet for years. So for your users it can become a life long nightmare, something they might not be able to ever escape, for organizations there’s a minefield of compliance impact that, this is briefly discussed in the next section.

The Ripple Effect: When “Just This Once” Becomes a Crisis

When you hardcode credentials, you aren’t just taking a shortcut; you are creating a fragile ecosystem where a single mistake can topple your entire security posture. The most immediate danger is the “single point of failure” problem. If a developer’s laptop is compromised, it’s no longer just that one machine at risk—that machine becomes a fully authorised gateway to your production databases, cloud infrastructure, or customer PII. You are effectively imposing the security of your entire organisation on a single endpoint.

Beyond the immediate breach risk, there is the nightmare of compliance. Standards like NIST SP 800-53 (specifically Control IA-5) explicitly forbid embedding unencrypted static authenticators in code, and FIPS 140-3 compliance is impossible without secure key management and “zeroization” protocols that hardcoding inherently violates. Furthermore, these floating copies of credentials become unmanageable. You can’t rotate a password that is buried in a thousand lines of code across fifty different versions of an app. You lose the ability to patch, track, and revoke access swiftly, turning credential management into an impossible game of whack-a-mole. To provide a quotable resource I’ve generated a simple table of the compliance standards that have controls pointing towards the issue of hardcoded credentials how they can impact standardisation and benchmarking around authentication and identity management.

Standard / Framework	Specific Control / Requirement	Why Hardcoded Creds Fail This Standard
PCI DSS v4.0	Requirement 8.6.2	Explicit Prohibition. The standard strictly states: “Passwords/passphrases for system and application accounts are not hard coded in scripts, configuration files, or source code.” Hardcoding acts as a direct violation of this requirement, resulting in immediate non-compliance for payment processors.
NIST SP 800-53 (Rev. 5)	Control IA-5 (1) (Authenticator Management)	Inability to Manage. This control requires that authenticators (passwords/keys) are changed/refreshed periodically and protected from unauthorized disclosure. Hardcoded keys cannot be rotated without redeploying code, violating the “management” aspect of this control.
OWASP ASVS 4.0	V2.10.4 & V14.3.2	Service Auth & Unintended Disclosure. ASVS requires that “service authentication credentials are stored in a secure local storage.” Hardcoding places them in public storage (source code), violating the separation of configuration and code.
ISO 27001:2022	Annex A 8.28 (Secure Coding)	Insecure Development. This control requires organizations to establish secure coding principles. Hardcoding is universally cited as a “poor coding practice” in ISO guidance, failing the requirement to prevent unintended information leakage in software development.
FIPS 140-3	Area 6 (Cryptographic Key Management)	Zeroization Failure. FIPS requires that cryptographic modules be able to “zeroize” (permanently destroy) sensitive security parameters (SSPs). Hardcoded keys are burned into the binary/read-only memory and cannot be effectively zeroized or wiped if compromised.
CIS Controls (v8)	Control 16.6 (Securely Manage Secrets)	Plaintext Exposure. This control mandates that you “use a dedicated secrets management solution” and never store secrets in code or config files. Hardcoding violates the core tenet of moving secrets out of the application logic.

So we know what the compliance sins are of disclosure but what is the financial impact? How bad can it get when you are too lenient with your controls? What has happened in the past, i know people like numbers and being able to say something like “yknow in 20XX org Y was made to pay a fine of a gajillion dollars” will help a lot. It takes the discussion out of pure and impractical audit findings into real cost impact. The following table summarises this impact per country / per compliance framework, feel free to whip it out at your next developer mindmeld or whatever trendy name people use for brainstorm/strat sessions at your org.

Absolutely! Here is the table detailing the costs of data breaches, formatted for easy copying directly into an Excel or Google Sheets document.

Jurisdiction / Framework	Type of Cost Imposed	Illustrative Maximum Statutory Fine	Notable Enforcement Example(s)	Source Reference (Max Fine)	Source Reference
🇪🇺 European Union (GDPR)	Regulatory Fine (Article 83), Lawsuits, Breach Notification Costs.	€20 Million or 4% of Global Annual Turnover, whichever is higher.	Amazon (2021): Fined €746 million by the Luxembourg DPA for privacy violations.	Article 83, GDPR	Amazon Fine Report
🇺🇸 United States (CCPA/CPRA – California)	State Civil Penalties (per violation), Class Action Lawsuits, FTC Fines, Mandatory Notification.	from USD 2,663 to USD 7,988 per intentional violation (per consumer).	Equifax (2019): Settled for up to $700 million with the FTC and States for the 2017 breach.	CCPA Penalties & Fines	Equifax Settlement Report
🇬🇧 United Kingdom (UK GDPR)	Regulatory Fine (enforced by ICO), Lawsuits, Mandated Audits.	£17.5 Million or 4% of Global Annual Turnover, whichever is higher.	British Airways (2020): Fined £20 million by the ICO for a 2018 data breach.	ICO Penalty Authority	BA Fine Report
🇦🇺 Australia (Privacy Act 1988)	Regulatory Fine (enforced by OAIC), Civil Penalties, Remediation Costs.	The greater of: A$50 Million, or 3x the value of the benefit, or 30% of domestic turnover.	Australian Clinical Labs (2024): Ordered to pay A$5.8 million (under the old, lower penalty regime) for a 2022 breach.	OAIC Maximum Penalties	ACL Penalty Report
🇨🇦 Canada (PIPEDA)	Administrative Fine (Failure to report), Mandated Breach Reporting, Reputational Harm.	Up to C$100,000 per violation for failure to report breaches, plus court-ordered damages.	Enforcement focuses on compliance orders and remediation, with fines typically applied to failures to report/record breaches.	PIPEDA Penalties	PIPEDA Summary

So basically you could be fined a percentage of your global turn over sometimes up to 4% (for AUS based businesses it could 30% of your turn over!) and we can see that huge organizations like Amazon[15] and WhatsApp[16] have been hit. Whats eye opening is that these examples are not due to breaches they are due to privacy violations from working / functioning software. Okay i think thats enough data on the horror stories that can develop and have played out, lets take a turn to some more practical discussion. Managing secrets and one or two of the nuances of git repositories.

Sniffing Out Your Secrets (It’s Not Just Luck)

Gone are the days when an attacker had to stumble upon your secrets by accident. Today, finding hardcoded credentials is an automated industry. Attackers actively scan platforms like GitHub, GitLab, and Bitbucket using sophisticated scripts that look for high-entropy strings (like API keys) or specific variable names like AWS_SECRET_KEY. For instance consider tools like git-hound which allows you to trawl through public repos for just about anything that matches a search term. Setting this up is super easy on Mac, Ubuntu (and I’m sure on Windows as well). The following section walks through the setup to get you sniffing credentials in no time.

Setting up Git-Hound

According to the read me on github: “GitHound hunts down exposed API keys, secrets, and credentials across GitHub by pairing GitHub dorks with pattern matching, contextual detection, and commit-history analysis. ” – https://github.com/tillson/git-hound

I recommend getting the repo first, so you have access to sample files you might need later: git clone https://github.com/tillson/git-hound.git
Then pull down one of the latest releases:
in your release folder you should have the following files: config.yml git-hound
We need to sort out the config.yml because git-hound needs a github token, to get your token you gotta follow these steps:
1. Click on your Profile Pic
2. Go to “Developer Settings”
3. Under “Personal access tokens” select “Fine-grained tokens”
4. Make sure “Public repositories” is selected
5. Click “Generate token”
You now have your token and you should edit the config.yml (on line 6 if you have my version of the file) where it says github_access_token please ensure you include the quotes.
Last step, you need to copy the rules/ folder from the repository into the folder with your git-hound release
You’re ready to rock!

Lets have git-hound chase up some aws secret keys and see what we get:

Screenshot 2025 11 22 At 00.56.15

Viola! We can lookup some secrets. I think in general this is great for looking or bad code patterns to investigate if you’re wondering how horrible it can be to make use of say a crypto API or setup IVs, Keys for some crypto, people are pretty terrible at it so its very entertaining! The more compute you have the more stuff you can find, its obviously an interesting problem to try and setup monitoring to alert you if someone uploads code that has embedded creds, I’m sure the OSINT folks will love git-hound for that.

There’s one more git related horror story I’d like to bring to your attention, and this is due to how git commit history works. Some of the more seasoned developers reading this might have realised that just because you remove or change code with secrets embedded in it, doesn’t mean its removed from the repo entirely, it may still be accessible from the commit history or other cached files on your disk. So lets say you have a secret config.txt in your repo, setup like this:

Okay so now this config.txt is part of commit 1. Maybe later some handsome pentester named Keith points out that this a very bad idea, the guilt overcomes you and you feel spurred to heroically remove it from your repo and you do the following:

Fixed right? wrong! Check this out, you can still find this secret in the commit history as demonstrated below:

Pretty scary stuff, i think its really important people understand the nature of a git repository and that it behaves totally agnostic to your secrets and serves to be an infallable record of the absolute circus that is development in the modern age. So we have a problem now, how do we get this secret out of the git history? This is where git-filter-repo comes in, its a nifty python script that has the ability to overwrite all these troubles.

Erasing the Digital Footprint: `git-filter-repo` for Complete Remediation

Setting up git-filter-repo is easy as pie, you essentially download the static python file and make sure you can call it from your repo:

Grab the file : wget https://raw.githubusercontent.com/newren/git-filter-repo/main/git-filter-repo (optionally cp git-filter-repo git-filter-repo.py; chmod 700 git-filter-repo)
Make a replacements file that looks like this: echo "sk_live_very_secret_key_123456789==>***REMOVED***" > replacements.txt
Run and watch the secrets melt away: git filter-repo --replace-text replacements.txt
Confirm the secrets are removed by checking out the git log.

Here’s what those instructions look like all together in a smooth terminal screenshot to concretely prove i know what im saying:

yay no more secrets! No more stress! So this gives us the power to look up secrets in one or two instances here n there, but what about scale? What about something more sophisticated than just regexes and string matching? This is where AI starts becoming a real threat to organizations that the hardcoded creds for granted.

Needle in a Haystack? AI Just Brought a Magnet

You don’t need a 500 IQ to guess that attackers are soon going to leverage Large Language Models (LLMs) to parse code more intelligently, being able to apply context to code patterns in order to find secrets that traditional regex scanners might miss. Recent movements in the academic ether show that LLMs are growing ever more capable at sniffing out credentials [1,2,3,4,6]. It gets really scary when you start to think about scale and how fast attackers will be able to react to the knowledge of secrets. In some studies researchers are trawling through around 80 million files from a combination of sources like Github WeChat and PyPi uncovering around 30% of them exposing secrets, I’m not gonna do the math right now but… thats a lot of files! Aside from sheer volume, I believe the compute power of scaled AI will close the gap between discovery and exploitation very very fast. Commit the code at minute x and x+2 someone may already be trawling around your AWS environment, causing Amazon to panic and cut off your infrastructure for a week or two whether or not the attacker actually managed to do some cool stunt or not. Not a good look for any company looking for investors or handling personal data.

The other side of this AI coin is the emerging risk of generative AI inadvertently “learning” bad patterns and reproducing these secrets and other anti-patterns, like hardcoded secrets [7].

Our experimental results show that NCCTs can not only return the precise piece of their training data but also inadvertently leak additional secret strings [7]

It is an automated arms race, and if your secrets are in plain text, the machines will find them before you do and copy this pattern into potentially thousands of innocent vibe coding projects. The real driver behind this is the problem of memoization and the rigidity of code patterns. Unlike images, code doesn’t allow for all of the freedom of pixel values or have space for the creative prose of poetry, stories of blog posts. For code there’s unfortunately a few patterns that will be entrained and if the training data contains a lot of these patterns they will be consumed with a very strong weighting, being returned in generative steps very reliably.

Any seasoned code reviewer will tell you how hard it is to unlearn/untrain certain anti-patterns in humans, for machines with artificial contextualisation (no awareness of the human impact of getting things wrong) built on simple statistical patterns it will probably be more of a nightmare. Personally i think its easy and just a matter of time before we put a saddle on the problem of static credentials i.e hardcoded strings with secrets in them; but patterns like: (i) modulo biased or equivalently biased self styled PRNGs, (ii) implicitly seeded random number generators, (iii) Guids as pseudorandom token and (iv) disclosure of secrets through immutable types or type casting into objects with the wrong runtime behaviour to list a few, will be harder to detect. Patterns that are not simply recognisable in text, but instead only become apparent in emergent behaviour; will be a bigger more subtle problem to squash later down the line if we’re talking about generating mountains of code that never get reviewed correctly even with LLM powered (purely code analysis driven) code review.

So in closing I would say that the idea of unguided, unreviewed vibe coding seems magical but it could also mean your spend your entire security testing budget fixing 1000s of CWE-786s that mean absolutely nothing to your security posture. And hardcoding secrets may have a bigger impact than we anticipate in the word of vibe coding especially for setups where private repos are being used to train locally managed LLMs. One way to solve the problem is to throw LLMs at it, but many LLM driven solutions are costly, require a lot of natural resources to run and may not uncover all instances not bring context to findings as impact-fully as good ol meat n bone does.

Hope you enjoyed the blog post, leave a comment and share, thanks for reading!

Take Aways

Embedding secrets in code can escalate problems in compliance.
Secrets can persist in code and git histories, use git filter-repo to replace them before the propagate to the wider repo.
AI can rapidly detect these secrets and with the compute we have accessible today, it can happen fast.
If academics are catching on, attackers will catch on too – only a matter of time.

References and Reading

Detecting Hard-Coded Credentials in Software Repositories via LLMs – https://dl.acm.org/doi/full/10.1145/3744756
Secret Breach Detection in Source Code with Large Language Models – https://arxiv.org/abs/2504.18784
Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection – https://arxiv.org/abs/2508.04448
Evaluating Large Language Models in detecting Secrets in Android Apps – https://arxiv.org/abs/2510.18601
Git Filter Repo – https://github.com/newren/git-filter-repo
Hey, Your Secrets Leaked! Detecting and Characterizing Secret Leakage in the Wild – https://kee1ongz.github.io/paper/sp25-secret.pdf
Your Code Secret Belongs to Me: Neural Code Completion Tools Can Memorize Hard-Coded Credentials – https://arxiv.org/abs/2309.07639
Reflecting on the 2023 Toyota Data Breach – https://cloudsecurityalliance.org/blog/2025/07/21/reflecting-on-the-2023-toyota-data-breach
Toyota Suffered a Data Breach by Accidentally Exposing A Secret Key Publicly On GitHub – https://blog.gitguardian.com/toyota-accidently-exposed-a-secret-key-publicly-on-github-for-five-years/
Samsung and Nvidia are the latest companies to involuntarily go open-source leaking company secrets – https://blog.gitguardian.com/samsung-and-nvidia-are-the-latest-companies-to-involuntarily-go-open-source-potentially-leaking-company-secrets/
Source Code as a Vulnerability – A Deep Dive into the Real Security Threats From the Twitch Leak – https://blog.gitguardian.com/security-threats-from-the-twitch-leak/
12-22-2022: Notice of Security Incident – https://blog.lastpass.com/posts/notice-of-recent-security-incident
Uber Breaches (2014 & 2016) – https://www.breaches.cloud/incidents/uber/
2016 Data Security Incident – https://www.uber.com/newsroom/2016-data-incident/
WhatsApp issued second-largest GDPR fine of €225m – https://www.bbc.com/news/technology-58422465
Amazon loses court fight against record $812 mln Luxembourg privacy fine – https://www.reuters.com/technology/amazon-loses-court-fight-against-record-812-mln-luxembourg-privacy-fine-2025-03-19/

Keith Makan

Keith is the founder of KMSecurity (Pty) Ltd. and a passionate security researcher with a storied career of helping clients all over the world become aware of the information security risks. Keith has worked for clients in Europe, the Americas and Asia and in that time gained experience assessing for clients from a plethora of industries and technologies. Keith’s experience renders him ready to tackle any application, network or organisation his clients need help with and is always eager to learn new environments. As a security researcher Keith has uncovered bugs in some prominent applications and services including Google Chrome Browser, various Google Services and components of the Mozilla web browser.