PII Detection at the Edge: WASM, Entropy, and Zero Trust

Your server-side secret scanner found the API key. Great. It found it after the key was already pushed to a remote repository, transmitted over a network, and cached in CI logs. The secret was exposed the moment it left the developer’s machine.

I built PII-Shield to solve this timing problem. It runs as a WebAssembly module at the edge — on the developer’s machine, inside the GitHub Action runner, before the code goes anywhere else. If a secret or PII value is in the diff, PII-Shield catches it before it crosses a trust boundary.

Why Server-Side Scanning Is Too Late

The typical secret scanning architecture looks like this: code gets pushed, a webhook fires, a server-side scanner analyzes the commit, and if it finds a secret, it sends an alert. Some services even auto-revoke the key.

The problem is everything that happened between “pushed” and “alert.” The secret traveled from the developer’s machine to GitHub’s servers. GitHub’s infrastructure processed it. CI runners downloaded it. Log collectors might have captured it. By the time the scanner fires, the secret has touched multiple systems you do not control.

This is not a theoretical concern. GitHub’s own secret scanning documentation acknowledges the gap. Their push protection feature helps, but it only covers secrets that match known provider patterns. A custom database password, a private signing key, or a patient identifier in a test fixture won’t match any pattern in their library.

The zero trust position is clear: do not send sensitive data to systems that do not need it. Scan before you transmit, not after.

How PII-Shield Works

PII-Shield is a WASM module compiled from Rust. It runs inside the CodeGuard GitHub Action and can also run locally as a pre-commit hook. The detection pipeline has three stages.

Stage 1: Pattern matching. Regex-based detection for known secret formats. AWS access keys (AKIA prefix, 20 characters), GitHub tokens (ghp_, gho_, ghs_ prefixes), Stripe keys (sk_live_, pk_live_), private key headers (BEGIN RSA PRIVATE KEY), and about 40 other patterns. This catches the obvious stuff — the secrets that have a predictable format.

Stage 2: Entropy analysis. For strings that don’t match any known pattern, PII-Shield calculates Shannon entropy. A random 32-character hex string has roughly 4.0 bits of entropy per character. English text averages about 1.5. Variable names are somewhere in between. PII-Shield flags strings with entropy above a configurable threshold (default: 3.5 bits per character) and length above 16 characters.

This is where most scanners stop, and it is where most false positives come from. High entropy does not mean “secret.” SHA-256 hashes, UUID v4 values, and base64-encoded data are all high-entropy strings that are often perfectly safe to commit.

Stage 3: Context classification. PII-Shield examines the surrounding code to reduce false positives. A high-entropy string assigned to a variable named password or api_key gets escalated. The same string assigned to content_hash or file_checksum gets downgraded. The classifier uses variable names, function names, file paths, and nearby comments to make this call.

All three stages run in a single WASM execution. The total time for a typical PR diff is under 200 milliseconds. No network calls. No external API. The code and the analysis never leave the machine where the WASM module is running.

The SHA-256 Hash Whitelisting Problem

Here is the false positive that drove me to build the context classifier.

GuardSpine’s evidence bundles contain SHA-256 hashes. Every evidence item has a content hash. The hash chain is made of hashes. The root hash is a hash. A typical evidence bundle contains 10 to 30 high-entropy strings that are not secrets — they are integrity proofs.

Without whitelisting, PII-Shield flagged every single evidence bundle change as a potential secret leak. The entropy stage would fire on every hash, and every PR that touched the evidence system would get blocked with false positives.

The fix is a whitelist system that operates on two levels:

Field-level whitelist. Any JSON field matching *_hash, content_hash, chain_hash, root_hash, or bundle_id is excluded from entropy analysis. The field name itself is sufficient context — these fields are designed to hold high-entropy values.

Pattern-level whitelist. You can define custom patterns in the policy file for your repository. If your codebase stores cryptographic material in specific fields or file paths, you add those patterns and PII-Shield stops flagging them.

pii_shield:
  entropy_threshold: 3.5
  min_length: 16
  whitelist_fields:
    - "*_hash"
    - "bundle_id"
    - "checksum"
    - "fingerprint"
  whitelist_paths:
    - "test/fixtures/**"
    - "migrations/**"

The whitelist is additive. GuardSpine’s defaults cover the common cases. Your policy file extends them.

Why WASM Instead of Native Code

I get asked this one a lot. Rust compiles to native binaries that would be faster than WASM. Why add the WASM layer?

Three reasons.

Portability. A WASM module runs anywhere a WASM runtime exists. Linux runners, macOS runners, Windows runners, inside a browser for a future web-based scan tool — one binary, every platform. Native Rust would require cross-compilation and separate binaries for each target.

Sandboxing. WASM modules execute in a sandbox with no filesystem access, no network access, and no system calls unless explicitly granted. PII-Shield takes a diff as input and returns findings as output. It cannot exfiltrate the code it is scanning. If you are worried about supply chain attacks in your security tooling — and you should be — the WASM sandbox is a hard boundary.

Auditability. The WASM binary is deterministic. Same source code, same compiler version, same flags — same binary. You can reproduce the build and verify the module you are running matches the source code in the repository. This matters when the tool that scans for secrets is itself a trust dependency.

The performance cost is real but small. WASM runs at roughly 80% of native speed for compute-bound tasks. PII-Shield’s bottleneck is regex matching, which is CPU-bound, and the typical analysis completes in under 200ms. For a pre-commit hook or CI step, that is invisible.

PII Beyond Secrets

Secret detection is the entry point, but PII-Shield also catches personally identifiable information that should not be in source code.

Email addresses in test fixtures. Phone numbers in seed data. Social security numbers in configuration examples. Medical record numbers in comments. These are not secrets in the traditional sense — they are not authentication credentials. But they are regulated data under GDPR, HIPAA, CCPA, and a growing list of frameworks.

The pattern matching stage includes PII patterns: email (RFC 5322), phone numbers (E.164 and North American formats), social security numbers (NNN-NN-NNNN), and credit card numbers (Luhn validation). Each pattern category can be enabled or disabled independently. A fintech company might care about credit card numbers. A healthcare company might care about medical record numbers. A GDPR-regulated company might flag any email address that appears outside of a designated test data directory.

The Evidence Trail

When PII-Shield finds something, it does not just block the commit. It creates an evidence item that goes into the review bundle.

The finding includes: what was detected, which pattern or entropy rule triggered, the file and line number, the context classification decision (escalated, downgraded, or neutral), and a recommended action (block, warn, or log).

This means your compliance team can answer the question: “How many PII incidents did the scanner catch before they left the development environment?” Not how many leaked and were caught. How many were stopped at the source.

That distinction matters for SOC 2 Type II and HIPAA. Preventing incidents is stronger evidence than detecting and remediating them.

PII-Shield ships inside codeguard-action. Install the Action and it runs automatically on every PR. Want to see entropy-based detection running on your codebase? Book a walkthrough.