How I Built a JPEG Encoder Fingerprint DB

Upload a JPEG to snapWONDERS and one of the things the analysis returns is an “encoder fingerprint” — the name of the software that created the file. Not from metadata you could fake with a text editor. From the mathematical structure of the compression tables baked into the file at the moment it was encoded.

This is DQT fingerprinting. It has been a known forensic technique for as long as JPEG analysis has existed — the technique was well understood, the data model was clear. What it needed was the engineering time to build the accumulation infrastructure properly: a clean provenance model, a trust pipeline that guards against poisoned data, and a seed database derived from first principles rather than copied from existing tools. That work is now done.

What a DQT table actually is

Every JPEG file is compressed using quantisation tables — arrays of 64 numbers arranged in an 8×8 block that determine how aggressively each frequency component of the image is compressed. High-frequency detail (fine texture, sharp edges) gets compressed more aggressively; low-frequency structure (colour gradients, large shapes) is preserved more faithfully.

These tables are written into the file header, in a section called the DQT marker (Define Quantisation Table). Every JPEG encoder has its own characteristic tables — chosen for a combination of technical performance and, in some cases, IP reasons. The libjpeg library, which underlies most of the internet’s JPEG handling, uses a specific mathematical formula derived from the ISO/IEC 10918-1 standard. Adobe Photoshop uses different tables. Canon camera firmware uses different tables still. A smartphone running stock Android uses different tables from one running Samsung’s camera app.

These differences are stable. They don’t vary between images from the same encoder at the same quality level. An Adobe Photoshop JPEG at quality 85 has the same DQT tables regardless of what the image actually shows.

That makes them a fingerprint.

The constraint: I couldn’t use existing databases

The obvious approach would have been to copy from existing tools. Several open-source JPEG utilities maintain encoder table databases compiled from years of accumulated research. But those tools are GPL-licensed. That data is their authors’ work. I wasn’t going to copy it.

The database had to be built from sources I could authoritatively claim. There turned out to be two.

Source 1: The maths

For libjpeg — the encoder underlying most image processing libraries, Android camera firmware, mirrorless cameras, GIMP, LibreOffice, and most browser canvas-to-JPEG operations — the DQT tables at any quality level are mathematically derivable from the ISO/IEC 10918-1 Annex K base matrices.

The formula is deterministic. Given a quality level 1–100, the coefficient at each of the 64 positions in the luma and chroma tables is calculable. I implemented this as a pure in-memory function — no database query at all. One hundred entries, all derivable from the standard, all verifiable.

This covers the single largest category of JPEGs in the wild. Most images on the internet were created by software that uses libjpeg.

Source 2: Live accumulation

From the point the system was deployed, every JPEG analysed on snapWONDERS feeds the database. A matching DQT hash increments an existing entry’s confirmation count. A non-matching hash creates a new candidate record, and its seen count starts climbing. The database is self-building — it grows with every upload.

The challenge here is that EXIF metadata can be faked. A file claiming Make: Canon, Model: EOS 5D Mark IV doesn’t have to have been taken by that camera — those are text fields, trivially editable. A naive system that simply trusted EXIF attribution could be poisoned.

The accumulation pipeline addresses this by requiring consistent attribution across many independent uploads before any candidate is automatically promoted to the verified database. Candidates that show conflicting device claims across different contributing files are flagged for mandatory manual review and cannot be automatically promoted regardless of volume. The practical effect is that genuine camera firmware fingerprints — seen consistently across many real uploads from the same device model — rise naturally, while anything with inconsistent or suspicious attribution stays quarantined until a human looks at it.

What it shows in the analysis report

When a JPEG’s DQT hash matches a verified entry, the analysis report shows an “Encoder Fingerprint” field: libjpeg q85, Adobe Photoshop q10, Canon EOS R5 firmware, or whatever the match resolves to.

This is meaningfully different from the Software EXIF tag. The Software tag is text — it can be deleted, modified, or was never set in the first place. The DQT fingerprint comes from the mathematical structure of the compressed image data itself. You cannot change the quantisation tables after the fact without re-encoding the entire image, which would leave its own evidence.

For forensic purposes, a mismatch is the interesting case. A file claiming to be an unedited camera original whose DQT tables match Photoshop’s signature at a specific quality level has been through Photoshop, regardless of what the metadata says.

What this has to do with snapWONDERS Vaultify

Here’s the part that took me a while to articulate clearly.

The DQT database is a map of how every major JPEG encoder writes its characteristic quantisation tables. That knowledge tells you what statistical fingerprint any JPEG should have — and conversely, which statistical properties are stable, characteristic, and therefore cannot be disturbed without destroying the fingerprint.

When Vaultify hides encrypted data inside a JPEG, it operates on the pixel values — not the compression tables in the file header. The DQT coefficients are untouched. The statistical properties the fingerprint database maps stay intact.

But deeper than that: understanding how JPEG quantisation works, how lossy compression interacts with different regions of an image, and where the statistical properties that forensic detectors measure actually live — that understanding is what makes it possible to embed data in ways that don’t raise flags in steganalysis. The forensic work and the steganography work draw from exactly the same knowledge base. One reveals what a file contains. The other puts something into it. Both require understanding the file’s mathematical structure at the same level of depth.

I’ve started thinking of snapWONDERS and Vaultify as two directions of the same work. The analysis exposes signatures. The embedding avoids disturbing them. Two sides of the same coin — and you cannot build one without deeply understanding the other.

The database is live and continuously growing. If you upload a JPEG to snapWONDERS, there’s a good chance the analysis can tell you exactly what created it. And if there isn’t a match yet, your upload contributes to building one.

Try it — upload a JPEG to snapWONDERS