Live Forensic Intelligence

Forensic Dossiers

Multi-layer cryptographic evidence of training data provenance. Each dossier combines structural probes, deep crystallography, resonance fingerprinting, behavioral analysis, and genomic correlation — autonomously generated, machine-verifiable.

How to read forensic evidence

This registry presents converging forensic signals across monitored models. Each card combines directional evidence, confidence estimates, similarity fingerprints, and internal activation patterns to help interpret whether a model shows signs consistent with a tracked evidence profile.

PEER-REVIEWED STUDY PUBLISHED

Verbatim in the Weights

Quantifying Copyright Memorization Across 81 Open-Weight Language Models from 12 Organizations. Three novel forensic methods reveal systematic, industry-wide memorization of copyrighted material.

Read Full Paper →

DOI: 10.5281/zenodo.19431804

Dataset Fingerprinting

METHOD 1

Does the model know the exact published text or just the story? Ratio above 1.0 = memorized the specific character sequence.

Memorization Scaling Law

METHOD 2

Larger models memorize more copyrighted content. BPC reduction of 10-14% per size increase across three independent families.

Evasion Detection

METHOD 3

Does RLHF/instruction tuning remove copyright knowledge? Min-K20% membership inference on base vs. instruct pairs.

Extraction Bypass

CRITICAL

Simple prompts extract verbatim copyrighted text from safety-trained models. No jailbreaking required.

Cross-Organizational BPC

81 MODELS

Average BPC by text category. Green = low BPC (memorized). Red = high BPC (unfamiliar). Pattern is universal across all organizations.

—

Dossiers Complete

—

In Progress

—

Evidence Layers

—

Crystallography

—

Resonance Probes

—

Families Covered

Pipeline Progress

Completed Dossiers

—

Each card presents converging evidence — from interpretive verdict to underlying signal layers.

⚠ CDS Recalculation Notice: CDS (Calibrated Differential Surprise) and Gene Y scores shown in dossier cards are being recalculated with validated Pile test texts (v4). Current values are preliminary and may change. Dataset Genome, Gotcha Report, and ZK Certificates sections below use independently verified data.

Pipeline Queue

—

Models awaiting completion of one or more forensic analysis stages.

Dataset Genome Project

Cross-model × cross-dataset forensic fingerprinting. Each cell shows size-normalized familiarity (0–100). Higher = model disproportionately familiar with that dataset.

—

🚨 Anomalous Familiarity Report

Forensic evidence of training data familiarity patterns that are not explained by developers' declared training sources. Each finding is backed by BPC measurements on 1,128 real text excerpts, size-normalized z-scores, and cross-family validation.

—

Methodology Note: These findings show statistical anomalies in model behavior, not definitive proof of training data inclusion. High familiarity with a dataset may result from: (a) undisclosed training data, (b) content overlap between web datasets, (c) knowledge distillation from larger models, or (d) emergent generalization. We present the evidence and let readers draw conclusions. All measurements are reproducible from published model weights and public dataset excerpts.

ZK Forensic Certificates

Cryptographically signed certificates for each model. Pedersen commitments (secp256k1) on BPC values, Merkle tree on genome vector, Fiat-Shamir proof of knowledge, Ed25519 signature.

—

Forensic Methodology

Five independent analysis engines converge on each model. No single signal determines a verdict.

Structural Probe

Sonar v2 latent crystallography — detects domain-specific structural patterns in model embeddings without accessing training data.

Observed Signal

Deep Crystallography

Extracts attention layer tensors, measures domain-specific crystallization gaps vs negative controls. Surface vs deep gap analysis reveals training signal.

Observed Signal

Resonance Probe

Vocabulary-space cluster analysis. Discovers token communities via attention resonance, measures separation from noise floor, and traces assimilation chains.

Observed Signal

CDS v4

Calibrated Differential Surprise — measures statistical deviation from reference models on dataset-specific texts with bootstrap CI and neutral calibration.

Derived Similarity

Gene Y

Ranking fingerprint correlation. Models trained on shared data produce correlated perplexity rankings. Spearman ρ across Pile-style test texts.

Derived Similarity

Dataset Genome

Size-normalized BPC fingerprinting across 13 datasets × 16 models. Z-score normalization removes model-size bias, revealing true training data signal.

Observed Signal

ZK Forensic Certificate

Pedersen commitments on BPC values, Merkle tree on genome vector, Fiat-Shamir proof, Ed25519 signature. Tamper-proof cryptographic attestation of forensic results.

Cryptographic Proof

Forensic Dossiers

How to read forensic evidence

Verbatim in the Weights

Dataset Fingerprinting

Memorization Scaling Law

Evasion Detection

Extraction Bypass

Cross-Organizational BPC

Pipeline Progress

Completed Dossiers

Pipeline Queue

Dataset Genome Project

🚨 Anomalous Familiarity Report

ZK Forensic Certificates

Forensic Methodology

Structural Probe

Deep Crystallography

Resonance Probe

CDS v4

Gene Y

Dataset Genome

ZK Forensic Certificate

Verification Stack