De-identification of Student Work

We de-identify student work before any AI model sees it.

An irreversible white-pixel masking pipeline, validated at 99.9% success across four months of human quality assurance. This is the technical foundation of our privacy posture.

  • ✓ Applied to 100% of student work before inference

  • ✓ Validated by human QA (Sept–Dec 2024)

  • ✓ FERPA-compliant "de-identification" standard (34 CFR §99.31(b))

Why de-identification matters

Large language models are powerful pattern-matchers. Even when vendors contractually agree not to retain or train on data, the safest stance is to ensure no identifying information is ever sent in the first place. De-identification is that second line of defense.

Under FERPA, "de-identified" education records are records from which all personally identifiable information (PII) has been removed, such that a "reasonable person in the school community" could not identify the student. Our pipeline targets this bar and exceeds it.

The pipeline

The pipeline runs in four steps before any student work reaches a language model: detection of identifying zones, irreversible white-pixel masking, OCR of the masked image, and the LLM call on the transcription only.

Step 1 — Detection

A computer-vision model trained specifically on classroom documents identifies zones likely to carry identifying information:

  • The student name field (top of page, margin, header)

  • Student ID or roster number if printed

  • Class label / section label

  • Teacher name where present

Step 2 — Masking

Detected zones are overlaid with opaque white pixels. The operation is:

  • Irreversible — the pixel values in those zones are replaced; the original is not embedded anywhere in the masked copy

  • Applied to the image that is transcribed — not just visually hidden

  • Performed on Ed.ai's infrastructure (US Azure), before anything leaves for inference

Step 3 — OCR

The masked image is sent to Azure AI's OCR service (US region). The OCR sees only the masked image — it cannot recover masked content.

Step 4 — LLM call

The OCR transcription is sent to the appropriate language model (Azure OpenAI, Azure Claude, Azure Mistral, or Google Gemini — all US-hosted). The model:

  • Receives only transcribed math content and instructional context

  • Does not receive the image

  • Does not receive roster information, class labels, or teacher identifiers

  • Operates under no-retention, no-training contractual terms

Validation

How we measured it

Between September and December 2024, every single image processed by Ed.ai was also manually reviewed by a human QA team. The QA team checked whether any identifying information survived masking.

Results

  • >99.9% success rate — fewer than 1 image per 1,000 showed any residual identifying information

  • 4 months of 100% human QA coverage before the pipeline was trusted to run at scale

  • Zero reported re-identification incidents since launch

What we do with the rare defect

When residual identifiers are detected:

  • The specific image is re-masked and re-processed (or deleted, depending on teacher choice)

  • The detection model is retrained on the failure case

  • The teacher is notified only if the identifier actually left Ed.ai's infrastructure (which has not happened to date)

What de-identification does NOT mean

We want to be precise about what this pipeline protects against — and what it doesn't:

  • Masking of names, IDs, class labels, teacher names on the scanned page — Content of the math work itself — which is the reason we're processing the page

  • No direct identifiers sent to LLMs — A guarantee against re-identification by someone with privileged access to all of our internal systems — that residual risk is handled by our security controls (see /security)

  • FERPA "de-identification" standard under normal operations — Adversarial re-identification attacks against our own production systems (that's /security's job)

In short: de-identification reduces the risk at the AI boundary. Security controls address risk everywhere else.

FERPA anchor

Under 34 CFR §99.31(b), student records are considered de-identified when:

  • All direct identifiers have been removed, AND

  • Reasonable steps have been taken to reduce risk that an individual's identity can be determined through indirect identifiers (contextual cues, small-cohort inference, etc.)

Ed.ai's pipeline handles direct identifiers at the image layer. For indirect identifiers:

  • Small-cohort risks (e.g., "the only student in grade 11 with this specific answer pattern") are mitigated by keeping downstream analytics within the school's own scope — no cross-district aggregation, no anonymized dataset pooling.

  • Analytics released to teachers and admins include only teachers' own classes.

Requesting more detail

District DPOs, CTOs, and researchers can request:

  • The detection model's architecture and evaluation results (under NDA)

  • Sample "before / after" images showing masking (synthetic, no real student data)

  • Audit summary of the 2024 QA campaign

Write to privacy@ed.ai.

Closing

De-identification is how we keep a simple promise: the AI that grades your students' work never sees who your students are.

Related pages: