De-identification of Student Work
De-identification of Student Work
De-identification of Student Work
We de-identify student work before any AI model sees it.
An irreversible white-pixel masking pipeline, validated at 99.9% success across four months of human quality assurance. This is the technical foundation of our privacy posture.
✓ Applied to 100% of student work before inference
✓ Validated by human QA (Sept–Dec 2024)
✓ FERPA-compliant "de-identification" standard (34 CFR §99.31(b))
Large language models are powerful pattern-matchers. Even when vendors contractually agree not to retain or train on data, the safest stance is to ensure no identifying information is ever sent in the first place. De-identification is that second line of defense.
Under FERPA, "de-identified" education records are records from which all personally identifiable information (PII) has been removed, such that a "reasonable person in the school community" could not identify the student. Our pipeline targets this bar and exceeds it.
The pipeline runs in four steps before any student work reaches a language model: detection of identifying zones, irreversible white-pixel masking, OCR of the masked image, and the LLM call on the transcription only.
A computer-vision model trained specifically on classroom documents identifies zones likely to carry identifying information:
The student name field (top of page, margin, header)
Student ID or roster number if printed
Class label / section label
Teacher name where present
Detected zones are overlaid with opaque white pixels. The operation is:
Irreversible — the pixel values in those zones are replaced; the original is not embedded anywhere in the masked copy
Applied to the image that is transcribed — not just visually hidden
Performed on Ed.ai's infrastructure (US Azure), before anything leaves for inference
The masked image is sent to Azure AI's OCR service (US region). The OCR sees only the masked image — it cannot recover masked content.
The OCR transcription is sent to the appropriate language model (Azure OpenAI, Azure Claude, Azure Mistral, or Google Gemini — all US-hosted). The model:
Receives only transcribed math content and instructional context
Does not receive the image
Does not receive roster information, class labels, or teacher identifiers
Operates under no-retention, no-training contractual terms
Between September and December 2024, every single image processed by Ed.ai was also manually reviewed by a human QA team. The QA team checked whether any identifying information survived masking.
>99.9% success rate — fewer than 1 image per 1,000 showed any residual identifying information
4 months of 100% human QA coverage before the pipeline was trusted to run at scale
Zero reported re-identification incidents since launch
When residual identifiers are detected:
The specific image is re-masked and re-processed (or deleted, depending on teacher choice)
The detection model is retrained on the failure case
The teacher is notified only if the identifier actually left Ed.ai's infrastructure (which has not happened to date)
We want to be precise about what this pipeline protects against — and what it doesn't:
Masking of names, IDs, class labels, teacher names on the scanned page — Content of the math work itself — which is the reason we're processing the page
No direct identifiers sent to LLMs — A guarantee against re-identification by someone with privileged access to all of our internal systems — that residual risk is handled by our security controls (see /security)
FERPA "de-identification" standard under normal operations — Adversarial re-identification attacks against our own production systems (that's /security's job)
In short: de-identification reduces the risk at the AI boundary. Security controls address risk everywhere else.
Under 34 CFR §99.31(b), student records are considered de-identified when:
All direct identifiers have been removed, AND
Reasonable steps have been taken to reduce risk that an individual's identity can be determined through indirect identifiers (contextual cues, small-cohort inference, etc.)
Ed.ai's pipeline handles direct identifiers at the image layer. For indirect identifiers:
Small-cohort risks (e.g., "the only student in grade 11 with this specific answer pattern") are mitigated by keeping downstream analytics within the school's own scope — no cross-district aggregation, no anonymized dataset pooling.
Analytics released to teachers and admins include only teachers' own classes.
District DPOs, CTOs, and researchers can request:
The detection model's architecture and evaluation results (under NDA)
Sample "before / after" images showing masking (synthetic, no real student data)
Audit summary of the 2024 QA campaign
Write to privacy@ed.ai.
De-identification is how we keep a simple promise: the AI that grades your students' work never sees who your students are.
Related pages:
Trust Pledge — the commitment framework
Security — the controls around the pipeline
AI Transparency — the role AI plays in each product