Skip to main content

Batch Anonymization

Status: Proposed

Batch anonymization extends the single-document anonymization pipeline to support processing multiple files in one guided session. The workflow follows a state-machine model: upload → extract → review → anonymize → completed.

Overview

Users can upload up to 100 files (admin-configurable) in a single request. DocuDesk processes them sequentially, extracting text and entities from each file, then presenting a consolidated entity review before applying anonymization. A CSV audit report is available for download after completion.

Batch state is persisted in Nextcloud ICache with a 2-hour TTL. No batch data is stored permanently; only the anonymized output files are saved to the user's DocuDesk folder.

Workflow Steps

  1. UploadPOST /api/anonymization/batch/upload — upload multiple files, receive batchId
  2. ExtractPOST /api/anonymization/batch/{batchId}/extract — process one file per call until all are extracted
  3. ReviewEntity review — consolidated entity list with toggle controls
  4. AnonymizePOST /api/anonymization/batch/{batchId}/anonymize — apply anonymization with reviewed entity list
  5. ReportGET /api/anonymization/batch/{batchId}/report — download CSV audit report

API Endpoints

MethodPathDescription
POST/api/anonymization/batch/uploadUpload multiple files; returns batchId
POST/api/anonymization/batch/{batchId}/extractExtract next unprocessed file in batch
GET/api/anonymization/batch/{batchId}/statusPolling endpoint — returns batch status and per-file progress
GET/api/anonymization/batch/{batchId}/entitiesConsolidated entity list for review
POST/api/anonymization/batch/{batchId}/anonymizeApply anonymization with reviewed entity list
GET/api/anonymization/batch/{batchId}/reportDownload CSV audit report (post-completion)

Audit Report

The CSV report includes: fileName, originalFileId, anonymizedFileId, entityCount, replacementCount, status, timestamp. Entity values are excluded (GDPR data minimization, Recital 26).

Standards

  • GDPR / AVG — Batch state is transient (ICache TTL 2h); entity values excluded from audit report
  • WOO — Anonymization profiles aligned with WOO publication requirements
  • GEMMA Media-behandelingcomponent
  • TEC-DMS-7 (Workflow Management)

Limits

ParameterDefaultConfig Key
Max files per batch100docudesk_batch_max_files (IAppConfig)
Batch TTL2 hoursHardcoded (ICache)