Safeguarding Crash Data: An AI-Powered Approach to Privacy

Author: Denis Avetisyan

A new workflow leverages the power of artificial intelligence to automatically identify and protect sensitive information within detailed accident reports.

The proposed agentic workflow effectively identifies and isolates Personally Identifiable Information (PII).

This paper details an agentic system combining rule-based and fine-tuned large language models for improved Personally Identifiable Information (PII) detection in crash narratives, balancing accuracy with privacy preservation for enhanced safety analysis.

Despite the crucial role of contextual information in crash reports for traffic safety analysis, their broader utility is hindered by the presence of sensitive personally identifiable information (PII). This paper introduces ‘An Agentic Workflow for Detecting Personally Identifiable Information in Crash Narratives’, a locally deployable system that combines rule-based methods with fine-tuned large language models and an agentic verification stage to improve PII detection accuracy and privacy. Achieving a precision of 0.82 and recall of 0.94 on a real-world crash dataset, this workflow demonstrably outperforms existing methods. Could this approach unlock more comprehensive and responsible utilization of crash data, ultimately leading to enhanced safety interventions while safeguarding individual privacy?

Unveiling Sensitive Data Within Crash Reports

Crash reports filed with the Wisconsin Department of Transportation, while crucial for improving road safety, frequently contain unexpectedly detailed personal information. These narratives, often composed by responding officers, routinely include not only details about the incident itself, but also observations about the individuals involved – names, addresses, and even potentially sensitive medical details gleaned from on-scene interviews or visible injuries. This embedded Personally Identifiable Information (PII) presents a significant privacy risk, as publicly accessible reports could expose individuals to identity theft, harassment, or unwanted attention. The sheer volume of these reports, combined with the unstructured, narrative format, makes manual redaction impractical, necessitating automated solutions to protect the privacy of those involved in traffic incidents.

The inherent complexity of natural language presents a significant hurdle for automated PII detection within the free-form text of crash narratives. Traditional rule-based systems, reliant on predefined patterns like regular expressions, often fail to capture the nuances of phrasing, resulting in both false positives and, more critically, false negatives – missed instances of sensitive data. Similarly, dictionary-based approaches struggle with variations in names, addresses, and vehicle identifiers, or the presence of misspellings common in quickly-recorded reports. This incomplete identification not only compromises individual privacy but also hinders the effectiveness of safety analyses; inaccurate or redacted data can skew investigations into crash causes and impede the development of preventative measures, ultimately limiting the utility of these valuable public safety resources.

An Agentic Workflow for Intelligent PII Detection

The PII detection process employs an Agentic Workflow, a modular system where distinct agents perform specific tasks in sequence. This architecture facilitates systematic identification of Personally Identifiable Information (PII) within crash narratives by breaking down the complex problem into manageable, specialized components. Each agent is designed to address a particular aspect of PII detection – such as initial information extraction, contextual analysis, and validation – and passes its output to the subsequent agent in the workflow. This sequential processing enhances both the accuracy and the efficiency of PII identification compared to monolithic approaches, allowing for focused optimization of individual agent capabilities and easier adaptation to evolving PII types and data formats.

The PII detection workflow employs a Hybrid Extractor to maximize both recall and precision. This component integrates traditional rule-based methods – regular expressions and keyword lists – with a fine-tuned Large Language Model (LLM). The rule-based system rapidly identifies likely PII based on predefined patterns, while the LLM leverages contextual understanding to refine these initial detections and identify more nuanced instances of PII. Following extraction, a Verifier component assesses the confidence of each identified PII element, reducing false positives and ensuring a higher level of accuracy in the overall process. This dual approach combines the speed of rule-based systems with the adaptability of LLMs, and the Verifier provides a final layer of quality control.

Refining Contextual Understanding with Large Language Models

The Llama 3.1-8B large language model was adapted for the specific task of identifying context-dependent Personally Identifiable Information (PII), such as names and addresses, through the application of Low-Rank Adaptation (LoRA) and carefully designed prompts. LoRA minimizes the number of trainable parameters, improving training efficiency and reducing computational costs, while prompt engineering focuses on structuring input to guide the LLM toward accurate PII identification based on surrounding text. This fine-tuning process enabled the model to distinguish between ambiguous terms and accurately extract PII only when contextually relevant, surpassing the performance of generic PII detection methods.

Ensemble Learning was implemented to enhance recall in PII detection by leveraging multiple inferences from the fine-tuned Llama 3.1-8B model. This technique involves running the model on the same input text multiple times and then pooling the resulting candidate spans. By combining these outputs, the system reduces the likelihood of missing valid PII instances, effectively boosting overall detection rates. The aggregation of predictions from repeated runs mitigates the impact of potential variations in model output and improves the robustness of the PII identification process.

The Hybrid Extractor combines rule-based identification with large language model (LLM) analysis for comprehensive PII detection. Specifically, it leverages the Presidio library to identify structured PII entities, such as phone numbers and email addresses, while employing a fine-tuned Llama 3.1-8B LLM to detect context-dependent PII. Evaluation demonstrates a precision of 0.94 for phone number detection using Presidio and a precision of 1.00 for email address detection, indicating high accuracy in identifying these specific structured PII types.

The training curve demonstrates successful fine-tuning of the model, indicated by decreasing loss over time.

Validating Accuracy and Enhancing Reliability

The system incorporates a dedicated Verifier component designed to resolve uncertainties inherent in Personally Identifiable Information (PII) detection. This module concentrates on accurately identifying challenging data points, specifically home addresses and alphanumeric identifiers – categories frequently misclassified by standard methods. By focusing on these ambiguous cases, the Verifier significantly reduces both false positives and false negatives, ensuring a higher degree of confidence in PII identification. Leveraging the same foundational Llama 3.1-8B model as the primary detection system, it provides a consistent and reliable secondary assessment, effectively functioning as a quality control measure for complex PII data.

The system’s Verifier component establishes a crucial layer of quality control by employing the same Llama 3.1-8B model used in initial PII detection, ensuring consistency and reliability in its secondary assessments. This approach significantly reduces both false positives – incorrectly identifying data as PII – and false negatives – failing to identify actual PII. Notably, the Verifier achieves perfect precision – a score of 1.00 – in detecting home addresses, meaning every instance flagged as an address is, in fact, a correctly identified address; this high level of accuracy demonstrates the effectiveness of the model and the workflow in handling complex and potentially ambiguous data categories.

The implemented agentic workflow demonstrates a marked improvement in Personally Identifiable Information (PII) detection, achieving an F1-score of 0.84 for name identification and 0.63 for alphanumeric identifiers. These results represent a substantial gain in performance compared to conventional baseline methods, indicating a greater capacity to accurately pinpoint sensitive data. Notably, the system exhibits enhanced robustness when dealing with ambiguous PII categories-those instances where clear identification proves challenging-suggesting a more reliable and nuanced approach to data privacy and security. This improved performance is crucial for applications requiring high precision in PII redaction, data governance, and compliance with privacy regulations.

The presented agentic workflow emphasizes a holistic approach to PII detection within crash narratives, mirroring the interconnectedness of systems. Just as a flawed component can disrupt an entire organism, a weakness in PII identification can compromise the integrity of safety analysis. This research underscores that effective data handling isn’t merely about isolated techniques; it demands an understanding of the entire data lifecycle. As Bertrand Russell aptly stated, “To be happy, one must be able to change.” This sentiment applies to data science; adapting to evolving privacy concerns and refining detection methods is crucial for responsible innovation in this domain.

The Road Ahead

The presented agentic workflow, while a step toward robust PII detection in unstructured crash narratives, merely shifts the locus of the problem. The core challenge isn’t simply finding the data, but understanding the implicit contracts embedded within its use. A system that successfully redacts names and addresses offers a superficial privacy; the true leakage occurs when patterns of behavior, inferred from the remaining data, reveal identities through association. This work optimizes for a local maximum-accurate redaction-but neglects the global minimum of genuine data utility and ethical practice.

Future research must move beyond feature engineering and model refinement toward a more holistic consideration of data provenance and purpose. Domain adaptation, while necessary, is a perpetual game of catch-up. A more elegant solution lies in minimizing the need for extraction altogether-perhaps through differential privacy techniques applied directly to the narrative text, or by developing synthetic datasets that preserve statistical properties without exposing individual details. The cost of freedom, in this case, is not dependency on complex algorithms, but a fundamental re-evaluation of what questions can-and should-be asked of this sensitive data.

Good architecture is invisible until it breaks. The true test of this, or any similar system, will not be its performance on benchmark datasets, but its resilience in the face of adversarial attacks and unforeseen edge cases. Until we acknowledge that simplicity scales while cleverness does not, the pursuit of perfect PII detection will remain a Sisyphean task.

Original article: https://arxiv.org/pdf/2604.15369.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unveiling Sensitive Data Within Crash Reports

An Agentic Workflow for Intelligent PII Detection

Refining Contextual Understanding with Large Language Models

Validating Accuracy and Enhancing Reliability

The Road Ahead

See also: