Building a Safer Tomorrow: AI Spots Hazards on Construction Sites

Author: Denis Avetisyan


New research shows artificial intelligence can proactively identify safety risks in construction by analyzing both images and textual reports.

The system explores predictive prompting to anticipate accident scenarios, effectively framing the challenge as a forecasting problem rather than a reactive one.
The system explores predictive prompting to anticipate accident scenarios, effectively framing the challenge as a forecasting problem rather than a reactive one.

Prompt-engineered large language and vision-language models enable effective hazard detection without extensive model training or fine-tuning, improving construction safety and OSHA compliance.

Despite the increasing digitization of construction, synthesizing safety insights from disparate data sources-accident reports, inspection logs, and site imagery-remains a significant challenge. This research, ‘Automated Hazard Detection in Construction Sites Using Large Language and Vision-Language Models’, introduces a multimodal AI framework leveraging large language and vision-language models to automatically identify hazards on construction sites. Results demonstrate that strategically prompted models can effectively analyze both textual and visual data without extensive model training, offering a cost-aware and scalable approach to safety monitoring. Could this methodology pave the way for proactive, real-time hazard mitigation and a substantial reduction in construction site accidents?


The Illusion of Control: Construction’s Persistent Peril

Despite decades of safety protocols and increasingly rigorous regulations, the construction industry persistently ranks among the most hazardous workplaces. Statistics reveal a disproportionately high incidence of serious injuries and fatalities compared to other sectors, stemming from falls, struck-by incidents, electrocution, and caught-in/between hazards. This enduring risk isn’t due to a lack of rules, but rather the inherent complexity of construction sites – constantly evolving environments with numerous moving parts, diverse trades working in close proximity, and frequently changing conditions. Moreover, factors like tight deadlines, economic pressures, and a transient workforce can contribute to shortcuts and compromised safety practices, highlighting the need for continuous improvement and proactive hazard mitigation strategies beyond simple compliance.

Conventional safety protocols within construction largely depend on periodic inspections, a necessary but often insufficient approach to risk management. These evaluations typically occur after potential hazards are already present, focusing on identifying and correcting existing issues rather than preventing them from arising. On increasingly complex job sites – characterized by dynamic workflows, multiple subcontractors, and ever-changing environments – the reactive nature of these inspections struggles to keep pace. Inspectors face the challenge of assessing a snapshot of conditions that may rapidly evolve, leaving a significant window for unaddressed dangers. This limitation highlights the need for more predictive and continuous monitoring systems capable of proactively identifying risks before they escalate into incidents, moving beyond simple compliance checks towards a truly preventative safety culture.

Construction sites generate an immense flood of visual information – workers, machinery, materials, and constantly shifting conditions – that routinely surpasses the capacity of human inspectors. This cognitive overload leads to inevitable gaps in hazard detection, as crucial safety violations or precarious situations can be missed amidst the visual complexity. Traditional inspection methods, reliant on manual observation, struggle to process this data effectively, increasing the risk of incidents. The human eye, while adept at identifying obvious dangers, simply cannot maintain consistent vigilance across expansive and dynamic environments, highlighting the need for technological solutions capable of augmenting inspection processes and proactively identifying potential risks before they escalate into accidents.

This prompt effectively filters out high-risk hazards, ensuring safer operation.
This prompt effectively filters out high-risk hazards, ensuring safer operation.

Shifting the Burden: Automated Hazard Detection Systems

Automated Hazard Detection systems utilize artificial intelligence to provide continuous monitoring of construction site imagery, shifting from reactive incident reporting to proactive risk mitigation. These systems ingest visual data – typically from site cameras or drone footage – and employ algorithms to identify potential safety violations and hazardous conditions in real-time. This continuous analysis allows for immediate alerts to site supervisors, enabling prompt corrective action before incidents occur. The implementation of such systems reduces reliance on manual inspections, increases the frequency of hazard identification, and contributes to a safer working environment by addressing risks as they emerge rather than after an event has transpired.

The training of automated hazard detection models relies on large, annotated datasets like the ConstructionSite10k Dataset, which comprises over 10,000 images of construction environments with detailed bounding box annotations for common hazards. These hazards include objects such as missing guardrails, improper personal protective equipment (PPE) like hard hats and safety vests, standing water, tools left unattended, and material stockpiling violations. The scale of these datasets is critical; the large number of images allows for the development of robust models capable of generalizing to diverse construction site conditions and minimizing false positive detections. Datasets also frequently include metadata regarding lighting conditions, time of day, and weather, further improving model accuracy and reliability in real-world applications.

Research indicates that prompt-based methodologies leveraging pre-trained large language and vision-language models offer a viable alternative to traditional fine-tuning for construction safety analysis. These models, trained on extensive general datasets, can be adapted to identify hazards through carefully constructed prompts without requiring substantial additional training on construction-specific data. Performance evaluations demonstrate competitive accuracy in hazard detection when compared to models requiring extensive fine-tuning, suggesting a reduction in both data labeling effort and computational resources. This approach facilitates rapid deployment and scalability of automated hazard detection systems in dynamic construction environments.

The Illusion of Intelligence: Harnessing LLMs and Prompt Engineering

GPT-4o, a multimodal large language model (LLM) and vision language model (VLM), was utilized to automatically generate detailed textual descriptions of construction site images. These descriptions serve as a foundational element for subsequent hazard assessment processes by providing critical contextual information derived directly from visual data. The model processes images and outputs comprehensive scene descriptions, enabling the identification of objects, spatial relationships, and potential safety concerns without manual image review. This automated description generation is a key component in scaling hazard identification across large datasets of construction site imagery.

Prompt engineering was critical to maximizing the performance of the LLM in identifying potential accident scenarios. This involved techniques such as semantic prompting, which focuses on conveying the meaning of the desired output rather than relying on specific keywords. By structuring prompts to emphasize contextual understanding and desired reasoning processes, the model’s ability to accurately assess construction site images was significantly improved. This optimization process moved beyond simple instruction-following and encouraged the LLM to interpret visual data with a focus on safety-relevant features, leading to enhanced identification of hazards and potential accident precursors.

The developed textual pipeline demonstrated high performance in scene description, achieving 89% accuracy when evaluated against a manually labeled test set. Comparative analysis using alternative models revealed that Qwen2-VL-2B attained a 72.6% F1 score, and Molmo-7B achieved a 67.2% F1 score. Importantly, both Qwen2-VL-2B and Molmo-7B achieved these results utilizing prompt ensembles, indicating that combining multiple prompts improves model performance beyond single-prompt approaches.

The image illustrates the prompt used to generate a scene description.
The image illustrates the prompt used to generate a scene description.

Beyond Band-Aids: Towards a Truly Safer Construction Future

This innovative system furnishes actionable intelligence for bolstering construction safety regulations by moving beyond simple hazard identification. It achieves this through the continuous collection and analysis of on-site data, discerning patterns and predicting potential risks with a granularity previously unattainable. The resulting data-driven insights allow regulatory bodies to refine existing standards based on real-world conditions, rather than relying solely on generalized guidelines or post-incident analysis. This proactive approach facilitates the development of more effective preventative measures, leading to a demonstrably safer working environment and a reduction in costly accidents. Furthermore, the system’s ability to highlight areas of frequent non-compliance informs targeted training programs and resource allocation, ensuring safety protocols are consistently upheld across all construction projects.

The automated monitoring of Personal Protective Equipment (PPE) compliance represents a crucial advancement in construction site safety. This system moves beyond simple hazard identification to actively verify that workers are utilizing essential safety gear – hard hats, safety vests, and eye protection – in real-time. By flagging instances of non-compliance, the technology not only prompts immediate corrective action, minimizing worker exposure to potential injuries, but also generates a detailed audit trail. This documentation is invaluable for demonstrating due diligence, reducing potential liability in the event of an accident, and supporting comprehensive safety training programs. Consequently, proactive PPE compliance monitoring fosters a stronger safety culture and significantly lowers the risk of costly incidents and legal repercussions.

Construction sites present inherent risks, but proactive hazard mitigation can substantially decrease accident rates. This system focuses on identifying potential fall protection failures – a leading cause of serious injury and fatality – by analyzing visual data for missing or improperly used safety equipment. Simultaneously, it assesses hazard proximity, alerting personnel to dangerous conditions like unprotected edges or active machinery before incidents occur. By shifting from reactive responses to preventative measures, this technology doesn’t merely document unsafe acts, but actively contributes to a safer work environment and a demonstrable reduction in construction-related accidents, fostering a culture of safety through real-time awareness and intervention.

The pursuit of automated hazard detection, as detailed in this research, feels predictably optimistic. It proposes leveraging large language and vision-language models to identify dangers on construction sites without burdensome retraining. This approach, while promising, will inevitably encounter the realities of production environments. As Marvin Minsky observed, “The most effective way to learn is through experiencing failure.” The system might elegantly parse reports and images in a controlled setting, but the moment it faces the chaotic input of a real construction site – inconsistent lighting, obscured views, ambiguous descriptions – the cracks will appear. The hazard detection will degrade, requiring constant patching and refinement, transforming a novel architecture into another layer of technical debt. It’s a useful step, certainly, but one destined to become commonplace, and then, eventually, problematic.

What’s Next?

The demonstrated capacity of large language and vision-language models to identify construction hazards without bespoke training is… predictably encouraging. It feels a bit like discovering a new way to automate something people were already doing imperfectly, and the inevitable regression to the mean will be fascinating. The current reliance on prompt engineering, while effective, introduces a fragility. A slightly altered report format, a different camera angle, and the whole system requires another round of carefully crafted queries. One suspects the next generation of ‘innovation’ will involve endlessly chasing those edge cases.

The real challenge isn’t simply detecting hazards, it’s preventing them. This work identifies that a trench lacks shoring, but doesn’t address why the shoring is missing – the logistical breakdown, the time pressure, the supervisor who signed off on it. Those are the problems that remain stubbornly resistant to algorithmic solutions. Expect to see a proliferation of increasingly complex multimodal systems, each layering more abstraction on top of the same fundamental human failings.

Ultimately, this feels less like a breakthrough and more like a shift in where the technical debt accumulates. Previously, it was in hand-labeled datasets and custom algorithms. Now, it’s in the prompt libraries and the ever-shifting landscape of model APIs. Everything new is just the old thing with worse docs.


Original article: https://arxiv.org/pdf/2511.15720.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-21 21:49