The Search for Truth Online: A China-Focused Study

Author: Denis Avetisyan

A new analysis reveals the varying accuracy of search engines, large language models, and AI-powered overviews in delivering factual information to Chinese web users.

Across multiple search engines-Baidu, Bing, and Sogou-large language models including DeepSeek, Qwen, and LLaMA demonstrated varying predictive accuracy, with confidence intervals established through bootstrapping, and performance generally clustered around an overall mean as indicated by the red dashed line.

This research provides a cross-system evaluation of misinformation exposure across Chinese search engines, large language models, and AI Overviews, highlighting topic- and geography-dependent variations in factual reliability.

Despite increasing reliance on AI-mediated information access, the factual reliability of these systems remains a critical concern, particularly in non-English digital ecosystems. This study, ‘Misinformation Exposure in the Chinese Web: A Cross-System Evaluation of Search Engines, LLMs, and AI Overviews’, presents a comprehensive evaluation of traditional search engines, standalone Large Language Models, and AI-generated overviews concerning factual accuracy across a diverse set of real-world Chinese queries. Our analysis reveals substantial performance differences both across systems and topics, suggesting variable exposure to misinformation for Chinese users-a risk further complicated by regional disparities in search behavior. How can we build more robust and transparent information tools to mitigate these risks and ensure trustworthy access to knowledge in an increasingly AI-driven world?

The Illusion of Truth: Navigating a Sea of Misinformation

The modern information landscape, while seemingly boundless, is surprisingly vulnerable to the spread of misinformation. Although search engines and artificial intelligence models are designed to deliver relevant answers, they are susceptible to propagating inaccuracies present within their source materials or amplified through algorithmic biases. This presents a significant challenge, as reliance on these tools for factual knowledge increases daily; a single incorrect result can influence decisions ranging from personal health choices to broader societal understandings. Consequently, ensuring the reliability of information retrieval systems is paramount, demanding constant vigilance and innovative approaches to verifying and validating the data upon which these technologies depend. The potential for widespread dissemination of false information necessitates a robust framework for evaluating and improving the factual accuracy of search results and AI-generated responses.

The pursuit of genuinely reliable information retrieval necessitates more than just algorithms; it demands a solid foundation of verified data for consistent evaluation. A large-scale dataset of authentic user queries, meticulously checked for factual accuracy, serves as a crucial benchmark for assessing the performance of both search engines and increasingly sophisticated artificial intelligence models. Without such a resource, improvements in information access remain difficult to quantify and potentially mask the continued propagation of misinformation. This dataset doesn’t merely test what a system returns, but rather confirms whether that information aligns with established truth, providing a robust measure of trustworthiness essential for building confidence in automated knowledge systems.

The foundation for a rigorous evaluation of information retrieval systems lies in the newly constructed Factual Query Dataset, built upon the existing T2Ranking Dataset. This dataset comprises over 12,161 meticulously verified Yes/No questions, representing a substantial leap forward in benchmarking factual accuracy. Unlike synthetic datasets, these queries are derived from actual Chinese search engine behavior, capturing the nuances of how people genuinely seek information. This grounding in real-world search patterns ensures that evaluations are not merely academic exercises, but directly reflect performance in practical scenarios. By utilizing verified answers, the dataset enables precise measurement of a system’s ability to discern truth, offering a robust tool for advancing the reliability of search and artificial intelligence.

Bing, Sogou, and Baidu exhibit comparable accuracy on yes/no factual queries, with each engine achieving approximately the same percentage of correct predictions ± 95% confidence intervals.

How Search Engines Really Work (And Where They Fail)

Traditional retrieval-based search engines, including those utilized by Sogou, Baidu, and Bing, operate by matching user queries to indexed documents based on keyword presence and ranking algorithms. This initial layer of information access relies on pre-existing content and established indexing methods, contrasting with more advanced approaches like generative AI. The process typically involves crawling the web, indexing discovered pages, and then retrieving results based on the identified keywords in a user’s query. Ranking algorithms then prioritize these results, attempting to present the most relevant information first; however, relevance is determined solely by textual matches and associated ranking factors, not semantic understanding.

Accurate information retrieval is paramount to maintaining user confidence in search engines and the validity of resulting decisions. Inaccurate results erode trust, potentially leading users to base critical choices – regarding health, finance, or current events – on flawed data. This impact extends beyond individual users; systemic inaccuracies can contribute to the spread of misinformation and negatively affect public discourse. Consequently, evaluating search engine accuracy isn’t merely a technical exercise, but a crucial component of responsible information provision and maintaining a reliable digital ecosystem.

Comparative analysis of leading search engines – including Sogou, Baidu, and Bing – indicates statistically comparable performance in factual query response. Specifically, the Baidu Search Engine demonstrated a 63.7% accuracy rate when evaluated against a dedicated factual query dataset. This metric, representing the proportion of correctly predicted answers, establishes a baseline for assessing the reliability of information delivered to users and facilitates ongoing evaluation of search engine performance improvements and potential biases in information retrieval.

Across ten domains, Bing, Sogou, and Baidu exhibit varying topic-level prediction accuracy, as indicated by each bar, with the red dashed line representing the overall average performance for each engine.

Large Language Models: A False Promise of Truth?

The application of large language models (LLMs) to automated fact-checking represents a growing trend in computational verification. Models such as DeepSeek, Qwen, and LLaMA are being utilized to assess the veracity of statements by leveraging their ability to process and understand natural language. This automation aims to scale fact-checking efforts beyond manual capabilities, enabling the analysis of larger volumes of information. Current implementations involve prompting these LLMs with claims and requesting a determination of their factual accuracy, often framed as a binary true/false classification or a confidence score. The increasing prevalence of LLMs in this domain is driven by their potential to reduce the time and resources required for fact-checking, although ongoing research focuses on mitigating potential inaccuracies and biases within these models.

Assessing the accuracy of large language models (LLMs) is critical due to their propensity to generate outputs that, while grammatically correct and contextually relevant, contain factual inaccuracies. This phenomenon, often termed “hallucination,” arises from the probabilistic nature of LLM text generation; models predict the next token based on patterns in training data, not necessarily verified truth. Consequently, LLMs can confidently present false or misleading information, making rigorous evaluation – employing established fact-checking benchmarks and methodologies – essential before deploying these models in applications requiring reliable information, such as news aggregation, research assistance, or legal analysis.

Recent evaluations indicate substantial performance in automated fact-checking by several large language models. Qwen currently leads with an accuracy rate of 68.5% in correctly identifying factual claims. DeepSeek follows with 63.3% accuracy, while LLaMA demonstrates competitive, though slightly lower, performance. This computational scrutiny, facilitated by these LLMs, represents a significant advancement in the automated evaluation of information and provides a quantifiable basis for assessing their reliability in fact-checking applications.

Across ten domains, DeepSeek, Qwen, and LLaMA exhibit varying topic-level prediction accuracy, as indicated by the percentage of correct predictions for each topic relative to each model’s overall average <span class="katex-eq" data-katex-display="false">ar{x}</span>. — Across ten domains, DeepSeek, Qwen, and LLaMA exhibit varying topic-level prediction accuracy, as indicated by the percentage of correct predictions for each topic relative to each model’s overall average $ar{x}$ .

Mapping the Spread: Where Misinformation Takes Root

Researchers developed a novel Information Exposure Metric to assess the potential reach of inaccurate information, moving beyond simple search counts. This metric ingeniously integrates Baidu Index data – reflecting user search volume – with a rigorous assessment of system accuracy, evaluating both the search engine itself and large language models. By combining these factors, the metric doesn’t just quantify how many people are seeking information on a topic, but crucially, the likelihood they encounter inaccurate responses. This approach allows for a more nuanced understanding of misinformation spread, providing a quantifiable measure of potential exposure that accounts for both demand and the reliability of the information presented, ultimately offering a valuable tool for tracking and mitigating the impact of false narratives.

The development of a novel Information Exposure Metric offers a rigorous approach to assessing the prevalence of potentially inaccurate information among online users. This metric moves beyond simple observation by integrating two key data points: the volume of searches related to specific topics – as captured by the Baidu Index – and a calculated system accuracy, reflecting the reliability of information returned by both search engines and large language models. By combining search interest with a measure of informational correctness, the metric provides a quantifiable value representing the potential for users to encounter misinformation, allowing for comparative analysis across different regions and topics. This isn’t merely tracking the presence of false claims, but rather estimating the scale at which users are likely being exposed to them, offering a valuable tool for understanding and mitigating the spread of inaccurate information.

Recent analysis of information access across China demonstrates marked geographic disparities in exposure to misinformation, suggesting uneven access to accurate knowledge. The study reveals that certain provinces face substantially higher risks of encountering inaccurate information compared to others, potentially exacerbating existing inequalities. Notably, Baidu’s AI Overview feature exhibited the highest overall accuracy-reaching 69.8%-in discerning factual content from misinformation within the assessed search results. This indicates that while misinformation remains prevalent, AI-powered search tools are beginning to play a role in filtering potentially harmful content, though considerable room for improvement persists in ensuring equitable access to reliable information nationwide.

Provincial exposure to inaccurate health information, as determined by the combined performance of Baidu, Bing, and Sogou search engines, varies geographically, with darker shades indicating lower exposure and brighter shades indicating higher exposure to misinformation.

The study meticulously maps the contours of misinformation-a familiar landscape, unfortunately. It details how Chinese search engines, LLMs, and those shiny AI Overviews stumble over factual accuracy, varying wildly depending on the topic. They’ll call it AI and raise funding, of course. But the core finding-geographic disparities in exposure-feels particularly bleak. It’s a reminder that elegant models, however sophisticated, ultimately reflect the messy reality of data. As Blaise Pascal observed, ‘The eloquence of the tongue makes the ears deaf.’ In this case, the eloquence of AI can easily mask a deafening silence regarding truth, and the documentation lied again about how well it would handle nuanced information.

The Road Ahead

The observed moderate accuracy across Chinese search engines, LLMs, and AI Overviews feels less like a triumph and more like a temporary reprieve. Any system currently labeled ‘fact-checking’ will inevitably become a catalog of what was misinformation, rendered useless by the next wave of narratives. The variations across topics suggest a landscape where reliability isn’t a property of the technology, but a function of how thoroughly a falsehood has been debunked – or, more accurately, how consistently it’s been re-debunked. The geographic disparities are predictable; information control, after all, isn’t about blocking everything, but about curating the acceptable falsehoods.

Future work will undoubtedly focus on ‘robustness’ and ‘alignment.’ These are engineering terms for ‘delaying the inevitable.’ A system that confidently presents information, even if accurate today, is a system primed for exploitation tomorrow. Documentation of these systems’ limitations will be, as always, a collective self-delusion. If a bug is reproducible, it signifies a stable system – one that will consistently fail in a predictable manner. The real challenge isn’t building more sophisticated algorithms, but accepting that information retrieval, at scale, is fundamentally an exercise in managed uncertainty.

The pursuit of ‘truth’ in these systems feels increasingly like polishing the brass on a sinking ship. More valuable, perhaps, is a deeper investigation into why misinformation spreads, and how systems can be designed to gracefully degrade-to admit ignorance rather than confidently propagate error. Anything self-healing just hasn’t broken yet.

Original article: https://arxiv.org/pdf/2602.22221.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Truth: Navigating a Sea of Misinformation

How Search Engines Really Work (And Where They Fail)

Large Language Models: A False Promise of Truth?

Mapping the Spread: Where Misinformation Takes Root

The Road Ahead

See also: