Close

Presentation

Increasing transparency of LLM systems does not always improve people's verification behavior and performance: Results from an empirical study of an AI-assisted intelligence reporting task
DescriptionLarge language models like GPT-4 offer powerful capabilities for summarizing and retrieving targeted information from large text corpora, making them attractive tools for rapidly completing mission-driven tasks in high-stakes domains like national security. However, their tendency to generate plausible but inaccurate outputs poses challenges for safe integration. Practitioner-derived standards for information quality like the Multisource AI Scorecard Table have introduced criteria for transparent, verifiable analytic outputs, but their effectiveness in supporting human-AI interaction remains under-explored. We developed an experimental platform called MASTOPIA that uses GPT-4 with retrieval-augmented generation to simulate an AI assistant that can analyze simulated news reports to aid in an intelligence reporting task. In a transparency-enhanced version of the system, backend prompts were designed to meet certain criteria as specified in the Multisource AI Scorecard Table. Online participants from Prolific completed an intelligence reporting task while interacting with either a baseline or the transparency-enhanced version of the system. Results from 288 participants indicate that transparency enhancements did not improve performance, and in some marginal circumstances, decreased it. Potentially, additional explanatory information overwhelmed participants or induced over-reliance, reducing analytical engagement. These findings highlight that simply increasing transparency in large language model outputs may not support better human decision-making in areas where people are not already experts. Information overload and interface complexity must be managed carefully.