Sensemaking in User-Driven Algorithm Auditing: A Case Study on Gender Bias in an Image Captioning Model

Best Paper
articleCHI '26

Authors

BM

Université Paris-Saclay

JF

Université Paris-Saclay

MG

Université Paris-Saclay

BC

Sorbonne University

Explainable AI (XAI)Algorithmic Transparency & AuditabilityPrivacy by Design & User ControlAI/ML Researchers & EngineersUI/UX DesignersHCI Researchers

Paper Title

Sensemaking in User-Driven Algorithm Auditing: A Case Study on Gender Bias in an Image Captioning Model

Publication Info

  • Topic area: User-driven algorithm auditing for detecting gender bias in AI systems.
  • Keywords: Sensemaking, algorithm auditing, gender bias, image captioning, user interfaces, non-expert users, visual-linguistic models, transparency, accountability, human-AI interaction.

Background and Problem

  • Problem / challenge: Algorithmic systems often exhibit biases, such as gender bias, but tools for non-expert users to audit these systems are limited. Existing tools focus on performance evaluation rather than open-ended exploration, leaving gaps in supporting iterative sensemaking processes.
  • Significance: Addressing bias in AI systems is critical for ensuring fairness, transparency, and accountability, particularly as these systems influence societal norms and decision-making.
  • Motivation and related work: Prior research has documented gender bias in image captioning models, such as reinforcing stereotypes and misclassifying roles based on gender. While expert-led audits have identified such biases, user-driven audits by non-experts remain underexplored. This study builds on the sensemaking framework to design tools that empower non-experts to uncover and reason about biases.

Solution

  • Proposed approach: Development and evaluation of three interfaces—Baseline, Image Masking Tool, and Text Filtering Tool—designed to support non-experts in auditing gender bias in image captioning models through iterative sensemaking.
  • Novelty:
    1. Application of the sensemaking framework to user-driven algorithm auditing.
    2. Design and evaluation of specialized tools (Masking and Filtering) to support hypothesis generation and evidence collection.
    3. Empirical demonstration of how interface design shapes bias detection and user confidence.
    4. Thematic analysis of gender bias patterns identified by non-expert auditors.
  • Procedure and key techniques:
    • Conducted a between-subjects study with 60 participants using the Salesforce BLIP image captioning model and the Visogender dataset.
    • Participants audited the model under one of three conditions: Baseline (open-ended exploration), Masking (manipulating visual inputs), and Filtering (querying captions by keywords).
    • Data collected included bias cards (hypotheses and evidence), confidence ratings, and thematic analysis of identified biases.

Results

  • Concrete findings:
    • Participants identified four main patterns of gender bias: reinforcement of stereotypes, prioritization of gender over profession, biased reliance on visual cues, and gendered language hierarchies.
    • The Masking Tool revealed inconsistencies in role attribution based on visual cues, while the Filtering Tool exposed systemic linguistic asymmetries.
    • Participants in the Filtering condition collected significantly more evidence per hypothesis (mean = 5.25 items) compared to the Masking condition.
    • Confidence ratings correlated with the amount of evidence collected in tool-enabled conditions but not in the Baseline condition.
  • Advantage over baselines:
    • Masking enabled fine-grained, counterfactual testing of visual cues, uncovering biases like role misattribution when gender cues were obscured.
    • Filtering facilitated the detection of broader linguistic patterns, such as markedness and gendered descriptors.
    • Both tools supported more diverse and systematic bias identification compared to the Baseline interface.
  • Experiments / evaluation:
    • Participants: 60 non-experts (balanced gender, diverse educational backgrounds, no prior auditing experience).
    • Dataset: 80 images from the Visogender dataset focusing on medical professions.
    • Metrics: Number of bias cards, evidence items per hypothesis, thematic distribution of biases, and confidence ratings.
  • Limitations and future work:
    • Limited to a fixed dataset and specific domain (medical professions).
    • Did not include marginalized communities, potentially narrowing perspectives.
    • Focused on individual audits; future work should explore collaborative and longitudinal auditing.
    • Interaction logs and real-world settings could provide deeper insights into sensemaking processes.

Summary

This study demonstrates how interface design grounded in the sensemaking framework can empower non-experts to audit gender bias in AI systems. Through a case study on an image captioning model, participants using the Masking and Filtering tools identified diverse patterns of bias, such as stereotyped role assignments and linguistic asymmetries. The tools shaped the granularity of observations and confidence in hypotheses, highlighting the interplay between visual and linguistic signals in model behavior. Future research should expand this approach to other domains, support collaborative auditing, and integrate sensemaking tools into everyday AI interactions to foster transparency and accountability.

Quick Actions

Sensemaking in User-Driven Algorithm Auditing: A Ca…… | HCI.TOP