Policy Maps: Tools for Guiding the Unbounded Space of LLM BehaviorsAI policy sets boundaries on acceptable behavior for AI models, but this is challenging in the context of large language models (LLMs): how do you ensure coverage over a vast behavior space? We introduce policy maps, an approach to AI policy design inspired by the practice of physical mapmaking. Instead of aiming for full coverage, policy maps aid effective navigation through intentional design choices about which aspects to capture and which to abstract away. With Policy Projector, an interactive tool for designing LLM policy maps, an AI practitioner can survey the landscape of model input-output pairs, define custom regions (e.g., “violence”), and navigate these regions with if-then policy rules that can act on LLM outputs (e.g., if output contains “violence” and “graphic details,” then rewrite without “graphic details”). Policy Projector supports interactive policy authoring using LLM classification and steering and a map visualization reflecting the AI practitioner’s work. In an evaluation with 12 AI safety experts, our system helps policy designers craft policies around problematic model behaviors such as incorrect gender assumptions and handling of immediate physical safety threats.2025MLMichelle S. Lam et al.Explainable AI (XAI)Algorithmic Transparency & AuditabilityAlgorithmic Fairness & BiasUIST
SQUIRE: Interactive UI Authoring via Slot QUery Intermediate REpresentationsFrontend developers create UI prototypes to evaluate alternatives, which is a time-consuming process of repeated iteration and refinement. Generative AI code assistants enable rapid prototyping simply by prompting through a chat interface rather than writing code. However, while this interaction gives developers flexibility since they can write any prompt they wish, it makes it challenging to control what is generated. First, natural language on its own can be ambiguous, making it difficult for developers to precisely communicate their intentions. Second, the model may respond unpredictably, requiring the developer to re-prompt through trial-and-error to repair any undesired changes. To address these weaknesses, we introduce SQUIRE, a system designed for guided prototype exploration and refinement. In SQUIRE, the developer incrementally builds a UI component tree by pointing and clicking on different alternatives suggested by the system. Additional affordances let the developer refine the appearance of the targeted UI. All interactions are explicitly scoped, with guarantees on what portions of the UI will and will not be mutated. The system is supported by a novel intermediate representation called SQUIREIR with language support for controlled exploration and refinement. Through a user study where 11 frontend developers used SQUIRE to implement mobile web prototypes, we find that developers effectively explore and iterate on different UI alternatives with high levels of perceived control. Developers additionally scored SQUIRE positively for usability and general satisfaction. Our findings suggest the strong potential for code generation to be controlled in rapid UI prototyping tools by combining chat with explicitly scoped affordances.2025ALAlan Leung et al.Human-LLM CollaborationKnowledge Worker Tools & WorkflowsUIST
From Interaction to Impact: Towards Safer AI Agent Through Understanding and Evaluating Mobile UI Operation ImpactsWith advances in generative AI, there is increasing work towards creating autonomous agents that can manage daily tasks by operating user interfaces (UIs). While prior research has studied the mechanics of how AI agents might navigate UIs and understand UI structure, the effects of agents and their autonomous actions—particularly those that may be risky or irreversible—remain under-explored. In this work, we investigate the real-world impacts and consequences of mobile UI actions taken by AI agents. We began by developing a taxonomy of the impacts of mobile UI actions through a series of workshops with domain experts. Following this, we conducted a data synthesis study to gather realistic mobile UI screen traces and action data that users perceive as impactful. We then used our impact categories to annotate our collected data and data repurposed from existing mobile UI navigation datasets. Our quantitative evaluations of different large language models (LLMs) and variants demonstrate how well different LLMs can understand the impacts of mobile UI actions that might be taken by an agent. We show that our taxonomy enhances the reasoning capabilities of these LLMs for understanding the impacts of mobile UI actions, but our findings also reveal significant gaps in their ability to reliably classify more nuanced or complex categories of impact.2025ZZZhuohao (Jerry) Zhang et al.Generative AI (Text, Image, Music, Video)AI-Assisted Decision-Making & AutomationIUI
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine ConversationsMultimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 353K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to UI verification.2025YJYue Jiang et al.Voice User Interface (VUI) DesignHuman-LLM CollaborationIUI
Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data AnnotationText-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We observe that the performance of existing text-to-SQL models drops dramatically when applied to a new schema, primarily due to the lack of domain-specific data for fine-tuning. Furthermore, this lack of data for the new schema also hinders our ability to effectively evaluate the model's performance in the new domain. Nevertheless, it is expensive to continuously obtain text-to-SQL data for an evolving schema in most real-world applications. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through collaboration between humans and a large language model in a structured workflow. A within-subject user study comparing SQLsynth to manual annotation and ChatGPT reveals that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/adobe/nl_sql_analyzer.2025YTYuan Tian et al.Human-LLM CollaborationAutoML InterfacesIUI
VibE: A Visual Analytics Workflow for Semantic Error Analysis of CVML Models at Subgroup LevelEffective error analysis is critical for the successful development and deployment of CVML models. One approach to understanding model errors is to summarize the common characteristics of error samples. This can be particularly challenging in tasks that utilize unstructured, complex data such as images, where patterns are not always obvious. Another method is to analyze error distributions across pre-defined categories, which requires analysts to hypothesize about potential error causes in advance. Forming such hypotheses without access to explicit labels or annotations makes it difficult to isolate meaningful subgroups or patterns, however, as analysts must rely on manual inspection, prior expertise, or intuition. This lack of structured guidance can hinder a comprehensive understanding of where models fail. To address these challenges, we introduce VibE, a semantic error analysis workflow designed to identify where and why computer vision and machine learning (CVML) models fail at the subgroup level, even when labels or annotations are unavailable. VibE incorporates several core features to enhance error analysis: semantic subgroup generation, semantic summarization, candidate issue proposals, semantic concept search, and interactive subgroup analysis. By leveraging large foundation models (such as CLIP and GPT-4) alongside visual analytics, VibE enables developers to semantically interpret and analyze CVML model errors. This interactive workflow helps identify errors through subgroup discovery, supports hypothesis generation with auto-generated subgroup summaries and suggested issues, and allows hypothesis validation through semantic concept search and comparative analysis. Through three diverse CVML tasks and in-depth expert interviews, we demonstrate how VibE can assist error understanding and analysis.2025JYJun Yuan et al.Human-LLM CollaborationInteractive Data VisualizationIUI
Towards AI-driven Sign Language Generation with Non-manual MarkersSign languages are essential for the Deaf and Hard-of-Hearing (DHH) community. Sign language generation systems have the potential to support communication by translating from written languages, such as English, into signed videos. However, current systems often fail to meet user needs due to poor translation of grammatical structures, the absence of facial cues and body language, and insufficient visual and motion fidelity. We address these challenges by building on recent advances in LLMs and video generation models to translate English sentences into natural-looking AI ASL signers. The text component of our model extracts information for manual and non-manual components of ASL, which are used to synthesize skeletal pose sequences and corresponding video frames. Our findings from a user study with 30 DHH participants and thorough technical evaluations demonstrate significant progress and identify critical areas necessary to meet user needs.2025HZHan Zhang et al.University of Washington, Paul G. Allen School of Computer Science and EngineeringVoice AccessibilityDeaf & Hard-of-Hearing Support (Captions, Sign Language, Vibration)CHI
Exploring Empty Spaces: Human-in-the-Loop Data AugmentationData augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these "unknown unknowns" is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate "unknown unknowns" in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment with Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.2025CYCatherine Yeh et al.Harvard UniversityGenerative AI (Text, Image, Music, Video)Human-LLM CollaborationComputational Methods in HCICHI
InterLink: Linking Text with Code and Output in Computational NotebooksComputational notebooks, widely used for ad-hoc analysis and often shared with others, can be difficult to understand because the standard linear layout is not optimized for reading. In particular, related text, code, and outputs may be spread across the UI making it difficult to draw connections. In response, we introduce InterLink, a plugin designed to present the relationships between text, code, and outputs, thereby making notebooks easier to understand. In a formative study, we identify pain points and derive design requirements for identifying and navigating relationships among various pieces of information within notebooks. Based on these requirements, InterLink features a new layout that separates text from code and outputs into two columns. It uses visual links to signal relationships between text and associated code and outputs and offers interactions for navigating related pieces of information. In a user study with 12 participants, those using InterLink were 13.6% more accurate at finding and integrating information from complex analyses in computational notebooks. These results show the potential of notebook layouts that make them easier to understand.2025YLYanna Lin et al.The Hong Kong University of Science and Technology, Department of Computer Science and Engineering; Human-Computer Interaction Institute, Carnegie Mellon UniversityInteractive Data VisualizationKnowledge Worker Tools & WorkflowsCHI
Misty: UI Prototyping Through Interactive Conceptual BlendingUI prototyping often involves iterating and blending elements from examples such as screenshots and sketches, but current tools offer limited support for incorporating these examples. Inspired by the cognitive process of conceptual blending, we introduce a novel UI workflow that allows developers to rapidly incorporate diverse aspects from design examples into work-in-progress UIs. We prototyped this workflow as Misty. Through a exploratory first-use study with 14 frontend developers, we assessed Misty's effectiveness and gathered feedback on this workflow. Our findings suggest that Misty's conceptual blending workflow helps developers kickstart creative explorations, flexibly specify intent in different stages of prototyping, and inspires developers through serendipitous UI blends. Misty demonstrates the potential for tools that blur the boundaries between developers and designers.2025YLYuwen Lu et al.University of Notre Dame, Computer Science and EngineeringKnowledge Worker Tools & WorkflowsPrototyping & User TestingCHI
ProxiCycle : Passively Mapping Cyclist Safety Using Smart Handlebars for Near-Miss DetectionActive transportation is a valuable tool to prevent some of the most common causes of mortality worldwide, but is severely underutilized. The primary factors preventing cyclist adoption are safety concerns, specifically, the fear of collision from automobiles. One solution to address this concern is to direct cyclists to known safe routes to minimize risk and stress, thus making cycling more approachable. However, few localized safety priors are available, hindering safety based routing. Specifically, road user behavior is unknown. To address this issue, we develop a novel handlebar attachment to passively monitor the proximity of passing cars as a an indicator of cycling safety along historically traveled routes. We deploy this sensor with 15 experienced cyclists in a 2 month longitudinal study to source a citywide map of car passing distance. We then compare this signal to both historic collisions and perceived safety reported by experienced and inexperienced cyclists.2025JBJoseph Breda et al.University of Washington, Paul G. Allen School of Computer Science & EngineeringMotion Sickness & Passenger ExperiencePedestrian & Cyclist SafetyCHI
eaSEL: Promoting Social-Emotional Learning and Parent-Child Interaction through AI-Mediated Content ConsumptionAs children increasingly consume media on devices, parents look for ways this usage can support learning and growth, especially in domains like social-emotional learning. We introduce eaSEL, a system that (a) integrates social-emotional learning (SEL) curricula into children’s video consumption by generating reflection activities and (b) facilitates parent-child discussions around digital media without requiring co-consumption of videos. We present a technical evaluation of our system’s ability to detect social-emotional moments within a transcript and to generate high-quality SEL-based activities for both children and parents. Through a user study with 𝑁 = 20 parent-child dyads, we find that after completing an eaSEL activity, children reflect more on the emotional content of videos. Furthermore, parents find that the tool promotes meaningful active engagement and could scaffold deeper conversations around content. Our work paves directions in how AI can support children’s social-emotional reflection of media and family connections in the digital age.2025JSJocelyn J Shen et al.Massachusetts Institute of Technology, MIT Media LabEarly Childhood Education TechnologyCollaborative Learning & Peer TeachingMental Health Apps & Online Support CommunitiesCHI
Perceptions of the Fairness Impacts of Multiplicity in Machine LearningMachine learning (ML) is increasingly used in high-stakes settings, yet multiplicity – the existence of multiple good models – means that some predictions are essentially arbitrary. ML researchers and philosophers posit that multiplicity poses a fairness risk, but no studies have investigated whether stakeholders agree. In this work, we conduct a survey to see how multiplicity impacts lay stakeholders’ – i.e., decision subjects’ – perceptions of ML fairness, and which approaches to address multiplicity they prefer. We investigate how these perceptions are modulated by task characteristics (e.g., stakes and uncertainty). Survey respondents think that multiplicity threatens the fairness of model outcomes, but not the appropriateness of using the model, even though existing work suggests the opposite. Participants are strongly against resolving multiplicity by using a single model (effectively ignoring multiplicity) or by randomizing the outcomes. Our results indicate that model developers should be intentional about dealing with multiplicity in order to maintain fairness.2025AMAnna P. Meyer et al.University of Wisconsin - MadisonExplainable AI (XAI)Algorithmic Fairness & BiasCHI
Towards Automated Accessibility Report Generation for Mobile AppsACM DL: https://dl.acm.org/doi/full/10.1145/3674967 Many apps have basic accessibility issues, like missing labels or low contrast. To supplement manual testing, automated tools can help developers and QA testers find basic accessibility issues, but they can be laborious to use or require writing dedicated tests. To motivate our work, we interviewed eight accessibility QA professionals at a large technology company. From these interviews, we synthesized three design goals for accessibility report generation systems. Motivated by these goals, we developed a system to generate whole app accessibility reports by combining varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner. Many such scanners are based on single-screen scanning, and a key problem in whole app accessibility reporting is to effectively de-duplicate and summarize issues collected across an app. To this end, we developed a screen grouping model with 96.9% accuracy (88.8% F1-score) and UI element matching heuristics with 97% accuracy (98.2% F1-score). We combine these technologies in a system to report and summarize unique issues across an app, and enable a unique pixel-based ignore feature to help engineers and testers better manage reported issues across their app’s lifetime. We conducted a user study where 19 accessibility engineers and testers used multiple tools to create lists of prioritized issues in the context of an accessibility audit. Our system helped them create lists they were more satisfied with while addressing key limitations of current accessibility scanning tools.2024ASAmanda Swearngin et al.Voice AccessibilityUniversal & Inclusive DesignPrivacy Perception & Decision-MakingUIST
UIClip: A Data-driven Model for Assessing User Interface DesignUser interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by (i) assigning a numerical score that represents a UI design's relevance and quality and (ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: (i) UI code generation, (ii) UI design tips generation, and (iii) quality-aware UI example search.2024JWJason Wu et al.360° Video & Panoramic ContentHuman-LLM CollaborationExplainable AI (XAI)UIST
Vision-Based Hand Gesture Customization from a Single DemonstrationHand gesture recognition is becoming a more prevalent mode of human-computer interaction, especially as cameras proliferate across everyday devices. Despite continued progress in this field, gesture customization is often underexplored. Customization is crucial since it enables users to define and demonstrate gestures that are more natural, memorable, and accessible. However, customization requires efficient usage of user-provided data. We introduce a method that enables users to easily design bespoke gestures with a monocular camera from one demonstration. We employ transformers and meta-learning techniques to address few-shot learning challenges. Unlike prior work, our method supports any combination of one-handed, two-handed, static, and dynamic gestures, including different viewpoints, and the ability to handle irrelevant hand movements. We implement three real-world applications using our customization method, conduct a user study, and achieve up to 94\% average recognition accuracy from one demonstration. Our work provides a viable path for vision-based gesture customization, laying the foundation for future advancements in this domain.2024SSSoroush Shahi et al.Hand Gesture RecognitionHuman Pose & Activity RecognitionFitness Tracking & Physical Activity MonitoringUIST
Towards Automated Accessibility Report Generation for Mobile AppsMany apps have basic accessibility issues, like missing labels or low contrast. To supplement manual testing, automated tools can help developers and QA testers find basic accessibility issues, but they can be laborious to use or require writing dedicated tests. To motivate our work, we interviewed eight accessibility QA professionals at a large technology company. From these interviews, we synthesized three design goals for accessibility report generation systems. Motivated by these goals, we developed a system to generate whole app accessibility reports by combining varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner. Many such scanners are based on single-screen scanning, and a key problem in whole app accessibility reporting is to effectively de-duplicate and summarize issues collected across an app. To this end, we developed a screen grouping model with 96.9% accuracy (88.8% F1-score) and UI element matching heuristics with 97% accuracy (98.2% F1-score). We combine these technologies in a system to report and summarize unique issues across an app, and enable a unique pixel-based ignore feature to help engineers and testers better manage reported issues across their app’s lifetime. We conducted a user study where 19 accessibility engineers and testers used multiple tools to create lists of prioritized issues in the context of an accessibility audit. Our system helped them create lists they were more satisfied with while addressing key limitations of current accessibility scanning tools.2024ASAmanda Swearngin et al.Visual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)Universal & Inclusive DesignUIST
Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning ExperiencesOn-device machine learning (ML) promises to improve the privacy, responsiveness, and proliferation of new, intelligent user experiences by moving ML computation onto everyday personal devices. However, today's large ML models must be drastically compressed to run efficiently on-device, a hurtle that requires deep, yet currently niche expertise. To engage the broader human-centered ML community in on-device ML experiences, we present the results from an interview study with 30 experts at Apple that specialize in producing efficient models. We compile tacit knowledge that experts have developed through practical experience with model compression across different hardware platforms. Our findings offer pragmatic considerations missing from prior work, covering the design process, trade-offs, and technical strategies that go into creating efficient models. Finally, we distill design recommendations for tooling to help ease the difficulty of this work and bring on-device ML into to more widespread practice.2024FHFred Hohman et al.AppleHuman-LLM CollaborationAutoML InterfacesCHI
AXNav: Replaying Accessibility Tests from Natural LanguageDevelopers and quality assurance testers often rely on manual testing to test accessibility features throughout the product lifecycle. Unfortunately, manual testing can be tedious, often has an overwhelming scope, and can be difficult to schedule amongst other development milestones. Recently, Large Language Models (LLMs) have been used for a variety of tasks including automation of UIs. However, to our knowledge, no one has yet explored the use of LLMs in controlling assistive technologies for the purposes of supporting accessibility testing. In this paper, we explore the requirements of a natural language based accessibility testing workflow, starting with a formative study. From this we build a system that takes a manual accessibility test instruction in natural language (e.g., "Search for a show in VoiceOver") as input and uses an LLM combined with pixel-based UI Understanding models to execute the test and produce a chaptered, navigable video. In each video, to help QA testers, we apply heuristics to detect and flag accessibility issues (e.g., Text size not increasing with Large Text enabled, VoiceOver navigation loops). We evaluate this system through a 10-participant user study with accessibility QA professionals who indicated that the tool would be very useful in their current work and performed tests similarly to how they would manually test the features. The study also reveals insights for future work on using LLMs for accessibility testing.2024MTMaryam Taeb et al.Florida State UniversityVoice AccessibilityHuman-LLM CollaborationCHI
Talaria: Interactively Optimizing Machine Learning Models for Efficient InferenceOn-device machine learning (ML) moves computation from the cloud to personal devices, protecting user privacy and enabling intelligent user experiences. However, fitting models on devices with limited resources presents a major technical challenge: practitioners need to optimize models and balance hardware metrics such as model size, latency, and power. To help practitioners create efficient ML models, we designed and developed Talaria: a model visualization and optimization system. Talaria enables practitioners to compile models to hardware, interactively visualize model statistics, and simulate optimizations to test the impact on inference metrics. Since its internal deployment two years ago, we have evaluated Talaria using three methodologies: (1) a log analysis highlighting its growth of 800+ practitioners submitting 3,600+ models; (2) a usability survey with 26 users assessing the utility of 20 Talaria features; and (3) a qualitative interview with the 7 most active users about their experience using Talaria.2024FHFred Hohman et al.AppleExplainable AI (XAI)AutoML InterfacesInteractive Data VisualizationCHI