SQUIRE: Interactive UI Authoring via Slot QUery Intermediate REpresentationsFrontend developers create UI prototypes to evaluate alternatives, which is a time-consuming process of repeated iteration and refinement. Generative AI code assistants enable rapid prototyping simply by prompting through a chat interface rather than writing code. However, while this interaction gives developers flexibility since they can write any prompt they wish, it makes it challenging to control what is generated. First, natural language on its own can be ambiguous, making it difficult for developers to precisely communicate their intentions. Second, the model may respond unpredictably, requiring the developer to re-prompt through trial-and-error to repair any undesired changes. To address these weaknesses, we introduce SQUIRE, a system designed for guided prototype exploration and refinement. In SQUIRE, the developer incrementally builds a UI component tree by pointing and clicking on different alternatives suggested by the system. Additional affordances let the developer refine the appearance of the targeted UI. All interactions are explicitly scoped, with guarantees on what portions of the UI will and will not be mutated. The system is supported by a novel intermediate representation called SQUIREIR with language support for controlled exploration and refinement. Through a user study where 11 frontend developers used SQUIRE to implement mobile web prototypes, we find that developers effectively explore and iterate on different UI alternatives with high levels of perceived control. Developers additionally scored SQUIRE positively for usability and general satisfaction. Our findings suggest the strong potential for code generation to be controlled in rapid UI prototyping tools by combining chat with explicitly scoped affordances.2025ALAlan Leung et al.Human-LLM CollaborationKnowledge Worker Tools & WorkflowsUIST
From Interaction to Impact: Towards Safer AI Agent Through Understanding and Evaluating Mobile UI Operation ImpactsWith advances in generative AI, there is increasing work towards creating autonomous agents that can manage daily tasks by operating user interfaces (UIs). While prior research has studied the mechanics of how AI agents might navigate UIs and understand UI structure, the effects of agents and their autonomous actions—particularly those that may be risky or irreversible—remain under-explored. In this work, we investigate the real-world impacts and consequences of mobile UI actions taken by AI agents. We began by developing a taxonomy of the impacts of mobile UI actions through a series of workshops with domain experts. Following this, we conducted a data synthesis study to gather realistic mobile UI screen traces and action data that users perceive as impactful. We then used our impact categories to annotate our collected data and data repurposed from existing mobile UI navigation datasets. Our quantitative evaluations of different large language models (LLMs) and variants demonstrate how well different LLMs can understand the impacts of mobile UI actions that might be taken by an agent. We show that our taxonomy enhances the reasoning capabilities of these LLMs for understanding the impacts of mobile UI actions, but our findings also reveal significant gaps in their ability to reliably classify more nuanced or complex categories of impact.2025ZZZhuohao (Jerry) Zhang et al.Generative AI (Text, Image, Music, Video)AI-Assisted Decision-Making & AutomationIUI
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine ConversationsMultimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 353K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to UI verification.2025YJYue Jiang et al.Voice User Interface (VUI) DesignHuman-LLM CollaborationIUI
Misty: UI Prototyping Through Interactive Conceptual BlendingUI prototyping often involves iterating and blending elements from examples such as screenshots and sketches, but current tools offer limited support for incorporating these examples. Inspired by the cognitive process of conceptual blending, we introduce a novel UI workflow that allows developers to rapidly incorporate diverse aspects from design examples into work-in-progress UIs. We prototyped this workflow as Misty. Through a exploratory first-use study with 14 frontend developers, we assessed Misty's effectiveness and gathered feedback on this workflow. Our findings suggest that Misty's conceptual blending workflow helps developers kickstart creative explorations, flexibly specify intent in different stages of prototyping, and inspires developers through serendipitous UI blends. Misty demonstrates the potential for tools that blur the boundaries between developers and designers.2025YLYuwen Lu et al.University of Notre Dame, Computer Science and EngineeringKnowledge Worker Tools & WorkflowsPrototyping & User TestingCHI
Towards Automated Accessibility Report Generation for Mobile AppsACM DL: https://dl.acm.org/doi/full/10.1145/3674967 Many apps have basic accessibility issues, like missing labels or low contrast. To supplement manual testing, automated tools can help developers and QA testers find basic accessibility issues, but they can be laborious to use or require writing dedicated tests. To motivate our work, we interviewed eight accessibility QA professionals at a large technology company. From these interviews, we synthesized three design goals for accessibility report generation systems. Motivated by these goals, we developed a system to generate whole app accessibility reports by combining varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner. Many such scanners are based on single-screen scanning, and a key problem in whole app accessibility reporting is to effectively de-duplicate and summarize issues collected across an app. To this end, we developed a screen grouping model with 96.9% accuracy (88.8% F1-score) and UI element matching heuristics with 97% accuracy (98.2% F1-score). We combine these technologies in a system to report and summarize unique issues across an app, and enable a unique pixel-based ignore feature to help engineers and testers better manage reported issues across their app’s lifetime. We conducted a user study where 19 accessibility engineers and testers used multiple tools to create lists of prioritized issues in the context of an accessibility audit. Our system helped them create lists they were more satisfied with while addressing key limitations of current accessibility scanning tools.2024ASAmanda Swearngin et al.Voice AccessibilityUniversal & Inclusive DesignPrivacy Perception & Decision-MakingUIST
UIClip: A Data-driven Model for Assessing User Interface DesignUser interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by (i) assigning a numerical score that represents a UI design's relevance and quality and (ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: (i) UI code generation, (ii) UI design tips generation, and (iii) quality-aware UI example search.2024JWJason Wu et al.360° Video & Panoramic ContentHuman-LLM CollaborationExplainable AI (XAI)UIST
Towards Automated Accessibility Report Generation for Mobile AppsMany apps have basic accessibility issues, like missing labels or low contrast. To supplement manual testing, automated tools can help developers and QA testers find basic accessibility issues, but they can be laborious to use or require writing dedicated tests. To motivate our work, we interviewed eight accessibility QA professionals at a large technology company. From these interviews, we synthesized three design goals for accessibility report generation systems. Motivated by these goals, we developed a system to generate whole app accessibility reports by combining varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner. Many such scanners are based on single-screen scanning, and a key problem in whole app accessibility reporting is to effectively de-duplicate and summarize issues collected across an app. To this end, we developed a screen grouping model with 96.9% accuracy (88.8% F1-score) and UI element matching heuristics with 97% accuracy (98.2% F1-score). We combine these technologies in a system to report and summarize unique issues across an app, and enable a unique pixel-based ignore feature to help engineers and testers better manage reported issues across their app’s lifetime. We conducted a user study where 19 accessibility engineers and testers used multiple tools to create lists of prioritized issues in the context of an accessibility audit. Our system helped them create lists they were more satisfied with while addressing key limitations of current accessibility scanning tools.2024ASAmanda Swearngin et al.Visual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)Universal & Inclusive DesignUIST
AXNav: Replaying Accessibility Tests from Natural LanguageDevelopers and quality assurance testers often rely on manual testing to test accessibility features throughout the product lifecycle. Unfortunately, manual testing can be tedious, often has an overwhelming scope, and can be difficult to schedule amongst other development milestones. Recently, Large Language Models (LLMs) have been used for a variety of tasks including automation of UIs. However, to our knowledge, no one has yet explored the use of LLMs in controlling assistive technologies for the purposes of supporting accessibility testing. In this paper, we explore the requirements of a natural language based accessibility testing workflow, starting with a formative study. From this we build a system that takes a manual accessibility test instruction in natural language (e.g., "Search for a show in VoiceOver") as input and uses an LLM combined with pixel-based UI Understanding models to execute the test and produce a chaptered, navigable video. In each video, to help QA testers, we apply heuristics to detect and flag accessibility issues (e.g., Text size not increasing with Large Text enabled, VoiceOver navigation loops). We evaluate this system through a 10-participant user study with accessibility QA professionals who indicated that the tool would be very useful in their current work and performed tests similarly to how they would manually test the features. The study also reveals insights for future work on using LLMs for accessibility testing.2024MTMaryam Taeb et al.Florida State UniversityVoice AccessibilityHuman-LLM CollaborationCHI
Never-ending Learning of User InterfacesMachine learning models have been trained to predict semantic information about user interfaces (UIs) to make apps more accessible, easier to test, and to automate. Currently, most models rely on datasets that are collected and labeled by human crowd-workers, a process that is costly and surprisingly error-prone for certain tasks. For example, it is possible to guess if a UI element is “tappable” from a screenshot (i.e., based on visual signifiers) or from potentially unreliable metadata (e.g., a view hierarchy), but one way to know for certain is to programmatically tap the UI element and observe the effects. We built the Never-ending UI Learner, an app crawler that automatically installs real apps from a mobile app store and crawls them to discover new and challenging training examples to learn from. The Never-ending UI Learner has crawled for more than 5,000 device-hours, performing over half a million actions on 6,000 apps to train three computer vision models for i) tappability prediction, ii) draggability prediction, and iii) screen similarity.2023JWJason Wu et al.Human-LLM CollaborationAI-Assisted Decision-Making & AutomationAutoML InterfacesUIST
Understanding Screen Relationships from Screenshots of Smartphone ApplicationsAll graphical user interfaces are comprised of one or more screens that may be shown to the user depending on their interactions. Identifying different screens of an app and understanding the type of changes that happen on the screens is a challenging task that can be applied in many areas including automatic app crawling, playback of app automation macros and large scale app dataset analysis. For example, an automated app crawler needs to understand if the screen it is currently viewing is the same as any previous screen that it has encountered, so it can focus its efforts on portions of the app that it has not yet explored. Moreover, identifying the type of change on the screen, such as whether any dialogues or keyboards have opened or closed, is useful for an automatic crawler to handle such events while crawling. Understanding screen relationships is a difficult task as instances of the same screen may have visual and structural variation, for example due to different content in a database-backed application, scrolling, dialog boxes opening or closing, or content loading delays. At the same time, instances of different screens from the same app may share some similarities in terms of design, structure, and content. This paper uses a dataset of screenshots from more than 1K iPhone applications to train two ML models that understand similarity in different ways: (1) a screen similarity model that combines a UI object detector with a transformer model architecture to recognize instances of the same screen from a collection of screenshots from a single app, and (2) a screen transition model that uses a siamese network architecture to identify both similarity and three types of events that appear in an interaction trace: the keyboard or a dialog box appearing or disappearing, and scrolling. Our models achieve an F1 score of 0.83 on the screen similarity task, improving on comparable baselines, and an average F-1 score of 0.71 across all events in the transition task.2022SFShirin Feiz et al.Eye Tracking & Gaze InteractionContext-Aware ComputingIUI
Towards Complete Icon Labeling in Mobile ApplicationsAccurately recognizing icon types in mobile applications is integral to many tasks, including accessibility improvement, UI design search, and conversational agents. Existing research focuses on recognizing the most frequent icon types, but these technologies fail when encountering an unrecognized low-frequency icon. In this paper, we work towards complete coverage of icons in the wild. After annotating a large-scale icon dataset (327,879 icons) from iPhone apps, we found a highly uneven distribution: 98 common icon types covered 92.8% of icons, while 7.2% of icons were covered by more than 331 long-tail icon types. In order to label icons with widely varying occurrences in apps, our system uses an image classification model to recognize common icon types with an average of 3,000 examples each (96.3% accuracy) and applies a few-shot learning model to classify long-tail icon types with an average of 67 examples each (78.6% accuracy). Our system also detects contextual information that helps characterize icon semantics, including nearby text (95.3% accuracy) and modifier symbols added to the icon (87.4% accuracy). In a validation study with workers (n=23), we verified the usefulness of our generated icon labels. The icon types supported by our work cover 99.5% of collected icons, improving on the previously highest 78% coverage in icon classification work.2022JCJieshan Chen et al.Australian National UniversityHuman-LLM CollaborationRecommender System UXCHI
Screen Parsing: Towards Reverse Engineering of UI Models from ScreenshotsAutomated understanding of user interfaces (UIs) from their pixels can improve accessibility, enable task automation, and facilitate interface design without relying on developers to comprehensively provide metadata. A first step is to infer what UI elements exist on a screen, but current approaches are limited in how they infer how those elements are semantically grouped into structured interface definitions. In this paper, we motivate the problem of screen parsing, the task of predicting UI elements and their relationships from a screenshot. We describe our implementation of screen parsing and provide an effective training procedure that optimizes its performance. In an evaluation comparing the accuracy of the generated output, we find that our implementation significantly outperforms current systems (up to 23%). Finally, we show three example applications that are facilitated by screen parsing: (i) UI similarity search, (ii) accessibility enhancement, and (iii) code generation from UI screenshots.2021JWJason Wu et al.Interactive Data VisualizationPrototyping & User TestingUIST
Screen Recognition: Creating Accessibility Metadata for Mobile Applications from PixelsMany accessibility features available on mobile platforms require applications (apps) to provide complete and accurate metadata describing user interface (UI) components. Unfortunately, many apps do not provide sufficient metadata for accessibility features to work as expected. In this paper, we explore inferring accessibility metadata for mobile apps from their pixels, as the visual interfaces often best reflect an app's full functionality. We trained a robust, fast, memory-efficient, on-device model to detect UI elements using a dataset of 77,637 screens (from 4,068 iPhone apps) that we collected and annotated. To further improve UI detections and add semantic information, we introduced heuristics (e.g., UI grouping and ordering) and additional models (e.g., recognize UI content, state, interactivity). We built Screen Recognition to generate accessibility metadata to augment iOS VoiceOver. In a study with 9 screen reader users, we validated that our approach improves the accessibility of existing mobile apps, enabling even previously inaccessible apps to be used.2021XZXiaoyi Zhang et al.Apple IncVisual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)Universal & Inclusive DesignCHI