The GenUI Study: Exploring the Design of Generative UI Tools to Support UX Practitioners and BeyondAI can now generate high-fidelity UI mock-up screens from a high-level textual description, promising to support UX practitioners' work. However, it remains unclear how UX practitioners would adopt such Generative UI (GenUI) models in a way that is integral and beneficial to their work. To answer this question, we conducted a formative study with 37 UX-related professionals that consisted of four roles: UX designers, UX researchers, software engineers, and product managers. Using a state-of-the-art GenUI tool, each participant went through a week-long, individual mini-project exercise with role-specific tasks, keeping a daily journal of their usage and experiences with GenUI, followed by a semi-structured interview. We report findings on participants' workflow using the GenUI tool, how GenUI can support all and each specific roles, and existing gaps between GenUI and users' needs and expectations, which lead to design implications to inform future work on GenUI development.2025XCXiang 'Anthony' Chen et al.Generative AI (Text, Image, Music, Video)Human-LLM CollaborationAI-Assisted Creative WritingDIS
UICrit: Enhancing Automated Design Evaluation with a UI Critique DatasetAutomated UI evaluation can be beneficial for the design process; for example, to compare different UI designs, or conduct automated heuristic evaluation. LLM-based UI evaluation, in particular, holds the promise of generalizability to a wide variety of UI types and evaluation tasks. However, current LLM-based techniques do not yet match the performance of human evaluators. We hypothesize that automatic evaluation can be improved by collecting a targeted UI feedback dataset and then using this dataset to enhance the performance of general-purpose LLMs. We present a targeted dataset of 3,059 design critiques and quality ratings for 983 mobile UIs, collected from seven designers, each with at least a year of professional design experience. We carried out an in-depth analysis to characterize the dataset's features. We then applied this dataset to achieve a 55\% performance gain in LLM-generated UI feedback via various few-shot and visual prompting techniques. We also discuss future applications of this dataset, including training a reward model for generative UI techniques, and fine-tuning a tool-agnostic multi-modal LLM that automates UI evaluation.2024PDPeitong Duan et al.Human-LLM CollaborationExplainable AI (XAI)UIST
Generating Automatic Feedback on UI Mockups with Large Language ModelsFeedback on user interface (UI) mockups is crucial in design. However, human feedback is not always readily available. We explore the potential of using large language models for automatic feedback. Specifically, we focus on \changes{applying GPT-4 to automate heuristic evaluation}, which currently entails a human expert assessing a UI’s compliance with a set of design guidelines. We implemented a Figma plugin that takes in a UI design and a set of written heuristics, and renders automatically-generated feedback as constructive suggestions. We assessed performance on 51 UIs using three sets of guidelines, compared GPT-4-generated design suggestions with those from human experts, and conducted a study with 12 expert designers to understand fit with existing practice. We found that GPT-4-based feedback is useful for catching subtle errors, improving text, and considering UI semantics, but feedback also decreased in utility over iterations. Participants described several uses for this plugin despite its imperfect suggestions.2024PDPeitong Duan et al.UC BerkeleyHuman-LLM CollaborationPrototyping & User TestingCHI
Enabling Conversational Interaction with Mobile UI using Large Language ModelsConversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, which is expensive and effort-consuming. Recently, pre-trained large language models (LLMs) have been shown capable of generalizing to various downstream tasks when prompted with a handful of examples from the target task. This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single LLM. We designed prompting techniques to adapt an LLM to mobile UIs. We experimented with four important modeling tasks that address various scenarios in conversational interaction. Our method achieved competitive performance on these challenging tasks without requiring dedicated datasets and training, offering a lightweight and generalizable approach to enable language-based mobile interaction.2023BWBryan Wang et al.University of TorontoVoice User Interface (VUI) DesignHuman-LLM CollaborationCHI
Predicting and Explaining Mobile UI Tappability with Vision Modeling and Saliency AnalysisUI designers often correct false affordances and improve the discoverability of features when users have trouble determining if elements are tappable. We contribute a novel system that models the perceived tappability of mobile UI elements with a vision-based deep neural network and helps provide design insights with dataset-level and instance-level explanations of model predictions. Our system retrieves designs from similar mobile UI examples from our dataset using the latent space of our model. We also contribute a novel use of an interpretability algorithm, XRAI, to generate a heatmap of UI elements that contribute to a given tappability prediction. Through several examples, we show how our system can help automate elements of UI usability analysis and provide insights for designers to iterate their designs. In addition, we share findings from an exploratory evaluation with professional designers to learn how AI-based tools can aid UI design and evaluation for tappability issues.2022ESEldon Schoop et al.University of California, BerkeleyExplainable AI (XAI)AI-Assisted Decision-Making & AutomationCHI
Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at ScaleThe layout of a mobile screen is a critical data source for UI design research and semantic understanding of the screen. However, UI layouts in existing datasets are often noisy, have mismatches with their visual representation, or consists of generic or app-specific types that are difficult to analyze and model. In this paper, we propose the CLAY pipeline that uses a deep learning approach for denoising UI layouts, allowing us to automatically improve existing mobile UI layout datasets at scale. Our pipeline takes both the screenshot and the raw UI layout, and annotates the raw layout by removing incorrect nodes and assigning a semantically meaningful type to each node. To experiment with our data-cleaning pipeline, we create the CLAY dataset of 59,555 human-annotated screen layouts, based on screenshots and raw layouts from Rico, a public mobile UI corpus. Our deep models achieve high accuracy with F1 scores of 82.7% for detecting layout objects that do not have a valid visual representation and 85.9% for recognizing object types, which significantly outperforms a heuristic baseline. Our work lays a foundation for creating large-scale high quality UI layout datasets for data-driven mobile UI research and reduces the need of manual labeling efforts that are prohibitively expensive.2022GLGang Li et al.Google ResearchPrototyping & User TestingComputational Methods in HCICHI
Screen2Words: Automatic Mobile UI Summarization with Multimodal LearningMobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen, which can be useful for many language-based application scenarios. We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase. Summarizing mobile screens requires a holistic understanding of the multi-modal data of mobile UIs, including text, image, structures as well as UI semantics, motivating our multi-modal learning approach. We collected and analyzed a large-scale screen summarization dataset annotated by human workers. Our dataset contains more than 112k language summarization across ~22k unique UI screens. We then experimented with a set of deep models with different configurations. Our evaluation of these models with both automatic accuracy metrics and human rating shows that our approach can generate high-quality summaries for mobile screens. We demonstrate potential use cases of Screen2Words and open-source our dataset and model to lay the foundations for further bridging language and user interfaces.2021BWBryan Wang et al.Voice User Interface (VUI) DesignHuman-LLM CollaborationUIST
HelpViz: Automatic Generation of Contextual Visual Mobile Tutorials from Text-Based InstructionsWe present HelpViz, a tool for generating contextual visual mobile tutorials from text-based instructions that are abundant on the web. HelpViz transforms text instructions to graphical tutorials in batch, by extracting a sequence of actions from each text instruction through an instruction parsing model, and executing the extracted actions on a simulation infrastructure that manages an array of Android emulators. The automatic execution of each instruction produces a set of graphical and structural assets, including images, videos, and metadata such as clicked elements for each step. HelpViz then synthesizes a tutorial by combining parsed text instructions with the generated assets, and contextualizes the tutorial to user interaction by tracking the user's progress and highlighting the next step. Our experiments with HelpViz indicate that our pipeline improved tutorial execution robustness and that participants preferred tutorials generated by HelpViz over text-based instructions. HelpViz promises a cost-effective approach for generating contextual visual tutorials for mobile interaction at scale.2021MZMingyuan Zhong et al.Interactive Data VisualizationData StorytellingUIST
Spacewalker: Rapid UI Design Exploration Using Lightweight Markup Enhancement and Crowd Genetic ProgrammingUser interface design is a complex task that involves designers examining a wide range of options. We present Spacewalker, a tool that allows designers to rapidly search a large design space for an optimal web UI with integrated support. Designers first annotate each attribute they want to explore in a typical HTML page, using a simple markup extension we designed. Spacewalker then parses the annotated HTML specification, and intelligently generates and distributes various configurations of the web UI to crowd workers for evaluation. We enhanced a genetic algorithm to accommodate crowd worker responses from pairwise comparison of UI designs, which is crucial for obtaining reliable feedback. Based on our experiments, Spacewalker allows designers to effectively search a large design space of a UI, using the language they are familiar with, and improve their design rapidly at a minimal cost.2021MZMingyuan Zhong et al.University of WashingtonCrowdsourcing Task Design & Quality ControlPrototyping & User TestingCHI
TapNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile InputTo make off-screen interaction without specialized hardware practical, we investigate using deep learning methods to process the common built-in IMU sensor (accelerometers and gyroscopes) on mobile phones into a useful set of one-handed interaction events. We present the design, training, implementation and applications of TapNet, a multi-task network that detects tapping on the smartphone. With phone form factor as auxiliary information, TapNet can jointly learn from data across devices and simultaneously recognize multiple tap properties, including tap direction and tap location. We developed two datasets consisting of over 135K training samples, 38K testing samples, and 32 participants in total. Experimental evaluation demonstrated the effectiveness of the TapNet design and its significant improvement over the state of the art. Along with the datasets, codebase, and extensive experiments, TapNet establishes a new technical foundation for off-screen mobile input.2021MHMichael Xuelin Huang et al.GoogleFoot & Wrist InteractionComputational Methods in HCICHI
Modeling Human Visual Search Performance on Realistic Webpages Using Analytical and Deep Learning MethodsModeling visual search not only offers an opportunity to predict the usability of an interface before actually testing it on real users but also advances scientific understanding about human behavior. In this work, we first conduct a set of analyses on a large-scale dataset of visual search tasks on realistic webpages. We then present a deep neural network that learns to predict the scannability of webpage content, i.e., how easy it is for a user to find a specific target. Our model leverages both heuristic-based features such as target size and unstructured features such as raw image pixels. This approach allows us to model complex interactions that might be involved in a realistic visual search task, which can not be achieved by traditional analytical models. We analyze the model behavior to offer our insights into how the salience map learned by the model aligns with human intuition.2020AYArianna Yuan et al.Stanford UniversityExplainable AI (XAI)Visualization Perception & CognitionCHI
Modeling Mobile Interface Tappability Using Crowdsourcing and Deep LearningTapping is an immensely important gesture in mobile touchscreen interfaces, yet people still frequently are required to learn which elements are tappable through trial and error. Predicting human behavior for this everyday gesture can help mobile app designers understand an important aspect of the usability of their apps without having to run a user study. In this paper, we present an approach for modeling tappability of mobile interfaces at scale. We conducted large-scale data collection of interface tappability over a rich set of mobile apps using crowdsourcing and computationally investigated a variety of signifiers that people use to distinguish tappable versus not-tappable elements. Based on the dataset, we developed and trained a deep neural network that predicts how likely a user will perceive an interface element as tappable versus not tappable. Using the trained tappability model, we developed TapShoe, a tool that automatically diagnoses mismatches between the tappability of each element as perceived by a human user---predicted by our model, and the intended or actual tappable state of the element specified by the developer or designer. Our model achieved reasonable accuracy: mean precision 90.2% and recall 87.0%, in matching human perception on identifying tappable UI elements. The tappability model and TapShoe were well received by designers via an informal evaluation with 7 professional interaction designers.2019ASAmanda M Swearngin et al.University of Washington & Google ResearchPrototyping & User TestingComputational Methods in HCICHI
Predicting Human Performance in Vertical Menu Selection Using Deep LearningPredicting human performance in interaction tasks allows designers or developers to understand the expected performance of a target interface without actually testing it with real users. In this work, we present a deep neural net to model and predict human performance in performing a sequence of UI tasks. In particular, we focus on a dominant class of tasks, i.e., target selection from a vertical list or menu. We experimented with our deep neural net using a public dataset collected from a desktop laboratory environment and a dataset collected from hundreds of touchscreen smartphone users via crowdsourcing. Our model significantly outperformed previous methods on these datasets. Importantly, our method, as a deep model, can easily incorporate additional UI attributes such as visual appearance and content semantics without changing model architectures. By understanding about how a deep learning model learns from human behaviors, our approach can be seen as a vehicle to discover new patterns about human behaviors to advance analytical modeling.2018YLYang Li et al.Google Research & Machine IntelligenceRecommender System UXComputational Methods in HCICHI
M3 Gesture Menu: Design and Experimental Analyses of Marking Menus for Touchscreen Mobile InteractionDespite their learning advantages in theory, marking menus have faced adoption challenges in practice, even on today's touchscreen-based mobile devices. We address these challenges by designing, implementing, and evaluating multiple versions of M3 Gesture Menu (M3), a reimagination of marking menus targeted at mobile interfaces. M3 is defined on a grid rather than in a radial space, relies on gestural shapes rather than directional marks, and has constant and stationary space use. Our first controlled experiment on expert performance showed M3 was faster and less error-prone by a factor of two than traditional marking menus. A second experiment on learning demonstrated for the first time that users could successfully transition to recall-based execution of a dozen commands after three ten-minute practice sessions with both M3 and Multi-Stroke Marking Menu. Together, M3, with its demonstrated resolution, learning, and space use benefits, contributes to the design and understanding of menu selection in the mobile-first era of end-user computing.2018JZJingjie Zheng et al.University of Waterloo, Google, Inc.Hand Gesture RecognitionCHI
Doppio: Tracking UI Flows and Code Changes for App DevelopmentDeveloping interactive systems often involves a large set of callback functions for handling user interaction, which makes it challenging to manage UI behaviors, create descriptive documentation, and track code revisions. We developed Doppio, a tool that automatically tracks and visualizes UI flows and their changes based on source code. For each input event listener of a widget, e.g., onClick of an Android View class, Doppio captures and associates its UI output from a program execution with its code snippet from the codebase. It automatically generates a screenflow diagram organized by the callback methods and interaction flow, where developers can review the code and UI revisions interactively. Doppio, as an IDE plugin, is seamlessly integrated into a common development workflow. Our studies show that our tool is able to generate quality visual documentation and helped participants understand unfamiliar source code and track changes.2018PCPei-Yu (Peggy) Chi et al.Google Inc.Knowledge Worker Tools & WorkflowsComputational Methods in HCICHI