StepWrite: Adaptive Planning for Speech-Driven Text GenerationPeople frequently use speech-to-text systems to compose short texts with voice. However, current voice-based interfaces struggle to support composing more detailed, contextually complex texts, especially in scenarios where users are on the move and cannot visually track progress. Longer-form communication, such as composing structured emails or thoughtful responses, requires persistent context tracking, structured guidance, and adaptability to evolving user intentions---capabilities that conventional dictation tools and voice assistants do not support. We introduce StepWrite, a large language model-driven voice-based interaction system that augments human writing ability by enabling structured, hands-free and eyes-free composition of longer-form texts while on the move. StepWrite decomposes the writing process into manageable subtasks and sequentially guides users with contextually-aware non-visual audio prompts. StepWrite reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models. Unlike baseline methods like standard dictation features (e.g., Microsoft Word) and conversational voice assistants (e.g., ChatGPT Advanced Voice Mode), StepWrite dynamically adapts its prompts based on the evolving context and user intent, and provides coherent guidance without compromising user autonomy. An empirical evaluation with 25 participants engaging in mobile or stationary hands-occupied activities demonstrated that StepWrite significantly reduces cognitive load, improves usability and user satisfaction compared to baseline methods. Technical evaluations further confirmed StepWrite's capability in dynamic contextual prompt generation, accurate tone alignment, and effective fact checking. This work highlights the potential of structured, context-aware voice interactions in enhancing hands-free and eye-free communication in everyday multitasking scenarios.2025HAHamza El Alaoui et al.Voice User Interface (VUI) DesignHuman-LLM CollaborationUIST
Morae: Proactively Pausing UI Agents for User ChoicesUser interface (UI) agents promise to make inaccessible or complex UIs easier to access for blind and low-vision (BLV) users. However, current UI agents typically perform tasks end-to-end without involving users in critical choices or making them aware of important contextual information, thus reducing user agency. For example, in our field study, a BLV participant asked to buy the cheapest available sparkling water, and the agent automatically chose one from several equally priced options, without mentioning alternative products with different flavors or better ratings. To address this problem, we introduce Morae, a UI agent that automatically identifies decision points during task execution and pauses so that users can make choices. Morae uses large multimodal models to interpret user queries alongside UI code and screenshots, and prompt users for clarification when there is a choice to be made. In a study over real-world web tasks with BLV participants, Morae helped users complete more tasks and select options that better matched their preferences, as compared to baseline agents, including OpenAI Operator. More broadly, this work exemplifies a mixed-initiative approach in which users benefit from the automation of UI agents while being able to express their preferences.2025YPYi-Hao Peng et al.Intelligent Voice Assistants (Alexa, Siri, etc.)Voice AccessibilityUIST
CodeA11y: Making AI Coding Assistants Useful for Accessible Web DevelopmentA persistent challenge in accessible computing is ensuring developers produce web UI code that supports assistive technologies. Despite numerous specialized accessibility tools, novice developers often remain unaware of them, leading to ~96% of web pages that contain accessibility violations. AI coding assistants, such as GitHub Copilot, could offer potential by generating accessibility-compliant code, but their impact remains uncertain. Our formative study with 16 developers without accessibility training revealed three key issues in AI-assisted coding: failure to prompt AI for accessibility, omitting crucial manual steps like replacing placeholder attributes, and the inability to verify compliance. To address these issues, we developed CodeA11y, a GitHub Copilot Extension, that suggests accessibility-compliant code and displays manual validation reminders. We evaluated it through a controlled study with another 20 novice developers. Our findings demonstrate its effectiveness in guiding novice developers by reinforcing accessibility practices throughout interactions, representing a significant step towards integrating accessibility into AI coding assistants.2025PMPeya Mowar et al.Carnegie Mellon University, Robotics InstituteGenerative AI (Text, Image, Music, Video)Universal & Inclusive DesignCHI
UIClip: A Data-driven Model for Assessing User Interface DesignUser interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by (i) assigning a numerical score that represents a UI design's relevance and quality and (ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: (i) UI code generation, (ii) UI design tips generation, and (iii) quality-aware UI example search.2024JWJason Wu et al.360° Video & Panoramic ContentHuman-LLM CollaborationExplainable AI (XAI)UIST
"This really let's us see the entire world:" Designing a conversational telepresence robot for homebound older adultsIn this paper, we explore the design and use of conversational telepresence robots to help homebound older adults interact with the external world. An initial needfinding study (N=8) using video vignettes revealed older adults' experiential needs for robot-mediated remote experiences such as exploration, reminiscence and social participation. We then designed a prototype system to support these goals and conducted a technology probe study (N=11) to garner a deeper understanding of user preferences for remote experiences. The study revealed user interactive patterns in each desired experience, highlighting the need of robot guidance in exploration and social engagements for reminiscence. Our work identifies a novel design space where conversational telepresence robots can be used to foster meaningful interactions in the remote physical environment. We offer design insights into the robot's proactive role in providing guidance and using dialogue to create personalized, contextualized and meaningful experiences.2024YHYaxin Hu et al.Aging-in-Place Assistance SystemsTeleoperation & TelepresenceDIS
GenAssist: Making Image Generation AccessibleBlind and low vision (BLV) creators use images to communicate with sighted audiences. However, creating or retrieving images is challenging for BLV creators as it is difficult to use authoring tools or assess image search results. Thus, creators limit the types of images they create or recruit sighted collaborators. While text-to-image generation models let creators generate high-fidelity images based on a text description (i.e. prompt), it is difficult to assess the content and quality of generated images. We present GenAssist, a system to make text-to-image generation accessible. Using our interface, creators can verify whether generated image candidates followed the prompt, access additional details in the image not specified in the prompt, and skim a summary of similarities and differences between image candidates. To power the interface, GenAssist uses a large language model to generate visual questions, vision-language models to extract answers, and a large language model to summarize the results. Our study with 12 BLV creators demonstrated that GenAssist enables and simplifies the process of image selection and generation, making visual authoring more accessible to all.2023MHMina Huh et al.Generative AI (Text, Image, Music, Video)Visual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)UIST
Slide Gestalt: Automatic Structure Extraction in Slide Decks for Non-Visual AccessPresentation slides commonly use visual patterns for structural navigation, such as titles, dividers, and build slides. However, screen readers do not capture such intention, making it time-consuming and less accessible for blind and visually impaired (BVI) users to linearly consume slides with repeated content. We present Slide Gestalt, an automatic approach that identifies the hierarchical structure in a slide deck. Slide Gestalt computes the visual and textual correspondences between slides to generate hierarchical groupings. Readers can navigate the slide deck from the higher-level section overview to the lower-level description of a slide group or individual elements interactively with our UI. We derived side consumption and authoring practices from interviews with BVI readers and sighted creators and an analysis of 100 decks. We performed our pipeline with 50 real-world slide decks and a large dataset. Feedback from eight BVI participants showed that Slide Gestalt helped navigate a slide deck by anchoring content more efficiently, compared to using accessible slides.2023YPYi-Hao Peng et al.Carnegie Mellon UniversityVisual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)CHI
Reality Rifts: Wonder-ful Interfaces by Disrupting Perceptual CausalityReality Rifts are interfaces between the physical and the virtual reality, where incoherent observations of physical behavior lead users to imagine comprehensive and plausible end-to-end dynamics. Reality Rifts emerge in interactive physical systems that lack one or more components that are central to their operation, yet where the physical end-to-end interaction persists with plausible outcomes. Even in the presence of a Reality Rift, users can still interact with a system—much like they would with the unaltered and complete counterpart—leading them to implicitly infer the existence and imagine the behavior of the lacking components from observable phenomena and outcomes. Therefore, dynamic systems with Reality Rifts trigger doubt, curiosity, and rumination—a sense of wonder that users experience when observing a Reality Rift due to their innate curiosity. In this paper, we explore how interactive systems can elicit and guide the user's imagination by integrating Reality Rifts. We outline the design process for opening a Reality Rift in interactive physical systems, describe the resulting design space, and explore it through six characteristic prototypes. To understand to what extent and with which qualities these prototypes indeed induce a sense of wonder during an interaction, we evaluated \projectName\ in the form of a field deployment with 50 participants. We discuss participants' behavior and derive factors for the implementation of future wonder-ful experiences.2023LCLung-Pan Cheng et al.National Taiwan UniversityDesign FictionDigital Art Installations & Interactive PerformanceCHI
AVscript: Accessible Video Editing with Audio-Visual ScriptsSighted and blind and low vision (BLV) creators alike use videos to communicate with broad audiences. Yet, video editing remains inaccessible to BLV creators. Our formative study revealed that current video editing tools make it difficult to access the visual content, assess the visual quality, and efficiently navigate the timeline. We present AVscript, an accessible text-based video editor. AVscript enables users to edit their video using a script that embeds the video's visual content, visual errors (e.g., dark or blurred footage), and speech. Users can also efficiently navigate between scenes and visual errors or locate objects in the frame or spoken words of interest. A comparison study (N=12) showed that AVscript significantly lowered BLV creators' mental demands while increasing confidence and independence in video editing. We further demonstrate the potential of AVscript through an exploratory study (N=3) where BLV creators edited their own footage.2023MHMina Huh et al.University of Texas, AustinVisual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)Accessible GamingVideo Production & EditingCHI
WebUI: A Dataset for Enhancing Visual UI Understanding with Web SemanticsModeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebUI, a large dataset of 400,000 rendered web pages associated with automatically extracted metadata. We analyze the composition of WebUI and show that while automatically extracted data is noisy, most examples meet basic criteria for visual UI modeling. We applied several strategies for incorporating semantics found in web pages to increase the performance of visual UI understanding models in the mobile domain, where less labeled data is available: (i) element detection, (ii) screen classification and (iii) screen similarity.2023JWJason Wu et al.Carnegie Mellon UniversityUniversal & Inclusive DesignPrototyping & User TestingComputational Methods in HCICHI
Diffscriber: Describing Visual Design Changes to Support Mixed-Ability Collaborative Presentation AuthoringVisual slide-based presentations are ubiquitous, yet slide authoring tools are largely inaccessible to people who are blind or visually impaired (BVI). When authoring presentations, the 9 BVI presenters in our formative study usually work with sighted collaborators to produce visual slides based on the text content they produce. While BVI presenters valued collaborators’ visual design skills, the collaborators often felt they could not fully review and provide feedback on the visual changes that were made. We present Diffscriber, a system that identifies and describes changes to a slide’s content, layout, and style for presentation authoring. Using our system, BVI presentation authors can efficiently review changes to their presentation by navigating either a summary of high-level changes or individual slide elements. To learn more about changes of interest, presenters can use a generated change hierarchy to navigate to lower-level change details and element styles. BVI presenters using Diffscriber were able to identify slide design changes and provide feedback more easily as compared to using only the slides alone. More broadly, Diffscriber illustrates how advances in detecting and describing visual differences can improve mixed-ability collaboration.2022YPYi-Hao Peng et al.Visual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)Universal & Inclusive DesignUIST
ImageExplorer: Multi-Layered Touch Exploration to Encourage Skepticism Towards Imperfect AI-Generated Image CaptionsBlind users rely on alternative text (alt-text) to understand an image; however, alt-text is often missing. AI-generated captions are a more scalable alternative, but they often miss crucial details or are completely incorrect, which users may still falsely trust. In this work, we sought to determine how additional information could help users better judge the correctness of AI-generated captions. We developed ImageExplorer, a touch-based multi-layered image exploration system that allows users to explore the spatial layout and information hierarchies of images, and compared it with popular text-based (Facebook) and touch-based (Seeing AI) image exploration systems in a study with 12 blind participants. We found that exploration was generally successful in encouraging skepticism towards imperfect captions. Moreover, many participants preferred ImageExplorer for its multi-layered and spatial information presentation, and Facebook for its summary and ease of use. Finally, we identify design improvements for effective and explainable image exploration systems for blind users.2022JLJaewook Lee et al.University of Illinois at Urbana-ChampaignExplainable AI (XAI)Visual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)CHI