StepWrite: Adaptive Planning for Speech-Driven Text GenerationPeople frequently use speech-to-text systems to compose short texts with voice. However, current voice-based interfaces struggle to support composing more detailed, contextually complex texts, especially in scenarios where users are on the move and cannot visually track progress. Longer-form communication, such as composing structured emails or thoughtful responses, requires persistent context tracking, structured guidance, and adaptability to evolving user intentions---capabilities that conventional dictation tools and voice assistants do not support. We introduce StepWrite, a large language model-driven voice-based interaction system that augments human writing ability by enabling structured, hands-free and eyes-free composition of longer-form texts while on the move. StepWrite decomposes the writing process into manageable subtasks and sequentially guides users with contextually-aware non-visual audio prompts. StepWrite reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models. Unlike baseline methods like standard dictation features (e.g., Microsoft Word) and conversational voice assistants (e.g., ChatGPT Advanced Voice Mode), StepWrite dynamically adapts its prompts based on the evolving context and user intent, and provides coherent guidance without compromising user autonomy. An empirical evaluation with 25 participants engaging in mobile or stationary hands-occupied activities demonstrated that StepWrite significantly reduces cognitive load, improves usability and user satisfaction compared to baseline methods. Technical evaluations further confirmed StepWrite's capability in dynamic contextual prompt generation, accurate tone alignment, and effective fact checking. This work highlights the potential of structured, context-aware voice interactions in enhancing hands-free and eye-free communication in everyday multitasking scenarios.2025HAHamza El Alaoui et al.Voice User Interface (VUI) DesignHuman-LLM CollaborationUIST
Morae: Proactively Pausing UI Agents for User ChoicesUser interface (UI) agents promise to make inaccessible or complex UIs easier to access for blind and low-vision (BLV) users. However, current UI agents typically perform tasks end-to-end without involving users in critical choices or making them aware of important contextual information, thus reducing user agency. For example, in our field study, a BLV participant asked to buy the cheapest available sparkling water, and the agent automatically chose one from several equally priced options, without mentioning alternative products with different flavors or better ratings. To address this problem, we introduce Morae, a UI agent that automatically identifies decision points during task execution and pauses so that users can make choices. Morae uses large multimodal models to interpret user queries alongside UI code and screenshots, and prompt users for clarification when there is a choice to be made. In a study over real-world web tasks with BLV participants, Morae helped users complete more tasks and select options that better matched their preferences, as compared to baseline agents, including OpenAI Operator. More broadly, this work exemplifies a mixed-initiative approach in which users benefit from the automation of UI agents while being able to express their preferences.2025YPYi-Hao Peng et al.Intelligent Voice Assistants (Alexa, Siri, etc.)Voice AccessibilityUIST
CodeA11y: Making AI Coding Assistants Useful for Accessible Web DevelopmentA persistent challenge in accessible computing is ensuring developers produce web UI code that supports assistive technologies. Despite numerous specialized accessibility tools, novice developers often remain unaware of them, leading to ~96% of web pages that contain accessibility violations. AI coding assistants, such as GitHub Copilot, could offer potential by generating accessibility-compliant code, but their impact remains uncertain. Our formative study with 16 developers without accessibility training revealed three key issues in AI-assisted coding: failure to prompt AI for accessibility, omitting crucial manual steps like replacing placeholder attributes, and the inability to verify compliance. To address these issues, we developed CodeA11y, a GitHub Copilot Extension, that suggests accessibility-compliant code and displays manual validation reminders. We evaluated it through a controlled study with another 20 novice developers. Our findings demonstrate its effectiveness in guiding novice developers by reinforcing accessibility practices throughout interactions, representing a significant step towards integrating accessibility into AI coding assistants.2025PMPeya Mowar et al.Carnegie Mellon University, Robotics InstituteGenerative AI (Text, Image, Music, Video)Universal & Inclusive DesignCHI
"This really let's us see the entire world:" Designing a conversational telepresence robot for homebound older adultsIn this paper, we explore the design and use of conversational telepresence robots to help homebound older adults interact with the external world. An initial needfinding study (N=8) using video vignettes revealed older adults' experiential needs for robot-mediated remote experiences such as exploration, reminiscence and social participation. We then designed a prototype system to support these goals and conducted a technology probe study (N=11) to garner a deeper understanding of user preferences for remote experiences. The study revealed user interactive patterns in each desired experience, highlighting the need of robot guidance in exploration and social engagements for reminiscence. Our work identifies a novel design space where conversational telepresence robots can be used to foster meaningful interactions in the remote physical environment. We offer design insights into the robot's proactive role in providing guidance and using dialogue to create personalized, contextualized and meaningful experiences.2024YHYaxin Hu et al.Aging-in-Place Assistance SystemsTeleoperation & TelepresenceDIS
Deconstructing the Veneer of Simplicity: Co-Designing Introductory Generative AI Workshops with Local EntrepreneursGenerative AI platforms and features are permeating many aspects of work. Entrepreneurs from lean economies in particular are well positioned to outsource tasks to generative AI given limited resources. In this paper, we work to address a growing disparity in use of these technologies by building on a four-year partnership with a local entrepreneurial hub dedicated to equity in tech and entrepreneurship. Together, we co-designed an interactive workshops series aimed to onboard local entrepreneurs to generative AI platforms. Alongside four community-driven and iterative workshops with entrepreneurs across five months, we conducted interviews with 15 local entrepreneurs and community providers. We detail the importance of communal and supportive exposure to generative AI tools for local entrepreneurs, scaffolding actionable use (and supporting non-use), demystifying generative AI technologies by emphasizing entrepreneurial power, while simultaneously deconstructing the veneer of simplicity to address the many operational skills needed for successful application.2024YKYasmine Kotturi et al.Carnegie Mellon UniversityGenerative AI (Text, Image, Music, Video)Participatory DesignCHI
COMPA: Using Conversation Context to Achieve Common Ground in AACGroup conversations often shift quickly from topic to topic, leaving a small window of time for participants to contribute. AAC users often miss this window due to the speed asymmetry between using speech and using AAC devices. AAC users may take over a minute longer to contribute, and this speed difference can cause mismatches between the ongoing conversation and the AAC user's response. This results in misunderstandings and missed opportunities to participate. We present COMPA, an add-on tool for online group conversations that seeks to support conversation partners in achieving common ground. COMPA uses a conversation's live transcription to enable AAC users to mark conversation segments they intend to address (Context Marking) and generate contextual starter phrases related to the marked conversation segment (Phrase Assistance) and a selected user intent. We study COMPA in 5 different triadic group conversations, each composed by a researcher, an AAC user and a conversation partner (n=10) and share findings on how conversational context supports conversation partners in achieving common ground.2024SVStephanie Valencia et al.Carnegie Mellon University, University of Maryland College ParkConversational ChatbotsMultilingual & Cross-Cultural Voice InteractionAugmentative & Alternative Communication (AAC)CHI
WebUI: A Dataset for Enhancing Visual UI Understanding with Web SemanticsModeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebUI, a large dataset of 400,000 rendered web pages associated with automatically extracted metadata. We analyze the composition of WebUI and show that while automatically extracted data is noisy, most examples meet basic criteria for visual UI modeling. We applied several strategies for incorporating semantics found in web pages to increase the performance of visual UI understanding models in the mobile domain, where less labeled data is available: (i) element detection, (ii) screen classification and (iii) screen similarity.2023JWJason Wu et al.Carnegie Mellon UniversityUniversal & Inclusive DesignPrototyping & User TestingComputational Methods in HCICHI
From User Perceptions to Technical Improvement: Enabling People Who Stutter to Better Use Speech RecognitionConsumer speech recognition systems do not work as well for many people with speech differences, such as stuttering, relative to the rest of the general population. However, what is not clear is the degree to which these systems do not work, how they can be improved, or how much people want to use them. In this paper, we first address these questions using results from a 61-person survey from people who stutter and find participants want to use speech recognition but are frequently cut off, misunderstood, or speech predictions do not represent intent. In a second study, where 91 people who stutter recorded voice assistant commands and dic- tation, we quantify how dysfluencies impede performance in a consumer-grade speech recognition system. Through three techni- cal investigations, we demonstrate how many common errors can be prevented, resulting in a system that cuts utterances off 79.1% less often and improves word error rate from 25.4% to 9.9%.2023CLColin Lea et al.AppleIntelligent Voice Assistants (Alexa, Siri, etc.)Voice AccessibilityCHI
Diffscriber: Describing Visual Design Changes to Support Mixed-Ability Collaborative Presentation AuthoringVisual slide-based presentations are ubiquitous, yet slide authoring tools are largely inaccessible to people who are blind or visually impaired (BVI). When authoring presentations, the 9 BVI presenters in our formative study usually work with sighted collaborators to produce visual slides based on the text content they produce. While BVI presenters valued collaborators’ visual design skills, the collaborators often felt they could not fully review and provide feedback on the visual changes that were made. We present Diffscriber, a system that identifies and describes changes to a slide’s content, layout, and style for presentation authoring. Using our system, BVI presentation authors can efficiently review changes to their presentation by navigating either a summary of high-level changes or individual slide elements. To learn more about changes of interest, presenters can use a generated change hierarchy to navigate to lower-level change details and element styles. BVI presenters using Diffscriber were able to identify slide design changes and provide feedback more easily as compared to using only the slides alone. More broadly, Diffscriber illustrates how advances in detecting and describing visual differences can improve mixed-ability collaboration.2022YPYi-Hao Peng et al.Visual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)Universal & Inclusive DesignUIST
"It's Complicated": Negotiating Accessibility and (Mis)Representation in Image Descriptions of Race, Gender, and DisabilityContent creators are instructed to write textual descriptions of visual content to make it accessible; yet existing guidelines lack specifics on how to write about people's appearance, particularly while remaining mindful of consequences of (mis)representation. In this paper, we report on interviews with screen reader users who were also Black, Indigenous, People of Color, Non-binary, and/or Transgender on their current image description practices and preferences, and experiences negotiating theirs and others' appearances non-visually. We discuss these perspectives, and the ethics of humans and AI describing appearance characteristics that may convey the race, gender, and disabilities of those photographed. In turn, we share considerations for more carefully describing appearance, and contexts in which such information is perceived salient. Finally, we offer tensions and questions for accessibility research to equitably consider politics and ecosystems in which technologies will embed, such as potential risks of human and AI biases amplifying through image descriptions.2021CBCynthia L. Bennett et al.Carnegie Mellon UniversityVoice AccessibilityAI Ethics, Fairness & AccountabilityUniversal & Inclusive DesignCHI
Twitter A11y: A Browser Extension to Make Twitter Images AccessibleSocial media platforms are integral to public and private discourse, but are becoming less accessible to people with vision impairments due to an increase in user-posted images. Some platforms (i.e. Twitter) let users add image descriptions (alternative text), but only 0.1% of images include these. To address this accessibility barrier, we created Twitter A11y, a browser extension to add alternative text on Twitter using six methods. For example, screenshots of text are common, so we detect textual images, and create alternative text using optical character recognition. Twitter A11y also leverages services to automatically generate alternative text or reuse them from across the web. We compare the coverage and quality of Twitter A11y's six alt-text strategies by evaluating the timelines of 50 self-identified blind Twitter users. We find that Twitter A11y increases alt-text coverage from 7.6% to 78.5%, before crowdsourcing descriptions for the remaining images. We estimate that 57.5% of returned descriptions are high-quality. We then report on the experiences of 10 participants with visual impairments using the tool during a week-long deployment. Twitter A11y increases access to social media platforms for people with visual impairments by providing high-quality automatic descriptions for user-posted images.2020CGCole Gleason et al.Carnegie Mellon UniversityVisual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)CHI
Conversational Agency in Augmentative and Alternative CommunicationAugmented communicators (ACs) use augmentative and alternative communication (AAC) technologies to speak. Prior work in AAC research has looked to improve efficiency and expressivity of AAC via device improvements and user training. However, ACs also face constraints in communication beyond their device and individual abilities such as when they can speak, what they can say, and who they can address. In this work, we recast and broaden this prior work using conversational agency as a new frame to study AC communication. We investigate AC conversational agency with a study examining different conversational tasks between four triads of expert ACs, their close conversation partners (paid aide or parent), and a third party (experimenter). We define metrics to analyze AAC conversational agency quantitatively and qualitatively. We conclude with implications for future research to enable ACs to easily exercise conversational agency.2020SVStephanie Valencia et al.Carnegie Mellon UniversityAugmentative & Alternative Communication (AAC)Universal & Inclusive DesignCHI
Automated Class Discovery and One-Shot Interactions for Acoustic Activity RecognitionAcoustic activity recognition has emerged as a foundational element for imbuing devices with context-driven capabilities, enabling richer, more assistive, and more accommodating computational experiences. Traditional approaches rely either on custom models trained in situ, or general models pre-trained on preexisting data, with each approach having accuracy and user burden implications. We present Listen Learner, a technique for activity recognition that gradually learns events specific to a deployed environment while minimizing user burden. Specifically, we built an end-to-end system for self-supervised learning of events labelled through one-shot interaction. We describe and quantify system performance 1) on preexisting audio datasets, 2) on real-world datasets we collected, and 3) through user studies which uncovered system behaviors suitable for this new type of interaction. Our results show that our system can accurately and automatically learn acoustic events across environments (e.g., 97% precision, 87% recall), while adhering to users' preferences for non-intrusive interactive behavior.2020JWJason Wu et al.Carnegie Mellon University & Apple Inc.Human Pose & Activity RecognitionContext-Aware ComputingCHI
A Data-Driven Analysis of Workers’ Earnings on Amazon Mechanical TurkA growing number of people are working as part of on-line crowd work. Crowd work is often thought to be low wage work. However, we know little about the wage distribution in practice and what causes low/high earnings in this setting. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~$2/h, and only 4% earned more than $7.25/h. While the average requester pays more than $11/h, lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is accounted for, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs platform design and worker tools to create a more positive future for crowd work.2018KHKotaro Hara et al.Singapore Management University, Carnegie Mellon UniversityCrowdsourcing Task Design & Quality ControlImpact of Automation on WorkCHI
Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over TimeCrowd-powered conversational assistants have been shown to be more robust than automated systems, but do so at the cost of higher response latency and monetary costs. A promising direction is to combine the two approaches for high quality, low latency, and low cost solutions. In this paper, we introduce Evorus, a crowd-powered conversational assistant built to automate itself over time by (i) allowing new chatbots to be easily integrated to automate more scenarios, (ii) reusing prior crowd answers, and (iii) learning to automatically approve response candidates. Our 5-month-long deployment with 80 participants and 281 conversations shows that Evorus can automate itself without compromising conversation quality. Crowd-AI architectures have long been proposed as a way to reduce cost and latency for crowd-powered systems; Evorus demonstrates how automation can be introduced successfully in a deployed system. Its architecture allows future researchers to make further innovation on the underlying automated components in the context of a deployed open domain dialog system.2018THTing-Hao (Kenneth) Huang et al.Carnegie Mellon UniversityIntelligent Voice Assistants (Alexa, Siri, etc.)Conversational ChatbotsHuman-LLM CollaborationCHI