Semantic Commit: Helping Users Update Intent Specifications for AI Memory at ScaleAs AI agents increasingly rely on memory systems to align with user intent, updating these memories presents challenges of semantic conflict and ambiguity. Inspired by impact analysis in software engineering, we introduce SemanticCommit, a mixed-initiative interface to help users integrate new intent into intent specifications—natural language documents like AI memory lists, Cursor Rules, and game design documents—while maintaining consistency. SemanticCommit detects potential semantic conflicts using a knowledge graph-based retrieval-augmented generation pipeline, and assists users in resolving them with LLM support. Through a within-subjects study with 12 participants comparing SemanticCommit to a chat-with-document baseline (OpenAI Canvas), we find differences in workflow: half of our participants adopted a workflow of impact analysis when using SemanticCommit, where they would first flag conflicts without AI revisions then resolve conflicts locally, despite having access to a global revision feature. Additionally, users felt SemanticCommit offered a greater sense of control without increasing workload. Our findings indicate that AI agent interfaces should help users validate AI retrieval independently from generation, suggesting that the benefits from improved control can offset the costs of manual review. Our work speaks to the need for AI system designers to think about updating memory as a process that involves human feedback and decision-making.2025PVPriyan Vaithilingam et al.Human-LLM CollaborationAI-Assisted Decision-Making & AutomationAlgorithmic Transparency & AuditabilityUIST
AbstractExplorer: Leveraging Structure-Mapping Theory to Enhance Comparative Close Reading at ScaleIndividual flagship conferences today can have over a thousand papers; even reading just the abstract of every paper at the latest relevant conference to keep up with the research is time and memory prohibitive. Previous visualizations in this domain have ubiquitously followed Shneiderman's Visual Information-Seeking Mantra, with details available on demand. However, recently in other domains, system designers have leveraged Structure-Mapping Theory (SMT) to facilitate seeing both the overview and the details at the same time, facilitating abstraction without losing context. We compose and evaluate a system, called AbstractExplorer, with analogous SMT-derived characteristics for the domain of scientific abstract corpus familiarization. AbstractExplorer has a unique combination of LLM-powered (1) faceted comparative close reading with (2) role highlighting enhanced by (3) structure-based ordering and (4) alignment. An ablation study (N=24) validated that these features work best together. A summative study (N=16) describes how these features support users in familiarizing themselves with a corpus of paper abstracts from a single large conference with over 1000 papers.2025ZGZiwei Gu et al.Human-LLM CollaborationInteractive Data VisualizationUIST
Supporting Co-Adaptive Machine Teaching through Human Concept Learning and Cognitive TheoriesAn important challenge in interactive machine learning, particularly in subjective or ambiguous domains, is fostering bi-directional alignment between humans and models. Users teach models their concept definition through data labeling, while refining their own understandings throughout the process. To facilitate this, we introduce MOCHA, an interactive machine learning tool informed by two theories of human concept learning and cognition. First, it utilizes a neuro-symbolic pipeline to support Variation Theory-based counterfactual data generation. By asking users to annotate counterexamples that are syntactically and semantically similar to already-annotated data but predicted to have different labels, the system can learn more effectively while helping users understand the model and reflect on their own label definitions. Second, MOCHA uses Structural Alignment Theory to present groups of counterexamples, helping users comprehend alignable differences between data items and annotate them in batch. We validated MOCHA's effectiveness and usability through a lab study with 18 participants.2025SGSimret Araya Gebreegziabher et al.University of Notre Dame, Department of Computer Science and EngineeringExplainable AI (XAI)AI-Assisted Decision-Making & AutomationCHI
Creative Writers’ Attitudes on Writing as Training Data for Large Language ModelsThe use of creative writing as training data for large language models (LLMs) is highly contentious and many writers have expressed outrage at the use of their work without consent or compensation. In this paper, we seek to understand how creative writers reason about the real or hypothetical use of their writing as training data. We interviewed 33 writers with variation across genre, method of publishing, degree of professionalization, and attitudes toward and engagement with LLMs. We report on core principles that writers express (support of the creative chain, respect for writers and writing, and the human element of creativity) and how these principles can be at odds with their realistic expectations of the world (a lack of control, industry-scale impacts, and interpretation of scale). Collectively these findings demonstrate that writers have a nuanced understanding of LLMs and are more concerned with power imbalances than the technology itself.2025KGKaty Ilonka Gero et al.Harvard University, Department of Computer ScienceHuman-LLM CollaborationAI Ethics, Fairness & AccountabilityAlgorithmic Fairness & BiasCHI
CorpusStudio: Surfacing Emergent Patterns In A Corpus Of Prior Work While WritingMany communities, including the scientific community, develop implicit writing norms. Understanding them is crucial for effective communication with that community. Writers gradually develop an implicit understanding of norms by reading papers and receiving feedback on their writing. However, it is difficult to both externalize this knowledge and apply it to one's own writing. We propose two new writing support concepts that reify document and sentence-level patterns in a given text corpus: (1) an ordered distribution over section titles and (2) given the user's draft and cursor location, many retrieved contextually relevant sentences. Recurring words in the latter are algorithmically highlighted to help users see any emergent norms. Study results (N=16) show that participants revised the structure and content using these concepts, gaining confidence in aligning with or breaking norms after reviewing many examples. These results demonstrate the value of reifying distributions over other authors’ writing choices during the writing process.2025HDHai Dang et al.University of Bayreuth, HCI+AIAI-Assisted Creative WritingCreative Collaboration & Feedback SystemsCHI
Imagining a Future of Designing with AI: Dynamic Grounding, Constructive Negotiation, and Sustainable MotivationWe ideate a future design workflow that involves AI technology. Drawing from activity and communication theory, we attempt to isolate the new value that large AI models can provide design compared to past technologies. We arrive at three affordances - dynamic grounding, constructive negotiation, and sustainable motivation - that summarize latent qualities of natural language-enabled foundation models that, if explicitly designed for, can support the process of design. Through design fiction, we then imagine a future interface as a diegetic prototype, the story of Squirrel Game, that demonstrates each of our three affordances in a realistic usage scenario. Our design process, terminology, and diagrams aim to contribute to future discussions about the relative affordances of AI technology with regard to collaborating with human designers.2024PVPriyan Vaithilingam et al.Human-LLM CollaborationDesign FictionDIS
DynaVis: Dynamically Synthesized UI Widgets for Visualization EditingUsers often rely on GUIs to edit and interact with visualizations - a daunting task due to the large space of editing options. As a result, users are either overwhelmed by a complex UI or constrained by a custom UI with a tailored, fixed subset of options with limited editing flexibility. Natural Language Interfaces (NLIs) are emerging as a feasible alternative for users to specify edits. However, NLIs forgo the advantages of traditional GUI: the ability to explore and repeat edits and see instant visual feedback. We introduce DynaVis, which blends natural language and dynamically synthesized UI widgets. As the user describes an editing task in natural language, DynaVis performs the edit and synthesizes a persistent widget that the user can interact with to make further modifications. Study participants (n=24) preferred \tool over the NLI-only interface citing ease of further edits and editing confidence due to immediate visual feedback.2024PVPriyan Vaithilingam et al.Harvard UniversityInteractive Data VisualizationPrototyping & User TestingCHI
An AI-Resilient Text Rendering Technique for Reading and Skimming DocumentsReaders find text difficult to consume for many reasons. Summarization can address some of these difficulties, but introduce others, such as omitting, misrepresenting, or hallucinating information, which can be hard for a reader to notice. One approach to addressing this problem is to instead modify how the original text is rendered to make important information more salient. We introduce Grammar-Preserving Text Saliency Modulation (GP-TSM), a text rendering method with a novel means of identifying what to de-emphasize. Specifically, GP-TSM uses a recursive sentence compression method to identify successive levels of detail beyond the core meaning of a passage, which are de-emphasized by rendering words in successively lighter but still legible gray text. In a lab study (n=18), participants preferred GP-TSM over pre-existing word-level text rendering methods and were able to answer GRE reading comprehension questions more efficiently.2024ZGZiwei Gu et al.Harvard UniversityGenerative AI (Text, Image, Music, Video)Explainable AI (XAI)Visualization Perception & CognitionCHI
ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis TestingEvaluating outputs of large language models (LLMs) is challenging, requiring making—and making sense of—many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.2024IAIan Arawjo et al.Harvard UniversityHuman-LLM CollaborationExplainable AI (XAI)Recommender System UXCHI
Supporting Sensemaking of Large Language Model Outputs at ScaleLarge language models (LLMs) are capable of generating multiple responses to a single prompt, yet little effort has been expended to help end-users or system designers make use of this capability. In this paper, we explore how to present many LLM responses at once. We design five features, which include both pre-existing and novel methods for computing similarities and differences across textual documents, as well as how to render their outputs. We report on a controlled user study (n=24) and eight case studies evaluating these features and how they support users in different tasks. We find that the features support a wide variety of sensemaking tasks and even make tasks tractable that our participants previously considered to be too difficult to attempt. Finally, we present design guidelines to inform future explorations of new LLM interfaces.2024KGKaty Ilonka Gero et al.Harvard UniversityHuman-LLM CollaborationExplainable AI (XAI)Interactive Data VisualizationCHI
PaTAT: Human-AI Collaborative Qualitative Coding with Explainable Interactive Rule SynthesisOver the years, the task of AI-assisted data annotation has seen remarkable advancements. However, a specific type of annotation task, the qualitative coding performed during thematic analysis, has characteristics that make effective human-AI collaboration difficult. Informed by a formative study, we designed PaTAT, a new AI-enabled tool that uses an interactive program synthesis approach to learn flexible and expressive patterns over user-annotated codes in real-time as users annotate data. To accommodate the ambiguous, uncertain, and iterative nature of thematic analysis, the use of user-interpretable patterns allows users to understand and validate what the system has learned, make direct fixes, and easily revise, split, or merge previously annotated codes. This new approach also helps human users to learn data characteristics and form new theories in addition to facilitating the ``learning'' of the AI model. PaTAT’s usefulness and effectiveness were evaluated in a lab user study.2023SGSimret Araya Gebreegziabher et al.University of Notre DameExplainable AI (XAI)AI-Assisted Decision-Making & AutomationComputational Methods in HCICHI
Concept-Labeled Examples for Library ComparisonProgrammers often rely on online resources—such as code examples, documentation, blogs, and Q&A forums—to compare similar libraries and select the one most suitable for their own tasks and contexts. However, this comparison task is often done in an ad-hoc manner, which may result in suboptimal choices. Inspired by Analogical Learning and Variation Theory, we hypothesize that rendering many concept-annotated code examples from different libraries side-by-side can help programmers (1) develop a more comprehensive understanding of the libraries' similarities and distinctions and (2) make more robust, appropriate library selections. We designed a novel interactive interface, ParaLib, and used it as a technical probe to explore to what extent many side-by-side concept-annotated examples can facilitate the library comparison and selection process. A within-subjects user study with 20 programmers shows that, when using ParaLib, participants made more consistent, suitable library selections and provided more comprehensive summaries of libraries' similarities and differences.2022LYLitao Yan et al.Knowledge Worker Tools & WorkflowsPrototyping & User TestingUIST
A Comparative Evaluation of Interventions Against Misinformation: Augmenting the WHO ChecklistDuring the COVID-19 pandemic, the World Health Organization provided a checklist to help people distinguish between accurate and misinformation. In controlled experiments in the United States and Germany, we investigated the utility of this ordered checklist and designed an interactive version to lower the cost of acting on checklist items. Across interventions, we observe non-trivial differences in participants' performance in distinguishing accurate and misinformation between the two countries and discuss some possible reasons that may predict the future helpfulness of the checklist in different environments. The checklist item that provides source labels was most frequently followed and was considered most helpful. Based on our empirical findings, we recommend practitioners focus on providing source labels rather than interventions that support readers performing their own fact-checks, even though this recommendation may be influenced by the WHO's chosen order. We discuss the complexity of providing such source labels and provide design recommendations.2022HHMorgan Klaus Scheuerman et al.University of BremenContent Moderation & Platform GovernanceMisinformation & Fact-CheckingCHI
Assuage: Assembly Synthesis Using A Guided ExplorationAssembly programming is challenging, even for experts. Program synthesis, as an alternative to manual implementation, has the potential to enable both expert and non-expert users to generate programs in an automated fashion. However, current tools and techniques are unable to synthesize assembly programs larger than a few instructions. We present Assuage: ASsembly Synthesis Using A Guided Exploration, which is a parallel interactive assembly synthesizer that engages the user as an active collaborator, enabling synthesis to scale beyond current limits. Using Assuage, users can provide two types of semantically meaningful hints that expedite synthesis and allow for exploration of multiple possibilities simultaneously. Assuage exposes information about the underlying synthesis process using multiple representations to help users guide synthesis. We conducted a within-subjects study with twenty-one participants working on assembly programming tasks. With Assuage, participants with a wide range of expertise were able to achieve significantly higher success rates, perceived less subjective workload, and preferred the usefulness and usability of Assuage over a state of the art synthesis tool.2021JHKehang Zhu et al.Human-LLM CollaborationComputational Methods in HCIUIST
Visualizing Examples of Deep Neural Networks at ScaleMany programmers want to use deep learning due to its superior accuracy in many challenging domains. Yet our formative study with ten programmers indicated that, when constructing their own deep neural networks (DNNs), they often had a difficult time choosing appropriate model structures and hyperparameter values. This paper presents ExampleNet---a novel interactive visualization system for exploring common and uncommon design choices in a large collection of open-source DNN projects. ExampleNet provides a holistic view of the distribution over model structures and hyperparameter settings in the corpus of DNNs, so users can easily filter the corpus down to projects tackling similar tasks and compare design choices made by others. We evaluated ExampleNet in a within-subjects study with sixteen participants. Compared with the control condition (i.e., online search), participants using ExampleNet were able to inspect more online examples, make more data-driven design decisions, and make fewer design mistakes.2021LYLitao Yan et al.Harvard UniversityHuman-LLM CollaborationInteractive Data VisualizationCHI
Evaluating the Interpretability of Generative Models by Interactive ReconstructionFor machine learning models to be most useful in numerous sociotechnical systems, many have argued that they must be human-interpretable. However, despite increasing interest in interpretability, there remains no firm consensus on how to measure it. This is especially true in representation learning, where interpretability research has focused on "disentanglement" measures only applicable to synthetic datasets and not grounded in human factors. We introduce a task to quantify the human-interpretability of generative model representations, where users interactively modify representations to reconstruct target instances. On synthetic datasets, we find performance on this task much more reliably differentiates entangled and disentangled models than baseline approaches. On a real dataset, we find it differentiates between representation learning methods widely believed but never shown to produce more or less interpretable models. In both cases, we ran small-scale think-aloud studies and large-scale experiments on Amazon Mechanical Turk to confirm that our qualitative and quantitative results agreed.2021ARAndrew Ross et al.Harvard UniversityExplainable AI (XAI)AI-Assisted Decision-Making & AutomationAlgorithmic Transparency & AuditabilityCHI
Interpretable Program SynthesisProgram synthesis, which generates programs based on user-provided specifications, can be obscure and brittle: users have few ways to understand and recover from synthesis failures. We propose interpretable program synthesis, a novel approach that unveils the synthesis process and enables users to monitor and guide the synthesis. We designed three representations that explain the underlying synthesis process with different levels of fidelity. We implemented an interpretable synthesizer and conducted a within-subjects study with eighteen participants on three challenging regular expression programming tasks. With interpretable synthesis, participants were able to reason about synthesis failures and strategically provide feedback, achieving a significantly higher success rate compared with a state-of-the-art synthesizer. In particular, participants with a high engagement tendency (as measured by NCS-6) preferred a deductive representation that shows the synthesis process in a search tree, while participants with a relatively low engagement tendency preferred an inductive representation that renders representative samples of programs enumerated during synthesis.2021TZTianyi Zhang et al.Harvard UniversityExplainable AI (XAI)Prototyping & User TestingCHI
Interactive Program Synthesis by Augmented ExamplesProgramming-by-example (PBE) has become an increasingly popular component in software development tools, humanrobot interaction, and end-user programming. A long-standing challenge in PBE is the inherent ambiguity in user-provided examples. This paper presents an interaction model to disambiguate user intent and reduce the cognitive load of understanding and validating synthesized programs. Our model provides two types of augmentations to user-given examples: 1) semantic augmentation where a user can specify how different aspects of an example should be treated by a synthesizer via light-weight annotations, and 2) data augmentation where the synthesizer generates additional examples to help the user understand and validate synthesized programs. We implement and demonstrate this interaction model in the domain of regular expressions, which is a popular mechanism for text processing and data wrangling and is often considered hard to master even for experienced programmers. A within-subjects user study with twelve participants shows that, compared with only inspecting and annotating synthesized programs, interacting with augmented examples significantly increases the success rate of finishing a programming task with less time and increases users’ confidence of synthesized programs.2020TZTianyi Zhang et al.Programming Education & Computational ThinkingComputational Methods in HCIUIST
Enabling Data-Driven API Design with Community Usage Data: A Need-Finding StudyAPIs are becoming the fundamental building block of modern software and their usability is crucial to programming efficiency and software quality. Yet API designers find it hard to gather and interpret user feedback on their APIs. To close the gap, we interviewed 23 API designers from 6 companies and 11 open-source projects to understand their practices and needs. The primary way of gathering user feedback is through bug reports and peer reviews, as formal usability testing is prohibitively expensive to conduct in practice. Participants expressed a strong desire to gather real-world use cases and understand users' mental models, but there was a lack of tool support for such needs. In particular, participants were curious about where users got stuck, their workarounds, common mistakes, and unanticipated corner cases. We highlight several opportunities to address those unmet needs, including developing new mechanisms that systematically elicit users' mental models, building mining frameworks that identify recurring patterns beyond shallow statistics about API usage, and exploring alternative design choices made in similar libraries.2020TZTianyi Zhang et al.Harvard UniversityComputational Methods in HCICHI