Generative Rotoscoping: A First-Person Autobiographical Exploration on Generative Video-to-Video PracticesThis paper contributes a first-person exploration on AI video-to-video technologies, which I call "Generative Rotoscoping". This includes: insights on the creation process, a set of prototype explorations, and an integrated workflow for multi-modal video generation. Generative video is rapidly evolving and delivering higher quality outputs. While video generation models have potential for film-making and content creation, they lack controllability for creative expression: viable videos can require hundreds of unsuccessful attempts. To understand this emergent practice, and due to the constant evolution of models and limited number of early adopters, I explored Generative Rotoscoping over 12 months and created AI workflows leading to over 40,000 video/image files examining a variety of models and techniques including: structural guidance, frame consistency, image referencing and masks, compositing, among others. Insights from this work can serve as a starting point for designing the next generation of video authoring tools.2025DLDavid LedoGenerative AI (Text, Image, Music, Video)Video Production & EditingC&C
WhatIF: Branched Narrative Fiction Visualization for Authoring Emergent Narratives using Large Language ModelsBranched Narrative Fiction (BNF) are non-linear, text based narrative games, where the player of the game is an active participant shaping the story. Unlike linear narratives, BNF allows players to influence the direction, outcomes, and progression of the plot. A narrative fiction developer designs these branching storylines, creating a dynamic interaction between the player and the narrative which requires significant time and skill. In this work we build and investigate the use of a visual analytics tool to help narrative fiction developers generate and plan these parallel worlds within a BNF. We present WhatIF, a visual analytics tool that aids BNF developers to create BNF graphs, edit the graphs, obtain recommendations, visualize differences between storylines and finally verify their BNF on custom metrics. Through a formative study (3 participants) and a user study (11 participants), we observe that WhatIF helps users plan and prototype their BNF, provides avenues to support iterative refinement of narrative and also aids in removing writer's block. Furthermore, we explore how contemporary generative AI (GenAI) tools can empower game developers to build richer and more immersive narratives.2025AMAditi Mishra et al.Generative AI (Text, Image, Music, Video)AI-Assisted Creative WritingC&C
Exploring the Potential of Metacognitive Support Agents for Human-AI Co-CreationDespite the potential of generative AI (GenAI) design tools to enhance design processes, professionals often struggle to integrate AI into their workflows. Fundamental cognitive challenges include the need to specify all design criteria as distinct parameters upfront (intent formulation) and designers’ reduced cognitive involvement in the design process due to cognitive offloading, which can lead to insufficient problem exploration, underspecification, and limited ability to evaluate outcomes. Motivated by these challenges, we envision novel metacognitive support agents that assist designers in working more reflectively with GenAI. To explore this vision, we conducted exploratory prototyping through a Wizard of Oz elicitation study with 20 mechanical designers probing multiple metacognitive support strategies. We found that agent-supported users created more feasible designs than non-supported users, with differing impacts between support strategies. Based on these findings, we discuss opportunities and tradeoffs of metacognitive support agents and considerations for future AI-based design tools.2025FGFrederic Gmeiner et al.Generative AI (Text, Image, Music, Video)Human-LLM CollaborationCreative Collaboration & Feedback SystemsDIS
VideoMix: Aggregating How-To Videos for Task-Oriented LearningTutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.2025SYSaelyne Yang et al.Online Learning & MOOC PlatformsIntelligent Tutoring Systems & Learning AnalyticsIUI
WhatELSE: Shaping Narrative Spaces at Configurable Level of Abstraction for AI-bridged Interactive StorytellingGenerative AI significantly enhances player agency in interactive narratives (IN) by enabling just-in-time content generation that adapts to player actions. While delegating generation to AI makes IN more interactive, it becomes challenging for authors to control the space of possible narratives - within which the final story experienced by the player emerges from their interaction with AI. In this paper, we present WhatELSE, an AI-bridged IN authoring system that creates narrative possibility spaces from example stories. WhatELSE provides three views (narrative pivot, outline, and variants) to help authors understand the narrative space and corresponding tools leveraging linguistic abstraction to control the boundaries of the narrative space. Taking innovative LLM-based narrative planning approaches, WhatELSE further unfolds the narrative space into executable game events. Through a user study (N=12) and technical evaluations, we found that WhatELSE enables authors to perceive and edit the narrative space and generates engaging interactive narratives at play-time.2025ZLZhuoran Lu et al.Autodesk Research; Purdue University, Computer ScienceGenerative AI (Text, Image, Music, Video)AI-Assisted Creative WritingInteractive Narrative & Immersive StorytellingCHI
Paratrouper: Exploratory Creation of Character Cast Visuals Using Generative AIGreat characters are critical to the success of many forms of media, such as comics, games, and films. Designing visually compelling casts of characters requires significant skill and consideration, and there is a lack of specialized tools to support this endeavor. We investigate how AI-driven image-generation techniques can empower creatives to explore a variety of visual design possibilities for individual and groups of characters. Informed by interviews with character designers, Paratrouper is a multi-modal system that enables creating and experimenting with multiple permutations for character casts and visualizing them in various contexts as part of a holistic approach to design. We demonstrate how Paratrouper supports different aspects of the character design process, and share insights from its use by eight creators. Our work highlights the interplay between creative agency and serendipity, as well as the visual interrelationships among character aesthetics.2025JLJoanne Leong et al.MIT, MIT Media LabGenerative AI (Text, Image, Music, Video)3D Modeling & AnimationCHI
To Use or Not to Use: Impatience and Overreliance When Using Generative AI Productivity Support ToolsGenerative AI has the potential to assist people with completing various tasks, but increased productivity is not guaranteed due to challenges such as uncertainty in output quality and unclear processing time. Through an online crowdsourced experiment (N=508), leveraging a “paint by numbers” task to simulate properties of GenAI assistance, we explore how, and how well, users make decisions on whether to use or not use automation to maximize their productivity given varying waiting times and output quality. We observed gaps between user’s actual choices and their optimal choices and characterized these gaps as the “gulf of impatience” and the “gulf of overreliance”. We also distilled strategies that participants adopted when making their decisions. We discuss design considerations in supporting users to make more informed decisions when interacting with GenAI tools and make these tools more useful for improving users’ task performance, productivity and satisfaction.2025HQHan Qiao et al.Autodesk ResearchGenerative AI (Text, Image, Music, Video)AI-Assisted Decision-Making & AutomationCHI
Exploring User Engagement Through an Interaction Lens: What Textual Cues Can Tell Us about Human-Chatbot InteractionsMonitoring and maintaining user engagement in human-chatbot interactions is challenging. Researchers often use cues observed in the interactions as indicators to infer engagement. However, evaluation of these cues is lacking. In this study, we collected an inventory of potential textual engagements cues from the literature, including linguistic features, utterance features, and interaction features. These cues were subsequently used to annotate a dataset of 291 user-chatbot interactions, and we examined which of these cues predicted self-reported user engagement. Our results show that engagement can indeed be recognized at the level of individual utterances. Notably, words indicating cognitive thinking processes and motivational utterances were strong indicators of engagement. An overall negative tone could also predict engagement, highlighting the importance of nuanced interpretation and contextual awareness of user utterances. Our findings demonstrated initial feasibility of recognizing utterance-level cues and using them to infer user engagement, although further validation is needed across different content-domains.2024LHLinwei He et al.Conversational ChatbotsExplainable AI (XAI)AI-Assisted Decision-Making & AutomationCUI
AQuA: Automated Question-Answering in Software Tutorial Videos with Visual Anchors Tutorial videos are a popular help source for learning feature-rich software. However, getting quick answers to questions about tutorial videos is difficult. We present an automated approach for responding to tutorial questions. By analyzing 633 questions found in 5,944 video comments, we identified different question types and observed that users frequently described parts of the video in questions. We then asked participants (N=24) to watch tutorial videos and ask questions while annotating the video with relevant visual anchors. Most visual anchors referred to UI elements and the application workspace. Based on these insights, we built AQuA, a pipeline that generates useful answers to questions with visual anchors. We demonstrate this for Fusion 360, showing that we can recognize UI elements in visual anchors and generate answers using GPT-4 augmented with that visual information and software documentation. An evaluation study (N=16) demonstrates that our approach provides better answers than baseline methods.2024SYSaelyne Yang et al.Autodesk Research, School of Computing, KAISTHuman-LLM CollaborationOnline Learning & MOOC PlatformsCHI
SwitchSpace: Understanding Context-Aware Peeking Between VR and Desktop InterfacesCross-reality tasks, like creating or consuming virtual reality (VR) content, often involve inconvenient or distracting switches between desktop and VR. An initial formative study explores cross-reality switching habits, finding most switches are momentary "peeks" between interfaces, with specific habits determined by current context. The results inform a design space for context-aware "peeking" techniques that allow users to view or interact with desktop from VR, and vice versa, without fully switching. We implemented a set of peeking techniques and evaluated them in two levels of a cross-reality task: one requiring only viewing, and another requiring input and viewing. Peeking techniques made task completion faster, with increased input accuracy and reduced perceived workload.2024JWJohann Wentzel et al.University of WaterlooMixed Reality WorkspacesContext-Aware ComputingCHI
GlucoMaker: Enabling Collaborative Customization of Glucose MonitorsMillions of individuals with diabetes use glucose monitors to track blood sugar levels. Research shows that such individuals seek to customize different aspects of their interactions with these devices, including how they engage with, decorate, and wear them. However, it remains challenging to tailor both device form and function to accommodate individual needs. To address this challenge, we introduce GlucoMaker, a system for collaboratively customizing physical design aspects of glucose monitors. Prior to designing GlucoMaker, we conducted a prototyping and focus group study to understand customization preferences and collaboration benefits. GlucoMaker provides individuals with the ability to a) select monitor form and function preferences, b) alter predefined and downloadable digital model files, c) receive feedback on monitor designs from stakeholders, and d) learn technical design aspects. We further demonstrate the versatility and design space of GlucoMaker with three examples of different form and function use cases.2024SLSabrina Lakhdhir et al.University of VictoriaChronic Disease Self-Management (Diabetes, Hypertension, etc.)Customizable & Personalized ObjectsCHI
TimeTunnel: Integrating Spatial and Temporal Motion Editing for Character Animation in Virtual RealityEditing character motion in Virtual Reality is challenging as it requires working with both spatial and temporal data using controls with multiple degrees of freedom. The spatial and temporal controls are separated, making it difficult to adjust poses over time and predict the effects across adjacent frames. To address this challenge, we propose TimeTunnel, an immersive motion editing interface that integrates spatial and temporal control for 3D character animation in VR. TimeTunnel provides an approachable editing experience via KeyPoses and Trajectories. KeyPoses are a set of representative poses automatically computed to concisely depict motion. Trajectories are 3D animation curves that pass through the joints of KeyPoses to represent in-betweens. TimeTunnel integrates spatial and temporal control by superimposing Trajectories and KeyPoses onto a 3D character. We conducted two studies to evaluate TimeTunnel. In our quantitative study, TimeTunnel reduced the amount of time required for editing motion, and saved effort in locating target poses. Our qualitative study with domain experts demonstrated how TimeTunnel is an approachable interface that can simplify motion editing, while still preserving a direct representation of motion.2024QZQian Zhou et al.Autodesk ResearchImmersion & Presence Research3D Modeling & AnimationCHI
DirectGPT: A Direct Manipulation Interface to Interact with Large Language ModelsWe characterize and demonstrate how the principles of direct manipulation can improve interaction with large language models. This includes: continuous representation of generated objects of interest; reuse of prompt syntax in a toolbar of commands; manipulable outputs to compose or control the effect of prompts; and undo mechanisms. This idea is exemplified in DirectGPT, a user interface layer on top of ChatGPT that works by transforming direct manipulation actions to engineered prompts. A study shows participants were 50% faster and relied on 50% fewer and 72% shorter prompts to edit text, code, and vector images compared to baseline ChatGPT. Our work contributes a validated approach to integrate LLMs into traditional software using direct manipulation. Data, code, and demo available at https://osf.io/3wt6s.2024DMDamien Masson et al.University of WaterlooHuman-LLM CollaborationExplainable AI (XAI)AI-Assisted Decision-Making & AutomationCHI
WorldSmith: A Multi-Modal Image Synthesis Tool for Fictional World BuildingCrafting a rich and unique environment is crucial for fictional world-building, but can be difficult to achieve since illustrating a world from scratch requires time and significant skill. We investigate the use of recent multi-modal image generation systems to enable users iteratively visualize and modify elements of their fictional world using a combination of text input, sketching, and region-based filling. WorldSmith enables novice world builders to quickly visualize a fictional world with layered edits and hierarchical compositions. Through a formative study (4 participants) and first-use study (13 participants) we demonstrate that WorldSmith offers more expressive interactions with prompt-based models. With this work, we explore how creatives can be empowered to leverage prompt-based generative AI as a tool in their creative process, beyond current "click-once" prompting UI paradigms.2023HDHai Duong Dang et al.Generative AI (Text, Image, Music, Video)AI-Assisted Creative WritingGraphic Design & Typography ToolsUIST
3DALL-E: Integrating Text-to-Image AI in 3D Design WorkflowsText-to-image AI are capable of generating novel images for inspiration, but their applications for 3D design workflows and how designers can build 3D models using AI-provided inspiration have not yet been explored. To investigate this, we integrated DALL-E, GPT-3, and CLIP within a CAD software in 3DALL-E, a plugin that generates 2D image inspiration for 3D design. 3DALL-E allows users to construct text and image prompts based on what they are modeling. In a study with 13 designers, we found that designers saw great potential in 3DALL-E within their workflows and could use text-to-image AI to produce reference images, prevent design fixation, and inspire design considerations. We elaborate on prompting patterns observed across 3D modeling tasks and provide measures of prompt complexity observed across participants. From our findings, we discuss how 3DALL-E can merge with existing generative design workflows and propose prompt bibliographies as a form of human-AI design history.2023VLVivian Liu et al.Generative AI (Text, Image, Music, Video)Customizable & Personalized ObjectsDIS
Immersive Sampling: Exploring Sampling for Future Creative Practices in Media-Rich, Immersive SpacesCreative practitioners rely on sampling to understand, explore, and construct problems; or gather resources for later use. Despite practitioners' ability to experience immersive environments, sampling from them remains limited to primarily visual captures (e.g., screenshots, videos), which overlook the richness and variety of available media. To address these challenges, we describe ''Immersive Sampling'' as a new way to frame information gathering in the context of immersive environments. In the context of Immersive Sampling, practitioners engage in experiencing immersive environments, while capturing, organizing, revisiting, and remixing found content. We situate this subset of tasks in literature and argue for their importance for emerging, future content creation domains. To further explore how Immersive Sampling might take place, we created VRicolage, a proof-of-concept prototype showcasing a set of interactions in Virtual Reality to sample, revisit, and remix captures. Given the democratization of immersive environments, Immersive Sampling provides practitioners with a means to collect, revisit, and remix digital materials.2023ESEvgeny Stemasov et al.Immersion & Presence ResearchInteractive Narrative & Immersive StorytellingDIS
Tesseract: Querying Spatial Design Recordings by Manipulating Worlds in MiniatureNew immersive 3D design tools enable the creation of spatial design recordings, capturing collaborative design activities. By reviewing captured spatial design sessions, which include user activities, workflows, and tool use, users can reflect on their own design processes, learn new workflows, and understand others' design rationale. However, finding interesting moments in design activities can be challenging: they contain multimodal data (such as user motion and logged events) occurring over time which can be difficult to specify when searching, and are typically distributed over many sessions or recordings. We present Tesseract, a Worlds-in-Miniature-based system to expressively query VR spatial design recordings. Tesseract consists of the Search Cube interface acting as a centralized stage-to-search container, and four querying tools for specifying multimodal data to enable users to find interesting moments in past design activities. We studied ten participants who used Tesseract and found support for our miniature-based stage-to-search approach.2023KMKarthik Mahadevan et al.University of TorontoMixed Reality WorkspacesComputational Methods in HCICHI
Escapement: A Tool for Interactive Prototyping with Video via Sensor-Mediated Abstraction of Time We present Escapement, a video prototyping tool that introduces a powerful new concept for prototyping screen-based interfaces by flexibly mapping sensor values to dynamic playback control of videos. This recasts the time dimension of video mock-ups as sensor-mediated interaction. This abstraction of time as interaction, which we dub video-escapement prototyping, empowers designers to rapidly explore and viscerally experience direct touch or sensor-mediated interactions across one or more device displays. Our system affords cross-device and bidirectional remote (tele-present) experiences via cloud-based state sharing across multiple devices. This makes Escapement especially potent for exploring multi-device, dual-screen, or remote-work interactions for screen-based applications. We introduce the core concept of sensor-mediated abstraction of time for quickly generating video-based interactive prototypes of screen-based applications, share the results of observations of long-term usage of video-escapement techniques with experienced interaction designers, and articulate design choices for supporting a reflective, iterative, and open-ended creative design process.2023MNMolly Jane Nicholas et al.UC BerkeleyTeleoperation & TelepresencePrototyping & User TestingCHI
Beyond Instructions: A Taxonomy of Information Types in How-to VideosHow-to videos are rich in information---they not only give instructions but also provide justifications or descriptions. People seek different information to meet their needs, and identifying different types of information present in the video can improve access to the desired knowledge. Thus, we present a taxonomy of information types in how-to videos. Through an iterative open coding of 4k sentences in 48 videos, 21 information types under 8 categories emerged. The taxonomy represents diverse information types that instructors provide beyond instructions. We first show how our taxonomy can serve as an analytical framework for video navigation systems. Then, we demonstrate through a user study (n=9) how type-based navigation helps participants locate the information they needed. Finally, we discuss how the taxonomy enables a wide range of video-related tasks, such as video authoring, viewing, and analysis. To allow researchers to build upon our taxonomy, we release a dataset of 120 videos containing 9.9k sentences labeled using the taxonomy.2023SYSaelyne Yang et al.School of Computing, KAISTInteractive Data VisualizationData StorytellingCHI
AvatAR: An Immersive Analysis Environment for Human Motion Data Combining Interactive 3D Avatars and TrajectoriesAnalysis of human motion data can reveal valuable insights about the utilization of space and interaction of humans with their environment. To support this, we present AvatAR, an immersive analysis environment for the in-situ visualization of human motion data, that combines 3D trajectories, virtual avatars of people’s movement, and a detailed representation of their posture. Additionally, we describe how to embed visualizations directly into the environment, showing what a person looked at or what surfaces they touched, and how the avatar’s body parts can be used to access and manipulate those visualizations. AvatAR combines an AR HMD with a tablet to provide both mid-air and touch interaction for system control, as well as an additional overview to help users navigate the environment. We implemented a prototype and present several scenarios to show that AvatAR can enhance the analysis of human motion data by making data not only explorable, but experienceable.2022PRPatrick Reipschläger et al.Autodesk Research, Technische Universität DresdenHuman Pose & Activity RecognitionSocial & Collaborative VRAR Navigation & Context AwarenessCHI