FlowRing: Integrated Microgesture and Surface Interaction Ring for Versatile XR InputAs Extended Reality (XR) advances, a device has the potential to be used across contexts from immersive productivity at a desk to on-the-go, public scenarios. Existing input solutions lack the versatility to provide both high-throughput, mouse-grade input and subtle, ergonomic interaction. We introduce FlowRing, a novel ring-form device that combines microgestures with precise 2D mouse-like input on surfaces. FlowRing supports five microgestures for discreet interaction and 2D input for richer tasks, using an optical flow sensor, skin-contact microphone, and IMU at the base of the finger. In a study with 11 participants, FlowRing achieved 93.6% microgesture recognition accuracy across sessions and 85.2% across unseen users, rising to 90.1% with just four gesture set examples from a new user. A separate 2D Fitts’ law study demonstrated its effectiveness for continuous input on various surfaces. FlowRing emerges as a versatile, user-friendly solution for the future of interactive technology.2025ICIshan Chatterjee et al.Hand Gesture RecognitionFoot & Wrist InteractionMobileHCI
Unknown Word Detection for English as a Second Language (ESL) Learners using Gaze and Pre-trained Language ModelsEnglish as a Second Language (ESL) learners often encounter unknown words that hinder their text comprehension. Automatically detecting these words as users read can enable computing systems to provide just-in-time definitions, synonyms, or contextual explanations, thereby helping users learn vocabulary in a natural and seamless manner. This paper presents EyeLingo, a transformer-based machine learning method that predicts the probability of unknown words based on text content and eye gaze trajectory in real time with high accuracy. A 20-participant user study revealed that our method can achieve an accuracy of 97.6%, and an F1-score of 71.1%. We implemented a real-time reading assistance prototype to show the effectiveness of EyeLingo. The user study shows improvement in willingness to use and usefulness compared to baseline methods.2025JDJiexin Ding et al.Tsinghua University, Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Global Innovation Exchange (GIX) Institute; University of Washington, Paul G. Allen School of Computer Science & EngineeringHuman Pose & Activity RecognitionHuman-LLM CollaborationCHI
The Odyssey Journey: Top-Tier Medical Resource Seeking for Specialized Disorder in ChinaIt is pivotal for patients to receive accurate health information, diagnoses, and timely treatments. However, in China, the significant imbalanced doctor-to-patient ratio intensifies the information and power asymmetries in doctor-patient relationships. Health information-seeking, which enables patients to collect information from sources beyond doctors, is a potential approach to mitigate these asymmetries. While HCI research predominantly focuses on common chronic conditions, our study focuses on specialized disorders, which are often familiar to specialists but not to general practitioners and the public. With Hemifacial Spasm (HFS) as an example, we aim to understand patients' health information and top-tier medical resource seeking journeys in China. Through interviews with three neurosurgeons and 12 HFS patients from rural and urban areas, and applying Actor-Network Theory, we provide empirical insights into the roles, interactions, and workflows of various actors in the health information-seeking network. We also identified five strategies patients adopted to mitigate asymmetries and access top-tier medical resources, illustrating these strategies as subnetworks within the broader health information-seeking network and outlining their advantages and challenges.2025KCKa I Chan et al.Tsinghua University, Global Innovation ExchangeChronic Disease Self-Management (Diabetes, Hypertension, etc.)Telemedicine & Remote Patient MonitoringCHI
VAction: A Lightweight and Integrated VR Training System for Authentic Film-Shooting ExperienceThe film industry exerts significant economic and cultural influence, and its rapid development is contingent upon the expertise of industry professionals, underscoring the critical importance of film-shooting education. However, this process typically necessitates multiple practice in complex professional venues using expensive equipment, presenting a significant obstacle for ordinary learners who struggle to access such training environments. Despite VR technology has already shown its potential in education, existing research has not addressed the crucial learning component of replicating the shooting process. Moreover, the limited functionality of traditional controllers hinder the fulfillment of the educational requirements. Therefore, we developed VAction VR system, combining high-fidelity virtual environments with a custom-designed controller to simulate the real-world camera operation experience. The system’s lightweight design ensures cost-effective and efficient deployment. Experiment results demonstrated that VAction significantly outperforms traditional methods in both practice effectiveness and user experience, indicating its potential and usefulness in film-shooting education.2025SWShaocong Wang et al.Tsinghua University, Department of Computer Science and TechnologyMixed Reality WorkspacesHome Energy ManagementCHI
Actual Achieved Gain and Optimal Perceived Gain: Modeling Human Take-over Decisions Towards Automated Vehicles' SuggestionsDriver decision quality in take-overs is critical for effective human-Autonomous Driving System (ADS) collaboration. However, current research lacks detailed analysis of its variations. This paper introduces two metrics--Actual Achieved Gain (AAG) and Optimal Perceived Gain (OPG)--to assess decision quality, with OPG representing optimal decisions and AAG reflecting actual outcomes. Both are calculated as weighted averages of perceived gains and losses, influenced by ADS accuracy. Study 1 (N=315) used a 21-point Thurstone scale to measure perceived gains and losses—key components of AAG and OPG—across typical tasks: route selection, overtaking, and collision avoidance. Studies 2 (N=54) and 3 (N=54) modeled decision quality under varying ADS accuracy and decision time. Results show with sufficient time (>3.5s), AAG converges towards OPG, indicating rational decision-making, while limited time leads to intuitive and deterministic choices. Study 3 also linked AAG-OPG deviations to irrational behaviors. An intervention study (N=8) and a pilot (N=4) employing voice alarms and multi-modal alarms based on these deviations demonstrated AAG's potential to improve decision quality.2025SZHaihua Zhang et al.Tsinghua University, Institute for Network Sciences and CyberspaceAutomated Driving Interface & Takeover DesignHead-Up Display (HUD) & Advanced Driver Assistance Systems (ADAS)AI-Assisted Decision-Making & AutomationCHI
Modeling the Impact of Visual Stimuli on Redirection Noticeability with Gaze Behavior in Virtual RealityWhile users could embody virtual avatars that mirror their physical movements in Virtual Reality, these avatars' motions can be redirected to enable novel interactions. Excessive redirection, however, could break the user's sense of embodiment due to perceptual conflicts between vision and proprioception. While prior work focused on avatar-related factors influencing the noticeability of redirection, we investigate how the visual stimuli in the surrounding virtual environment affect user behavior and, in turn, the noticeability of redirection. Given the wide variety of different types of visual stimuli and their tendency to elicit varying individual reactions, we propose to use users' gaze behavior as an indicator of their response to the stimuli and model the noticeability of redirection. We conducted two user studies to collect users' gaze behavior and noticeability, investigating the relationship between them and identifying the most effective gaze behavior features for predicting noticeability. Based on the data, we developed a regression model that takes users' gaze behavior as input and outputs the noticeability of redirection. We then conducted an evaluation study to test our model on unseen visual stimuli, achieving an accuracy of 0.012 MSE. We further implemented an adaptive redirection technique and conducted a proof-of-concept study to evaluate its effectiveness with complex visual stimuli in two applications. The results indicated that participants experienced less physical demanding and a stronger sense of body ownership when using our adaptive technique, demonstrating the potential of our model to support real-world use cases.2025ZLZhipeng Li et al.ETH Zürich, Department of Computer ScienceEye Tracking & Gaze InteractionMixed Reality WorkspacesImmersion & Presence ResearchCHI
UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural LanguageWang 等人开发 UbiPhysio 系统,通过动作理解和自然语言反馈,帮助用户进行日常功能锻炼、健身和康复训练。2024CWChongyang Wang et al.Vibrotactile Feedback & Skin StimulationFull-Body Interaction & Embodied InputUbiComp
The EarSAVAS Dataset: Enabling Subject-Aware Vocal Activity Sensing on EarablesZhang 等人构建 EarSAVAS 数据集,支持智能耳穿戴设备进行主体感知的语音活动检测,推动相关算法研究。2024XZXiyuxing Zhang et al.Biosensors & Physiological MonitoringUbiComp
ReHEarSSE: Recognizing Hidden-in-the-Ear Silently Spelled ExpressionsSilent speech interaction (SSI) allows users to discreetly input text without using their hands. Existing wearable SSI systems typically require custom devices and are limited to a small lexicon, limiting their utility to a small set of command words. This work proposes ReHearSSE, an earbud-based ultrasonic SSI system capable of generalizing to words that do not appear in its training dataset, providing support for nearly an entire dictionary's worth of words. As a user silently spells words, ReHearSSE uses autoregressive features to identify subtle changes in ear canal shape. ReHearSSE infers words using a deep learning model trained to optimize connectionist temporal classification (CTC) loss with an intermediate embedding that accounts for different letters and transitions between them. We find that ReHearSSE recognizes 100 unseen words with an accuracy of 89.3%.2024XDXuefu Dong et al.The University of TokyoElectrical Muscle Stimulation (EMS)Augmentative & Alternative Communication (AAC)CHI
PepperPose: Full-Body Pose Estimation with a Companion RobotAccurate full-body pose estimation across diverse actions in a user-friendly and location-agnostic manner paves the way for interactive applications in realms like sports, fitness, and healthcare. This task becomes challenging in real-world scenarios due to factors like the user's dynamic positioning, the diversity of actions, and the varying acceptability of the pose-capturing system. In this context, we present PepperPose, a novel companion robot system tailored for optimized pose estimation. Unlike traditional methods, PepperPose actively tracks the user and refines its viewpoint, facilitating enhanced pose accuracy across different locations and actions. This allows users to enjoy a seamless action-sensing experience. Our evaluation, involving 30 participants undertaking daily functioning and exercise actions in a home-like space, underscores the robot's promising capabilities. Moreover, we demonstrate the opportunities that PepperPose presents for human-robot interaction, its current limitations, and future developments.2024CWChongyang Wang et al.Tsinghua UniversityHuman Pose & Activity RecognitionHuman-Robot Collaboration (HRC)CHI
Time2Stop: Adaptive and Explainable Human-AI Loop for Smartphone Overuse InterventionDespite a rich history of investigating smartphone overuse intervention techniques, AI-based just-in-time adaptive intervention (JITAI) methods for overuse reduction are lacking. We develop Time2Stop, an intelligent, adaptive, and explainable JITAI system that leverages machine learning to identify optimal intervention timings, introduces interventions with transparent AI explanations, and collects user feedback to establish a human-AI loop and adapt the intervention model over time. We conducted an 8-week field experiment (N=71) to evaluate the effectiveness of both the adaptation and explanation aspects of Time2Stop. Our results indicate that our adaptive models significantly outperform the baseline methods on intervention accuracy (>32.8% relatively) and receptivity (>8.0%). In addition, incorporating explanations further enhances the effectiveness by 53.8% and 11.4% on accuracy and receptivity, respectively. Moreover, Time2Stop significantly reduces overuse, decreasing app visit frequency by 7.0∼8.9%. Our subjective data also echoed these quantitative measures. Participants preferred the adaptive interventions and rated the system highly on intervention time accuracy, effectiveness, and level of trust. We envision our work can inspire future research on JITAI systems with a human-AI loop to evolve with users.2024AOAdiba Orzikulova et al.KAISTExplainable AI (XAI)AI-Assisted Decision-Making & AutomationNotification & Interruption ManagementCHI
MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition"Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements (11.13% of cross-subject F1-score on the MMAct dataset) than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods." https://doi.org/10.1145/36108722023ZGZiqi Gao et al.Human Pose & Activity RecognitionUbiComp
DRG-Keyboard: Enabling Subtle Gesture Typing on the Fingertip with Dual IMU Rings"We present DRG-Keyboard, a gesture keyboard enabled by dual IMU rings, allowing the user to swipe the thumb on the index fingertip to perform word gesture typing as if typing on a miniature QWERTY keyboard. With dual IMUs attached to the user's thumb and index finger, DRG-Keyboard can 1) measure the relative attitude while mapping it to the 2D fingertip coordinates and 2) detect the thumb's touch-down and touch-up events combining the relative attitude data and the synchronous frequency domain data, based on which a fingertip gesture keyboard can be implemented. To understand users typing behavior on the index fingertip with DRG-Keyboard, we collected and analyzed user data in two typing manners. Based on the statistics of the gesture data, we enhanced the elastic matching algorithm with rigid pruning and distance measurement transform. The user study showed DRG-Keyboard achieved an input speed of 12.9 WPM (68.3% of their gesture typing speed on the smartphone) for all participants. The appending study also demonstrated the superiority of DRG-Keyboard for better form factors and wider usage scenarios. To sum up, DRG-Keyboard not only achieves good text entry speed merely on a tiny fingertip input surface, but is also well accepted by the participants for the input subtleness, accuracy, good haptic feedback, and availability. https://doi.org/10.1145/3569463"2023CLChen Liang et al.Vibrotactile Feedback & Skin StimulationHaptic WearablesHand Gesture RecognitionUbiComp
Modeling the Trade-off of Privacy Preservation and Activity Recognition on Low-Resolution ImagesA computer vision system using low-resolution image sensors can provide intelligent services (e.g., activity recognition) but preserve unnecessary visual privacy information from the hardware level. However, preserving visual privacy and enabling accurate machine recognition have adversarial needs on image resolution. Modeling the trade-off of privacy preservation and machine recognition performance can guide future privacy-preserving computer vision systems using low-resolution image sensors. In this paper, using the at-home activity of daily livings (ADLs) as the scenario, we first obtained the most important visual privacy features through a user survey. Then we quantified and analyzed the effects of image resolution on human and machine recognition performance in activity recognition and privacy awareness tasks. We also investigated how modern image super-resolution techniques influence these effects. Based on the results, we proposed a method for modeling the trade-off of privacy preservation and activity recognition on low-resolution images.2023YWYuntao Wang et al.Tsinghua UniversityHuman Pose & Activity RecognitionPrivacy Perception & Decision-MakingCHI
Enabling Voice-Accompanying Hand-to-Face Gesture Recognition with Cross-Device SensingGestures performed accompanying the voice are essential for voice interaction to convey complementary semantics for interaction purposes such as wake-up state and input modality. In this paper, we investigated voice-accompanying hand-to-face (VAHF) gestures for voice interaction. We targeted on hand-to-face gestures because such gestures relate closely with speech and yield significant acoustic features (e.g., impeding voice propagation). We conducted a user study to explore the design space of VAHF gestures, where we first gathered candidate gestures and then applied a structural analysis to them in different dimensions (e.g., contact position and type), outputting a total of 8 VAHF gestures with good usability and least confusion. To facilitate VAHF gesture recognition, we proposed a novel cross-device sensing method that leverages heterogeneous channels (vocal, ultrasound, and IMU) of data from commodity devices (earbuds, watches, and rings). Our recognition model achieved an accuracy of 97.3\% for recognizing 3 gestures and 91.5\% for recognizing 8 gestures \revision{(excluding the "empty" gesture)}, proving the high applicability. Quantitative analysis also shed light on the recognition capability of each sensor channel and their different combinations. In the end, we illustrated the feasible use cases and their design principles to demonstrate the applicability of our system in various scenarios.2023ZLZisu Li et al.The Hong Kong University of Science and TechnologyHand Gesture RecognitionVoice User Interface (VUI) DesignCHI
Color-to-Depth Mappings as Depth Cues in Virtual RealityDespite significant improvements to Virtual Reality (VR) technologies, most VR displays are fixed focus and depth perception is still a key issue that limits the user experience and the interaction performance. To supplement humans' inherent depth cues (e.g., retinal blur, motion parallax), we investigate users' perceptual mappings of distance to virtual objects' appearance to generate visual cues aimed to enhance depth perception. As a first step, we explore color-to-depth mappings for virtual objects so that their appearance differs in saturation and value to reflect their distance. Through a series of controlled experiments, we elicit and analyze users' strategies of mapping a virtual object's hue, saturation, value and a combination of saturation and value to its depth. Based on the collected data, we implement a computational model that generates color-to-depth mappings fulfilling adjustable requirements on confusion probability, number of depth levels, and consistent saturation/value changing tendency. We demonstrate the effectiveness of color-to-depth mappings in a 3D sketching task, showing that compared to single-colored targets and strokes, with our mappings, the users were more confident in the accuracy without extra cognitive load and reduced the perceived depth error by 60.8%. We also implement four VR applications and demonstrate how our color cues can benefit the user experience and interaction performance in VR.2022ZLZhipeng Li et al.Immersion & Presence ResearchMedical & Scientific Data VisualizationUIST
FaceOri: Tracking Head Position and Orientation Using Ultrasonic Ranging on EarphonesFace orientation can often indicate users’ intended interaction target. In this paper, we propose FaceOri, a novel face tracking technique based on acoustic ranging using earphones. FaceOri can leverage the speaker on a commodity device to emit an ultrasonic chirp, which is picked up by the set of microphones on the user’s earphone, and then processed to calculate the distance from each microphone to the device. These measurements are used to derive the user’s face orientation and distance with respect to the device. We conduct a ground truth comparison and user study to evaluate FaceOri’s performance. The results show that the system can determine whether the user orients to the device at a 93.5% accuracy within a 1.5 meters range. Furthermore, FaceOri can continuously track the user’s head orientation with a median absolute error of 10.9 mm in the distance, 3.7◦ in yaw, and 5.8◦ in pitch. FaceOri can allow for convenient hands-free control of devices and produce more intelligent context-aware interaction.2022YWYuntao Wang et al.Tsinghua UniversityEye Tracking & Gaze InteractionContext-Aware ComputingCHI
TypeOut: Leveraging Just-in-Time Self-Affirmation for Smartphone Overuse ReductionSmartphone overuse is related to a variety of issues such as lack of sleep and anxiety. We explore the application of Self-Affirmation Theory on smartphone overuse intervention in a just-in-time manner. We present \projectname{}, a just-in-time intervention technique that integrates two components: an in-situ typing-based unlock process to improve user engagement, and self-affirmation-based typing content to enhance effectiveness. We hypothesize that the integration of typing and self-affirmation content can better reduce smartphone overuse. We conducted a 10-week within-subject field experiment (N=54) and compared \projectname{} against two baselines: one only showing the self-affirmation content (a common notification-based intervention), and one only requiring typing non-semantic content (a state-of-the-art method). \projectname{} reduces app usage by over 50\%, and both app opening frequency and usage duration by over 25\%, all significantly outperforming baselines. \projectname{} can potentially be used in other domains where an intervention may benefit from integrating self-affirmation exercises with an engaging just-in-time mechanism.2022XXXuhai Xu et al.University of WashingtonMental Health Apps & Online Support CommunitiesNotification & Interruption ManagementCHI
ReflecTrack: Enabling 3D Acoustic Position Tracking Using Commodity Dual-Microphone Smartphones3D position tracking on smartphones has the potential to unlock a variety of novel applications, but has not been made widely available due to limitations in smartphone sensors. In this paper, we propose ReflecTrack, a novel 3D acoustic position tracking method for commodity dual-microphone smartphones. A ubiquitous speaker (e.g., smartwatch or earbud) generates inaudible Frequency Modulated Continuous Wave (FMCW) acoustic signals that are picked up by both smartphone microphones. To enable 3D tracking with two microphones, we introduce a reflective surface that can be easily found in everyday objects near the smartphone. Thus, the microphones can receive sound from the speaker and echoes from the surface for FMCW-based acoustic ranging. To simultaneously estimate the distances from the direct and reflective paths, we propose the echo-aware FMCW technique with a new signal pattern and target detection process. Our user study shows that ReflecTrack achieves a median error of 28.4 mm in the 60cm*60cm*60cm space and 22.1 mm in the 30cm*30cm*30cm space for 3D positioning. We demonstrate the easy accessibility of ReflecTrack using everyday surfaces and objects with several typical applications of 3D position tracking, including 3D input for smartphones, fine-grained gesture recognition, and motion tracking in smartphone-based VR systems.2021YZYuzhou Zhuang et al.Full-Body Interaction & Embodied InputBiosensors & Physiological MonitoringUIST
Auth+Track: Enabling Authentication Free Interaction on Smartphone by Continuous User TrackingIn this paper, we propose Auth+Track, a novel authentication model that aims to reduce redundant authentication in everyday smartphone usage. By sparse authentication and continuous tracking of user's status, Auth+Track eliminates the "gap" authentication between fragmented sessions and enables "Authentication Free when User is Around". To instantiate Auth+Track model, we present PanoTrack, a valid implementation that integrates body and near field hand information for user tracking. We install a fisheye camera on the top of the phone to achieve panoramic vision that can capture both user's body and on-screen hand. Based on the captured video stream, we develops an algorithm pipeline to extract all the key features for user tracking, including body keypoints and their temporal and spacial association, near field hand status and features for user identity assignment. By analyzing system performance and user experience in real-life scenarios, we demonstrate that our system outperforms existing solutions.2021CLChen Liang et al.Tsinghua UniversityHuman Pose & Activity RecognitionPasswords & AuthenticationCHI