Time2Stop: Adaptive and Explainable Human-AI Loop for Smartphone Overuse InterventionDespite a rich history of investigating smartphone overuse intervention techniques, AI-based just-in-time adaptive intervention (JITAI) methods for overuse reduction are lacking. We develop Time2Stop, an intelligent, adaptive, and explainable JITAI system that leverages machine learning to identify optimal intervention timings, introduces interventions with transparent AI explanations, and collects user feedback to establish a human-AI loop and adapt the intervention model over time. We conducted an 8-week field experiment (N=71) to evaluate the effectiveness of both the adaptation and explanation aspects of Time2Stop. Our results indicate that our adaptive models significantly outperform the baseline methods on intervention accuracy (>32.8% relatively) and receptivity (>8.0%). In addition, incorporating explanations further enhances the effectiveness by 53.8% and 11.4% on accuracy and receptivity, respectively. Moreover, Time2Stop significantly reduces overuse, decreasing app visit frequency by 7.0∼8.9%. Our subjective data also echoed these quantitative measures. Participants preferred the adaptive interventions and rated the system highly on intervention time accuracy, effectiveness, and level of trust. We envision our work can inspire future research on JITAI systems with a human-AI loop to evolve with users.2024AOAdiba Orzikulova et al.KAISTExplainable AI (XAI)AI-Assisted Decision-Making & AutomationNotification & Interruption ManagementCHI
DRG-Keyboard: Enabling Subtle Gesture Typing on the Fingertip with Dual IMU Rings"We present DRG-Keyboard, a gesture keyboard enabled by dual IMU rings, allowing the user to swipe the thumb on the index fingertip to perform word gesture typing as if typing on a miniature QWERTY keyboard. With dual IMUs attached to the user's thumb and index finger, DRG-Keyboard can 1) measure the relative attitude while mapping it to the 2D fingertip coordinates and 2) detect the thumb's touch-down and touch-up events combining the relative attitude data and the synchronous frequency domain data, based on which a fingertip gesture keyboard can be implemented. To understand users typing behavior on the index fingertip with DRG-Keyboard, we collected and analyzed user data in two typing manners. Based on the statistics of the gesture data, we enhanced the elastic matching algorithm with rigid pruning and distance measurement transform. The user study showed DRG-Keyboard achieved an input speed of 12.9 WPM (68.3% of their gesture typing speed on the smartphone) for all participants. The appending study also demonstrated the superiority of DRG-Keyboard for better form factors and wider usage scenarios. To sum up, DRG-Keyboard not only achieves good text entry speed merely on a tiny fingertip input surface, but is also well accepted by the participants for the input subtleness, accuracy, good haptic feedback, and availability. https://doi.org/10.1145/3569463"2023CLChen Liang et al.Vibrotactile Feedback & Skin StimulationHaptic WearablesHand Gesture RecognitionUbiComp
Modeling the Trade-off of Privacy Preservation and Activity Recognition on Low-Resolution ImagesA computer vision system using low-resolution image sensors can provide intelligent services (e.g., activity recognition) but preserve unnecessary visual privacy information from the hardware level. However, preserving visual privacy and enabling accurate machine recognition have adversarial needs on image resolution. Modeling the trade-off of privacy preservation and machine recognition performance can guide future privacy-preserving computer vision systems using low-resolution image sensors. In this paper, using the at-home activity of daily livings (ADLs) as the scenario, we first obtained the most important visual privacy features through a user survey. Then we quantified and analyzed the effects of image resolution on human and machine recognition performance in activity recognition and privacy awareness tasks. We also investigated how modern image super-resolution techniques influence these effects. Based on the results, we proposed a method for modeling the trade-off of privacy preservation and activity recognition on low-resolution images.2023YWYuntao Wang et al.Tsinghua UniversityHuman Pose & Activity RecognitionPrivacy Perception & Decision-MakingCHI
Enabling Voice-Accompanying Hand-to-Face Gesture Recognition with Cross-Device SensingGestures performed accompanying the voice are essential for voice interaction to convey complementary semantics for interaction purposes such as wake-up state and input modality. In this paper, we investigated voice-accompanying hand-to-face (VAHF) gestures for voice interaction. We targeted on hand-to-face gestures because such gestures relate closely with speech and yield significant acoustic features (e.g., impeding voice propagation). We conducted a user study to explore the design space of VAHF gestures, where we first gathered candidate gestures and then applied a structural analysis to them in different dimensions (e.g., contact position and type), outputting a total of 8 VAHF gestures with good usability and least confusion. To facilitate VAHF gesture recognition, we proposed a novel cross-device sensing method that leverages heterogeneous channels (vocal, ultrasound, and IMU) of data from commodity devices (earbuds, watches, and rings). Our recognition model achieved an accuracy of 97.3\% for recognizing 3 gestures and 91.5\% for recognizing 8 gestures \revision{(excluding the "empty" gesture)}, proving the high applicability. Quantitative analysis also shed light on the recognition capability of each sensor channel and their different combinations. In the end, we illustrated the feasible use cases and their design principles to demonstrate the applicability of our system in various scenarios.2023ZLZisu Li et al.The Hong Kong University of Science and TechnologyHand Gesture RecognitionVoice User Interface (VUI) DesignCHI
Selecting Real-World Objects via User-Perspective Phone OcclusionPerceiving the region of interest (ROI) and target object by smartphones from the user's first-person perspective can enable diverse spatial interactions. In this paper, we propose a novel ROI input method and a target selecting method for smartphones by utilizing the user-perspective phone occlusion. This concept of turning the phone into real-world physical cursor benefits from the proprioception, gets rid of the constraint of camera preview, and allows users to rapidly and accurately select the target object. Meanwhile, our method can provide a resizable and rotatable rectangular ROI to disambiguate dense targets. We implemented the prototype system by positioning the user's iris with the front camera and estimating the rectangular area blocked by the phone with the rear camera simultaneously, followed by a target prediction algorithm with the distance-weighted Jaccard index. We analyzed the behavioral models of using our method and evaluated our prototype system's pointing accuracy and usability. Results showed that our method is well-accepted by the users for its convenience, accuracy, and efficiency.2023YQYue Qin et al.Tsinghua University, Tsinghua UniversityHand Gesture RecognitionEye Tracking & Gaze InteractionContext-Aware ComputingCHI
Color-to-Depth Mappings as Depth Cues in Virtual RealityDespite significant improvements to Virtual Reality (VR) technologies, most VR displays are fixed focus and depth perception is still a key issue that limits the user experience and the interaction performance. To supplement humans' inherent depth cues (e.g., retinal blur, motion parallax), we investigate users' perceptual mappings of distance to virtual objects' appearance to generate visual cues aimed to enhance depth perception. As a first step, we explore color-to-depth mappings for virtual objects so that their appearance differs in saturation and value to reflect their distance. Through a series of controlled experiments, we elicit and analyze users' strategies of mapping a virtual object's hue, saturation, value and a combination of saturation and value to its depth. Based on the collected data, we implement a computational model that generates color-to-depth mappings fulfilling adjustable requirements on confusion probability, number of depth levels, and consistent saturation/value changing tendency. We demonstrate the effectiveness of color-to-depth mappings in a 3D sketching task, showing that compared to single-colored targets and strokes, with our mappings, the users were more confident in the accuracy without extra cognitive load and reduced the perceived depth error by 60.8%. We also implement four VR applications and demonstrate how our color cues can benefit the user experience and interaction performance in VR.2022ZLZhipeng Li et al.Immersion & Presence ResearchMedical & Scientific Data VisualizationUIST
FaceOri: Tracking Head Position and Orientation Using Ultrasonic Ranging on EarphonesFace orientation can often indicate users’ intended interaction target. In this paper, we propose FaceOri, a novel face tracking technique based on acoustic ranging using earphones. FaceOri can leverage the speaker on a commodity device to emit an ultrasonic chirp, which is picked up by the set of microphones on the user’s earphone, and then processed to calculate the distance from each microphone to the device. These measurements are used to derive the user’s face orientation and distance with respect to the device. We conduct a ground truth comparison and user study to evaluate FaceOri’s performance. The results show that the system can determine whether the user orients to the device at a 93.5% accuracy within a 1.5 meters range. Furthermore, FaceOri can continuously track the user’s head orientation with a median absolute error of 10.9 mm in the distance, 3.7◦ in yaw, and 5.8◦ in pitch. FaceOri can allow for convenient hands-free control of devices and produce more intelligent context-aware interaction.2022YWYuntao Wang et al.Tsinghua UniversityEye Tracking & Gaze InteractionContext-Aware ComputingCHI
Towards Understanding Diminished RealityDiminished reality (DR) refers to the concept of removing content from a user's visual environment. While its implementation is becoming feasible, it is still unclear how users perceive and interact in DR-enabled environments and what applications it benefits. To address this challenge, we first conduct a formative study to compare user perceptions of DR and mediated reality effects (e.g., changing the color or size of target elements) in four example scenarios. Participants preferred removing objects through opacity reduction (i.e., the standard DR implementation) and appreciated mechanisms for maintaining a contextual understanding of diminished items (e.g., outlining). In a second study, we explore the user experience of performing tasks within DR-enabled environments. Participants selected which objects to diminish and the magnitude of the effects when performing two separate tasks (video viewing, assembly). Participants were comfortable with decreased contextual understanding, particularly for less mobile tasks. Based on the results, we define guidelines for creating general DR-enabled environments.2022YCYifei Cheng et al.Swarthmore CollegeMixed Reality WorkspacesImmersion & Presence ResearchContext-Aware ComputingCHI
ReflecTrack: Enabling 3D Acoustic Position Tracking Using Commodity Dual-Microphone Smartphones3D position tracking on smartphones has the potential to unlock a variety of novel applications, but has not been made widely available due to limitations in smartphone sensors. In this paper, we propose ReflecTrack, a novel 3D acoustic position tracking method for commodity dual-microphone smartphones. A ubiquitous speaker (e.g., smartwatch or earbud) generates inaudible Frequency Modulated Continuous Wave (FMCW) acoustic signals that are picked up by both smartphone microphones. To enable 3D tracking with two microphones, we introduce a reflective surface that can be easily found in everyday objects near the smartphone. Thus, the microphones can receive sound from the speaker and echoes from the surface for FMCW-based acoustic ranging. To simultaneously estimate the distances from the direct and reflective paths, we propose the echo-aware FMCW technique with a new signal pattern and target detection process. Our user study shows that ReflecTrack achieves a median error of 28.4 mm in the 60cm*60cm*60cm space and 22.1 mm in the 30cm*30cm*30cm space for 3D positioning. We demonstrate the easy accessibility of ReflecTrack using everyday surfaces and objects with several typical applications of 3D position tracking, including 3D input for smartphones, fine-grained gesture recognition, and motion tracking in smartphone-based VR systems.2021YZYuzhou Zhuang et al.Full-Body Interaction & Embodied InputBiosensors & Physiological MonitoringUIST
SemanticAdapt: Optimization-based Adaptation of Mixed Reality Layouts Leveraging Virtual-Physical Semantic ConnectionsWe present an optimization-based approach that automatically adapts Mixed Reality (MR) interfaces to different physical environments. Current MR layouts, including the position and scale of virtual interface elements, need to be manually adapted by users whenever they move between environments, and whenever they switch tasks. This process is tedious and time consuming, and arguably needs to be automated for MR systems to be beneficial for end users. We contribute an approach that formulates this challenge as a combinatorial optimization problem and automatically decides the placement of virtual interface elements in new environments. To achieve this, we exploit the semantic association between the virtual interface elements and physical objects in an environment. Our optimization furthermore considers the utility of elements for users' current task, layout factors, and spatio-temporal consistency to previous layouts. All those factors are combined in a single linear program, which is used to adapt the layout of MR interfaces in real time. We demonstrate a set of application scenarios, showcasing the versatility and applicability of our approach. Finally, we show that compared to a naive adaptive baseline approach that does not take semantic associations into account, our approach decreased the number of manual interface adaptations by 33\%.2021YCYifei Cheng et al.AR Navigation & Context AwarenessMixed Reality WorkspacesContext-Aware ComputingUIST
HulaMove: Using Commodity IMU for Waist InteractionWe present HulaMove, a novel interaction technique that leverages the movement of the waist as a new eyes-free and hands-free input method for both the physical world and the virtual world. We first conducted a user study (N=12) to understand users’ ability to control their waist. We found that users could easily discriminate eight shifting directions and two rotating orientations, and quickly confirm actions by returning to the original position (quick return). We developed a design space with eight gestures for waist interaction based on the results and implemented an IMU-based real-time system. Using a hierarchical machine learning model, our system could recognize waist gestures at an accuracy of 97.5%. Finally, we conducted a second user study (N=12) for usability testing in both real-world scenarios and virtual reality settings. Our usability study indicated that HulaMove significantly reduced interaction time by 41.8% compared to a touch screen method, and greatly improved users’ sense of presence in the virtual world. This novel technique provides an additional input method when users’ eyes or hands are busy, accelerates users’ daily operations, and augments their immersive experience in the virtual world.2021XXXuhai Xu et al.University of WashingtonFull-Body Interaction & Embodied InputImmersion & Presence ResearchCHI
ProxiMic: Convenient Voice Activation via Close-to-Mic Speech Detected by a Single MicrophoneWake-up-free techniques (e.g., Raise-to-Speak) are important for improving the voice input experience. We present ProxiMic, a close-to-mic (within 5 cm) speech sensing technique using only one microphone. With ProxiMic, a user keeps a microphone-embedded device close to the mouth and speaks directly to the device without wake-up phrases or button presses. To detect close-to-mic speech, we use the feature from pop noise observed when a user speaks and blows air onto the microphone. Sound input is first passed through a low-pass adaptive threshold filter, then analyzed by a CNN which detects subtle close-to-mic features (mainly pop noise). Our two-stage algorithm can achieve 94.1% activation recall, 12.3 False Accepts per Week per User (FAWU) with 68 KB memory size, which can run at 352 fps on the smartphone. The user study shows that ProxiMic is efficient, user-friendly, and practical.2021YQYue Qin et al.Tsinghua University, Ministry of EducationVoice User Interface (VUI) DesignIntelligent Voice Assistants (Alexa, Siri, etc.)CHI
FaceSight: Enabling Hand-to-Face Gesture Interaction on AR Glasses with a Downward-Facing Camera VisionWe present FaceSight, a computer vision-based hand-to-face gesture sensing technique for AR glasses. FaceSight fixes an infrared camera onto the bridge of AR glasses to provide extra sensing capability of the lower face and hand behaviors. We obtained 21 hand-to-face gestures and demonstrated the potential interaction benefits through five AR applications. We designed and implemented an algorithm pipeline that segments facial regions, detects hand-face contact (f1 score: 98.36%), and trains convolutional neural network (CNN) models to classify the hand-to-face gestures. The input features include gesture recognition, nose deformation estimation, and continuous fingertip movement. Our algorithm achieves classification accuracy of all gestures at 83.06%, proved by the data of 10 users. Due to the compact form factor and rich gestures, we recognize FaceSight as a practical solution to augment input capability of AR glasses in the future.2021YWYueting Weng et al.Tsinghua University, Ministry of EducationHand Gesture RecognitionAR Navigation & Context AwarenessCHI
FrownOnError: Interrupting Responses from Smart Speakers by Facial ExpressionsIn the conversations with smart speakers, misunderstandings of users' requests lead to erroneous responses. We propose FrownOnError, a novel interaction technique that enables users to interrupt the responses by intentional but natural facial expressions. This method leverages the human nature that the facial expression changes when we receive unexpected responses. We conducted a first user study (N=12) to understand users' intuitive reactions to the correct and incorrect responses. Our results reveal the significant difference in the frequency of occurrence and intensity of users' facial expressions between two conditions, and frowning and raising eyebrows are intuitive to perform and easy to control. Our second user study (N=16) evaluated the user experience and interruption efficiency of FrownOnError and the third user study (N=12) explored suitable conversation recovery strategies after the interruptions. Our results show that FrownOnError can be accurately detected (precision: 97.4%, recall: 97.6%), provides the most timely interruption compared to the baseline methods of wake-up word and button press, and is rated as most intuitive and easiest to be performed by users.2020YYYukang Yan et al.Tsinghua UniversityHand Gesture RecognitionHuman Pose & Activity RecognitionVoice User Interface (VUI) DesignCHI
EarBuddy: Enabling On-Face Interaction via Wireless EarbudsPast research regarding on-body interaction typically requires custom sensors, limiting their scalability and generalizability. We propose EarBuddy, a real-time system that leverages the microphone in commercial wireless earbuds to detect tapping and sliding gestures near the face and ears. We develop a design space to generate 27 valid gestures and conducted a user study (N=16) to select the eight gestures that were optimal for both human preference and microphone detectability. We collected a dataset on those eight gestures (N=20) and trained deep learning models for gesture detection and classification. Our optimized classifier achieved an accuracy of 95.3%. Finally, we conducted a user study (N=12) to evaluate EarBuddy's usability. Our results show that EarBuddy can facilitate novel interaction and that users feel very positively about the system. EarBuddy provides a new eyes-free, socially acceptable input method that is compatible with commercial wireless earbuds and has the potential for scalability and generalizability2020XXXuhai Xu et al.University of Washington & Tsinghua UniversityHaptic WearablesFoot & Wrist InteractionCHI
PrivateTalk: Activating Voice Input with Hand-On-Mouth Gesture Detected by Bluetooth EarphonesWe introduce PrivateTalk, an on-body interaction technique to activate voice input with the Hand-On-Mouth gesture. When "talking" to a computing device (e.g., a smartphone), a user covers his/her mouth with the hand from one side. PrivateTalk provides two essential benefits for voice input simultaneously. First, the Hand-On-Mouth gesture protects talking privacy by lowering down the voice spreading out and shielding the mouth and lip movements from seen by people around. Second, it uses an on-body hand gesture which removes the need for Wake-Up-Word and is more accessible than a physical/software button when the device is not in hand. We propose a novel sensing technique to recognize this asymmetric Hand-On-Mouth gesture with Bluetooth earphones (e.g., AirPod). When the mouth is covered from one side, the amplitude and Mel-frequency cepstral coefficients (MFCC) of the two audio signals from earphone's microphones are significantly different. This feature is stable and robust across users, and we leverage it to recognize the Hand-On-Mouth gesture. Evaluation results show that the recognition accuracy is as high as 98.33\%. Users' subjective feedback indicates that PrivateTalk is intuitive and consistently appreciated by users.2019YYYukang Yan et al.Eye Tracking & Gaze InteractionVoice User Interface (VUI) DesignPrivacy by Design & User ControlUIST
Eyes-Free Target Acquisition in Interaction Space around the Body for Virtual RealityEyes-free target acquisition is a basic and important human ability to interact with the surrounding physical world, relying on the sense of space and proprioception. In this research, we leverage this ability to improve interaction in virtual reality (VR), by allowing users to acquire a virtual object without looking at it. We expect this eyes-free approach can effectively reduce head movements and focus changes, so as to speed up the interaction and alleviate fatigue and VR sickness. We conduct three lab studies to progressively investigate the feasibility and usability of eyes-free target acquisition in VR. Results show that, compared with the eyes-engaged manner, the eyes-free approach is significantly faster, provides satisfying accuracy, and introduces less fatigue and sickness; Most participants (13/16) prefer this approach. We also measure the accuracy of motion control and evaluate subjective experience of users when acquiring targets at different locations around the body. Based on the results, we make suggestions on designing appropriate target layout and discuss several design issues for eyes-free target acquisition in VR.2018YYYukang Yan et al.Tsinghua UniversityFull-Body Interaction & Embodied InputImmersion & Presence ResearchCHI
VirtualGrasp: Leveraging Experience of Interacting with Physical Objects to Facilitate Digital Object RetrievalWe propose VirtualGrasp, a novel gestural approach to retrieve virtual objects in virtual reality. Using VirtualGrasp, a user retrieves an object by performing a barehanded gesture as if grasping its physical counterpart. The object-gesture mapping under this metaphor is of high intuitiveness, which enables users to easily discover, remember the gestures to retrieve the objects. We conducted three user studies to demonstrate the feasibility and effectiveness of the approach. Progressively, we investigated the consensus of the object-gesture mapping across users, the expressivity of grasping gestures, and the learnability and performance of the approach. Results showed that users achieved high agreement on the mapping, with an average agreement score [35] of 0.68 (SD=0.27). Without exposure to the gestures, users successfully retrieved 76% objects with VirtualGrasp. A week after learning the mapping, they could recall the gestures for 93% objects.2018YYYukang Yan et al.Tsinghua UniversityHand Gesture Recognition3D Modeling & AnimationCHI