Exploring the Impact of Emotional Voice Integration in Sign-to-Speech Translators for Deaf-to-Hearing CommunicationEmotional voice communication plays a crucial role in effective daily interactions. Deaf and Hard of Hearing (DHH) individuals, who often have limited use of voice, rely on facial expressions to supplement sign language and convey emotions. However, in American Sign Language (ASL), facial expressions serve not only emotional purposes but also function as linguistic markers that can alter the meaning of signs. This dual role can often confuse non-signers when interpreting a signer’s emotional state. In this paper, we present studies that: (1) confirm the challenges non-signers face when interpreting emotions from facial expressions in ASL communication, and (2) demonstrate how integrating emotional voice into translation systems can enhance hearing individuals’ understanding of a signer’s emotional intent. An online survey with 45 hearing participants (non-ASL signers) revealed frequent misinterpretations of signers’ emotions when emotional and linguistic facial expressions were used simultaneously. The findings show that incorporating emotional voice into translation systems significantly improves emotion recognition by 32%. Additionally, follow-up survey with 48 DHH participants highlights design considerations for implementing emotional voice features, emphasizing the importance of emotional voice integration to bridge communication gaps between DHH and hearing communities.2025HLHyunchul Lim et al.Deaf and Hard-of-Hearing ResearchCSCW
EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric VideosSelf-recording eating behaviors is a step towards a healthy lifestyle recommended by many health professionals. However, the current practice of manually recording eating activities using paper records or smartphone apps is often unsustainable and inaccurate. Smart glasses have emerged as a promising wearable form factor for tracking eating behaviors, but existing systems primarily identify when eating occurs without capturing details of the eating activities (E.g., what is being eaten). In this paper, we present EchoGuide, an application and system pipeline that leverages low-power active acoustic sensing to guide head-mounted cameras to capture egocentric videos, enabling efficient and detailed analysis of eating activities. By combining active acoustic sensing for eating detection with video captioning models and large-scale language models for retrieval augmentation, EchoGuide intelligently clips and analyzes videos to create concise, relevant activity records on eating. We evaluated EchoGuide with 9 participants in naturalistic settings involving eating activities, demonstrating high-quality summarization and significant reductions in video data needed, paving the way for practical, scalable eating activity tracking.2024VPVineet Parikh et al.Diet Tracking & Nutrition ManagementSleep & Stress MonitoringBiosensors & Physiological MonitoringUbiComp
MunchSonic: Tracking Fine-grained Dietary Actions through Active Acoustic Sensing on EyeglassesWe introduce MunchSonic, an AI-powered active acoustic sensing system integrated into eyeglasses to track fine-grained dietary actions. MunchSonic emits inaudible ultrasonic waves from the eyeglass frame, with the reflected signals capturing detailed positions and movements of body parts, including the mouth, jaw, arms, and hands involved in eating. These signals are processed by a deep learning pipeline to classify six actions: hand-to-mouth movements for food intake, chewing, drinking, talking, face-hand touching, and other activities (null). In an unconstrained study with 12 participants, MunchSonic achieved a 93.5% macro F1-score in a user-independent evaluation with a 2-second resolution in tracking these actions, also demonstrating its effectiveness in tracking eating episodes and food intake frequency within those episodes.2024SMSaif Mahmud et al.Diet Tracking & Nutrition ManagementBiosensors & Physiological MonitoringUbiComp
SeamPose: Repurposing Seams as Capacitive Sensors in a Shirt for Upper-Body Pose TrackingSeams are areas of overlapping fabric formed by stitching two or more pieces of fabric together in the cut-and-sew apparel manufacturing process. In SeamPose, we repurposed seams as capacitive sensors in a shirt for continuous upper-body pose estimation. Compared to previous all-textile motion-capturing garments that place the electrodes on the clothing surface, our solution leverages existing seams inside of a shirt by machine-sewing insulated conductive threads over the seams. The unique invisibilities and placements of the seams afford the sensing shirt to look and wear similarly as a conventional shirt while providing exciting pose-tracking capabilities. To validate this approach, we implemented a proof-of-concept untethered shirt with 8 capacitive sensing seams. With a 12-participant user study, our customized deep-learning pipeline accurately estimates the relative (to the pelvis) upper-body 3D joint positions with a mean per joint position error (MPJPE) of 6.0 cm. SeamPose represents a step towards unobtrusive integration of smart clothing for everyday pose estimation.2024TYTianhong Catherine Yu et al.Haptic WearablesHuman Pose & Activity RecognitionBiosensors & Physiological MonitoringUIST
EyeEcho: Continuous and Low-power Facial Expression Tracking on GlassesIn this paper, we introduce EyeEcho, a minimally-obtrusive acoustic sensing system designed to enable glasses to continuously monitor facial expressions. It utilizes two pairs of speakers and microphones mounted on glasses, to emit encoded inaudible acoustic signals directed towards the face, capturing subtle skin deformations associated with facial expressions. The reflected signals are processed through a customized machine-learning pipeline to estimate full facial movements. EyeEcho samples at 83.3 Hz with a relatively low power consumption of 167 mW. Our user study involving 12 participants demonstrates that, with just four minutes of training data, EyeEcho achieves highly accurate tracking performance across different real-world scenarios, including sitting, walking, and after remounting the devices. Additionally, a semi-in-the-wild study involving 10 participants further validates EyeEcho's performance in naturalistic scenarios while participants engage in various daily activities. Finally, we showcase EyeEcho's potential to be deployed on a commercial-off-the-shelf (COTS) smartphone, offering real-time facial expression tracking.2024KLKe Li et al.Cornell UniversityHand Gesture RecognitionEye Tracking & Gaze InteractionHuman Pose & Activity RecognitionCHI
PoseSonic: 3D Upper Body Pose Estimation Through Egocentric Acoustic Sensing on Smartglasses"In this paper, we introduce PoseSonic, an intelligent acoustic sensing solution for smartglasses that estimates upper body poses. Our system only requires two pairs of microphones and speakers on the hinges of the eyeglasses to emit FMCW-encoded inaudible acoustic signals and receive reflected signals for body pose estimation. Using a customized deep learning model, PoseSonic estimates the 3D positions of 9 body joints including the shoulders, elbows, wrists, hips, and nose. We adopt a cross-modal supervision strategy to train our model using synchronized RGB video frames as ground truth. We conducted in-lab and semi-in-the-wild user studies with 22 participants to evaluate PoseSonic, and our user-independent model achieved a mean per joint position error of 6.17 cm in the lab setting and 14.12 cm in semi-in-the-wild setting when predicting the 9 body joint positions in 3D. Our further studies show that the performance was not significantly impacted by different surroundings or when the devices were remounted or by real-world environmental noise. Finally, we discuss the opportunities, challenges, and limitations of deploying PoseSonic in real-world applications." https://doi.org/10.1145/36108952023SMSaif Mahmud et al.Human Pose & Activity RecognitionContext-Aware ComputingUbiComp
HPSpeech: Silent Speech Interface for Commodity HeadphonesWe present HPSpeech, a silent speech interface for commodity headphones. HPSpeech utilizes the existing speakers of the headphones to emit inaudible acoustic signals. The movements of the temporomandibular joint (TMJ) during speech modify the reflection pattern of these signals, which are captured by a microphone positioned inside the headphones. To evaluate the performance of HPSpeech, we tested it on two headphones with a total of 18 participants. The results demonstrated that HPSpeech successfully recognized 8 popular silent speech commands for controlling the music player with an accuracy over 90%. While our tests use modified commodity hardware (both with and without active noise cancellation), our results show that sensing the movement of the TMJ could be as simple as a firmware update for ANC headsets which already include a microphone inside the hear cup. This leaves us to believe that this technique has great potential for rapid deployment in the near future. We further discuss the challenges that need to be addressed before deploying HPSpeech at scale.2023RZRuidong Zhang et al.Voice User Interface (VUI) DesignUbiComp
EchoNose: Sensing Mouth, Breathing and Tongue Gestures inside Oral Cavity using a Non-contact Nose InterfaceSensing movements and gestures inside the oral cavity has been a long-standing challenge for the wearable research community. This paper introduces EchoNose, a novel nose interface that explores a unique sensing approach to recognize gestures related to mouth, breathing, and tongue by analyzing the acoustic signal reflections inside the nasal and oral cavities. The interface incorporates a speaker and a microphone placed at the nostrils, emitting inaudible acoustic signals and capturing the corresponding reflections. These received signals were processed using a customized data processing and machine learning pipeline, enabling the distinction of 16 gestures involving speech, tongue, and breathing. A user study with 10 participants demonstrates that EchoNose achieves an average accuracy of 93.7% in recognizing these 16 gestures. Based on these promising results, we discuss the potential opportunities and challenges associated with applying this innovative nose interface in various future applications.2023RSRujia Sun et al.Electrical Muscle Stimulation (EMS)Hand Gesture RecognitionBrain-Computer Interface (BCI) & NeurofeedbackUbiComp
EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic SensingWe present EchoSpeech, a minimally-obtrusive silent speech interface (SSI) powered by low-power active acoustic sensing. EchoSpeech uses speakers and microphones mounted on a glass-frame and emits inaudible sound waves towards the skin. By analyzing echos from multiple paths, EchoSpeech captures subtle skin deformations caused by silent utterances and uses them to infer silent speech. With a user study of 12 participants, we demonstrate that EchoSpeech can recognize 31 isolated commands and 3-6 figure connected digits with 4.5% (std 3.5%) and 6.1% (std 4.2%) Word Error Rate (WER), respectively. We further evaluated EchoSpeech under scenarios including walking and noise injection to test its robustness. We then demonstrated using EchoSpeech in demo applications in real-time operating at 73.3mW, where the real-time pipeline was implemented on a smartphone with only 1-6 minutes of training data. We believe that EchoSpeech takes a solid step towards minimally-obtrusive wearable SSI for real-life deployment.2023RZRuidong Zhang et al.Cornell UniversityVibrotactile Feedback & Skin StimulationVoice User Interface (VUI) DesignBiosensors & Physiological MonitoringCHI
ReMotion: Supporting Remote Collaboration in Open Space with Automatic Robotic EmbodimentDesign activities, such as brainstorming or critique, often take place in open spaces combining whiteboards and tables to present artefacts. In co-located settings, peripheral awareness enables participants to understand each other’s locus of attention with ease. However, these spatial cues are mostly lost while using videoconferencing tools. Telepresence robots could bring back a sense of presence, but controlling them is distracting. To address this problem, we present ReMotion, a fully automatic robotic proxy designed to explore a new way of supporting non-collocated open-space design activities. ReMotion combines a commodity body tracker (Kinect) to capture a user’s location and orientation over a wide area with a minimally invasive wearable system (NeckFace) to capture facial expressions. Due to its omnidirectional platform, ReMotion embodiment can render a wide range of body movements. A formative evaluation indicated that our system enhances the sharing of attention and the sense of co-presence enabling seamless movement-in-space during a design review task.2023MSMose Sakashita et al.Cornell UniversityHuman-Robot Collaboration (HRC)Teleoperation & TelepresenceCHI
HandyTrak: Recognizing the Holding Hand on a Commodity Smartphone from Body Silhouette ImagesUnderstanding which hand a user holds a smartphone with can help improve the mobile interaction experience. For instance, the layout of the user interface (UI) can be adapted to the holding hand. In this paper, we present HandyTrak, an AI-powered software system that recognizes the holding hand on a commodity smartphone using body silhouette images captured by the front-facing camera. The silhouette images are processed and sent to a customized user-dependent deep learning model (CNN) to infer how the user holds the smartphone (left, right, or both hands). We evaluated our system on each participant's smartphone at five possible front camera positions in a user study with ten participants under two hand positions (in the middle and skewed) and three common usage cases (standing, sitting, and resting against a desk). The results showed that HandyTrak was able to continuously recognize the holding hand with an average accuracy of 89.03\% (SD: 8.98\%) at a 2 Hz sampling rate. We also discuss the challenges and opportunities to deploy HandyTrak on different commodity smartphones and potential applications in real-world scenarios.2021HLHyunchul Lim et al.Hand Gesture RecognitionEye Tracking & Gaze InteractionUIST
TeethTap: Recognizing Discrete Teeth Gestures using Motion and Acoustic Sensing on an EarpieceTeeth gestures become an alternative input modality for different situations and accessibility purposes. In this paper, we present TeethTap, a novel eyes-free and hands-free input technique, which can recognize up to 13 discrete teeth tapping gestures. TeethTap adopts a wearable 3D printed earpiece with an IMU sensor and a contact microphone behind both ears, which works in tandem to detect jaw movement and sound data, respectively. TeethTap uses a support vector machine to classify gestures from noise by fusing acoustic and motion data, and implements K-Nearest-Neighbor (KNN) with a Dynamic Time Warping (DTW) distance measurement using motion data for gesture classification. A user study with 11 participants demonstrated that TeethTap could recognize 13 gestures with a real-time classification accuracy of 90.9% in a laboratory environment. We further uncovered the accuracy differences on different teeth gestures when having sensors on single vs. both sides. Moreover, we explored the activation gesture under real-world environments, including eating, speaking, walking and jumping. Based on our findings, we further discussed potential applications and practical challenges of integrating TeethTap into future devices.2021WSWei Sun et al.Haptic WearablesHand Gesture RecognitionFull-Body Interaction & Embodied InputIUI
C-Face: Continuously reconstructing facial expressions by deep learning contours of the face with ear-mounted miniature camerasC-Face (Contour-Face) is an ear-mounted wearable sensing technology that uses two miniature cameras to continuously reconstruct facial expressions by deep learning contours of the face. When facial muscles move, the contours of the face change from the point of view of the ear-mounted cameras. These subtle changes are fed into a deep learning model which continuously outputs 42 facial feature points representing the shapes and positions of the mouth, eyes and eyebrows. To evaluate C-Face, we embedded our technology into headphones and earphones. We conducted a user study with nine participants. In this study, we compared the output of our system to the feature points outputted by a state of the art computer vision library (Dlib1) from a font facing camera. We found that the mean error of all 42 feature points was 0.77 mm for earphones and 0.74 mm for headphones. The mean error for 20 major feature points capturing the most active areas of the face was 1.43 mm for earphones and 1.39 mm for headphones. The ability to continuously reconstruct facial expressions introduces new opportunities in a variety of applications. As a demonstration, we implemented and evaluated C-Face for two applications: facial expression detection (outputting emojis) and silent speech recognition. We further discuss the opportunities and challenges of deploying C-Face in real-world applications.2020TCTuochao Chen et al.Haptic WearablesHuman Pose & Activity RecognitionUIST