WatchHand: Enabling Continuous Hand Pose Tracking On Off-the-Shelf Smartwatches
Authors
Korea Advanced Institute of Science and Technology
Cornell University
Korea Advanced Institute of Science and Technology
Cornell University
Cornell University
Korea Advanced Institute of Science and Technology
Cornell University
Paper Title
WatchHand: Enabling Continuous Hand Pose Tracking On Off-the-Shelf Smartwatches
Publication Info
- Topic area: Continuous 3D hand pose tracking using commercial smartwatches.
- Keywords: Hand pose tracking, smartwatch, active acoustic sensing, COTS devices, deep learning, 3D hand tracking, gesture recognition, wearable computing, human-computer interaction, privacy-preserving sensing.
Background and Problem
- Problem / challenge: Continuous 3D hand pose tracking is unavailable on commercial smartwatches due to reliance on external sensors or custom hardware. Existing solutions either focus on discrete gestures or require bespoke configurations, limiting scalability and real-world applicability.
- Significance: Enabling continuous hand pose tracking on off-the-shelf (COTS) smartwatches could unlock expressive, context-aware interactions for millions of devices already in use, enhancing user experience and accessibility.
- Motivation and related work: Prior work has explored sensing modalities like cameras, radar, and EMG, but these approaches face challenges such as privacy concerns, hardware requirements, or limited generalizability. WatchHand builds on recent advances in active acoustic sensing, repurposing built-in smartwatch sensors to achieve continuous hand pose tracking without additional hardware.
Solution
- Proposed approach: WatchHand, a system leveraging the built-in speaker and microphone of COTS smartwatches to perform continuous 3D hand pose tracking using active acoustic sensing and deep learning.
- Novelty:
- First system to achieve continuous 3D hand pose tracking exclusively using built-in sensors on commercial smartwatches.
- Development of a deep-learning pipeline for robust hand pose estimation across diverse conditions (e.g., hardware, postures, noise).
- Extensive empirical evaluations on multiple smartwatch models, yielding sub-centimeter accuracy in cross-session tests.
- Practical considerations for real-world deployment, including on-device processing and privacy-preserving design.
- Procedure and key techniques:
- Emit inaudible frequency-modulated continuous waves (18–21 kHz) via the smartwatch speaker and capture reflections with the microphone.
- Process acoustic signals into spatiotemporal echo profiles using cross-correlation and differential techniques.
- Use a FastViT-based deep-learning model to estimate 3D positions of 20 finger joints.
- Evaluate performance across multiple conditions (e.g., hardware, postures, noise) and adapt models with fine-tuning for unseen scenarios.
Results
- Concrete findings:
- Achieved a mean per-joint position error (MPJPE) of 7.87 mm in cross-session tests and 14.88 mm in cross-user tests.
- Maintained robustness under diverse conditions, including body postures (MPJPE reduced to 6.58 mm with fine-tuning) and noise scenarios (e.g., loud music, walking).
- Demonstrated sub-centimeter accuracy in within-session tests (MPJPE: 6.02 mm).
- Advantage over baselines:
- Outperformed prior systems like DiscoBand (17.87 mm MPJPE) and EITPose (17.81 mm MPJPE) in cross-session evaluations.
- Enabled real-time, on-device processing with a latency of 0.115 seconds per prediction.
- Experiments / evaluation:
- Conducted four studies with 40 participants, testing across three smartwatch models (Samsung, Xiaomi, Google) and various conditions (e.g., postures, noise, dynamic hand pose variations).
- Evaluated models using cross-session, within-session, and cross-user protocols.
- Limitations and future work:
- Performance drops in cross-user scenarios due to inter-user variability.
- Challenges with object interactions and unseen hand poses.
- Future work includes expanding datasets, integrating object-awareness, and exploring self-supervised learning for improved generalization.
Summary
WatchHand introduces a novel approach to continuous 3D hand pose tracking using only the built-in speaker and microphone of commercial smartwatches. By leveraging active acoustic sensing and deep learning, the system achieves sub-centimeter accuracy across diverse conditions, including hardware variations, postures, and noise. Extensive evaluations demonstrate its robustness and adaptability, with potential applications in gesture-based interaction, accessibility, and cross-device interfaces. WatchHand represents a significant step toward scalable, privacy-preserving hand tracking on millions of existing smartwatches, with opportunities for further optimization and dataset expansion.
Quick Actions
Learn AI Coding at CodeNow
Paper Snapshot
Share This Paper
https://hci.top/en/papers/chi/223529/2026