Multimodal Silent Speech-based Text Entry with Word-initials Conditioned LLMAlthough exhibiting great potential in enabling seamless communication between humans and conversational agents, large vocabulary recognition is still challenging for silent speech interfaces. In this research, we propose a novel interaction technique that combines silent speech and typing to enable more efficient text entry while preserving privacy. This technique allows users to use abbreviated phrase input while still ensuring high accuracy by leveraging visual information. By fine-tuning a large language model with a visual speech encoder, we condition the models to decode the speech content with word initials as hints. Evaluations on existing datasets show that our model can reduce the Word Error Rate from 20.3% to 9.19%, compared to state-of-the-art visual speech recognition models. Results from a user study demonstrated significant improvements in input speed and keystroke saving. Participants reported that our prototype, LipType, leads to an overall lower perceived workload, particularly in the effort and physical demand dimension.2025ZSZixiong Su et al.Electrical Muscle Stimulation (EMS)Hand Gesture RecognitionHuman-LLM CollaborationCUI
Dynamik: Syntactically-Driven Dynamic Font Sizing for Emphasis of Key InformationIn today's globalized world, there are increasing opportunities for individuals to communicate using a common non-native language (lingua franca). Non-native speakers often have opportunities to listen to foreign languages, but may not comprehend them as fully as native speakers do. To aid real-time comprehension, live transcription of subtitles is frequently used in everyday life (e.g., during Zoom conversations, watching YouTube videos, or on social networking sites). However, simultaneously reading subtitles while listening can increase cognitive load. In this study, we propose Dynamik, a system that reduces cognitive load during reading by decreasing the size of less important words and enlarging important ones, thereby enhancing sentence contrast. Our results indicate that Dynamik can reduce certain aspects of cognitive load, specifically, participants' perceived performance and effort among individuals with low proficiency in English, as well as enhance the users' sense of comprehension, especially among people with low English ability. We further discuss our methods' applicability to other languages and potential improvements and further research directions.2025NNNaoto Nishida et al.Voice User Interface (VUI) DesignVoice AccessibilityIUI
Watch Your Mouth: Silent Speech Recognition with Depth SensingSilent speech recognition is a promising technology that decodes human speech without requiring audio signals, enabling private human-computer interactions. In this paper, we propose Watch Your Mouth, a novel method that leverages depth sensing to enable accurate silent speech recognition. By leveraging depth information, our method provides unique resilience against environmental factors such as variations in lighting and device orientations, while further addressing privacy concerns by eliminating the need for sensitive RGB data. We started by building a deep-learning model that locates lips using depth data. We then designed a deep learning pipeline to efficiently learn from point clouds and translate lip movements into commands and sentences. We evaluated our technique and found it effective across diverse sensor locations: On-Head, On-Wrist, and In-Environment. Watch Your Mouth outperformed the state-of-the-art RGB-based method, demonstrating its potential as an accurate and reliable input technique.2024XWXue Wang et al.University of California, Los AngelesEye Tracking & Gaze InteractionCognitive Impairment & Neurodiversity (Autism, ADHD, Dyslexia)CHI
Telextiles: End-to-end Remote Transmission of Fabric Tactile SensationThe tactile sensation of textiles is critical in determining the comfort of clothing. For remote use, such as online shopping, users cannot physically touch the textile of clothes, making it difficult to evaluate its tactile sensation. Tactile sensing and actuation devices are required to transmit the tactile sensation of textiles. The sensing device needs to recognize different garments, even with hand-held sensors. In addition, the existing actuation device can only present a limited number of known patterns and cannot transmit unknown tactile sensations of textiles. To address these issues, we propose Telextiles, an interface that can remotely transmit tactile sensations of textiles by creating a latent space that reflects the proximity of textiles through contrastive self-supervised learning. We confirm that textiles with similar tactile features are located close to each other in the latent space through a two-dimensional plot. We then compress the latent features for known textile samples into the 1D distance and apply the 16 textile samples to the rollers in the order of the distance. The roller is rotated to select the textile with the closest feature if an unknown textile is detected.2023TKTakekazu Kitagishi et al.Haptic WearablesTextile Art & Craft DigitizationUIST
LipLearner: Customizable Silent Speech Interactions on Mobile DevicesSilent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme. Subjective feedback indicated that our system provides essential functionalities for customizable silent speech interactions with high usability and learnability.2023ZSZixiong Su et al.The University of TokyoBrain-Computer Interface (BCI) & NeurofeedbackConversational ChatbotsVoice AccessibilityCHI
DualVoice: Speech Interaction that Discriminates between Normal and Whispered Voice InputInteractions based on automatic speech recognition (ASR) have become widely used, with speech input being increasingly utilized to create documents. However, as there is no easy way to distinguish between commands being issued and text required to be input in speech, misrecognitions are difficult to identify and correct, meaning that documents need to be manually edited and corrected. The input of symbols and commands is also challenging because these may be misrecognized as text letters. To address these problems, this study proposes a speech interaction method called DualVoice, by which commands can be input in a whispered voice and letters in a normal voice. The proposed method does not require any specialized hardware other than a regular microphone, enabling a complete hands-free interaction. The method can be used in a wide range of situations where speech recognition is already available, ranging from text input to mobile/wearable computing. Two neural networks were designed in this study, one for discriminating normal speech from whispered speech, and the second for recognizing whisper speech. A prototype of a text input system was then developed to show how normal and whispered voice can be used in speech text input. Other potential applications using DualVoice are also discussed.2022JRJun RekimotoVoice User Interface (VUI) DesignIntelligent Voice Assistants (Alexa, Siri, etc.)UIST
Aware: Intuitive Device Activation Using Prosody for Natural Voice InteractionsVoice interactive devices often use keyword spotting for device activation. However, this approach suffers from misrecognition of keywords and can respond to keywords not intended for calling the device (e.g., "You can ask Alexa about it."), causing accidental device activations. We propose a method that leverages prosodic features to differentiate calling/not-calling voices (F1 score: 0.869), allowing devices to respond only when called upon to avoid misactivation. As a proof of concept, we built a prototype smart speaker called Aware that allows users to control the device activation by speaking the keyword in specific prosody patterns. These patterns are chosen to represent people's natural calling/not-calling voice, which are uncovered in a study to collect such voices and investigate their prosodic difference. A user study comparing Aware with Amazon Echo shows Aware can activate more correctly (F1 score 0.93 vs. 0.56 ) and is easy to learn and use.2022XZXinlei Zhang et al.The University of TokyoVoice User Interface (VUI) DesignIntelligent Voice Assistants (Alexa, Siri, etc.)CHI
SilentSpeller: Towards mobile, hands-free, silent speech text entry using electropalatographySpeech is inappropriate in many situations, limiting when voice control can be used. Most unvoiced speech text entry systems can not be used while on-the-go due to movement artifacts. Using a dental retainer with capacitive touch sensors, SilentSpeller tracks tongue movement, enabling users to type by spelling words without voicing. SilentSpeller achieves an average 97% character accuracy in offline isolated word testing on a 1164-word dictionary. Walking has little effect on accuracy; average offline character accuracy was roughly equivalent on 107 phrases entered while walking (97.5%) or seated (96.5%). To demonstrate extensibility, the system was tested on 100 unseen words, leading to an average 94% accuracy. Live text entry speeds for seven participants averaged 37 words per minute at 87% accuracy. Comparing silent spelling to current practice suggests that SilentSpeller may be a viable alternative for silent mobile text entry.2022NKNaoki Kimura et al.The University of TokyoElectrical Muscle Stimulation (EMS)Augmentative & Alternative Communication (AAC)CHI
Design in Action: Unpacking the Artists' Role in Performance-Led ResearchThis paper illustrates design work carried out to develop an interactive theater performance. HCI has started to address the challenges of designing interactive performances, as both audience and performers' experiences are considered and a variety of professional expertise involved. Nevertheless, research has overlooked how such design unfolds in practice, and what role artists play in exploring both the creative opportunities and the challenges associated with interweaving digital technologies. A two-day workshop was conducted to tailor the use of the ChameleonMask, a telepresence technology, within a performance. The analysis highlights the artists’ work to make the mask work while framing, exploring and conceptualizing its use. The discussion outlines the artists' skills and design expertise, and how they redefine the role of HCI in performance-led research.2021JTJakob Tholander et al.Stockholm UniversityDigital Art Installations & Interactive PerformanceInteractive Narrative & Immersive StorytellingCHI
WithYou: Automated Adaptive Speech Tutoring With Context-Dependent Speech RecognitionLearning to speak in foreign languages is hard. Speech shadowing has been rising as a proven way to practice speaking, which asks a learner to listen and repeat a native speech template as simultaneously as possible. However, shadowing can be hard to do because learners can frequently fail to follow the speech and unintentionally interrupt a practice session. Worse, as a technical way to evaluate shadowing performance in real-time has not been established, no automated solutions are available to help. In this paper, we propose a technical framework with context-dependent speech recognition to evaluate shadowing in real-time. We propose a shadowing tutor system called WithYou, which can automatically adjust the playback and the difficulty of a speech template when learners fail, so shadowing becomes smooth and tailored. Results from a user study show that WithYou provides greater speech improvements (14%) than the conventional method (2.7%) with a lower cognitive load.2020XZXinlei Zhang et al.University of TokyoVoice User Interface (VUI) DesignIntelligent Voice Assistants (Alexa, Siri, etc.)Intelligent Tutoring Systems & Learning AnalyticsCHI
ElasticVR: Providing Multilevel Continuously-Changing Resistive Force and Instant Impact Using Elasticity for VRResistive force (e.g., due to object elasticity) and impact (e.g., due to recoil) are common effects in our daily life. However, resistive force continuously changes due to users' movements while impact instantly occurs when an event triggers it. These feedback are still not realistically provided by current VR haptic methods. In this paper, a wearable device, ElasticVR, which consists of an elastic band, servo motors and mechanical brakes, is proposed to provide the continuously-changing resistive force and instantly-occurring impact upon the user's hand to enhance VR realism. By changing two physical properties, length and extension distance, of the elastic band, ElasticVR provides multilevel resistive force with no delay and impact with little delay, respectively, for realistic and versatile VR applications. A force perception study was performed to observe users' force distinguishability of the resistive force and impact, and the prototype was built based on its results. A VR experience study further proves that the resistive force and impact from ElasticVR both outperform those from current approaches in realism. Applications using ElasticVR are also demonstrated.2019HTHsin-Ruey Tsai et al.National Taiwan University & University of TokyoForce Feedback & Pseudo-Haptic WeightImmersion & Presence ResearchCHI
SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural NetworksThe availability of digital devices operated by voice is expanding rapidly. However, the applications of voice interfaces are still restricted. For example, speaking in public places becomes an annoyance to the surrounding people, and secret information should not be uttered. Environmental noise may reduce the accuracy of speech recognition. To address these limitations, a system to detect a user's unvoiced utterance is proposed. From internal information observed by an ultrasonic imaging sensor attached to the underside of the jaw, our proposed system recognizes the utterance contents without the user's uttering voice. Our proposed deep neural network model is used to obtain acoustic features from a sequence of ultrasound images. We confirmed that audio signals generated by our system can control the existing smart speakers. We also observed that a user can adjust their oral movement to learn and improve the accuracy of their voice recognition.2019NKNaoki Kimura et al.The University of TokyoBrain-Computer Interface (BCI) & NeurofeedbackVoice User Interface (VUI) DesignCHI
ExtVision: Augmentation of Visual Experiences with Generation of Context Images for a Peripheral Vision Using DNNWe propose a system, called ExtVision, to augment visual experiences by generating and projecting context-images onto the periphery of the television or computer screen. A peripheral projection of the context-image is one of the most effective techniques to enhance visual experiences. However, the projection is not commonly used at present, because of the difficulty in preparing the context-image. In this paper, we propose a deep neural network-based method to generate context-images for peripheral projection. A user study was performed to investigate the manner in which the proposed system augments traditional visual experiences. In addition, we present applications and future prospects of the developed system.2018NKNaoki Kimura et al.University of TokyoImmersion & Presence Research360° Video & Panoramic ContentCHI