Tap&Say: Touch Location-Informed Large Language Model for Multimodal Text Correction on SmartphonesWhile voice input offers a convenient alternative to traditional text editing on mobile devices, practical implementations face two key challenges: 1) reliably distinguishing between editing commands and content dictation, and 2) effortlessly pinpointing the intended edit location. We propose Tap&Say, a novel multimodal system that combines touch interactions with Large Language Models (LLMs) for accurate text correction. By tapping near an error, users signal their edit intent and location, addressing both challenges. Then, the user speaks the correction text. Tap&Say utilizes the touch location, voice input, and existing text to generate contextually relevant correction suggestions. We propose a novel touch location-informed attention layer that integrates the tap location into the LLM's attention mechanism, enabling it to utilize the tap location for text correction. We fine-tuned the touch location-informed LLM on synthetic touch locations and correction commands, achieving significantly higher correction accuracy than the state-of-the-art method VT. A 16-person user study demonstrated that Tap&Say outperforms VT with 16.4% shorter task completion time and 47.5% fewer keyboard clicks and is preferred by users.2025MZMaozheng Zhao et al.Stony Brook University, Department of Computer ScienceHuman-LLM CollaborationCHI
Simulating Errors in Touchscreen TypingEmpirical evidence shows that typing on touchscreen devices is prone to errors and that correcting them poses a major detriment to users’ performance. Design of text entry systems that better serve users, across their broad capability range, necessitates understanding the cognitive mechanisms that underpin these errors. However, prior models of typing cover only motor slips. The paper reports on extending the scope of computational modeling of typing to cover the cognitive mechanisms behind the three main types of error: slips (inaccurate execution), lapses (forgetting), and mistakes (incorrect knowledge). Given a phrase, a keyboard, and user parameters, Typoist simulates eye and finger movements while making human-like insertion, omission, substitution, and transposition errors. Its main technical contribution is the formulation of a supervisory control problem wherein the controller allocates cognitive resources to detect and fix errors generated by the various mechanisms. The model generates predictions of typing performance that can inform design, for better text entry systems.2025DSDanqing Shi et al.Aalto UniversityForce Feedback & Pseudo-Haptic WeightComputational Methods in HCICHI
SkipWriter: LLM-Powered Abbreviated Writing on TabletsLarge Language Models (LLMs) may offer transformative opportunities for text input, especially for physically demanding modalities like handwriting. We studied a form of abbreviated handwriting by designing, developing, and evaluating a prototype, named SkipWriter, that converts handwritten strokes of a variable-length prefix-based abbreviation (e.g. "ho a y" as handwritten strokes) into the intended full phrase (e.g., "how are you" in the digital format) based on the preceding context. SkipWriter consists of an in-production handwriting recognizer and an LLM fine-tuned on this task. With flexible pen input, SkipWriter allows the user to add and revise prefix strokes when predictions do not match the user's intent. An user evaluation demonstrated a 60% reduction in motor movements with an average speed of 25.78 WPM. We also showed that this reduction is close to the ceiling of our model in an offline simulation.2024ZXZheer Xu et al.Human-LLM CollaborationMotor Impairment Assistive Input TechnologiesKnowledge Worker Tools & WorkflowsUIST
Can Capacitive Touch Images Enhance Mobile Keyboard Decoding?Capacitive touch sensors capture the two-dimensional spatial profile (referred to as a touch heatmap) of a finger's contact with a mobile touchscreen. However, the research and design of touchscreen mobile keyboards -- one of the most speed and accuracy demanding touch interfaces -- has focused on the location of the touch centroid derived from the touch image heatmap as the input, discarding the rest of the raw spatial signals. In this paper, we investigate whether touch heatmaps can be leveraged to further improve the tap decoding accuracy for mobile touchscreen keyboards. Specifically, we developed and evaluated machine-learning models that interpret user taps by using the centroids and/or the heatmaps as their input and studied the contribution of the heatmaps to model performance. The results show that adding the heatmap into the input feature set led to 21.4% relative reduction of character error rates on average, compared to using the centroid alone. Furthermore, we conducted a live user study with the centroid-based and heatmap-based decoders built into Pixel 6 Pro devices and observed lower error rate, faster typing speed, and higher self-reported satisfaction score based on the heatmap-based decoder than the centroid-based decoder. These findings underline the promise of utilizing touch heatmaps for improving typing experience in mobile keyboards.2024PLPiyawat Lertvittayakumjorn et al.Vibrotactile Feedback & Skin StimulationKnowledge Worker Tools & WorkflowsUIST
CRTypist: Simulating Touchscreen Typing Behavior via Computational RationalityTouchscreen typing requires coordinating the fingers and visual attention for button-pressing, proofreading, and error correction. Computational models need to account for the associated fast pace, coordination issues, and closed-loop nature of this control problem, which is further complicated by the immense variety of keyboards and users. The paper introduces CRTypist, which generates human-like typing behavior. Its key feature is a reformulation of the supervisory control problem, with the visual attention and motor system being controlled with reference to a working memory representation tracking the text typed thus far. Movement policy is assumed to asymptotically approach optimal performance in line with cognitive and design-related bounds. This flexible model works directly from pixels, without requiring hand-crafted feature engineering for keyboards. It aligns with human data in terms of movements and performance, covers individual differences, and can generalize to diverse keyboard designs. Though limited to skilled typists, the model generates useful estimates of the typing performance achievable under various conditions.2024DSDanqing Shi et al.Aalto UniversityKnowledge Worker Tools & WorkflowsComputational Methods in HCICHI
Rambler: Supporting Writing With Speech via LLM-Assisted Gist ManipulationDictation enables efficient text input on mobile devices. However, writing with speech can produce disfluent, wordy, and incoherent text and thus requires heavy post-processing. This paper presents Rambler, an LLM-powered graphical user interface that supports gist-level manipulation of dictated text with two main sets of functions: gist extraction and macro revision. Gist extraction generates keywords and summaries as anchors to support the review and interaction with spoken text. LLM-assisted macro revisions allow users to respeak, split, merge, and transform dictated text without specifying precise editing locations. Together they pave the way for interactive dictation and revision that help close gaps between spontaneously spoken words and well-structured writing. In a comparative study with 12 participants performing verbal composition tasks, \tool outperformed the baseline of a speech-to-text editor + ChatGPT, as it better facilitates iterative revisions with enhanced user control over the content while supporting surprisingly diverse user strategies.2024SLSusan Lin et al.UC BerkeleyVoice User Interface (VUI) DesignGenerative AI (Text, Image, Music, Video)Human-LLM CollaborationCHI
C-PAK: Correcting and Completing Variable-length Prefix-based Abbreviated KeystrokesImproving keystroke savings is a long-term goal of text input research. We present a study into the design space of an abbreviated style of text input called C-PAK (Correcting and completing variable-length Prefix-based Abbreviated Keystrokes) for text entry on mobile devices. Given a variable length and potentially inaccurate input string (e.g., 'li g t m'), C-PAK aims to expand it into a complete phrase (e.g., 'looks good to me'). We develop a C-PAK prototype keyboard, PhraseWriter, based on a current state-of-the-art mobile keyboard consisting of 1.3 million n-grams and 164,000 words. Using computational simulations on a large dataset of realistic input text, we found that, in comparison to conventional single-word suggestions, PhraseWriter improves the maximum keystroke savings rate by 6.7% (from 46.3% to 49.4,), reduces the word error rate by 14.7%, and is particularly advantageous for common phrases. We conducted a lab study of novice user behavior and performance which found that users could quickly utilize the C-PAK style abbreviations implemented in PhraseWriter, achieving a higher keystroke savings rate than forward suggestions (25% vs. 16%). Furthermore, they intuitively and successfully abbreviated more with common phrases. However, users had a lower overall text entry rate due to their limited experience with the system (28.5 words per minute vs. 37.7). We outline future technical directions to improve C-PAK over the PhraseWriter baseline, and further opportunities to study the perceptual, cognitive, and physical action trade-offs that underlie the learning curve of C-PAK systems.2023TLTianshi Li et al.Human-LLM CollaborationAI-Assisted Decision-Making & AutomationUIST
TouchType-GAN: Modeling Touch Typing with Generative Adversarial NetworkModels that can generate touch typing tasks are important to the development of touch typing keyboards. We propose TouchType- GAN, a Conditional Generative Adversarial Network that can sim- ulate locations and time stamps of touch points in touch typing. TouchType-GAN takes arbitrary text as input to generate realistic touch typing both spatially (i.e., (𝑥, 𝑦) coordinates of touch points) and temporally (i.e., timestamps of touch points). TouchType-GAN in- troduces a variational generator that estimates Gaussian Distribu- tions for every target letter to prevent mode collapse. Our experi- ments on a dataset with 3k typed sentences show that TouchType- GAN outperforms existing touch typing models, including the Ro- tational Dual Gaussian model for simulating the distribution of touch points, and the Finger-Fitts Euclidean Model for sim- ulating typing time. Overall, our research demonstrates that the proposed GAN structure can learn the distribution of user typed touch points, and the resulting TouchType-GAN can also estimate typing movements. TouchType-GAN can serve as a valuable tool for designing and evaluating touch typing input systems.2023JCJeremy Chu et al.Force Feedback & Pseudo-Haptic WeightHuman-LLM CollaborationUIST
WordGesture-GAN: Modeling Word-Gesture Movement with Generative Adversarial NetworkWord-gesture production models that can synthesize word-gestures are critical to the training and evaluation of word-gesture keyboard decoders. We propose WordGesture-GAN, a conditional generative adversarial network that takes arbitrary text as input to generate realistic word-gesture movements in both spatial (i.e., $(x,y)$ coordinates of touch points) and temporal (i.e., timestamps of touch points) dimensions. WordGesture-GAN introduces a Variational Auto-Encoder to extract and embed variations of user-drawn gestures into a Gaussian distribution which can be sampled to control variation in generated gestures. Our experiments on a dataset with 38k gesture samples show that WordGesture-GAN outperforms existing gesture production models including the minimum jerk model [37] and the style-transfer GAN [31,32] in generating realistic gestures. Overall, our research demonstrates that the proposed GAN structure can learn variations in user-drawn gestures, and the resulting WordGesture-GAN can generate word-gesture movement and predict the distribution of gestures. WordGesture-GAN can serve as a valuable tool for designing and evaluating gestural input systems.2023JCJeremy Chu et al.Stony Brook UniversityHand Gesture RecognitionHuman-LLM CollaborationCreative Coding & Computational ArtCHI
EyeSayCorrect: Eye Gaze and Voice Based Hands-free Text Correction for Mobile DevicesText correction on mobile devices usually requires precise and repetitive manual control. In this paper, we present EyeSayCorrect, an eye gaze and voice based hands-free text correction method for mobile devices. To correct text with EyeSayCorrect, the user first utilizes the gaze point on the screen to select a word, then speaks the new phrase. EyeSayCorrect would then infer the user's correction intention based on the inputs and the text context. EyeSayCorrect can accommodate ambiguities and noisy input signals. We used a Bayesian approach for determining the selected word given an eye-gaze trajectory. Given each sampling point in an eye-gaze trajectory, the posterior probability of selecting a word is calculated and accumulated. The target word would be selected when its accumulated interest is larger than a threshold. The misspelled words have higher priors. Our evaluation showed that EyeSayCorrect can correct text with promising performance. The mean +/- 95% CI of the task completion time (in seconds) with priors is 11.63 +/- 1.07 for large font size (28 pt) and 11.57 +/- 1.14 for small font size (14 pt). Using priors for misspelled words reduced the task-completion time of large text by 9.26% and small text by 23.79%, and it reduced the text-selecting time of large text by 23.49% and small text by 40.35%. The subjective ratings are also in favor of the method with priors for misspelled words. Overall, EyeSayCorrect utilizes the advantages of eye gaze and voice input, making hands-free text correction available and efficient on mobile devices.2022MZMaozheng Zhao et al.Eye Tracking & Gaze InteractionVoice User Interface (VUI) DesignIUI
TypeAnywhere: A QWERTY-Based Text Entry Solution for Ubiquitous ComputingWe present a QWERTY-based text entry system, TypeAnywhere, for use in off-desktop computing environments. Using a wearable device that can detect finger taps, users can leverage their touch-typing skills from physical keyboards to perform text entry on any surface. TypeAnywhere decodes typing sequences based only on finger-tap sequences without relying on tap locations. To achieve optimal decoding performance, we trained a neural language model and achieved a 1.6% character error rate (CER) in an offline evaluation, compared to a 5.3% CER from a traditional n-gram language model. Our user study showed that participants achieved an average performance of 70.6 WPM, or 80.4% of their physical keyboard speed, and 1.50% CER after 2.5 hours of practice over five days on a table surface. They also achieved 43.9 WPM and 1.37% CER when typing on their laps. Our results demonstrate the strong potential of QWERTY typing as a ubiquitous text entry solution.2022MZMingrui Ray Zhang et al.University of WashingtonHaptic WearablesVoice User Interface (VUI) DesignIntelligent Voice Assistants (Alexa, Siri, etc.)CHI
Modeling Touch Point Distribution with Rotational Dual Gaussian ModelTouch point distribution models are important tools for designing touchscreen interfaces. In this paper, we investigate how the finger movement direction affects the touch point distribution, and how to account for it in modeling. We propose the Rotational Dual Gaussian model, a refinement and generalization of the Dual Gaussian model, to account for the finger movement direction in predicting touch point distribution. In this model, the major axis of the prediction ellipse of the touch point distribution is along the finger movement direction, and the minor axis is perpendicular to the finger movement direction. We also propose using projected target width and height, in lieu of nominal target width and height to model touch point distribution. Evaluation on three empirical datasets shows that the new model reflects the observation that the touch point distribution is elongated along the finger movement direction, and outperforms the original Dual Gaussian Model in all prediction tests. Compared with the original Dual Gaussian model, the Rotational Dual Gaussian model reduces the RMSE of touch error rate prediction from 8.49% to 4.95%, and more accurately predicts the touch point distribution in target acquisition. Using the Rotational Dual Gaussian model can also improve the soft keyboard decoding accuracy on smartwatches.2021YMYan Ma et al.Hand Gesture RecognitionEye Tracking & Gaze InteractionUIST
Voice and Touch Based Error-tolerant Multimodal Text Editing and Correction for SmartphonesEditing operations such as cut, copy, paste, and correcting errors in typed text is often tedious and challenging to perform on smartphones. In this paper, we present VT, a voice and touch-based multi-modal text editing and correction method for smartphones. To edit text with VT, the user glides over a text fragment with a finger and dictates a command, such as "bold" to change the format of the fragment, or the user can tap inside a text area and speak a command such as "highlight this paragraph" to edit the text. For text correcting, the user taps approximately at the area of erroneous text fragment and dictates the new content for substitution or insertion. VT combines touch and voice inputs with language context such as language model and phrase similarity to infer a user's editing intention, which can handle ambiguities and noisy input signals. It is a great advantage over the existing error correction methods (e.g., iOS’s Voice Control) which require precise cursor control or text selection. Our evaluation shows that VT significantly improves the efficiency of text editing and text correcting on smartphones over the touch-only method and the iOS’s Voice Control method. Our user studies showed that VT reduced the text editing time by 30.80%, and text correcting time by 29.97% over the touch-only method., and reduced the text editing time by 30.81%, and text correcting time by 47.96% over the iOS’s Voice Control method.2021MZMaozheng Zhao et al.Voice User Interface (VUI) DesignExplainable AI (XAI)UIST
PhraseFlow: Designs and Empirical Studies of Phrase-Level InputDecoding on phrase-level may afford more correction accuracy than on word-level according to previous research. However, how phrase-level input affects the user typing behavior, and how to design the interaction to make it practical remain under explored. We present PhraseFlow, a phrase-level input keyboard that is able to correct previous text based on the subsequently input sequences. Computational studies show that phrase-level input reduces the error rate of autocorrection by over 16%. We found that phrase-level input introduced extra cognitive load to the user that hindered their performance. Through an iterative design-implement-research process, we optimized the design of PhraseFlow that alleviated the cognitive load. An in-lab study shows that users could adopt PhraseFlow quickly, resulting in 19% fewer error without losing speed. In real-life settings, we conducted a six-day deployment study with 42 participants, showing that 78.6% of the users would like to have the phrase-level input feature in future keyboards.2021MZMingrui Ray Zhang et al.University of WashingtonGenerative AI (Text, Image, Music, Video)Human-LLM CollaborationAI-Assisted Creative WritingCHI
TapNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile InputTo make off-screen interaction without specialized hardware practical, we investigate using deep learning methods to process the common built-in IMU sensor (accelerometers and gyroscopes) on mobile phones into a useful set of one-handed interaction events. We present the design, training, implementation and applications of TapNet, a multi-task network that detects tapping on the smartphone. With phone form factor as auxiliary information, TapNet can jointly learn from data across devices and simultaneously recognize multiple tap properties, including tap direction and tap location. We developed two datasets consisting of over 135K training samples, 38K testing samples, and 32 participants in total. Experimental evaluation demonstrated the effectiveness of the TapNet design and its significant improvement over the state of the art. Along with the datasets, codebase, and extensive experiments, TapNet establishes a new technical foundation for off-screen mobile input.2021MHMichael Xuelin Huang et al.GoogleFoot & Wrist InteractionComputational Methods in HCICHI
Modeling Two Dimensional Touch PointingModeling touch pointing is essential to touchscreen interface development and research, as pointing is one of the most basic and common touch actions users perform on touchscreen devices. Finger-Fitts Law [4] revised the conventional Fitts’ law into a 1D (one-dimensional) pointing model for finger touch by explicitly accounting for the fat finger ambiguity (absolute error) problem which was unaccounted for in the original Fitts’ law. We generalize Finger-Fitts law to 2D touch pointing by solving two critical problems. First, we extend two of the most successful 2D Fitts law forms to accommodate finger ambiguity. Second, we discovered that using nominal target width and height is a conceptually simple yet effective approach for defining amplitude and directional constraints for 2D touch pointing across different movement directions. The evaluation shows our derived 2D Finger-Fitts law models can be both principled and powerful. Specifically, they outperformed the existing 2D Fitts’ laws, as measured by the regression coefficient and model selection information criteria (e.g., Akaike Information Criterion) considering the number of parameters. Finally, 2D Finger-Fitts laws also advance our understanding of touch pointing and thereby serve as the basis for touch interface designs.2020YKYu-Jung Ko et al.Force Feedback & Pseudo-Haptic WeightHand Gesture RecognitionPrototyping & User TestingUIST
Text Entry Throughput: Towards Unifying Speed and Accuracy in a Single Performance MetricHuman-computer input performance inherently involves speed-accuracy tradeoffs---the faster users act, the more inaccurate those actions are. Therefore, comparing speeds and accuracies separately can result in ambiguous outcomes: Does a fast but inaccurate technique perform better or worse overall than a slow but accurate one? For pointing, speed and accuracy has been unified for over 60 years as throughput (bits/s) (Crossman 1957, Welford 1968), but to date, no similar metric has been established for text entry. In this paper, we introduce a text entry method-independent throughput metric based on Shannon information theory (1948). To explore the practical usability of the metric, we conducted an experiment in which 16 participants typed with a laptop keyboard using different cognitive sets, i.e., speed-accuracy biases. Our results show that as a performance metric, text entry throughput remains relatively stable under different speed-accuracy conditions. We also evaluated a smartphone keyboard with 12 participants, finding that throughput varied least compared to other text entry metrics. This work allows researchers to characterize text entry performance with a single unified measure of input efficiency.2019MZMingrui Ray Zhang et al.University of WashingtonUser Research Methods (Interviews, Surveys, Observation)Computational Methods in HCICHI
Touchscreen Haptic Augmentation Effects on Tapping, Drag and Drop, and Path FollowingWe study the effects of haptic augmentation on tapping, path following, and drag & drop tasks based on a recent flagship smartphone with refined touch sensing and haptic actuator technologies. Results show actuated haptic confirmation on tapping targets was subjectively appreciated by some users but did not improve tapping speed or accuracy. For drag & drop, a clear performance improvement was measured when haptic feedback is applied to target boundary crossing, particularly when the targets are small. For path following tasks, virtual haptic feedback improved accuracy at a reduced speed in a sitting condition. Stronger results were achieved in a physical haptic mock-up. Overall, we found actuated touchscreen haptic feedback particularly effective when the touched object was visually interfered by the finger. Participants subjective experience of haptic feedback in all tasks tended to be more positive than their time or accuracy performance suggests. We compare and discuss these findings with previous results on early generations of devices. The work provides an empirical foundation to product design and future research of touch input and haptic systems.2019MGMitchell L. Gordon et al.Stanford UniversityVibrotactile Feedback & Skin StimulationCHI
i'sFree: Eyes-Free Gesture Typing via a Touch-Enabled Remote ControlEntering text without having to pay attention to the keyboard is compelling but challenging due to the lack of visual guidance. We propose i'sFree to enable eyes-free gesture typing on a distant display from a touch-enabled remote control. i'sFree does not display the keyboard or gesture trace but decodes gestures drawn on the remote control into text according to an invisible and shifting Qwerty layout. i'sFree decodes gestures similar to a general gesture typing decoder, but learns from the instantaneous and historical input gestures to dynamically adjust the keyboard location. We designed it based on the understanding of how users perform eyes-free gesture typing. Our evaluation shows eyes-free gesture typing is feasible: reducing visual guidance on the distant display hardly affects the typing speed. Results also show that the i'sFree gesture decoding algorithm is effective, enabling an input speed of 23 WPM, 46% faster than the baseline eyes-free condition built on a general gesture decoder. Finally, i'sFree is easy to learn: participants reached 22 WPM in the first ten minutes, even though 40% of them were first-time gesture typing users.2019SZSuwen Zhu et al.Stony Brook UniversityHand Gesture RecognitionFoot & Wrist InteractionCHI
Active Edge: Designing Squeeze Gestures for the Google Pixel 2Active Edge is a feature of Google Pixel 2 smartphone devices that creates a force-sensitive interaction surface along their sides, allowing users to perform gestures by holding and squeezing their device. Supported by strain gauge elements adhered to the inner sidewalls of the device chassis, these gestures can be more natural and ergonomic than on-screen (touch) counterparts. Developing these interactions is an integration of several components: (1) an insight and understanding of the user experiences that benefit from squeeze gestures; (2) hardware with the sensitivity and reliability to sense a user's squeeze in any operating environment; (3) a gesture design that discriminates intentional squeezes from innocuous handling; and (4) an interaction design to promote a discoverable and satisfying user experience. This paper describes the design and evaluation of Active Edge in these areas as part of the product's development and engineering.2019PQPhilip Quinn et al.GoogleForce Feedback & Pseudo-Haptic WeightCHI