µCap: Instrumental Music Captions for Deaf and Hard……

Paper Title

µCap: Instrumental Music Captions for Deaf and Hard-of-Hearing Individuals

Publication Info

Topic area: Accessibility solutions for instrumental music via automated captioning.
Keywords: instrumental music, captions, Deaf and Hard-of-Hearing (DHH), accessibility, Retrieval-Augmented Generation (RAG), phonetic transcription, music visualization, human-computer interaction (HCI), audio-to-text, music accessibility.

Background and Problem

Problem / challenge: Instrumental music remains largely inaccessible to Deaf and Hard-of-Hearing (DHH) individuals due to the absence of effective captioning systems. Current captioning approaches for music focus on lyrics or vague descriptions, leaving instrumental music uncaptioned or minimally represented.
Significance: Making instrumental music accessible can enhance cultural participation and emotional connection for DHH audiences, addressing a significant gap in media accessibility.
Motivation and related work: Prior research has explored visualizations (e.g., vibrations, graphical displays) and affective captioning for speech but has not addressed the unique challenges of representing instrumental music. Existing systems lack the ability to generate non-semantic, sound-mimetic captions that align with musical flow.

Solution

Proposed approach: µCap (Music Captions), an automated system that generates phonetic-like, time-synchronized textual captions for instrumental music, augmented with visual representations.
Novelty:
1. Introduction of a phonetic captioning schema for instrumental music based on audio features and linguistic insights.
2. Integration of Retrieval-Augmented Generation (RAG) to enhance caption quality using a curated dataset of music-text pairs.
3. Development of expert-informed guidelines for mapping musical features to textual representations.
4. Empirical validation of the system’s impact on music appreciation and immersion for DHH participants.
Procedure and key techniques:
- Audio feature extraction (e.g., pitch, volume, onset strength).
- Instrument classification using deep learning models.
- Retrieval of relevant annotations from a vector database.
- Caption generation via GPT-4o, guided by expert-derived heuristics.
- Visualization of captions with dynamic text properties (e.g., font size for volume, baseline shift for pitch).

Results

Concrete findings:
- µCap enhanced music appreciation for 18 out of 20 DHH participants in User Evaluation 1, with 15 reporting improved immersion and comprehension.
- Captions significantly improved rhythm perception compared to baseline methods in User Evaluation 2.
- Captions were rated higher than waveform visualizations for immersion and understanding (e.g., M=5.43 for captions vs. M=4.19 for waveforms on a 7-point scale).
Advantage over baselines:
- µCap outperformed heuristic and simpler automated methods (e.g., µCap-zero, µCap-mini) in generating more intuitive and contextually relevant captions.
- While manual captions still ranked highest, µCap showed statistically significant improvements over other automated approaches.
Experiments / evaluation:
- Two user evaluations with DHH participants (n=20 and n=15) and a supplementary evaluation with hearing participants (n=5).
- Metrics included ranking of caption quality, Likert-scale ratings for immersion and comprehension, and qualitative feedback.
- Datasets included 3,060 annotated audio clips across genres like classical and jazz.
Limitations and future work:
- Current implementation is limited to the Korean language; future work should explore multilingual extensions.
- Captions sometimes felt unnatural due to overly literal phonetic transcription.
- System evaluation was restricted to classical music; broader genre coverage is needed.
- Further refinement of RAG and dataset expansion could improve caption quality and naturalness.

Summary

µCap is an automated system designed to make instrumental music accessible to DHH individuals by generating phonetic-like, time-aligned captions augmented with visual features. The system leverages Retrieval-Augmented Generation (RAG) and expert-informed guidelines to produce intuitive and expressive captions. User evaluations demonstrated that µCap enhances music appreciation, immersion, and rhythm perception for DHH audiences, though manual captions remain the gold standard. Future work will focus on multilingual support, broader genre coverage, and improving the naturalness of captions. This research represents a significant step toward inclusive cultural participation for DHH individuals.