Hikaru Kamioka, Satoshi Maeda, Masayuki Hashimoto
Abstract: In typical LLM-based voice dialogue systems, the system waits for the user to finish speaking before processing and generating a response, causing a delay. We propose an approach where the LLM generates response text for an incomplete utterance without waiting for the user to finish speaking. This method allows for earlier initiation of response generation, potentially reducing delays. However, LLMs may not always produce appropriate responses when generating replies to incomplete utterances, despite their predictive capabilities. If voice data acquisition is terminated too early, the likelihood of inappropriate responses from the LLM increases. On the other hand, delaying the termination reduces the effectiveness of latency reduction. Therefore, it is crucial to identify the optimal timing for terminating voice data acquisition. To determine the optimal timing for stopping audio capture and initiating response generation, we use changes in Sentence-BERT embedding representations of the dialogue history up to that point. We investigate changes in the similarity measure, St (X0,Xt), between the embedding representation X0 (the embedding at the start of the user's utterance) and Xt (the embedding at midpoint t during the user's utterance). It is suggested that the changes in similarity may be correlated with the validity of the LLM's responses. Therefore, we proposed some methods to determine the cutoff point based on these changes and conducted simulation experiments to evaluate their effectiveness specifically in Japanese conversations. As a result, we demonstrated that our proposed method can successfully stop audio capture 14.6 characters before the end of the user's utterance, achieving a 0.80 probability of generating a valid response. This reduction of 14.6 characters corresponds to approximately 2.4 seconds in Japanese speech.
Keywords: Response delay, Dialogue systems, Large Language Models.
Date Published: December 10, 2024 DOI: 10.11159/jmids.2024.017
View Article