AAD-LLM: Neural Attention-Driven Auditory Scene Understanding

Xilin Jiang^1★, Sukru Samet Dindar^1★, Vishal Choudhari^1★, Stephan Bickel^3,4, Ashesh Mehta^3,4, Guy M McKhann¹, Daniel Friedman², Adeen Flinker², Nima Mesgarani^1★

_{First two authors conntribute equally.}

abstract

Abstract: Auditory foundation models, including auditory large language models (LLMs), process all sound inputs equally, independent of listener perception. However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perception-aligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios, with both objective and subjective ratings showing improved alignment with listener intention. By taking a first step toward intention-aware auditory AI, this work explores a new paradigm where listener perception informs machine listening, paving the way for future listener-centered auditory systems.

In the clinical setting, the listener's brain signal is used to decode the attended speaker. We show example responses of AAD-LLM compared with responses of SALMONN and Qwen2-Audio. The Oracle Answer is the response of a finetuned Qwen2-Audio trained and evaluated on the oracle (ground-truth attended/unattended) speaker, representing the performance upper bound.

Clinical Sample 1: Female and Male

🎧 Speech Mixture

💡 Attended Speech (attention decoded from brain)

Question 1	Model Answers 1	Question 2	Model Answers 2	Question 3	Model Answers 3
Who is the speaker that the subject is paying attention to?	SALMONN: The speaker is a woman. Qwen2-Audio: The specific identity of the speaker is unknown without additional context. AAD-LLM: A female speaker with high pitch and normal tempo.	Transcribe the attended speaker.	SALMONN: We do offer sessions with our five minute personal trainers. Qwen2-Audio: um sandwich place with a five minute drive from this area. AAD-LLM: We do offer sessions with our personal trainers.	What is the unattended speaker's speech about?	SALMONN: Offering sessions with personal personal trainers. Qwen2-Audio: The unattended speaker's speech is inaudible... AAD-LLM: Tom's Sandwich Place is five minutes away.
Oracle Answer 1		Oracle Answer 2		Oracle Answer 3
A female speaker with high pitch and normal tempo.		We do offer sessions with our personal trainers.		Tom's Sandwich Place is five minutes away.

Clinical Sample 2: Female and Female

🎧 Speech Mixture

💡 Attended Speech

Question 1	Model Answers 1	Question 2	Model Answers 2	Question 3	Model Answers 3
Please write down all words spoken by the attended speaker.	SALMONN: I'm a vintage changer to my second twentieth class cost. Qwen2-Audio: Mike gave a ginger to me on my twenty-first birthday. AAD-LLM: My uncle gifted Ginger to me on my twentieth birthday.	What was the occasion for the gift?	SALMONN: The occasion for the gift is not mentioned in the given sentence. Qwen2-Audio: The occasion for the gift was Mike's birthday. AAD-LLM: The occasion was my twentieth birthday.	Can you identify the person the subject is NOT listening to?	SALMONN: No, I cannot identify the person the subject is not listening to. Qwen2-Audio: Mike. AAD-LLM: A female speaker with high pitch and normal tempo.
Oracle Answer 1		Oracle Answer 2		Oracle Answer 3
My uncle gifted ginger to me on my twentieth birthday.		The gift was given on the speaker's twentieth birthday.		A female speaker with high pitch and normal tempo.

Clinical Sample 3: Male and Male

🎧 Speech Mixture

💡 Attended Speech

Question 1	Model Answers 1	Question 2	Model Answers 2	Question 3	Model Answers 3
What is the attended speech about?	SALMONN: A high school reunion. Qwen2-Audio: The high time to start thinking about having children. AAD-LLM: Time to consider carbon neutrality options.	What phrase indicates urgency in the speaker's statement?	SALMONN: I THINK IT IS HIGH TIME THAT WE START THINKING ABOUT THE NOT TOO DISTANT FUTURE. Qwen2-Audio: The phrase "it is high time that we start thinking about" indicates urgency. AAD-LLM: The phrase "high time" indicates urgency.	In the unattended speech, on what floor does the speaker want the room?	SALMONN: The speaker wants the room on the top floor. Qwen2-Audio: The speaker wants the room on the ground floor. AAD-LLM: The ninth floor.
Oracle Answer 1		Oracle Answer 2		Oracle Answer 3
Time to consider carbon neutrality now.		The phrase "high time" indicates urgency in the speaker's statement.		The speaker wants the room on the ninth floor.

For same-topic samples below, we replaced the background speaker with another speaker talking about the same topic as the foreground speaker. Therefore, selecting the correct speaker is necessary to answer the question correctly.

Same-Topic Sample 1

🎧 Speech Mixture

💡 Attended Speech

🔊 Distractor Speech

Question	Model Answers
Listen to the attended talker: what is the address mentioned?	Qwen2-Audio: The address mentioned is 378456 apartments north avenue. AAD-LLM: Three Seven Eight three seven eight South Street, Apartment C.
Oracle Answer
The address is three seven eight, three seven eight South Street, Apartment C.

Same-Topic Sample 2

🎧 Speech Mixture

💡 Attended Speech

🔊 Distractor Speech

Question	Model Answers
Listen to the attended talker: how often does the speaker go hiking?	Qwen2-Audio: The speaker goes hiking rarely. AAD-LLM: The speaker goes hiking very often.
Oracle Answer
The speaker goes hiking very often.

Same-Topic Sample 3

🎧 Speech Mixture

💡 Attended Speech

🔊 Distractor Speech

Question	Model Answers
Listen to the attended talker: what does the speaker imply about the progress toward space travel?	Qwen2-Audio: The speaker implies that space travel remains a distant dream despite recent advancements in private space companies. AAD-LLM: The speaker implies that there has been significant progress made by private companies in making space travel more feasible.
Oracle Answer
The speaker implies that there is significant progress being made in achieving their dreams of floating in space.

AAD-LLM can also run without the brain signal. Users need to provide the speaker vector (label) of the target speaker. We show examples of responses on synthetic LibriTTS + DEMAND speech mixtures with one of the speakers chosen as the target.

LibriTTS+DEMAND Sample 1

🎧 Speech Mixture

💡 Attended Speech (by user-provided speaker label)

Question 1	Model Answers 1	Question 2	Model Answers 2	Question 3	Model Answers 3
ASR the attended speaker.	SALMONN: Somewhere however the desire for these changes must arise in the minds of individuals. The old man turned around and around carefully examined and discovered on the napkin a ring several strange red characters. Qwen2-Audio: The old man turned around, examined the ground carefully, and discovered on the napkin several strange red characters. AAD-LLM: Somewhere, however, the desire for these changes must arise in the minds of individuals.	ASR the unattended speaker.	SALMONN: Somewhere however the desire for these changes must arise in the minds of individuals. The old man turned around and around carefully examined and discovered on the napkin a ring several strange red characters. Qwen2-Audio: The old man turned to ground and round carefully examined it discovered on the napkin several strange red characters. AAD-LLM: The old man turned it round and round, carefully examined it, and discovered on the nape of its neck several strange red characters.	Describe the attended speaker.	SALMONN: The attended speaker is describing the examination of changes on the nape of the neck in several strange red characters. Somewhere however the desire for these changes must arise in the minds of individuals. Qwen2-Audio: The speaker is male, speaks English, and conveys a disgusted mood in his speech. AAD-LLM: A male speaker with normal pitch and low tempo.
Oracle Answer 1		Oracle Answer 2		Oracle Answer 3
Somewhere, however, the desire for these changes must arise in the minds of individuals.		The old man turned it round and round, carefully examined it, and discovered on the nape of its neck several strange red characters.		A male speaker with normal pitch and low tempo.

LibriTTS+DEMAND Sample 2

🎧 Speech Mixture

💡 Attended Speech

Question 1	Model Answers 1	Question 2	Model Answers 2	Question 3	Model Answers 3
What is the main subject of the attended speaker's speech?	SALMONN: Not specified in the given sentences. I shut the queen up in the bath and tried to suffocate her and the old woman had her own ugly daughter and twilight gathered that the happy family did not know together where she was. Qwen2-Audio: Not specified in the provided subtitles. AAD-LLM: Old woman hides daughter from king.	Can you provide an overview of what the unattended speaker is saying?	SALMONN: And shut the queen up in the bath and tried to suffocate her father and the guests had their own ugly daughter and twilight gathered that the happy family did not know she was alive. I swear i've been crying but the girls would not leave her so woman with her abandoned daughter in the twilight bathed and came and sat together round the fire. Qwen2-Audio: A shut up had been placed in front of the bath and tried to suffocate her and the old woman for her own ugly daughter and in the queen's bed that the king might not know she was away. AAD-LLM: Guests left as family gathered by fire.	What was the intention of shutting the queen up in the bath?	SALMONN: The intention was to suffocate the queen and her father. Qwen2-Audio: The intention was to prevent the queen from leaving, as the girls would not let her go without their father. AAD-LLM: The intention was to suffocate her.
Oracle Answer 1		Oracle Answer 2		Oracle Answer 3
Queen imprisoned, daughter hidden from king.		Guests left early; family gathered by fire.		They tried to suffocate her.

AAD-LLM may fail if two speakers are too similar in acoustic, making it hard to decode and select the attended speaker, as shown in the example below.

Failure Sample

🎧 Speech Mixture

💡 Attended Speech

🔊 Distractor Speech

Question 1	Model Answers 1	Question 2	Model Answers 2	Question 3	Model Answers 3
Transcribe the words of the speaker being listened to.	SALMONN: It seems like i set the bar too high for myself this week. Qwen2-Audio: It seems like I said the bar will be there in about forty minutes. AAD-LLM: To the offer, The driver will be there in about forty minutes.	What is the attended speech about?	SALMONN: Setting the bar too high for oneself. Qwen2-Audio: Someone named Barto coming over in about forty minutes. AAD-LLM: Driver will arrive in forty minutes.	Identify the person whom the subject is listening to.	SALMONN: It is not specified in the given sentence who the subject is listening to. Qwen2-Audio: Harry. AAD-LLM: A male speaker with low pitch and high tempo. (The description is accurate, but it is true for both speakers.)
Oracle Answer 1		Oracle Answer 2		Oracle Answer 3
Seems like I set the bar too high for myself this time.		I’ve set unrealistic expectations for myself.		A male speaker with low pitch and high tempo.

Disclaimer: Within this study, approval of all ethical and experimental procedures and protocols was granted by the university’s Institutional Review Board (IRB). The iEEG participants provided informed consent as per the local IRB regulations (IRB protocol number AAAD5482).