Imagine for a minute. You’re in a coffeeshop, or a bar, or at a swanky cocktail party (whichever you prefer). There are people around, chatting nearby. But you’re speaking to the person directly across from you. Somehow, you can pick their voice out of the chatter and attend to what they are saying, even though the conversations around you might be just as loud or louder (especially in a bar!) than the one you’re interested in.

Have you ever wondered how you do that? I know I have. It’s kind of a mind-boggling problem (and is, in fact, called the Cocktail party problem), trying to separate out speech, and make sense of it, in comparison to all the noise. And it’s not just something to think about for us humans. Voice recognition technology and recording wrestles with this all the time: how to pick out the voice from the crowd?

As it turns out, it’s all about attention, and how that attention can change your brain.

Mesgarani and Chang. “Selective cortical representation of attended speaker in multi-talker speech perception” Nature, 2012.

The authors of this study were interested in what happens in the brain when someone tries to pick out a single speaker in a room full of people. To look at this, they actually used electrodes implanted subdurally (beneath the tough dura mater on the outside of the brain) in three human patients. Three is a really small number, but they had to use patients who were receiving this electrode implant clinically, in this case for treatment of epilepsy, and who were known to have normal hearing and language skills.

These subdural electrodes were implanted over the cortex, and in particular over the superior temporal lobe of the brain.


(Source)

This area is not primary auditory cortex, but it is secondary auditory cortex, and is related to areas which are important in speech perception. Electrodes implanted in this region could detect populations of neurons firing in that area, and showed reliable changes in response to sound.

They then had the patients listen to speech samples from two different voices. The sentences used made no sense, an example from the paper is “ready tiger go to red two now”, but they contain call signs and color-number sequences, which the patients had to identify (in this case the call signs are “tiger” and “red two”). The patients had to indicate when they heard the particular call signs and sequences spoken, indicating that they could hear and understand the speech. They were exposed to a male and female voice separately saying similar things with similar types of call signs and indicators, and unsurprisingly, were very good at the task (100{9f43b4361d9a125bc126dd2a2d1949be02545ec69880430bc4fed2272fd72da3} accuracy).

Then the subjects listened to a mixture of the two voices speaking together, saying the same types of words, and told to pick out the call signs. The targeted speaker that they had to listen to varied from trial to trial, and the target call sign varied as well, meaning that the patients had to keep a careful ear tuned to keep up. At this point, their performance dropped to about 75{9f43b4361d9a125bc126dd2a2d1949be02545ec69880430bc4fed2272fd72da3} accuracy, but it’s still not bad.

Above you can see a visual representation of what the sounds looked like when presented individually and then when overlapped (i in the figure). You can see that there’s a lot of overlap between the two voices, even though they differ in gender, pitch, and intonation, but that when the patients attended to them (the thin lines displayed in i), they were able to pick them out and focus on the differences between the voices.

But what is going on in the brain? During the individual voices, the electrodes picked up a pattern (shown below as the dotted lines for speaker 1 and speaker 2).

You can see that when the patients were given a mixture and told to attend to one voice or another, their electrode responses took on forms that were very similar to those when they speakers were speaking alone (the solid lines). In particular, they showed an increased response to the high frequency sounds, and suppressed the response to the sounds of the other speaker.

So it appears that when you have to pay attention to a single speaker in a group (or in this case a pair), your brain can emphasize the signals related to what you recognize as that speaker. This is probably more than just paying better attention, it probably also involves things like increased signal responses to that person’s particular speech patterns and intonations that you might recognize (say, if the voice is female you’d be more sensitive to that range). And the neural signals also suggested that when the patients messed up in the task, it was because their brain was “losing track” of the aspects it was supposed to pay attention to, and the responses from the electrodes faltered.

This means that picking out speech isn’t just a response to sounds in your environment, it’s also your brain picking out the parts of the sounds that make them relevant to you as speech, probably thought several processes.

Of course, it was a simplistic study, they only dealt with two voices, not a crowd. It’d be interesting to see how much the signals stand out for attention as the number of voices increases. And of course, it’s in a very small sample, and a group that is already not a standard control, so there might be some differences in their speech processes compared to people without epilepsy (though that is unlikely here since their hearing and speech was normal). But I personally think this is a really cool study, and provides some interesting insights into how our brains allocate attention where it is needed. So the next time you’re in a crowded room, and shouting out to someone over the crowd, don’t worry. Their brain will help them pick you out, and with any luck they’ll even understand what you’re saying.

Mesgarani N, & Chang EF (2012). Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 485 (7397), 233-6 PMID: 22522927