The Complexities of Audio Signal Processing in Video Conferencing

By The NUROUM Team November 12, 2022

Video conferencing is about communicating between two remote parties using video and audio. There are three major factors that influences video conferencing quality:

  1. Media transimission quality, i.e., no jittering, hd quality;

  2. Audio quality, i.e., less noise, more voice clarity, invariance to voice pickup distance;

  3. Video quality, i.e., better picture clarity, more tracking intelligence.

Thanks to new SaaS-based video conferencing solutions such as Zoom, Google Meet, Microsoft Teams as wells as better internet speed in both access point and broadband, factor 1) has been well addressed. We could have very smooth, jitter-free video conferencing almost anywhere at almost any time.

img

However, factor 2) is not a trivial problem. A simple drawing is used to illustrate the video conferencing scenario.

img

Pepople from near end room denoted with N needs to communicate with prople from far end denoted with F.

img

For the near end room, there are lots of possibilities:

  • N1: Small huddle room fitting only less than 3 person.

  • N2: Medium sized room fitting a team of around 6-8 person.

  • N3: Large meeting room fitting a team of around 15 person.

  • N4: Extra large room such as training room or board room.

img

For the far end, there are even more lots of possibilities:

  • F1: people might dial in from a quiet conferencing room using professional devices

  • F2: people might dial in from their work desk with a earphone

  • F3: people might dial in from their car, or even on a train

  • F4: people might dial in from home office

img

Let's try to make the problem a little bit simpler by assuming that the near end room is well-furnished, quiet and does not suffer from reverberation. Let's analyze the complexities of each of the cases.

  • N1: This is the most straightforward case. In this case, usually the person is sitting pretty close to the audio capture device. As long as the device can handle echo cancellation and has basic noise supression, automatic gain control, this will be fine. However, most of the audio devices can not even handle this case, because the device can not handle double talk very well. This case mainly involves one issue.

    • Double talk: When the far end is talking or the far end has background sound, it becomes double talk scenario. To make sure that far end can hear clearly about the conversation, the audio device is supposed to have great double talk capabilities, i.e., full duplex.
  • N2 & N3: For this case, usually this involves two problems

    • Distance to audio device: with only one device in the room, there might be some person sitting close to the device, while some other person sits far from the device. This makes the captured voice shaky, meaning that for closer-sitted person, the captured voice is rich and powerful, for the far-sitted person, the captured voice is shallow and weak.
  • Noise: With more people in the conference room, one serious issue is that it causes more noises. Some person might be hitting the keyboard, while some other person might be knocking on the desk and some other person might be sneezing. Unfortunately, all the claimed "NoiseBlock" technology can not handle these noises when the noises happens when people are talking at the same time. We are working on the next generation deep learning based solution to handle this issue. So, stay tuned for our future product updates.

  • N4: Training room or all-hands rooms are usually intended for large gathering or meetings. In this case, all issues encountered by N1, N2 and N3 are even more serious for N4.

For far end, there are also lots of challenges:

  • F1: When people dial in from quiet environments, the conference system should be able to handle double talk. As we have stated earlier, most of the existing solutions fail to handle double talk, while our solution can handle double talk very well.

  • F2: When people dial in from their desks, their microphone might pick up background sounds such as people walking by. This is a continuous double talk case.

  • F3: People might also dial in from their car, where the far end is full of instantaneous noise.

  • F4: People might dial in from home office

For the conference environments, there are small offices with glass windows and large offices with even marble floor. The transimission path for these different rooms causes even more chanllenges for conference devices.