The Complexities of Audio Signal Processing in Video Conferencing

Video conferencing is about communicating between two remote parties using video and audio. There are three major factors that influences video conferencing quality:

  1. Media transimission quality, i.e., no jittering, hd quality;
  2. Audio quality, i.e., less noise, more voice clarity, invariance to voice pickup distance;
  3. Video quality, i.e., better picture clarity, more tracking intelligence.
Thanks to new SaaS-based video conferencing solutions such as Zoom, Google Meet, Microsoft Teams as wells as better internet speed in both access point and broadband, factor 1) has been well addressed. We could have very smooth, jitter-free video conferencing almost anywhere at almost any time.
However, factor 2) is not a trivial problem. A simple drawing is used to illustrate the video conferencing scenario.
Pepople from near end room denoted with N needs to communicate with prople from far end denoted with F.

For the near end room, there are lots of possibilities:
  • N1: Small huddle room fitting only less than 3 person.
  • N2: Medium sized room fitting a team of around 6-8 person.
  • N3: Large meeting room fitting a team of around 15 person.
  • N4: Extra large room such as training room or board room.
For the far end, there are even more lots of possibilities:
  • F1: people might dial in from a quiet conferencing room using professional devices
  • F2: people might dial in from their work desk with a earphone
  • F3: people might dial in from their car, or even on a train
  • F4: people might dial in from home office
Let's try to make the problem a little bit simpler by assuming that the near end room is well-furnished, quiet and does not suffer from reverberation. Let's analyze the complexities of each of the cases.
  • N1: This is the most straightforward case. In this case, usually the person is sitting pretty close to the audio capture device. As long as the device can handle echo cancellation and has basic noise supression, automatic gain control, this will be fine. However, most of the audio devices can not even handle this case, because the device can not handle double talk very well. This case mainly involves one issue.
    • Double talk: When the far end is talking or the far end has background sound, it becomes double talk scenario. To make sure that far end can hear clearly about the conversation, the audio device is supposed to have great double talk capabilities, i.e., full duplex.
        • N2 & N3: For this case, usually this involves two problems
        • Distance to audio device: with only one device in the room, there might be some person sitting close to the device, while some other person sits far from the device. This makes the captured voice shaky, meaning that for closer-sitted person, the captured voice is rich and powerful, for the far-sitted person, the captured voice is shallow and weak. For this case, we would recommend to use daisy-chained devices to make sure all seats can be equally covered. The follow graph shows the difference between our device and the Jabra 710 device. Our device captures much powerful and rich voice at a distance up to 3 meters.


      aligned_jabra_1m



      aligned_jabra_3m



      aligned_aw_1m



      aligned_aw_3m


        • Noise: With more people in the conference room, one serious issue is that it causes more noises. Some person might be hitting the keyboard, while some other person might be knocking on the desk and some other person might be sneezing. Unfortunately, all the claimed "NoiseBlock" technology can not handle these noises when the noises happens when people are talking at the same time. We are working on the next generation deep learning based solution to handle this issue. So, stay tuned for our future product updates.
      For far end, there are also lots of challenges:
      • F1: When people dial in from quiet environments, the conference system should be able to handle double talk. As we have stated earlier, most of the existing solutions fail to handle double talk, while our solution can handle double talk very well.


      aligned_aw_dtd



      aligned_jabra_dtd

      • F2: When people dial in from their desks, their microphone might pick up background sounds such as people walking by. This is a continuous double talk case.


      aligned_jabra_sbg



      aligned_aw_sbg


      • F3: People might also dial in from their car, where the far end is full of instantaneous noise.
      • F4: People might dial in from home office
      For the conference environments, there are small offices with glass windows and large offices with even marble floor. The transimission path for these different rooms causes even more chanllenges for conference devices.
      To handle all these cases, there needs to be a wholistic approach and fortunately, after two years of hard work with more than 30 engineers and scientists, we are able to deliver such a device.

        Leave a comment

        Please note, comments must be approved before they are published