Speaker Track 60 Camera and Ceiling Speakers

mohamedbineesh · ‎09-21-2020

We have an Auditorium type conference room 10 meters long and 8 meters wide. The Camera and Video Wall display to be placed on the Longer Side as the Audience/participants will be sitting facing towards this longer side. Am planning to use Webex Codec Pro with Speakertrack-60 Camera. The Question is:- All the Audio Outputs from the Codec (Remote party Audio or Local Audio from third-party MIC in Conference Room) will be connected to BOSE Speakers(ceiling mount).

This Audio from Ceiling Speakers will have any impact on the Speaker Tracking, as the built-in microphone array of Speaker track 60 will have difficulty listening and deciding the Source location as it may get confused between Audio coming from the BOSE Ceiling Speakers and the actual human speaker?

Nithin Eluvathingal · ‎09-21-2020

There are four parts to how SpeakerTrack 60 does what it does.

Audio triangulation - The microphone array behind the fabric panel that is position behind the camera pictured above is able to accurately locate voices within the room. The microphones are only used for audio triangulation .
Facial detection - Identification of a full or partial face at the same location as the voice is required to form a positive match.The camera quickly locates a close- up of the active speaker while the other gets ready to seek and display the next active speaker.
Camera control - With a positive match, the processor in the camera base instructs the cameras directly where to move.
Camera switching - The processor in the camera base instructs the codec which camera to use. The codec does the actual camera switching.

To get more of a feel for how this works SpeakerTrack 60 has a diagnostic mode that shows how this works. Below is a screenshot of diagnostic mode turned on and a face is detected but no audio match.

Now the engineer is speaking below, and the green square indicates a positive match for facial detection and voice. Therefore the system will zoom in to this person.

The indicators down the bottom of the screen match up with the following:

F = 10.1% detected voice
T = 91.6% non-noise
E = 0% voice from far end
C = 28.1% camera movement
U = 0% ultrasound detected
N = 89.9% silence
S = 178 samples from sound algorithm