All Cisco video endpoints except the SX10 support a video call plus one audio call without the multisite license. The key is to connect the video call first, because if you dial the audio call first and don't specify that it's an audio call, the system will tie up the video resources and your second call will also be only audio.
That said - in your scenario, it would be better for that audio call to be directly connected to the CMS conference and not cascading off one of the endpoints dialed into the conference.