I don't believe you're going to run into QoS problems in this scenario. It takes very high jitter or latency, which you should not get going over a single switch.
This is more likely an interoperability problem, and has to do with the media that they are exchanging.
I would get a packet capture and use Ethereal to do a voice stream analysis and see what that turns up.
They may have RTP stream trouble, or codec trouble.
hth,
nick