There is no hard number, it depends on the conditions and the settings, you can make it more or less sensitive. If there are multiple voice sources in the room it make take longer to determine where to focus, but if not it will behave quicker. It also will be a little quicker as time goes by and it learns the room dynamics. It's not just a simple "three seconds" or anything like that.
Hope this helps.