While I understand why you'd think that, it actually is stressed at the right syllable. A native speaker should/would be able to understand the difference. The problem is that, because this is a question, the TTS recording has the voice go very high and, excessively IMO, emphasise the last syllable. But, overall, it is voiced correctly.
I hear the stress in the right place :) and I am not a native speaker.