If there is a focus, it's the part of the sentence right before the verb, yes. But there does not need to be a focus, in which case the sentence is more neutral. For verbs without prefixes, like "lát", there is no way to tell from the written form whether the focus is empty or not. (For verbs with prefixes, the prefix shifts.)
In spoken language, stress makes it possible to disambiguate this: If there is a focus, it is typically stressed. In the audio for this sentence, the speaker does put this stress on "kint", so it's the focus. If it weren't in the focus, it would not be stressed, and typically there would be a bit more stress on the verb.
So yes, there is a strong case for "outside I see a writer" here. Still, even in English you can say "I see a writer outside" with the stress on "outside", with the same property that there is a stress that is not reflected in writing.
I think that would be two, not even slightly related sentences, one about me being outside and one about me seeing something. Of course you can connect them using és or something, but to be honest, for me, the thought itself doesn't sound natural... can't we just imply the listener already knows your location at the point you are getting into details like "I see this certain thing/person"?
az írót (the one writer you're looking for): use látom. egy írót (any old writer will do): use látok.
If there's no subject at all, you can default to látok (though it seems there can be nuances if the subject is implied). However if there is a subject, you need to ask the followup question about whether the sentence talks about a specific item (the writer) or a general class of item (a writer). In this case it's the latter.
As far as I know, there is ongoing work to turn voice section into some mix of text-to-speech (for slow mode) and actual voice recordings. (Don't take this for granted though.)
Besides, I don't think this recording could get better, or, well, has any reason to. Contrary to what you are saying, it's well-articulated. This is how anyone would read something out loud for you, this is by no means the tempo of "people on the streets", and slower than this would sound quite unnatural (actually, the text-to-speech of other courses isn't any slower than this, the only way they differ is that those aren't clear while this is).