https://www.duolingo.com/chubbard

Mis-scraped sentences

For many, many of the translations I've worked on, the "sentence" to be translated is actually a combination of a heading (typically the article heading or, often for Wikipedia articles, a section heading) and the immediately following sentence.

I understand how this happens. Article or section headings are rarely if ever terminated with any kind of punctuation, and I assume the bot that slices and dices the web content into individual sentences relies on ending punctuation to mark the end of a sentence.

This seems like kind of a serious problem, given the frequency at which it occurs (hint: OFTEN!), and especially if our translations are expected to be taken as-is and turned back into a web page. The thing is, the translation input box assumes we are working on a single sentence, and there is no good way to insert blank lines or otherwise (re)separate heading material from body material. I wonder if the the good folks at duolingo would consider enhancing the parser to also look for a font size and/or font family change as well (perhaps followed by a capital letter) to indicate the beginning of a new sentence. This would catch most heading/body transitions I think, although, if not done carefully, it might cause problems in body text that contained italics or boldfaced segments. I think this could be figured out and worked around though.

10/21/2012, 8:22:44 AM

0 Comments

Learn German in just 5 minutes a day. For free.