Duolingo is the most popular way to learn languages in the world. Best of all, it's 100% free!

https://www.duolingo.com/Zanoma

Why does duolingo mess up wikipedia articles so hard?

Zanoma
  • 22
  • 10
  • 8
  • 8
  • 6
  • 6
  • 6
  • 3
  • 2
  • 2

just take a look at this article up on Duolingo:

http://www.duolingo.com/translation/261939bfb886c1917cffda5263fc0501

and the source on Wikipedia:

http://es.wikipedia.org/wiki/Rub%C3%A9n_Wolkowyski

First thing that struck me as odd was that there's actually more info in the duolingo article than there is on wikipedia. His wife, children and place of residence aren't on wikipedia.

Then comes the table, which is understandably a big block of text. However, the block of text also contains the subsection titulos "obtenidos en clubes" below the table.

After that the duolingo article somewhat stabilizes until it reaches the header "Menciones" where it creates some weird meshes again.

The problem with all of this is that it's harder to translate things when you don't know the context.

I've understood that this happens partly due to periods. Hopefully, there is further information into what causes it so that the problems might be resolved or spotted better, so it will be less of a necesity to go through the duolingo + source articles both.

5 years ago

1 Comment


https://www.duolingo.com/_pinkodoug_
_pinkodoug_
  • 25
  • 11
  • 11
  • 7
  • 6

Some of the issues you cite as examples are due to the fact that Wikipedia articles, by their very nature, are constantly evolving. Duolingo imported a version of the article which has since changed significantly. The "Vida Personal" section of the article that Duolingo still includes was removed from the Wikipedia article a couple of years ago. It can still be found in older versions of the page by looking through the revision history for that document on Wikipedia. Duolingo's Immersion function has no facility keep up with the constant evolution of Wiki articles (or any other), and can only work with what was in the article at the time that it was imported.

The blob of text in the "Trayectoria" section is now presented in the Wiki article as an organized table, but in the version of the document that Duo has, it was an unpunctuated list of bullet points. The lack of punctuation is the issue here as Duo relies heavily on end-of-sentence punctuation to delineate sentences for presentation in the Immersion section. This could, of course, be greatly improved upon if the scripts which handle processing imported documents could perform a deeper inspection of the document, relying on HTML tags and CSS to recognize breaks in text that aren't obvious by punctuation alone. This is the same issue that causes the problem with the "Menciones" section, unlike the "Títulos internacionales en selecciones nacionales" and "Presencias internacionales en selecciones" sections, which did include end-of-sentence punctuation.

Ultimately, Duo is limited to parsing whatever is in the original document, and as the old CS adage goes, "garbage in, garbage out." If the input article is a mess of poorly punctuated text, Duo can only do so much with it.

5 years ago