Does Duolingo Keep Two Sets of Books?
Summary: When Duolingo choses which sentences to review you on, I think it uses a different measure of "word weakness" than the one that we see on the Words page. I also think the questions are not chosen at random--even for the general "strengthen" exercises. Instead, they are clustered in a way related to the lessons where they were introduced.
Details: In December 2014, I started Duolingo German. My daily routine is two review lessons and one new lesson. I regularly look at the Words page to see how different words are doing. At this stage, there are never more than a dozen words with just one bar.
You'd think that those one-bar words would come up for review, but you'd be wrong. Zitrone went four weeks (that's fifty strengthening exercises) before Duolingo finally reviewed it today. There have been several days where none of the one-bar words was chosen for review. When I look at my French, Spanish, and Italian trees, there are lots of words that have not been reviewed in over a year.
Looking at the German words that were chosen for review (ignoring function words), they all seem to come from two or three lessons--not all from the same skill. Reviews in the other trees show similar behavior (e.g. if you get one lesson on the subjunctive, you , but it's easier to see it when the tree is small.
And yet, when I look at those year-old words, they generally are words that I know pretty well. In fact, even for German, I saw Zitrone get used in questions before it got counted as "reviewed." It's as though Duolingo made a distinction between when a word is "formally reviewed" vs. when it just happened to come up in a review (or other) lesson. A distinction that doesn't affect whether a word is shown as weak on the Words page, but which very much affects whether Duo thinks you actually need to review the word.
This goes a long way toward explaining why doing general strengthening exercises seems to have so little impact on weak skills. I think the weak skills really are calculated based on the word strengths shown on the Words page, but that's only loosely related to how Duolingo decides what you really need to review. When you do a skill-specific review, I suspect it uses the same algorithm, but restricting it to questions that are actually inside that skill greatly limits what it can choose. At worst, you might need to do two or three "extra" strengthens (ones that appear to have no effect on the skill), but you can eventually force it to review everything.
I think this also explains why a general strengthen might only give you 3 XP; it was Duo's choice to drill you on relatively strong words, even though it had a huge number of weak words to choose from.
The funny thing is that I think Duo's secret algorithm is probably better for purposes of learning the language. It seems to do a good job of drilling you on the things you most recently studied, and it doesn't bother you too much with really easy, obvious sentences. (To such an extent that the easy obvious ones are welcome breaks.)
Anyway, that's my thinking at the moment. I'd be interested if anyone has had similar thoughts and/or any data that might serve to support (or even contradict) these observations.
Yes, your observation is generally correct. Except that there's one extra element to the sauce. Duolingo also uses the global weak/strong words of all users to determine which words users generally find hard or easy  and it uses this as a tie breaker when words have the same strength. This may explain why it sometimes chooses strong words resulting in low XP. This is because one person's strong words is another person's weak word and vice versa.
Another problem/scenario is that the words tab may sometimes be buggy and display the wrong strength for the word. It also rates different word forms differently. You may have seen "set" as a noun but not as a verb. So one of those word forms might be strong while the other is weak. That's why people sometimes complain that Duolingo is testing them on words they know well, when in effect it is sometimes testing different word forms.
Well, I'm not sure it is a different system, but it is something that isn't entirely related to only a particular user's vocabulary . Like I said in my post above, sometimes because users may learn a lot of words at the same time, or because many words are randomly at the same strength, the algorithm picks words for practice that many other users find hard (to break the tie) and choose the next word for practice (at least that's how it was according to the staff member tatou).
What I described above is not a normal element of a spaced repetition algorithm. This may be why it may sometimes better and sometimes be worse than most other similar software.
It's worth noting that there are currently a number of AB experiment options that appear to affect how practice sessions are generated (tie breaks among them). The app's behavior can vary pretty dramatically from one user to the next. Over the last few months a number of session generation tests have come and gone - it seems to be one of the most active areas for testing. Additionally, I've noted a significant change in the way that Duo handles strength (primarily with how it's aggregated for skill and lexeme purposes rather than at the individual word level) and strength decay since I last studied it in the summer of 2014. I wouldn't necessarily expect Tatou's statment (posted at 17:53:52 UTC on 2013-09-05) to still be true today (although it certainly might be).
I wouldn't necessarily expect Tatou's statment (posted at 17:53:52 UTC on 2013-09-05) to still be true today (thought it certainly might be).
I generally agree that things may have changed. But I highly doubt that they would not make use of global data of users' words' strength.
I did ask tatou some time ago to give us a current state of affairs of the words' strength, but my guess is that because the decay/strength is relatively stable, they haven't felt the need to make any announcements for a long time.
I'm replying to your other comment, which was indented too deeply for a direct reply.
I have (so far) seen two cases where a gold tree had under 50% estimated strength. It was close enough that I could attribute that to measurement error. (Which speaks to your other point, of course.) :-)
I haven't been willing to make the time to decipher the vocabulary object, verify that it really represents the words we're seeing, and then write my own code to extract what's there. If you can point me to something that documents it, I'll have a look at it.
I suspect you're correct that this is an area they are actively playing with at the moment. I was surprised today to discover that Duolingo did not strengthen a word that I missed during review. This was a noun that came up several times, so it seems likely that it was a targeted word, not an accidental one. The other words I had thought were targeted (and which I got right) all went to 4-bars. I've seen Duo fail to strengthen a word that only appeared once (even thought I got it right) but this was different. If missing a word means it doesn't get strengthened, that would be a big change in their algorithm.
Any thoughts as to why little or nothing is ever at strength 2 or 3?
Any thoughts as to why little or nothing is ever at strength 2 or 3?
I think it's probably a combination of a couple of things, the biggest being that Duo does a relatively poor job of targeting weak words in reviews. It appears that their algorithm selects based upon lexeme strength rather than individual word. Like you suggest above, and I've suggested elsewhere, when a lexeme is targeted, questions are selected randomly from the pool of questions that match both the skill being exercised and the targeted lexeme (this is why you don't get a bunch of infinitives or participles in a review of present tense verbs, even though they're all in the same lexeme). Some word forms are better represented within this pool of questions, so they tend to come up more often, while those that are less well represented atrophy.
Additionally, I think that their skill strength modeling contributes because a user can easily end up with several skills whose average word strength is well below 50%. Most words just don't come up for practice often enough to keep the individual word strengths out of the basement.
Finally, I'm sure some of it has to do with the way that many of us practice. If I had time to do it, it would be interesting to go through a course without taking any shortcut exams, and engaging in minimal review along the way in order to track the differences that might be observable between that and anaccount with more aggressive skill maintenance.
I was thinking a bit about this last night, and a thought occurred to me. I've not yet dug into the data or Duo's client code in an attempt to verify this, but I thought I'd share my hypothesis with you.
I think Duo's algorithm for determining a given skill's strength is making some assumptions, and that it doesn't care that a number of words form are weak so long as one of the forms for that lexeme taught within a particular skill is strong. I think that it might be assuming, for example, that if you know that pasabas is the 2rd person (formal) singular imperfect form of the verb to happen, that you also know that pasaban is the third person plural plural regardless of pasaban 's strength rating. If this is actually happening, a user might have 5 forms of a any lexeme in a particular verb skill fully degraded, and the skill could remain golden as long as enough lexemes have at least one constituent taught within that skill with relatively high strength value.
Later this afternoon when I have some time, I'll look a bit more deeply into this.
Finally had a few minutes to look at this, and it doesn't look promising at all.
There have got to be some other metrics being factored into the calcualtion. Some possible candidates: the number of times the word has been practiced, which is tracked but is no longer exposed to the client, or the time elapsed since last practiced and/or the elapsed time since the word was initially learned.
It seems to do a good job of drilling you on the things you most recently studied, and it doesn't bother you too much with really easy, obvious sentences.
Yes, one of the last posts by tatou indicated that's one of the last tweaks they reported making. Newest words had high decay rates at the time .
I do know most of staff posts on this subject are older than a year. But they are the only glimpse users have on how the algorithm worked, As they've often stated they carefully make small changes so they don't cause wide-scale problems. So it is possible that important parts still work in a similar manner.
Thanks for those links. In both cases, those were folks who were trying to use the general strengthen even though they had weak skills. (I suggested to both of them that they not do general reviews if they had any weak skills.)
Of course even if the general review is slower than the per-skill review, it shouldn't be that bad. Perhaps the algorithm can get into a bad state where it isn't reviewing the words it should, but doing a skill-specific review is enough to knock it out of it?
Yes, I think skill specific practice can always find those pesky words. The trouble of course is finding out which skills need skill specific review when the whole tree is golden.
Also, people tend to relax when the skills are golden because they fail to realize that Duolingo is only attempting to get 80% average word strength in all skills. So if each skill has maybe 5 one bar words, it won't be enough to cause it to lose its glittler.
The problem is that can easily result in 300 (5 * 60 skills in a tree) or more weak words.
It's not 80%--it's 50%. Have a look at the list of words for a tree. Count how many have strengths 1, 2, 3, and 4. In a newly gold tree, half will be 1, half will be 4, and almost none will be 2 or 3.
Everyone keeps quoting this 80% number, but it does not survive the simplest experiment. When a skill goes below 50% strong words, the skill drops from 5 bars to 4 bars.
@greg - You're right that the 80% number no longer stands up to scrutiny (though it once did). However, neither does 50%.
There have clearly been some changes since the last time I dove into it last summer. The previous models no longer explain Duo's current behavior (which probably goes a ways towards explaining some of our previous disagreements). The divisions are accurate (keep an eye on the skill strength reported within the user object's language data to see that in action), but the way the strength of a skill is calculated is not a simple average strength of the words taught within a skill or the average strength lexemes within a skill. As an example, I've currently got skills for which the average word strength is below 40%, yet they're still gold - the lowest "gold" skill has an average word strength of ~32%!. I'm still gathering data trying to work out how the skill strength is currently calculated, but I don't currently have a good handle on it yet.
As an aside, I really suggest you work with the actual strength numbers included within the vocabulary overview object rather than estimating based on the number of strength bars a word has. A 1 bar word might be anywhere from 0 to 20% strength, a 4 bar word anywhere from 80 to 100, which makes a pretty big difference in the resulting averages.
It's not 80%--it's 50%. Have a look at the list of words for a tree. Count how many have strengths 1, 2, 3, and 4. In a newly gold tree,
Hmm, it is actually a fact that its above 75%. I did a small experiment. I picked a 4/5 strong skill and strengthened it. You can see the screenshots here.
Before I did the practice:
mon 4,ton 3,son 4, ma 3,ta 3,sa 4,leur 3,mes 3, tes 2,ses 4,nos 3,vos 3,leurs 3,notre 3,votre 3
It also had a rating of 75% as seen in the screenshots. Interestingly enough, all those lessons had a strength of 75%. Somehow duolingo seems to take the strength of the weakest lesson as the skill strength. Also if I add up my words strengths and divide by 60, I get exactly 80%.
After the exercise (you can see the values yourself in the screenshot), most of these were strengthened to 100%. Resulting in a golden skill.
But you don't have to take my word for it, Duolingo's own api reports it as such. Have a look at this. Just observe the network requests and looking at the json related to the skill you're learning.
Interesting. You are only counting the words that are displayed with the lessons. That's not unreasonable, but it's not what I've been looking at. If you look across all the words on the Words page, the boundary seems to be 50%, not 80%.
This speaks again to Duo keeping two sets of books, but it gives a hint as to what the other set might be. A good question might be why compute it that way. The per-lesson words are a weird subset of word forms, after all.
By the way, some of your links are dead.
There is another phenomenon I have run into several times (3-4). After I have completed or reviewed a skill, I have been notified that I have strengthened another, different skill further up the tree. This suggests that those two sets of books are somehow being crosschecked, at least partially. Maybe the algorithm is more complex than it first appears.
Greg already pointed out that you were looking at the wrong values, but I just want to elaborate:
Duolingo seems to have a function which takes some lexemes as input and outputs either 0.25 (1/5 bars), 0.50 (3/5 bars), 0.75 (4/5 bars) or 1 (5/5 bars). This is the function that is used to determine the strength of a skill.
The interesting question is which average word strength the lexemes should have for the function to output 1. What you are looking at is instead the per lesson output of this function.
This speaks again to Duo keeping two sets of books, but it gives a hint as to what the other set might
Duolingo seems to be running several A/B tests related to strengthening skills, so our experience may vary.
By the way, some of your links are dead.
The imgur link should be working, but the duolingo link will only work when viewed using the browser's console. You will only see that particular json if you navigate to a skill and look at the network objects being downloaded. A json object named after a skill will contain the information I stated.
When I do my flash cards I get the same words over and over and over. There has been less than 40 words that have been used over about a dozen separate sessions of flash cards over the past couple of days. It's very annoying to get the same words I already know over and over when I've been getting them for days.
It doesn't exist for Dutch yet, but if you go into French, Spanish, German, Italian, or Portuguese, the words tab is up there with home, activity and discussion (plus immersion in those languages). I think I've heard that words has a good chance of being added to the newer courses like Dutch soonish, although I personally don't find it all that useful...
I think you may have missed the  link in Dessamator's reply which is the very top reply on this thread (it's easy to overlook, so here it is). The one you linked was posted at 18:14:03 UTC on 2013-06-12. Because of the way that Duo rounds dates to "days/weeks/months/years ago", it will continue to say "1 year ago"until 18:14:03 UTC on 2015-06-12.
There's nothing more recent from tatou as his responsibilities within Duolingo have changed since these posts were made, and I've not seen any comparable information offered from any other Duo employees.
In my experience this seems to be the case - the questions I'm asked in general strengthening exercises are useful and variable, the flashcards are utterly useless and maddeningly repetitive. Individual word strength, and subsequently flashcards and skill strength, are utterly unrelated to how well or how often I've translated the words, with the same ten to fifteen words appearing in flashcards every day all week no matter that I knew 90% of them well from the beginning. I get them right, tell it so, and the next day they're down to two strength again. Meanwhile there are words I've met just once which have never dipped below 4 bars.
I think the flashcards could be useful if they were hooked to the same kind of factors that decide when I need to cover a word in an exercise. I'm happy that what general strengthening exercises ask me to do is helping me learn.