@DuoLingo: Metrics, Statistics and Control Groups
[EDIT: If you downvote this, please also post a comment saying why. Especially if you are a member of Duolingo core team.]
Dear DuoLingo staff,
while everyone is praising or trashing the new design, I would like to talk to you a little about statistics. I have a couple of questions that I have asked in random other places, but it turns out they are all related. I hope you guys don't feel like I am trying to insult your intelligence. I am perfectly aware that you are a bunch of very smart people. But you have been very busy lately, and even the smartest people can only see things they choose to look at in the first place, unless those things bite them in the leg.
Question 1: Did you have a control group that could switch between the two designs?
Reason for asking: Having a group like that would have provided some very useful metrics, such as user preference, which does not necessarily correlate with retention and performance, and also whether having the ability to switch themes influences people's other performance indicators. Since the design currently in use is obviously just a flag in the database at the moment, this would have been really easy to implement.
Question 2: Do you have the ability to control for age, and did you use it? (related question: do you know what percentage of your user base is less then 16 years old?)
Reason for asking: A couple of hours after reading Luis' post, the following block jumped out at me:
"According to our metrics with over 300,000 users, [...]People spend more time on the site, get farther down the tree, come back more often, share more on Facebook, get more items on the store, etc."
Now, to me those particular metrics indicate
people having more time
people being more comfortable with social media.
people having better learning ability
To put it short... kids. As I wrote in a different post, the very people who will be more flexible and comfortable using the "less boring" new design, are the same people who will innately have a better ability to learn languages -- and incidentally also have the rest of the above behaviour patterns.
So if you DIDN'T control for age, what your metrics might really show might not be (quoting Luis) "that getting used to the design change is simply a matter of time" but that kids are more comfortable and less bored using the new DuoLingo design. The "random" nature of it would look like this:
An adult person might get either design, and stick to it just because they are doing it for the sake of language learning, and suffer through whatever problems they might have with the design for the sake of learning.
Alternatively, they might log in for the first time, see the new design, and (judging by the comments) leave with a headache for a different web site or a text book.
A young person might log in into the old design, screw around with the site, and get bored.
The same young person might be more motivated by the new design to stick to it for longer.
So you would have adults as "background noise" that could learn equally well or equaly badly, unless they get a headache and leave, and the younger demographic who is more likely to stick if they are less bored, ie with the new design, and they would perform better, have more time, share it more on facebook, and so on.
So if you didn't or couldn't control for that, there would indeed be a correlation between the new theme and better KPI, but not a causal correlation, and probably not for the reasons that you'd like.
Your tu oso bebe mucha cerveza, ahem, nosotros somos tortugas.
If you have a control group that can choose what interface they can use, then obviously that will affect the results. Prior knowledge/experience affects one ability to use software,
As expected, perceived ease of use of an IT tool is affected by the functionality of the tool and the individual's experience with the tool.
Controlling for age may seem logical, but it is not practical, how exactly will you determine the age of 300000 people? Beyond grabbing their ID's or having scientists physically determining how old they are, it is simply not feasible, and people lie for no other reason than vanity, personal preference, or playfulness. Also age may not play such as important role as one may believe it does ,
The literature supports the idea that adults are very capable of learning well into their seventies which is a good reason to accept lifelong learning as more than just a pleasant mantra.
I think users are trying to find ways to explain that their like of something has to coincide with everyone else. However, users are different, and there are millions of users from different cultures in Duolingo. In fact, studies have shown that culture affect the [perceived] usability of software.
Of course it would affect the results. That effect is exactly what would be interesting.
And of course individual's experience with the tool matters. If that group, too, starts either with the choice to select one in the beginning, or randomly with either of the two, that would be background noise. But anything that moves away from 50/50 would be signal.
And yes we can learn until we are 70. Yet both the way you learn and you retention differs vastly between the time when you are prepubescent kid and when you are adult. So do your choices.
My point is, it would be interesting to know whether there are different groups which benefit from different layouts. "One size fits all" doesn't even work with condoms, so why try to squeeze everyone and every device into the same schema? Why not try finding out whether differences matter?
While I agree with your condom analogy, there is no way one can make a good judgement when one is affected by other nuisances. For instance, there are many people who don't eat certain kinds of food because they think it looks ugly or smells bad. Yet once they taste it they may change their judgement, either for the better or the worse. But whatever their perception of the taste is initially, it will certainly be affected by the prior knowledge (ugly or smells bad).
Now, if we take the food, remove the smell, blindfold people and feed it to them. We now have a judgement that is based solely on taste, nothing more nothing else. In fact, Prof von Ahn was also so biased that at one point attempted to make the half-hearts system more appealing, simply because it seemed nice and for whatever reason the Prof liked it. So knowing of the existence of a different interface affects your judgement of how good the current one is, not liking change, will make you hesitant to use the second interface, and so on.
That is their reasoning based on existing research, I presume.
P.S. The methodology you propose may work too, they just don't want to use it for whatever reason. Or perhaps they did use it, the moderators, and other people took part in helping to submit bug reports and improvement suggestions. So they did exactly what you propose (excluding age most likely) in a kind of limited case study.
You've gotta love any thought that begins with "While I agree with your condom analogy".....
I admit that I'm here primarily because I saw that on my stream. Some things, you just gotta know the backstory.
The dirty secret of it all is probably user retention for a totally voluntary difficult thing like learning a second language is very low. So it is worth the risk to upset a lot of people if you can boost that even a little. It reminds me a little of Digg version 4 and their desperate attempt to boost Digg's numbers at all cost. The difference is the new Duolingo version basically does still work. I was shocked that none of the problems I noted a month ago were addressed and it was rolled out to 50% user anyway. So they really were purely looking at data and not user feedback. Pretty crazy way to test something.
In my opinion that's the best way, people can lie about their like or dislike of something. But they can't lie about using a website, using functionalities, and what they click or don't click. In fact users are notorious for never knowing what they want until they see it.
Yeah sure that's an easy way to measure dislike. But I was talking about obvious bug reports.
What I am writing up there is not "user feedback", it is "what layout the user decides to use and how long he sticks with it". The user doesn't even need to know that it is feedback, he just needs to use what's more comfortable for him, as opposed to being forced to use something and then his performance being measured against other users who are different from him and have different likes, dislikes and set of abilities.
Well, the bottom line is that Duolingo is a language learning service, if the users are committed to learning a language they will stick around or find solutions to overcome the perceived "ugliness" of the interface. If they aren't they will simply move on. When users first came to Duolingo, they were presented with an interface, the "old" one, is probably a new one that was developed after several iterations. At least from what I read, I know the immersion interface changed, not knowing what it was allows me to give an unbiased, and objective view of the current interface.
The "old" one was a new old one. Tress used to be organized differently. There weren't levels, but lines that connected different lessons. No one really complained then. The functionality of the website hasn't changed (save the vocab section, which the team is reconsidering and most likely reworking). The people that are complaining are just whining about a new coat of paint.
That's all the update is. A new coat of paint.
Why are you so against the new design? You're being a ludicrous reactionary. You'll get used to it after a while. The team will continue to improve the site, but the old format is gone.
I am not against the new design, even though I don't particularly like it. I can get used to it. What I am against is a bunch of people who could benefit from language learning and have the time to do it (namely older people) leaving the site because one stupid button ("switch designs") has been left off. Search for "headache", or "contrast", or "eye sight", and you'll see what I mean. I am against a "one size fits all" solution, because one size just doesn't.
If eye strain is your main problem and aesthetics mean nothing, then you can invert the screen, just like I am now. It's really easy to invert screen colors on Mac, PC, and Linux.
It would be too hard for the team to develop for two different themes. For every update, they would have to add and remove graphics for both versions, which would be too hectic. It would get in the way of languages, which is the point of the site.
If people are bothered, it's easy to invert the colors.
And how is inverting going to fix the font being too small in comparison to the icons, so that you either have to scroll, or, if you zoom out, you can't read anything?
Also, aesthetics are very subjective. I don't find Duolingo aesthetically pleasing (never did in fact, even though some of the details were cute), but I am just here for the language. If I would have let the aesthetics deter me, I would have left after the last design update.
Luis has stated that they are working on those problems so that zoom won't be an issue. In any case, I'm done. The new look is here to stay. I saw farther down that you are arguing over whether age should have been taken into account. Duo doesn't ask for age during registration, so there is no way to know. They were using a random sample of users and users here are not divided according to age in the first place.
Argue 'til you're blue in the face. It won't do anything.
Ahem, can you please look again at the original post. I am not arguing, I am asking DuoLingo whether they have controlled for it. And hey, no, the new look is not here to stay. First of all as you are saying in your very post, they are already working on changing it right now. Second, they probably will change it a year from now to fit the newest design fad. And third, as I have already written several times, I truly don't care what this site looks like. If it was black text on white background, I would use it anyway, because it is a great resource.
The question is whether it is necessary to have people "find solutions" to something that is supposed to be designed for them to use it most effectively (which involves people being comfortable with it), or whether you can actually make both groups happy now, and then find out whether a middle ground exists.
Yes, it is as you said, a condom does not fit all. There are cases where there may be sizes that the manufacturers didn't account for. This even happens with women's undergarments (bras), that simply cannot fit for everyone. In these cases, the women or men must request custom sizes. Indeed, even the site is also being customized for various special cases, it has been localized to Russian, Turkish, and so forth. They may even localize it to a language that will never be taught, if there is a sufficient demand.
That's a simple reality of life.
This may be a silly question, but why would people lie about liking/disliking something? Unless you're asking them questions about condom-related situations or the like. Then that's a big thing. But what kind of motive would people have to lie en masse about their likes and dislikes on a site like this?
Well, there are many reasons, but one simple reason is 'group-think'. When someone perceives that the majority likes something, they may also lean that way, regardless of whether they do or not. There are also kids around who may just randomly answer the questions, and other people who may just rush through the survey.
In fact, even if I had to ask people in this thread their ages, chances are that some if not most of them would lie about it, even though they are protected by their pseudo-names. It is just human nature.
Hmm, that's a good point. I remember reading through journal articles on group decision-making. I don't have any links handy, but I can say that the literature made me nervous about how many committees are making decisions in my government. I also admit that I lie on surveys for no reason. I don't lie about questions that I think are important, especially anything related to personality or morality. I won't purposefully lie if I think it might cause someone grief, or damage a worthy cause. I don't think I've ever run into a pure pathological liar with no underlying disorder, which is kind of odd given the statistics.
Oh, no, let's not talk about Digg.
Unfortunately, the this redesign (and associated complaints) seemed very reminiscent of what happen when Digg changed their formats. I tend to think the same thing that happen to Digg won't happen here because what DL offers is so much more valuable, but it does remind me that no one likes huge, dramatic changes all at once.
Although, I'm as surprised as you with the seemingly lack of support that occurred when this new versions was released in December (I think). I think I saw a report by you in troubleshooting, and when I was able to use the new version, I still had the same problem two weeks later. I check back on your post, and there was no response. I think they would have received a lot less push-back if the new version functioned a bit better from the get-go.
Currently, I am still in the 'old version', hopefully some of the kinks will be worked up by the time I'm updated.
Is your second question implying that kids younger than sixteen should not be learning languages on Duolingo? I believe that people of all ages, young and old, should have the ability of learning new languages whenever they want. In fact, in terms of statistics, the younger a person starts learning a new language, the better they'll learn it and the more likely they'll remember it. Also, can't adults spend a lot of time using Duolingo? If so, does it mean that they have too much time on their hands? Some people are very devoted to learning new languages and sharing these types of websites to people who need them. Don't you think so? (I'm sorry, I don't mean to sound rude here, I was just wondering what you mean by that)
I am absolutely not implying that kids should not learn using duolingo (In fact I think everyone SHOULD). I wrote right there that kids learn faster than adults (you didn't read that far, huh?) I am also not implying that anyone has "too much time on their hands", or that no adult has a lot of time. In fact, I am not implying anything at all, I am saying quite clearly that statistically, the group of people that matches the KPI that Luis uses best, are people who haven't yet entered working life -- and it might have been a good idea to find out whether different usergroups / agegroups benefit differently from different layouts. And asking Duolingo team whether they have done that or not.
Oh, I apologize, I suppose I had to question this because, well, I'm considered a "kid". And yes, I did read that far, but I was slightly skimming over it...
That reminds me I told duolingo about the page scaling problems a month ago and that's functionality that an older "farsighted" age group might depend on more.
So, let me get this straight: You are suggesting that a huge swift in demographics took place with the new version, and some how Duolingos metrics were unable to detect this. How again is that not a bit insulting to the Duolingo teams abilities?
It seems to me that you already decided that the new version woulds only be preferred by kids, and now you try to fit the data to your conclusion. This is called bias, and it is something you want to avoid!
"Now, to me those particular metrics indicate
- people having more time"
Maybe if you already concluded that a swift in demographics took place, but otherwise there is really nothing about "spend[ing] more time on the site" that suggests how much time these people have.
- "people being more comfortable with social media."
I disagree, but I do not have anything to back my claim up with, so your guess can be as good as mine. However, when I look at posts on facebook it seems to me like the people who posts the most app-post are the people who are the least comfortable with facebook.
- "people having better learning ability"
This seems to me, to be entirely made up. Which part of the metrics suggests that people had an easier time learning?
I find it quite interesting how people tell me what I am suggesting, pulling out words that were nowhere to be seen in my post. For example "huge swift of demographics". No-one said anything about huge, neither I, nor if I am not mistaken, Luis. "Significant" often means "Statistically significant", which in demographic experiments often is only a few percent away from "not statistically significant". As long as we don't know the numbers, anything here, especially "huge", is pure guesswork.
About your bullet points:
if you have more free time, you can spend more of it doing different things, even if the percentages of the things you are doing are the same as for everybody else. If you have 8 hours of spare time a day, you have the leisure of devote more of it doing any one task. For example you could spend 25% of it doing DuoLingo and the other 6 hours doing something else. Whereas if I only have 2 hours of spare time a day, even if I spend 100% (!) of my spare time Duolingoing, I am only barely catching up with your 25%.
If you are going farther down the tree, there can be several explanations for that:
- You have more fun doing it
- You have more time for doing it (this makes it dependant on the first variable, time).
or you are doing it faster, because it is easier for you, even if the other two factors are 100% identical. This is why when it is easier for you to learn, you are going to get further down the tree than me if all other factors are the same.
Just being on facebook and using it to sign in onto duolingo is a certain kind of bias: Either you are comfortable sharing your information with the world (because you are adapted to it in one way or another), or you don't care because you don't know or care enough about privacy (yet). Most tech-savvy people my age in my circles either don't use facebook at all, or have them fenced off and don't share everything they do due to lack of time (yet again, that dependant variable), or for privacy concerns. Often those people don't even MAKE it into DuoLingo test groups because they have ad- or tracker-blockers and that interferes with the data acquisition.
As to me having already reached a conclusion: What I have seen on the threads is a lot of complaints by people with bad eye sight or headache triggered by the contrast. Not one complaint, not even half a dozen - many. I can't help but be biased by that to some degree.
However, notice that I am not shouting loudly "Duolingo, you made a mistake". I am asking whether it is possible that they have missed a certain source of bias, and explaining my reasoning now, risking making fool out of myself, instead of springing it like a trap on them later if they actually do have a problem with the data. Why? Because I am not interested in this kind of games - I am interested in DuoLingo working properly and everyone being able to enjoy it, no matter the demographic. And yes, I am even proposing an easy, affordable workaround if that problem exists. Check out how many other complainers do that.
Okay. Fair point, but changing the word from huge to significant does not really change anything with regards to my question. You think a significant swift in demographics took place with the new version, and somehow Duolingos metrics were unable to detect this?
You do not have to spell out, how more free time can lead to more time spend on Duolingo; this is still assuming that a swift in demographics took place.
If you go farther down the tree, it might simply be because you spend more time on the site.
You think that it is probable that Duolingo simple looked at the number of facebook posts, without adjusting for the number of users who has connected their facebook account to their Duolingo account and how much time these users spend on Duolingo? I find this hard to believe!
I am sorry, but I fail to see that there is anything in the data, that suggests your conclusion. Yes, you can fit the data to your conclusion if you try to, but this is really not how one is supposed to do statistics.
I am not sure what field you are coming from, so forgive me if I state a couple of facts that you are already familiar with. A lot of people are not.
When you gather data, in general you have a model that you want to test. And if your model is "user behaviour", it might not model "user demographics" very well. So yes, it is not impossible that DuoLingo doesn't have the data to detect shifts of demographics if it wasn't their goal to look for shifts of demographics. Especially if that data is hard to collect, and was thought to be irrelevant to the experiment. This is precisely what I meant when I wrote "you can only see what you decide to look at."
Often, people rely on random sampling to avoid statistical bias. I have seen Luis bring up random sampling as a solution to some of the statistical problems people posed. However, specifically with demographic experiments, it is often not enough because you have dependant variables that could make your results LOOK one particular way, even if it is not so. This is what http://en.wikipedia.org/wiki/Controlling_for_a_variable is for.
Notice that my question is "do you have the ability to control for age". I am explaining why not controlling for age might affect their results, and by which mechanism. It is a model that they can easily disprove if they have the data to do it.
I just do not think you made a very good case for your hypothesis, as some of your reasoning seemed a bit preconceived.
If their metrics showed that people had been learning faster with the new design, do you then think that Luis would not have mentioned it in his post, where he tried to convince people that their metrics actually did show that people learned better with the new design?
Do you think the Duolingo team would not have noticed if the increase in facebook posts where due to a higher percentage of "high activity" profiles that were linked to a facebook account?
Duolingo does not need to rely on users telling them their age to detect a demographic shift. There are many other factors they can look at, for example the time of the day people are using Duolingo. These data are not hard to collect, actually it is rather hard to avoid collecting them.
The fact that some people have had age related problems with the new design does not imply that young people like it more or that old people like it less. They are only a minority, and I am sure that Duolingo will do their best to resolve these issues.
Luis did directly say that his conclusion was that the design was more effective, ie made people learn better.
Depends on whether they have access to that much of your data on facebook (ie how much you post overall)
Retired people will have similar time-of-day access profile as kids, but very different problems. People who have to be at school will have similar time-of-day profile as working people, but different ages. ALSO in order to know what "time of day" is for those people, you have to analyze their IPs for geographic location. They could do that with their existing data, but did they?
I said it might be the site might have age-related benefits and problems. It doesn't imply like or dislike, it implies suitability for different age groups.
How about we stop guessing and wait for someone to comment who actually knows what data they gathered and whether they can make conclusions from it or not?
A couple of disclaimers, because I have the impression people have a bunch of misconceptions about my motives for asking this.
So that people understand better where I am coming from: I have studied Bioinformatics and live with a person who does epidemiology and social science for living. I am a computer programmer.
I like some of the ideas behind DuoLingo very much, and am devoting a significant percentage of my time both to study languages with it and to try to help making it better as a(n inofficial) volonteer.
I definitely have no gripe with kids using this site. I started surfing the web when I was 12 myself, and would have started it much earlier if it was available back then. I think this resource is excellent for kids (despite a couple of less than appropriate sentences), because they can benefit the most from it. I wish I had something like this when I was a kid, when I had the time and brain capacity to usilize it better. I would be speaking a lot more languages now.
I am here to learn. I didn't like the "new old" duolingo design, and don't like the new one either, so for me it truly doesn't matter whether they switch back, change things, or do whatever (even though I hope they fix a couple of things that are truly broken.)
I don't downvote any of the comments in responses to this post or any of my comments on it (I do it in other people's posts since there I am a part of the audience, here I am not.) So every "-" that you see here is feedback by other users.