The Common Voice project needs your voice in Esperanto - and could help the EO duolingo tree a lot
I hope this doesn’t count as advertising since this is a non-profit project that could be very useful for Duolingo too.
The Common Voice project by Mozilla builds up a huge voice database in many languages, including Esperanto. This open database exists to help make speech recognition better and/or available in more languages. So, this is basically a project to make machine learning for speech recognition reachable for everyone, especially for smaller languages and startups without much capital. This is the website in Esperanto:
You can do two things there:
- donate your voice - you record short sentences in a language of your choice. It is very important that this project gets a diverse collection of audios. Right now, only 10% of the audio is from female donors and they are also looking for people in the age group below 18 and above 40.
- Listen to the records and validate that they are corect. This part is great to train your listening skills in Esperanto and you learn a lot about different Esperanto accents from different countries.
Right now (Aug. 2019) there are 20 hours of recordings in Esperanto from 144 speakers, 15 hours are already validated. The complete dataset is under a free license (CC0) so everyone can use it without any restrictions for their own private or commercial machine learning projects.
So how could this be useful for Duolingo? Many language versions of Duolingo support speak recognition, but the Esperanto tree doesn't. I assume that this is because of the complete lack of Esperanto speech recognition software. The Mozilla voice project is the first project I know of that really could change this situation. A lot of startups and big company will use this dataset simply because it is free. If they find a good dataset for Esperanto the chance exists that at least a few of them will train there system also with this data. (for example Google already supports Esperanto in google translate)
So in the end we might get speech recognition for Esperanto for a few services and maybe also here on Duoling. Plus, this dataset could also be very useful for speech synthesizers that use machine learning.
I find it a lot of fun to donate some time to this project every now and then. What are you thinking about it?
Right now (Aug. 2019) there are 20 hours of recordings in Esperanto from 144 speakers,
This project created by far the biggest dataset for Esperanto (and for many other small languages too). But the official aim for all datasets is 10 000 hours because they say this is the amount of data one needs to train a neural network and get good results. This aim ist very ambitious for most small languages. But I do believe that Esperanto could work with much less data, because the hard part in machine learning is teaching all the exceptions and irregularities to the computer. In Esperanto one only has to teach different accents and voices to the computer, but the pronunciation and the writing is always completely regular.
AFAIK no one has ever tried machine learning with a constructed language (due to the lack of data to train the neural network) that is why I’m looking forward to the first experiments with the Esperanto dataset. Maybe we will see some surprises. I belive 100 hours of recordings in the next two years is a realistic aim for the Esperanto community.
Thank you for sharing this! I will definitely be contributing in the future.
Well, that was exactly thought, as I participate in the course. Besides, I am 55, so I definitely fit your needs. The problem for me are technical details, I simply do not know how to do it. Could anyone on the know enlighten me?
Thats great! I will try to elighten you :) Maybe the english version of the project is easier for you to understand. (or any other language that you can find in the dropdown in the upper right corner of the website)
You can do everything without creating a user, but if you want to track your work a user can be very usefull. In both cases you can start on the homepage (https://voice.mozilla.org/eo) by clicking on the round icon right next to "paroli" or "aŭskulti". If you choosed "paroli" you will likely get a little dialog where your browser asks you if it is okay that this website wants to use your microphone when you try to record your voice for the first time. You have to allow this to start the recording.
If you have more questions, always feel free to ask.
Great project, well worth supporting! I will arrange for people to contribute.