Have you ever wondered what a language you understand sounds like to someone who doesn't understand a single word of it? I've heard it said that my native Swedish has a lot of melody to it, like we're almost singing. I can't hear that at all myself, but perhaps it's easier to hear things like this if you don't understand the words being said?
A long time ago I had a bit of inspiration on how this could be done. It's been so long now since I started fiddling with this that I no longer remember what sparked my interest, but from time to time I've been returning to the project. And now, after a lot of work and tedious web searching, I think I may have found a way – and if I'm right this is what English sounds like:
I started out looking for a way to do this with Swedish, but for reasons discussed later it turned out to be easier to work with English. Plus, it's more likely that whoever reads this understands English better than Swedish.
To know for sure if my hypothesis is correct I suppose I'd have to locate a hundred persons or so who don't know a single word of English and set up a test where they classify a series of audio clips as either language A or or language B, and if the sum of their classifications don't correctly group the real English and fake English clips separately I can be fairly sure that my hypothesis is correct… Seems like a lot of job though, so I'll just assume that I'm correct!
So, how did I generate the "English" above? It's quite simple, really: first a bit of processing proper English, then a bit of generating "English".
- Download the English Wiktionary database
- Extract IPAs for English words
- Download a long English text
- Convert the text into a stream of IPA phonemes
- Markov-process the IPA stream
Some of the steps above might not be totally obvious to everyone, so here's a quick intro to some of the concepts.
IPA, isn't that hipster beer..? Nope, not in this case: here IPA is short for " International Phoneme Alphabet ". It's a way to describe how a word is pronounced. There may be only one way to spell "program", but there are many ways to pronounce it:
- BBC English/Received pronunciation: pɹəʊɡɹæm
- General American: pɹoʊˌɡɹæm
- Southern American: pɹoʊɡɹəm
Examples from Wiktionary .
If I could find it somewhere, I would have added IPA to show the Swedish pronunciation too, but my search skills have failed me. This is the main reason I didn't create fake Swedish instead of fake English: I couldn't find a good source of IPA for lots and lots of Swedish words.
Markov chains is a bit of fancy mathematics to get a model of how likely various sequences are in an analyzed case. In our case we want to know what the most likely next syllable is while we're generating a word. If we've started with "th", it's pretty likely that some vowel sound will follow, and pretty unlikely that we'll see a "z" for example.
We generate our chains by analyzing all the proper English IPAs we've downloaded from Wiktionary, basically creating a long list of sequences that occur and how common each sequence is. To make things a bit more realistic than just analyzing each word in the dictionary once, we use a real text as source - this will give us a better picture of how common the sounds are in actual use. We give more weight to common words, basically. I suppose we could do this by finding a database with information on how commonly words are used, but with the current approach we can easily emulate different usages of our target language.
For example, my source text is Romeo and Juliette by William Shakespeare , who probably uses different words than what we'd find if we feed pop culture gossip into the machine. I used the "Plain Text UTF-8" variant, since we're only interested in the words and not fancy formatting.
We get the raw data for our Markov analysis by replacing each word in the text with its IPA representation, skipping any word we don't have IPA for. This will skew the data, but for a first test it'll have to be good enough.
Now that we have a model of English, all we have to do is create some fake English. We do this in two steps:
- Generate new IPAs
- Use a text-to-speech engine to convert to audio
Generating our fake English IPA sequence is pretty straight forward. What you do is you take your Markov analysis, and run it backwards. First select a probable first sound, then ask the statistics what the most likely next sounds are and randomly select one of them (taking relative probabilities into account). Continue until you've generated a suitable long bit of sounds.
In our analysis we consider "end of word" to be a "sound", so we'll automatically get our output separated into words (and likewise for sentences).
Text to speech
Ok, so now we have a bunch of IPA that should sound like English. But how can we tell? The tedious and hard way to do it might be to learn how to pronounce all the IPA characters we've used and painstakingly learn each fake word and say it out loud. Possible, but not very appealing.
When I'd gotten this far, I looked into various text to speech systems. There are a few open source variants, but I wasn't terribly impressed with the realism of the generated sound. I'm sure it's usable enough for most purposes, but since what we're after is to know if it sounds natural… We need something better.
I dabbled a bit with MacOS's builtin voices, but it seems the new good voices used by Siri don't understand IPA. Some older voices do, kind of, but they're not very natural sounding.
I found Google's Cloud Text-to-Speech API , and thought I'd found what I was looking for. The generated audio when you enter normal text sound pretty good! So I tried giving it a bunch of IPA, but instead of speaking all of the phonemes, it used a few of them - likely the ones that it interpreted as normal characters. I did a bit of searching around, and it seems like they don't support phoneme input. Boo!
Next in the list of search results we find Amazon's Alexa. They've got an API for creating "skills", Alexa Skills Kit where we see that they do support phoneme input! If only there was a way to use the API without having to set up a developer account and create a "skill"… Oh well. I'm sure I'll have a use for that login sooner or later.
But wow, there were a lot of agreements, licenses and whatnot to approve before I got far enough to create my skill! Hopefully I won't be in too much trouble for using their tech to pronounce gibberish.
On the alexa developer console, on the Test tab, I found something useful: the Voice & Tone sub-tab lets you enter SSML (Speech Synthesis Markup Language)!
We input this:
<speak> <phoneme alphabet="ipa" ph="ðɪs ˈmɔː tuː klaʊdz ɒv maɪn lɜːn miː ɹɒks ɪt æm ðaʊ diːmz tuː"></phoneme>. <phoneme alphabet="ipa" ph="aɪ ɹɪˈfɔː(ɹ)"></phoneme>. <phoneme alphabet="ipa" ph="ʃiː ˈsaʊn tɹuːm sʌt͡ʃən wiː kænɒt fɪəfəli tɪɹz ˈmɔː pleɪ"></phoneme>. <phoneme alphabet="ipa" ph="bɹɔːt ˈɛkst ðaɪ ˈnɪʃməntl̩ tɛkt ðiː teɪn pɑːt eɪ̯ ˈmʌnθ"></phoneme>. </speak>
And we can enjoy listening to Marklish!
Some carefully applied use of RogueAmoeba's Loopback , and we've captured the audio and put it in a file. Then ffmpeg can convert it to ogg and mp3 so that most people can hear it on this page. Let's hear it again!
So, there you have it: Marklish, a Markov-generated English-like language.
I note that some of the fake English is actually proper English, but that's only to be expected since a lot of words are both common and short - easily replicated by the random generator. Perhaps it would have been better to add a tweak to the generation, so that we skip any words that are known real words. But on the other hand, there are only so many short sound combinations possible - and we still want this to sound like English.
If you'd like to give it a try yourself, the messy code I've used to do this is available on Bitbucket . The code is a ball of mud, created by borrowing various bits and pieces of code from anywhere I could find it. Hopefully I haven't forgotten to note where I found the pieces… In any case, thanks to everyone who put their building blocks on the net for anyone to use to build things!