I’m thinking of writing a couple of posts about ideas for things to do with machine learning. This is partly just to get them out of my head - they’ve mostly been swirling around in there for a few years not doing a whole lot, and it might be healthy for them to move out on their own for a while. It might clear my head to dive into other things. Maybe I’ll even get the urge to work on one of them.
Or maybe I’ll get bored of the subject after one post and write about something else.
In any case, let’s start with a fun one:
Using word embeddings to investigate the Voynich manuscript
One of the more unique organisations to start up in the past few years must be the Earth Species Project. Their aim is to decode animal languages without labelled examples or excessive human interpretation, in the hope of triggering a shift in the global consciousness around environmental issues. The way they plan to do this is by repurposing work on unsupervised word embeddings and unsupervised translation for human languages.
It’s a cool project, although the deep reliance of word embedding techniques on grammatical structure makes me doubt if the idea will quite work as stated. To my knowledge, there’s no evidence that any non-human animals use language with a grammatical structure that could act as a foothold.
But even if it doesn’t end up working for animal language, there are still fragments of human writing systems that have never been deciphered. Could these unsupervised ML techniques provide a foothold there?
Sadly, this probably won’t be a fruitful approach in most cases due to the extreme paucity of written material - far too little to train a decent word embedding model. The Voynich manuscript is an exception to this tendency, containing around 40,000 words in total.
The problem with the Voynich manuscript is its controversial status. It contains a number of illustrations which seem to suggest it’s a book of herbal medicine. It was once widely assumed to be a substitution cipher of a known language, but the statistical properties of the characters are so unusual that some believe it to be a hoax written in a meaningless pseudo-language. However, other linguists have pointed to the natural patterns of word distributions to argue that it’s not a hoax. There have been a number of claimed decipherments, none of which seem to have much support beyond their inventors. In short, the crackpot index is higher than might be ideal. (If you want to dive into the literature, this paper seems like a good place to start.)
Despite the messy status of the manuscript, it does seem like there is useful work that could be done using word embeddings. For example:
Do the word embeddings learned from the Voynich manuscript have a comparable structure to those learned from natural language sources of similar length and type?
Is it possible to use the learned embeddings to determine the grammatical category of any of the words (e.g. which words are nouns) - or at least which words belong to the same category?
Do the embeddings have more in common with some known languages than others?
As a closing note, after a little bit of searching I did find something similar that has been tried in this unpublished work. However, the type of embeddings used there rely on sub-word units carrying semantic meaning, which may not be a safe assumption given the oddness of the Voynich character-level properties. It would be interesting to do something similar with “standard” word embeddings.
Anyway. Don’t take this as more than one person’s idle musings. I am not a linguist, etc.