As neural networks tease apart the structure of language, they are finding a hidden gender bias that nobody knew was there.
From the MIT Technology Review:
Back in 2013, a handful of researchers at Google set loose a neural network on a corpus of three million words taken from Google News texts. The neural net’s goal was to look for patterns in the way words appear next to each other.
What it found was complex but the Google team discovered it could represent these patterns using vectors in a vector space with some 300 dimensions.
It turned out that words with similar meanings occupied similar parts of this vector space. And the relationships between words could be captured by simple vector algebra. For example, “man is to king as woman is to queen” or, using the common notation, “man : king :: woman : queen.” Other relationships quickly emerged too such as “sister : woman :: brother : man,” and so on. These relationships are known as word embeddings.
This data set is called Word2vec and is hugely powerful. Numerous researchers have begun to use it to better understand everything from machine translation to intelligent Web searching.
But today Tolga Bolukbasi at Boston University and a few pals from Microsoft Research say there is a problem with this database: it is blatantly sexist.
More here. [Thanks to Farrukh Azfar.]