Ashutosh Jogalekar in The Curious Wavefunction:
My colleague Patrick Riley from Google has a good piece in Nature in which he describes three very common errors in applying machine learning to real world problems. The errors are general enough to apply to all uses of machine learning irrespective of field, so they certainly apply to a lot of machine learning work that has been going on in drug discovery and chemistry.
The first kind of error is an incomplete split between training and test sets. People who do ML in drug discovery have encountered this problem often; the test set can be very similar to the training set, or – as Patrick mentions here – the training and test sets aren’t really picked at random. There should be a clear separation between the two sets, and the impressive algorithms are the ones which extrapolate non-trivially from the former to the latter. Only careful examination of the training and test sets can ensure that the differences are real.
Another more serious problem with training data is of course the many human biases that have been exposed over the last few years, biases arising in fields ranging from hiring to facial recognition. The problem is that it’s almost impossible to find training data that doesn’t have some sort of human bias (in that context, general image data usually works pretty well because of the sheer number of random images human beings capture), and it’s very likely that this hidden bias is what your model will then capture.