eDiscovery Science Lesson in Machine Learning

In a recent blog by Herbert Roitblat, he explores the concept that the algorithm used for machine learning is less important than the amount of data used to train the algorithm (e.g., Domingos, 2012; “More data beats a cleverer algorithm”). In a study by Fernández-Delgado and colleagues, they tested 179 machine learning categorizers on 121 data sets and found that a large majority of them, were essentially identical in their accuracy. In fact, 121 of them (that’s a coincidence) were within ±5 percentage points of one another averaging all of the data sets. Read the entire article by Herbert L. Roitblat.