The rise of machine learning means everyone is talking about data. Data is now the currency of the land and can make or break a project or even a company. But what exactly is the data? And more importantly, how do we value it? If data is a currency then it is critical that we can place a tangible value on it. At RoadBotics, we never set out to value our data but we ended up doing it anyway.
Lots of people talk about data and make statements like, “wow, you have a lot of data so your machine learning gets smarter and smarter all the time.” But this is only half the story. Access to data is necessary. But it is certainly not sufficient to transform it into business value. I think most people mistake raw data for labeled data and while the distinction might not be apparent to everyone I’d like to take this opportunity to make it clear. First, because you will get fewer eye rolls from nerds but more importantly so that we can talk about data value.
Now, some companies get raw data that is automatically or implicitly labeled. The famous examples are Amazon’s “if you like this you might also like X.” Basically, every server log that told “person Y bought this product, then bought that product”… do that a million times and you can draw conclusions. That raw data is implicitly labeled because as it is streaming into their servers they can utilize it to make better predictions.
This is incredible for businesses that can get access to it. First, it means that you can get network effects from your data usage. That is, the more people using Amazon the smarter Amazon gets, for free, in making these predictions. Other examples of these recommender systems are Google’s search system as well as Netflix’s recommendations. Famous examples of internet companies abound with this kind of application.
While they are by no means, “easy” problems, this kind of problem is far more accessible to a machine learning solution because the data sizes are gigantic making inferencing an easier. But what about making machine learning systems where the meaning of the data is locked away?
This is the situation that RoadBotics faces. We find cracks, potholes, alligator cracks, raveling, etc. All issues that a trained Civil Engineer would know and be able to recognize. This is a straight-up image analysis problem. We take an image from a dashboard of a car and then train an algorithm to find these distresses.
Well great, strap a camera to a car and drive in circles for several hours anywhere in the United States and you will have a bajillion (technical term) samples of all of these distresses.
Or will you?
No, not exactly. There is no “implicit” information content in these images. You could maybe have someone in the passenger seat point at each pothole so the camera can see you pointing and in that way, you would have “annotated implicit labels”… Not realistic.
So as in many cases for machine learning startups, the RoadBotics data does not come with any labels for free. So how do we get them?
Have you heard of MS Paint?