The rise of machine learning means everyone is talking about data. Data is now the currency of the land and can make or break a project or even a company. But what exactly is the data? And more importantly, how do we value it? If data is a currency then it is critical that we can place a tangible value on it. At RoadBotics, we never set out to value our data but we ended up doing it anyway.
Lots of people talk about data and make statements like, “wow, you have a lot of data so your machine learning gets smarter and smarter all the time.” But this is only half the story. Access to data is necessary. But it is certainly not sufficient to transform it into business value. I think most people mistake raw data for labeled data and while the distinction might not be apparent to everyone I’d like to take this opportunity to make it clear. First, because you will get fewer eye rolls from nerds but more importantly so that we can talk about data value.
Now, some companies get raw data that is automatically or implicitly labeled. The famous examples are Amazon’s “if you like this you might also like X.” Basically, every server log that told “person Y bought this product, then bought that product”… do that a million times and you can draw conclusions. That raw data is implicitly labeled because as it is streaming into their servers they can utilize it to make better predictions.
This is incredible for businesses that can get access to it. First, it means that you can get network effects from your data usage. That is, the more people using Amazon the smarter Amazon gets, for free, in making these predictions. Other examples of these recommender systems are Google’s search system as well as Netflix’s recommendations. Famous examples of internet companies abound with this kind of application.
While they are by no means, “easy” problems, this kind of problem is far more accessible to a machine learning solution because the data sizes are gigantic making inferencing a easier. But what about making machine learning systems where the meaning of the data is locked away?
This is the situation that RoadBotics faces. We find cracks, potholes, alligator cracks, raveling, etc. All issues that a trained Civil Engineer would know and be able to recognize. This is a straight-up image analysis problem. We take an image from a dashboard of a car and then train an algorithm to find these distresses.
Well great, strap a camera to a car and drive in circles for several hours anywhere in the United States and you will have a bajillion (technical term) samples of all of these distresses.
Or will you?
No, not exactly. There is no “implicit” information content in these images. You could maybe have someone in the passenger seat point at each pothole so the camera can see you pointing and in that way you would have “annotated implicit labels”…. Not realistic.
So as in many cases for machine learning startups, the RoadBotics data does not come with any labels for free. So how do we get them?
Have you heard of MS Paint?
Remember that program you would doodle on like 20 years ago? That is basically what we do but cloud-based and with more features. We have members of the RoadBotics team, trained to understand the different distress types, “paint” on our collected images and tell it “blue is an unsealed crack”, or “red is a pothole”, etc. This can then be fed into our machine learning pipeline so that these labels can then be trained. We are effectively asking the machine, “ok here is a new image, find and paint the same distresses”. This is semantic segmentation.
Great. We have a system to get labels. So we do that. A lot. This is time consuming and expensive but it gets us the data we need. The currency for us to operate. But wait? We wanted to know how valuable they are.
One way to try to answer this question is to ask, if we left one of the labeled images out of the set used for training – like skipping a day of school – how well does the machine learning do? Then we could know just how valuable that single day of school really was. While interesting, it assumes we have both have the labeled data and decide not to use (so you both go to school and don’t go to school that day). Hardly a mechanism for bettering our system.
The Tutor – A Feedback Loop
Instead, the better approach is to ask the machine where it got things wrong after its last training bout and the last test. Look at where there were errors and then try to prioritize learning based upon that result.
The procedure is very much like working with a tutor. You sit with the tutor and ideally you already know some of the subject but not very much. The first thing the tutor needs to understand in order to help is “what do you know now?”.
So you’re given your first test. The tutor then looks at the test results, sees where you did well and where you did poorly. Then a custom curriculum is drafted that the tutor thinks can help shore up your weaknesses.
At this point, we prioritize certain data to be labeled rather than broadly studying everything. We work on that for a bit until we think we can make progress. To test this progress the tutor gives you another test and the procedure repeats itself.
The Value of Data
This tutor has given us the ability to value each label and to coordinate how much value we get by concentrating on areas that return the most to the main task. By creating this positive feedback loop and integrating our output to how we extract information from our data we are now able to build more powerful models more quickly and at a lower expense.
That’s the power of feedback in machine learning.