The RoadBotics Tutor – How We Built a Data Labeling Feedback Loop

RoadBotics Blog

February 15, 2019

The rise of machine learning means everyone is talking about data. Data is now the currency of the land and can make or break a project or even a company. But what exactly is the data? And more importantly, how do we value it? If data is a currency then it is critical that we can place a tangible value on it. At RoadBotics, we never set out to value our data but we ended up doing it anyway.

Lots of people talk about data and make statements like, “wow, you have a lot of data so your machine learning gets smarter and smarter all the time.” But this is only half the story. Access to data is necessary. But it is certainly not sufficient to transform it into business value. I think most people mistake raw data for labeled data and while the distinction might not be apparent to everyone I’d like to take this opportunity to make it clear. First, because you will get fewer eye rolls from nerds but more importantly so that we can talk about data value.

Data Bonanza

Now, some companies get raw data that is automatically or implicitly labeled. The famous example is Amazon’s “if you like this you might also like X”. Basically, every server log told “person Y bought this product, then bought that product”…do that a million times and you can draw conclusions. That raw data is implicitly labeled because as it is streaming into their servers they can utilize it to make better predictions.

This is incredible for businesses that can get access to it. First, it means that you can get network effects from your data usage. That is, the more people using Amazon the smarter Amazon gets, for free, in making these predictions. Other examples of these “recommender systems” are Google’s search system as well as Netflix’s recommendations. Famous examples of internet companies abound with this kind of application.

While they are by no means, “easy” problems, this kind of problem is far more accessible to a machine learning solution because the data sizes are gigantic, making “inferencing” easier. But what about making machine learning systems where the meaning of the data is locked away?

This is the situation that RoadBotics faces. We find cracks, potholes, alligator cracks, raveling, etc. All issues that a trained Civil Engineer would know and be able to recognize. This is a straight-up image analysis problem. We take an image from a dashboard of a car and then train an algorithm to find these distresses.

Well great, strap a camera to a car and drive in circles for several hours anywhere in the United States and you will have a bajillion (technical term) samples of all of these distresses.

Or will you?

No, not exactly. There is no “implicit” information content in these images. You could maybe have someone in the passenger seat point at each pothole so the camera can see you pointing and in that way, you would have “annotated implicit labels”…but that’s not realistic.

So as in many cases for machine learning startups, the RoadBotics data does not come with any labels for free. So how do we get them?

Have you heard of MS Paint?

Remember that program you would doodle on like 20 years ago? That is basically what we do but cloud-based and with more features. We have members of the RoadBotics team, trained to understand the different distress types, “paint” on our collected images and tell it “blue is an unsealed crack”, or “red is a pothole”, etc. This can then be fed into our machine learning pipeline so that these labels can then be trained. We are effectively asking the machine, “ok here is a new image, find and paint the same distresses”. This is semantic segmentation.

Great. We have a system to get labels. So we do that. A lot. This is time-consuming and expensive but it gets us the data we need. The currency for us to operate. But wait? We wanted to know how valuable they are.

One way to try to answer this question is to ask if we left one of the labeled images out of the set used for training–like skipping a day of school–how well does the machine learning do? Then we could know just how valuable that single day of school really was. While interesting, it assumes we have both have the labeled data and decide not to use (so you both go to school and don’t go to school that day). Hardly a mechanism for bettering our system.

The Tutor – A Feedback Loop

Instead, the better approach is to ask the machine where it got things wrong after its last training bout and the last test. Look at where there were errors and then try to prioritize learning based upon that result.

The procedure is very much like working with a tutor. You sit with the tutor and ideally you already know some of the subject but not very much. The first thing the tutor needs to understand in order to help is “what do you know now?”.

So you’re given your first test. The tutor then looks at the test results, sees where you did well and where you did poorly. Then a custom curriculum is drafted that the tutor thinks can help shore up your weaknesses.

At this point, we prioritize certain data to be labeled rather than broadly studying everything. We work on that for a bit until we think we can make progress. To test this progress the tutor gives you another test and the procedure repeats itself.

The Value of Data

This tutor has given us the ability to value each label and to coordinate how much value we get by concentrating on areas that return the most to the main task. By creating this positive feedback loop and integrating our output to how we extract information from our data we are now able to build more powerful models more quickly and at a lower expense.

That’s the power of feedback in machine learning.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
_GRECAPTCHA	5 months 27 days	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
ak_bmsc	2 hours	This cookie is used by Akamai to optimize site security by distinguishing between humans and bots
citrix_ns_id	session	This cookie is set by the provider Citrix, a web application firewall. This cookie is used for protecting the website against known and unknown attacks.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-functional	1 year	The cookie is set by the GDPR Cookie Consent plugin to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Necessary" category .
cookielawinfo-checkbox-others	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Others".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin to store whether or not the user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
A3	1 year	No description
AnalyticsSyncHistory	1 month	No description
citrix_ns_id_.d2d.gsa.gov__wlf	session	No description
citrix_ns_id_.gsa.gov__wlf	session	No description
li_gc	5 months 27 days	No description
ln_or	1 day	No description
m	2 years	No description available.
NSC_IUUQ-Ebub2Efd	session	No description
SSESSe6f64672c023222bafbc47f83a5ecbd4	23 days 4 hours	No description
TS01c2db25	session	No description

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
c	1 year	This cookie is set by Rubicon Project to control synchronization of user identification and exchange of user data between various ad services.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
tuuid	1 year	The tuuid cookie, set by BidSwitch, stores an unique ID to determine what adverts the users have seen if they have visited any of the advertiser's websites. The information is used to decide when and how often users will see a certain banner.
tuuid_lu	1 year	This cookie, set by BidSwitch, stores a unique ID to determine what adverts the users have seen while visiting an advertiser's website. This information is then used to understand when and how often users will see a certain banner.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_SQVZMMXYCW	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_88652169_15	1 minute	Set by Google to distinguish users.
_gat_UA-88652169-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
messagesUtk	5 months 27 days	HubSpot sets this cookie to recognize visitors who chat via the chatflows tool.
tads_uid	5 years	The domain of this cookie is owned by Technorati.This cookie helps the user to share pages through social networking sites. The main purpose of this cookie is advertising.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

The RoadBotics Tutor – How We Built a Data Labeling Feedback Loop

Data Bonanza

Have you heard of MS Paint?

The Tutor – A Feedback Loop

The Value of Data

Author

Benjamin Schmidt, PhD

Ready to Get Started?

Will we see you on the road in 2024?

Beyond Slippery Roads: The Hidden Cost of Winter Deicing

Practical Applications of Artificial Intelligence and Machine Learning for Better Road Maintenance