Machine Learning Data for Self-Driving Cars: Shared or Proprietary


By Dr. Lance B. Eliot, the AI Insider for AI Trends and a regular contributor

The crux of any machine learning approach involves data. You need lots and lots of usable data to be able to “teach” a machine. One of the reasons that machine learning has progressed lately is due to the advent of Big Data, meaning tons of data that can be readily captured, stored, and processed. Why is there a necessity to have an abundance of data for purposes of doing machine learning? Let’s use a simple but illustrative example to explain this. Imagine if you wanted to learn about birds and someone showed you only one individual picture of a bird (and furthermore, let’s assume you had never seen any birds in your lifetime). It might be difficult to generalize from one picture and discern the actual characteristics of a bird. If you saw perhaps 50 pictures you’d have a greater chance of discovering that birds have wings, they have beaks, etc. If you saw thousands and thousands of pictures of birds you’d be able to really begin to figure out their characteristics, and even be able to classify birds by aspects such as distinctive colors, distinctive wing shapes, and so on.

For self-driving cars, many of the self-driving car makers are utilizing machine learning to imbue their AI systems with an ability to drive a car. What kind of data are the developers using to “teach” the automation to drive a car? The developers are capturing huge amounts of data that arises while a car is being driven, collecting the data from a myriad of sensors on the car. These sensors include cameras that are capturing images and video, radar devices that capture radar signals, LIDAR devices that capture laser-based distance points data, and the like. All of this data can be fed into a massive dataset, and then crunched and processed by machine learning algorithms.  Indeed, Tesla does this data collection over-the-air from their Tesla cars and can enhance their existing driving algorithms by examining the data and using it to learn new aspects about how their Autopilot software can improve as a driver of the car.

How much data are we talking about?

One estimate by Intel is the following:

Radar data: 10 to 100 KB per second

Camera data: 20 to 40 MB per second

Sonar data: 10 to 100 KB per second

GPS: 50 KB per second

LIDAR: 10 to 70 MB per second

If you add all that up, you get about 4,000 GB per day of data, assuming that a car is being driven about 8 hours per day. As a basis for comparison, it is estimated that the average tech-savvy person uses only about 650 MB per day when you add-up all of the online social media, online video watching, online video chatting, and other such uses on a typical day.

The estimates of the data amounts being collected by self-driving cars varies somewhat by the various analysts and experts that are commenting about the data deluge. For example, it is said that Google Waymo’s self-driving cars are generating about 1 GB every second while on the road, which makes it 60 GB per hour, and thus for 8 hours it would be about 480 GB. Based on how much time the average human driver drives a car annually, it would be around 2 petabytes of data per year if you used the Waymo suggested collection rate of data.

There’s not much point about arguing how much data per se is being collected, and instead we need to focus on the simple and clear cut fact that it is a lot of data. A barrage of data. A torrent of data. And that’s a good thing for this reason – the more data we have, the greater the chances of using it wisely for doing machine learning. Notice that I said we need to use the data wisely. If we just feed all this raw data into just anything that we call “machine learning” the results will not likely be very useful. Keep in mind that machine learning is not magic. It cannot miraculously turn data into supreme knowledge.

The data being fed into machine learning algorithms needs to be pre-processed in various fashions. The machine learning algorithms need to be setup to train on the datasets and adjust their internal parameters correspondingly to what is found. One of the dangers of most machine learning algorithms is that what they have “learned” becomes a hidden morass of internal mathematical aspects. We cannot dig into this and figure out why it knows what it knows. There is no particular logical explanation for what it deems to be “knowledge” about what it is doing.

This is one of the great divides between more conventional AI programming and the purists approach to machine learning. In conventional AI programming, the human developer has used some form of logic and explicit rules to setup the system. For machine learning, it is typically algorithms that merely mathematically adjust based on data patterns, but you cannot in some sense poke into it to find out “why” it believes something to be the case.

Let’s take an example of making a right turn on red. One approach to programing a self-driving car would be to indicate that if it “sees” a red light and if it wants to make a right turn, it can come to a stop at the rightmost lane, verify that there isn’t anyone in the pedestrian walkway, verify that there is no oncoming traffic to block the turn, and then can make the right turn. This is all a logical step-by-step approach. We can use the camera on the self-driving car to detect the red light, we can use the radar to detect if there are any pedestrians in the walkway, and we can use the LIDAR to detect if any cars are oncoming.  The sensory devices generate their data, and the AI of the self-driving car fuses the data together, applies the rules it has been programmed with, and then makes the right turn accordingly.

Compare this approach to a machine learning approach. We could collect data involving cars that are making right turns at red lights. We feed that into a machine learning algorithm. It might ultimately identify that the red light is associated with the cars coming to a halt. It might ultimately identify that the cars only move forward to do the right turn when there aren’t any pedestrians in the walkway, etc. This can be accomplished in a supervised manner, wherein the machine learning is guided toward these aspects, or in an unsupervised manner, meaning that it “discovers” these facets without direct guidance.

Similar to my comments earlier regarding learning about birds, the machine learning approach to learning about right turns on red would need an enormous amount of data to figure out the various complexities of the turn aspects. It might also figure out things that aren’t related and yet believe that they are. Suppose that the data had a pattern that a right turn on red typically took place when there was a mailbox at the corner. It might therefore expect to detect a mailbox on a corner and only be willing to make the right turn when one is there, and otherwise refuse to make the right turn on red.

There would be no easy way to inspect the machine learning algorithm to ferret out what it assumed was the case for making the right turn on red. For example, in small-scale artificial neural network we can often inspect the weights and values to try and reverse engineer into what the “logic” might be, but for massive-sized neural networks this is not readily feasible.  There are some innovative approaches emerging to try and do this, but by-and-large for large-scale settings it is pretty much a mystery. We cannot explain what it is doing, while in the approach of conventional AI programming we could do so (the rules of the road approach).

In spite of these limitations about machine learning, it has the great advantage that rather than trying to program everything in a conventional AI way, which takes specialized programmers hours and hours to do, and which might not even cover all various potentialities, the machine learning algorithm can pretty much run on its own. The machine learning algorithm can merely consume processing cycles and keep running until it seems to find useful patterns. It might also discover facets that weren’t apparent to what the human developers might have known.

This is not to suggest that we must choose between using a machine learning approach versus a more conventional AI programming approach. It is not a one-size-fits all kind of circumstance. Complex systems such as self-driving cars consist of a mixture of both approaches. Some elements are based on machine learning, while other elements are based on conventional AI programming. They work hand-in-hand.

Suppose though that you are developing a self-driving car and you don’t have sufficient data to turn loose a machine learning algorithm onto? This is one of the current issues being debated at times loudly in the halls of self-driving car makers and the industry.

If you believe that humanity deserves to have self-driving cars, you might then take the position that whomever has self-driving car data ought to make it available to others. For example, some believe that Tesla should make available its self-driving car data and allow other self-driving car makers to make use of it. Likewise, some believe that Google Waymo should share its self-driving car data.  If Tesla and Google were to readily share their data, presumably all the other self-driving car makers could leverage it and be able to more readily make viable self-driving cars.

On the other hand, it seems a bit over-the-top to assert that private companies that have invested heavily into developing self-driving cars and that have amassed data at their own costs should have to suddenly turn it over to their competitors.  Why should they provide a competitor with something that will allow the competitor to have avoided similar costs? Why should they be enabling their competitors to easily catch-up with them and not have to make similar investments? You can imagine that the self-driving car makers that have such precious data argue that this data is proprietary and not to be handed-out to whomever wants it.

There are some publicly available datasets of driving data, but they are relatively small and sparse. Some have argued that the government should be collecting and providing driving data, making it available to anyone that wants to have it. There are also more complicated questions too, such as what the data should consist of, and in what way would be it representative.  In other words, if you have driving data of only driving on perhaps the roads in Palo Alto, does that provide sufficiently generalizable data that machine learning could achieve an appropriate driving ability in Boston or New York?

Most of this data so far is based on self-driving cars, which makes sense because those are the cars that have all the needed sensory devices to collect the data. Another approach involves taking a human-driven car, put the sensory devices onto it, and use that data to learn from. This certainly makes perhaps even more sense to do, in that why try to learn from a self-driving car which is already just a novice at driving, and instead try to learn from the maneuvers of a human driven car that presumably involves a savvy driver and savvy driving.

This is reminiscent of a famous story that occurred during World War II. When Allied bombers returned to their bases, the planes were studied to determine where the holes were. The thinking was that those holes are vulnerable places on the plane and should be armored heavily on future planes, hoping to ensure that those future planes would be able to sustain the aerial attacks better than the existing planes.  A mathematician involved in the analysis had a different idea. He pointed out that the planes that didn’t return were the ones that had been shot down. The holes on those planes would be the spots to be armored.  This was thinking outside-the-box and makes perfectly good sense when you consider it.

The same can be said of collecting self-driving car data. Right now, we are obsessed with collecting the data from self-driving cars, but it might be more sensible to also collect the data from human driven cars. We could include not only well-driven human-driven cars, but also human drivers that are prone to accidents. In this manner, the machine learning algorithm could try to discern between proper driving and improper driving. The improper driving would help keep the self-driving car from falling into the trap of driving in the same ways that bad drivers drive.

For those that believe fervently that self-driving cars will change society and democratize the world, they are pushing toward trying to make all data about self-driving cars available to all comers. Will regulators agree and force companies to do so? Will companies want to voluntarily provide their data? Should this data be made available but perhaps at a fee that would compensate those companies that provide it? Will the data become a privacy issue if it provides a capability to drill into the data down to the actual driving of a particular car? When there are accidents involving self-driving cars, will this data be available for purposes of lawsuits?

We are just starting to see an awareness about the importance of data when it comes to self-driving cars. Up until now, the innovators trying to move forward on self-driving cars have been doing their own thing. As the self-driving car market matures, we’re likely to see increased attention to the data and how and who should have the data. Machine learning algorithms hunger for data. Feeding them is essential to ongoing advances of self-driving cars. Society is going to bring pressures into this field of endeavor and I assure you that the question of whether the self-driving car data is proprietary or shared is going to one day become a highly visible and contentious topic. Right now, it’s only known to those in the know. Be on the watch for this to break into the limelight, sooner rather than later.