
Loss tolerance v. perfect data: An algorithmic parable
By Tim Panagos
What’s the difference between a loss-tolerant algorithm, and one that’s trained for perfect data in the real-world? We are so glad you asked!
Perfect data is very expensive to come by. There are many factors that change the quality and consistency of data that real life injects into any data system.
Most algorithms are trained and developed by data scientists using engineered data, that is, data that has been highly processed to perfect formatting, remove gaps, and ensure consistent units of measure. In fact, ask any data science team where the most time is spent and they will tell you that it isn’t training AIs or generating wiz-bang algos, but rather in the plodding work of data engineering—preparing data at scale to train perfect algorithms. It is expensive, but the results can be profound.
Algorithms created in this way are elegant and explainable but suffer from dependency on high-quality data to produce meaningful results. Without high-quality data? Well, the results are questionable.
Data engineering is so time consuming because we don’t live in a cookie-cutter world. And the effort to make chaotic data appear in the preferred orderly manner is proportionally high. Real world data is messy. The more money and time you spend, the closer you can get your data to perfect fit.
‘Cs get degrees’
There are many factors that make data from the real-world imperfect, particularly where data is generated by sensors and by humans. It is hard to say which is least reliable. In IoT-based autonomous remote sensing, there are device malfunctions, battery depletion, network noise and instability, and just simple breakage and theft. Any and all of these factors will impact every large-scale, real-world deployment. Each of these factors are constantly at play and the result is a gap between imperfect data and the data engineered to perfect that is preferred by most data scientists.
It is possible to close this gap if you stuff enough cash into it. You can spend more money on more and higher accuracy devices. You can use yearly calibration testing and adjustments. You can create redundant networks, use licensed and controlled spectrum, or hard-wire the infrastructure. You can run data through high-throughout, high-accuracy data centers and store data in hyper-accurate, full-size formats. Each of these measures will drive the cost to deploy exponentially higher as you approach the perfect horizon.
But is this cost justified? In high-risk, high-margin scenarios, the answer is yes. Healthcare and pharmaceutical research come to mind. But in low-risk, low-margin business applications, can you really afford to double operating costs even if the resulting accuracy and precision are perfect? Will your customers pay for an A+ grade? Or do Cs get degrees?
The efficient frontier
At Microshare, we strive to strike the optimal balance of cost to predictiveness, what we call the efficient frontier. Because we see that in our businesses, the majority of predictive value is gained from the least expensive deployments (see the Pareto Principle).
That’s why we focus on loss-tolerant algorithms. Algorithms built to eat imperfect data so that real-world, real-time data can be used to create predictive insights that drive business value at a minimum of cost. Algorithms that are designed to handle the lack of data quality that comes from these real-world scenarios.
Intuitively you can grasp this if you just consider the idea that with a perfect data algorithm “no news is good news.” In other words, if you don’t hear from a sensor, an algorithm designed with perfect data will interpret that signal gap as meaning “no activity,” whereas a loss-tolerant algorithm will recognize gaps in data as belonging to this class of real-world blindness, solving for the chaotic world in which we operate.
So a loss-tolerant system will expect that there is a difference between a sensor that has read no activity and a sensor that really has no activity. That difference is very important in creating valuable insights, accurate predictiveness—and is the most important feature of a data system that is meant to read and interpret real life scenarios.
At Microshare, we have a process where we train and test our algorithms against real world data. We have the most real-world data about rodent behavior of any organization in the world right now. We use that data to create outcomes that are loss-tolerant, predictive and value generating. This optimizes for the perfect ratio of cost per unit insight that means the data environment can be inexpensive, but the values generated from it are very high and that’s the ideal situation for IOT in pest and for clean where margins are small, but insights are valuable.
Tim Panagos is the Chief Technology Officer and a Co-Founder of Microshare.
