Data Quality Measurement
Overview/Purpose
CloudFactory wants to sell quality data to customers and needs to quantify it for SLAs (service level agreements), sales, marketing, etc. We can’t just claim it and let the customers come back with their own fabricated metrics and complaints. This leaves us in a vulnerable position. To prevent this, we should devise a way to quantify how good the labeled data is in a human-understandable way.
The further benefit here is that we can quantify our differentiator, proving to the customer that our delivery is high quality.
Audience
This guide is for the Sales, Delivery, and CTS teams. It is an outline of how we will measure the quality of the data that we label. We can include this information in the contracting and order process, and it will be how we substantiate what we deliver later.
What you’ll need
If you are uncomfortable with measures outlined in the document below, like precision and recall, that stem from a confusion matrix, read how they work here.
Proposal
We propose using AI CS to review the data as an iterative process. If we assume AI CS returns the most probable errors, we can (after the human-in-the-loop review) calculate some metrics between the data before and after the review for further validation.
Step 1: evaluate the AI CS actuals.
This step is counterintuitive. If we accept many of the suggested AI CS errors, we can assume the original data was low quality and vice versa. So, if we look at AI CS results and the suggestions are poor (as in the suggested errors are not errors), then the underlying data is good quality.
If there are only a few changes, the data quality is high, whereas if there are many changes, the quality is poor. If the data is poor, we can continue to iterate and might see that the AI CS models return fewer errors, which are less frequently accepted as errors.
The figures for this can be found by generating this report.
Step 2: calculate key metrics on the original vs. the clean data.
We can calculate key metrics for a given error type on the original data vs. the clean data. The metric serves as a further validation to answer the question of what is “clean enough.” Especially as the concept of clean in most cases is not absolute but somewhat parabolic and will approach 100% while never being perfect, it also provides objective grounds for a reasonable SLA.
The metric used depends on the project. The sales team can determine which error type is least acceptable for the project, and we will adopt that metric accordingly. Once the data hits some metrics agreed upon by the customer, we can deliver the data.
Error Type
Metric
Model family
Misclassified label
Accuracy (pot. Hamming score)
OD, IS, SS
Missing label
Recall
OD, IS, SS
Extra label
Precision
OD, IS, SS
Low IoU
IoU
OD, IS, SS
Misclassified Tag
Accuracy (pot. Hamming score)
image classification or attributes
Missing Tag
Recall
image classification or attributes
Extra Tag
Precision
image classification or attributes
Table 1: Proposed metrics
Caveats and complications
We assume that the metrics we calculate are conservative estimates as AI CS flags the most likely errors and is not a random subset of the data. This means the metric is the “worst-case scenario”
This method depends on having good quality, reliable AI CS models. Sometimes the use case may be very complicated, or the project is early stage, and AI CS may perform poorly. In such cases, more extensive reviews may need to be done, potentially on random subsets.
A “Change” might be poorly defined in some cases, e.g. a user might accept a model suggestion for a mask, but edit the mask slightly after.
Last updated