AI Consensus Scoring (AI CS) Review Guide
Overview
This guide provides detail on how to trigger AI Consensus Scoring and how to interpret its results. Please review the batch workflow guide for more information on when to use AI-CS. Please see the data quality section in the project performance report guide for more on quantifying the data quality using AI-CS.
What you’ll need
Prerequisite Guides: Batch Workflow Guide, QA Workflow Guide
Tools: Hasty App
Related Guides: Reviewing AI Assistant Performance Guide, Project Report Guide
Running and reviewing AI-CS
Before going into the details, let’s start with a high-level overview of AI-CS.
We know we have an unknown % of incorrect annotations for any labeled dataset. Finding those is tedious, like finding a needle in a haystack. AI-CS helps by suggesting a subset of annotations with a high likelihood of being erroneous that a human will then review. Human involvement is critical because an AI can only indicate with which annotations it is uncertain, but by definition, it doesn’t inherently know what would be correct.
Behind the curtains, we use a mix of Confidence Learning and Bayesian Neural Networks to power AI-CS. You can learn more about it in the Hasty announcement blogpost. As with any ML implementation, results are not always black and white. Instead, it’s crucial to develop an understanding of its nuances.
Evaluating when to start using AI-CS
When should we move from a manual QA workflow to an AI-CS-enabled one? When does it make sense to try creating a run and see if the results are usable? The short answer: Start using AI-CS when the models trained for the the labelling automation perform well.
Assessing model performance is not trivial. It’s unclear which ML metrics to evaluate, which will differ for each project. So instead of looking at one hard metric, we need to determine how well the models perform at automating the first-pass labeling. This only makes sense once at least 2,000 annotations have been created. Then the assessment should be conducted in a three-step process:
Try the assistants for yourself to develop an initial feeling for the model’s performance.
Get qualitative feedback from the cloud workers. They’re working with the tools daily; use this to our advantage.
Monitor the assistance rate (% of annotations created by the level 2 tools). Once it picks up, automation starts to kick in, indicating that the models start producing usable results. For this metric, contact Tobias Rosario and he'll ensure you get the data you need.
If you feel that the assistants work well, but the assistance rate doesn’t pick up, this often means that (some) cloud workers are not using the tool correctly. In that case, check the assistant rate per cloud worker using the above script and address arising issues with the delivery lead.
Creating an AI-CS run
Once the model performance picks up, it’s time to try AI-CS.
To create an AI-CS run, please follow the instructions in the Hasty documentation with the following settings:
Dataset: Select the dataset associated with the current batch.
Image status: As default, choose ‘DONE’, ‘TO REVIEW’, and ‘AUTO LABELLED’.
Preview mode: ‘False’.
Retrain model*: If we added significantly more data since the last retraining or changed anything in the taxonomy, ‘True’, otherwise ‘False’.
*The model trains on a maximum of 14,000 images. If more images should be used, a custom model must be trained in Model Playground.
Evaluation of an AI-CS run
Now that we created an AI-CS run, the question is: can we trust the results?
To answer this, first, check the results on the first page displaying the suggestions with the highest confidence. Of those suggestions, > 50% should be correct if the model performance is good enough.
Then, review the confidence histograms on the summary page of the AI-CS to get a feeling of how confident the AI-CS is. They should be predominantly left-skewed (see here for an explanation of ‘left-skewed’).
Interpretation of AI-CS results
If we made it here, fantastic! It means that AI-CS works as intended, and we can move on to the next project phase. This section covers interpreting the AI-CS results and adjusting the strategy based on the insights.
We can only interpret AI-CS results meaningfully once the cloud workers finish the first review task. Before the task is completed, we don’t know if the results are correct. To ensure the correct execution of the review tasks, the delivery team performs spot checks described in the QA workflow guide.
The steps below should be checked after each AI-CS run we do to spot potential errors. In theory, if the first AI-CS runs went smooth, later ones should as well, but you never know.
After completing the review, we calculate the true error rate as the % of suggested errors multiplied by the % of accepted suggestions. This data will provide us with valuable insights into the cloud workers' performance and potential issues with the data strategy we should provide as feedback to the client.
We then should first break down the true error rate over cloud workers. A potential source for error is always that instructions were misunderstood. If one or several cloud workers have a much higher true error rate than the average, address the issue with the delivery team.
Then, we can deep-dive and address different error types, more specifically:
Classification errors (created by a Classification, Tagger, or Attribute run)
Low IoU errors (created by an Object Detection or Instance Segmentation run)
Extra and missing label errors (created by an Object Detection or Instance Segmentation run)
All three error types can occur in one project, but they don’t have to. Further, we should always review the true error rate by individual classes for all three error types to understand the potential root cause better.
Classification errors
When we have predominantly classification errors, we can use the confusion matrix on the summary page to uncover quality hotspots in the dataset. The reason for a hotspot is generally one or more of the following listed. Root cause analysis can be used to determine the contributing reason(s) to the confusion and the preventative or corrective actions needed.
Skewness of data—There may be a hotspot because we simply haven't labeled enough data and trained the AI CS model enough on specific classes. Insights in Hasty should make this determination objective.
CHECK: see if all label classes are somewhat represented on the project summary page.
Ambiguity in instructions—Class definitions might overlap (e.g., ‘car’ and ‘SUV’). Ideally, we should be able to prevent some of this ambiguity via best practice guidance to clients upfront for documenting instructions, seed stage rigor to validate and refine instructions, and edge case handling (and further instructions refinement as needed) early in the onboarding process. But when ambiguity occurs downstream, our operational syncs with clients are good opportunities to review, discuss, and resolve issues in real-time together.
CHECK: often, this ambiguity can be picked up with common sense. Ask yourself, “could I easily tell which class something belonged to by ONLY looking at a crop of the annotation, or is there another context needed to assess that?”
Low IoU errors
Low IoU errors are caused by annotations not being drawn tight enough, typically caused by:
Unclear instructions—This should have been ruled out during the discovery analysis, but additional edge cases may arise over time. Use our operational synchs with clients to address, discuss, and resolve potential issues.
CHECK: are the instructions specific about which parts of an object to annotate and which not (E.g., should the window frame be part of a ‘window’ annotation?)
The IoU histogram gives you an indication of how severe the issues are. The more the distribution is left-skewed, the less powerful the errors.
Extra and missing label errors
One or more listed causes often cause many extra or missing labels.
Missing edge cases in instructions—Sometimes, a clear differentiation of objects that should not be annotated is missing, which may confuse cloud workers. Ideally, we should be available to prevent some of these issues by reviewing the client’s instructions before starting a project. However, those edge cases can reveal themselves only after some time, as they’re often not obvious. When they come up, it’s an excellent opportunity to discuss them with the client to resolve potential issues.
CHECK: Such edge cases can often be picked up with common sense. Review the individual errors and ask yourself: “How could a cloud worker confuse this? Is there any pattern prevalent?”
Blind trust level 2 assistants—Especially with many annotations per image, it sometimes happens that cloud workers accept level 2 suggestions without checking them in detail. As the level 2 models are not perfect yet, it’s to be expected that the recommendations will not be perfect. If so, address the issue with the delivery team lead.
CHECK: spot-check the level 2 assistants. Are they systematically missing particular objects or creating erroneous suggestions?
Last updated