Changes

no edit summary
Line 3: Line 3:  
As part of the suite of Conductivity, Temperature, Depth (CTD) AI tools being produced by the Office of the Chief Data Steward (OCDS), we are developing a model to assist with identifying and deleting poor-quality scans during the CTD quality control process. Using a combination of a Gaussian Mixture Model (GMM) to cluster CTD scans into groups with similar physical properties and Multi-Layer Perceptrons to classify the scans in each group, we are able to automatically flag the poor-quality scans to be deleted with a high degree of accuracy. Through the deployment of the model as a real-time online endpoint and the support of model communication through a client-side program, we have successfully integrated an experimental model into the client's business process in a field testing environment. The continuation of this line of work will now look to bring the model into a production environment for regular usage in the quality control process.
 
As part of the suite of Conductivity, Temperature, Depth (CTD) AI tools being produced by the Office of the Chief Data Steward (OCDS), we are developing a model to assist with identifying and deleting poor-quality scans during the CTD quality control process. Using a combination of a Gaussian Mixture Model (GMM) to cluster CTD scans into groups with similar physical properties and Multi-Layer Perceptrons to classify the scans in each group, we are able to automatically flag the poor-quality scans to be deleted with a high degree of accuracy. Through the deployment of the model as a real-time online endpoint and the support of model communication through a client-side program, we have successfully integrated an experimental model into the client's business process in a field testing environment. The continuation of this line of work will now look to bring the model into a production environment for regular usage in the quality control process.
 
== Use Case Objectives ==
 
== Use Case Objectives ==
[[File:Business_process.png|alt=|right|382x382px]]
+
[[File:Business process fr.png|alt=|right|402x402px]]
 
The quality control process for CTD data is a highly time-intensive task. An oceanographer performs a visual inspection to identify poor-quality scans in each CTD profile using graphical editing software. As the task requires careful inspection of the data and the CTD profiles collected during one year number in the thousands, this work consumes a large amount of time and effort. To ease this burden, we have produced an AI tool that is capable of flagging the poor-quality CTD scans such that these flags can be displayed to the oceanographer within the graphical editing software. This allows for the task to be sped up by providing a quick reference for the areas in the CTD profiles where attention is required. By providing flags for assisted decision-making rather than using the model for fully automated decision-making, we preserve the ability of oceanographer to use their domain expertise to make the best possible decisions. As the model becomes more mature and its performance improves, we may explore options to increase the degree of automation.                                     
 
The quality control process for CTD data is a highly time-intensive task. An oceanographer performs a visual inspection to identify poor-quality scans in each CTD profile using graphical editing software. As the task requires careful inspection of the data and the CTD profiles collected during one year number in the thousands, this work consumes a large amount of time and effort. To ease this burden, we have produced an AI tool that is capable of flagging the poor-quality CTD scans such that these flags can be displayed to the oceanographer within the graphical editing software. This allows for the task to be sped up by providing a quick reference for the areas in the CTD profiles where attention is required. By providing flags for assisted decision-making rather than using the model for fully automated decision-making, we preserve the ability of oceanographer to use their domain expertise to make the best possible decisions. As the model becomes more mature and its performance improves, we may explore options to increase the degree of automation.                                     
 
    
 
    
Line 27: Line 27:     
Finally, data classification is achieved using one MLP per cluster of CTD scans. The MLP is trained for binary classification to flag CTD scans as poor-quality (to be deleted) or good quality (to be preserved). The ground truth used to produce the training data is derived from historical instances of quality control, using the decisions made by oceanographers in the past regarding which scans should be deleted. The deleted scans are far fewer than the preserved scans, making the training data highly imbalanced. This is often problematic in a machine learning setting as it causes the model to learn in a fashion biased towards the majority class. To address this, we apply oversampling to randomly duplicate deleted scans in the training data in order to balance their numbers with the preserved scans.
 
Finally, data classification is achieved using one MLP per cluster of CTD scans. The MLP is trained for binary classification to flag CTD scans as poor-quality (to be deleted) or good quality (to be preserved). The ground truth used to produce the training data is derived from historical instances of quality control, using the decisions made by oceanographers in the past regarding which scans should be deleted. The deleted scans are far fewer than the preserved scans, making the training data highly imbalanced. This is often problematic in a machine learning setting as it causes the model to learn in a fashion biased towards the majority class. To address this, we apply oversampling to randomly duplicate deleted scans in the training data in order to balance their numbers with the preserved scans.
[[File:Ml_pipeline.png|alt=|center|992x992px]]
+
[[File:Ml pipeline fr.png|alt=|center|992x992px]]
    
    
 
    
 
==Experimental Model Performance==
 
==Experimental Model Performance==
[[File:Performance.png|alt=|right|477x477px]]
+
[[File:Performance fr.png|alt=|right|476x476px]]
 
In order to validate the performance of the model, we have performed experimentation using historical CTD data covering the years 2013-2021. We use a 5-fold cross-validation methodology to ensure robustness in the measured results.
 
In order to validate the performance of the model, we have performed experimentation using historical CTD data covering the years 2013-2021. We use a 5-fold cross-validation methodology to ensure robustness in the measured results.
   Line 50: Line 50:  
For the field test setting, we have used Azure Databricks as the cloud analytics platform through which the model is deployed. A real-time online endpoint is used to make the model available via an Application Programming Interface (API). This means that authenticated users can access the model from any machine that is connected to the internet. We have produced a light-weight client-side program that runs on the user's machine to manage communication with the model. This program provides a simple graphical interface used to select the CTD files to send to the model for processing. The program then receives a response with the flags generated by the model and saves new files containing the results. These files can be loaded directly into the graphical editor used during the standard CTD quality control process. This provides the user with a simple method to use the model with minimal overhead and integrates the model results directly into the existing quality control software.
 
For the field test setting, we have used Azure Databricks as the cloud analytics platform through which the model is deployed. A real-time online endpoint is used to make the model available via an Application Programming Interface (API). This means that authenticated users can access the model from any machine that is connected to the internet. We have produced a light-weight client-side program that runs on the user's machine to manage communication with the model. This program provides a simple graphical interface used to select the CTD files to send to the model for processing. The program then receives a response with the flags generated by the model and saves new files containing the results. These files can be loaded directly into the graphical editor used during the standard CTD quality control process. This provides the user with a simple method to use the model with minimal overhead and integrates the model results directly into the existing quality control software.
   −
The user can optionally send the CTD files back to the model endpoint after the quality control process has been finished. This allows for the scan deletion decisions made by the user to be compared against the model's predicted flags in order to assess the performance of the model. The results are then analyzed to provide a breakdown of the model performance across various metrics which are presented visually in a Power BI report.[[File:Model_communications.png|alt=|center]]
+
The user can optionally send the CTD files back to the model endpoint after the quality control process has been finished. This allows for the scan deletion decisions made by the user to be compared against the model's predicted flags in order to assess the performance of the model. The results are then analyzed to provide a breakdown of the model performance across various metrics which are presented visually in a Power BI report.
 +
[[File:Model communications fr.png|alt=|center]]
     
121

edits