Changes

Jump to navigation Jump to search
no edit summary
Line 26: Line 26:     
Finally, data classification is achieved using once MLP per cluster of CTD scans. The MLP is trained for binary classification to flag CTD as poor-quality (to be deleted) or good quality (to be preserved). The ground truth used to produce the training data is derived from historical instances of quality control, using the decisions made by oceanographers in the past on which scans should be deleted. The deleted scans are far fewer than the preserved scans, making the training data highly imbalanced. This is often problematic in a machine learning setting as it causes the model to learn in a fashion biased towards the majority class. To address this, we apply oversampling to randomly duplicate deleted scans in the training data in order to balance their numbers with the preserved scans.
 
Finally, data classification is achieved using once MLP per cluster of CTD scans. The MLP is trained for binary classification to flag CTD as poor-quality (to be deleted) or good quality (to be preserved). The ground truth used to produce the training data is derived from historical instances of quality control, using the decisions made by oceanographers in the past on which scans should be deleted. The deleted scans are far fewer than the preserved scans, making the training data highly imbalanced. This is often problematic in a machine learning setting as it causes the model to learn in a fashion biased towards the majority class. To address this, we apply oversampling to randomly duplicate deleted scans in the training data in order to balance their numbers with the preserved scans.
 +
[[File:Ml_pipeline.png|alt=|center|992x992px]]
 +
 
    
 
    
 
+
 
[[File:Ml pipeline.png|Three-step process used in the machine learning pipeline.|992x992px]]
        Line 49: Line 50:     
==Model Deployment and Integration==
 
==Model Deployment and Integration==
 +
To get practical value out of the AI model, it must be integrated into the client's business process in an easy-to-use manner. We have taken the approach of making the model available via a real-time online endpoint deployed on a cloud analytics platform. The client then uses the model through communication with the endpoint managed by a client-side program which runs on the user's machine.
 +
For the field test setting, we have used Azure Databricks as the cloud analytics platform through which the model is deployed. A real-time online endpoint is used to make the model available via an Application Programming Interface (API). This means that authenticated users can access the model from any machine that is connected to the internet. We have produced a light-weight client-side program that runs on the user's machine to manage communication with the model. This program provides a simple graphical interface used to select the CTD files to send to the model for processing. The program receives the flags generated by the model and saves new files containing the results. These files can then be loaded directly into the graphical editor used during the standard CTD quality control process. This provides the user with a simple method to use the model with minimal overhead and integrates the model results directly into the existing quality control software.
 +
 +
The user can optionally send the CTD files back to the model endpoint after the quality control process has been finished. This allows for the scan deletion decisions made by the user to be compared against the model's predicted flags in order to assess the performance of the model. The results are then analyzed to provide a breakdown of the model performance across various metrics which are presented visually in a Power BI report.
 +
 +
[[File:Model_communications.png|alt=|center]]
 +
 +
   −
[[File:Model communications.png|Information flow in the integration of the model deployment into the business process.]]
         
==Next Steps ==
 
==Next Steps ==
121

edits

Navigation menu

GCwiki