Changes

no edit summary
Line 27: Line 27:  
Finally, data classification is achieved using once MLP per cluster of CTD scans. The MLP is trained for binary classification to flag CTD as poor-quality (to be deleted) or good quality (to be preserved). The ground truth used to produce the training data is derived from historical instances of quality control, using the decisions made by oceanographers in the past on which scans should be deleted. The deleted scans are far fewer than the preserved scans, making the training data highly imbalanced. This is often problematic in a machine learning setting as it causes the model to learn in a fashion biased towards the majority class. To address this, we apply oversampling to randomly duplicate deleted scans in the training data in order to balance their numbers with the preserved scans.
 
Finally, data classification is achieved using once MLP per cluster of CTD scans. The MLP is trained for binary classification to flag CTD as poor-quality (to be deleted) or good quality (to be preserved). The ground truth used to produce the training data is derived from historical instances of quality control, using the decisions made by oceanographers in the past on which scans should be deleted. The deleted scans are far fewer than the preserved scans, making the training data highly imbalanced. This is often problematic in a machine learning setting as it causes the model to learn in a fashion biased towards the majority class. To address this, we apply oversampling to randomly duplicate deleted scans in the training data in order to balance their numbers with the preserved scans.
 
[[File:Ml_pipeline.png|alt=|center|992x992px]]
 
[[File:Ml_pipeline.png|alt=|center|992x992px]]
  −
 
  −
  −
  −
  −
      
==Experimental Model Performance==
 
==Experimental Model Performance==
Line 43: Line 37:     
Obtaining perfect accuracy will likely never be obtainable as there is uncertainty in the decision-making even for oceanographers. There are many complex factors at play influencing the data that is ultimately recorded such as choppy water causing the scanning equipment to descend irregularly, winds and strong weather conditions causing mixing of waters near the surface, and currents causing mixing of waters under the surface. As the distinction between proper and corrupted scan data can be very difficult to identify under these conditions, the human decision-making process is not guaranteed to be perfect. This in fact highlights a potential area where the development of more mature AI models that can exploit additional information on these factors may have the potential to lead to automated approaches that could eventually augment the human decisions to reach higher overall accuracy.
 
Obtaining perfect accuracy will likely never be obtainable as there is uncertainty in the decision-making even for oceanographers. There are many complex factors at play influencing the data that is ultimately recorded such as choppy water causing the scanning equipment to descend irregularly, winds and strong weather conditions causing mixing of waters near the surface, and currents causing mixing of waters under the surface. As the distinction between proper and corrupted scan data can be very difficult to identify under these conditions, the human decision-making process is not guaranteed to be perfect. This in fact highlights a potential area where the development of more mature AI models that can exploit additional information on these factors may have the potential to lead to automated approaches that could eventually augment the human decisions to reach higher overall accuracy.
       
121

edits