Changes

AI-Assisted Quality Control of CTD Data (view source)

Revision as of 12:20, 23 December 2022

1 byte removed , 12:20, 23 December 2022

no edit summary

Line 25: Line 25:

The data partitioning step is used to break the CTD scans up into groups that have similar physical properties to each other. Since the CTD profiles are collected across a vast geographical area (the North-East Pacific), a large span of depths (0 to 4300 meters below surface) and across all seasons, these spatio-temporal variations lead to a highly varied distribution of physical properties in the CTD scans. This degree of variation is difficult for a single MLP to cope with and learn how to distinguish between good or poor-quality scans in all observed conditions. Performance can be greatly improved by first partitioning the scans and then training a separate MLP for each partition. We achieve this using a GMM trained on the primary physical properties of the scans, namely the temperature, conductivity and salinity values.

−

Finally, data classification is achieved using ~~once~~ MLP per cluster of CTD scans. The MLP is trained for binary classification to flag CTD as poor-quality (to be deleted) or good quality (to be preserved). The ground truth used to produce the training data is derived from historical instances of quality control, using the decisions made by oceanographers in the past on which scans should be deleted. The deleted scans are far fewer than the preserved scans, making the training data highly imbalanced. This is often problematic in a machine learning setting as it causes the model to learn in a fashion biased towards the majority class. To address this, we apply oversampling to randomly duplicate deleted scans in the training data in order to balance their numbers with the preserved scans.

+

Finally, data classification is achieved using one MLP per cluster of CTD scans. The MLP is trained for binary classification to flag CTD as poor-quality (to be deleted) or good quality (to be preserved). The ground truth used to produce the training data is derived from historical instances of quality control, using the decisions made by oceanographers in the past on which scans should be deleted. The deleted scans are far fewer than the preserved scans, making the training data highly imbalanced. This is often problematic in a machine learning setting as it causes the model to learn in a fashion biased towards the majority class. To address this, we apply oversampling to randomly duplicate deleted scans in the training data in order to balance their numbers with the preserved scans.

[[File:Ml_pipeline.png|alt=|center|992x992px]]

Lee.croft

121

edits