tl;dr
For the release of CoralNet 1.0, we have created a new deep learning engine that is more accurate and faster. The accuracy of automatic classification of benthic images depends on many factors including imaging conditions (e.g, resolution, lighting, turbidity, motion blur, camera quality), the particular taxa of interest, and number of training annotations. On a test set of 43,049 images from 26 sources, the average error rate of CoralNet 1.0 was 22.4% lower relative to the error rate of CoralNet Beta. To accomplish this, we trained a new convolutional neural network (EfficientNet-B0) with an order of magnitude more highly curated training images provided by CoralNet users, and we replaced the final classifier with a multi-layer perceptron.
Introduction to the CoralNet Deep Learning Engine
CoralNet’s machine learning engine performs automatic point annotation — for a given pixel location in an image, the location is classified from a set of labels in a source’s Labelset. The input is an image patch cropped from the full size image about that point, and the output is score ranging from 0.0 to 1.0 over each of the n mutually exclusive labels in the Labelset. For a typical CoralNet source, there are between 5 and 80 labels with most sources having fewer than 20 labels. The label with the highest score is taken to be the annotation for the point. The n scores sum to 1.0 akin to a probability and provide a measure of the confidence of the decision. This score is used in the alleviation ability of CoralNet.
The CoralNet engine is built on a transfer learning approach with two stages. The first stage is a feature extractor using a convolutional neural network that takes an image patch (usually 224x224 pixels) centered on the given point and outputs a feature vector with length that ranges from 1024 to 4096 depending upon the particular network. The classifier takes the feature vector as input and outputs the n scores. See figure.
Generally speaking, supervised networks require 1000’s of examples of each class (label) to be effectively trained, which is typically not available for sources in the CoralNet scenario. The alternative that we developed on CoralNet is to train a single deep network feature extractor pooling a large collection of data and then to train per-source classifiers. When an image is uploaded to CoralNet and a set of point locations are specified for the image, the feature vector for each point is computed and stored in a database. However, for each source, a classifier is trained from the computed features and manually annotated labels of that source. During the lifecycle of a source, classifiers are repeatedly trained for a number of reasons:
- Users manually annotate additional points.
- Users confirm automatic annotations.
- Users may decide to add or remove labels from the source’s Labelset.
With more labelled training data, we expect the classifier to become accurate, and so the classifier is retrained.
This basic architecture of separate feature extractors and classifiers have persisted since CoralNet Alpha was launched in 2011 and followed the architecture in [1] with a feature extractor based on texture and color filters and a Bag of Visual Words using hierarchical pooling along with a support vector machine for classification. CoralNet Beta, launched in 2017, replaced the feature extractor with a Convolutional Neural Network based on VGG16 [6] written in Caffe that was retrained on benthic images. The classifier was replaced with logistic regression. Over the past year and a half, we sought to improve the accuracy and efficiency of CoralNet by considering alternative feature extractors and classifiers, and the result is being deployed with the launch of CoralNet 1.0 in January 2021, and uses a EfficientNet-b0 [4] (retrained) as a feature extractor and a Multi-Layer Perceptron for a classifier, all written in Pytorch.
Training and evaluation Data
When the CoralNet Beta network was trained, we used 62,000 images for training. Since then CoralNet has been heavily used by marine scientists, with more than 1 million images uploaded when we started this project. Today there are 1.6M images. We wanted to leverage this well-curated labelled data to further improve CoralNet accuracy. While CoralNet was first created for the purpose of estimating coverage of coral reefs from benthic image surveys, our users have found value in a broader range of image data from seagrasses and cold water rocky habitats to oil rigs, pier pilings and ARMS plates. While the vast majority of CoralNet sources are from tropical coral reefs, uploaded images range from as far south as Antarctica to as far north as Scotland.
While we went through a few iterations of data preprocessing, in the end we exported 330 representative sources with 591,604 images and a total of 16,533,651 annotated points from CoralNet. The sources were divided into a training set of 304 sources and hold out test set of sources. The test set sources had between 201 and 15,258 images.
In CoralNet today, there are a total of 4,489 user-defined labels, but we also knew that many were duplicates — different labels and names for the same taxa. We were fortunate that Jessica Bouwmeester joined our team and helped us verify the duplicated labels. Eventually, we identified 315 duplicate labels covering 5,436,343 image annotations, and we ended up using 1,275 labels for training the deep network feature extractor. As described in an earlier blog post, the CoralNet user interface was rewritten so that existing labels are more discoverable with a hope that fewer duplicates will be created in the future.
Training a New Feature Extractor
The goal of the feature extractor is to create features that are discriminative between classes and will work well across the wide range of classes and imaging conditions that CoralNet users are interested in. From the full training dataset above, we noted that there are choices to be made, and we performed many experiments to determine which approaches were most promising: Which networks to evaluate? What image patch size to use? What optimization method works well? What loss function? What fraction of data to use? Which part of the data to use? How should class imbalance be handled? Many experiments were performed to make these decisions and required 9,751 GPU hours provided by UCSD's Nautilus HyperCluster.
For each experiment, our methodology was to train a network with a loss function evaluated across all training images. A trained network was then evaluated by creating separate classifiers for each test source, and the metrics were evaluated over the labeled images in the test source. This simulates how the feature extractor is used within CoralNet. Two metrics were used: average source accuracy (also called recall) and average source F1 score. The F1 score is the geometric mean of the precession and recall. We note that higher precision of the classifier is often aligned with higher accuracy of rarer classes.
The training set for the feature extractor was created by taking the union of labels across the training sources, yielding 1,275 classes. We trained the network through a softmax layer and cross entropy loss function over these classes. Because of the severe class imbalance, we went through a number of iterations on the dataset, deciding whether we should restrict training to classes with sufficient representation. We also explored the benefits of balancing the dataset through data augmentation and resampling — rotating and flipping image patches from rare classes to create additional samples coupled with subsampling the dominant classes.
We settled on using the 1-cycle policy for optimization [2], and training for 5 epochs. Typically, it took 3-5 days to train one network on a GPU cluster, where the time depended on the size of the deep network (EfficientNetB0 was fast, Resnet101 was slow). The training data was split into a training set (90%) and evaluation set (10%). At the end of each epoch, the loss on the evaluation set was determined to detect overfitting. When all the experiments were complete and we settled on the final architecture, we continued the EfficientNetB0 production training for 20 epochs starting with a learning rate of 10⁻³ and manually decreased it to 10⁻⁷ based on the classification accuracy of the validation set.
A New Classifier for CoralNet
The CoralNet classifiers (robots) are trained on the extracted features using manually labelled and confirmed annotations. We take a source’s labelled data, and partition it with 7/8 into the classifier training set and with 1/8 into the classifier testing set. 10% of the classifier training set is used to calibrate the classifier. CoralNet Beta uses Calibrated Logistic Regression for classification. We explored whether alternative classifiers would be more accurate including Linear Support Vector Machines (SVM), Random Forest, Gaussian Models, and variations of a Multilayer Perceptron (MLP). We also explored data augmentation by strategies using geometric augmentation of rare classes and using SMOTE [5]. We found that logistic regression worked as well or better than the alternatives except some variations of the MLP, Most importantly, we found that while the MLP had a modest improvement on accuracy, increasing it from 76.35 to 76.64%, the F1-score increased from 74.8% to 76.5% in an experiment. For the launch of CoralNet 1.0, we are using a multi-layer perceptron with a hierarchical hyper-parameter setting. If the source has less than 50,000 annotated points, one hidden layer with 100 hidden units and a learning rate of 10⁻³ is used, otherwise, two hidden-layers with 200 and 100 hidden units and a learning rate of 10⁻⁴ is used. ReLU activation function and Adam optimizer are used in both cases.
Performance Comparison
We compared the accuracy of automatic classification between CoralNet 1.0 and CoralNet Beta using 26 test sources with 43,049 images and 2,188,956 annotations. This test set included Source 1579, an extremely well-curated NOAA source with 15,258 annotated images.
Accuracy of Logistic Regression vs. MLP
To validate the performance of the Multi-layer Perceptron classifier vs. Logistic Regression (LR) classifier, we trained them on the features extracted with CoralNet 1.0 (EfficientNet-B0) from images in the 26 test sources. In absolute accuracy, the MLP was up to 3% more accurate than LR, and performed slightly worse on 10 out of 26 test sources on accuracy. The average accuracy improvement was 0.5%. More importantly, F1-score for MLP over LR improved by 1.2% on average while F1-score only decreased slightly for 5 out of 26 test sources. Not shown in these plots is that for small sources (20-40 images), the average accuracy was about three times greater, which will facilitate the early stages of semi-automatic labelling using alleviation.
Accuracy of CoralNet 1.0 vs. CoralNet Beta
The following plots compare the accuracy of CoralNet Beta which is built on VGG-16 and logistic regression vs CoralNet 1.0 which is built on EfficientNet B0 and an MLP trained on 16 million annotations. First, we look at the accuracy improvement of CoralNet 1.0 over CoralNet Beta for the 26 test sources, and see that an improvement of up to 19.86% absolute accuracy was achieved. However, the accuracy did decrease for 3 sources.
The following is a histogram over the 26 sources of the accuracy improvement of CoralNet 1.0 over CoralNet Beta, and shows that the median improvement in accuracy is about 5%.
Finally, below are summary statistics of CoralNet Beta and CoralNet 1.0 over all sources. Note that we do not have a way to compute F1 score on CoralNet.
Inference Speed
We explored the inference speed for different networks on CPU’s vs. GPU’s. When deciding which network to deploy in CoralNet 1.0, we were concerned both with accuracy and computational speed. A faster network will provide a better user experience and lower the cost of operating CoralNet. If the accuracy of two networks were comparable, we would favor the faster network.
Speed is highly dependent on the specific machine, and for the following experiment we used an AMD Ryzen 7 3700x 8-core 16-processor CPU and RTX 2080 Ti GPU. Because GPU performance can be dominated by load times for small batches, we also looked at the impact of batch size. As seen below, a batch size of 10 or greater is adequate for GPU speed to dominate over load times. The GPU is about 30 times faster than the CPU.
(Note scale of CPU is in seconds and GPU in ms.)
After drilling into a few different networks that we’ve used in other contexts, we settled on exploring the Resnet Network family [3] and the Efficient Network family [4] and also evaluated the tradeoff from using an input patch size of 168x168 vs. 224x224. Note that the network running on CoralNet Beta is VGG with an input size of 224x224. The following plot summarizes our study of the tradeoff of compute time and accuracy for an early evaluation of 8 alternative network/size options vs. VGG on CoralNet Beta.
From this we concluded that
- Nearly all of the networks are faster than CoralNet Beta, potentially twice as fast.
- The accuracy of CoralNet Beta is 71.3%, and all of the networks are more accurate by a solid margin.
- Two networks stood out to us: Resnet101 was the most accurate network at at 77.59%. EfficientNetB0 paid a small price on accuracy (76.68%), but it was more than twice as fast as Resnet101.
Considering the inference speed, network performance and CPU/GPU instance price, we choose EfficientNet-b0 as the new backend network. With CPU instances on AWS running the backend, EfficientNet-b0 is 2-3 times faster than VGG16, which significantly speeds up feature extraction.
Classifier Online Training Time
While the Multi-layer Perceptron classifier is more accurate than than the Logistic Regression classifier, we wanted to ensure that it wouldn’t be prohibitively slow or expensive. During the course of annotating a large dataset using our alleviate strategy, the classifier is retrained whenever there are an additional 10% more manually labelled or confirmed annotations. For a large source with say, 10,000 images, the classifier will be retrained 55 times. As shown in the plot below, the training speed is nearly the same as Logistic Regression classifier on small sources and even faster on large sources.
Bibliography
[1] Beijbom, Oscar, Peter J. Edmunds, David I. Kline, B. Greg Mitchell, and David Kriegman. "Automated annotation of coral reef survey images." In 2012 IEEE Conf on Computer Vision and Pattern Recognition (CVPR), pp. 1170-1177., 2012.
[2] Smith, Leslie N. "A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay." arXiv preprint arXiv:1803.09820 (2018).
[3] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[4] Tan, Mingxing, and Quoc Le. "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." International Conference on Machine Learning. 2019.
[5] Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357.
[6] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).