CoralNet

Those of you following the Google Group forum probably know that CoralNet's gone through a recent period of hampered performance. There's been a fair bit of work done in the past few weeks to get the site running more smoothly again, and to improve performance monitoring going forward. I think it's a good time to give a rundown of key events and changes that were made during this period.

About source checks and other types of jobs

In a blog post last year, I introduced the source Jobs dashboard:

But I didn't really go into detail about the types of jobs you might see on this page. Here's a rundown:

Check source: Checks whether it's time to schedule more feature-extract, training, or classification jobs for the source. The general strategy this job uses is:
- First extract features for all images.
- Then check if the criteria are met for a new training job.
- Then make sure all non-confirmed images are classified with the source's selected classifier.
Later in this blog post, I'll go into more detail to answer the question of: how do source-checks themselves get scheduled?
Extract features: A pre-processing step that must be run for every image, before that image can be classified or used for training a classifier.
Train classifier: Trains a new classifier for the source, using all the confirmed and feature-extracted images in the source up to that point. The new classifier may or may not be saved, depending on how the evaluation results compare to the source's previous classifiers.
Classify: Classifies an image, using the source's selected classifier. This updates machine suggestions for the annotation tool and for exports, and also updates annotations if they're not confirmed yet.
Reset classifiers for source: Delete all previous trained classifiers saved in the source. Certain changes, such as changing the source's labelset, will result in this job being scheduled.
Reset features for source: Delete all previous extracted features saved in the source. Changing the source's feature-extractor setting, or changing the selected classifier to one that uses a different extractor, results in this job being scheduled.

Speed of source checks

The aforementioned introduction of the source Jobs dashboard happened in November 2023. While working on that, I noticed issues with the source check's logic for figuring out which images to schedule classifications for. So, in that same month, I updated the logic to be more accurate. This means that the logic is less likely to miss images that need classification, and less likely to redundantly classify images that are already up to date.

Unfortunately, the tradeoff for more accuracy was that the check became slower as well. On top of that, I believe the slowness didn't manifest immediately, because the new logic uses a database table that was brand new in November 2023. Now, almost a year later, that table has gradually grown to have 1.8 million entries in it, enough for us to really notice that accuracy vs. slowness tradeoff.

As of October 23, 2024, the logic for scheduling classifications has changed to be even more accurate, but I believe it's still on a similar level as far as speed goes. I tried to make it faster, but ultimately I focused elsewhere to get the speedups we needed.

Frequency of source checks

Way back in 2016, we introduced a periodic background-job that would try to run every 24 hours. The job would check all sources to see which ones needed to train a new classifier.

This logic remained largely the same until 2022, when I generalized "check for needing training" to the modern concept of source checks - "check for needing features, training, or classification". This was done to address various issues involving those jobs not getting queued when they needed to be.

In the past year, I've greatly enhanced the ability to monitor background jobs and see what the system is getting stuck on. While looking at the job dashboards again and again in the past few weeks, I noticed that it seemed to be running source checks 90% of the time.

Now, we've known that running source checks this way was quite wasteful in the sense that, on any given day, the vast majority of sources are not being actively worked on. However, until quite recently, the logic to schedule source checks based on other actions (such as image uploads) still didn't seem good enough, and it seemed like there'd be a lot of sources that would not get the checking they need if we stopped the daily automatic checks.

Well, after all these years and a few factors - the ever-growing number of sources, the slower source checks introduced a year ago, and the high website activity this fall - we've finally identified the daily source checks as a critical bottleneck for CoralNet. So:

We've finally retired the daily source checks.
I've done what I could to improve the other types of logic for scheduling source checks.
There's now a button to request a source check:

CoralNet will continue making its best effort to automatically schedule source checks at appropriate times, but in some cases you may have to request a source check yourself. As long as no other jobs are running or pending in the source, you should see a "Run a source check" button near the top of this page.

Note that the button does have a cooldown period. For now, if the last check was done in the last 30 minutes, you'll have to wait before your next request goes through.

Swapping the daily checks for a check button has been a game-changer for getting the job queues caught up. Hopefully, the logic to automatically schedule checks is indeed smart enough so that they won't have to be manually requested too often. If you notice a situation where you have to keep requesting source checks yourself, I'd like to improve that, so don't hesitate to report it.

On the other hand, there are still situations where many source checks are automatically run in a row without accomplishing anything useful. I believe I fixed at least one of these situations recently, but I'll look out for ways to do even better and further minimize the waste.

Server downtimes

While this was going on, there was a separate major issue in the past month. On three occasions, the CoralNet server became unresponsive and I had to manually restart it: on September 28, October 6, and October 16.

We had these unresponsiveness episodes more frequently years ago, but not very much recently. So, I wasn't too familiar with how our updated background-job infrastructure would react to the server going unresponsive. On September 28, I didn't notice that some background processes failed to restart, and basically nothing was happening in the vision backend as a result; that was fixed on September 30. But there was still a lingering problem which prevented existing pending jobs from getting started, and I didn't fix that issue until October 4.

Recovery from the next two downtimes went more smoothly after the first experience, but obviously the downtimes themselves are really annoying for everyone. Luckily, after I had taken other steps to get the queue caught up, that reduced the noise and I was able to notice the common thread: all three downtimes occurred during 'Reset classifiers for source' jobs in very large sources.

From there, I was able to identify a way that this type of job could exhaust the server's memory capacity. So I reworked that job's code to be smarter about memory usage. Hopefully that did the trick, and if the episodes of unexpected downtime continue, then I'll keep using all the clues I can find to debug it.

New page showing job-processing status

Although the source job dashboards can give you clues that something is wrong with your source's progress, there's still the question: is it just stuck on your end, or is it slow for everyone else on CoralNet as well? That's what this new page sets out to answer:

This page is linked as "Status" at the website footer (the bottom of each page), as long as you are logged in.

If this page shows the sitewide 90th percentile wait time as 4 hours, and your jobs have been pending for 1 day, then there's a fair chance something is stuck with your source specifically. Note how you can change the time interval of analysis - 3 days, 1 week, 4 hours, etc. - whichever makes the most sense depending on the situation you're looking at.

The "Number of incomplete jobs" section aims to show whether or not the job queues have been overloaded recently. Though note that it only takes data from 7 points in time, so the average and the graph can be skewed by very short bursts of activity. There's room for improvement to make this page more accurate and relevant, but hopefully it offers some value in its current form.

Other performance improvements

There is now a stricter limit on how many image classifications (non-API) can be automatically queued for a single source in one go. The image limit's anywhere from a few hundred to nearly 1000, depending on point count. Previously, the system could be running image classifications from a single source for a few hours on end, which isn't fair to other sources.
Besides that, image classification jobs themselves (again, non-API) now run slightly faster on average.
There used to be only one dedicated process on the web server for running certain types of background jobs. Now there are two such processes, so if one job takes an unusually long time, then other jobs can still run in parallel.
There is a site-wide background job for recomputing certain label details such as popularities, which runs once a week. The job's runtime was sped up from about 170 minutes to 13 minutes.