One day at work I noticed that emails were taking much longer to be sent out from our app. I narrowed down the problem to our background queue which is responsible for sending out the emails. The solution prompted me to write this DelayedJob “survival guide”to help others who may encounter this issue in the future.

Asynchronous processing/background processing is an important part of a web application. This ensures that running code is not blocking the rest of the process if that code does not need to run synchronously. Common examples are sending emails or code that depends on a third-party API or service.

There are many solutions for this, such as Redis-backed programs like Sidekiq or Resque. There are also database-backed programs like DelayedJob. The advantage of using a database-backed solution is its simplicity: you don’t need an external dependency (such as Redis) to run it. Instead, you can use your existing database to manage your background processing.

This simplicity also has a disadvantage: you are now constrained by your database and database issues can directly affect your background processing.

The Problem

We had a new feature which required processing old data in the system. This uses the background queue as it takes a few seconds to process each individual task. Eventually these tasks accumulated, resulting in more than half a million jobs in the DelayedJob queue.

As I noticed that the queue is not getting processed as fast as I expected, I looked at the database logs. In the MySQL slow query logs, I noticed that almost all entries look like this:

UPDATE delayed_jobs
SET `delayed_jobs`.`locked_at` = '2018-06-05 11:48:28', 
 `delayed_jobs`.`locked_by` = 'delayed_job.2 host:ip-10-203-174-216 pid:3226' 
WHERE ((run_at <= '2018-06-05 11:48:28' 
AND (locked_at IS NULL OR locked_at < '2018-06-05 07:48:28') OR locked_by = 'delayed_job.2 host:ip-10-203-174-216 pid:3226') 
AND failed_at IS NULL) 
ORDER BY priority ASC, run_at ASC 
LIMIT 1;

DelayedJob updates the locking information (timestamp and PID) when processing jobs. However, this UPDATE call in the database does not use the index in the table, at least for old MySQL versions (5.6 or below). As the number of entries in the queue increases, this UPDATE call becomes much slower.

This is the problem with database-backed asynchronous queues: the database is used both as a state manager and the queue storage/retrieval, resulting in locking.

Emergency Processing

Since the queue processing is becoming really slow, some critical tasks were not being performed. Thus we needed to run some jobs manually (using the Ruby/Rails console). We can invoke a DelayedJob worker manually using this command:

Delayed::Worker.new.run(delayed_job_object)

However, we may want to run all tasks in a given queue, let’s say the important_queue. We can query the database for all tasks under the queue and invoke the worker manually for each:

Delayed::Job.where(queue: "important_queue").find_each do |dj|
  Delayed::Worker.new.run(dj)
end

In this manner we were able to quickly resolve some critical tasks that needed to be run immediately. However, this is not a scalable solution as everything is done manually. This also won’t solve the problem of having hundreds of thousands of tasks in the backlog.

Queue “Storage”

Searching the internet, I found that there were others who also encountered this problem. Their solution was documented here and here. The main gist of the solution is to temporarily remove most (or all) of the items in the delayed_jobs table into a separate table to “unclog” the background queue.

In this example, we will create a new table called delayed_jobs_storage with the same columns as the original delayed_jobs table. The examples also assume we are using MySQL as our database:

CREATE TABLE delayed_jobs_storage LIKE delayed_jobs;

Once the “storage” table has been created, we can now move the jobs into that new table. In this example, we will limit the query to only move jobs that are under the huge_queue queue.

INSERT INTO delayed_jobs_storage (SELECT * FROM delayed_jobs WHERE queue='huge_queue');

Then we remove the jobs that we moved from the original delayed_jobs table:

DELETE FROM delayed_jobs WHERE queue='huge_queue';

At this point, the background processing speed returns to normal as the size of the table is now greatly reduced. The next step is to gradually move back some jobs from the delayed_jobs_storage table into the delayed_jobs table so they are processed.

This involves some trial and error as we want to determine the optimal number of jobs that we can transfer. We want it so that we can move the largest amount of jobs without slowing down the queue. In my experiment, I determined that we can transfer up to around 100k jobs back to the queue without impacting the performance.

To move the first 100k jobs back into the delayed_jobs table:

INSERT INTO delayed_jobs (SELECT * FROM delayed_jobs_storage ORDER BY id ASC LIMIT 100000);

Then we need to remove those jobs from our “storage” table:

DELETE FROM delayed_jobs_storage ORDER BY id ASC LIMIT 100000;

We wait until all the jobs have been processed and the queue goes back to its minimal state. After which we repeat the process again until all of the jobs stored in delayed_jobs_storage have been moved back to the delayed_jobs table.

Afterthoughts

While this workaround will get you out of a bind when your backround queue is clogged, it is not a long-term solution. As much as possible we want to avoid this scenario happening in the first place!

Here are some ideas that you can implement:

  • Analyze each background job to see areas of optimization. If the code that is running in a job is not optimized, it will run slower and will consume more resources. Check your database queries and your code performance to make sure they are running as fast as possible. For example, add table indexes and remove N+1 queries.
  • Reorganize how you add jobs to the background queue. Sometimes we just add tasks to the queue without thinking about how it impacts the rest of the jobs. Can you make your code add less to the queue by removing redundancy? Does combining smaller jobs into a larger job make sense? Are longer-running jobs of lower priority than faster ones?
  • Consider moving to a Redis-based solution such as Sidekiq. This will make sure that your dependency to your main database is eliminated and allows you to use a separate (and more efficient) storage of your background jobs.

Photo by James Pond on Unsplash

 

Leave a Reply

Your email address will not be published. Required fields are marked *