2

I have a Bull queue running lengthy video upload jobs which could take any amount of time from < 1 min up to many minutes.

The jobs stall after the default 30 seconds, so I increased the timeout to several minutes, but this is not respected. If I set the timeout to 10ms it immediately stalls, so it is taking timeout into account.

Job {
      opts: {
      attempts: 1,
      timeout: 600000,
      delay: 0,
      timestamp: 1634753060062,
      backoff: undefined
    }, 
    ...
}

Despite the timeout, I am receiving a stalled event, and the job starts to process again.

EDIT: I thought "stalling" was the same as timing out, but apparently there is a separate timeout for how often Bull checks for stalled jobs. In other words the real problem is why jobs are considered "stalled" even though they are busy performing an upload.

Simon
  • 123
  • 3
  • 9

2 Answers2

2

The problem seems to be your job stalling because of the operation you are running which blocks the event loop. you could convert your code into a non-blocking one and solve the problem that way.

That being said, stalled interval check could be set in queue settings while initiating the queue (more of a quick solution):

const queue = new Bull('queue', {
    port: 6379,
    host: 'localhost',
    db: 0,
    settings: {
      stalledInterval: 60 * 60 * 1000, // change default from 30 sec to 1 hour, set 0 for disabling the stalled interval
    },
  })

based on bull's doc:

  • timeout: The number of milliseconds after which the job should be fail with a timeout error
  • stalledInterval: How often check for stalled jobs (use 0 for never checking)

Increasing the stalledInterval (or disabling it by setting it as 0) would remove the check that makes sure event loop is running thus enforcing the system to ignore the stall state.

again for docs:

When a worker is processing a job it will keep the job "locked" so other workers can't process it.

It's important to understand how locking works to prevent your jobs from losing their lock - becoming _stalled_ -
and being restarted as a result. Locking is implemented internally by creating a lock for `lockDuration` on interval
`lockRenewTime` (which is usually half `lockDuration`). If `lockDuration` elapses before the lock can be renewed,
the job will be considered stalled and is automatically restarted; it will be __double processed__. This can happen when:
1. The Node process running your job processor unexpectedly terminates.
2. Your job processor was too CPU-intensive and stalled the Node event loop, and as a result, Bull couldn't renew the job lock (see [#488](https://github.com/OptimalBits/bull/issues/488) for how we might better detect this). You can fix this by breaking your job processor into smaller parts so that no single part can block the Node event loop. Alternatively, you can pass a larger value for the `lockDuration` setting (with the tradeoff being that it will take longer to recognize a real stalled job).

As such, you should always listen for the `stalled` event and log this to your error monitoring system, as this means your jobs are likely getting double-processed.

As a safeguard so problematic jobs won't get restarted indefinitely (e.g. if the job processor always crashes its Node process), jobs will be recovered from a stalled state a maximum of `maxStalledCount` times (default: `1`).
alikh31
  • 49
  • 6
  • What is the difference between a job's timeout and the stalledInterval setting? – Simon Nov 10 '22 at 16:10
  • @Simon just updated the answer to add some info about the differences between timeout and stalledInterval – alikh31 Nov 12 '22 at 19:18
  • thank you for clarifying. Actually looking at your quote from the docs, it seems to me the fundamental problem here is why the job is stalling if it still uploading. Probably the correct solution isn't to allow the job to stall, which is what I think increasing the stalledInterval would do, but rather avoiding the job stalling in the first place. I will update the question to reflect this. – Simon Nov 14 '22 at 01:39
  • very true, will also reflect that in the answer. although in certain cases; just like the one I had where there were blocking operations on jobs' separate thread causing it to stall; turning off the stall checker might be useful and quick fix. – alikh31 Nov 14 '22 at 12:46
0

The better approach is to use job.progress() function wherein long running task can update the progress at a regular instance to avoid the event loop from stalling the event.

https://github.com/OptimalBits/bull/blob/HEAD/REFERENCE.md#jobprogress

Also, you can log to help you better troubleshoot.

.on('stalled', function (job) {
  // A job has been marked as stalled. This is useful for debugging job
  // workers that crash or pause the event loop.
})