Deadlock error when running a model using many CPU cores

I am trying to run a model using 72 cpu cores and have come across a deadlock error

00:54:54.094 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Deadlock detected!  No worker threads are running and all tasks are blocked
00:54:54.100 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -   Blocked on full output:
00:54:54.101 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -     ChainTask(select_6:[select])
00:54:54.103 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -     ChainTask(select:[select], unnest:[unnest], select_2:[select], select_3:[select], select_4:[select], exposures:[select])
00:54:54.103 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -   Blocked on no input:
00:54:54.103 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -     ChainTask(select_7:[select], sampled:[select], select_9:[select], analysis:[select], select_12:[select], select_12-sink:[select]) - 22 / 70 tasks blocked
00:54:54.106 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Unhandled exception in scheduler, exiting thread loop!
nz.org.riskscape.engine.RiskscapeException: Problem(ERROR: DEADLOCK)
        at nz.org.riskscape.engine.sched.Scheduler.checkForDeadlock(Scheduler.java:196)
        at nz.org.riskscape.engine.sched.Scheduler.runOnce(Scheduler.java:222)
        at nz.org.riskscape.engine.sched.Scheduler.run(Scheduler.java:275)
        at java.lang.Thread.run(Thread.java:748)
Problems found with pipeline model
  - Execution of your data processing pipeline failed. The reasons for this follow:
    - Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
      - java.lang.RuntimeException: Waiting thread received a null tile.

This model runs fine, albeit very slowly, on a 4 core machine.

I have run the model “getting-started” using this many cores and that worked fine.

Could you confirm you’re using RiskScape v1.2.0? We have fixed a scheduler deadlock bug recently (GL816), but it is not in a formal RiskScape release yet. I’ll see if I can get you a build to try out.

Using 1.2 but from July

RiskScape Core Engine v1.2.0
----------------------------
Build time - Fri Jul 29 12:27:23 UTC 2022
Git SHA1   - f406ef1705a1e287867b6ced8325f5d801d42b0b

Plugins:
defaults     1.2.0  nz.org.riskscape.engine.defaults.Plugin
postgis      1.2.0  nz.org.riskscape.postgis.Plugin
jython       1.2.0  nz.org.riskscape.jython.Plugin
wizard       1.2.0  nz.org.riskscape.wizard.WizardPlugin
cpython      1.2.0  nz.org.riskscape.cpython.CPythonPlugin
wizard-cli   1.2.0  nz.org.riskscape.wizard.WizardCliPlugin

System:
Linux 3.10.0-693.2.2.el7.x86_64
Java 1.8.0_144 OpenJDK 64-Bit Server VM 25.144-b01

I have updated to the dev version

RiskScape Core Engine v1.3.0-dev
--------------------------------
Build time - Wed Sep 07 11:32:58 UTC 2022
Git SHA1   - 47bcb9a1417e934eb95860335ca4730c404a3451

Plugins:
defaults     1.3.0-dev  nz.org.riskscape.engine.defaults.Plugin
postgis      1.3.0-dev  nz.org.riskscape.postgis.Plugin
jython       1.3.0-dev  nz.org.riskscape.jython.Plugin
wizard       1.3.0-dev  nz.org.riskscape.wizard.WizardPlugin
cpython      1.3.0-dev  nz.org.riskscape.cpython.CPythonPlugin
wizard-cli   1.3.0-dev  nz.org.riskscape.wizard.WizardCliPlugin

System:
Linux 3.10.0-693.2.2.el7.x86_64
Java 1.8.0_144 OpenJDK 64-Bit Server VM 25.144-b01

But I am still getting what looks like the same issue.

Problems found with pipeline model
  - Execution of your data processing pipeline failed. The reasons for this follow:
    - Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
      - java.lang.RuntimeException: Waiting thread received a null tile.

That looks like a slightly different problem. I’ll raise a new bug for it.

You could try using the riskscape --pipeline-threads CLI option to reduce the number of CPU cores being used, to see if that helps at all, e.g. try it with 32 cores.

Ran again using Slurm to limit the process to 32 cores

03:56:53.504 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Deadlock detected!  No worker threads are running and all tasks are blocked
Problems found with pipeline model
  - Execution of your data processing pipeline failed. The reasons for this follow:
    - Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
      - java.lang.RuntimeException: Waiting thread received a null tile.
Exit exception thrown from nz.org.riskscape.engine.cli.model.RunCommand.doCommand(RunCommand.java:85)
Caused by:
nz.org.riskscape.engine.cli.ExitException: Problem(ERROR: EXECUTION_FAILEDchildren=[Problem(ERROR: NONE exception=nz.org.riskscape.engine.rl.EvalException: Problem(ERROR: NONE exception=java.lang.RuntimeException: Waiting thread received a null tile.children=[Problem(ERROR: CAUGHT_EXCEPTION['java.lang.RuntimeException: Waiting thread received a null tile.'] exception=java.lang.RuntimeException: Waiting thread received a null tile.)])children=[Problem(ERROR: CAUGHT_EXCEPTION['java.lang.RuntimeException: Waiting thread received a null tile.'] exception=java.lang.RuntimeException: Waiting thread received a null tile.)])])
        at nz.org.riskscape.engine.cli.pipeline.CliPipelineRunner.run(CliPipelineRunner.java:156)
        at nz.org.riskscape.engine.cli.model.RunCommand.doCommand(RunCommand.java:83)
        at nz.org.riskscape.engine.cli.ApplicationCommand.run(ApplicationCommand.java:150)
        at nz.org.riskscape.engine.cli.Main.runCommand(Main.java:244)
        at nz.org.riskscape.engine.cli.Main.runMain(Main.java:210)
        at nz.org.riskscape.engine.cli.Main.main(Main.java:95)
Caused by: nz.org.riskscape.engine.rl.EvalException: Problem(ERROR: NONE exception=java.lang.RuntimeException: Waiting thread received a null tile.children=[Problem(ERROR: CAUGHT_EXCEPTION['java.lang.RuntimeException: Waiting thread received a null tile.'] exception=java.lang.RuntimeException: Waiting thread received a null tile.)])
        at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Realized.evaluate(DefaultExpressionRealizer.java:134)
        at nz.org.riskscape.engine.rl.StructDeclarationRealizer.lambda$create$0(StructDeclarationRealizer.java:100)
        at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Realized.evaluate(DefaultExpressionRealizer.java:132)
        at nz.org.riskscape.engine.projection.SelectProjector.apply(SelectProjector.java:49)
        at nz.org.riskscape.engine.projection.SelectProjector.apply(SelectProjector.java:15)
        at nz.org.riskscape.engine.task.ChainTask.processPage(ChainTask.java:184)
        at nz.org.riskscape.engine.task.ChainTask.run(ChainTask.java:142)
        at nz.org.riskscape.engine.task.WorkerTask.runPublic(WorkerTask.java:97)
        at nz.org.riskscape.engine.sched.Worker.run(Worker.java:104)
        at java.lang.Thread.run(Thread.java:748)
03:56:53.510 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -   Blocked on full output:
Caused by: java.lang.RuntimeException: Waiting thread received a null tile.
        at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:963)
        at javax.media.jai.OpImage.getTile(OpImage.java:1129)
        at javax.media.jai.RenderedOp.getTile(RenderedOp.java:2257)
        at org.geotools.coverage.grid.GridCoverage2D.evaluate(GridCoverage2D.java:504)
        at org.geotools.coverage.grid.GridCoverage2D.evaluate(GridCoverage2D.java:433)
        at nz.org.riskscape.engine.data.coverage.GridTypedCoverage.evaluate(GridTypedCoverage.java:71)
        at nz.org.riskscape.engine.function.geometry.SampleCoverageAtCentroid$1.call(SampleCoverageAtCentroid.java:38)
        at nz.org.riskscape.engine.CoercingFunctionWrapper.call(CoercingFunctionWrapper.java:74)
        at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Instance.lambda$realizeRaw$22(DefaultExpressionRealizer.java:621)
        at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Realized.evaluate(DefaultExpressionRealizer.java:132)
        ... 9 more

So the error is happening in a library that RiskScape uses to read the hazard-layer data (sun.media.jai). This library code has an optimization for multi-threaded access, and it’s this code that’s throwing an exception.

I’m not sure exactly what the root cause of the problem is. It could be that there is some underlying problem with the hazard-layer data, and this is being masked by the multi-threaded access. For example, it looked like someone hit a similar problem due to zipping data on Windows and then unzipping it on Linux:

As a sanity-check, you could try running your model with a single CPU core, i.e. riskscape --pipeline-threads=1 .... This will completely take the multi-threaded access out of the equation. It looks like some errors in that JAI code will only be displayed to stderr, so keep an eye out for errors.

Also note that Slurm will limit the CPU resources that RiskScape gets, but not the number of threads it tries to use. So I think RiskScape will still try to use 72 threads, even though you’ve told Slurm to only let it use 32 cores.

Alternatively, you could try sharing your project data with me privately, and I can see if I can reproduce the problem on my system.

Finally, I think the scheduler deadlock message here is a bit of a red herring now. I think the root problem is the ‘null tile’ issue, and the scheduler cleanup is perhaps just a little untidy (possibly due to the number of worker threads here).

Actually, another quick way to sanity-check the hazard-layer would be to run:

riskscape --pipeline-threads=1 pipeline evaluate "input('YOUR-HAZARD-DATA.tif')"

This will try to read each pixel in the GeoTIFF, turn it into a polygon, and write the result to a shapefile.