I am trying to run a model using 72 cpu cores and have come across a deadlock error
00:54:54.094 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Deadlock detected! No worker threads are running and all tasks are blocked
00:54:54.100 [scheduler-thread] WARN n.o.riskscape.engine.sched.Scheduler - Blocked on full output:
00:54:54.101 [scheduler-thread] WARN n.o.riskscape.engine.sched.Scheduler - ChainTask(select_6:[select])
00:54:54.103 [scheduler-thread] WARN n.o.riskscape.engine.sched.Scheduler - ChainTask(select:[select], unnest:[unnest], select_2:[select], select_3:[select], select_4:[select], exposures:[select])
00:54:54.103 [scheduler-thread] WARN n.o.riskscape.engine.sched.Scheduler - Blocked on no input:
00:54:54.103 [scheduler-thread] WARN n.o.riskscape.engine.sched.Scheduler - ChainTask(select_7:[select], sampled:[select], select_9:[select], analysis:[select], select_12:[select], select_12-sink:[select]) - 22 / 70 tasks blocked
00:54:54.106 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Unhandled exception in scheduler, exiting thread loop!
nz.org.riskscape.engine.RiskscapeException: Problem(ERROR: DEADLOCK)
at nz.org.riskscape.engine.sched.Scheduler.checkForDeadlock(Scheduler.java:196)
at nz.org.riskscape.engine.sched.Scheduler.runOnce(Scheduler.java:222)
at nz.org.riskscape.engine.sched.Scheduler.run(Scheduler.java:275)
at java.lang.Thread.run(Thread.java:748)
Problems found with pipeline model
- Execution of your data processing pipeline failed. The reasons for this follow:
- Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
- java.lang.RuntimeException: Waiting thread received a null tile.
This model runs fine, albeit very slowly, on a 4 core machine.
I have run the model “getting-started” using this many cores and that worked fine.
Could you confirm you’re using RiskScape v1.2.0? We have fixed a scheduler deadlock bug recently (GL816), but it is not in a formal RiskScape release yet. I’ll see if I can get you a build to try out.
RiskScape Core Engine v1.3.0-dev
--------------------------------
Build time - Wed Sep 07 11:32:58 UTC 2022
Git SHA1 - 47bcb9a1417e934eb95860335ca4730c404a3451
Plugins:
defaults 1.3.0-dev nz.org.riskscape.engine.defaults.Plugin
postgis 1.3.0-dev nz.org.riskscape.postgis.Plugin
jython 1.3.0-dev nz.org.riskscape.jython.Plugin
wizard 1.3.0-dev nz.org.riskscape.wizard.WizardPlugin
cpython 1.3.0-dev nz.org.riskscape.cpython.CPythonPlugin
wizard-cli 1.3.0-dev nz.org.riskscape.wizard.WizardCliPlugin
System:
Linux 3.10.0-693.2.2.el7.x86_64
Java 1.8.0_144 OpenJDK 64-Bit Server VM 25.144-b01
But I am still getting what looks like the same issue.
Problems found with pipeline model
- Execution of your data processing pipeline failed. The reasons for this follow:
- Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
- java.lang.RuntimeException: Waiting thread received a null tile.
That looks like a slightly different problem. I’ll raise a new bug for it.
You could try using the riskscape --pipeline-threads CLI option to reduce the number of CPU cores being used, to see if that helps at all, e.g. try it with 32 cores.
Ran again using Slurm to limit the process to 32 cores
03:56:53.504 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Deadlock detected! No worker threads are running and all tasks are blocked
Problems found with pipeline model
- Execution of your data processing pipeline failed. The reasons for this follow:
- Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
- java.lang.RuntimeException: Waiting thread received a null tile.
Exit exception thrown from nz.org.riskscape.engine.cli.model.RunCommand.doCommand(RunCommand.java:85)
Caused by:
nz.org.riskscape.engine.cli.ExitException: Problem(ERROR: EXECUTION_FAILEDchildren=[Problem(ERROR: NONE exception=nz.org.riskscape.engine.rl.EvalException: Problem(ERROR: NONE exception=java.lang.RuntimeException: Waiting thread received a null tile.children=[Problem(ERROR: CAUGHT_EXCEPTION['java.lang.RuntimeException: Waiting thread received a null tile.'] exception=java.lang.RuntimeException: Waiting thread received a null tile.)])children=[Problem(ERROR: CAUGHT_EXCEPTION['java.lang.RuntimeException: Waiting thread received a null tile.'] exception=java.lang.RuntimeException: Waiting thread received a null tile.)])])
at nz.org.riskscape.engine.cli.pipeline.CliPipelineRunner.run(CliPipelineRunner.java:156)
at nz.org.riskscape.engine.cli.model.RunCommand.doCommand(RunCommand.java:83)
at nz.org.riskscape.engine.cli.ApplicationCommand.run(ApplicationCommand.java:150)
at nz.org.riskscape.engine.cli.Main.runCommand(Main.java:244)
at nz.org.riskscape.engine.cli.Main.runMain(Main.java:210)
at nz.org.riskscape.engine.cli.Main.main(Main.java:95)
Caused by: nz.org.riskscape.engine.rl.EvalException: Problem(ERROR: NONE exception=java.lang.RuntimeException: Waiting thread received a null tile.children=[Problem(ERROR: CAUGHT_EXCEPTION['java.lang.RuntimeException: Waiting thread received a null tile.'] exception=java.lang.RuntimeException: Waiting thread received a null tile.)])
at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Realized.evaluate(DefaultExpressionRealizer.java:134)
at nz.org.riskscape.engine.rl.StructDeclarationRealizer.lambda$create$0(StructDeclarationRealizer.java:100)
at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Realized.evaluate(DefaultExpressionRealizer.java:132)
at nz.org.riskscape.engine.projection.SelectProjector.apply(SelectProjector.java:49)
at nz.org.riskscape.engine.projection.SelectProjector.apply(SelectProjector.java:15)
at nz.org.riskscape.engine.task.ChainTask.processPage(ChainTask.java:184)
at nz.org.riskscape.engine.task.ChainTask.run(ChainTask.java:142)
at nz.org.riskscape.engine.task.WorkerTask.runPublic(WorkerTask.java:97)
at nz.org.riskscape.engine.sched.Worker.run(Worker.java:104)
at java.lang.Thread.run(Thread.java:748)
03:56:53.510 [scheduler-thread] WARN n.o.riskscape.engine.sched.Scheduler - Blocked on full output:
Caused by: java.lang.RuntimeException: Waiting thread received a null tile.
at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:963)
at javax.media.jai.OpImage.getTile(OpImage.java:1129)
at javax.media.jai.RenderedOp.getTile(RenderedOp.java:2257)
at org.geotools.coverage.grid.GridCoverage2D.evaluate(GridCoverage2D.java:504)
at org.geotools.coverage.grid.GridCoverage2D.evaluate(GridCoverage2D.java:433)
at nz.org.riskscape.engine.data.coverage.GridTypedCoverage.evaluate(GridTypedCoverage.java:71)
at nz.org.riskscape.engine.function.geometry.SampleCoverageAtCentroid$1.call(SampleCoverageAtCentroid.java:38)
at nz.org.riskscape.engine.CoercingFunctionWrapper.call(CoercingFunctionWrapper.java:74)
at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Instance.lambda$realizeRaw$22(DefaultExpressionRealizer.java:621)
at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Realized.evaluate(DefaultExpressionRealizer.java:132)
... 9 more
So the error is happening in a library that RiskScape uses to read the hazard-layer data (sun.media.jai). This library code has an optimization for multi-threaded access, and it’s this code that’s throwing an exception.
I’m not sure exactly what the root cause of the problem is. It could be that there is some underlying problem with the hazard-layer data, and this is being masked by the multi-threaded access. For example, it looked like someone hit a similar problem due to zipping data on Windows and then unzipping it on Linux:
As a sanity-check, you could try running your model with a single CPU core, i.e. riskscape --pipeline-threads=1 .... This will completely take the multi-threaded access out of the equation. It looks like some errors in that JAI code will only be displayed to stderr, so keep an eye out for errors.
Also note that Slurm will limit the CPU resources that RiskScape gets, but not the number of threads it tries to use. So I think RiskScape will still try to use 72 threads, even though you’ve told Slurm to only let it use 32 cores.
Alternatively, you could try sharing your project data with me privately, and I can see if I can reproduce the problem on my system.
Finally, I think the scheduler deadlock message here is a bit of a red herring now. I think the root problem is the ‘null tile’ issue, and the scheduler cleanup is perhaps just a little untidy (possibly due to the number of worker threads here).
I was able to run the model to completion using 1, 4 or 16 cores with
region="STHL"
/nesi/project/niwa03670/riskscape/bin/riskscape --pipeline-threads=16 model run FloodExposure --progress-indicator=progress_$region.txt --output output/$region -p "region_filter = 'region = \'$region\''"
but when changing the --pipeline-threads to 64 I get the null tile error again
07:19:57.149 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Deadlock detected! No worker threads are running and all tasks are blocked
07:19:57.158 [scheduler-thread] WARN n.o.riskscape.engine.sched.Scheduler - Blocked on no input:
07:19:57.159 [scheduler-thread] WARN n.o.riskscape.engine.sched.Scheduler - ChainTask(select_7:[select], sampled:[select], select_9:[select], analysis:[select], select_12:[select], select_12-sink:[select]) - 49 / 64 tasks blocked
07:19:57.160 [scheduler-thread] WARN n.o.riskscape.engine.sched.Scheduler - ChainTask(select_6:[select]) - 14 / 64 tasks blocked
07:19:57.163 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Unhandled exception in scheduler, exiting thread loop!
nz.org.riskscape.engine.RiskscapeException: Problem(ERROR: DEADLOCK)
at nz.org.riskscape.engine.sched.Scheduler.checkForDeadlock(Scheduler.java:196)
at nz.org.riskscape.engine.sched.Scheduler.runOnce(Scheduler.java:222)
at nz.org.riskscape.engine.sched.Scheduler.run(Scheduler.java:275)
at java.lang.Thread.run(Thread.java:748)
Problems found with pipeline model
- Execution of your data processing pipeline failed. The reasons for this follow:
- Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
- java.lang.RuntimeException: Waiting thread received a null tile.