Deadlock error when running a model using many CPU cores

JPowell · 7 September 2022 01:49

I am trying to run a model using 72 cpu cores and have come across a deadlock error

00:54:54.094 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Deadlock detected!  No worker threads are running and all tasks are blocked
00:54:54.100 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -   Blocked on full output:
00:54:54.101 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -     ChainTask(select_6:[select])
00:54:54.103 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -     ChainTask(select:[select], unnest:[unnest], select_2:[select], select_3:[select], select_4:[select], exposures:[select])
00:54:54.103 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -   Blocked on no input:
00:54:54.103 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -     ChainTask(select_7:[select], sampled:[select], select_9:[select], analysis:[select], select_12:[select], select_12-sink:[select]) - 22 / 70 tasks blocked
00:54:54.106 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Unhandled exception in scheduler, exiting thread loop!
nz.org.riskscape.engine.RiskscapeException: Problem(ERROR: DEADLOCK)
        at nz.org.riskscape.engine.sched.Scheduler.checkForDeadlock(Scheduler.java:196)
        at nz.org.riskscape.engine.sched.Scheduler.runOnce(Scheduler.java:222)
        at nz.org.riskscape.engine.sched.Scheduler.run(Scheduler.java:275)
        at java.lang.Thread.run(Thread.java:748)
Problems found with pipeline model
  - Execution of your data processing pipeline failed. The reasons for this follow:
    - Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
      - java.lang.RuntimeException: Waiting thread received a null tile.

This model runs fine, albeit very slowly, on a 4 core machine.

I have run the model “getting-started” using this many cores and that worked fine.

timbeale · 7 September 2022 01:54

Could you confirm you’re using RiskScape v1.2.0? We have fixed a scheduler deadlock bug recently (GL816), but it is not in a formal RiskScape release yet. I’ll see if I can get you a build to try out.

JPowell · 7 September 2022 02:07

Using 1.2 but from July

RiskScape Core Engine v1.2.0
----------------------------
Build time - Fri Jul 29 12:27:23 UTC 2022
Git SHA1   - f406ef1705a1e287867b6ced8325f5d801d42b0b

Plugins:
defaults     1.2.0  nz.org.riskscape.engine.defaults.Plugin
postgis      1.2.0  nz.org.riskscape.postgis.Plugin
jython       1.2.0  nz.org.riskscape.jython.Plugin
wizard       1.2.0  nz.org.riskscape.wizard.WizardPlugin
cpython      1.2.0  nz.org.riskscape.cpython.CPythonPlugin
wizard-cli   1.2.0  nz.org.riskscape.wizard.WizardCliPlugin

System:
Linux 3.10.0-693.2.2.el7.x86_64
Java 1.8.0_144 OpenJDK 64-Bit Server VM 25.144-b01

JPowell · 7 September 2022 22:01

I have updated to the dev version

RiskScape Core Engine v1.3.0-dev
--------------------------------
Build time - Wed Sep 07 11:32:58 UTC 2022
Git SHA1   - 47bcb9a1417e934eb95860335ca4730c404a3451

Plugins:
defaults     1.3.0-dev  nz.org.riskscape.engine.defaults.Plugin
postgis      1.3.0-dev  nz.org.riskscape.postgis.Plugin
jython       1.3.0-dev  nz.org.riskscape.jython.Plugin
wizard       1.3.0-dev  nz.org.riskscape.wizard.WizardPlugin
cpython      1.3.0-dev  nz.org.riskscape.cpython.CPythonPlugin
wizard-cli   1.3.0-dev  nz.org.riskscape.wizard.WizardCliPlugin

System:
Linux 3.10.0-693.2.2.el7.x86_64
Java 1.8.0_144 OpenJDK 64-Bit Server VM 25.144-b01

But I am still getting what looks like the same issue.

Problems found with pipeline model
  - Execution of your data processing pipeline failed. The reasons for this follow:
    - Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
      - java.lang.RuntimeException: Waiting thread received a null tile.

timbeale · 7 September 2022 23:40

That looks like a slightly different problem. I’ll raise a new bug for it.

You could try using the riskscape --pipeline-threads CLI option to reduce the number of CPU cores being used, to see if that helps at all, e.g. try it with 32 cores.

JPowell · 8 September 2022 21:38

Ran again using Slurm to limit the process to 32 cores

03:56:53.504 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Deadlock detected!  No worker threads are running and all tasks are blocked
Problems found with pipeline model
  - Execution of your data processing pipeline failed. The reasons for this follow:
    - Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
      - java.lang.RuntimeException: Waiting thread received a null tile.
Exit exception thrown from nz.org.riskscape.engine.cli.model.RunCommand.doCommand(RunCommand.java:85)
Caused by:
nz.org.riskscape.engine.cli.ExitException: Problem(ERROR: EXECUTION_FAILEDchildren=[Problem(ERROR: NONE exception=nz.org.riskscape.engine.rl.EvalException: Problem(ERROR: NONE exception=java.lang.RuntimeException: Waiting thread received a null tile.children=[Problem(ERROR: CAUGHT_EXCEPTION['java.lang.RuntimeException: Waiting thread received a null tile.'] exception=java.lang.RuntimeException: Waiting thread received a null tile.)])children=[Problem(ERROR: CAUGHT_EXCEPTION['java.lang.RuntimeException: Waiting thread received a null tile.'] exception=java.lang.RuntimeException: Waiting thread received a null tile.)])])
        at nz.org.riskscape.engine.cli.pipeline.CliPipelineRunner.run(CliPipelineRunner.java:156)
        at nz.org.riskscape.engine.cli.model.RunCommand.doCommand(RunCommand.java:83)
        at nz.org.riskscape.engine.cli.ApplicationCommand.run(ApplicationCommand.java:150)
        at nz.org.riskscape.engine.cli.Main.runCommand(Main.java:244)
        at nz.org.riskscape.engine.cli.Main.runMain(Main.java:210)
        at nz.org.riskscape.engine.cli.Main.main(Main.java:95)
Caused by: nz.org.riskscape.engine.rl.EvalException: Problem(ERROR: NONE exception=java.lang.RuntimeException: Waiting thread received a null tile.children=[Problem(ERROR: CAUGHT_EXCEPTION['java.lang.RuntimeException: Waiting thread received a null tile.'] exception=java.lang.RuntimeException: Waiting thread received a null tile.)])
        at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Realized.evaluate(DefaultExpressionRealizer.java:134)
        at nz.org.riskscape.engine.rl.StructDeclarationRealizer.lambda$create$0(StructDeclarationRealizer.java:100)
        at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Realized.evaluate(DefaultExpressionRealizer.java:132)
        at nz.org.riskscape.engine.projection.SelectProjector.apply(SelectProjector.java:49)
        at nz.org.riskscape.engine.projection.SelectProjector.apply(SelectProjector.java:15)
        at nz.org.riskscape.engine.task.ChainTask.processPage(ChainTask.java:184)
        at nz.org.riskscape.engine.task.ChainTask.run(ChainTask.java:142)
        at nz.org.riskscape.engine.task.WorkerTask.runPublic(WorkerTask.java:97)
        at nz.org.riskscape.engine.sched.Worker.run(Worker.java:104)
        at java.lang.Thread.run(Thread.java:748)
03:56:53.510 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -   Blocked on full output:
Caused by: java.lang.RuntimeException: Waiting thread received a null tile.
        at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:963)
        at javax.media.jai.OpImage.getTile(OpImage.java:1129)
        at javax.media.jai.RenderedOp.getTile(RenderedOp.java:2257)
        at org.geotools.coverage.grid.GridCoverage2D.evaluate(GridCoverage2D.java:504)
        at org.geotools.coverage.grid.GridCoverage2D.evaluate(GridCoverage2D.java:433)
        at nz.org.riskscape.engine.data.coverage.GridTypedCoverage.evaluate(GridTypedCoverage.java:71)
        at nz.org.riskscape.engine.function.geometry.SampleCoverageAtCentroid$1.call(SampleCoverageAtCentroid.java:38)
        at nz.org.riskscape.engine.CoercingFunctionWrapper.call(CoercingFunctionWrapper.java:74)
        at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Instance.lambda$realizeRaw$22(DefaultExpressionRealizer.java:621)
        at nz.org.riskscape.engine.rl.DefaultExpressionRealizer$Realized.evaluate(DefaultExpressionRealizer.java:132)
        ... 9 more

timbeale · 9 September 2022 00:37

So the error is happening in a library that RiskScape uses to read the hazard-layer data (sun.media.jai). This library code has an optimization for multi-threaded access, and it’s this code that’s throwing an exception.

I’m not sure exactly what the root cause of the problem is. It could be that there is some underlying problem with the hazard-layer data, and this is being masked by the multi-threaded access. For example, it looked like someone hit a similar problem due to zipping data on Windows and then unzipping it on Linux:

As a sanity-check, you could try running your model with a single CPU core, i.e. riskscape --pipeline-threads=1 .... This will completely take the multi-threaded access out of the equation. It looks like some errors in that JAI code will only be displayed to stderr, so keep an eye out for errors.

Also note that Slurm will limit the CPU resources that RiskScape gets, but not the number of threads it tries to use. So I think RiskScape will still try to use 72 threads, even though you’ve told Slurm to only let it use 32 cores.

Alternatively, you could try sharing your project data with me privately, and I can see if I can reproduce the problem on my system.

Finally, I think the scheduler deadlock message here is a bit of a red herring now. I think the root problem is the ‘null tile’ issue, and the scheduler cleanup is perhaps just a little untidy (possibly due to the number of worker threads here).

timbeale · 9 September 2022 01:23

Actually, another quick way to sanity-check the hazard-layer would be to run:

riskscape --pipeline-threads=1 pipeline evaluate "input('YOUR-HAZARD-DATA.tif')"

This will try to read each pixel in the GeoTIFF, turn it into a polygon, and write the result to a shapefile.

JPowell · 18 January 2023 21:41

I was able to run the model to completion using 1, 4 or 16 cores with

region="STHL"
/nesi/project/niwa03670/riskscape/bin/riskscape --pipeline-threads=16 model run FloodExposure --progress-indicator=progress_$region.txt --output output/$region -p "region_filter = 'region = \'$region\''"

but when changing the --pipeline-threads to 64 I get the null tile error again

07:19:57.149 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Deadlock detected!  No worker threads are running and all tasks are blocked
07:19:57.158 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -   Blocked on no input:
07:19:57.159 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -     ChainTask(select_7:[select], sampled:[select], select_9:[select], analysis:[select], select_12:[select], select_12-sink:[select]) - 49 / 64 tasks blocked
07:19:57.160 [scheduler-thread] WARN  n.o.riskscape.engine.sched.Scheduler -     ChainTask(select_6:[select]) - 14 / 64 tasks blocked
07:19:57.163 [scheduler-thread] ERROR n.o.riskscape.engine.sched.Scheduler - Unhandled exception in scheduler, exiting thread loop!
nz.org.riskscape.engine.RiskscapeException: Problem(ERROR: DEADLOCK)
        at nz.org.riskscape.engine.sched.Scheduler.checkForDeadlock(Scheduler.java:196)
        at nz.org.riskscape.engine.sched.Scheduler.runOnce(Scheduler.java:222)
        at nz.org.riskscape.engine.sched.Scheduler.run(Scheduler.java:275)
        at java.lang.Thread.run(Thread.java:748)
Problems found with pipeline model
  - Execution of your data processing pipeline failed. The reasons for this follow:
    - Failed to evaluate `{*, sample_centroid(geometry: exposure, coverage: event.coverage) as hazard}`
      - java.lang.RuntimeException: Waiting thread received a null tile.

JPowell · 26 February 2023 00:34

This has been fixed in a 1.4 development build

RiskScape Core Engine v1.4.0-dev
--------------------------------
Build time - Fri Jan 27 14:42:44 UTC 2023
Git SHA1   - 66990971c415920eca26814e6bb7d59eb32c0c10

The model now runs fine with 72 pipeline threads

Topic		Replies	Views
Issues troubleshooting model Community	4	149	6 September 2023
Batch running model with multiple hazard layers Community	4	177	7 November 2023
RiskScape commands not working Community	2	181	5 April 2023
Running a pipeline multiple times Community	5	181	6 November 2023
No space left on device Community	2	138	12 September 2023

Deadlock error when running a model using many CPU cores

Related topics