Insert output file into the next step of pipeline

How can I use the file processed in an earlier step in the pipeline further down within the pipeline; without manually adding it myself. Ideally, I would be able to run the pipeline and the output files would move down to the next step, be altered, and then the output would be moved down to the next step again.
My steps are roughly as follows:

  1. [spatial-join] of build to area data
  2. data is edited i.e. (floor x area)
  3. (floor x area) is grouped via polygon
    3.1. CSV file is joined back into the pipeline (I wonder is there also a way to do this without manually having to make a bookmark and setting the attribute as a float, so it doesn’t pop up as text) - this allows for the total polygon area to be added to each property within the polygon.
  4. (area x floor)/ total polygon area x floor = division factor
  5. variables of interest * division factor is grouped by polygon to gain a new value
  6. Input hazard and the multiple scenarios

Any advice as to stepping away from manual handling is appreciated, at the moment all of these steps are their own chunks of pipeline and I am muting the unnecessary bits when running bit by bit. Ideally, these would all be steps within the same pipeline - if that is possible.

Hi Abby,

In general, you should be able to keep chaining processing steps onto the end of your pipeline using the -> operator. If you want to save the output at intermediary steps as well, then you can do so by naming the step and using the save() pipeline step, e.g.

select({*}) as first_step
# output goes into the next pipeline step
-> select({*}) as next_step
# in a separate pipeline branch, we can also save the same output from the first step to file
first_step -> save('intermediary-results')

However, what you’re trying to do here is slightly trickier. Here’s an example pipeline that I think does roughly what you’re trying to do. It should run against the RiskScape getting-started data.

input('Buildings_SE_Upolu.shp', name: 'exposure')
 ->
# join buildings to region
select({ *, sample_one(exposure, to_coverage(bookmark('Samoa_constituencies.shp'))) as region })
 ->
# aggregate total building area by region
group(by: region,
      select: {
          region.Region,
          sum(exposure.area) as total_area
      })
# join the results back to the exposure-layer
-> join_total_area.rhs

# next 2 pipeline steps are duplicated to join the buildings to region again
input('Buildings_SE_Upolu.shp', name: 'exposure')
 ->
select({ *, sample_one(exposure, to_coverage(bookmark('Samoa_constituencies.shp'))) as region })
 ->
# join the buildings to the total_area by region and calculate a division factor 
join(on: region.Region = Region) as join_total_area
 ->
select({ *, exposure.area / total_area as division_factor })

I’ve duplicated the steps that match the building data to the regions here, because gets tricky joining data to itself (at least, when the group step uses a by parameter). Without the duplicated steps, the pipeline mechanics can unfortunately result in deadlock.

The other thing to note is you can sometimes move some of the pipeline processing into the bookmark, which might help simplify things. E.g. the step 2 (data is edited) could potentially be done using set-attribute in the bookmark.

[bookmark buildings]
location = buildings.shp
set-attribute.tot_area = area * floor_levels

Hope that helps.

Cheers,
Tim