Is there an easy way I can limit the number of rows in my input dataset so that I can speed up debugging and testing?
A random selection of the input data can be selected using the random_uniform function e.g.
input(bookmark('data'), name: 'data')
->
select({
*,
random_uniform(0,1) as randomFilter
})
-> filter(randomFilter <= .2)
->
save('output.csv', format: 'csv')
will select ~20% of the input dataset
That’s right. You could also make this a model parameter, so you can leave it in your pipeline and enable it when needed. For example, this uses a proportion_input
parameter, which should default to 1.0 (100%), but you could switch to including only 20% of the data with: riskscape moddel run foo -p proportion_input=0.2
input(bookmark('data'), name: 'data')
-> filter(random_uniform(0,1) <= $proportion_input)
-> save('output.csv', format: 'csv')
The other way to do this is the input
step has a limit
parameter that will restrict the input to a specific number of rows. E.g. to only include 1000 rows of data you could use:
input(bookmark('data', limit: 1000)
Thanks Tim, very helpful to see how it can be done without adding another column.
Is it possible to use an integer limit within a pipeline? I have a few steps I need to run on the full dataset at the start.
something like
-> filter(1:nrow(data) <= 1000)
The only way to do that currently would be to limit the number of rows of input data, e.g.
input(bookmark('data'), name: 'data', limit: $num_rows)
The num_rows
parameter could have a default value of maxint()
(i.e. all the data), and then if you only wanted 1000 rows for debugging, you could use -p num_rows=1000
.
Sweet as. In this case I am a reading in a dataset that covers NZ and then clipping it down to the relevant area so I cant limit the input to the first n rows as this might not include any assets from the area im looking at
input(relation: 'buildings_geopackage', name: 'exposure') as exposures_input
-> select({*, intersects(exposure.geom, bounds(bookmark('hazard_clip', {location: $hazard_clip_location})))})
-> filter("intersects") as exposure_clipped
But the randomised filter works well in this case so I will continue to use that.