Limit number of rows in input

Is there an easy way I can limit the number of rows in my input dataset so that I can speed up debugging and testing?

A random selection of the input data can be selected using the random_uniform function e.g.

input(bookmark('data'), name: 'data')
->
select({
		*,
		random_uniform(0,1) as randomFilter
		})
-> filter(randomFilter <= .2)
->
save('output.csv', format: 'csv')

will select ~20% of the input dataset

That’s right. You could also make this a model parameter, so you can leave it in your pipeline and enable it when needed. For example, this uses a proportion_input parameter, which should default to 1.0 (100%), but you could switch to including only 20% of the data with: riskscape moddel run foo -p proportion_input=0.2

input(bookmark('data'), name: 'data')
-> filter(random_uniform(0,1) <= $proportion_input)
-> save('output.csv', format: 'csv')

The other way to do this is the input step has a limit parameter that will restrict the input to a specific number of rows. E.g. to only include 1000 rows of data you could use:

input(bookmark('data', limit: 1000)
1 Like

Thanks Tim, very helpful to see how it can be done without adding another column.

Is it possible to use an integer limit within a pipeline? I have a few steps I need to run on the full dataset at the start.
something like

-> filter(1:nrow(data) <= 1000)

The only way to do that currently would be to limit the number of rows of input data, e.g.

input(bookmark('data'), name: 'data', limit: $num_rows)

The num_rows parameter could have a default value of maxint() (i.e. all the data), and then if you only wanted 1000 rows for debugging, you could use -p num_rows=1000.

Sweet as. In this case I am a reading in a dataset that covers NZ and then clipping it down to the relevant area so I cant limit the input to the first n rows as this might not include any assets from the area im looking at

input(relation: 'buildings_geopackage', name: 'exposure') as exposures_input
-> select({*, intersects(exposure.geom, bounds(bookmark('hazard_clip', {location: $hazard_clip_location})))})
-> filter("intersects") as exposure_clipped

But the randomised filter works well in this case so I will continue to use that.