Limit number of rows in input

JPowell · 13 November 2024 22:37

Is there an easy way I can limit the number of rows in my input dataset so that I can speed up debugging and testing?

JPowell · 13 November 2024 22:38

A random selection of the input data can be selected using the random_uniform function e.g.

input(bookmark('data'), name: 'data')
->
select({
		*,
		random_uniform(0,1) as randomFilter
		})
-> filter(randomFilter <= .2)
->
save('output.csv', format: 'csv')

will select ~20% of the input dataset

timbeale · 13 November 2024 23:00

That’s right. You could also make this a model parameter, so you can leave it in your pipeline and enable it when needed. For example, this uses a proportion_input parameter, which should default to 1.0 (100%), but you could switch to including only 20% of the data with: riskscape moddel run foo -p proportion_input=0.2

input(bookmark('data'), name: 'data')
-> filter(random_uniform(0,1) <= $proportion_input)
-> save('output.csv', format: 'csv')

The other way to do this is the input step has a limit parameter that will restrict the input to a specific number of rows. E.g. to only include 1000 rows of data you could use:

input(bookmark('data', limit: 1000)

JPowell · 14 November 2024 02:43

Thanks Tim, very helpful to see how it can be done without adding another column.

Is it possible to use an integer limit within a pipeline? I have a few steps I need to run on the full dataset at the start.
something like

-> filter(1:nrow(data) <= 1000)

timbeale · 17 November 2024 21:52

The only way to do that currently would be to limit the number of rows of input data, e.g.

input(bookmark('data'), name: 'data', limit: $num_rows)

The num_rows parameter could have a default value of maxint() (i.e. all the data), and then if you only wanted 1000 rows for debugging, you could use -p num_rows=1000.

JPowell · 18 November 2024 01:16

Sweet as. In this case I am a reading in a dataset that covers NZ and then clipping it down to the relevant area so I cant limit the input to the first n rows as this might not include any assets from the area im looking at

input(relation: 'buildings_geopackage', name: 'exposure') as exposures_input
-> select({*, intersects(exposure.geom, bounds(bookmark('hazard_clip', {location: $hazard_clip_location})))})
-> filter("intersects") as exposure_clipped

But the randomised filter works well in this case so I will continue to use that.

Topic		Replies	Views
Different number of hazard inputs for RiskScape Community	1	263	15 September 2022
Running a pipeline multiple times Community	5	181	6 November 2023
Start a pipeline with select Community	1	15	2 October 2024
Clip a list of geotiff Community	5	101	6 March 2024
Using command line parameters with 'riskscape pipeline eval' Community	2	181	23 July 2023

Limit number of rows in input

Related topics