Find max value without grouping

This pipeline will return the maximum value for each unique group in column ‘A’

-> group(
by: A,
select: {max: {A, max(B)}}
)
-> save(name: 'output')

so

A,B
a,1
a,4
g,3
g,2

will return

max.A,max.max_B
a,4
g,3

I would like to get the maximum value in column ‘B’ but removing the group function

input(relation: 'test.csv')
-> select({max(B)})
-> save(name: 'output')

gives the error

Failed to validate model for execution
  - Failed to validate 'select({max(B)})' step for execution
    - Failed to validate expression '{max(B)}' against input type {A=>Text, B=>Text}
      - Problems found with 'max' riskscape function
        - Wrong number of arguments provided. Expected 2, but got 1

How can I use max with a column of data outside of a group function?

Hi John,

Why do you want to get rid of the group step?

Basically the group step aggregates across several rows of data, whereas the select step processes each row of data separately (i.e. it operates on a scalar value). So the select step only ever sees one value for column B at a time.

The slightly confusing thing here is that the max() function can be used both in scalar-mode (i.e. select step) and aggregate-mode (i.e. group step). In the scalar version it expects to compare two numbers and return the higher number, e.g. max(B, C). This is why it’s giving you this error.

If all you want to do here is get the max across all the rows of data, then you could just remove the by clause from the group step, i.e.

input(relation: 'test.csv')
-> group({max(B)})
-> save(name: 'output')

Cheers,
Tim

1 Like