domingo, 13 de janeiro de 2019

Spark "first" function behavior on pandas dataframe

Spark first function is used to choose a value after aggregating some dataset value.

In python pandas we don't have this behavior as default after aggregating some dataframe, but we can do it easily we a few lines of code.

Since I could not find this solution on Stack Overflow. But first, let's see what happen with a column with string type when we do not use a function like firtst:

import pandas as pd
df = pd.DataFrame.from_records([
       dict(k=1, i=10, t="a"),
       dict(k=1, i=20, t="b"),
       dict(k=1, i=20, t="c"),
df.groupby("k", as_index=False).sum()

If we run the code bellow, we get this result:

  k i
0 1 50

You can see that the column t was removed since you cannot sum it.

Now, let's add the first aggregation function to this column:

first = lambda a: a.values[0] if len(a) > 0 else None
df.groupby("k", as_index=False).agg({'i': sum, 't': first})

If we run the code bellow, we get this result:

  k i  t
0 1 50 a