terça-feira, 12 de março de 2019

The non-experimentation result anti-pattern

Many of us have gone through the scenario of putting new software functionality into production and seeing tremendous improvement in business metrics. These results are often celebrated and attributed to this new achievement and this is used in presentations and corporate bonuses definitions.

But can we attribute causal relation between the new functionality and the new business metrics?

Before answering this question, however difficult it may be to abandon the emotive side of the functionality we have created, let us try to imagine some hypothesis that might have caused this improvement other than our change:

  • An advertising campaign
  • An unexpected viral
  • The functionality created by another squad
  • A concurrent product crashes
  • It is a normal seasonal movement

These are just a few assumptions that could cause improvement in metrics. It is possible to mitigate each of them through some data analysis. But can you anticipate all other possible causal assumptions to be able to rule them out? Difficultly.

Since you can not assign causality between the two events, the most you can say is that there is a correlation between them. And yet it may be a spurious correlation. Some examples of this kind of correlations can be seen on this site here. It speaks, for example, about the strong correlation between the number of Nicolas Cage films per year and the number of drownings in the United States per year.

So the answer to the initial question is no! It is not correct to attribute the improvement of the business metric just because of its new functionality. If you want to check whether a new feature causes a change in some metric, an AB experiment is required.

In an AB experiment we have a variant (or alternative) called control (usually A) and a variant containing the modification to be validated (B). A random sample of users will receive the version of their product with variant A, and another random sample of users will see variant B. Everything in the two variants must be the same, except the modification made in variant B. With this you can control the environment and all those possible causes we reported above and more the others hypothesis we could not predict before.

At the end of the experiment we ate going to be able to attribute the causality of the new functionality to the incremental metric because it was the only thing different in the whole environment. Taking into account of course all statistical variance due to the use of a sample, which can lead us to possible false positives in a minimal percentage of cases.

AB experiments are the most accepted technique currently for you to determine the causality of a change in your product. They are not a new thing and are widely used as a part of the scientific method.

However there is currently a large discussion in the statistical area regarding causal inferences, which would be models for inferring cause from one event to another without necessarily an experiment controlling all variables. This discussion became even stronger after the publication of The Book of Why.

However, if you do not master the techniques of causal inference, it is best not to disclose cause and effect without being sure. I know it's very hard for us to leave our emotions aside for having participated in the development of that piece of software. But it's important to stay cool. And of course, always do AB experiments before putting something into production.

This post is part of a series about Experimentation Anti-Patterns, a talk that I presented recently.

domingo, 13 de janeiro de 2019

Spark "first" function behavior on pandas dataframe

Spark first function is used to choose a value after aggregating some dataset value.

In python pandas we don't have this behavior as default after aggregating some dataframe, but we can do it easily we a few lines of code.

Since I could not find this solution on Stack Overflow. But first, let's see what happen with a column with string type when we do not use a function like firtst:

import pandas as pd
df = pd.DataFrame.from_records([
       dict(k=1, i=10, t="a"),
       dict(k=1, i=20, t="b"),
       dict(k=1, i=20, t="c"),
df.groupby("k", as_index=False).sum()

If we run the code bellow, we get this result:

  k i
0 1 50

You can see that the column t was removed since you cannot sum it.

Now, let's add the first aggregation function to this column:

first = lambda a: a.values[0] if len(a) > 0 else None
df.groupby("k", as_index=False).agg({'i': sum, 't': first})

If we run the code bellow, we get this result:

  k i  t
0 1 50 a