terça-feira, 12 de março de 2019

The non-experimentation result anti-pattern

Many of us have gone through the scenario of putting new software functionality into production and seeing tremendous improvement in business metrics. These results are often celebrated and attributed to this new achievement and this is used in presentations and corporate bonuses definitions.

But can we attribute causal relation between the new functionality and the new business metrics?

Before answering this question, however difficult it may be to abandon the emotive side of the functionality we have created, let us try to imagine some hypothesis that might have caused this improvement other than our change:

  • An advertising campaign
  • An unexpected viral
  • The functionality created by another squad
  • A concurrent product crashes
  • It is a normal seasonal movement

These are just a few assumptions that could cause improvement in metrics. It is possible to mitigate each of them through some data analysis. But can you anticipate all other possible causal assumptions to be able to rule them out? Difficultly.

Since you can not assign causality between the two events, the most you can say is that there is a correlation between them. And yet it may be a spurious correlation. Some examples of this kind of correlations can be seen on this site here. It speaks, for example, about the strong correlation between the number of Nicolas Cage films per year and the number of drownings in the United States per year.

So the answer to the initial question is no! It is not correct to attribute the improvement of the business metric just because of its new functionality. If you want to check whether a new feature causes a change in some metric, an AB experiment is required.

In an AB experiment we have a variant (or alternative) called control (usually A) and a variant containing the modification to be validated (B). A random sample of users will receive the version of their product with variant A, and another random sample of users will see variant B. Everything in the two variants must be the same, except the modification made in variant B. With this you can control the environment and all those possible causes we reported above and more the others hypothesis we could not predict before.

At the end of the experiment we ate going to be able to attribute the causality of the new functionality to the incremental metric because it was the only thing different in the whole environment. Taking into account of course all statistical variance due to the use of a sample, which can lead us to possible false positives in a minimal percentage of cases.

AB experiments are the most accepted technique currently for you to determine the causality of a change in your product. They are not a new thing and are widely used as a part of the scientific method.

However there is currently a large discussion in the statistical area regarding causal inferences, which would be models for inferring cause from one event to another without necessarily an experiment controlling all variables. This discussion became even stronger after the publication of The Book of Why.

However, if you do not master the techniques of causal inference, it is best not to disclose cause and effect without being sure. I know it's very hard for us to leave our emotions aside for having participated in the development of that piece of software. But it's important to stay cool. And of course, always do AB experiments before putting something into production.

This post is part of a series about Experimentation Anti-Patterns, a talk that I presented recently.

domingo, 13 de janeiro de 2019

Spark "first" function behavior on pandas dataframe

Spark first function is used to choose a value after aggregating some dataset value.

In python pandas we don't have this behavior as default after aggregating some dataframe, but we can do it easily we a few lines of code.

Since I could not find this solution on Stack Overflow. But first, let's see what happen with a column with string type when we do not use a function like firtst:

import pandas as pd
df = pd.DataFrame.from_records([
       dict(k=1, i=10, t="a"),
       dict(k=1, i=20, t="b"),
       dict(k=1, i=20, t="c"),
df.groupby("k", as_index=False).sum()

If we run the code bellow, we get this result:

  k i
0 1 50

You can see that the column t was removed since you cannot sum it.

Now, let's add the first aggregation function to this column:

first = lambda a: a.values[0] if len(a) > 0 else None
df.groupby("k", as_index=False).agg({'i': sum, 't': first})

If we run the code bellow, we get this result:

  k i  t
0 1 50 a


domingo, 3 de setembro de 2017

"Learning to ranking" with xCLiMF python implementation

At Globo.com we try to optimize our personalized recommendation providing better content in a ranked list. We have a hybrid system that can automatically essemble collaborative filtering, content based and non personalized algorithms.

But none of our implemented algorithms are intented to rank. Our best collaborative filtering algorithm, based on the paper Collaborative Filtering for Implicit Feedback Datasets, doesn't optimizes a ranking metric like MRR or NDCG.

Even so we have good results of those algorithms choosing the best hyper parameters focusing in MRR, we thought we could get better results using algorithms focusing in optimizing a ranking metric: Our first hypothesis is xCLiMF.

xCLiMF is a latent factor collaborative filtering variant wich optimizes a lower bound of the smoothed reciprocal rank of "relevant" items in ranked recommendation list. It is a evolution o CLiMF algorithm that can handle using multiple levels of relevance data instead of using only unary user preferences.

Implementation references

Since xCLiMF is very similar to CLiMF, I implemented xCLiMF based on this python CLiMF repository ( https://github.com/gamboviol/climf ). During the process I found an error and submitted a pull request ( https://github.com/gamboviol/climf/pull/2 ).

I have also found a C++ xCLiMF implementantion by other brazilian guy ( https://github.com/gpoesia/xclimf ). But this solution have two problems: The first one is a bug reported here ( https://github.com/gpoesia/xclimf/issues/1 ) and the second one is related to performance, since it updates user and item vectors separately as two loops of other two chained loops.

My implementation

So my implementation also have some differences from original papers:

  • In the paper they don't mention any kind of previously normalization, but it's always a good practice to do it for machine learning algorithms input data. So I normalized dividing all rating by the maximum rating in the dataset. Using original ratings lead to some math errors:
    • Exponential function of large values causes math range error;
    • Exponential function of very large negative numbers get 0, and lead to some division by zero error; 
    • Log function of negative numbers, caused by large exponentials, causes math domain error
  • The paper experimentation setup uses N random ratings by user in training and in the cross validation data. Even so I have implemented this way, I have put an option to select 2N top ratings by user, randomly distributed in training and cross validation. For me, this last protocol reflected a more realistic kind of problem that xCLiMF is going to solve.
  • In the original paper they described the algorithm as two chained loops. Firstly I implemented as described in the paper. But them I updated the code to use a more vectorized solution, having a 54% improvement in the performance. I maintained the chained loop solution for comparison purpose ( https://github.com/timotta/xclimf/blob/master/tests/looping_update.py )
  • For the objective function, the paper describes that it's not necessary to divide by the number of users, because it is only used for validating if the gradient ascending is still increasing after many interactions. But I made this division for interpretability purposes.


Using the experimentation protocol of N random ratings by user in training and in the cross validation, applied to the MovieLens 1M dataset, I got MRR variying from 0.08 to 0.09 at the 50 iteration using the hyper-parameters described in the paper.

Using the top K experimentation protocol in the same dataset I got MRR varying from 0.2 to 0.26. I tried those two experimentation protocol with the Alternative Least Squares implemented at Spark and the maximum MRR I got was 0.01.

The sequence of experiments, and how to run can be seen at https://github.com/timotta/xclimf

Conclusions and future

Now we have a working xCLiMF implementation in python, but it is not feasible to train it with Globo.com dataset. Running it using MovieLens 1M dataset that have 6k users took 4 minutes. With MovieLens 20M dataset that have 130k users took more than one hour. Globo.com datasets are much bigger than that, if we count only the logged users, our dataset of users preferences on shows has more than 4 millions users, it would take more than 40 hours to run.

That's why my next step is implementing this same algorithm on Spark. Since the updating user by user step is independent, it is possible to scale it running in parallel on our hadoop cluster.

I appreciate any help, correction, suggestion or criticism


CLiMF: Learning to Maximize Reciprocal Rank with Collaborative Less-is-More Filtering
Yue Shi, Martha Larson, Alexandros Karatzoglou, Nuria Oliver, Linas Baltrunas, Alan Hanjalic
ACM RecSys 2012

xCLiMF: Optimizing Expected Reciprocal Rank for Data with Multiple Levels of Relevance
Yue Shia, Alexandros Karatzogloub, Linas Baltrunasb, Martha Larsona, Alan Hanjalica
ACM RecSys 2013

Collaborative Filtering for Implicit Feedback Datasets
Yifan Hu, Yehuda Koren, Chris Volinsky
IEEE, 2008

sábado, 28 de novembro de 2015

Dados abertos do Web Democracia

Nas minhas conversas com os participantes do RecSys de 2015 em Viena foi possível notar o esgotamento deles em relação aos dados existentes para análise. Os acadêmicos longe de grandes empresas como Linkedin, Google, Facebook e Netflix, estavam sempre em busca de mais dados para tunar e implementar novos algoritmos. Em resumo, ninguém aguenta mais o MovieLens.

Por essa razão resolvi disponibilizar todos os dados de avaliações de políticos no Web Democracia publicamente.

Para garantir a privacidade dos usuários utilizei a total aleatorização dos IDs de forma a evitar qualquer engenharia reversa, deixando então as avaliações anônimas.

A descrição completa dos dados de avaliações de políticos do Web Democracia você encontra aqui.

Um exemplo simples de análise desses dados com pandas pode ser visto aqui.

Agora então os quase meio milhão de avaliações de políticos do Web Democracia é mais uma fonte de informação e de um domínio bem diferente que os tradicionais livros e filmes.

segunda-feira, 14 de outubro de 2013

Meu primeiro aplicativo Android

Publiquei essa semana no Google Play meu primeiro aplicativo Android. O 9Gag Offline trata-se de um aplicativo que permite que você baixe os gags do site 9Gag para momentos em que não haja conexão com internet. Por exemplo se você for viajar, no avião você ficará isolado do mundo mas poderá rir bastante com os gags que estarão guardados no seu celular.

Já fazia mais ou menos uns seis meses que estava estudando esporadicamente o sistema operacional da Google e fazendo diversos testes. A um mês atrás resolvi fechar de vez o aplicativo e colocá-lo ao público. Deu pra aprender bastante sobre o framework e reviver os problemas de sincronia e concorrência que haviam na época em que programava para desktop com Delphi, Kylix e Java Swing.

Espero que gostem e se acharem qualquer problema ou tiverem alguma sugestão é só avisar. Para baixar o 9Gag Offline basta clicar aqui e ir ao Google Play.

quinta-feira, 25 de julho de 2013

Plugin de Fonética Portuguesa para ElasticSearch

O ElasticSearch possui um plugin fonético que abrange diversas regras e padrões. No entanto nenhuma dessas regras é boa o bastante para as peculiaridades da lingua portuguesa.

Pensando nisso e de posse de uma gramática portuguesa, forkei o repositório e criei um encoder com as regras de nosso estimado idioma lusitano. Como alguns meses depois ainda não obtive resposta alguma da equipe do plugin original, promovi então este encoder a um novo plugin: Portuguese Phonetic Plugin.

Para utilizá-lo você precisa clonar o projeto no github e instalá-lo:

git clone https://github.com/timotta/elasticsearch-fonetica-portuguesa.git
cd elasticsearch-fonetica-portuguesa
./script/install.sh ~/Programas/elasticsearch-0.20.5
Depois é preciso configurar um analyser em config/elasticsearch.yml:
index :
    analysis :
        analyzer :
            fonetico :
                type : custom
                tokenizer : standard
                filter : [ standard, lowercase, foneticaportuguesa_filter, asciifolding ]
        filter :
                type : foneticaportuguesa
                replace : false
Com tudo configurado, você pode criar um novo indice usando este analyser, como é mostrado abaixo:
$ curl localhost:9200/comfonetica -XPUT -d '
  "mappings" :{
     "document" : {
        "analyzer" : "fonetico"
Com o indice criado e configurado é possivel verificar as transformações de texto da seguinte forma:
$ curl -XGET 'localhost:9200/comfonetica/_analyze?analyzer=fonetico&pretty=true' -d chiclete
  "tokens" : [ {
    "token" : "XICLETE",
    "start_offset" : 0,
    "end_offset" : 8,
    "type" : "",
    "position" : 1
  }, {
    "token" : "chiclete",
    "start_offset" : 0,
    "end_offset" : 8,
    "type" : "",
    "position" : 1
  } ]
Repare que a palavra se transformou em duas. Isso acontece porque a configuração replace do filter está como false.

Atualmente só testei o plugin com a versão 0.20.5 do elasticsearch, se testarem em outras versões peço que reporte no github. Além disso, nem todas as regras fonéticas foram implementadas, então se você precisar de alguma que esteja faltando, colabore ou solicite lá também.

terça-feira, 16 de julho de 2013

Complexidade Ciclomática com Pygenie recursivo

Pygenie é uma biblioteca python bem simples para calcular a complexidade ciclomática de seus método. É tão simples, mas tão simples que para projetos mais complexos é preciso scriptar um pouco para que todos os diretórios sejam analisados.

Para facilitar então a nossa vida aqui na empresa, fiz uma contribuição (com pouca esperança de ser aceita pois o projeto está parado a dois anos) para que seja possível obter os resultado de forma recursiva. Com esta contribuição é possivel informar o parâmetro -r que analisará recursivamente todos os arquivos .py dentro de um diretório.

$ pygenie complexity mylib -r

Se dentro de seu projeto houver algum arquivo ou diretório que você não deseja que seja analisado você pode utilizar um outro parâmetro adicionado, o -e, onde você informar um pattern para ser excluído. Por exemplo se houver um diretório de testes:

$ pygenie complexity mylib -r -e tests

Caso você deseje utilizar desde já pode instalar o pygenie através do meu fork e neste meio tempo tentar incentivar nosso amigo Matthew Von Rocketstein a aceitar o pull request.