As mentioned in the previous blog post, algorithm-based methodologies for assigning credit to media channels on conversion of a user are becoming more and more popular, replacing archaic methodologies such as first touch and last touch attribution. A paper that goes beyond a regression framework to explain such attributions was presented by Dalessandro et al. which I’ll be going over in the next few sections.
Attribution
Attribution and Causality
Dalessandro et al. propose a counterfactual analysis to produce estimates of the causal effect of advertising channels on user conversion. There are some strict assumptions that have to be met in order to obtain causality from the data, which Dalessandro et al. state as the following:
- The ad treatment precedes the outcome (conversion of a user)
- Any attribute that may affect both ad treatment and conversion outcome is observed and accounted for. i.e., there are no unknown variables acting as confounders.
- Every user has a non-zero probability of receiving an ad treatment.
Obviously, in real life scenarios, conditions 2 and 3 are nearly impossible to prove as true in any attribution analysis. It may be possible that an ad campaign is targeted towards a certain demographic, thus violating condition 3, and it may be very possible that confounders such as users’ biases towards certain products and services are unmeasurable quantities. One can see how this would be a challenge. In the interests of brevity, we will not dwell on the mathematical formulation of such an analysis since the practicality of it is dubious. In the next section, I will discuss an approximate causal model that Dalessandro et al. introduce, which recasts the causal estimation problem as a channel importance problem, with better application to real world data.
Channel Importance Attribution
Before getting into any convoluted equations, I’ll quickly introduce important notation:
- C={ C1, C2,…Ck } is defined as the set of media channels that have shown ads to a group of people
- W is a vector of user attributes before being exposed to any ads ( for example, demographics, prior internet searches etc.)
- Y is a boolean indicating whether or not a user has converted, post exposure to ads
- (γ = Σ Y, n) is the dataset of n users who have seen the same ads by channels in C, and have the same values W = w, producing γ = Σ Y total conversions
- S is the set C, excluding Ck (hence a subset of C)
- ωS,k is the probability that set C begins with the sequence {S, Ck, ….} in some distribution Ω of possible orderings
The expectation of channel Ck‘s contribution to Y, over all possible combinations of C, is given as Vk, which can be seen in the equation below:
In order to understand this better, consider an example where there are only 2 channels, C1 and C2. Attribution values for the channels can be given as :
We can see in this simplified form that the attribution values are affected by how these channels serve their advertisements to the user.
It is interesting to note, that in the case of observable ad campaigns, we will already know the order in which channels deliver their ads, making the ωS,k probabilities always 0 or 1. The paper discusses why this observable information can actually be harmful in providing attribution values. Let’s take a look at an example.
Consider C = {C1, C2}. Further, let E[γ|{∅}] = E[γ|{C1}] = E[γ|{C2}] = 0, and E[γ|{C1,C2}] = δ >0.. Further, assume that C2 always serves its ads after C1. These assumptions tell us that the individual effects of C1 and C2 cause no conversions among users, but the joint effects of C1 and C2 do lead to some user conversions. Using the formula described above, we can get attribution values as following:
Since we have observable probabilities of the sequence in which the channels serve their ads (since C2 always serves after C1), we can note that ω2,1 = 0, and ω1,2=1, giving us the equation in the form above. What is interesting to note now, is the fact that our attribution values tell us that V1 = 0, while V2 = δ. This means, all the credit for the joint effect of C1 and C2 in our example is going to C2, simply due to the fact that C2 serves its ads after C1. This conclusion is harmful, since we can extrapolate this to a general idea that channels that serve their ads later receive greater credit for user conversions ( it basically turns into a last touch attribution model, which is pretty flawed).
Dalessandro et al. recognize that using these observable probabilities lead to poor recognition of interaction effects among channels, and instead propose a different way to calculate the quantity ωS,k. The following equation is the crux of their idea :
They define Ω as a uniform distribution over all possible orderings of C. They state that ωS,k can now be calculated as :
To completely understand this equation would require a very good understanding of Shapley Values, which are a common concept of attribution allocation in game theory. Due to the limited scope of this blog, I will not discuss it here. But if there’s something to take away from the paper’s implementation, it is the fact that observable probability distributions of ωS,k should be ignored in favor of the equation provided by the authors in the equation above.