A/B Test Reporting in Looker

Published in

Kickstarter Engineering

6 min readJul 2, 2015

One of the Data team’s priorities this year has been improving the Kickstarter A/B testing process. To this end, I’ve been been focused on making it easier to set up, run, and analyze experiments. This will make it more likely we’ll use data from experiments to inform product design.

Until recently, we monitored A/B tests in an ad hoc way. We use our event tracking infrastructure to log A/B test data, so while a test was running, a Product Manager or Data Analyst watched the number of users in the experiment’s event stream until it reached the required sample size. At that point, we ran the numbers through an R script and sent out the results.

Enter Looker

Kickstarter recently adopted a business intelligence tool called Looker to support data reporting and ad hoc analysis. Looker connects directly to our Redshift cluster, which is where we store the raw event data from our A/B tests. This made me wonder whether we could use Looker to monitor experiments and report results.

One feature of Looker we like is the ability to save and schedule queries, with the results delivered via email. If we could find a way to analyze A/B tests via SQL, then Looker could handle the rest.

Back to School

How can we do statistics in SQL without access to probability distributions? There are methods for generating normally distributed data in SQL, but this approach seems like overkill. We don’t need to recreate a standard normal distribution on the fly. The values don’t change.

My aha moment was remembering my old statistics textbooks with look-up tables in the back. By adding look-up tables for probability distributions to Redshift, we can get good approximations of power, p-values, and confidence intervals for the typical A/B tests we run. Although this means we’re simulating a continuous distribution with a discrete one, we don’t rely exclusively on p-values to interpret our tests, so a difference of a few thousandths of a point won’t make much difference.

The Nitty Gritty

As an example, I’m going to use a common type of test we run — a hypothesis test of the difference of two proportions. (If you’d like to learn more about the statistics behind this test, this is a good place to start).

To make this concrete, let’s say we’re testing a new design of the Discover page, and we want to know whether it affects the number of users clicking through to project pages.

To generate a test statistic for this type of test, we need a standard normal distribution. I generated a set of z-scores and their probabilities in R and loaded this into Redshift as standard_normal_distribution.

The table looks something like this:

+-------------------------+--------------------+
|        z_score          |    probability     |
+-------------------------+--------------------+
|                      0  |                0.5 |
|  0.0000009999999992516  |   0.50000039894228 |
| 0.00000199999999939138  |  0.500000797884561 |
| 0.00000299999999953116  |  0.500001196826841 |
| 0.00000399999999967093  |  0.500001595769122 |
| 0.00000499999999981071  |  0.500001994711402 |
| 0.00000599999999995049  |  0.500002393653682 |
| 0.00000700000000009027  |  0.500002792595963 |
| 0.00000799999999934187  |  0.500003191538243 |
| 0.00000899999999948164  |  0.500003590480523 |
| 0.00000999999999962142  |  0.500003989422804 |
|  0.0000109999999997612  |  0.500004388365084 |
|   0.000011999999999901  |  0.500004787307365 |
|  0.0000130000000000408  |  0.500005186249645 |
|  0.0000139999999992924  |  0.500005585191925 |
+-------------------------+--------------------+

Now let’s say we’ve already calculated the results of our experiment for two groups: control and experimental. For each group, we have the number of unique users n who visited the Discover page, the number of unique users x who clicked through to a project page, and the proportion p = x / n. This can all be done with a query.

In the sections below, I’ll use the output of that query to calculate the sample proporution, standard error, and other sample statistics using subqueries called common table expressions (CTEs). If you aren’t familiar with this flavor of SQL syntax, you can think of CTEs as forming temporary tables that can be used in subsequent parts of the query.

Using a CTE, we calculate p_hat, the pooled proportion under the null hypothesis:

... ), p_hat AS (
  SELECT
    (control.x + experimental.x) / (control.n + experimental.n) AS p
  FROM
    control, experimental
), ...

Next we calculate the pooled standard error under the null hypothesis:

... ), se_pooled AS (
  SELECT
    SQRT((p_hat.p * (1 - p_hat.p)) * (1 / control.n + 1 / experimental.n)) AS se
  FROM
    control, experimental, p_hat
), ...

This allows us to calculate an exact z-score from the data:

... ), z_exact AS (
  SELECT
    ABS((control.p - experimental.p) / se_pooled.se) AS z
  FROM
    control, experimental, se_pooled
), ...

Then we find the nearest z-score in our standard normal look-up table and use that to calculate a p-value:

... ), z_nearest AS (
  SELECT
    standard_normal_distribution.z_score AS z_score
  FROM
    standard_normal_distribution, z_exact
  ORDER BY ABS(standard_normal_distribution.z_score - z_exact.z) ASC
  LIMIT 1
), p_value AS (
  SELECT
    (1 - standard_normal_distribution.probability) * 2 AS p
  FROM
    z_nearest
  INNER JOIN standard_normal_distribution ON z_nearest.z_score = standard_normal_distribution.z_score
), ...

Having a p-value is a good start, but we also want to generate confidence intervals for the test. While we’re at it, we’d also like to conduct a power analysis so the test results only display when we’ve reached the minimum sample size.

To do that properly, we need some details about the test design: the significance level, the power, and the minimum change to detect between the two variants. These are all added to the query using Looker’s templated filters, which take user input and add them as parameters.

Unfortunately, Looker cannot simply add an arbitrary value (e.g. 0.05) to any part of a query. To get around this, I filter a table column with user input and then use the resulting value.

For example, in the following section, the query takes user input as significance_level, matches it against the probability column of the standard_normal_distribution table (after some rounding to ensure a match), and saves that value as alpha:

... ), significance_level AS (
  SELECT
    ROUND(probability, 3) AS alpha
  FROM
    standard_normal_distribution
  WHERE
    {% condition significance_level %} ROUND(probability, 3) {% endcondition %}
  LIMIT 1
), ...

Note Looker’s syntax for what it calls templated filters:

WHERE
  {% condition significance_level %} ROUND(probability, 3) {% endcondition %}

If the user input is 0.05 for the significance_level filter, Looker converts this to:

WHERE
  ROUND(probability, 3) = 0.05

See the appendix below for the entire query.

Admittedly, doing all this in SQL is kind of preposterous, but it means that we can add it to Looker as the engine of an A/B Test Dashboard. The dashboard abstracts away all the calculations and presents a clean UI for taking user input on the parameters of the test design, allowing people without any special engineering or data expertise to use it. Now that it’s built into Looker, it’s part of our larger data reporting infrastructure.

*Filters on the dashboard take user input on details about the test design*

After taking input about the test design, the dashboard calculates the number of users in each variant, their conversion rates, and the minimum sample size. If the sample size has been met, the dashboard also outputs a p-value and confidence interval for the test. The dashboard can be scheduled to run daily, and we can even set it up to email only when there are results to report.

Emailed results

Now when we implement a new A/B test, we add it to Looker so we can get daily status emails, including statistical results when the test is complete. This can be done by someone on the Product or Engineering teams, freeing up Data team resources to focus on designing experiments well and running more complex tests.

The Results

This kind of dashboard pushes Looker to its limits, so naturally there are some drawbacks to doing A/B test reporting this way. It separates the implementation of the test from the implementation of the reporting, so there is some duplicated effort. Furthermore, it only works for specific types of tests where the math can be handled by a SQL query and a static probability distribution.

On the other hand, we’re happy that Looker is flexible enough to allow us to prototype internal data tools. The A/B Test Dashboard has automated what was a very manual process before, and it has reduced the dependency on the Data team for monitoring and reporting the results of common types of tests. This all means we can run more experiments to create a better experience for our users.

Find this interesting?

If this kind of thing gets data juices flowing, you should know we’re hiring for a Data Scientist! Head over to the job description to learn more.

Appendix

Our query in full:

Written by Jeremy Salfen.