Coffee Data Science

Double-Blind Coffee Studies

A discussion on why I don’t do them

Robert McKeon Aloe
Towards Data Science
6 min readFeb 1, 2022

--

Over the past three years, I’ve collected a lot of data on coffee, and I’ve published quite a bit especially for improving taste and extraction. However, I haven’t done double-blind studies or triangulated studies (with two different samples and some control sample in a double-blind). I don’t think I will either, and I’d like to talk about why.

Definitions

A blind study is where I would make two shots of coffee, put a label underneath, and have someone else move the cups around so that I wouldn’t know until after the study which was which.

A double-blind study in coffee is where the barista makes two shots of espresso (with a difference in variables) with labels hidden under the cups. Then they mix up the shots, walk away, and the tester comes in. Both don’t know the order that the tester will try each.

A triangulated study in coffee uses the same two shots as in a double-blind but adds in a third unrelated as a control.

The aim for double-blind studies is to remove bias when people know what variables are going into recipe. This is very important for something like medications where there are multiple clinical trials. For coffee, I’m not convinced every experiment should be held to a clinical trial standard.

All images by author. A sample of a study I did without double blind or even blind testing.

Metrics of Performance

I use two metrics for evaluating the differences between techniques: Final Score and Coffee Extraction.

Final score is the average of a scorecard of 7 metrics (Sharp, Rich, Syrup, Sweet, Sour, Bitter, and Aftertaste). These scores were subjective, of course, but they were calibrated to my tastes and helped me improve my shots. There is some variation in the scores. My aim was to be consistent for each metric, but some times the granularity was difficult.

Total Dissolved Solids (TDS) is measured using a refractometer, and this number combined with the output weight of the shot and the input weight of the coffee is used to determine the percentage of coffee extracted into the cup, called Extraction Yield (EY).

Challenges in Coffee

Double-blind studies are challenging for a few reasons.

  1. You need more than one person.
  2. You need to adjust for how the tongue is affected by something as rich as espresso, which is why Q-grading is done on more diluted coffee.
  3. You need multiple people to taste from different parts of the world.
  4. You need multiple coffees, roast levels, and ages.
  5. You need multiple grinder/espresso machine setups.

Maybe it would help you to see where my mind sees an experiment to better understand the variable space.

Design of Experiment Example

Let’s talk about my recent Rok vs Niche grinder experiments. My original design used the same espresso machine with paired shots over multiple roasts and different roast ages, and I didn’t have the pairs of shots back-to-back but rather I changed grinders as I had shots throughout the day. I found that the Rok grinder was better in both taste and extraction, and then I did further investigations to find a root-cause.

However, this experiment was just me tasting, and it wasn’t double-blind. Let’s set aside extraction yield (EY) for the moment. EY showed better results for the ROK, but people argue about taste. Let’s just look at a subjective metric such as taste.

We can redesign this experiment. We can assume the original experiment took N person-hours, which accounts for how many people spent how many hours to do something.

  1. First, let’s make it double-blind, and it will now take 2*N person-hours.
  2. Let’s push the shots closer together in time, which doesn’t cause a time increase, but having to rinse my tongue would greatly affect it.
  3. Increase the number of tasters to 10, so now we are at 11*N (assuming 1 barista makes the shots for everyone).
  4. I did six roasts and multiple ages, but not roast levels. Vary the number of roast levels by 4, which increases the time to 44*N
  5. We can use a Decent Espresso machine to mimic other machines, and do 4 common machines. Time increases to 176*N.
  6. Should we try across different pressure/temperature profiles or just one? For multiple, N increases of course.
  7. Because these are grinders, do we care about blade dullness? Are we concerned with the grinders only after they are broken in?

I’m not trying to be hyperbolic. I want to use this example to illustrate the challenges in coffee data science.

Often, people have argued I didn’t have extraction yield to back my findings, and then when I did, so they would say my taste findings aren’t double-blind. I have seen few double-blind studies in espresso, and when there are double-blind studies, only a few samples are tasted which is also insufficient. So I don’t think those should be considered the standard.

The trouble with this idea is that it limits the number of people, who could make a data backed opinion about coffee, to just the few with the expertise, time, and money. There isn’t some peer review process other than people repeating an experiment or set of experiments for themselves.

My experience has been that people look towards well-established coffee people who lack data even when faced with people who have data. Then when presented with data, they argue without data themselves.

First fines migration test with chalk

A good example of this has been fines migration. I searched in multiple ways to see if fines migrate, and I found only a small amount of fines moves, to the tune of 4%. However, the pushback from many in coffee was that I didn’t disprove the theory of fines migration, all the while ignoring the lack of proof of said theory. It had been proved in the minds of the coffee community as it had developed in the past two decades, but it is a theory without evidence. Even the discussion of the origins of the theory is murky suggesting it started out as a oral myth to explain problematic shots.

How to Accomodate

I present both taste and extraction. My aim isn’t to prove a certain product or method is better because I want it to be. My aim is to improve my own espresso experience. To do this, I must aim for honesty. In this vein, I have also published on weird and failed experiments because I believe their results may be important too.

I experiment with shots throughout the day and the week, and I try to randomize the variables involved. Additionally, I take steps to go deeper than others in my analysis so I don’t want to just quantify whether a method is better or worse but to understand the causes and conditions.

The world of espresso is not yet a world filled with data, so any data is better than no data, and anyone is welcome to collect any amount of data to shot I am wrong. I am open to debating data driven conclusions.

--

--

I’m in love with my Wife, my Kids, Espresso, Data Science, tomatoes, cooking, engineering, talking, family, Paris, and Italy, not necessarily in that order.