Root Cause Analysis using Rothmans Causal Pies

rothmans causal pies

It sometimes seems to me that in the tech industry, maybe because we’re often playing with new technologies and innovating in our organisation, or even field, (when we’re not trying to pay down tech debt and keep legacy systems running), we’re sometimes guilty of not looking outside our sphere for better practices and new (or even old) ideas.

Whilst studying for my Master’s degree in Global Health, I discovered the concept of “Rothman’s Causal Pies”.

The Epidemiological Triad

Epidemiology is the study of why and how diseases (including non-communicable diseases) occur. As a field, it encompasses the entire realm of human existence, from environmental and biological aspect to heuristics and even economics. It’s a real exercise into Systems Thinking, which is kinda why I love it.

In epidemiology, there is a concept known as the “Epidemiological Triad”, which describes the necessary relationship between vector, host, and environment. When all three are present, the disease can occur. Without one or more of those three factors, the disease cannot occur. It’s a very simplistic but useful model. As we know, all models are wrong, but some are useful.

This concept is useful because through understanding this triad, it’s possible to identify an intervention to reduce the incidence of, or even eradicate, a disease, such as by changing something in the environment (say, by providing clean drinking water) or a vaccination programme (changing something about the host).

What the triad doesn’t provide, however, is a description of the various factors necessary for the disease to occur, and this is especially relevant to non-infectious disease, such as back pain, coronary heart disease, or a mental health problem. In these cases, there may be many different components, or causal factors. Some of these may be “necessary”, whilst some may contribute. There may be many difference combinations of causes that result in the disease.

To use heart disease as an example, the component causes, or “risk factors” could include poor diet, little or no exercise, genetic predisposition, smoking, alcohol, and many more. No single component is sufficient to cause the disease, and one (genetic predisposition, for example) may be necessary in all cases.

Rothman, in 1976, came up with a model that demonstrates the multifactorial nature of causation.

Rothman’s Causal Pies

An individual factor that contributes to cause disease is shown as a piece of a pie, like the triangles in the game Trivial Pursuit. After all the pieces of a pie fall into place, the pie is complete, and disease occurs.

The individual factors are called component causes. The complete pie, which is termed a causal pathway, is called a sufficient cause. A disease may have more than one sufficient cause, with each sufficient cause being composed of several component causes that may or may not overlap. A component that appears in every single pie or pathway is called a necessary cause, because without it, disease does not occur. An example of this is the role that genetic factors play in haemophilia in humans – haemophilia will not occur without a specific gene defect, but the gene defect is not believed to be sufficient in isolation to cause the disease.

An example: Note in the image below that component cause A is a necessary cause because it appears in every pie.

Root Cause Analysis

I’m a huge proponent of holding regular retrospectives (for incidents, failures, successes, and simply at regular intervals), but it seems that in technology, particularly when we’re carrying out a Root Cause Analysis due to an incident, there’s a tendency to assume one single “root cause” – the smoking gun that caused the problem.

We may tend towards assuming that once we’ve found this necessary cause, we’re finished. And whilst that’s certainly a useful exercise, it’s important to recognise that there are other component causes and there may be more than one sufficient cause.

The Five Why’s model is a great example of this – it fails to probe into other component factors, and only looks for a single root cause. As any resilience engineer will tell you: There is no Root Cause.

The 5 whys takes the team down a single linear path, and will certainly find a root cause, but leaves the team blind to other potential component or sufficient causes – and even worse: it leads the team to believe that they’ve identified the problem. In the worst case scenario, a team may identify “human error” as a root cause, which could re-affirm a faulty, overly-simplistic world view and result in not only the wrong cause identified, but harm the team’s ability to carry out RCAs in the future.

Read more about the flaws in the “five whys” model in John Allspaw’s “Infinite Hows”.

In reality, we’re dealing with complex, maybe even chaotic, systems, alongside human interactions. There exist multiple causal factors, some necessary for the “incident” to have occurred, and some simply component causes that together become sufficient – the completed pie!

Take Away: There is usually more than one causal pie.

An improved approach could be to use Ishikawa diagrams, but in my experience, when dealing with complex systems, these diagrams very quickly become visibly cluttered and complex, which makes them hard to use. Additionally, because each “fish bone” is treated as a separate pathway, interrelationships between causes may not be identified.

Instead of a complex fishbone diagram, try identifying all the component causes, and visually complete (on a whiteboard for example) all the pies that could (or did) result in the outcome. You almost certainly won’t identify all of them, but that doesn’t matter very much.

If we adopt the Rothman’s causal pie model instead of approaches such as the 5 whys or Ishikawa, it provides us with an easy to use and easy to visualise tool that can model not only “what caused this incident”, but “what factors, if present, could cause this incident to occur again?“. 

In order to prevent the incident (the disease, in epidemiological terms), the key factor we’re looking for is the “necessary cause” – component A in the pies diagram. But we’re also looking for the other component causes.

Application: The prevention of future incidents.

Suppose we can’t easily solve component A – maybe it’s a third party system that’s outside our control – but we can control causal components B and C which occur in every causal pie. If we control for those instead, it’s clear that we don’t need to worry about component A anyway!

Next time you’re carrying out a Root Cause Analysis or retrospective, try using Rothman’s Causal Pies, and please let me know how it goes.

Addendum: “Post-Mortem” exercises.

Even though the term “post-mortem” is ubiquitously used in the technology industry as a descriptor for analysis into root causes, I don’t like it.

Firstly, in the vast majority of tech incidents, nobody died – post-mortem literally means “after death”. It implies that a Very Bad Thing happened, but if we’re trying to hold constructive, open exercises where everyone present possesses enough psychological safety in order to contribute honestly and without fear, we should phrase the exercise in less morbid terms. The incident has already happened – we should treat it as a learning opportunity, not a punitive sounding exercise.

Secondly, we should run these root cause analysis exercises for successes, not just for failures. You don’t learn the secrets of a great marriage by studying divorce. The term “post-mortem” isn’t particularly appropriate for studying the root causes of successes.

 

 

Simpsons Paradox and the Ecological Fallacy [Data Science]

Simpsons Paradox

I’m currently studying for a Master’s Degree in Global Health at The University of Manchester, and I’m absolutely loving it. Right now, we’re studying epidemiology and study design, which also involves a great deal of statistical analysis.

Some data was presented to us from an ecological study (a type of scientific study, that looks at large-scale, population level data) called The WHO MONICA Project that showed mean cholesterol vs mean height, grouped by population centre (E.g. China-Beijing or UK-Glasgow).

In this chart, you can see a positive correlation between height and cholesterol, with a coefficient of 0.36, suggesting that height may be a potential risk factor for higher cholesterol.

However, when the analysis was re-run using raw data (not averaged for each of the population centres), the correlation coefficient was -0.11.

So, when using mean measures of each population centre, it appears that height could be a risk factor for higher cholesterol, whilst the raw data actually shows the opposite is slightly more likely to be true!

This is known as an “ecological fallacy” – because it takes population level data and makes erroneous assumptions about individual effects.

This is a great example of Simpsons Paradox.

Simpsons paradox is when a trend appears in several different groups of data but disappears or reverses when the groups are combined.

Table 1 in Wang (2018) is a relatively easy example. (This is fictional test score data for two schools.)

(Also, please ignore for a moment the author’s possible bias in scoring male students higher – maybe this is a test about ability to grow facial hair.)

male

male

female

female

School

n

average

n

Average

Alpha (1)

80

84

20

80

Beta (2)

20

85

80

81

It’s clear if you look at the numbers that the Beta school have higher average scores (85 and 81 for male students and female students respectively).

However, if you calculate the averaged scores for individuals in the schools, Alpha school has an average score of 83.8 and Beta has just 81.8.

So whilst Beta school *looks* like the highest performing school when broken down by gender, it is actually Alpha school that has the highest average scores.

In this case, it’s quite clear why: if you only look at the average scores by gender, it’s easy to assume that the proportion of male and female pupils for each school is roughly the same, when in fact 80 pupils at Alpha school are male (and 20 female), but only 20 are male at the Beta school, with 80 female.

Using gender to segment the data hides this disproportion of gender between the schools. This may be appropriate to show in some cases, but can lead to false assumptions being made.

The same issue can be seen in Covid-19 Case Fatality Rate (CFR) data when comparing Italy and China. Kegelgen et al (2020) found that CFRs were lower in Italy for every age group, but higher overall (see table (a)) in the paper.

The reason, when you see table (b), is clear. The CFR for the 70-79 and 80+ groups are far higher than for all other age groups, and these age groups are significantly over-represented in Italy’s confirmed cases of Covid-19. This means that Italy’s overall CFR is higher than China’s only by dint of recording a “much higher proportion of confirmed cases in older patients compared to China.” China simply didn’t report as many Covid-19 cases in older individuals, and the fatality rate is far higher in older individuals. Italy has a more elderly population (median age of 45.4 opposed to China’s 38.4), but other factors such as testing strategies and social dynamics may also be playing a part.

Another example of Simpsons Paradox is in gender bias among graduate admissions to University of California, Berkeley, where it was used in reverse. In 1973, the admission figures appeared to show that men were more likely to be admitted than women, and the difference was significant enough that it was unlikely to be due to chance alone. However, the data for the individual departments showed a “small but statistically significant bias in favour of women”. (Bickel et al, 1975). Bickel et al’s conclusions were that women were applying to more competitive departments such as English, whilst men were applying to departments such as engineering and chemistry, that typically had higher rates of admission.

(Whether this still constitutes bias is the subject of a different debate.)

The crux of Simpsons Paradox is: If you pool data without regard to the underlying causality, you could get the wrong results.

References:

Bokai WANG, C. (2018) “Simpson’s Paradox: Examples”, Shanghai Archives of Psychiatry, 30(2), p. 139. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5936043/ (Accessed: 21 October 2020).

Julius von Kugelgen, Luigi Gresele, Bernhard Scholkopl, (2020) “Simpson’s paradox in Covid-19 case fatality rates: a mediation analysis of age-related causal effects.” Arxiv.org. Available at: https://arxiv.org/pdf/2005.07180.pdf (Accessed: 21 October 2020).

P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”(PDF). Science. 187 (4175): 398–404. doi:10.1126/science.187.4175.398. PMID 17835295. https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf

WHO MONICA Project Principal Investigators (1988) “The world health organization monica project (monitoring trends and determinants in cardiovascular disease): A major international collaboration” Journal of Clinical Epidemiology 41(2) 105-114. DOI: 10.1016/0895-4356(88)90084-4