Visualising uncertainty with ggdibbler

Harriet Mason, Dianne Cook, Sarah Goodwin, Susan Vanderplas

Monash University, Australia

I am going have to submit my thesis soon

So I should look for a job

Pro Tip: Interviews do NOT go well when you tell the interviewer the statistics they are doing are technically fraud!

I have been think of expanding my search

But I have a small issue…

My Dog Bosco

No international travel for Bosco

Bosco’s minor hobbies

Bosco’s BIG hobby

Bosco has demands

and Bosco has demands

Calculating possum concentration

We have the number of possums seen in each area, as reported by local dogs.
The dogs have requested to maintain their anonymity for data privacy reasons
- We have been given:
  - The average number of possums seen per walk
  - The standard error on that estimate
  - The number of walks that contributed to that average

Possum Data

areaID	possum_mean	possum_se	n_dogwalks
1	1.182	0.226	11
2	1.034	0.189	29
3	0.800	0.374	5
4	1.059	0.218	17
5	0.737	0.214	19

Communication goals

Everyone in the house needs to be in agreement on:

1) The spatial trend and hot spots in possum counts

The statistical strength of these spatial trends

MUST communicate with a visualisation

Bosco doesn’t understand maths, he is literally a dog

The house includes

Tom is just an average guy

…and..

I am easily tricked

And also…

Mum whose standard of evidence is almost overwhelming

And also…

Mum needs an overwhelming amount of evidence

So, an uncertainty visualisation should…

Reinforce justified signals

The signal should convey its strength, my mum to trust the results

Hide signals that are just noise

I don’t even want to see something that isn’t there (or I will fall for another scam)

Can also think of it as…

Ideally we would just plot the original data

But the original data isn’t always…

Available
- e.g. anonymised data, theoretical values, etc.
Relevant
- Might need to use a sampling distribution rather than original data
Deterministic
- e.g. bounded data, estimated values, etc.

So we will visualise the mean

But what if the error is worse?

We get reports that dogs have been inflating their possum numbers
- They are desperate to have Bosco move to their neighbourhood
- We estimate work out that our current standard error is likely the lower bound
- However, the upper bound could be six times that amount.
- We need to see both scenarios to make an informed decision

Spot the difference

Solution 1: add an axis for uncertainty

Questions to think about…

Is there a visible difference between the high and low uncertainty cases?
Is the trend still visible in the high uncertainty case?
Is this approach accessible?
Is making this plot accessible

Aside: why doesn’t this work?

Uncertainty is not just another variable…
- It presents an interesting perceptual problem
Usually do not want variables to interfere with each other
- In uncertainty visualisation, the opposite is true
- We want the uncertainty to interfere with our ability to read the signal

Solution 2: blend the colours together

Questions to think about…

Is there a visible difference between the high and low uncertainty cases?
Is the trend still visible in the high uncertainty case?
Is this approach accessible?
At what level of uncertainty should you blend two colours together?
Is making this plot accessible

Solution 3: simulate a sample

Questions to think about…

Is there a visible difference between the high and low uncertainty cases?
Is the trend still visible in the high uncertainty case?
Is this approach accessible?
What has replaced the manual colour blending in this approach?

Solution: For Bosco!

Why are the plots hard to make?

Not tibble friendly

Existing packages take uncertainty in basically any format
- quantile function from qnorm
- estimate and error as two separate variables
- a distribution from distributional
Existing packages are difficult to integrate into the ggplot workflow
- Often require using bespoke data wrangling functions that are run separately to the plots
- The code can be finicky and often the examples don’t run

Ideal uncertainty map code

the ggplot recognises the random variable input, and changes the visualisation accordingly
I want to touch as few ggplot settings as possible

# Psudo Code
ggplot(data) |>
  geom_sf(aes(geometry = geometry,
              fill = random_variable))

`ggplot2` uses the grammar of graphics

It is designed to take in data

Not theoretical distributions

This is what `ggdibbler` is for

`ggdibbler` uses `distributional`

distributional lets you store distributions in a tibble as distribution objects

possum_dist_df <- possum_area_mean |> 
  mutate(possum_dist = dist_normal(mu = possum_mean,
                                    sigma = possum_se))

areaID	possum_dist
1	N(1.2, 0.051)
2	N(1, 0.036)
3	N(0.8, 0.14)

The object is literally a distribution

Can also mix different distributions together (don’t all need to be the same family)
Can use a sample rather than a theoretical distribution
- e.g. if you want to implement bootstrapping

class(possum_dist_df$possum_dist)

[1] "distribution" "vctrs_vctr"   "list"

The `ggdibbler` package

Named after the Australian dibble animal
- Wanted it to be next to ggdistalphabetically in package list
  - ggdist visualises distributions (not signal suppression)
- dibble = distributional tibble was an accident

Comparing `ggplot` to `ggdibbler`

`ggplot` code

Code

possum_dist_df |> 
  ggplot() + 
  geom_sf(aes(geometry = geometry,
                     fill=possum_mean)) +
  scale_fill_distiller(palette="Greens", direction=1) +
  labs(fill = "Possums/Walk") +
  theme_minimal()

`ggdibbler` code

Code

possum_dist_df |> 
  ggplot() + 
  geom_sf_sample(aes(geometry = geometry,
                     fill=possum_dist)) +
  scale_fill_distiller(palette="Greens", direction=1) +
  labs(fill = "Possums/Walk") +
  theme_minimal()

The plots are random

`ggplot` code

Code

possum_dist_df |> 
  ggplot() + 
  geom_sf(aes(geometry = geometry,
                     fill=possum_mean)) +
  scale_fill_distiller(palette="Greens", direction=1) +
  labs(fill = "Possums/Walk") +
  theme_minimal()

`ggdibbler` code

Code

possum_dist_df |> 
  ggplot() + 
  geom_sf_sample(aes(geometry = geometry,
                     fill=possum_dist)) +
  scale_fill_distiller(palette="Greens", direction=1) +
  labs(fill = "Possums/Walk") +
  theme_minimal()

Cool Stuff in the Branch

Scatter plot

`ggplot` code

ggplot() + 
  geom_point(data = scatter_data, 
             aes(x = xmean, y=ymean))

`ggdibbler` code

ggplot() + 
  stat_sample(data = scatter_data, 
              aes(x = xdist, y = ydist), , n=300)

Both Scatter Plots

Density plot

`ggplot` code

ggplot() + 
  geom_density(data = density_data, aes(x=xmean))

`ggdibbler` code

ggplot() + 
  stat_density_sample(data = density_data, 
                      aes(x=xdist), n=100)

Both Density Plots

`ggdist` vs `ggdibbler`

`ggdist` code

Code

ggplot() +
  stat_slab(data = density_data, 
            aes(xdist = xdist, y=group)) 
  theme_ggdist()

`ggdibbler` code

Code

ggplot() + 
  stat_density_sample(data = density_data, 
  aes(x=xdist), n=100)

Visualising uncertainty with ggdibbler

I am going have to submit my thesis soon

So I should look for a job

I have been think of expanding my search

But I have a small issue…

My Dog Bosco

No international travel for Bosco

Bosco’s minor hobbies

Bosco’s BIG hobby

Bosco has demands

and Bosco has demands

Calculating possum concentration

Possum Data

Communication goals

Everyone in the house needs to be in agreement on:

MUST communicate with a visualisation

The house includes

…and..

And also…

And also…

So, an uncertainty visualisation should…

Can also think of it as…

Ideally we would just plot the original data

But the original data isn’t always…

So we will visualise the mean

But what if the error is worse?

Spot the difference

Solution 1: add an axis for uncertainty

Questions to think about…

Not colour blind friendly

Not colour blind friendly

Aside: why doesn’t this work?

Solution 2: blend the colours together

Questions to think about…

Solution 3: simulate a sample

Questions to think about…

Solution: For Bosco!

Why are the plots hard to make?

Not tibble friendly

Ideal uncertainty map code

ggplot2 uses the grammar of graphics

It is designed to take in data

Not theoretical distributions

This is what ggdibbler is for

ggdibbler uses distributional

The object is literally a distribution

The ggdibbler package

Comparing ggplot to ggdibbler

ggplot code

ggdibbler code

The plots are random

ggplot code

ggdibbler code

Cool Stuff in the Branch

Scatter plot

ggplot code

ggdibbler code

Both Scatter Plots

Density plot

ggplot code

ggdibbler code

Both Density Plots

ggdist vs ggdibbler

ggdist code

ggdibbler code

Bye Bye

End

`ggplot2` uses the grammar of graphics

This is what `ggdibbler` is for

`ggdibbler` uses `distributional`

The `ggdibbler` package

Comparing `ggplot` to `ggdibbler`

`ggplot` code

`ggdibbler` code

`ggplot` code

`ggdibbler` code

`ggplot` code

`ggdibbler` code

`ggplot` code

`ggdibbler` code

`ggdist` vs `ggdibbler`

`ggdist` code

`ggdibbler` code