I am going have to submit my thesis soon

So I should look for a job

Pro Tip: Interviews do NOT go well when you tell the interviewer the statistics they are doing are technically fraud!

But I have a small issue…

My Dog Bosco

No international travel for Bosco

Bosco’s minor hobbies

Bosco’s BIG hobby

Bosco has demands

and Bosco has demands

Calculating possum concentration

  • We have the number of possums seen in each area, as reported by local dogs.
  • The dogs have requested to maintain their anonymity for data privacy reasons
    • We have been given:
      • The average number of possums seen per walk
      • The standard error on that estimate
      • The number of walks that contributed to that average

Possum Data

areaID possum_mean possum_se n_dogwalks
1 1.182 0.226 11
2 1.034 0.189 29
3 0.800 0.374 5
4 1.059 0.218 17
5 0.737 0.214 19

Communication goals


Everyone in the house needs to be in agreement on:


1) The spatial trend and hot spots in possum counts


  1. The statistical strength of these spatial trends

MUST communicate with a visualisation

Bosco doesn’t understand maths, he is literally a dog

The house includes

Tom is just an average guy

…and..

I am easily tricked

And also…

Mum whose standard of evidence is almost overwhelming

And also…

Mum needs an overwhelming amount of evidence

So, an uncertainty visualisation should…


  1. Reinforce justified signals
  • The signal should convey its strength, my mum to trust the results
  1. Hide signals that are just noise
  • I don’t even want to see something that isn’t there (or I will fall for another scam)

Can also think of it as…

Ideally we would just plot the original data

But the original data isn’t always…

  • Available
    • e.g. anonymised data, theoretical values, etc.
  • Relevant
    • Might need to use a sampling distribution rather than original data
  • Deterministic
    • e.g. bounded data, estimated values, etc.

So we will visualise the mean

But what if the error is worse?

  • We get reports that dogs have been inflating their possum numbers
    • They are desperate to have Bosco move to their neighbourhood
    • We estimate work out that our current standard error is likely the lower bound
    • However, the upper bound could be six times that amount.
    • We need to see both scenarios to make an informed decision

Spot the difference

Solution 1: add an axis for uncertainty

Questions to think about…

  • Is there a visible difference between the high and low uncertainty cases?
  • Is the trend still visible in the high uncertainty case?
  • Is this approach accessible?
  • Is making this plot accessible

Not colour blind friendly

Not colour blind friendly

Aside: why doesn’t this work?

  • Uncertainty is not just another variable…
    • It presents an interesting perceptual problem
  • Usually do not want variables to interfere with each other
    • In uncertainty visualisation, the opposite is true
    • We want the uncertainty to interfere with our ability to read the signal

Solution 2: blend the colours together

Questions to think about…

  • Is there a visible difference between the high and low uncertainty cases?
  • Is the trend still visible in the high uncertainty case?
  • Is this approach accessible?
  • At what level of uncertainty should you blend two colours together?
  • Is making this plot accessible

Solution 3: simulate a sample

Questions to think about…

  • Is there a visible difference between the high and low uncertainty cases?
  • Is the trend still visible in the high uncertainty case?
  • Is this approach accessible?
  • What has replaced the manual colour blending in this approach?

Solution: For Bosco!

Why are the plots hard to make?

Not tibble friendly

  • Existing packages take uncertainty in basically any format
    • quantile function from qnorm
    • estimate and error as two separate variables
    • a distribution from distributional
  • Existing packages are difficult to integrate into the ggplot workflow
    • Often require using bespoke data wrangling functions that are run separately to the plots
    • The code can be finicky and often the examples don’t run

Ideal uncertainty map code

  • the ggplot recognises the random variable input, and changes the visualisation accordingly
  • I want to touch as few ggplot settings as possible
# Psudo Code
ggplot(data) |>
  geom_sf(aes(geometry = geometry,
              fill = random_variable)) 

ggplot2 uses the grammar of graphics

It is designed to take in data

Not theoretical distributions

This is what ggdibbler is for

ggdibbler uses distributional

  • distributional lets you store distributions in a tibble as distribution objects
possum_dist_df <- possum_area_mean |> 
  mutate(possum_dist = dist_normal(mu = possum_mean,
                                    sigma = possum_se))
areaID possum_dist
1 N(1.2, 0.051)
2 N(1, 0.036)
3 N(0.8, 0.14)

The object is literally a distribution

  • Can also mix different distributions together (don’t all need to be the same family)
  • Can use a sample rather than a theoretical distribution
    • e.g. if you want to implement bootstrapping
class(possum_dist_df$possum_dist)
[1] "distribution" "vctrs_vctr"   "list"        

The ggdibbler package

  • Named after the Australian dibble animal
    • Wanted it to be next to ggdistalphabetically in package list
      • ggdist visualises distributions (not signal suppression)
    • dibble = distributional tibble was an accident

Comparing ggplot to ggdibbler

ggplot code

Code
possum_dist_df |> 
  ggplot() + 
  geom_sf(aes(geometry = geometry,
                     fill=possum_mean)) +
  scale_fill_distiller(palette="Greens", direction=1) +
  labs(fill = "Possums/Walk") +
  theme_minimal()  

ggdibbler code

Code
possum_dist_df |> 
  ggplot() + 
  geom_sf_sample(aes(geometry = geometry,
                     fill=possum_dist)) +
  scale_fill_distiller(palette="Greens", direction=1) +
  labs(fill = "Possums/Walk") +
  theme_minimal() 

The plots are random

ggplot code

Code
possum_dist_df |> 
  ggplot() + 
  geom_sf(aes(geometry = geometry,
                     fill=possum_mean)) +
  scale_fill_distiller(palette="Greens", direction=1) +
  labs(fill = "Possums/Walk") +
  theme_minimal() 

ggdibbler code

Code
possum_dist_df |> 
  ggplot() + 
  geom_sf_sample(aes(geometry = geometry,
                     fill=possum_dist)) +
  scale_fill_distiller(palette="Greens", direction=1) +
  labs(fill = "Possums/Walk") +
  theme_minimal() 

Cool Stuff in the Branch

Scatter plot

ggplot code

ggplot() + 
  geom_point(data = scatter_data, 
             aes(x = xmean, y=ymean))

ggdibbler code

ggplot() + 
  stat_sample(data = scatter_data, 
              aes(x = xdist, y = ydist), , n=300)

Both Scatter Plots

Density plot

ggplot code

ggplot() + 
  geom_density(data = density_data, aes(x=xmean)) 

ggdibbler code

ggplot() + 
  stat_density_sample(data = density_data, 
                      aes(x=xdist), n=100) 

Both Density Plots

ggdist vs ggdibbler

ggdist code

Code
ggplot() +
  stat_slab(data = density_data, 
            aes(xdist = xdist, y=group)) 
  theme_ggdist()

ggdibbler code

Code
ggplot() + 
  stat_density_sample(data = density_data, 
  aes(x=xdist), n=100) 

Bye Bye

End