Visualising Uncertainty with ggdibbler

Harriet Mason, Dianne Cook, Sarah Goodwin, Susan Vanderplas

Monash University, Australia

What is uncertainty visualisation?

Can graphics lie?

Citizen Scientists

The Bureau of Meteorology has been getting reports about weird temperatures in Iowa
Apparently, there is a strange spatial pattern in the data
We get some citizen scientists to measure data at their home and report back

NOT SO FAST CHUMP

Doing science is now illegal in Iowa
We need to maintain anonymity
Can only provide the county of each scientist

The citizen scientist data

toy_temp

scientistID	county_name	recorded_temp
#74991	Lyon County	21.1
#22780	Dubuque County	28.9
#55325	Crawford County	26.4
#46379	Allamakee County	27.1
#84259	Jones County	34.2

990 citizen scientists participated

Visualisation goals

We need to look at the data and identify:
- The spatial trend (does it exist or not)
- The statistical strength of the spatial trend

Approach needs to work for…

…and…

…and..

Uncertainty visualisation should…

Reinforce justified signals

We want my mum to trust the results

Hide signals that are just noise

I don’t want to see something that isn’t there

Can also think of it as…

We could just plot the data…

Benefits of plotting the original data

Plots of the original data can reveal it’s limitations
- Can highlight small sample sizes, high variability, missing values, etc.
Understanding these limitations can prevent false conclusions

But the original data isn’t always…

Available
- e.g. anonymised data, theoretical values, etc.
Relevant
- Might need to use a sampling distribution rather than original data
Deterministic
- e.g. bounded data, estimated values, etc.

Going back to the example

After we send in our data plot The Bureau of Meteorology call us and let us know that we are only looking for a trend in the average values of the counties

Need to use the sampling distribution
- We do not have data for the sampling distribution
- Usually we do not bother, instead we do something like…

Estimate the county mean

# Calculate County Mean
toy_temp |> 
  group_by(county_name) |>
  summarise(temp_mean = mean(recorded_temp))

county_name	temp_mean
Adair County	29.7
Adams County	29.6
Allamakee County	26.3
Appanoose County	22.8
Audubon County	27.6

Visualise with a choropleth map

But what if the error is worse?

Citizens are using some pretty old tools
The standard error could be our estimate or up to three times our estimate.
The Bureau wants to see both cases

Spot the difference

How do we make an honest plot?

Solution: add an axis for uncertainty

Does this work? Not really

Pro
- Included uncertainty and increased transparency
Cons
- High uncertainty signal still very visible so I am still getting scammed
- 2D palette is harder to read
  - Colour is not a simple 3D space
  - Using saturation hurts accessibility

Aside: why doesn’t this work?

Uncertainty is not just another variable…
- It presents an interesting perceptual problem
Usually do not want variables to interfere with each other
- In uncertainty visualisation, the opposite is true
- This is the core of the signal suppression approach we implement

Solution: blend the colours together!

Does this work? Kind of…

Pros
- Included uncertainty and increased transparency
- No false signals
Cons
- Still have 2D Colour palette
- Standard error at which to blend colours is made up
  - Blend at 1? 2? 4? 37?
  - Impossible to align with hypothesis testing

Solution: simulate a sample

Does this work? Almost!

Pros
- Included uncertainty
- High uncertainty interferes with reading of plot (?)
- 1D colour palette
Cons
- Nightmare to make

Why are the plots hard to make?

Storing uncertain data

Main issues stem from difficulties to come from storing uncertain data
Best to represent uncertain data as a distribution
- That distribution should take up one cell
- Set up a vectorised distribution object?

Thank god for `distributional`

toy_temp_est <- toy_temp |> 
  group_by(county_name) |>
  summarise(temp_dist = dist_normal(mu = mean(recorded_temp),
                                    sigma = sd(recorded_temp)/sqrt(n())))

county_name	temp_dist
Adair County	N(30, 0.82)
Adams County	N(30, 1)
Allamakee County	N(26, 0.3)
Appanoose County	N(23, 0.69)
Audubon County	N(28, 0.8)

What does it do?

distributional lets you store distributions in a tibble as distribution objects

class(toy_temp_est$temp_dist)

[1] "distribution" "vctrs_vctr"   "list"

Can also mix different distributions together (don’t all need to be the same family)
Can use a sample rather than a theoretical distribution
- e.g. bootstrapping

What if we don’t use distributional?

Existing tidy data structures are not great for uncertain data
e.g. Vizumap
- Makes Bivariate maps and Pixel (sample) maps
- Package is designed specifically for uncertainty
Issues
- ggplot2 flexibility is lost
  - e.g. you can only use one of three specific palettes
- Very computationally expensive
  - A simple map can take over a minute to run
- Need to make every component separately then combine

Code to make pixel map

Ideal pixelation map code

the ggplot recognises the random variable input, and changes the visualisation accordingly
Again, touch as few ggplot settings as possible

# Psudo Code
ggplot(data) |>
  geom_sf(aes(geometry = geometry,
              fill = random_variable))

`Vizumap` code

# load the package
library(Vizumap)
library(sf)
sf_use_s2(FALSE)

# Step 1: Format data using bespoke data formatting function
data <- read.uv(data = original_data, 
                estimate = "mean", 
                error = "standard_error")

# Step 2: Pixelate the shapefile
pixelation <- pixelate(geoData = geometry_data, 
                       id = "ID", 
                       # improved - set number of pixels
                       pixelSize = 100)


# Step 3: Build pixel map
pixel_map <- build_pmap(data = data, 
                         distribution = "normal", 
                         pixelGeo = pixelation, 
                         id = "ID", 
                         # You can only use a set palette
                         palette = "Oranges"
                         border = geometry_data)

# Step 4: Print pixel map
view(pixel_map)

Can we plot the pixel map now?

Have tidy format for distributions
Now need to extend ggplot2 to implement it

`ggplot2` uses the grammar of graphics

It is designed to take in data

Not theoretical distributions

This is what `ggdibbler` is for

`ggdibbler`

The `ggdibbler` package

Named after the Australian dibble animal
- Wanted it to be next to ggdistalphabetically in package list
  - ggdist visualises distributions (not signal supression)
- dibble = distributional tibble was an accident

`ggdibbler` Example

library(ggdibbler)
toy_temp_dist |> 
  ggplot() + 
  geom_sf_sample(aes(geometry = county_geometry,
                     fill=temp_dist))

Wow, look at that software go

ggplot(toy_temp_dist) +
  geom_sf_sample(aes(geometry=county_geometry, fill=temp_dist),  linewidth=0, n=7) +
  geom_sf(aes(geometry = county_geometry), fill=NA, linewidth=0.5, colour="white") +
  theme_minimal() +
  scale_fill_distiller(palette = "YlOrRd", direction= 1) +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(fill = "Temperature") +
  ggtitle("A super cool and customised plot")

Wow, look at that software go x2

ggplot(toy_temp_dist) +
  geom_sf_sample(aes(geometry=county_geometry, fill=temp_dist),  linewidth=0, n=7) +
  geom_sf(aes(geometry = county_geometry), fill=NA, linewidth=0.5, colour="white") +
  theme_minimal() +
  scale_fill_distiller(palette = "YlOrRd", direction= 1) +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(fill = "Temperature") +
  ggtitle("A super cool and customised plot")

`ggdibbler` Future Plans

Implement ggdibbler variations of other geom_*() functions
- e.g. geom_point(), etc.
- currently only has one function (I just came back from leave)
Might implement VSUP into the package
- ggplot2 was not designed for accessing colour space directly
Integrate dibble object so that geom_sf() automatically does geom_sf_sample() if you pass a distribution in

Visualising Uncertainty with ggdibbler

What is uncertainty visualisation?

Can graphics lie?

Citizen Scientists

NOT SO FAST CHUMP

The citizen scientist data

Visualisation goals

Approach needs to work for…

…and…

…and..

Uncertainty visualisation should…

Can also think of it as…

We could just plot the data…

Benefits of plotting the original data

But the original data isn’t always…

Going back to the example

Estimate the county mean

Visualise with a choropleth map

But what if the error is worse?

Spot the difference

How do we make an honest plot?

Solution: add an axis for uncertainty

Does this work? Not really

Aside: why doesn’t this work?

Solution: blend the colours together!

Does this work? Kind of…

Solution: simulate a sample

Does this work? Almost!

Why are the plots hard to make?

Storing uncertain data

Thank god for distributional

What does it do?

What if we don’t use distributional?

Code to make pixel map

Ideal pixelation map code

Vizumap code

Can we plot the pixel map now?

ggplot2 uses the grammar of graphics

It is designed to take in data

Not theoretical distributions

This is what ggdibbler is for

ggdibbler

The ggdibbler package

ggdibbler Example

Wow, look at that software go

Wow, look at that software go x2

ggdibbler Future Plans

ggdibbler Future Plans

If you care about software

If you care about theory

End

Thank god for `distributional`

`Vizumap` code

`ggplot2` uses the grammar of graphics

This is what `ggdibbler` is for

`ggdibbler`

The `ggdibbler` package

`ggdibbler` Example

`ggdibbler` Future Plans

`ggdibbler` Future Plans