What is uncertainty visualisation?

Can graphics lie?

Citizen Scientists

  • The Bureau of Meteorology has been getting reports about weird temperatures in Iowa
  • Apparently, there is a strange spatial pattern in the data
  • We get some citizen scientists to measure data at their home and report back

NOT SO FAST CHUMP

  • Doing science is now illegal in Iowa
  • We need to maintain anonymity
  • Can only provide the county of each scientist

The citizen scientist data

toy_temp
scientistID county_name recorded_temp
#74991 Lyon County 21.1
#22780 Dubuque County 28.9
#55325 Crawford County 26.4
#46379 Allamakee County 27.1
#84259 Jones County 34.2

990 citizen scientists participated

Visualisation goals

  • We need to look at the data and identify:
    • The spatial trend (does it exist or not)
    • The statistical strength of the spatial trend

Approach needs to work for…

…and…

…and..

Uncertainty visualisation should…


  1. Reinforce justified signals
  • We want my mum to trust the results
  1. Hide signals that are just noise
  • I don’t want to see something that isn’t there

Can also think of it as…

We could just plot the data…

Benefits of plotting the original data

  • Plots of the original data can reveal it’s limitations
    • Can highlight small sample sizes, high variability, missing values, etc.
  • Understanding these limitations can prevent false conclusions

But the original data isn’t always…

  • Available
    • e.g. anonymised data, theoretical values, etc.
  • Relevant
    • Might need to use a sampling distribution rather than original data
  • Deterministic
    • e.g. bounded data, estimated values, etc.

Going back to the example

After we send in our data plot The Bureau of Meteorology call us and let us know that we are only looking for a trend in the average values of the counties

  • Need to use the sampling distribution
    • We do not have data for the sampling distribution
    • Usually we do not bother, instead we do something like…

Estimate the county mean

# Calculate County Mean
toy_temp |> 
  group_by(county_name) |>
  summarise(temp_mean = mean(recorded_temp)) 
county_name temp_mean
Adair County 29.7
Adams County 29.6
Allamakee County 26.3
Appanoose County 22.8
Audubon County 27.6

Visualise with a choropleth map

But what if the error is worse?

  • Citizens are using some pretty old tools
  • The standard error could be our estimate or up to three times our estimate.
  • The Bureau wants to see both cases

Spot the difference

How do we make an honest plot?

Solution: add an axis for uncertainty

Does this work? Not really

  • Pro
    • Included uncertainty and increased transparency
  • Cons
    • High uncertainty signal still very visible so I am still getting scammed
    • 2D palette is harder to read
      • Colour is not a simple 3D space
      • Using saturation hurts accessibility

Aside: why doesn’t this work?

  • Uncertainty is not just another variable…
    • It presents an interesting perceptual problem
  • Usually do not want variables to interfere with each other
    • In uncertainty visualisation, the opposite is true
    • This is the core of the signal suppression approach we implement

Solution: blend the colours together!

Does this work? Kind of…

  • Pros
    • Included uncertainty and increased transparency
    • No false signals
  • Cons
    • Still have 2D Colour palette
    • Standard error at which to blend colours is made up
      • Blend at 1? 2? 4? 37?
      • Impossible to align with hypothesis testing

Solution: simulate a sample

Does this work? Almost!

  • Pros
    • Included uncertainty
    • High uncertainty interferes with reading of plot (?)
    • 1D colour palette
  • Cons
    • Nightmare to make

Why are the plots hard to make?

Storing uncertain data

  • Main issues stem from difficulties to come from storing uncertain data
  • Best to represent uncertain data as a distribution
    • That distribution should take up one cell
    • Set up a vectorised distribution object?

Thank god for distributional

toy_temp_est <- toy_temp |> 
  group_by(county_name) |>
  summarise(temp_dist = dist_normal(mu = mean(recorded_temp),
                                    sigma = sd(recorded_temp)/sqrt(n())))
county_name temp_dist
Adair County N(30, 0.82)
Adams County N(30, 1)
Allamakee County N(26, 0.3)
Appanoose County N(23, 0.69)
Audubon County N(28, 0.8)

What does it do?

  • distributional lets you store distributions in a tibble as distribution objects
class(toy_temp_est$temp_dist)
[1] "distribution" "vctrs_vctr"   "list"        
  • Can also mix different distributions together (don’t all need to be the same family)
  • Can use a sample rather than a theoretical distribution
    • e.g. bootstrapping

What if we don’t use distributional?

  • Existing tidy data structures are not great for uncertain data
  • e.g. Vizumap
    • Makes Bivariate maps and Pixel (sample) maps
    • Package is designed specifically for uncertainty
  • Issues
    • ggplot2 flexibility is lost
      • e.g. you can only use one of three specific palettes
    • Very computationally expensive
      • A simple map can take over a minute to run
    • Need to make every component separately then combine

Code to make pixel map

Ideal pixelation map code

  • the ggplot recognises the random variable input, and changes the visualisation accordingly
  • Again, touch as few ggplot settings as possible
# Psudo Code
ggplot(data) |>
  geom_sf(aes(geometry = geometry,
              fill = random_variable)) 

Vizumap code

# load the package
library(Vizumap)
library(sf)
sf_use_s2(FALSE)

# Step 1: Format data using bespoke data formatting function
data <- read.uv(data = original_data, 
                estimate = "mean", 
                error = "standard_error")

# Step 2: Pixelate the shapefile
pixelation <- pixelate(geoData = geometry_data, 
                       id = "ID", 
                       # improved - set number of pixels
                       pixelSize = 100)


# Step 3: Build pixel map
pixel_map <- build_pmap(data = data, 
                         distribution = "normal", 
                         pixelGeo = pixelation, 
                         id = "ID", 
                         # You can only use a set palette
                         palette = "Oranges"
                         border = geometry_data)

# Step 4: Print pixel map
view(pixel_map)

Can we plot the pixel map now?

  • Have tidy format for distributions
  • Now need to extend ggplot2 to implement it

ggplot2 uses the grammar of graphics

It is designed to take in data

Not theoretical distributions

This is what ggdibbler is for

ggdibbler

The ggdibbler package

  • Named after the Australian dibble animal
    • Wanted it to be next to ggdistalphabetically in package list
      • ggdist visualises distributions (not signal supression)
    • dibble = distributional tibble was an accident

ggdibbler Example

library(ggdibbler)
toy_temp_dist |> 
  ggplot() + 
  geom_sf_sample(aes(geometry = county_geometry,
                     fill=temp_dist))

Wow, look at that software go

ggplot(toy_temp_dist) +
  geom_sf_sample(aes(geometry=county_geometry, fill=temp_dist),  linewidth=0, n=7) +
  geom_sf(aes(geometry = county_geometry), fill=NA, linewidth=0.5, colour="white") +
  theme_minimal() +
  scale_fill_distiller(palette = "YlOrRd", direction= 1) +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(fill = "Temperature") +
  ggtitle("A super cool and customised plot")

Wow, look at that software go x2

ggplot(toy_temp_dist) +
  geom_sf_sample(aes(geometry=county_geometry, fill=temp_dist),  linewidth=0, n=7) +
  geom_sf(aes(geometry = county_geometry), fill=NA, linewidth=0.5, colour="white") +
  theme_minimal() +
  scale_fill_distiller(palette = "YlOrRd", direction= 1) +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(fill = "Temperature") +
  ggtitle("A super cool and customised plot")

ggdibbler Future Plans

  • Implement ggdibbler variations of other geom_*() functions
    • e.g. geom_point(), etc.
    • currently only has one function (I just came back from leave)
  • Might implement VSUP into the package
    • ggplot2 was not designed for accessing colour space directly
  • Integrate dibble object so that geom_sf() automatically does geom_sf_sample() if you pass a distribution in

ggdibbler Future Plans

If you care about software

If you care about theory

End