Walktober

  • Our department had a walkathon in october where we all competed to see how many steps we could walk each day

Data quality angel

Before the competition started, I searched up the most accurate and cost effective pedometer

Pedometer = bad

  • It turns out, pedometers are wildly inaccurate

Data quality demons

  • It turns out I was the ONLY person concerned about data quality
  • We conducted a survey after walktober, to see if we could quantify the pedometer error
  • Turns out the measurement error was the least of my worries

Survey response

Quantifiable vs unquantifiable uncertainty

  • I can incorporate pedometer error estimates into our analysis, I CANNOT work with completely falsified data
  • This is the difference between quantifiable vs unquantifiable uncertainty
  • We are going to try and quantify the uncertainty that we can quantify
    • “Anything worth doing is worth doing poorly” - G. K. Chesterton

Not an uncommon scenario

Often our data is….

  • Unavailable,
    • e.g. anonymised data, measurement error, etc.
  • Non-deterministic
    • e.g. bounded data, estimated values, etc.
  • or Theoretical
    • e.g. estimates based on theory, latent variables, etc

How many statisticans does it take to visualise a random variable

  • Even though we usually work with random variables, are unable to visualise them effectively
  • Our choice of error distribution might change the conclusion of our analysis in unexpected ways
  • Often our solution is to just ignore the inherrent uncertainty in our data

The visualisation challenge

  • Our department decides to do a visualisation challenge of the walktober data

I dont want to ignore it

  • But I did all that reading about pedometers, so I would like to incorporate that uncertainty

Spot the difference

  • Maps of temperature in Iowa counties
  • I chose two error distributions, can you spot the difference?

How do we include the uncertainty?

Exceedance probability map

  • If you care about the uncertainty, visualise the uncertainty

A terrible vet

A terrible vet

Uncertainty as signal vs noise

  • Uncertainty can play two roles in an analysis
    • Sometimes it is used to hedge or dampen our conclusions on other statistics
    • Sometimes it is a statistic of inference itself
  • A visualisation is a statistic which means, just like other statistics, we use them to draw inference
    • If we want to draw inference on uncertainty: visualise uncertainty as signal
    • If it is supposed to hedge our inference from the plot: it is noise
  • An exceedence probability map is fine if we want to draw inference on our uncertainty, but not fine if we were trying to hedge the original plot

Solution: add an axis for uncertainty

  • 2D palette is harder to read
  • Says: “We have a wave pattern, but it is uncertain”

I keep getting scammed

Why doesn’t this work?

  • Uncertainty is not just another variable…
    • It presents an interesting perceptual problem
  • Usually do not want variables to interfere with each other
    • In uncertainty visualisation, the opposite is true

Uncertainty visualisation for signal supression

  • Statistical validity translates to perceptual ease
    • The higher the variance on an estimate, the harder that estimate is to extract from the plot

Solution: blend the colours together!

  • Made signal harder to see… but maybe too hard?
  • Still have 2D Colour palette
  • Standard error at which to blend colours is made up

Free yourself from the two variable approach

  • Realistically, we are trying add information back in that we just shouldn’t have droppped
  • We need a more holistic apporach that doesn’t allow us to pick and choose when and how we include uncertainty
  • Uncertainty visualisation doesn’t have units of data, it has units of “random variables” so we should directly input random variables

Vectorise random variables with distributional

steps_dist team name
N(23679, 4633687)[0,Inf] iwalk() A
N(18322, 2774223)[0,Inf] iwalk() A
N(24562, 5e+06)[0,Inf] iwalk() A
N(26128, 5642050)[0,Inf] iwalk() A
N(10238, 866202)[0,Inf] iwalk() A
N(16568, 2268638)[0,Inf] iwalk() A
N(12270, 1244340)[0,Inf] iwalk() A
N(17226, 2452356)[0,Inf] iwalk() A
  • It turns out you can.
  • These columns are made using distributional
  • They are truncated normally distributed random variables
  • This is some of Mitch’s software, I am not going to explain it because Mitch is going to talk about it immediately after me

Solution: simulate a sample

  • Made using Vizumap’s pixelmap function
  • Gives the best overall understanding of our random variables
  • Not actually making any top level decisions, just letting the variance from the random variables carry through to the visual system
  • The signal seems harder to read
  • 1D colour palette

But lets take this one step further…

Universal application in ggdibbler

  • ggdibbler applies this concept to every plot and every aesthetic

Universal application in ggdibbler

Contour plots

ggplot(faithfuld, aes(waiting, eruptions, z = density)) + 
  ggtitle("ggplot2") +
  geom_contour() +
  theme(aspect.ratio = 1)

ggplot(uncertain_faithfuld, aes(waiting, eruptions, z = density2))+
  ggtitle("ggdibbler") +
  geom_contour_sample(alpha=0.2)+
  theme(aspect.ratio = 1)

Text plot

Spatial pixel map

Bar charts

Raster plots

ggdibbler also ensures your plots have nice statistical properties

  • Statistical properties are what differentiate us from the animals

Visual Continuous mapping theorem

Example in geom_tile

ggdibbler guarentees these properties

  • Not the default in ggplot2, you need nested positions

Back to walktober example

Have a go yourself

Future Plans

  • Future of the software
    • multivariate distributions and other complex more complex joint distributions
    • built out nested position system
    • expand on the scales to accept more object types
  • Unemployment
    • I also need a job (I am holding my software hostage)
    • If you want to give me a job, my email is harriet.m.mason@gmail.com

Acknowledgements

  • My Supervisors: Di Cook, Susan Vanderplas, and Sarah Goodwin
  • AEMO Zema Energy Schoalarship
  • Australian RTP Stipend
  • Numbat Hackathon (for the walktober data)
  • Mitch O’Hara-Wild and Cynthia Huang