Prologue: Bayesian reconstruction of Jouni Helske in 2010-2019

  • 2010-2015 PhD in Statistics at JYU
    • State space models (time series),
      prediction problems, statistical software

2016-2017: First postdoc at JYU

  • Bayesian statistics, MCMC, sequential Monte Carlo
  • Visited Oxford’s department of Statistics for a year
  • 2017 moved to Sweden, visited Uppsala University

2017-2019: Postdoc at Linköping University

  • Postdoc in information visualization
  • First at H2020 project Safeclouds, then free hands

2019->: Second postdoc at JYU

  • Causal inference, Bayesian models, missing data, MCMC, SMC, HMC, PMCMC, HMM, SSM, (add your own acronym!), … ?

And now…

Bayesian reconstruction of historical population in Finland 1647-1850

By Jouni Helske, based on paper by Voutilainen, Helske, Högmander (2019) by the same name, conditionally accepted to Demography

Historical population estimates

  • Interesting for history, demography, and economic research
    • Important contextual / macro-economic variables
    • Also interesting in their own right
  • Only few countries have annual population series available before 19th century
    • No proper censuses, estimates based on tax or church records
  • Our aim was to estimate Finnish population from 1647 to 1850
    • Use incomplete parish records and Bayesian methods
    • Incorporate prior knowledge, report uncertainties of the estimates

Problems with current population estimates

  • Based on fiscal and/or parish records of poor quality with ad-hoc correction procedures
  • Only point estimates
    • arguments over differences of few thousands in estimates probably does not make much sense
  • Difficult to assess the validity of the results
    • no ground truth, only one data

Perfect data: Problem solved

Given the initial population estimate, full birth and death records:

\[ \mu_t = \mu_{t-1} + B_t - D_t, \quad t=1648, \ldots, 1850, \]

where

  • \(\mu_t\) is the population

  • \(B_t = \sum_{i=1}^{197}b_{t,i}\) is the total number of births

  • \(D_t= \sum_{i=1}^{197}d_{t,i}\) is the total number of deaths at year \(t\).

  • But we don’t know \(\mu_{1647}\) nor yearly births and deaths…

Our aim

  • Build a hierarchical model using
    • Parish level data on baptisms and burials (197 parishes)
    • External estimates on military casualties (taken as given)
    • Few available censuses (22) from 1749 onward
    • Prior knowledge (population level mid-1600s, characteristics of parish records, …)

Data quality

Three types of missingness:

  • Baptisms do not equal births and burials do not equal deaths
  • Records are not available for all parishes for every year
  • Not all burials and baptisms are registered

Census data \(C_t\) is likely not accurate either, but we’ll assume that

\[ C_t \sim N(\mu_t, \sigma_C^2), \quad t \in \mathcal{T}, \]

  • \(\sigma_C \sim Gamma(5, 0.005)\)
  • \(\mathcal{T} = \small {1749,1751,1754,1757,\ldots,1775,1780,1800,1805,\ldots, 1850}\)
  • This leads to increase of relative accuracy of the censuses over time as population increases

Census data

Parish data

Population level

\[ \begin{aligned} \mu_{1647} &\sim Gamma(215, 0.0005),\\ \mu_t &\sim Gamma(\psi_{\mu} (\mu_{t-1} + \beta_t - \delta_t - s_t), \psi_{\mu}), \quad t \neq 1697 \end{aligned} \]

  • \(\beta_t\) and \(\delta_t\) are annual birth and death estimates, \(s_t\) are the military casualties at year \(t\)

  • Prior for first year translates to prior mean 430,000 and prior sd of 30,000

  • Special treatment of year 1697 is needed due to the Great Famine\(^\ast\)

  • mean of \(\mu_t\) is \(\mu_{t-1} + \beta_t - \delta_t - s_t\), sd is \(\sqrt{\mu_t / \psi_{\mu}}\)

Parish level

We consider 177 parishes of 197 which have at least 10 years of baptism and burial records

  • Denote parishes with recorded baptisms at year \(t\) as \(\Omega_t^b\)
  • Define baptism process \(\beta_t\)

\[ \begin{aligned} \beta_t &= \frac{1}{\lambda_t^b}\left( \sum_{i \in \Omega_t^b} b_{t,i} + \sum_{j \notin \Omega_t^b} \hat{b}_{t,j}\right), \end{aligned} \]

  • First term in parenthesis corresponds to observed baptisms
  • Latter term corresponds estimates for completely missing records at year \(t\)
  • Similar construction for burials with \(\Omega_t^d\) and \(\delta_t\)

Correction coefficients

  • Coefficients \(\lambda_t^b\) and \(\lambda_t^d\) are the proportions of total number of baptisms of all births (and similarly for burials and deaths)
  • Try to capture
    • partial missingness
    • omission of 20 parishes
    • discrepancy between baptisms and births (burials and deaths)
  • \(0 < \lambda_t < 1\), increases over time as partial missingness decreases

Assumptions for \(\lambda\)s

  • Based on prior knowledge:
    • \(\lambda^d_{1648} < \lambda^b_{1648}\)
    • \(\lambda^d_{1850} = \lambda^b_{1850} = \lambda_{1850}\)
  • \(\lambda_{1850} \sim Beta\) with 99% prior interval of \((0.55, 1)\) (mean 0.9) \[ \lambda^d_{1648}/\lambda_{1850} = a_1, \quad \lambda^b_{1648}/\lambda_{1850} = a_1 + a_2, \]
  • with \((a_1,a_2,a_3) \sim Dirichlet(\frac{1}{2}, \frac{1}{4}, \frac{1}{4})\) leading to 99% prior intervals \((0.2, 0.8)\) and \((0.5, 1)\) for the proportions respectively.
  • Assume smooth increase of \(\lambda\)s over time

\(\lambda\)s as logistic curves

  • Define the \(\lambda\)s as scaled logistic curves:

\[ \begin{aligned} \tilde{\lambda^j}_t &= \left[1 + \exp(-r_j (t - m_j)) \right]^{-1}, \\ \lambda_t^j &= \frac{\tilde{\lambda}^j_t - \tilde{\lambda}^j_{1648}}{\tilde{\lambda}^j_{1850} - \tilde{\lambda}^j_{1648}}(\lambda_{1850} - \lambda^j_{1648}) + \lambda^j_{1648}, \quad j = b, d. \end{aligned} \] with midpoint \(1648 < m_j < 1830\) and growth rate \(0 < r_j\), \(j = b, d\)

Parish processes

\[ \begin{aligned} b_{t,i} &\sim Gamma(\psi_{bd}\exp(\nu^b_{t,i}), \psi_{bd}),\\ \hat{b}_{t,j} &\sim Gamma(\psi_{bd}\exp(\nu^b_{t,j}), \psi_{bd}), \\ \nu^b_{t,i} &\sim N(\nu^b_{t-1,i} + \eta_t^b, \sigma^2_{b,\nu}),\\ \eta^b_t &\sim N(\eta^b_{t-1}, \sigma^2_{b,\eta}), \end{aligned} \] and similarly for burials, except for year 1697\(^\ast\)

\(^\ast\) Handling Great Famine

  • The particularly cold period in 1695–1697 resulted poor yields and subsequent famine
  • Based on literature, it is estimated that 15-30% of the population died during this period
  • Such a drastic drop needs special attention
  • We define \(\pi = \frac{\mu_{1697}}{\mu_{1695}} \sim Beta\) with 99% prior interval of (0.7, 0.85), and \[ \begin{aligned} \mu_{1697} &\sim Gamma(\psi_{\mu} \pi \mu_{1695}, \psi_{\mu}),\\ \hat{d}_{t,i} &\sim Gamma(\psi_{bd}\exp(\phi_{\delta})\exp(\nu^d_{t,i}), \psi_{bd}), \\ \phi_{\delta} &\sim N(2, 0.25^2) \end{aligned} \]

Estimation alternatives

  • Total number of unknown parameters: 72489 (18 hyperparameters and 72471 latent variables)
  • Need to Markov chain Monte Carlo (MCMC) for fully Bayesian inference
  • Gibbs sampling or naive Metropolis-type MCMC won’t work due to large number of (correlated) parameters
  • State-of-the-art alternatives:
    • Hamiltonian Monte Carlo using probabilistic programming language Stan
    • particle marginal Metropolis–Hastings (PMCMC) using efficient particle filter (PF)

Pros and cons of Stan

  • No need to code up your own MCMC stuff
  • Simple way to write your model (BUGS-like syntax)
  • Claimed to be very efficient for “arbitrary” problems with large number of parameters
    • in practice, in order to be efficient you need to reparametrize and scale your variables
      • for efficient adaptation of the sampler, variables should be uncorrelated and on unit scale (in terms of posterior variance)
  • Many convergence diagnostics
    • Theoretical aspects of many diagnostics unclear, when should you be worried and when you can ignore the warnings?

Pros and cons of PMCMC

  • Stan is “dumb”, as it does not differentiate between hyperparameters and latent variables
  • PMCMC handles these separately, by marginalizing over the latent variables using PF
    • using e.g. random walk Metropolis for 18 hyperparameters is easy
  • But, large number of missing observations make bootstrap-PF unfeasible (needs too many particles)
  • Twisted-PFs with proposals based on approximating Gaussian model?
    • initial tests showed that the approximation of Gamma is not stable wrt hyperparameters (needs too many particles)

Estimation

  • We started prototyping the model formulation with Stan
  • Initial belief was that the modelling is done quickly so no need to bother with implementing custom PMCMC
  • In the end, we stuck with Stan despite some hiccups
  • Total estimation time of the final model was around 110h on Amazon AWS (16 parallel chains of length 15,000)
    • One run of twisted-PF for parish level ~30s on Julia
      • 100 particles, not really enough without some tricks
      • 300+ days for 15,000 iterations without parallelisation
      • not fully optimized code/algorithm, but not an encouraging starting point…

Results: \(\lambda\)-coefficients

Results: yearly differences

Results: Birth and death estimates

Results: Population estimates

Results: Posterior predictive check

- (using posterior samples of hyperparameters, \(\beta_t\), and \(\delta_t\))

Thank you!

Some references:

  • Paper: Voutilainen M, Helske J, Högmander, H (2019). A Bayesian reconstruction of historical population in Finland 1647-1850. Conditionally accepted to Demography.
  • Twisted PF etc: Vihola M, Helske J, Franks, J (2018). Importance sampling type estimators based on approximate marginal MCMC. On ArXiv.
  • Stan: Carpenter B, Gelman A, Hoffman M.D, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76(1).
  • PMCMC: Andrieu C, Doucet A, Holenstein R (2010). Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72: 269-342.