Bayesian reconstruction of historical population in Finland 1647-1850

Jouni Helske (with Miikka Voutilainen and Harri Högmander)

Prologue: Bayesian reconstruction of Jouni Helske in 2010-2019

2010-2015 PhD in Statistics at JYU
- State space models (time series),
  prediction problems, statistical software

2016-2017: First postdoc at JYU

Bayesian statistics, MCMC, sequential Monte Carlo
Visited Oxford’s department of Statistics for a year
2017 moved to Sweden, visited Uppsala University

2017-2019: Postdoc at Linköping University

Postdoc in information visualization
First at H2020 project Safeclouds, then free hands

2019->: Second postdoc at JYU

Causal inference, Bayesian models, missing data, MCMC, SMC, HMC, PMCMC, HMM, SSM, (add your own acronym!), … ?

And now…

Bayesian reconstruction of historical population in Finland 1647-1850

By Jouni Helske, based on paper by Voutilainen, Helske, Högmander (2019) by the same name, conditionally accepted to Demography

Historical population estimates

Interesting for history, demography, and economic research
- Important contextual / macro-economic variables
- Also interesting in their own right
Only few countries have annual population series available before 19th century
- No proper censuses, estimates based on tax or church records
Our aim was to estimate Finnish population from 1647 to 1850
- Use incomplete parish records and Bayesian methods
- Incorporate prior knowledge, report uncertainties of the estimates

Problems with current population estimates

Based on fiscal and/or parish records of poor quality with ad-hoc correction procedures
Only point estimates
- arguments over differences of few thousands in estimates probably does not make much sense
Difficult to assess the validity of the results
- no ground truth, only one data

Perfect data: Problem solved

Given the initial population estimate, full birth and death records:

where

is the population
is the total number of births
is the total number of deaths at year .
But we don’t know nor yearly births and deaths…

Our aim

Build a hierarchical model using
- Parish level data on baptisms and burials (197 parishes)
- External estimates on military casualties (taken as given)
- Few available censuses (22) from 1749 onward
- Prior knowledge (population level mid-1600s, characteristics of parish records, …)

Data quality

Three types of missingness:

Baptisms do not equal births and burials do not equal deaths
Records are not available for all parishes for every year
Not all burials and baptisms are registered

Census data is likely not accurate either, but we’ll assume that

This leads to increase of relative accuracy of the censuses over time as population increases

Census data

Parish data

Population level

and are annual birth and death estimates, are the military casualties at year
Prior for first year translates to prior mean 430,000 and prior sd of 30,000
Special treatment of year 1697 is needed due to the Great Famine
mean of is , sd is

Parish level

We consider 177 parishes of 197 which have at least 10 years of baptism and burial records

Denote parishes with recorded baptisms at year as
Define baptism process

First term in parenthesis corresponds to observed baptisms
Latter term corresponds estimates for completely missing records at year
Similar construction for burials with and

Correction coefficients

Coefficients and are the proportions of total number of baptisms of all births (and similarly for burials and deaths)
Try to capture
- partial missingness
- omission of 20 parishes
- discrepancy between baptisms and births (burials and deaths)
, increases over time as partial missingness decreases

Assumptions for s

Based on prior knowledge:
with 99% prior interval of (mean 0.9)
with leading to 99% prior intervals and for the proportions respectively.
Assume smooth increase of s over time

s as logistic curves

Define the s as scaled logistic curves:

with midpoint

and growth rate

Parish processes

and similarly for burials, except for year 1697

Handling Great Famine

The particularly cold period in 1695–1697 resulted poor yields and subsequent famine
Based on literature, it is estimated that 15-30% of the population died during this period
Such a drastic drop needs special attention
We define with 99% prior interval of (0.7, 0.85), and

Estimation alternatives

Total number of unknown parameters: 72489 (18 hyperparameters and 72471 latent variables)
Need to Markov chain Monte Carlo (MCMC) for fully Bayesian inference
Gibbs sampling or naive Metropolis-type MCMC won’t work due to large number of (correlated) parameters
State-of-the-art alternatives:
- Hamiltonian Monte Carlo using probabilistic programming language Stan
- particle marginal Metropolis–Hastings (PMCMC) using efficient particle filter (PF)

Pros and cons of Stan

No need to code up your own MCMC stuff
Simple way to write your model (BUGS-like syntax)
Claimed to be very efficient for “arbitrary” problems with large number of parameters
- in practice, in order to be efficient you need to reparametrize and scale your variables
  - for efficient adaptation of the sampler, variables should be uncorrelated and on unit scale (in terms of posterior variance)
Many convergence diagnostics
- Theoretical aspects of many diagnostics unclear, when should you be worried and when you can ignore the warnings?

Pros and cons of PMCMC

Stan is “dumb”, as it does not differentiate between hyperparameters and latent variables
PMCMC handles these separately, by marginalizing over the latent variables using PF
- using e.g. random walk Metropolis for 18 hyperparameters is easy
But, large number of missing observations make bootstrap-PF unfeasible (needs too many particles)
Twisted-PFs with proposals based on approximating Gaussian model?
- initial tests showed that the approximation of Gamma is not stable wrt hyperparameters (needs too many particles)

Estimation

We started prototyping the model formulation with Stan
Initial belief was that the modelling is done quickly so no need to bother with implementing custom PMCMC
In the end, we stuck with Stan despite some hiccups
Total estimation time of the final model was around 110h on Amazon AWS (16 parallel chains of length 15,000)
- One run of twisted-PF for parish level ~30s on Julia
  - 100 particles, not really enough without some tricks
  - 300+ days for 15,000 iterations without parallelisation
  - not fully optimized code/algorithm, but not an encouraging starting point…

Results: -coefficients

Results: yearly differences

Results: Birth and death estimates

Results: Population estimates

Results: Posterior predictive check

- (using posterior samples of hyperparameters, , and )

Thank you!

Some references:

Paper: Voutilainen M, Helske J, Högmander, H (2019). A Bayesian reconstruction of historical population in Finland 1647-1850. Conditionally accepted to Demography.
Twisted PF etc: Vihola M, Helske J, Franks, J (2018). Importance sampling type estimators based on approximate marginal MCMC. On ArXiv.
Stan: Carpenter B, Gelman A, Hoffman M.D, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76(1).
PMCMC: Andrieu C, Doucet A, Holenstein R (2010). Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72: 269-342.

Bayesian reconstruction of historical population in Finland 1647-1850

Prologue: Bayesian reconstruction of Jouni Helske in 2010-2019

2016-2017: First postdoc at JYU

2017-2019: Postdoc at Linköping University

2019->: Second postdoc at JYU

And now…

Bayesian reconstruction of historical population in Finland 1647-1850

Historical population estimates

Problems with current population estimates

Perfect data: Problem solved

Our aim

Data quality

Census data

Parish data

Population level

Parish level

Correction coefficients

Assumptions for λs

λs as logistic curves

Parish processes

∗ Handling Great Famine

Estimation alternatives

Pros and cons of Stan

Pros and cons of PMCMC

Estimation

Results: λ-coefficients

Results: yearly differences

Results: Birth and death estimates

Results: Population estimates

Results: Posterior predictive check

Thank you!

Assumptions for s

s as logistic curves

Handling Great Famine

Results: -coefficients