Documentation for wfmm

This documents the basics of how to use the code provided for implementing the Bayesian wavelet-based functional mixed models methodology introduced in Morris and Carroll (2006). The code implements the Markov Chain Monte Carlo (MCMC) procedure described in Section 5 of the paper and outputs posterior samples for many model quantities.

Morris, JS and Carroll, RJ (2006). Wavelet-based functional mixed models, Journal of the Royal Statistical Society, Series B, 68(2): 179-199.

We apologize that this initial documentation is incomplete. More extensive documentation will be provided in the near future, along with scripts for performing some of the most commonly used Bayesian estimation, inference, and prediction procedures from the MCMC output from this program. We also plan to develop an R interface for this program, which we will also post when it is ready.

Sample program call (from DOS window):

wfmm c:\work\input.mat output.mat > log_file.log

Input: The program requires an input file (“input.mat” in the call) that is a matlab file

containing the following objects:

Y: N –by- T matrix, each row containing one of the observed functions on an equally-spaced grid of length T. This is the only variable that is required. Defaults will be taken for everything else if they are omitted.
model: Matlab structure indicating details of model. The following elements can be specified:

X: N-by-p matrix containing the desired covariates for the p fixed effect functions in the model. Default: Nx1 column of 1’s.
Hstar: integer indicating the number of levels of random effect functions to compute GCP-by-blocks, default: 0. This should only be used for very large m, (>70).
Z{h}, h=1, …, H: N-by-m_h matrices containing covariates for each set of random effect functions. If omitted, it is assumed that it is a fixed effects model.
The following two variables should just be included as given here

i. covQ=2

ii. covS=2

These are artifacts of some exploration of different ways of doing the modeling – very soon these will be removed from the code, but for now they are still there and must be included.

e. C: N-by-c matrix, the columns indicate the functions that share a common residual error S; by default this

is an Nx1 matrix of all 1’s.

3. wavespecs: Matlab structure describing wavelet settings to be used. The following elements can be specified:

a. wavelet: text string indicating the wavelet basis to use. Abbreviations for wavelet bases are given below. Default: ‘db4’.

b. nlevels: integer indicating the number of levels of decomposition. If omitted, an optimal number of levels is computed.

c. boundary: boundary correction method assumed, default: ‘periodic’.

d. extendedMode: whether or not to keep extra boundary wavelet coefficients (so K>T) (1 by default)

e. P: fraction of “energy” retained during wavelet compression, default: 1 (no compression)

f. t: number of functions that must be greater than threshold P in order to be retained during compression, default: 0 , included if any functions are greater than P.

4. MCMCspecs: Matlab structure describing details of MCMC. The following elements can be specified:

a. B= number of MCMC samples to obtain, default: 1000.

b. burnin= burn-in length, default: 1000.

c. thin= thinning parameter; e.g. if 10, then keep every 10 samples in MCMC, default: 5.

d. propvarTheta= multiple of var(MLE) to use in proposal variance for variance components in step 2 of the MCMC, default: 1.5

e. nj_nosmooth= number of lowest frequency wavelet levels for which we want a vague prior (no smoothing), default: 2.

The following parameters are simply for numerical stability:

f. minp= minimum value for any p_ij, default: 10^-14

g. minT= minimum value for T_ij, default: 1

h. bigT= value to use for T_ij when vague prior desired (no smoothing), default: 1000

i. maxO = maximum odds ratio, by default: 10²⁰ (prevents overflow)

j. minVC = minimum value of variance component, default: 10^-6 (prevents instability of variance components wandering near zero)

k. VC0_thresh = minimum size for important variance component, default: 0.01.

l. delta_theta = multiple for prior on theta: “number of datasets of information” in prior (see discussion in Morris, et al. 2003 JASA), default: 10^-4.

m. thetaMOM_maxiter = maximum number of iterations in finding MOM starting values for variance components, default: 100

n. thetaMOM_convcrit= convergence criteria for iterative procedure for finding MOM starting values for variance components, default: 10^-3

o. time_update=number of iterations between updates to the log file during MCMC loop, default: 100.

p. missing_data=flag indicating whether to process normal data Y or imputed data Vstar (0=normal,

1=imputed), default: 0.

Output:

The output of the program is a Matlab data file (“output.mat” in the sample call), containing the following Matlab objects, as well as a log file.

The following variables are stored in the input_Init.mat

· Y, model, wavespecs, MCMCspecs: as input

· D: matrix containing wavelet coefficients for observed data

· PI: matrix containing the p_ij estimated by the Empirical Bayes procedure described in Section 4.4 of the paper, based on theta_MLE.

PI_MOM: matrix containing the p_ij estimated by the Empirical Bayes procedure described in Section 4.4 of the paper, based on theta_MOM.

· Tau: matrix containing the T_ijby the Empirical Bayes procedure described in Section 4.5 of Morris and Carroll (2004), based on theta_MLE.

Tau_MOM: matrix containing the T_ijby the Empirical Bayes procedure,

based on theta_MOM.

· theta_MOM: matrix containing method of moments starting values for the wavelet-space variance components q_jk and s_jk in model (3).

· theta_MLE: matrix containing profile maximum likelihood starting values for the wavelet-space variance components.

· se_Theta: estimate of the variance of theta_MLE, to use in automatic proposal variances in Metropolis-Hastings procedure described in step 2 of Section 5 in Morris and Carroll (2004)

· betans: matrix containing non-shrunken estimate of wavelet coefficients for fixed effects conditioning on starting values of variance components, given by equation (6) in Morris and Carroll (2004).

· Vbetans: Matrix containing variance of these wavelet-spaced estimates, given by equation (7) in Morris and Carroll (2004).

· alpha: Matrix containing starting values for shrinkages for wavelet coefficients for fixed effect functions, which are their posterior probabilities of being “nonzero”. Condition on theta_MLE for variance components.

alpha_MOM: same as alpha, only based on theta_MOM.

· prior_Theta_a, prior_Theta_b: matrices containing the prior hyperparameters for the inverse gamma distributions on the wavelet-space variance components

· Wv: structure containing, for each wavelet coefficient, the following statistics, using starting values of the variance components for S_jk

o XvX=X’(S_jk)^-1X

o XvZ=X’(S_jk)^-1Z

o XvD=X’(S_jk)^-1D

o ZvZ=Z’(S_jk)^-1Z

o ZvD=Z’(S_jk)^-1D

o dvd=diag(D’(S_jk)^-1D)

o L1=det(S_jk)

o L2= (d_jk-X B_jk)’ (S_jk)^-1 (d_jk-X B_jk)

The following variables are stored in the output.mat

· Y, model, wavespecs, MCMCspecs: as input

· D: matrix containing wavelet coefficients for observed data

· PI: matrix containing the p_ij estimated by the Empirical Bayes procedure described in Section 4.4 of the paper.

· Tau: matrix containing the T_ijby the Empirical Bayes procedure described in Section 4.5 of Morris and Carroll (2004).

· betahat: betans*alpha; shrinkage starting values for betas.

· ghat: p-by-T matrix containing the posterior mean for each fixed effect function.

ghatns: Inverse discrete wavelet transform of betans.

· Q05_ghat: p-by-T matrix containing the pointwise .05 quantile for each fixed effect function, which is the lower bound for the 90% posterior credible interval.

· Q95_ghat: p-by-T matrix containing the pointwise .95 quantile for each fixed effect function, which is the upper bound for the 90% posterior credible interval.

theta: Mean of MCMC theta samples.

· MCMC_wbeta: matrix containing MCMC posterior samples for wavelet coefficients for fixed effects. Each row contains all B^*_ijk , i=1, …, p, j=1, .., J, k=1, …, K_j , for one iteration of MCMC.

· MCMC_beta: matrix containing MCMC posterior samples for data-space fixed effect functions. Each row contains B_ij , i=1, …, p, j=1, …, T, for one iteration of MCMC.

· MCMC_theta: matrix containing MCMC posterior samples for variance components in wavelet space. Each row contains {q_hjk, h=1, …, H, s_jk}, in that order, for j=1, …, J, k=1, …, K_j

· MCMC_newtheta: vector containing Metropolis-Hastings acceptance probabilities for the set of variance components for each wavelet coefficient, indexed by scale j and location k.

Comments:

· This code has been compiled and used on Microsoft Windows systems. In the near future, we will also test it on Linux, as well.

· The current interface assumes you create the input files and want to post-process the output files in Matlab. We will write an R wrapper for this program in the near future, and post on our web site when it is ready.

· The current version of the code assumes:

1. You want to estimate the shrinkage hyperparameters using the empirical Bayes method.

2. You want vague proper priors for the variance components, centered at the starting values with information equivalent to delta_Theta observations.

3. The random effect functions are independent and identically distributed, so P=R=I

Future versions of the code will allow the users to specify shrinkage hyperparameters and priors for the variance components, and will also allow other covariance structures on the random effect functions.

· This code yields MCMC samples for the quantities in the wavelet-space model, (3) in Morris and Carroll (2004), plus MCMC samples for the fixed effect functions B in the data space model (2). You will notice a number of output files with *.dat extensions. These are binary scratch files storing the MCMC samples so that they do not overflow memory.

· MCMC samples of Q_h and S_i matrices can be obtained by applying the 2-D IDWT to the corresponding diagonal wavelet-space matrices. They are not generated by default because they are frequently not of direct interest, and their large size will cause memory issues in large data sets. We are currently writing code to compute and store the diagonal elements of these matrices in a more efficient manner. These diagonal elements are useful for certain purposes we discuss in future publications.

· MCMC samples of the random effect functions and their wavelet coefficients are not calculated by default, again to save memory and computing resources. We have matlab scripts to compute these that we are in the process of converting over to C, and will post once they are complete.

· Our scripts for post-processing the posterior samples from the MCMC to do various types of Bayesian inference are in Matlab. We are in the process of converting some of them over to C, at which time they will be posted along with their documentation.

· It is possible to parallelize the MCMC step if parallel processing is available – a parallel processing version of the code will be posted in due time.

· For extremely large data sets, one may need to compute the memory requirements of fitting their data and compare it with the capabilities of their machine. See formulas below for estimating RAM and disk usage.

· There are numerous extensions and improvements in the works for this method – stay tuned for updated code.

Wavelet Bases: The current version of accepts both columns of abbreviations for the wavelet bases specified below. Here are the available wavelets, and the corresponding notations for wfmm and Matlab:

WFMM wavelets Matlab equivalent

"haar", “db1” or “haar”

"d4", “db2”

"d6", “db3”

"d8", “db4”

"d10", “db5”

"d12", “d6”

"d14", “db7”

"d16", “db8”

"d18", “db9”

"d20", “db10”

"s4", “sym2”

"s6", “sym3”

"s8", “sym4”

"s10", “sym5”

"s12", “sym6”

"s14", “sym7”

"s16", “sym8”

"c6", “coif1”

"c12", “coif2”

"c18", “coif3”

"c24", “coif4”

"c30" “coif5”

Estimating disk and RAM usage:

N = Number of Functions

p = Number of fixed effect functions

T = Number of observations/function

B = Number of MCMC samples

K = Number of wavelet coefficients

K’ = Number of non-thresholded wavelet coefficients

m = Number of random effect functions H = Number of levels of random effect functions

c = Number of strata for residual error functions

B’= Number of samples in memory. This is computed automatically, but should be set to 1 to find the minimum memory required to run.

Disk Usage » 8 [BK’(p+H+c+1) + pT(B+4) + 3NT+ 2NK’ + K’(p^2+m^2+pm+m+3+7p+7(H+c)) ]

RAM Usage » 8[B’(2pT+(H+c+2)K’+T) + (0.05B+4)pT + 3(H+c)K’]