README

=======================
PyMC Multilevel Example
=======================

A nice example of using PyMC for multilevel (aka "Random Effects")
modeling came through on the PyMC mailing list today.  I've put it
into a git repo so that I can play around with it a little, and
collect up the feedback that the list generates.

The original message, from Scott B, lays out the situation with
refreshing precision.  I continue to be impressed with the skill-level
that new PyMC users frequently begin with (for fun, compare to the
questions on the NetworkX mailing list sometime...  I guess this is an
indication of the relative success in the popularization of network
science and Bayesian statistics).

Anyways, here is the original message::

     Hi All,

     New to this group and to PyMC (and mostly new to Python). In any case,
     I'm writing to ask for help in specifying a multilevel model (mixed
     model) for what I call partially clustered designs. An example of a
     partially clustered design is a 2-condition randomized psychotherapy
     study where subjects are randomly assigned to conditions. In condition
     1, subjects are treated in small groups of say 8 people a piece. In
     the condition 2, usually a control, subjects are only assessed on the
     outcome (they don't receive an intervention). Thus you have clustering
     in condition 1 but not condition 2.

     The model for a 2-condition study looks like (just a single time point
     to keep things simpler):

     y_ij = b_0 + b_1*Tx + u_j*Tx + e_ij

     where y_ij is the outcome for the ith person in cluster j (in most
     multilevel modeling software and in PROC MCMC in SAS, subjects in the
     unclustered condition are all in clusters of just 1 person), b_0 is
     the overall intercept, b_1 is the treatment effect, Tx is a dummy
     coded variable coded as 1 for the clustered condition and 0 for the
     unclustered condition, u_j is the random effect for cluster j, and
     e_ij is the residual error. The variance among clusters is \sigma^2_u
     and the residual term is \sigma^2_e (ideally you would estimate a
     unique residual by condition).

     Because u_j interacts with Tx, the random effect is only contributes
     to the clustered condition.

     In my PyMC model, I expressed the model in matrix form - I find it
     easier to deal with especially for dealing with the cluster effects.
     Namely:

     Y = XB + ZU + R

     where X is an n x 2 design matrix for the overall intercept and
     intervention effect, B is a 1 x 2 matrix of regression coefficients, Z
     is an n x c design matrix for the cluster effects (where c is equal to
     the number of clusters in the clustered condition), and U is a c x 1
     matrix of cluster effects. The way I've written the model below, I
     have R as an n x n diagonal matrix with \sigma^2_e on the diagonal and
     0's on the off-diagonal.

     All priors below are based on a model fit in PROC MCMC in SAS. I'm
     trying to replicate the analyses in PyMC so I don't want to change the
     priors.

     The data for my code can be downloaded here:
     http://dl.dropbox.com/u/613463/run1.csv

     The data consist of 200 total subjects. 100 in the clustered condition
     and 100 in the unclustered. In the clustered condition there are 10
     clusters of 10 people each. There is a single outcome variable.

     The PyMC set-up file can be downloaded here (also pasted below):
     http://dl.dropbox.com/u/613463/analysis1.py

     The file for running the PyMC set-up file can be downloaded here:
     http://dl.dropbox.com/u/613463/run_analysis1.py

     I have 3 specific questions about the model:

     1 - Given the description of the model, have I successfully specified
     the model? The results are quite similar to PROC MCMC, though the
     cluster variance (\sigma^2_u) differs more than I would expect due to
     Monte Carlo error. The differences make me wonder if I haven't got it
     quite right in PyMC.

     2 - Is there a better (or more efficient) way to set up the model? The
     model runs quickly but I am trying to learn Python and to get better
     at optimizing how to set up models (especially multilevel models).

     3 - How can change my specification so that I can estimate unique
     residual variances for clustered and unclustered conditions? Right now
     I've estimated just a single residual variance. But I typically want
     separate estimates for the residual variances per intervention
     condition.

     Thanks so much for your help. My code follows my signature.

     Best,
     Scott

Here are my notes on the matter:

1.  This code worked for me without any modification. :) Although when
    I tried to run it twice in the same ipython session, it gave me
    strange complaints. (for pymc version 2.1alpha, wall time 78s).
    For the newest version in the git repo (pymc version 2.2grad,
    commit ca77b7aa28c75f6d0e8172dd1f1c3f2cf7358135, wall time 75s) it
    didn't complain.

2.  I find the data wrangling section of the model quite opaque.  If
    there is a difference between the PROC MCMC and PyMC results, this
    is the first place I would look.  I've been searching for the most
    transparent ways to deal with data in Python, so I can share some
    of my findings as applied to this block of code.

3.  The model could probably be faster.  This is the time for me to
    recall the two cardinal rules of program optimization: 1) Don't
    Optimize, and 2) (for experts only) Don't Optimize Yet.

    That said, the biggest change to the time PyMC takes to run is in
    the step methods.  But adjusting this is a delicate operation.
    More on this to follow.

4.  Changing the specification is the true power of the PyMC approach,
    and why this is worth the extra effort, since a random effects
    model like yours is one line of STATA.  So I'd like to write out
    in detail how to change it.  More on this to follow.

5.  Indentation should be 4 spaces.  Diverging from this inane detail
    will make python people itchy.

I think that speeding up the MCMC is more fun than cleaning up the
data wrangling, so I'll assume that the data is being set up as
desired for now and mess around with the step methods.  Commit linked
here
https://github.com/aflaxman/pymc-multilevel-example/commit/03d0e64653a99085ac5f391acd0ad5839ddfaae4
is a small example of this, which doesn't speed things up, but could
mean less samples are required of the chain.

To actually speed things up (and also potentially reduce the number of
samples required), I'm going to combine the beta0 and beta1
stochs. This reduces the wall time to 58s, but the diagnostic plots
make me think the it needs a longer burn-in period.  I like to throw
out the first half of my samples, so I'll make that change, too.  Pro
tip: I added a <code>reload(analysis1)</code> to the run_analysis1.py
script, so that my changes to the model automatically make it into
ipython when I say <code>time run run_analysis1</code>. Commit linked
here
https://github.com/aflaxman/pymc-multilevel-example/commit/6bd55cef9c5d854f7755924c705ab8e111e20b3c

I'm not keeping careful track of if the results change as I modify the
step methods, which is probably not a recommended practice.  But it is
fun to see the methods get faster and faster as I group more and more
in the AM step method.  (43s wall time for M.B and M.U in one AM,
commit here; 32s wall time for all stochs together, but the results
are not pretty)
https://github.com/aflaxman/pymc-multilevel-example/commit/49dea954c0c681ad74a2b427a2e6ca594e501c4f

Sometimes I've had good luck with choosing initial values for the MCMC
by maximizing the posterior or something related to it.  Here is code
to do that, which doesn't take very long (36s wall time, and
respectible looking results are back)
https://github.com/aflaxman/pymc-multilevel-example/commit/e8c376b8685c64c7183a8aeb14ec05b095042652


So that little burst of messing around has halved the run time.  Too
bad I didn't work on the things I was supposed to be doing for the
last hour.

Another hour of messing around has given me some confidence that 200K
samples is certainly enough, and 20K samples probably is enough to
guide model development.  To assess this, I relied on the
autocorrelation plot of the stoch traces, which I added to the
Matplot.plot routine, since I think everyone should use it
approach. Commit linked here
https://github.com/aflaxman/pymc-multilevel-example/commit/e8545ca5bec6e3a1e5dbb7ba20bb957bc67c0d95

Now I'm ready to try simplifying the data wrangling section, and then
everything will be in order to extend the model to have different
variance for the clustered and unclustered observations.

I like to use pylab's csv2rec function for loading data files.  It
deals nicely with the column headings, and converts data to the
correct type pretty routinely.  I'm going to do two commits in the
process of changing this code, one which checks carefully that I'm not
breaking anything, and then another to yield a nice short block of
code. Commits linked here
https://github.com/aflaxman/pymc-multilevel-example/commit/8c06499b287e1bbd9dd7a4146bc627d6e9187ad8
and here
https://github.com/aflaxman/pymc-multilevel-example/commit/11ff14dab2377491e8d4d155acfd74b7ca6c93d0

Now, on to changing the model!  I added a separate variance parameter
for the treatment and control groups in a slightly sneaky way.  It
only took 10 keystrokes (roughly), but maybe it would be better to use
more code and have a clearer approach.  Commit linked here
https://github.com/aflaxman/pymc-multilevel-example/commit/a46531085c673e62458b8b28153f50aebae036de

Before:
         B: [-0.07  0.3 ]
         u: [-0.14  0.02  0.3   0.5   0.36 -1.02  0.33  0.23 -0.28 -0.35]
    var_e1: 0.94
     var_u: 0.4

After:
         B: [-0.06  0.31]
         u: [-0.13  0.    0.28  0.53  0.39 -1.08  0.34  0.22 -0.31 -0.35]
    var_e1: [ 1.12  0.79]
     var_u: 0.41

Since I sneakily used the indicator run1.treat to index the var_e1
array, var_e1[0] is the control variance and var_e1[1] is the
treatment variance.  So the change in the model doesn't change any of
the parameter means, but it does show that the variance of the
control group is higher than for the treatment group.

Too bad I didn't generate before-and-after numbers on the uncertainty
intervals, that is something that might have changed in an interesting
way with this more complicated model.

In conclusion, here is why I think that this approach is worth the
extra effort, compared to using xtmixed in STATA.  Although the STATA
model is almost done when figure out the single line <code>xtmixed y
treat ||j:</code>, that is not quite what you want.  The control
groups are not supposed to have random effects, so you have to find a
hack for that, maybe <code>xtmixed y treat ||j: treat,
nointercept</code>.  And then you want to add different variances for
treatment and control, which is doable, but also hacky, maybe with
<code>xtmixed y treat ||j: treat, nointercept ||treat:</code>.  And
now you've got a very short line, but one that is going to be very
hard to understand in the future.  And what if you want to fiddle with
the priors?  Or change the noise model?  Much better to have 30 clear
lines of PyMC, in my opinion.

p.s. Django and Rails have gotten a lot of mileage out of emphasizing
_convention_ in frequently performed tasks, and I think that PyMC
models could also benefit from this approach.  I'm sure I can't
develop our conventions myself, but I have changed all the file names
to move towards what I think we might want them to look like. Commit
linked here
https://github.com/aflaxman/pymc-multilevel-example/commit/603112c1de3d0fd8fa8fdc76a21ec3f290d30a06
My analyses often have these basic parts: data, model, fitting code,
graphics code.  Maybe your do to.