Important: This is just a draft! I haven’t 1) finished creating reasonable data sets for each model or 2) made even one pretty picture! You’re stuck with the chain summaries for now

!

using Turing
using Bijectors
using Gadfly
using DataFrames, DataFramesMeta
Gadfly.set_default_plot_size(900px, 300px)

Item Response

Item response models are used to simultaneously make inferences about two interacting populations. Commonly, a population of test questions and a population of test takers (students) with the result (success/failure) of each student on each question they've seen. This is an interesting problem in that even in the basic case:

students have different levels of aptitude
questions have different levels of difficulty
not every student sees every question
not every question needs to be seen by the same number of students
we should be able to make relative inferences between students (resp. questions) that have no overlapping questions (resp. students)
the data is nonetheless fairly simple: [correct (Boolean), student_id (categorical), question_id (categorical]

I love these models because they're easy to extend in an intuitive way. I'm going to add a few random bells and whistles to the most vanilla version, and if you're interested the Stan user guide has some good content on this topic and many others.

Some IRT Models

Vanilla Item-Response (aka 1PL)

For each student $s$, we have an aptitude $\alpha_s$ and for each question $q$ we have a difficulty $\gamma_q$. The likelihood of a correct response is informed by the difference between these two quantities:

$$\begin{aligned} \alpha_s &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q} &= \mathrm{logit}^{-1}(\alpha_s - \gamma_q)\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}$$

logit = bijector(Beta())   # bijection:  (0, 1)  →  ℝ
inv_logit = inv(logit)     # bijection:       ℝ  →  (0, 1)

student = [1,1,1,1,2,2,2,2,3,3,3,3]
question = [1,2,3,4,2,3,4,5,3,4,5,1]
correct = [
    true, true, true, false, 
    true, false, false, true, 
    false, false, false, true];

Some observations on the toy data:

Everyone got question 1 correct (expect this to be rated as low difficulty)
Everyone got question 4 wrong (high difficulty)
Student 1 got all tested questions correct except question 4
Student 3 got all tested questions incorrect except question 1
Question 5 was only seen by student 3 (incorrect)

So, here's the model set up in Turing, and the result of the sampler below.

@model function irt_1pl(correct::Array{Bool}, student::Array{Int64}, question::Array{Int64})
    aptitude = Vector(undef, maximum(student))
    difficulty = Vector(undef, maximum(question))
    
    # priors
    for i in 1:length(aptitude)
        aptitude[i] ~ Normal(0,5)
    end
    for i in 1:length(difficulty)
        difficulty[i] ~ Normal(0,5)
    end        
    
    β = Vector(undef, length(correct))
    for i in 1:length(correct)
        β[i] = aptitude[student[i]] - difficulty[question[i]]
        correct[i] ~ Bernoulli(inv_logit(β[i]))
    end
end;

# Settings of the Hamiltonian Monte Carlo (HMC) sampler.
iterations = 1000
ϵ = 0.05
τ = 10;

irt_1pl_ch = sample(
    irt_1pl(correct, student, question), 
    HMC(ϵ, τ), iterations, 
    progress=true, drop_warmup=true)

Sampling: 100%|█████████████████████████████████████████| Time: 0:00:00

Chains MCMC chain (1000×17×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude[1], aptitude[2], aptitude[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5]
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
     parameters      mean       std   naive_se      mcse       ess      rhat 
         Symbol   Float64   Float64    Float64   Float64   Float64   Float64 

    aptitude[1]    3.2740    1.9350     0.0612    0.3207   22.7341    1.0402
    aptitude[2]    0.9977    2.2009     0.0696    0.5345    5.0679    1.2607
    aptitude[3]   -2.7975    3.1851     0.1007    0.9097    3.2885    1.5668
  difficulty[1]   -6.6429    4.3688     0.1382    1.3718    2.4382    2.1207
  difficulty[2]   -3.1028    2.2644     0.0716    0.5146   13.6893    1.0655
  difficulty[3]    2.0820    2.0146     0.0637    0.4029   10.0855    1.1187
  difficulty[4]    5.3785    2.3564     0.0745    0.6279   10.9535    1.0266
  difficulty[5]   -0.7591    3.7529     0.1187    1.0978    3.7110    1.4890

Quantiles
     parameters       2.5%      25.0%     50.0%     75.0%     97.5% 
         Symbol    Float64    Float64   Float64   Float64   Float64 

    aptitude[1]     0.1558     1.7989    3.0630    4.7257    6.9881
    aptitude[2]    -2.7293    -0.6033    0.8091    2.4061    5.2246
    aptitude[3]    -9.5331    -4.5425   -2.9342   -0.7317    2.7270
  difficulty[1]   -12.9759   -10.1684   -7.3905   -3.1434    2.1283
  difficulty[2]    -9.1851    -4.3815   -2.6233   -1.5957    0.3462
  difficulty[3]    -2.0318     0.6111    2.2832    3.5960    5.3615
  difficulty[4]     1.1812     3.6750    5.2200    6.9466   10.4091
  difficulty[5]    -7.9556    -2.9220   -1.3910    1.0610    8.3417

Interesting, the only surprise for me is that I expected a wider spread for difficulty[5], but otherwise looks very reasonable!

Question Quality (aka 2PL)

The purpose of asking questions is to probe the aptitude of the test taker, and some questions will do a much better job of guaranteeing a minimum skill level given a successful response. This is called "discrimination". Intuitively, a highly discriminating question would magnify the difference between a student's ability and the question's difficulty, so that both

students with sufficient aptitude are more likely to succeed
students with insufficient aptitude are more likely to fail

We can see that $\eta$ will accomplish this in the model below:

$$\begin{aligned} \alpha_s &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \eta_q &\sim \mathrm{LogNormal}(0,2)\\ \beta_{s, q} &= \mathrm{logit}^{-1}(\eta_q * (\alpha_s - \gamma_q))\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}$$

@model function irt_2pl(correct::Array{Bool}, student::Array{Int64}, question::Array{Int64})
    aptitude = Vector(undef, maximum(student))
    difficulty = Vector(undef, maximum(question))
    discr = Vector(undef, maximum(question))
    
    # priors
    for i in 1:length(aptitude)
        aptitude[i] ~ Normal(0,5)
    end
    for i in 1:length(difficulty)
        difficulty[i] ~ Normal(0,5)
    end        
    for i in 1:length(difficulty)
        discr[i] ~ LogNormal(0,2)
    end        
    
    β = Vector(undef, length(correct))
    for i in 1:length(correct)
        β[i] = discr[question[i]] * (aptitude[student[i]] - difficulty[question[i]])
        correct[i] ~ Bernoulli(inv_logit(β[i]))
    end
end;

irt_2pl_ch = sample(
    irt_2pl(correct, student, question), 
    HMC(ϵ, τ), iterations, 
    progress=true, drop_warmup=true)

Sampling:  37%|███████████████▏                         |  ETA: 0:00:02┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC /Users/brad/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
Sampling:  38%|███████████████▋                         |  ETA: 0:00:02┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC /Users/brad/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
Sampling: 100%|█████████████████████████████████████████| Time: 0:00:03

Chains MCMC chain (1000×22×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude[1], aptitude[2], aptitude[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5], discr[1], discr[2], discr[3], discr[4], discr[5]
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
     parameters      mean       std   naive_se      mcse       ess      rhat 
         Symbol   Float64   Float64    Float64   Float64   Float64   Float64 

    aptitude[1]    1.2991    2.5842     0.0817    0.6109   15.4744    1.0188
    aptitude[2]   -0.4991    2.5806     0.0816    0.7655    4.9059    1.1926
    aptitude[3]   -1.8909    3.3642     0.1064    0.9387    6.3762    1.1386
  difficulty[1]   -2.1490    3.1013     0.0981    0.8999    3.7623    1.3383
  difficulty[2]   -3.7273    1.9332     0.0611    0.4666   14.1960    1.0006
  difficulty[3]   -3.8600    4.1205     0.1303    1.2739    2.9892    1.7147
  difficulty[4]    3.3615    2.0557     0.0650    0.4997    5.5086    1.3375
  difficulty[5]   -2.0982    3.6023     0.1139    1.0232    3.3308    1.5764
       discr[1]    4.2717   13.8163     0.4369    2.4126   24.7575    1.0508
       discr[2]    4.3264   11.0954     0.3509    1.7288   36.1833    1.0601
       discr[3]    1.1143    3.4122     0.1079    0.6271   13.4357    1.0971
       discr[4]    7.9696   12.9932     0.4109    1.7747   55.6834    1.0197
       discr[5]    2.0524    4.9542     0.1567    0.8716   17.5768    1.0340

Quantiles
     parameters       2.5%     25.0%     50.0%     75.0%     97.5% 
         Symbol    Float64   Float64   Float64   Float64   Float64 

    aptitude[1]    -2.7132   -0.5813    0.8765    2.8893    7.2904
    aptitude[2]    -4.8826   -2.5355   -0.6822    1.5337    4.2364
    aptitude[3]    -8.0107   -4.8561   -1.1829    0.8619    3.2192
  difficulty[1]   -10.9402   -3.1502   -1.5466   -0.1949    2.7139
  difficulty[2]    -8.1221   -4.9271   -3.3535   -2.3146   -0.8306
  difficulty[3]   -11.1577   -7.2108   -3.4431   -0.7276    3.0153
  difficulty[4]    -1.2933    2.1900    3.4221    4.8106    6.8774
  difficulty[5]    -7.7892   -4.9410   -2.8640    0.7741    4.9000
       discr[1]     0.0254    0.1579    0.6579    2.1601   36.1958
       discr[2]     0.0213    0.3799    0.8490    2.7462   38.2713
       discr[3]     0.0074    0.0353    0.0986    0.5403    8.5258
       discr[4]     0.0623    0.5002    2.8361   10.1297   45.3600
       discr[5]     0.0119    0.0991    0.2953    1.3438   16.8309

Guessing Behavior

It's common knowledge that guessing is advantageous on the SATs if you can eliminate at least 1 answer. This is because there are usually 5 responses and an incorrect response is penalized by 1/4 point. In the previous examples we assumed that question difficulty and student aptitude accounted for a span of possible $P(\mathrm{correct})$ covering $(0,1)$, but if the test taker can opportunistically guess (ie on a multiple choice test) then the true probabilities have some other lower bound, $(\delta, 1), \delta > 0$.

Modifying our first model to account for this is relatively straightforward:

$$\begin{aligned} \delta &\sim \mathrm{Beta}(1, 2)\\ \alpha_s &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q} &= \delta + (1-\delta)\mathrm{logit}^{-1}(\alpha_s - \gamma_q)\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}$$

@model function irt_guess(correct::Array{Bool}, student::Array{Int64}, question::Array{Int64})
    aptitude = Vector(undef, maximum(student))
    difficulty = Vector(undef, maximum(question))
    
    # priors
    for i in 1:length(aptitude)
        aptitude[i] ~ Normal(0,5)
    end
    for i in 1:length(difficulty)
        difficulty[i] ~ Normal(0,5)
    end
    guess_factor ~ Beta(1,2)
    
    β = Vector(undef, length(correct))
    for i in 1:length(correct)
        β[i] = aptitude[student[i]] - difficulty[question[i]]
        correct[i] ~ Bernoulli(guess_factor + (1-guess_factor)*inv_logit(β[i]))
    end
end;

irt_guess_ch = sample(
    irt_guess(correct, student, question), 
    HMC(ϵ, τ), iterations, 
    progress=true, drop_warmup=true)

Sampling: 100%|█████████████████████████████████████████| Time: 0:00:00

Chains MCMC chain (1000×18×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude[1], aptitude[2], aptitude[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5], guess_factor
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
     parameters      mean       std   naive_se      mcse       ess      rhat 
         Symbol   Float64   Float64    Float64   Float64   Float64   Float64 

    aptitude[1]    2.4869    2.5707     0.0813    0.7223    3.2350    1.6621
    aptitude[2]    0.0467    2.1501     0.0680    0.4649    4.9934    1.3398
    aptitude[3]   -4.6788    2.3852     0.0754    0.5883   12.5814    1.0098
  difficulty[1]   -5.4882    3.0988     0.0980    0.7927   11.2340    1.0712
  difficulty[2]   -1.3732    1.8141     0.0574    0.3498   12.9492    1.1089
  difficulty[3]    2.2928    3.9825     0.1259    1.2095    2.8536    1.6963
  difficulty[4]    5.0951    1.9785     0.0626    0.4414    5.7468    1.2786
  difficulty[5]   -0.9364    2.5749     0.0814    0.6004   17.2439    0.9997
   guess_factor    0.2061    0.1453     0.0046    0.0167   49.5365    0.9990

Quantiles
     parameters       2.5%     25.0%     50.0%     75.0%     97.5% 
         Symbol    Float64   Float64   Float64   Float64   Float64 

    aptitude[1]    -2.7677    0.6181    2.6535    4.2176    7.0921
    aptitude[2]    -4.4554   -1.3947    0.2112    1.5970    3.9855
    aptitude[3]    -9.1785   -6.4195   -4.4813   -2.8343   -0.2408
  difficulty[1]   -10.9864   -7.4053   -5.9243   -3.7231    2.0833
  difficulty[2]    -5.1880   -2.6083   -1.2140    0.0306    1.6720
  difficulty[3]    -3.9818   -0.5712    1.5472    3.9268   11.2302
  difficulty[4]     1.4102    3.7990    5.0364    6.2808    9.6207
  difficulty[5]    -7.2441   -2.3121   -0.8704    0.7449    3.9909
   guess_factor     0.0136    0.0825    0.1808    0.3095    0.5229

Two Kinds of Questions

Students aren't universally adept at answering questions of different types, so let's add that to the model! For questions of type $t_i$ (ie $t(q)=t_i$), we apply the student's aptitude from that question type.

$$\begin{aligned} \alpha_{s, t} &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q, t} &= \mathrm{logit}^{-1}(\alpha_{s,t(q)} - \gamma_q)\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}$$

More fun (but maybe too much fun for this post ) is that with multiple question types it would be pretty simple to bake in correlations in student aptitude across question types.

question_types = [1,2,1,2,1,2,1,2,1,2,1,2]

@model function irt_2types(
        correct::Array{Bool}, 
        student::Array{Int64}, 
        question::Array{Int64}, question_type::Array{Int64}
    )
    aptitude_1 = Vector(undef, maximum(student))
    aptitude_2 = Vector(undef, maximum(student))
    difficulty = Vector(undef, maximum(question))
    
    # priors
    for i in 1:length(aptitude_1)
        aptitude_1[i] ~ Normal(0,5)
        aptitude_2[i] ~ Normal(0,5)
    end
    for i in 1:length(difficulty)
        difficulty[i] ~ Normal(0,5)
    end
    
    β = Vector(undef, length(correct))
    for i in 1:length(correct)
        if question_type[i] == 1
            β[i] = aptitude_1[student[i]] - difficulty[question[i]]
        else
            β[i] = aptitude_2[student[i]] - difficulty[question[i]]
        end
        correct[i] ~ Bernoulli(inv_logit(β[i]))
    end
end;

irt_2types_ch = sample(
    irt_2types(correct, student, question, question_types), 
    HMC(ϵ, τ), iterations, 
    progress=true, drop_warmup=true)

Sampling: 100%|█████████████████████████████████████████| Time: 0:00:00

Chains MCMC chain (1000×20×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude_1[1], aptitude_1[2], aptitude_1[3], aptitude_2[1], aptitude_2[2], aptitude_2[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5]
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
     parameters      mean       std   naive_se      mcse       ess      rhat 
         Symbol   Float64   Float64    Float64   Float64   Float64   Float64 

  aptitude_1[1]    4.1125    2.8127     0.0889    0.7651   12.7687    1.0016
  aptitude_1[2]    2.0198    2.9273     0.0926    0.7668    4.0284    1.4551
  aptitude_1[3]   -4.7104    2.8762     0.0910    0.8053    6.7425    1.0645
  aptitude_2[1]    3.2786    3.0173     0.0954    0.7514    5.4254    1.2778
  aptitude_2[2]   -0.6002    2.5990     0.0822    0.7241    5.6558    1.1660
  aptitude_2[3]    1.9599    3.2568     0.1030    0.8387   10.0610    1.1281
  difficulty[1]   -2.7944    3.3292     0.1053    0.8177   14.5871    1.0535
  difficulty[2]   -3.8994    3.5653     0.1127    1.0350    2.7201    1.9978
  difficulty[3]    0.7524    2.8349     0.0896    0.8140    8.2372    1.0174
  difficulty[4]    9.5894    3.2614     0.1031    0.8923    3.0798    1.6956
  difficulty[5]   -2.1624    3.0569     0.0967    0.8869    5.3197    1.1394

Quantiles
     parameters       2.5%     25.0%     50.0%     75.0%     97.5% 
         Symbol    Float64   Float64   Float64   Float64   Float64 

  aptitude_1[1]    -0.7504    2.0848    4.1293    5.5508   10.6943
  aptitude_1[2]    -3.0840   -0.3602    1.8441    4.4293    7.2316
  aptitude_1[3]    -9.6865   -6.8462   -5.0387   -2.2494    0.4745
  aptitude_2[1]    -1.0901    0.8448    2.8353    5.0615    9.8895
  aptitude_2[2]    -5.0098   -2.5673   -1.0476    1.4656    4.3434
  aptitude_2[3]    -3.6354   -0.6446    1.8742    4.6117    7.8919
  difficulty[1]    -8.3563   -5.2372   -3.3224   -0.2932    3.9075
  difficulty[2]   -10.8610   -6.4562   -3.6604   -0.9632    2.1568
  difficulty[3]    -4.5461   -1.2926    0.4437    2.8840    6.4110
  difficulty[4]     3.5535    7.4417    9.4532   11.5941   16.2606
  difficulty[5]    -6.8662   -4.6146   -2.6714    0.1172    3.9333

Test-taker Fatigue

Imagine the test is several hours long. The test taker is pretty likely to perform differently (let's assume worse) by the end of the test, and that fatigue factor is probably pretty specific to the person. Thus, for the $i^{th}$ question we introduce a linear penalty as a first stab at the idea:

$$\begin{aligned} \alpha_s &\sim \mathrm{Normal}(0,5)\\ \phi_s &\sim \mathrm{LogNormal}(0,1)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q, i} &= \mathrm{logit}^{-1}(\alpha_s - \gamma_q - i\phi_s)\\ \mathrm{correct_{s,q,i}} &\sim \mathrm{Bernoulli}(\beta_{s,q,i})\\ \end{aligned}$$

question_seq = [1,2,3,4, 1,2,3,4, 1,2,3,4]
@model function irt_fatigue(
        correct::Array{Bool}, student::Array{Int64}, 
        question::Array{Int64}, question_seq::Array{Int64}
    )
    aptitude = Vector(undef, maximum(student))
    wimpiness = Vector(undef, maximum(student))
    difficulty = Vector(undef, maximum(question))
    
    # priors
    for i in 1:length(aptitude)
        aptitude[i] ~ Normal(0,5)
        wimpiness[i] ~ LogNormal(0,2)
    end
    for i in 1:length(difficulty)
        difficulty[i] ~ Normal(0,5)
    end        
    
    β = Vector(undef, length(correct))
    for i in 1:length(correct)
        β[i] = aptitude[student[i]] - difficulty[question[i]] - wimpiness[student[i]] * i * question_seq[i]
        correct[i] ~ Bernoulli(inv_logit(β[i]))
    end
end;

irt_fatigue_ch = sample(
    irt_fatigue(correct, student, question, question_seq), 
    HMC(ϵ, τ), iterations, 
    progress=true, drop_warmup=true)

Sampling: 100%|█████████████████████████████████████████| Time: 0:00:00

Chains MCMC chain (1000×20×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude[1], aptitude[2], aptitude[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5], wimpiness[1], wimpiness[2], wimpiness[3]
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
     parameters      mean       std   naive_se      mcse       ess      rhat 
         Symbol   Float64   Float64    Float64   Float64   Float64   Float64 

    aptitude[1]    5.5689    3.2672     0.1033    0.9661    5.5541    1.0900
    aptitude[2]   -0.6428    2.2023     0.0696    0.5558    5.5129    1.2839
    aptitude[3]   -2.0090    2.7835     0.0880    0.7741    6.2247    1.1309
  difficulty[1]   -7.5951    3.1442     0.0994    0.8897   10.2621    1.0804
  difficulty[2]   -3.5479    3.0344     0.0960    0.8538    2.8263    1.9409
  difficulty[3]    0.3819    2.0660     0.0653    0.4835   15.8760    1.0060
  difficulty[4]    5.3452    3.8158     0.1207    1.0966    6.8170    1.1893
  difficulty[5]   -3.1397    2.0446     0.0647    0.5193    7.1755    1.2067
   wimpiness[1]    0.3146    0.2430     0.0077    0.0457   22.9879    0.9995
   wimpiness[2]    0.0530    0.0488     0.0015    0.0073   44.2052    1.0160
   wimpiness[3]    0.0864    0.0808     0.0026    0.0220    4.2962    1.2917

Quantiles
     parameters       2.5%     25.0%     50.0%     75.0%     97.5% 
         Symbol    Float64   Float64   Float64   Float64   Float64 

    aptitude[1]    -0.0760    2.7155    5.5546    8.5856   10.8985
    aptitude[2]    -4.8763   -2.0208   -0.7854    0.5966    4.2086
    aptitude[3]    -6.2673   -4.0574   -2.2741   -0.5126    4.9629
  difficulty[1]   -13.9438   -9.4788   -7.9346   -5.4993   -0.8206
  difficulty[2]    -8.2400   -5.8861   -3.8871   -1.4654    2.6733
  difficulty[3]    -3.2067   -1.1498    0.2113    1.9139    4.7973
  difficulty[4]    -1.0724    2.2905    5.4179    8.2336   12.3365
  difficulty[5]    -6.8396   -4.4525   -3.1455   -1.9591    1.3959
   wimpiness[1]     0.0278    0.1198    0.2535    0.4515    0.9220
   wimpiness[2]     0.0016    0.0162    0.0393    0.0767    0.1814
   wimpiness[3]     0.0028    0.0211    0.0614    0.1266    0.2839

Note on the model specifications

You might be wondering "where did these prior values come from?" or "how did Brad choose these distributions? Why Normal instead of, I dunno, t?". Good questions! The answer is I didn't think too hard and just wrote down the first thing that seemed reasonable either in terms of the values ($\mathrm{logit}^{-1}(5)$ is a very high - 99-ish% - but not insurmountable level of confidence) or theoretical properties (basically, choose a simple distribution with the right domain and range).

Perhaps you're also wondering "what's up with the random mixing in of Greek letters?" You got me there.

Comparison to Collaborative Filtering

Item Response may seem very similar to collaborative filtering, so it's worth highlighting the differences.

Collaborative filtering aims to complete a sparsely determined preferences/ratings matrix of consumer - item scores (e.g. "User 513 gave product 149 3.5 s") $M$. A common approach is alternating least squares which iteratively factors the matrix into a product-feature matrix $P$ and a customer-preference matrix $C$. The goal is to create these so that their product "accurately completes" $M$, ie if $CP = \overline{M}$ then the difference $M - \overline{M}$ is small wherever we have entries for $M$ (remember, $M$ is incomplete).

A key fact is that the matrix $\overline{M}$ (the list of recommendations) is the important output here, and the factors $C$ and $P$ are intriguing but not critical. This is different from Item Response where the background variables describing difficulty and aptitude for each question, student are the primary desired outputs (but we could infer $P(\mathrm{correct} | \mathrm{student\_id}, \mathrm{question\_id})$ for unseen pairings!).

The other distinction worth mentioning is that the IR models have enormous flexibility in how they inform the probability of success, as we've seen above. Collaborative filtering, at least with ALS, is just optimizing a matrix factorization task. Since $\overline{M} = CP$, the user-product score can only be the dot product of the product-feature vector and the customer-preference vector, it attempts to encode "how well do the consumer preferences overlap with the product features." It does not lend itself to any extensions to capture more domain-specific nuance as we did here.