Important: This is just a draft! I haven’t 1) finished creating reasonable data sets for each model or 2) made even one pretty picture! You’re stuck with the chain summaries for now !

using Turing
using Bijectors
using Gadfly
using DataFrames, DataFramesMeta
Gadfly.set_default_plot_size(900px, 300px)


## Item Response

Item response models are used to simultaneously make inferences about two interacting populations. Commonly, a population of test questions and a population of test takers (students) with the result (success/failure) of each student on each question they've seen. This is an interesting problem in that even in the basic case:

• students have different levels of aptitude
• questions have different levels of difficulty
• not every student sees every question
• not every question needs to be seen by the same number of students
• we should be able to make relative inferences between students (resp. questions) that have no overlapping questions (resp. students)
• the data is nonetheless fairly simple: [correct (Boolean), student_id (categorical), question_id (categorical]

I love these models because they're easy to extend in an intuitive way. I'm going to add a few random bells and whistles to the most vanilla version, and if you're interested the Stan user guide has some good content on this topic and many others.

# Some IRT Models

## Vanilla Item-Response (aka 1PL)

For each student $s$, we have an aptitude $\alpha_s$ and for each question $q$ we have a difficulty $\gamma_q$. The likelihood of a correct response is informed by the difference between these two quantities:

\begin{aligned} \alpha_s &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q} &= \mathrm{logit}^{-1}(\alpha_s - \gamma_q)\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}
logit = bijector(Beta())   # bijection:  (0, 1)  →  ℝ
inv_logit = inv(logit)     # bijection:       ℝ  →  (0, 1)

student = [1,1,1,1,2,2,2,2,3,3,3,3]
question = [1,2,3,4,2,3,4,5,3,4,5,1]
correct = [
true, true, true, false,
true, false, false, true,
false, false, false, true];


Some observations on the toy data:

• Everyone got question 1 correct (expect this to be rated as low difficulty)
• Everyone got question 4 wrong (high difficulty)
• Student 1 got all tested questions correct except question 4
• Student 3 got all tested questions incorrect except question 1
• Question 5 was only seen by student 3 (incorrect)

So, here's the model set up in Turing, and the result of the sampler below.

@model function irt_1pl(correct::Array{Bool}, student::Array{Int64}, question::Array{Int64})
aptitude = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))

# priors
for i in 1:length(aptitude)
aptitude[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end

β = Vector(undef, length(correct))
for i in 1:length(correct)
β[i] = aptitude[student[i]] - difficulty[question[i]]
correct[i] ~ Bernoulli(inv_logit(β[i]))
end
end;

# Settings of the Hamiltonian Monte Carlo (HMC) sampler.
iterations = 1000
ϵ = 0.05
τ = 10;

irt_1pl_ch = sample(
irt_1pl(correct, student, question),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)

Sampling: 100%|█████████████████████████████████████████| Time: 0:00:00

Chains MCMC chain (1000×17×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude[1], aptitude[2], aptitude[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5]
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
parameters      mean       std   naive_se      mcse       ess      rhat
Symbol   Float64   Float64    Float64   Float64   Float64   Float64

aptitude[1]    3.2740    1.9350     0.0612    0.3207   22.7341    1.0402
aptitude[2]    0.9977    2.2009     0.0696    0.5345    5.0679    1.2607
aptitude[3]   -2.7975    3.1851     0.1007    0.9097    3.2885    1.5668
difficulty[1]   -6.6429    4.3688     0.1382    1.3718    2.4382    2.1207
difficulty[2]   -3.1028    2.2644     0.0716    0.5146   13.6893    1.0655
difficulty[3]    2.0820    2.0146     0.0637    0.4029   10.0855    1.1187
difficulty[4]    5.3785    2.3564     0.0745    0.6279   10.9535    1.0266
difficulty[5]   -0.7591    3.7529     0.1187    1.0978    3.7110    1.4890

Quantiles
parameters       2.5%      25.0%     50.0%     75.0%     97.5%
Symbol    Float64    Float64   Float64   Float64   Float64

aptitude[1]     0.1558     1.7989    3.0630    4.7257    6.9881
aptitude[2]    -2.7293    -0.6033    0.8091    2.4061    5.2246
aptitude[3]    -9.5331    -4.5425   -2.9342   -0.7317    2.7270
difficulty[1]   -12.9759   -10.1684   -7.3905   -3.1434    2.1283
difficulty[2]    -9.1851    -4.3815   -2.6233   -1.5957    0.3462
difficulty[3]    -2.0318     0.6111    2.2832    3.5960    5.3615
difficulty[4]     1.1812     3.6750    5.2200    6.9466   10.4091
difficulty[5]    -7.9556    -2.9220   -1.3910    1.0610    8.3417


Interesting, the only surprise for me is that I expected a wider spread for difficulty[5], but otherwise looks very reasonable!

## Question Quality (aka 2PL)

The purpose of asking questions is to probe the aptitude of the test taker, and some questions will do a much better job of guaranteeing a minimum skill level given a successful response. This is called "discrimination". Intuitively, a highly discriminating question would magnify the difference between a student's ability and the question's difficulty, so that both

• students with sufficient aptitude are more likely to succeed
• students with insufficient aptitude are more likely to fail

We can see that $\eta$ will accomplish this in the model below:

\begin{aligned} \alpha_s &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \eta_q &\sim \mathrm{LogNormal}(0,2)\\ \beta_{s, q} &= \mathrm{logit}^{-1}(\eta_q * (\alpha_s - \gamma_q))\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}
@model function irt_2pl(correct::Array{Bool}, student::Array{Int64}, question::Array{Int64})
aptitude = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))
discr = Vector(undef, maximum(question))

# priors
for i in 1:length(aptitude)
aptitude[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
discr[i] ~ LogNormal(0,2)
end

β = Vector(undef, length(correct))
for i in 1:length(correct)
β[i] = discr[question[i]] * (aptitude[student[i]] - difficulty[question[i]])
correct[i] ~ Bernoulli(inv_logit(β[i]))
end
end;


irt_2pl_ch = sample(
irt_2pl(correct, student, question),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)

Sampling:  37%|███████████████▏                         |  ETA: 0:00:02┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC /Users/brad/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
Sampling:  38%|███████████████▋                         |  ETA: 0:00:02┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC /Users/brad/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
Sampling: 100%|█████████████████████████████████████████| Time: 0:00:03

Chains MCMC chain (1000×22×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude[1], aptitude[2], aptitude[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5], discr[1], discr[2], discr[3], discr[4], discr[5]
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
parameters      mean       std   naive_se      mcse       ess      rhat
Symbol   Float64   Float64    Float64   Float64   Float64   Float64

aptitude[1]    1.2991    2.5842     0.0817    0.6109   15.4744    1.0188
aptitude[2]   -0.4991    2.5806     0.0816    0.7655    4.9059    1.1926
aptitude[3]   -1.8909    3.3642     0.1064    0.9387    6.3762    1.1386
difficulty[1]   -2.1490    3.1013     0.0981    0.8999    3.7623    1.3383
difficulty[2]   -3.7273    1.9332     0.0611    0.4666   14.1960    1.0006
difficulty[3]   -3.8600    4.1205     0.1303    1.2739    2.9892    1.7147
difficulty[4]    3.3615    2.0557     0.0650    0.4997    5.5086    1.3375
difficulty[5]   -2.0982    3.6023     0.1139    1.0232    3.3308    1.5764
discr[1]    4.2717   13.8163     0.4369    2.4126   24.7575    1.0508
discr[2]    4.3264   11.0954     0.3509    1.7288   36.1833    1.0601
discr[3]    1.1143    3.4122     0.1079    0.6271   13.4357    1.0971
discr[4]    7.9696   12.9932     0.4109    1.7747   55.6834    1.0197
discr[5]    2.0524    4.9542     0.1567    0.8716   17.5768    1.0340

Quantiles
parameters       2.5%     25.0%     50.0%     75.0%     97.5%
Symbol    Float64   Float64   Float64   Float64   Float64

aptitude[1]    -2.7132   -0.5813    0.8765    2.8893    7.2904
aptitude[2]    -4.8826   -2.5355   -0.6822    1.5337    4.2364
aptitude[3]    -8.0107   -4.8561   -1.1829    0.8619    3.2192
difficulty[1]   -10.9402   -3.1502   -1.5466   -0.1949    2.7139
difficulty[2]    -8.1221   -4.9271   -3.3535   -2.3146   -0.8306
difficulty[3]   -11.1577   -7.2108   -3.4431   -0.7276    3.0153
difficulty[4]    -1.2933    2.1900    3.4221    4.8106    6.8774
difficulty[5]    -7.7892   -4.9410   -2.8640    0.7741    4.9000
discr[1]     0.0254    0.1579    0.6579    2.1601   36.1958
discr[2]     0.0213    0.3799    0.8490    2.7462   38.2713
discr[3]     0.0074    0.0353    0.0986    0.5403    8.5258
discr[4]     0.0623    0.5002    2.8361   10.1297   45.3600
discr[5]     0.0119    0.0991    0.2953    1.3438   16.8309


## Guessing Behavior

It's common knowledge that guessing is advantageous on the SATs if you can eliminate at least 1 answer. This is because there are usually 5 responses and an incorrect response is penalized by 1/4 point. In the previous examples we assumed that question difficulty and student aptitude accounted for a span of possible $P(\mathrm{correct})$ covering $(0,1)$, but if the test taker can opportunistically guess (ie on a multiple choice test) then the true probabilities have some other lower bound, $(\delta, 1), \delta > 0$.

Modifying our first model to account for this is relatively straightforward:

\begin{aligned} \delta &\sim \mathrm{Beta}(1, 2)\\ \alpha_s &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q} &= \delta + (1-\delta)\mathrm{logit}^{-1}(\alpha_s - \gamma_q)\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}
@model function irt_guess(correct::Array{Bool}, student::Array{Int64}, question::Array{Int64})
aptitude = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))

# priors
for i in 1:length(aptitude)
aptitude[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end
guess_factor ~ Beta(1,2)

β = Vector(undef, length(correct))
for i in 1:length(correct)
β[i] = aptitude[student[i]] - difficulty[question[i]]
correct[i] ~ Bernoulli(guess_factor + (1-guess_factor)*inv_logit(β[i]))
end
end;


irt_guess_ch = sample(
irt_guess(correct, student, question),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)

Sampling: 100%|█████████████████████████████████████████| Time: 0:00:00

Chains MCMC chain (1000×18×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude[1], aptitude[2], aptitude[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5], guess_factor
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
parameters      mean       std   naive_se      mcse       ess      rhat
Symbol   Float64   Float64    Float64   Float64   Float64   Float64

aptitude[1]    2.4869    2.5707     0.0813    0.7223    3.2350    1.6621
aptitude[2]    0.0467    2.1501     0.0680    0.4649    4.9934    1.3398
aptitude[3]   -4.6788    2.3852     0.0754    0.5883   12.5814    1.0098
difficulty[1]   -5.4882    3.0988     0.0980    0.7927   11.2340    1.0712
difficulty[2]   -1.3732    1.8141     0.0574    0.3498   12.9492    1.1089
difficulty[3]    2.2928    3.9825     0.1259    1.2095    2.8536    1.6963
difficulty[4]    5.0951    1.9785     0.0626    0.4414    5.7468    1.2786
difficulty[5]   -0.9364    2.5749     0.0814    0.6004   17.2439    0.9997
guess_factor    0.2061    0.1453     0.0046    0.0167   49.5365    0.9990

Quantiles
parameters       2.5%     25.0%     50.0%     75.0%     97.5%
Symbol    Float64   Float64   Float64   Float64   Float64

aptitude[1]    -2.7677    0.6181    2.6535    4.2176    7.0921
aptitude[2]    -4.4554   -1.3947    0.2112    1.5970    3.9855
aptitude[3]    -9.1785   -6.4195   -4.4813   -2.8343   -0.2408
difficulty[1]   -10.9864   -7.4053   -5.9243   -3.7231    2.0833
difficulty[2]    -5.1880   -2.6083   -1.2140    0.0306    1.6720
difficulty[3]    -3.9818   -0.5712    1.5472    3.9268   11.2302
difficulty[4]     1.4102    3.7990    5.0364    6.2808    9.6207
difficulty[5]    -7.2441   -2.3121   -0.8704    0.7449    3.9909
guess_factor     0.0136    0.0825    0.1808    0.3095    0.5229


## Two Kinds of Questions

Students aren't universally adept at answering questions of different types, so let's add that to the model! For questions of type $t_i$ (ie $t(q)=t_i$), we apply the student's aptitude from that question type.

\begin{aligned} \alpha_{s, t} &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q, t} &= \mathrm{logit}^{-1}(\alpha_{s,t(q)} - \gamma_q)\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}

More fun (but maybe too much fun for this post ) is that with multiple question types it would be pretty simple to bake in correlations in student aptitude across question types.

question_types = [1,2,1,2,1,2,1,2,1,2,1,2]

@model function irt_2types(
correct::Array{Bool},
student::Array{Int64},
question::Array{Int64}, question_type::Array{Int64}
)
aptitude_1 = Vector(undef, maximum(student))
aptitude_2 = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))

# priors
for i in 1:length(aptitude_1)
aptitude_1[i] ~ Normal(0,5)
aptitude_2[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end

β = Vector(undef, length(correct))
for i in 1:length(correct)
if question_type[i] == 1
β[i] = aptitude_1[student[i]] - difficulty[question[i]]
else
β[i] = aptitude_2[student[i]] - difficulty[question[i]]
end
correct[i] ~ Bernoulli(inv_logit(β[i]))
end
end;


irt_2types_ch = sample(
irt_2types(correct, student, question, question_types),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)

Sampling: 100%|█████████████████████████████████████████| Time: 0:00:00

Chains MCMC chain (1000×20×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude_1[1], aptitude_1[2], aptitude_1[3], aptitude_2[1], aptitude_2[2], aptitude_2[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5]
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
parameters      mean       std   naive_se      mcse       ess      rhat
Symbol   Float64   Float64    Float64   Float64   Float64   Float64

aptitude_1[1]    4.1125    2.8127     0.0889    0.7651   12.7687    1.0016
aptitude_1[2]    2.0198    2.9273     0.0926    0.7668    4.0284    1.4551
aptitude_1[3]   -4.7104    2.8762     0.0910    0.8053    6.7425    1.0645
aptitude_2[1]    3.2786    3.0173     0.0954    0.7514    5.4254    1.2778
aptitude_2[2]   -0.6002    2.5990     0.0822    0.7241    5.6558    1.1660
aptitude_2[3]    1.9599    3.2568     0.1030    0.8387   10.0610    1.1281
difficulty[1]   -2.7944    3.3292     0.1053    0.8177   14.5871    1.0535
difficulty[2]   -3.8994    3.5653     0.1127    1.0350    2.7201    1.9978
difficulty[3]    0.7524    2.8349     0.0896    0.8140    8.2372    1.0174
difficulty[4]    9.5894    3.2614     0.1031    0.8923    3.0798    1.6956
difficulty[5]   -2.1624    3.0569     0.0967    0.8869    5.3197    1.1394

Quantiles
parameters       2.5%     25.0%     50.0%     75.0%     97.5%
Symbol    Float64   Float64   Float64   Float64   Float64

aptitude_1[1]    -0.7504    2.0848    4.1293    5.5508   10.6943
aptitude_1[2]    -3.0840   -0.3602    1.8441    4.4293    7.2316
aptitude_1[3]    -9.6865   -6.8462   -5.0387   -2.2494    0.4745
aptitude_2[1]    -1.0901    0.8448    2.8353    5.0615    9.8895
aptitude_2[2]    -5.0098   -2.5673   -1.0476    1.4656    4.3434
aptitude_2[3]    -3.6354   -0.6446    1.8742    4.6117    7.8919
difficulty[1]    -8.3563   -5.2372   -3.3224   -0.2932    3.9075
difficulty[2]   -10.8610   -6.4562   -3.6604   -0.9632    2.1568
difficulty[3]    -4.5461   -1.2926    0.4437    2.8840    6.4110
difficulty[4]     3.5535    7.4417    9.4532   11.5941   16.2606
difficulty[5]    -6.8662   -4.6146   -2.6714    0.1172    3.9333


## Test-taker Fatigue

Imagine the test is several hours long. The test taker is pretty likely to perform differently (let's assume worse) by the end of the test, and that fatigue factor is probably pretty specific to the person. Thus, for the $i^{th}$ question we introduce a linear penalty as a first stab at the idea:

\begin{aligned} \alpha_s &\sim \mathrm{Normal}(0,5)\\ \phi_s &\sim \mathrm{LogNormal}(0,1)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q, i} &= \mathrm{logit}^{-1}(\alpha_s - \gamma_q - i\phi_s)\\ \mathrm{correct_{s,q,i}} &\sim \mathrm{Bernoulli}(\beta_{s,q,i})\\ \end{aligned}
question_seq = [1,2,3,4, 1,2,3,4, 1,2,3,4]
@model function irt_fatigue(
correct::Array{Bool}, student::Array{Int64},
question::Array{Int64}, question_seq::Array{Int64}
)
aptitude = Vector(undef, maximum(student))
wimpiness = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))

# priors
for i in 1:length(aptitude)
aptitude[i] ~ Normal(0,5)
wimpiness[i] ~ LogNormal(0,2)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end

β = Vector(undef, length(correct))
for i in 1:length(correct)
β[i] = aptitude[student[i]] - difficulty[question[i]] - wimpiness[student[i]] * i * question_seq[i]
correct[i] ~ Bernoulli(inv_logit(β[i]))
end
end;


irt_fatigue_ch = sample(
irt_fatigue(correct, student, question, question_seq),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)

Sampling: 100%|█████████████████████████████████████████| Time: 0:00:00

Chains MCMC chain (1000×20×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = aptitude[1], aptitude[2], aptitude[3], difficulty[1], difficulty[2], difficulty[3], difficulty[4], difficulty[5], wimpiness[1], wimpiness[2], wimpiness[3]
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
parameters      mean       std   naive_se      mcse       ess      rhat
Symbol   Float64   Float64    Float64   Float64   Float64   Float64

aptitude[1]    5.5689    3.2672     0.1033    0.9661    5.5541    1.0900
aptitude[2]   -0.6428    2.2023     0.0696    0.5558    5.5129    1.2839
aptitude[3]   -2.0090    2.7835     0.0880    0.7741    6.2247    1.1309
difficulty[1]   -7.5951    3.1442     0.0994    0.8897   10.2621    1.0804
difficulty[2]   -3.5479    3.0344     0.0960    0.8538    2.8263    1.9409
difficulty[3]    0.3819    2.0660     0.0653    0.4835   15.8760    1.0060
difficulty[4]    5.3452    3.8158     0.1207    1.0966    6.8170    1.1893
difficulty[5]   -3.1397    2.0446     0.0647    0.5193    7.1755    1.2067
wimpiness[1]    0.3146    0.2430     0.0077    0.0457   22.9879    0.9995
wimpiness[2]    0.0530    0.0488     0.0015    0.0073   44.2052    1.0160
wimpiness[3]    0.0864    0.0808     0.0026    0.0220    4.2962    1.2917

Quantiles
parameters       2.5%     25.0%     50.0%     75.0%     97.5%
Symbol    Float64   Float64   Float64   Float64   Float64

aptitude[1]    -0.0760    2.7155    5.5546    8.5856   10.8985
aptitude[2]    -4.8763   -2.0208   -0.7854    0.5966    4.2086
aptitude[3]    -6.2673   -4.0574   -2.2741   -0.5126    4.9629
difficulty[1]   -13.9438   -9.4788   -7.9346   -5.4993   -0.8206
difficulty[2]    -8.2400   -5.8861   -3.8871   -1.4654    2.6733
difficulty[3]    -3.2067   -1.1498    0.2113    1.9139    4.7973
difficulty[4]    -1.0724    2.2905    5.4179    8.2336   12.3365
difficulty[5]    -6.8396   -4.4525   -3.1455   -1.9591    1.3959
wimpiness[1]     0.0278    0.1198    0.2535    0.4515    0.9220
wimpiness[2]     0.0016    0.0162    0.0393    0.0767    0.1814
wimpiness[3]     0.0028    0.0211    0.0614    0.1266    0.2839


# Note on the model specifications

You might be wondering "where did these prior values come from?" or "how did Brad choose these distributions? Why Normal instead of, I dunno, t?". Good questions! The answer is I didn't think too hard and just wrote down the first thing that seemed reasonable either in terms of the values ($\mathrm{logit}^{-1}(5)$ is a very high - 99-ish% - but not insurmountable level of confidence) or theoretical properties (basically, choose a simple distribution with the right domain and range).

Perhaps you're also wondering "what's up with the random mixing in of Greek letters?" You got me there.

# Comparison to Collaborative Filtering

Item Response may seem very similar to collaborative filtering, so it's worth highlighting the differences.

Collaborative filtering aims to complete a sparsely determined preferences/ratings matrix of consumer - item scores (e.g. "User 513 gave product 149 3.5 s") $M$. A common approach is alternating least squares which iteratively factors the matrix into a product-feature matrix $P$ and a customer-preference matrix $C$. The goal is to create these so that their product "accurately completes" $M$, ie if $CP = \overline{M}$ then the difference $M - \overline{M}$ is small wherever we have entries for $M$ (remember, $M$ is incomplete).

A key fact is that the matrix $\overline{M}$ (the list of recommendations) is the important output here, and the factors $C$ and $P$ are intriguing but not critical. This is different from Item Response where the background variables describing difficulty and aptitude for each question, student are the primary desired outputs (but we could infer $P(\mathrm{correct} | \mathrm{student\_id}, \mathrm{question\_id})$ for unseen pairings!).

The other distinction worth mentioning is that the IR models have enormous flexibility in how they inform the probability of success, as we've seen above. Collaborative filtering, at least with ALS, is just optimizing a matrix factorization task. Since $\overline{M} = CP$, the user-product score can only be the dot product of the product-feature vector and the customer-preference vector, it attempts to encode "how well do the consumer preferences overlap with the product features." It does not lend itself to any extensions to capture more domain-specific nuance as we did here.