Item Response Models

using Turing
using Bijectors
using Gadfly
using DataFrames, DataFramesMeta
Gadfly.set_default_plot_size(900px, 300px)
Item Response
Item response models are used to simultaneously make inferences about two interacting populations. Commonly, a population of test questions and a population of test takers (students) with the result (success/failure) of each student on each question they've seen. This is an interesting problem in that even in the basic case:
- students have different levels of aptitude
- questions have different levels of difficulty
- not every student sees every question
- not every question needs to be seen by the same number of students
- we should be able to make relative inferences between students (resp. questions) that have no overlapping questions (resp. students)
- the data is nonetheless fairly simple:
[correct (Boolean), student_id (categorical), question_id (categorical]
I love these models because they're easy to extend in an intuitive way. I'm going to add a few random bells and whistles to the most vanilla version, and if you're interested the Stan user guide has some good content on this topic and many others.
Vanilla Item-Response (aka 1PL)
For each student $s$, we have an aptitude $\alpha_s$ and for each question $q$ we have a difficulty $\gamma_q$. The likelihood of a correct response is informed by the difference between these two quantities:
$$\begin{aligned} \alpha_s &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q} &= \mathrm{logit}^{-1}(\alpha_s - \gamma_q)\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}$$logit = bijector(Beta()) # bijection: (0, 1) → ℝ
inv_logit = inv(logit) # bijection: ℝ → (0, 1)
student = [1,1,1,1,2,2,2,2,3,3,3,3]
question = [1,2,3,4,2,3,4,5,3,4,5,1]
correct = [
true, true, true, false,
true, false, false, true,
false, false, false, true];
Some observations on the toy data:
- Everyone got question 1 correct (expect this to be rated as low difficulty)
- Everyone got question 4 wrong (high difficulty)
- Student 1 got all tested questions correct except question 4
- Student 3 got all tested questions incorrect except question 1
- Question 5 was only seen by student 3 (incorrect)
So, here's the model set up in Turing, and the result of the sampler below.
@model function irt_1pl(correct::Array{Bool}, student::Array{Int64}, question::Array{Int64})
aptitude = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))
# priors
for i in 1:length(aptitude)
aptitude[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end
β = Vector(undef, length(correct))
for i in 1:length(correct)
β[i] = aptitude[student[i]] - difficulty[question[i]]
correct[i] ~ Bernoulli(inv_logit(β[i]))
end
end;
# Settings of the Hamiltonian Monte Carlo (HMC) sampler.
iterations = 1000
ϵ = 0.05
τ = 10;
irt_1pl_ch = sample(
irt_1pl(correct, student, question),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)
Interesting, the only surprise for me is that I expected a wider spread for difficulty[5]
, but otherwise looks very reasonable!
Question Quality (aka 2PL)
The purpose of asking questions is to probe the aptitude of the test taker, and some questions will do a much better job of guaranteeing a minimum skill level given a successful response. This is called "discrimination". Intuitively, a highly discriminating question would magnify the difference between a student's ability and the question's difficulty, so that both
- students with sufficient aptitude are more likely to succeed
- students with insufficient aptitude are more likely to fail
We can see that $\eta$ will accomplish this in the model below:
$$\begin{aligned} \alpha_s &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \eta_q &\sim \mathrm{LogNormal}(0,2)\\ \beta_{s, q} &= \mathrm{logit}^{-1}(\eta_q * (\alpha_s - \gamma_q))\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}$$@model function irt_2pl(correct::Array{Bool}, student::Array{Int64}, question::Array{Int64})
aptitude = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))
discr = Vector(undef, maximum(question))
# priors
for i in 1:length(aptitude)
aptitude[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
discr[i] ~ LogNormal(0,2)
end
β = Vector(undef, length(correct))
for i in 1:length(correct)
β[i] = discr[question[i]] * (aptitude[student[i]] - difficulty[question[i]])
correct[i] ~ Bernoulli(inv_logit(β[i]))
end
end;
irt_2pl_ch = sample(
irt_2pl(correct, student, question),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)
Guessing Behavior
It's common knowledge that guessing is advantageous on the SATs if you can eliminate at least 1 answer. This is because there are usually 5 responses and an incorrect response is penalized by 1/4 point. In the previous examples we assumed that question difficulty and student aptitude accounted for a span of possible $P(\mathrm{correct})$ covering $(0,1)$, but if the test taker can opportunistically guess (ie on a multiple choice test) then the true probabilities have some other lower bound, $(\delta, 1), \delta > 0$.
Modifying our first model to account for this is relatively straightforward:
$$\begin{aligned} \delta &\sim \mathrm{Beta}(1, 2)\\ \alpha_s &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q} &= \delta + (1-\delta)\mathrm{logit}^{-1}(\alpha_s - \gamma_q)\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}$$@model function irt_guess(correct::Array{Bool}, student::Array{Int64}, question::Array{Int64})
aptitude = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))
# priors
for i in 1:length(aptitude)
aptitude[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end
guess_factor ~ Beta(1,2)
β = Vector(undef, length(correct))
for i in 1:length(correct)
β[i] = aptitude[student[i]] - difficulty[question[i]]
correct[i] ~ Bernoulli(guess_factor + (1-guess_factor)*inv_logit(β[i]))
end
end;
irt_guess_ch = sample(
irt_guess(correct, student, question),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)
Two Kinds of Questions
Students aren't universally adept at answering questions of different types, so let's add that to the model! For questions of type $t_i$ (ie $t(q)=t_i$), we apply the student's aptitude from that question type.
$$\begin{aligned} \alpha_{s, t} &\sim \mathrm{Normal}(0,5)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q, t} &= \mathrm{logit}^{-1}(\alpha_{s,t(q)} - \gamma_q)\\ \mathrm{correct_{s,q}} &\sim \mathrm{Bernoulli}(\beta_{s,q})\\ \end{aligned}$$More fun (but maybe too much fun for this post ) is that with multiple question types it would be pretty simple to bake in correlations in student aptitude across question types.
question_types = [1,2,1,2,1,2,1,2,1,2,1,2]
@model function irt_2types(
correct::Array{Bool},
student::Array{Int64},
question::Array{Int64}, question_type::Array{Int64}
)
aptitude_1 = Vector(undef, maximum(student))
aptitude_2 = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))
# priors
for i in 1:length(aptitude_1)
aptitude_1[i] ~ Normal(0,5)
aptitude_2[i] ~ Normal(0,5)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end
β = Vector(undef, length(correct))
for i in 1:length(correct)
if question_type[i] == 1
β[i] = aptitude_1[student[i]] - difficulty[question[i]]
else
β[i] = aptitude_2[student[i]] - difficulty[question[i]]
end
correct[i] ~ Bernoulli(inv_logit(β[i]))
end
end;
irt_2types_ch = sample(
irt_2types(correct, student, question, question_types),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)
Test-taker Fatigue
Imagine the test is several hours long. The test taker is pretty likely to perform differently (let's assume worse) by the end of the test, and that fatigue factor is probably pretty specific to the person. Thus, for the $i^{th}$ question we introduce a linear penalty as a first stab at the idea:
$$\begin{aligned} \alpha_s &\sim \mathrm{Normal}(0,5)\\ \phi_s &\sim \mathrm{LogNormal}(0,1)\\ \gamma_q &\sim \mathrm{Normal}(0,5)\\ \beta_{s, q, i} &= \mathrm{logit}^{-1}(\alpha_s - \gamma_q - i\phi_s)\\ \mathrm{correct_{s,q,i}} &\sim \mathrm{Bernoulli}(\beta_{s,q,i})\\ \end{aligned}$$question_seq = [1,2,3,4, 1,2,3,4, 1,2,3,4]
@model function irt_fatigue(
correct::Array{Bool}, student::Array{Int64},
question::Array{Int64}, question_seq::Array{Int64}
)
aptitude = Vector(undef, maximum(student))
wimpiness = Vector(undef, maximum(student))
difficulty = Vector(undef, maximum(question))
# priors
for i in 1:length(aptitude)
aptitude[i] ~ Normal(0,5)
wimpiness[i] ~ LogNormal(0,2)
end
for i in 1:length(difficulty)
difficulty[i] ~ Normal(0,5)
end
β = Vector(undef, length(correct))
for i in 1:length(correct)
β[i] = aptitude[student[i]] - difficulty[question[i]] - wimpiness[student[i]] * i * question_seq[i]
correct[i] ~ Bernoulli(inv_logit(β[i]))
end
end;
irt_fatigue_ch = sample(
irt_fatigue(correct, student, question, question_seq),
HMC(ϵ, τ), iterations,
progress=true, drop_warmup=true)
Note on the model specifications
You might be wondering "where did these prior values come from?" or "how did Brad choose these distributions? Why Normal instead of, I dunno, t?". Good questions! The answer is I didn't think too hard and just wrote down the first thing that seemed reasonable either in terms of the values ($\mathrm{logit}^{-1}(5)$ is a very high - 99-ish% - but not insurmountable level of confidence) or theoretical properties (basically, choose a simple distribution with the right domain and range).
Perhaps you're also wondering "what's up with the random mixing in of Greek letters?" You got me there.
Comparison to Collaborative Filtering
Item Response may seem very similar to collaborative filtering, so it's worth highlighting the differences.
Collaborative filtering aims to complete a sparsely determined preferences/ratings matrix of consumer - item scores (e.g. "User 513 gave product 149 3.5 s") $M$. A common approach is alternating least squares which iteratively factors the matrix into a product-feature matrix $P$ and a customer-preference matrix $C$. The goal is to create these so that their product "accurately completes" $M$, ie if $CP = \overline{M}$ then the difference $M - \overline{M}$ is small wherever we have entries for $M$ (remember, $M$ is incomplete).
A key fact is that the matrix $\overline{M}$ (the list of recommendations) is the important output here, and the factors $C$ and $P$ are intriguing but not critical. This is different from Item Response where the background variables describing difficulty and aptitude for each question, student are the primary desired outputs (but we could infer $P(\mathrm{correct} | \mathrm{student\_id}, \mathrm{question\_id})$ for unseen pairings!).
The other distinction worth mentioning is that the IR models have enormous flexibility in how they inform the probability of success, as we've seen above. Collaborative filtering, at least with ALS, is just optimizing a matrix factorization task. Since $\overline{M} = CP$, the user-product score can only be the dot product of the product-feature vector and the customer-preference vector, it attempts to encode "how well do the consumer preferences overlap with the product features." It does not lend itself to any extensions to capture more domain-specific nuance as we did here.