No articles match
Calibrating with a Weakly-Informative, Biased LLM2 days ago
The setup | Naive pooling inherits the bias | $\lambda$ moves efficiency, not bias | Choosing $\lambda$ | Takeaways | Reproducing
Choosing Lambda in Mixed-Subjects IRT2 days ago
Two objectives, two estimators | Example data | Ability-risk tuning: Minimizing $\mathbb{E}[g'\Sigma_\gamma g]$ | Cross-fit $\lambda$ tuning (recommended workflow) | Frozen expected-count estimator (fast approximation) | Minimizing $\text{Tr}\big[\Sigma_\gamma\big]$ (diagnostic only) | Choosing a procedure
IRT Linking and Gradient Asymmetry: Diagnostic Guide2 days ago
Background | Background (frozen expected-count estimator) | Linking implementations | Simulation | Fitting human and LLM models | Applying the three methods | Parameter alignment after linking | TCC alignment | Gradient asymmetry: what linking fixes and what it does not | Lambda sweep: how $\lambda$ interacts with linking quality | The role of power tuning | Validation: what does $\lambda^*$ measure? | Test A — Perfect paired surrogate ($F = Y$) | Test B — Partially overlapping predictions | Test C — Stochastic LLM predictions (practical baseline) | Summary: PPI++ score vs. ability risk | Summary of findings | Recommendation | The marginal-MML fix
Mixed-Subjects IRT Calibration2 days ago
Simulate example data | Step 1: Fit the human baseline | Step 2: Fit the MML mixed-subjects model | Step 3: Select $\lambda$ by ability-score risk | Step 3b (recommended workflow): cross-fit $\lambda$ tuning | Step 4: Inspect the covariance | Compare calibrations | When the LLM is uninformative | Validation
Per-Item Lambda (Experimental)2 days ago
Why per-item lambda? | Simulate a heterogeneous test | Step 1: Fit 2PL baseline and get global scalar lambda | Step 2: PPI++ score per item (fast diagnostic) | Step 3: Per-item ability-risk tuning | Step 4: Compare scalar vs. per-item parameter recovery | Important note on initialization | Approximation caveat
Simulation Validation of the Mixed-Subjects MML Estimator2 days ago
Design | Does $\lambda$-selection track predictor quality? | Do standard errors achieve appropriate coverage? | Does the method improve downstream scoring? | What is the role of cross-fitting? | Is coverage valid at the tuned $\lambda$? | Summary | Reproducing these results
Understanding Ability-Risk Tuning2 days ago
Why this vignette exists | Key Intuition | The three response matrices | 1. Observed human responses: $O$ | 2. Paired LLM-predicted human responses: $P$ | 3. Additional LLM-generated responses: $G$ | The mixed-subjects IRT objective | What lambda is learning | $$L_O^ | Ability-risk tuning | The approximate target is$$\widehat R(\lambda) | Why row alignment matters | Case A: perfect paired prediction | $$\lambda_ | \frac | Case B: row-shuffled perfect predictions | Case C: same DGP, fresh Bernoulli draw | $$\operatorname{Cov}(O_{ij},P_{ij}) | $$\operatorname{Var}(P_{ij}) | What kind of LLM data produces higher lambda? | One approach to row alignment: leave-one-item-out prediction | Another approach: covariate-based prediction | Something that probably won't work: item-text-only generation | How to generate $G$ | Summary | Technical Explanation | Overview: four objects, one objective | 1. The estimator and its estimating equation | 2. The sandwich covariance of $\hat\gamma$ | 3. Ability scoring and the implicit gradient | 4. Delta-method propagation and the risk | 5. Why this differs from the PPI++ trace objective
Calibrating with a Weakly-Informative, Biased LLM2 days ago
The setup | Naive pooling inherits the bias | $\lambda$ moves efficiency, not bias | Choosing $\lambda$ | Takeaways | Reproducing
Choosing Lambda in Mixed-Subjects IRT2 days ago
Two objectives, two estimators | Example data | Ability-risk tuning: Minimizing $\mathbb{E}[g'\Sigma_\gamma g]$ | Cross-fit $\lambda$ tuning (recommended workflow) | Frozen expected-count estimator (fast approximation) | Minimizing $\text{Tr}\big[\Sigma_\gamma\big]$ (diagnostic only) | Choosing a procedure
IRT Linking and Gradient Asymmetry: Diagnostic Guide2 days ago
Background | Background (frozen expected-count estimator) | Linking implementations | Simulation | Fitting human and LLM models | Applying the three methods | Parameter alignment after linking | TCC alignment | Gradient asymmetry: what linking fixes and what it does not | Lambda sweep: how $\lambda$ interacts with linking quality | The role of power tuning | Validation: what does $\lambda^*$ measure? | Test A — Perfect paired surrogate ($F = Y$) | Test B — Partially overlapping predictions | Test C — Stochastic LLM predictions (practical baseline) | Summary: PPI++ score vs. ability risk | Summary of findings | Recommendation | The marginal-MML fix
Mixed-Subjects 1PL Calibration2 days ago
Simulate a 1PL test | Step 1: Fit the 1PL baseline | Step 2: Fit mixed-subjects MML (1PL) | Step 3: Correct covariance — $(J+1) \times (J+1)$ sandwich | Step 4: Ability-score risk and lambda tuning | Step 5: Verify — F = Y gives lambda > 0 | Compare 1PL and 2PL | Ability-score risk: 1PL vs 2PL parameterization
Mixed-Subjects IRT Calibration2 days ago
Simulate example data | Step 1: Fit the human baseline | Step 2: Fit the MML mixed-subjects model | Step 3: Select $\lambda$ by ability-score risk | Step 3b (recommended workflow): cross-fit $\lambda$ tuning | Step 4: Inspect the covariance | Compare calibrations | When the LLM is uninformative | Validation
Per-Item Lambda (Experimental)2 days ago
Why per-item lambda? | Simulate a heterogeneous test | Step 1: Fit 2PL baseline and get global scalar lambda | Step 2: PPI++ score per item (fast diagnostic) | Step 3: Per-item ability-risk tuning | Step 4: Compare scalar vs. per-item parameter recovery | Important note on initialization | Approximation caveat
Simulation Validation of the Mixed-Subjects MML Estimator2 days ago
Design | Does $\lambda$-selection track predictor quality? | Do standard errors achieve appropriate coverage? | Does the method improve downstream scoring? | What is the role of cross-fitting? | Is coverage valid at the tuned $\lambda$? | Summary | Reproducing these results
Understanding Ability-Risk Tuning2 days ago
Why this vignette exists | Key Intuition | The three response matrices | 1. Observed human responses: $O$ | 2. Paired LLM-predicted human responses: $P$ | 3. Additional LLM-generated responses: $G$ | The mixed-subjects IRT objective | What lambda is learning | $$L_O^ | Ability-risk tuning | The approximate target is$$\widehat R(\lambda) | Why row alignment matters | Case A: perfect paired prediction | $$\lambda_ | \frac | Case B: row-shuffled perfect predictions | Case C: same DGP, fresh Bernoulli draw | $$\operatorname{Cov}(O_{ij},P_{ij}) | $$\operatorname{Var}(P_{ij}) | What kind of LLM data produces higher lambda? | One approach to row alignment: leave-one-item-out prediction | Another approach: covariate-based prediction | Something that probably won't work: item-text-only generation | How to generate $G$ | Summary | Technical Explanation | Overview: four objects, one objective | 1. The estimator and its estimating equation | 2. The sandwich covariance of $\hat\gamma$ | 3. Ability scoring and the implicit gradient | 4. Delta-method propagation and the risk | 5. Why this differs from the PPI++ trace objective
Mixed-Subjects 1PL Calibration17 days ago
Simulate a 1PL test | Step 1: Fit the 1PL baseline | Step 2: Fit mixed-subjects MML (1PL) | Step 3: Correct covariance — $(J+1) \times (J+1)$ sandwich | Step 4: Ability-score risk and lambda tuning | Step 5: Verify — F = Y gives lambda > 0 | Compare 1PL and 2PL | Ability-score risk: 1PL vs 2PL parameterization