Statistical Analysis Plan
Overview
This document is the integrated CO-LUMINATE Statistical Analysis Plan (SAP). It is organized by analytical objective and structured for staged publication. The current release includes:
- Section A: Objective 2 (Depression Harmonization)
- Section B: Objective 3 (Social Determinant Trajectories)
- Section C: Objective 4 (Behavioral Mediators)
Section D (Objective 5: Causal Mediation) remains an intentional placeholder pending the finalization of prespecified mediation model families and implementation diagnostics.
Section A: Objective 2 - Depression Harmonization SAP
A1 Background and Rationale
This SAP component defines the psychometric protocol for harmonizing depression measurement between the Shona Symptom Questionnaire (SSQ) and the Patient Health Questionnaire (PHQ-9). The methodological challenge is multi-instrument comparability in a longitudinal setting where cohorts differ in instrument availability and local validity context.
The PHQ-9 is treated as the external benchmark for probable depression (Kroenke, Spitzer, and Williams 2001), while the SSQ captures culturally grounded distress phenotypes relevant to Southern African settings (Patel et al. 1997; Haney et al. 2014; Chibanda et al. 2019). Harmonization therefore requires defensible latent-construct alignment, subgroup stability checks, and threshold calibration rather than simple raw-score translation (Putnick and Bornstein 2016; Woods 2009).
A3 Objectives and Estimands
Let \(Y_i^{PHQ} = \mathbb{1}\{PHQ9_i \ge 10\}\) denote PHQ-referenced probable depression for participant \(i\).
The primary estimand is the harmonized binary endpoint:
\[ Y_i^{H} = \mathbb{1}\{\hat{\theta}_i \ge \tau^\ast\}, \]
Parameter definitions (Eq. A3.1):
- \(Y_i^{H}\): harmonized depression endpoint for participant \(i\).
- \(\mathbb{1}\{\cdot\}\): indicator function (equals 1 when the condition is true, 0 otherwise).
- \(\hat{\theta}_i\): calibrated latent depression score for participant \(i\).
- \(\tau^\ast\): chosen latent-score threshold linked to the PHQ-9 benchmark.
where \(\hat{\theta}_i\) is the calibrated latent depression score from joint PHQ-SSQ modeling and \(\tau^\ast\) is the pre-specified threshold linked to the PHQ-9 reference scale.
For the PHQ-referenced comparator used above, \(PHQ9_i\) denotes the total PHQ-9 score for participant \(i\), with the threshold of 10 defining probable depression on the benchmark instrument.
Secondary estimands include calibration transportability and prevalence alignment in SSQ-only cohorts:
\[ \Delta_p = \Pr(Y_i^{H}=1) - \Pr(Y_i^{PHQ}=1), \]
Parameter definitions (Eq. A3.2):
- \(\Delta_p\): prevalence drift between harmonized and PHQ-referenced endpoints.
- \(\Pr(Y_i^{H}=1)\): marginal prevalence under the harmonized endpoint.
- \(\Pr(Y_i^{PHQ}=1)\): marginal prevalence under the PHQ-referenced endpoint.
interpreted as endpoint-prevalence drift under candidate thresholding strategies.
A4 Study Design and Analytic Samples
The development sample comprises participants aged 16-24 years with concurrent SSQ and PHQ observations. The transport sample consists of SSQ-only records in linked cohorts.
Eligibility is staged by model:
- factor-analytic stages require complete item vectors for included indicators;
- IRT calibration requires complete item responses under the final item set;
- endpoint assignment is restricted to records with valid latent score estimates.
A5 Structural Modeling Strategy
For ordinal symptom items, exploratory and confirmatory analyses are estimated using polychoric correlation structures and robust estimators for ordered-categorical data (Holgado–Tello et al. 2010; Wu and Estabrook 2016).
Exploratory factor analysis (EFA) is run in Monte Carlo split samples to stabilize dimensionality decisions. Confirmatory factor analysis (CFA) is then evaluated on hold-out subsets with fit diagnostics interpreted jointly (CFI, RMSEA, SRMR) rather than by single-cutoff rules (Hu and Bentler 1999; MacCallum et al. 2002).
Let \(\mathbf{x}_i\) denote the ordinal item vector and \(\theta_i\) the latent depression factor. The CFA measurement model is:
\[ \mathbf{x}_i \sim f(\theta_i; \boldsymbol{\lambda}, \boldsymbol{\tau}), \]
Parameter definitions (Eq. A5.1):
- \(\mathbf{x}_i\): vector of observed ordinal symptom responses for participant \(i\).
- \(\theta_i\): latent depression factor score.
- \(f(\cdot)\): ordinal measurement response function implied by the CFA specification.
- \(\boldsymbol{\lambda}\): factor-loading parameters.
- \(\boldsymbol{\tau}\): item-threshold parameters.
where \(\boldsymbol{\lambda}\) are loadings and \(\boldsymbol{\tau}\) are item thresholds.
Measurement invariance and differential item functioning are evaluated across sex and age strata using parameter-constraint comparisons and practical fit-change diagnostics (Putnick and Bornstein 2016; Vandenberg and Lance 2000; Woods 2009).
A6 IRT Harmonization and Threshold Derivation
Harmonization uses graded response models (GRM) implemented in mirt (Samejima 1969; Chalmers 2012). For item \(j\) and category threshold \(k\):
\[ \Pr(X_{ij} \ge k \mid \theta_i) = \operatorname{logit}^{-1}\left[a_j(\theta_i - b_{jk})\right]. \]
Parameter definitions (Eq. A6.1):
- \(\Pr(X_{ij} \ge k \mid \theta_i)\): cumulative probability of endorsing category \(k\) or higher for item \(j\).
- \(X_{ij}\): ordinal response of participant \(i\) on item \(j\).
- \(\theta_i\): latent depression severity for participant \(i\).
- \(a_j\): GRM discrimination parameter for item \(j\).
- \(b_{jk}\): GRM threshold/location parameter for item \(j\) at threshold \(k\).
- \(\operatorname{logit}^{-1}(\cdot)\): inverse-logit link mapping linear predictors to probabilities.
Joint PHQ-SSQ calibration maps both instruments to a common latent continuum. The anchored rule selects \(\tau^\ast\) via PHQ test-characteristic alignment to the clinical PHQ-9 benchmark near 10, while ROC/Youden and sensitivity-prioritized alternatives are retained for robustness comparison.
A7 Performance Metrics and Uncertainty
Endpoint performance is summarized by:
- sensitivity: \(\Pr(Y^H=1 \mid Y^{PHQ}=1)\)
- specificity: \(\Pr(Y^H=0 \mid Y^{PHQ}=0)\)
- PPV, NPV, accuracy, AUC, and Cohen’s \(\kappa\)
- prevalence drift \(\Delta_p\)
Metric definitions: sensitivity is the probability that the harmonized endpoint is positive among PHQ-positive participants; specificity is the probability that the harmonized endpoint is negative among PHQ-negative participants; PPV and NPV denote positive and negative predictive value, respectively; AUC is the area under the receiver-operating-characteristic curve; and Cohen’s \(\kappa\) summarizes agreement beyond chance.
Interval estimation uses resampling-based uncertainty summaries where appropriate (Efron and Tibshirani 1994). Threshold selection is governed by a joint decision criterion balancing discrimination, calibration, and prevalence distortion rather than discrimination alone.
A8 Planned Outputs and Reporting
Planned outputs are:
- Item and structural diagnostics (EFA/CFA, invariance, DIF).
- IRT calibration tables and threshold-mapping plots.
- Comparative endpoint performance tables across threshold strategies.
- Final harmonized endpoint definition for downstream Objectives 3-5.
A9 Symbol-to-Implementation Crosswalk (Depression Harmonization SAP)
To align the Depression Harmonization SAP (Objective 2) notation with implementation evidence in the public repository, Table A9.1 maps each core quantity to scripts under scripts/psychometric/ and to the corresponding generated output artifacts.
| SAP quantity | Mathematical definition in this SAP | Implementation locus (OBJ03 Psychometric) | Primary output artifact(s) |
|---|---|---|---|
| \(\mathbf{x}_i\), retained item structure | Observed ordinal symptom response vector used in EFA/CFA stages | EFA stability and retention pipeline in scripts/psychometric/programs/utils/efa_stability/ with orchestration via scripts/psychometric/programs/utils/efa_stability/05_run_pipeline_efa_master.R |
Stability/retention figures in SCAR notebook outputs and EFA artifacts |
| \(\theta_i\), \(\boldsymbol{\lambda}\), \(\boldsymbol{\tau}\) | CFA latent trait and ordered-threshold measurement parameters | CFA + invariance + ROC workflow under scripts/psychometric/programs/utils/cfa_roc/ (notably 03_mc_loop.R, 04_summary.R, 06_master.R) |
03_IRT_fitted_models_1factor.rds, 03_IRT_invariance_models_1factor.rds, fit/invariance PDFs |
| \(a_j\), \(b_{jk}\) | GRM discrimination and threshold/location parameters | IRT calibration engine in scripts/psychometric/programs/utils/irt/full_itr.R and batch execution via scripts/psychometric/programs/analysis/run_irt_batch.R |
03_IRT_Harmonization_Results.rds, batch model objects in output/batch/ |
| \(\tau^\ast\) and operating-point scenarios | Applied latent threshold used for harmonized endpoint assignment | Threshold scenario comparison in scripts/psychometric/programs/analysis/irt_roc_compare.R with harmonization rules consolidated by batch outputs |
SSQ10_Harmonized_Scoring_Engine.rds, ROC metric figures/tables |
| \(Y_i^H\) and prevalence alignment \(\Delta_p\) | Harmonized depression endpoint and drift vs PHQ-referenced prevalence | Endpoint performance summaries from IRT + ROC comparison outputs and Monte Carlo result summaries | main_results_boot_10000_Isisekelo_Sempilo.rds, boot_10000_Isisekelo_Sempilo_PHQ-09_(Original)_results.rds, ROC/performance figures |
This crosswalk is scoped to Objective 2 implementation evidence in the public script bundle at scripts/psychometric/. Retired scripts are intentionally excluded from active SAP traceability in public release.
Section C: Objective 4 - Behavioral Mediators SAP
C1 Background and Rationale
Objective 4 identifies behavioral pathways through which childhood social determinant trajectories may influence depression risk. This section defines mediator-screening and mediator-structuring analyses that precede full causal mediation decomposition.
C3 Objectives and Estimands
Primary estimands are adjusted mediator-outcome associations conditional on exposure trajectory class and confounders.
For mediator \(M_j\):
\[ \operatorname{logit}\Pr(Y_i^H=1) = \alpha_0 + \alpha_1 M_{ij} + \alpha_2 C_i + \boldsymbol{\alpha}_3^\top \mathbf{Z}_i. \]
Parameter definitions (Eq. C3.1):
- \(\Pr(Y_i^H=1)\): probability of harmonized depression for participant \(i\).
- \(\alpha_0\): model intercept.
- \(\alpha_1\): association parameter for mediator \(M_{ij}\).
- \(M_{ij}\): value of mediator \(j\) for participant \(i\).
- \(\alpha_2\): effect parameter for trajectory representation \(C_i\).
- \(C_i\): trajectory class/exposure representation.
- \(\mathbf{Z}_i\): confounder vector.
- \(\boldsymbol{\alpha}_3\): confounder-effect coefficient vector.
- \(\operatorname{logit}(\cdot)\): log-odds transformation, \(\log\{p/(1-p)\}\).
- \(\boldsymbol{\alpha}_3^\top \mathbf{Z}_i\): linear predictor contribution of measured confounders.
Secondary estimands include mediator inter-association structure and ranking stability across adjustment sets.
C4 Mediator Definitions and Coding
Candidate mediators are operationalized from harmonized adolescent behavioral domains (e.g., school attachment, social support, violence exposure, food insecurity, and related behavioral indicators). Recoding, scaling, and missingness handling follow pre-specified rules from the governed data-preparation pipeline.
C5 Modeling Strategy
Three layers are prespecified:
- unadjusted models (\(Y_i^H \sim M_{ij}\)),
- confounder-adjusted models (\(Y_i^H \sim M_{ij} + \mathbf{Z}_i\)),
- jointly adjusted mediator models (\(Y_i^H \sim \mathbf{M}_i + \mathbf{Z}_i\)).
Model-term definitions: in these compact model expressions, \(Y_i^H\) is the harmonized depression endpoint, \(M_{ij}\) is a single mediator for participant \(i\), \(\mathbf{M}_i\) is the vector of mediators entered jointly, \(\mathbf{Z}_i\) is the confounder vector, and the symbol \(\sim\) indicates the set of predictors included in the working regression model.
Given potential mediator collinearity, interpretation emphasizes effect-direction consistency, precision, and stability under alternate adjustment sets rather than single-model significance alone.
C6 Mediator Prioritization and Phenotype Planning
Mediators are prioritized for Objective 5 if they show:
- stable adjusted associations with \(Y_i^H\),
- coherent epidemiologic interpretation,
- acceptable overlap structure for mediation decomposition.
Mediator phenotype summaries are treated as supportive representation tools and not substitutes for prespecified individual mediator effects.
C7 Sensitivity and Robustness
Sensitivity analyses include alternate coding schemes, complete-case vs missing-data-aware pipelines, and subgroup checks by age/sex strata. E-value style bias-sensitivity summaries may be reported for key associations where decision-relevant (Mathur et al. 2022; T. J. VanderWeele and Ding 2017).
C8 Planned Outputs and Reporting
Outputs include mediator-association tables (unadjusted and adjusted), mediator inter-association summaries, and a transparent shortlist of mediators carried into Objective 5 causal mediation modeling.
Supplementary: Glossary of Symbols and Parameters
This supplementary glossary consolidates symbol and parameter notation for quick reference. Equation-specific parameter definitions remain directly below each model/equation in Sections A-C.
Objective 2 Glossary
- \(i\): participant index.
- \(j\): item index.
- \(k\): ordered-category threshold index for item \(j\).
- \(Y_i^{PHQ}\): PHQ-referenced probable-depression indicator for participant \(i\).
- \(Y_i^H\): harmonized depression indicator for participant \(i\).
- \(\hat{\theta}_i\): participant-specific estimated latent depression score (EAP score from joint calibration).
- \(\tau^\ast\): selected latent threshold used to classify \(Y_i^H\).
- \(\Delta_p\): prevalence drift between harmonized and PHQ-referenced endpoint prevalence.
- \(\mathbf{x}_i\): vector of observed ordinal item responses for participant \(i\).
- \(\theta_i\): latent depression factor in structural models.
- \(\boldsymbol{\lambda}\): factor-loading parameter vector (or matrix).
- \(\boldsymbol{\tau}\): item-threshold parameter vector (or matrix) in CFA parameterization.
- \(a_j\): GRM discrimination parameter for item \(j\).
- \(b_{jk}\): GRM category-threshold (difficulty/location) parameter for item \(j\), threshold \(k\).
Objective 3 Glossary
- \(i\): participant index.
- \(t\): time index within developmental window.
- \(c, h\): latent-class indices.
- \(K\): total number of latent classes under a candidate model.
- \(E_{it}\): observed exposure measure for participant \(i\) at time \(t\).
- \(C_i\): latent trajectory class membership for participant \(i\).
- \(\pi_c\): marginal class proportion for class \(c\).
- \(Y_i^H\): harmonized binary depression outcome from Objective 2.
- \(c_{ref}\): pre-specified reference class for class-contrast models.
- \(\eta_{0c}\): class-specific intercept (initial level) parameter.
- \(\eta_{1c}\): class-specific linear time-slope parameter.
- \(\eta_{2c}\): class-specific quadratic time-slope parameter.
- \(\varepsilon_{it}\): residual term for participant \(i\) at time \(t\).
- \(\sigma_c^2\): class-specific residual variance.
- \(\alpha_c\): multinomial logit intercept parameter governing class-membership probability.
- \(\beta_0\): outcome-model intercept.
- \(\beta_c\): class-effect coefficient for class \(c\) relative to \(c_{ref}\).
- \(\mathbf{Z}_i\): confounder vector for participant \(i\).
- \(\boldsymbol{\gamma}\): confounder-effect coefficient vector.
- \(\text{OR}_c\): odds-ratio contrast in depression risk for class \(c\) versus \(c_{ref}\).
Objective 4 Glossary
- \(i\): participant index.
- \(j\): mediator index.
- \(Y_i^H\): harmonized binary depression outcome.
- \(M_{ij}\): value of mediator \(j\) for participant \(i\).
- \(C_i\): trajectory-class summary or exposure-trajectory representation for participant \(i\).
- \(\mathbf{Z}_i\): confounder vector.
- \(\alpha_0\): model intercept.
- \(\alpha_1\): regression coefficient for mediator \(M_{ij}\).
- \(\alpha_2\): regression coefficient for trajectory-class representation \(C_i\).
- \(\boldsymbol{\alpha}_3\): coefficient vector for confounders \(\mathbf{Z}_i\).
- \(\mathbf{M}_i\): vector of candidate mediators in jointly adjusted models.
Remaining Sections
- Section D: Objective 5 - Causal Mediation SAP (in progress). This section will formalize natural direct and indirect effect estimands, identification assumptions, and sensitivity procedures for unmeasured mediator-outcome confounding (Imai, Keele, and Yamamoto 2010; Valeri and VanderWeele 2013; Tyler J. VanderWeele and Hernan 2012; T. J. VanderWeele et al. 2016).