07 — Experiments: Harder Dataset¶

Primary author: Victoria

Builds on:

  • 05_dataset_construction.ipynb (Victoria — harder dataset with cosine-similarity-based distractors, balanced 1:1, 32 features after removing the 15 context-free cosine features)
  • 06_experiments_easy.ipynb (Victoria — experiment scaffolding: run_experiment(), hyperparameter grids, GroupKFold setup, results summary formatting)
  • 03_feature_engineering.ipynb (Victoria — 47-feature computation logic, feature group column lists extracted to scripts/feature_utils.py per Decision 18)
  • Hans's Hans_Supervised_Learning_Models.ipynb (KNN, LogReg, RF scaffolding with 5-fold stratified CV — adapted to GroupKFold per Decision 7, expanded features, and restructured for the A/B ablation design)

Prompt engineering: Victoria
AI assistance: Claude Code (Anthropic)
Environment: Local (CPU only)

This notebook implements PLAN.md Step 8 — running Experiments 2A and 2B on the harder (cosine-similarity distractor) dataset (design doc Section 8.3).

This is where the misdirection hypothesis is tested via classification. Unlike the easy dataset (NB 06), where random distractors are trivially distinguishable, the harder dataset's distractors are chosen by cosine similarity to the definition (Decision 6). This forces the model to rely on subtler features to distinguish real from distractor pairs. The A/B ablation — removing the 6 context-informed features — directly tests whether the clue's surface reading helps or hurts classification.

Experiment 2A — All 32 features¶

Three model families (KNN, Logistic Regression, Random Forest) are trained on 32 features: 6 context-informed meaning + 22 relationship

  • 4 surface. The 15 context-free cosine features have been removed

because they are artifacts of the cosine-similarity-based distractor construction (Decision 6).

Experiment 2B — 26 features (context-informed removed)¶

The 6 context-informed cosine features (involving word1_clue_context) are removed, leaving 26 features (22 relationship + 4 surface). Comparing 2A vs. 2B measures whether clue context helps or hurts classification when the task is genuinely difficult:

  • If Exp 2A < Exp 2B: context-informed features hurt classification — the clue's surface reading shifts the definition embedding away from the true answer, supporting the misdirection hypothesis through the classifier lens.
  • If Exp 2A > Exp 2B: context-informed features help — the classifier extracts useful signal from the contextual shift that outweighs any misdirection effect.

Input: data/dataset_harder.parquet (Step 7)
Output: outputs/results_harder.csv, saved hyperparameters


1. Configuration¶

SAMPLE_MODE: Set to True (default) for fast iteration with 20,000 rows. Set to False for final runs on the full dataset. The sample is stratified by label to preserve the 1:1 balance.

In [1]:
import sys
import warnings

import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import GroupKFold

warnings.filterwarnings('ignore', category=FutureWarning)

# --- Environment Auto-Detection ---
# Same pattern as prior notebooks: detect Colab vs. local / Great Lakes.
try:
    IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
    IS_COLAB = False

if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/'
                        'Milestone II - NLP Cryptic Crossword Clues/'
                        'clue_misdirection')
else:
    try:
        PROJECT_ROOT = Path(__file__).resolve().parent.parent
    except NameError:
        PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / 'data'
OUTPUT_DIR = PROJECT_ROOT / 'outputs'
SCRIPTS_DIR = PROJECT_ROOT / 'scripts'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# --- Add scripts/ to sys.path so feature_utils is importable ---
# Decision 18: feature column lists are defined in feature_utils.py
# (extracted from NB 03) so all downstream notebooks use the same names.
if str(SCRIPTS_DIR) not in sys.path:
    sys.path.insert(0, str(SCRIPTS_DIR))

from feature_utils import (
    CONTEXT_INFORMED_COLS,
    RELATIONSHIP_COLS,
    SURFACE_COLS,
)

# --- Experiment parameters ---
RANDOM_SEED = 42
N_FOLDS = 5
SAMPLE_MODE = True   # <-- Set to False for final runs
SAMPLE_SIZE = 20_000

# --- Feature sets ---
# The harder dataset has already had the 15 context-free cosine features
# removed (Decision 6: they are artifacts of the cosine-similarity-based
# distractor construction). The remaining 32 features are:
#   6 context-informed + 22 relationship + 4 surface
HARDER_FEATURE_COLS = (
    CONTEXT_INFORMED_COLS + list(RELATIONSHIP_COLS) + SURFACE_COLS
)

# Exp 2A: all 32 harder features
# Exp 2B: remove the 6 context-informed cosine features (those involving
# word1_clue_context) → 26 features. These features capture how the
# definition's meaning shifts when embedded within the clue sentence.
# Removing them tests whether clue context helps or hurts classification
# on the harder task — the central misdirection question.
EXP_2B_COLS = [c for c in HARDER_FEATURE_COLS if c not in CONTEXT_INFORMED_COLS]

print(f'Environment: {"Google Colab" if IS_COLAB else "Local / Great Lakes"}')
print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
print(f'Output directory: {OUTPUT_DIR}')
print(f'\nRandom seed: {RANDOM_SEED}')
print(f'CV folds: {N_FOLDS}')
print(f'Sample mode: {SAMPLE_MODE}'
      f'{f" ({SAMPLE_SIZE:,} rows)" if SAMPLE_MODE else " (full dataset)"}')
print(f'\nFeature sets:')
print(f'  Exp 2A (all harder features):        {len(HARDER_FEATURE_COLS)}')
print(f'  Context-informed (to remove for 2B):  {len(CONTEXT_INFORMED_COLS)}')
print(f'  Exp 2B (relationship + surface only): {len(EXP_2B_COLS)}')
Environment: Local / Great Lakes
Project root: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection
Data directory: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/data
Output directory: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/outputs

Random seed: 42
CV folds: 5
Sample mode: True (20,000 rows)

Feature sets:
  Exp 2A (all harder features):        32
  Context-informed (to remove for 2B):  6
  Exp 2B (relationship + surface only): 26

2. Load the Harder Dataset¶

In [2]:
# ============================================================
# Load dataset_harder.parquet (Step 7 output)
# ============================================================
# The harder dataset uses cosine-similarity-based distractors
# (Decision 6): for each real definition, distractor answers are
# sampled from the top-k most similar answer words by cosine
# similarity between word1_average and word2_average. This makes
# the classification task genuinely difficult — distractors are
# semantically plausible, not random. The 15 context-free cosine
# features have already been removed in NB 05 because they are
# artifacts of the cosine-based construction.
dataset_path = DATA_DIR / 'dataset_harder.parquet'
assert dataset_path.exists(), (
    f'Missing input file: {dataset_path}\n'
    f'Run 05_dataset_construction.ipynb first to produce this file.'
)

df = pd.read_parquet(dataset_path)
print(f'Loaded dataset_harder.parquet: {len(df):,} rows × {len(df.columns)} columns')

# --- Validate expected columns ---
# The harder dataset should contain the 32 features (6 context-informed
# + 22 relationship + 4 surface) plus metadata. It should NOT contain
# the 15 context-free cosine features.
missing_feat = [c for c in HARDER_FEATURE_COLS if c not in df.columns]
assert not missing_feat, f'Missing feature columns: {missing_feat}'
assert 'label' in df.columns, 'Missing label column'
assert 'definition_wn' in df.columns, 'Missing definition_wn column'
assert 'answer_wn' in df.columns, 'Missing answer_wn column'

# --- Sample mode ---
# When iterating quickly, take a stratified subsample to speed up
# cross-validation. Stratification preserves the 1:1 label balance.
# We sample each label group separately and concatenate, rather than
# using groupby().apply(), which can drop the grouping column.
if SAMPLE_MODE:
    sampled_parts = []
    for label_val in df['label'].unique():
        group = df[df['label'] == label_val]
        sampled_parts.append(
            group.sample(n=min(SAMPLE_SIZE // 2, len(group)),
                         random_state=RANDOM_SEED)
        )
    df = pd.concat(sampled_parts, ignore_index=True)
    print(f'\n⚠ SAMPLE MODE: subsampled to {len(df):,} rows '
          f'(set SAMPLE_MODE = False for final runs)')

# --- Summary ---
print(f'\nShape: {df.shape}')
print(f'\nLabel distribution:')
print(df['label'].value_counts().to_string())
print(f'\nUnique definition_wn values: {df["definition_wn"].nunique():,}')
print(f'Unique answer_wn values:     {df["answer_wn"].nunique():,}')

# Number of unique (definition_wn, answer_wn) pairs — this is the
# grouping unit for GroupKFold. Each pair may appear in multiple clue
# rows (different clue surfaces for the same definition–answer pair),
# and each real row has a corresponding distractor row with a different
# answer_wn. GroupKFold ensures all rows sharing the same pair stay in
# the same fold, preventing near-duplicate feature vectors from leaking
# across train/test splits.
n_unique_pairs = df.groupby(['definition_wn', 'answer_wn']).ngroups
print(f'Unique (definition_wn, answer_wn) pairs: {n_unique_pairs:,}')

# --- Validate no NaNs in feature columns ---
feat_nulls = df[HARDER_FEATURE_COLS].isnull().any()
if feat_nulls.any():
    print(f'\nWARNING: NaN values found in features:')
    print(feat_nulls[feat_nulls].to_string())
else:
    print(f'\nNo NaN values in any of the {len(HARDER_FEATURE_COLS)} feature columns ✓')
Loaded dataset_harder.parquet: 480,422 rows × 47 columns

⚠ SAMPLE MODE: subsampled to 20,000 rows (set SAMPLE_MODE = False for final runs)

Shape: (20000, 47)

Label distribution:
label
1    10000
0    10000

Unique definition_wn values: 8,151
Unique answer_wn values:     14,156
Unique (definition_wn, answer_wn) pairs: 19,112

No NaN values in any of the 32 feature columns ✓

3. GroupKFold Assignment¶

We use GroupKFold (Decision 7) rather than StratifiedKFold because multiple clue rows can share the same (definition_wn, answer_wn) pair. These rows have near-identical feature vectors (differing only in the 6 context-informed features, which depend on the specific clue surface). If they were split across train and test folds, the model would effectively see the test example during training — leaking information and inflating accuracy.

GroupKFold guarantees that all rows belonging to the same (definition_wn, answer_wn) group are assigned to the same fold. The same fold assignments are reused for both Exp 2A and Exp 2B to ensure a fair comparison.

In [3]:
# ============================================================
# Create group key and assign folds
# ============================================================
# Build a composite group key from (definition_wn, answer_wn). All rows
# sharing this pair — whether real or distractor, and across multiple
# clue surfaces — land in the same fold.
groups = df['definition_wn'].astype(str) + '|||' + df['answer_wn'].astype(str)

gkf = GroupKFold(n_splits=N_FOLDS)

# GroupKFold.split() yields (train_idx, test_idx) tuples. We only need
# the fold assignment for each row, so we iterate and record which fold
# each row's test set falls into.
df['fold'] = -1
for fold_idx, (_, test_idx) in enumerate(gkf.split(df, y=df['label'], groups=groups)):
    df.loc[df.index[test_idx], 'fold'] = fold_idx

assert (df['fold'] >= 0).all(), 'Some rows were not assigned to any fold'

# ============================================================
# Verify: no (definition_wn, answer_wn) pair spans multiple folds
# ============================================================
folds_per_group = (
    df.groupby(['definition_wn', 'answer_wn'])['fold']
      .nunique()
)
leaked_groups = folds_per_group[folds_per_group > 1]
assert len(leaked_groups) == 0, (
    f'{len(leaked_groups)} groups span multiple folds — GroupKFold failed!\n'
    f'Examples: {leaked_groups.head(5).to_dict()}'
)
print(f'GroupKFold verification passed: no (definition_wn, answer_wn) '
      f'pair spans multiple folds ✓')

# ============================================================
# Print fold sizes and label balance
# ============================================================
print(f'\n{"Fold":<6s} {"Size":>8s} {"Label=1":>10s} {"Label=0":>10s} {"% Positive":>12s}')
print('-' * 50)
for fold_idx in range(N_FOLDS):
    fold_mask = df['fold'] == fold_idx
    fold_size = fold_mask.sum()
    n_pos = (df.loc[fold_mask, 'label'] == 1).sum()
    n_neg = (df.loc[fold_mask, 'label'] == 0).sum()
    pct_pos = n_pos / fold_size * 100
    print(f'{fold_idx:<6d} {fold_size:>8,d} {n_pos:>10,d} {n_neg:>10,d} {pct_pos:>11.1f}%')

print(f'\nTotal rows: {len(df):,}')
print(f'Unique groups: {groups.nunique():,}')
GroupKFold verification passed: no (definition_wn, answer_wn) pair spans multiple folds ✓

Fold       Size    Label=1    Label=0   % Positive
--------------------------------------------------
0         4,000      2,043      1,957        51.1%
1         4,000      2,023      1,977        50.6%
2         4,000      1,975      2,025        49.4%
3         4,000      1,991      2,009        49.8%
4         4,000      1,968      2,032        49.2%

Total rows: 20,000
Unique groups: 19,112

4. Experiment Design¶

We run two experiments on the harder (cosine-similarity distractor) dataset, following the design in Section 8.3 of the design document (Table 7):

Experiment Features Count Description
Exp 2A All harder features 32 6 context-informed + 22 relationship + 4 surface
Exp 2B Context-informed removed 26 22 relationship + 4 surface

Note: the 15 context-free cosine features are excluded from both experiments (Decision 6) because they are artifacts of the cosine-similarity-based distractor construction — distractors were selected to be close to the definition in context-free embedding space, so those features would encode the construction method rather than genuine signal.

Three model families are trained under each condition:

  1. K-Nearest Neighbors (KNN) — instance-based, non-parametric. Features are scaled with StandardScaler fitted on the training fold.
  2. Logistic Regression — probabilistic, linear. Scaled. L1 and L2 penalties explored (solver='saga' supports both).
  3. Random Forest — tree-based, non-linear. Scale-invariant, so no scaling applied. Uses RandomizedSearchCV due to larger grid.

Hyperparameters are tuned via inner 3-fold stratified CV within each training fold. The outer 5-fold GroupKFold (assigned in Section 3) provides the train/test split. The same folds are used for both experiments to ensure a fair A vs. B comparison (Decision 7).

Expected outcome: Accuracy will be lower than the easy dataset because cosine-similarity distractors are semantically plausible. The key question is the sign of Δ Hard (2A − 2B):

  • Δ Hard < 0: context-informed features hurt — the clue's surface reading shifts the definition embedding away from the true answer, consistent with the misdirection hypothesis.
  • Δ Hard > 0: context-informed features help — the classifier extracts useful signal from contextual shift despite misdirection.
In [4]:
import time
from sklearn.base import clone
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import (GridSearchCV, RandomizedSearchCV,
                                     StratifiedKFold)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, f1_score, precision_score,
                             recall_score, roc_auc_score)

# ============================================================
# Hyperparameter grids
# ============================================================
# Same grids as NB 06 — identical model families and search spaces
# ensure results are comparable across easy and harder datasets.
# Full grids for final runs; reduced grids when SAMPLE_MODE is True
# to keep iteration time under a few minutes.

if SAMPLE_MODE:
    knn_grid = {
        'n_neighbors': [3, 7, 15],
        'weights': ['uniform', 'distance'],
    }
    logreg_grid = {
        'C': [0.1, 1.0, 10.0],
        'l1_ratio': [0.0, 0.5, 1.0],
    }
    rf_grid = {
        'n_estimators': [100, 200],
        'max_depth': [5, 10, None],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2],
    }
    RF_N_ITER = 10
else:
    knn_grid = {
        'n_neighbors': [3, 5, 7, 11, 15, 21],
        'weights': ['uniform', 'distance'],
        'metric': ['euclidean', 'manhattan'],
    }
    logreg_grid = {
        'C': [0.01, 0.1, 1.0, 10.0, 100.0],
        'l1_ratio': [0.0, 0.5, 1.0],
    }
    rf_grid = {
        'n_estimators': [100, 200, 500],
        'max_depth': [5, 10, 20, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['sqrt', 'log2'],
    }
    RF_N_ITER = 20

# --- Model configurations ---
# Each entry defines the base estimator, its search grid, search
# strategy (grid vs. randomized), and whether StandardScaler should
# be applied before fitting.  Random Forest is scale-invariant, so
# it receives unscaled features directly.
model_configs = {
    'KNN': {
        'estimator': KNeighborsClassifier(),
        'param_grid': knn_grid,
        'search': 'grid',
        'scale': True,
    },
    'Logistic Regression': {
        'estimator': LogisticRegression(
            solver='saga', penalty='elasticnet',
            max_iter=5000, random_state=RANDOM_SEED),
        'param_grid': logreg_grid,
        'search': 'grid',
        'scale': True,
    },
    'Random Forest': {
        'estimator': RandomForestClassifier(random_state=RANDOM_SEED),
        'param_grid': rf_grid,
        'search': 'random',
        'scale': False,
        'n_iter': RF_N_ITER,
    },
}

# Print grid sizes so we know how long tuning will take
for name, cfg in model_configs.items():
    total = 1
    for v in cfg['param_grid'].values():
        total *= len(v)
    if cfg['search'] == 'grid':
        print(f'{name}: GridSearchCV — {total} combinations × 3 inner folds')
    else:
        n_it = cfg.get('n_iter', 20)
        print(f'{name}: RandomizedSearchCV — {n_it} of {total} '
              f'combinations × 3 inner folds')
KNN: GridSearchCV — 6 combinations × 3 inner folds
Logistic Regression: GridSearchCV — 9 combinations × 3 inner folds
Random Forest: RandomizedSearchCV — 10 of 24 combinations × 3 inner folds
In [5]:
def run_experiment(df, feature_cols, experiment_name, model_configs,
                   n_folds=5):
    """Run a classification experiment using pre-assigned GroupKFold splits.

    For each outer fold, hyperparameters are tuned via inner 3-fold
    StratifiedKFold CV on the training portion, then the best model is
    evaluated on the held-out test fold.  StandardScaler is fitted on
    the training fold only for scale-sensitive models (KNN, LogReg).

    Parameters
    ----------
    df : pd.DataFrame
        Must contain ``feature_cols``, ``'label'``, and ``'fold'``.
    feature_cols : list of str
        Feature columns to use as model input.
    experiment_name : str
        Label for this experiment (e.g., ``"Exp_2A"``).
    model_configs : dict
        Model definitions — see hyperparameter grids cell above.
    n_folds : int
        Number of outer CV folds (must match ``'fold'`` column values).

    Returns
    -------
    results_df : pd.DataFrame
        One row per (model, fold) with accuracy, F1, precision, recall,
        ROC AUC, and best hyperparameters.
    best_params : dict
        ``{model_name: {fold_idx: best_params_dict}}``.
    """
    X = df[feature_cols].values
    y = df['label'].values

    all_results = []
    best_params = {name: {} for name in model_configs}

    # Inner CV for hyperparameter tuning.  We use StratifiedKFold (not
    # GroupKFold) for the inner loop because the outer GroupKFold has
    # already separated definition-answer groups across folds — further
    # group separation within a single training fold is unnecessary and
    # would complicate the search with minimal benefit.
    inner_cv = StratifiedKFold(
        n_splits=3, shuffle=True, random_state=RANDOM_SEED)

    t0_exp = time.time()
    print(f'\n{"="*65}')
    print(f'{experiment_name}: {len(feature_cols)} features, {len(df):,} rows')
    print(f'{"="*65}')

    for fold_idx in range(n_folds):
        test_mask = (df['fold'] == fold_idx).values
        train_mask = ~test_mask

        X_train, X_test = X[train_mask], X[test_mask]
        y_train, y_test = y[train_mask], y[test_mask]

        print(f'\nFold {fold_idx}: '
              f'train={train_mask.sum():,}  test={test_mask.sum():,}')

        for model_name, config in model_configs.items():
            t0 = time.time()
            estimator = clone(config['estimator'])

            # --- Feature scaling ---
            # StandardScaler is fitted on the training fold only, then
            # applied to the test fold.  This prevents information about
            # test-set feature distributions from leaking into training.
            # Random Forest is scale-invariant and skips this step.
            if config['scale']:
                scaler = StandardScaler()
                X_tr = scaler.fit_transform(X_train)
                X_te = scaler.transform(X_test)
            else:
                X_tr = X_train
                X_te = X_test

            # --- Inner CV hyperparameter search ---
            if config['search'] == 'random':
                search = RandomizedSearchCV(
                    estimator, config['param_grid'],
                    n_iter=config.get('n_iter', 20),
                    cv=inner_cv, scoring='accuracy',
                    n_jobs=-1, random_state=RANDOM_SEED,
                )
            else:
                search = GridSearchCV(
                    estimator, config['param_grid'],
                    cv=inner_cv, scoring='accuracy',
                    n_jobs=-1,
                )

            search.fit(X_tr, y_train)
            best_params[model_name][fold_idx] = search.best_params_

            # --- Evaluate on held-out test fold ---
            y_pred = search.predict(X_te)
            y_prob = search.predict_proba(X_te)[:, 1]

            acc = accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            prec = precision_score(y_test, y_pred)
            rec = recall_score(y_test, y_pred)
            roc = roc_auc_score(y_test, y_prob)
            elapsed = time.time() - t0

            all_results.append({
                'experiment': experiment_name,
                'model': model_name,
                'fold': fold_idx,
                'accuracy': acc,
                'f1': f1,
                'precision': prec,
                'recall': rec,
                'roc_auc': roc,
                'best_params': str(search.best_params_),
            })

            print(f'  {model_name:<22s} '
                  f'Acc={acc:.4f}  F1={f1:.4f}  AUC={roc:.4f}  '
                  f'[{elapsed:.1f}s]')

    elapsed_total = time.time() - t0_exp
    print(f'\n{experiment_name} complete in {elapsed_total:.0f}s')

    return pd.DataFrame(all_results), best_params
In [6]:
# ============================================================
# Run Experiment 2A: all 32 harder features
# ============================================================
results_2a, params_2a = run_experiment(
    df, HARDER_FEATURE_COLS, "Exp_2A", model_configs, n_folds=N_FOLDS)

# ============================================================
# Run Experiment 2B: 26 features (context-informed removed)
# ============================================================
results_2b, params_2b = run_experiment(
    df, EXP_2B_COLS, "Exp_2B", model_configs, n_folds=N_FOLDS)
=================================================================
Exp_2A: 32 features, 20,000 rows
=================================================================

Fold 0: train=16,000  test=4,000
  KNN                    Acc=0.7288  F1=0.7153  AUC=0.7885  [2.9s]
  Logistic Regression    Acc=0.7075  F1=0.6952  AUC=0.7696  [5.1s]
  Random Forest          Acc=0.7392  F1=0.7267  AUC=0.8122  [8.4s]

Fold 1: train=16,000  test=4,000
  KNN                    Acc=0.7272  F1=0.7148  AUC=0.7904  [0.8s]
  Logistic Regression    Acc=0.7222  F1=0.7079  AUC=0.7769  [3.5s]
  Random Forest          Acc=0.7498  F1=0.7333  AUC=0.8202  [9.8s]

Fold 2: train=16,000  test=4,000
  KNN                    Acc=0.7322  F1=0.7157  AUC=0.7853  [0.8s]
  Logistic Regression    Acc=0.7205  F1=0.7028  AUC=0.7816  [2.6s]
  Random Forest          Acc=0.7410  F1=0.7197  AUC=0.8073  [8.6s]

Fold 3: train=16,000  test=4,000
  KNN                    Acc=0.7242  F1=0.7107  AUC=0.7892  [0.8s]
  Logistic Regression    Acc=0.7288  F1=0.7139  AUC=0.7833  [3.4s]
  Random Forest          Acc=0.7575  F1=0.7431  AUC=0.8259  [8.8s]

Fold 4: train=16,000  test=4,000
  KNN                    Acc=0.7232  F1=0.7006  AUC=0.7743  [0.8s]
  Logistic Regression    Acc=0.7225  F1=0.7030  AUC=0.7718  [3.9s]
  Random Forest          Acc=0.7358  F1=0.7173  AUC=0.8081  [8.7s]

Exp_2A complete in 69s

=================================================================
Exp_2B: 26 features, 20,000 rows
=================================================================

Fold 0: train=16,000  test=4,000
  KNN                    Acc=0.6412  F1=0.6219  AUC=0.6983  [0.7s]
  Logistic Regression    Acc=0.6560  F1=0.5967  AUC=0.7056  [4.7s]
  Random Forest          Acc=0.6647  F1=0.5834  AUC=0.7311  [3.6s]

Fold 1: train=16,000  test=4,000
  KNN                    Acc=0.6452  F1=0.6237  AUC=0.6990  [0.8s]
  Logistic Regression    Acc=0.6645  F1=0.6130  AUC=0.7130  [3.0s]
  Random Forest          Acc=0.6750  F1=0.5930  AUC=0.7372  [3.9s]

Fold 2: train=16,000  test=4,000
  KNN                    Acc=0.6470  F1=0.6278  AUC=0.6977  [0.8s]
  Logistic Regression    Acc=0.6675  F1=0.6100  AUC=0.7110  [2.5s]
  Random Forest          Acc=0.6660  F1=0.6119  AUC=0.7124  [4.2s]

Fold 3: train=16,000  test=4,000
  KNN                    Acc=0.6505  F1=0.6264  AUC=0.7041  [0.7s]
  Logistic Regression    Acc=0.6655  F1=0.6110  AUC=0.7193  [3.3s]
  Random Forest          Acc=0.6737  F1=0.6262  AUC=0.7232  [4.3s]

Fold 4: train=16,000  test=4,000
  KNN                    Acc=0.6418  F1=0.6149  AUC=0.6936  [0.8s]
  Logistic Regression    Acc=0.6707  F1=0.6109  AUC=0.7061  [3.4s]
  Random Forest          Acc=0.6680  F1=0.5741  AUC=0.7320  [4.1s]

Exp_2B complete in 41s
In [7]:
# ============================================================
# Results Summary: mean +/- SD across folds
# ============================================================
results_all = pd.concat([results_2a, results_2b], ignore_index=True)

metrics = ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']
summary_rows = []

for exp_name in ['Exp_2A', 'Exp_2B']:
    for model_name in model_configs:
        mask = ((results_all['experiment'] == exp_name) &
                (results_all['model'] == model_name))
        subset = results_all[mask]
        row = {'Experiment': exp_name, 'Model': model_name}
        for m in metrics:
            mean_val = subset[m].mean()
            std_val = subset[m].std()
            row[f'{m}_mean'] = mean_val
            row[f'{m}_std'] = std_val
            row[m] = f'{mean_val:.4f} +/- {std_val:.4f}'
        summary_rows.append(row)

summary_df = pd.DataFrame(summary_rows)

# --- Display formatted table ---
display_cols = ['Experiment', 'Model'] + metrics
print('RESULTS SUMMARY — Harder Dataset (mean +/- SD across 5 folds)')
print('=' * 115)
print(summary_df[display_cols].to_string(index=False))

# --- Delta Hard = Exp 2A - Exp 2B per model ---
# This is the central comparison for the misdirection hypothesis
# (design doc Section 8.4). A negative delta means context-informed
# features hurt classification — consistent with misdirection shifting
# the definition embedding away from the true answer.
print(f'\n{"="*65}')
print('Delta Hard (Exp 2A - Exp 2B)')
print(f'{"="*65}')
delta_rows = []
for model_name in model_configs:
    row_2a = summary_df[(summary_df['Experiment'] == 'Exp_2A') &
                        (summary_df['Model'] == model_name)].iloc[0]
    row_2b = summary_df[(summary_df['Experiment'] == 'Exp_2B') &
                        (summary_df['Model'] == model_name)].iloc[0]
    delta_row = {'Model': model_name}
    for m in metrics:
        delta_row[f'delta_{m}'] = row_2a[f'{m}_mean'] - row_2b[f'{m}_mean']
    delta_rows.append(delta_row)
    print(f'  {model_name:<22s}  '
          f'dAcc={delta_row["delta_accuracy"]:+.4f}  '
          f'dF1={delta_row["delta_f1"]:+.4f}  '
          f'dAUC={delta_row["delta_roc_auc"]:+.4f}')

delta_df = pd.DataFrame(delta_rows)

# --- Best hyperparameters (fold 0 as representative) ---
print(f'\n{"="*65}')
print('Best Hyperparameters (fold 0, representative)')
print(f'{"="*65}')
for model_name in model_configs:
    print(f'\n  {model_name}:')
    print(f'    Exp 2A: {params_2a[model_name][0]}')
    print(f'    Exp 2B: {params_2b[model_name][0]}')

# --- Save to CSV ---
save_df = summary_df[['Experiment', 'Model'] +
                      [f'{m}_mean' for m in metrics] +
                      [f'{m}_std' for m in metrics]]
save_path = OUTPUT_DIR / 'results_harder.csv'
save_df.to_csv(save_path, index=False)
print(f'\nSaved summary to {save_path}')

# Per-fold results (including best_params) for reproducibility
fold_path = OUTPUT_DIR / 'results_harder_per_fold.csv'
results_all.to_csv(fold_path, index=False)
print(f'Per-fold results saved to {fold_path}')
RESULTS SUMMARY — Harder Dataset (mean +/- SD across 5 folds)
===================================================================================================================
Experiment               Model          accuracy                f1         precision            recall           roc_auc
    Exp_2A                 KNN 0.7271 +/- 0.0036 0.7114 +/- 0.0064 0.7549 +/- 0.0105 0.6728 +/- 0.0102 0.7855 +/- 0.0066
    Exp_2A Logistic Regression 0.7203 +/- 0.0078 0.7046 +/- 0.0069 0.7466 +/- 0.0069 0.6671 +/- 0.0097 0.7767 +/- 0.0060
    Exp_2A       Random Forest 0.7446 +/- 0.0088 0.7280 +/- 0.0105 0.7786 +/- 0.0145 0.6837 +/- 0.0121 0.8147 +/- 0.0081
    Exp_2B                 KNN 0.6452 +/- 0.0038 0.6229 +/- 0.0051 0.6646 +/- 0.0101 0.5864 +/- 0.0101 0.6985 +/- 0.0038
    Exp_2B Logistic Regression 0.6649 +/- 0.0055 0.6083 +/- 0.0066 0.7318 +/- 0.0079 0.5207 +/- 0.0126 0.7110 +/- 0.0056
    Exp_2B       Random Forest 0.6695 +/- 0.0046 0.5977 +/- 0.0212 0.7664 +/- 0.0411 0.4929 +/- 0.0446 0.7272 +/- 0.0097

=================================================================
Delta Hard (Exp 2A - Exp 2B)
=================================================================
  KNN                     dAcc=+0.0820  dF1=+0.0885  dAUC=+0.0870
  Logistic Regression     dAcc=+0.0554  dF1=+0.0962  dAUC=+0.0657
  Random Forest           dAcc=+0.0751  dF1=+0.1303  dAUC=+0.0875

=================================================================
Best Hyperparameters (fold 0, representative)
=================================================================

  KNN:
    Exp 2A: {'n_neighbors': 15, 'weights': 'distance'}
    Exp 2B: {'n_neighbors': 15, 'weights': 'uniform'}

  Logistic Regression:
    Exp 2A: {'C': 0.1, 'l1_ratio': 1.0}
    Exp 2B: {'C': 0.1, 'l1_ratio': 1.0}

  Random Forest:
    Exp 2A: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': None}
    Exp 2B: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10}

Saved summary to /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/outputs/results_harder.csv
Per-fold results saved to /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/outputs/results_harder_per_fold.csv

5. Discussion¶

The harder dataset uses cosine-similarity distractors — answer words sampled from the most semantically similar candidates to the definition (Decision 6). Unlike the easy dataset where random distractors are trivially distinguishable, these distractors share genuine semantic overlap with the definition. This makes the classification task substantially harder and is where the misdirection hypothesis is tested through the classifier lens (design doc Section 8.4).

Interpreting Δ Hard (Exp 2A − Exp 2B)¶

The 6 context-informed features measure how the definition's embedding shifts when it is read within the full clue sentence (the "surface reading"). By comparing Exp 2A (with these features) to Exp 2B (without them), we directly test whether clue context helps or hurts the classifier's ability to distinguish real definition–answer pairs from distractor pairs:

  • If Δ Hard < 0 (2A worse than 2B): The context-informed features hurt classification. This means the clue's surface reading shifts the definition embedding in a direction that makes it harder to identify the true answer — direct evidence that cryptic clues misdirect embedding-based models, consistent with the misdirection hypothesis.

  • If Δ Hard > 0 (2A better than 2B): The context-informed features help classification. The classifier is able to extract useful signal from the contextual shift that outweighs any misdirection effect. This would suggest that while misdirection exists (as shown by the retrieval analysis), the classifier can learn to exploit the contextual information rather than being fooled by it.

  • If Δ Hard ≈ 0: The context-informed features are uninformative on the harder task — neither helping nor hurting. This would suggest the contextual shift is essentially noise relative to the relationship and surface features.

Comparison with the easy dataset (NB 06)¶

The easy dataset (Exp 1A/1B) serves as a control. Because random distractors are semantically unrelated, the context-informed features add little signal regardless — Δ Easy should be near zero. The interesting comparison is whether Δ Hard differs meaningfully from Δ Easy:

  • Δ Hard < Δ Easy: Context is more harmful (or less helpful) on the harder task, suggesting that misdirection specifically targets the semantic similarity channel that the harder distractors exploit.
  • Δ Hard > Δ Easy: Context is more helpful on the harder task, suggesting the contextual shift provides discriminative signal precisely when the task is difficult.

Caveats¶

  • Results under SAMPLE_MODE = True use a 20,000-row subsample and reduced hyperparameter grids. Final conclusions should be drawn from the full-dataset run (SAMPLE_MODE = False).
  • The classifier analysis complements but does not replace the retrieval analysis (PLAN.md Step 9). The retrieval analysis directly measures whether clue context degrades the rank of the true answer among all candidate words — a more direct test of misdirection. The classifier tests whether the effect is strong enough to degrade a supervised model trained to exploit all available features.