06 — Experiments: Easy Dataset¶

Primary author: Victoria

Builds on:

  • 05_dataset_construction.ipynb (Victoria — easy dataset with random distractors, balanced 1:1 real/distractor, all 47 features)
  • 03_feature_engineering.ipynb (Victoria — 47-feature computation logic, feature group column lists extracted to scripts/feature_utils.py per Decision 18)
  • Hans's Hans_Supervised_Learning_Models.ipynb (KNN, LogReg, RF scaffolding with 5-fold stratified CV — adapted to GroupKFold per Decision 7, expanded from 10 to 47 features, and restructured for the A/B ablation design)

Prompt engineering: Victoria
AI assistance: Claude Code (Anthropic)
Environment: Local (CPU only)

This notebook implements PLAN.md Step 6 — running Experiments 1A and 1B on the easy (random-distractor) dataset (design doc Section 8.3).

Experiment 1A — All 47 features¶

Three model families (KNN, Logistic Regression, Random Forest) are trained on all 47 features: 15 context-free meaning + 6 context-informed meaning + 22 relationship + 4 surface. Because random distractors are semantically unrelated to the definition, we expect high accuracy across all models — this is a baseline sanity check.

Experiment 1B — 41 features (context-informed removed)¶

The 6 context-informed cosine features (involving word1_clue_context) are removed, leaving 41 features. Comparing 1A vs. 1B measures whether clue context helps or hurts classification on the easy task. Because random distractors are trivially distinguishable, we expect the Δ (1A − 1B) to be negligible — the interesting comparison is on the harder dataset (NB 07).

Input: data/dataset_easy.parquet (Step 5)
Output: outputs/results_easy.csv, saved hyperparameters


1. Configuration¶

SAMPLE_MODE: Set to True (default) for fast iteration with 20,000 rows. Set to False for final runs on the full dataset. The sample is stratified by label to preserve the 1:1 balance.

In [1]:
import sys
import warnings

import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import GroupKFold

warnings.filterwarnings('ignore', category=FutureWarning)

# --- Environment Auto-Detection ---
# Same pattern as prior notebooks: detect Colab vs. local / Great Lakes.
try:
    IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
    IS_COLAB = False

if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/'
                        'Milestone II - NLP Cryptic Crossword Clues/'
                        'clue_misdirection')
else:
    try:
        PROJECT_ROOT = Path(__file__).resolve().parent.parent
    except NameError:
        PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / 'data'
OUTPUT_DIR = PROJECT_ROOT / 'outputs'
SCRIPTS_DIR = PROJECT_ROOT / 'scripts'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# --- Add scripts/ to sys.path so feature_utils is importable ---
# Decision 18: feature column lists are defined in feature_utils.py
# (extracted from NB 03) so all downstream notebooks use the same names.
if str(SCRIPTS_DIR) not in sys.path:
    sys.path.insert(0, str(SCRIPTS_DIR))

from feature_utils import (
    ALL_FEATURE_COLS,
    CONTEXT_INFORMED_COLS,
)

# --- Experiment parameters ---
RANDOM_SEED = 42
N_FOLDS = 5
SAMPLE_MODE = True   # <-- Set to False for final runs
SAMPLE_SIZE = 20_000

# --- Feature sets ---
# Exp 1A: all 47 features
# Exp 1B: remove the 6 context-informed cosine features (those involving
# word1_clue_context) → 41 features. These features capture how the
# definition's meaning shifts when embedded within the clue sentence.
# Removing them tests whether clue context helps or hurts classification.
EXP_1B_COLS = [c for c in ALL_FEATURE_COLS if c not in CONTEXT_INFORMED_COLS]

print(f'Environment: {"Google Colab" if IS_COLAB else "Local / Great Lakes"}')
print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
print(f'Output directory: {OUTPUT_DIR}')
print(f'\nRandom seed: {RANDOM_SEED}')
print(f'CV folds: {N_FOLDS}')
print(f'Sample mode: {SAMPLE_MODE}'
      f'{f" ({SAMPLE_SIZE:,} rows)" if SAMPLE_MODE else " (full dataset)"}')
print(f'\nFeature sets:')
print(f'  Exp 1A (all features):              {len(ALL_FEATURE_COLS)}')
print(f'  Context-informed (to remove for 1B): {len(CONTEXT_INFORMED_COLS)}')
print(f'  Exp 1B (context-free only):          {len(EXP_1B_COLS)}')
Environment: Local / Great Lakes
Project root: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection
Data directory: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/data
Output directory: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/outputs

Random seed: 42
CV folds: 5
Sample mode: True (20,000 rows)

Feature sets:
  Exp 1A (all features):              47
  Context-informed (to remove for 1B): 6
  Exp 1B (context-free only):          41

2. Load the Easy Dataset¶

In [2]:
# ============================================================
# Load dataset_easy.parquet (Step 5 output)
# ============================================================
dataset_path = DATA_DIR / 'dataset_easy.parquet'
assert dataset_path.exists(), (
    f'Missing input file: {dataset_path}\n'
    f'Run 05_dataset_construction.ipynb first to produce this file.'
)

df = pd.read_parquet(dataset_path)
print(f'Loaded dataset_easy.parquet: {len(df):,} rows × {len(df.columns)} columns')

# --- Validate expected columns ---
missing_feat = [c for c in ALL_FEATURE_COLS if c not in df.columns]
assert not missing_feat, f'Missing feature columns: {missing_feat}'
assert 'label' in df.columns, 'Missing label column'
assert 'definition_wn' in df.columns, 'Missing definition_wn column'
assert 'answer_wn' in df.columns, 'Missing answer_wn column'

# --- Sample mode ---
# When iterating quickly, take a stratified subsample to speed up
# cross-validation. Stratification preserves the 1:1 label balance.
# We sample each label group separately and concatenate, rather than
# using groupby().apply(), which can drop the grouping column.
if SAMPLE_MODE:
    sampled_parts = []
    for label_val in df['label'].unique():
        group = df[df['label'] == label_val]
        sampled_parts.append(
            group.sample(n=min(SAMPLE_SIZE // 2, len(group)),
                         random_state=RANDOM_SEED)
        )
    df = pd.concat(sampled_parts, ignore_index=True)
    print(f'\n⚠ SAMPLE MODE: subsampled to {len(df):,} rows '
          f'(set SAMPLE_MODE = False for final runs)')

# --- Summary ---
print(f'\nShape: {df.shape}')
print(f'\nLabel distribution:')
print(df['label'].value_counts().to_string())
print(f'\nUnique definition_wn values: {df["definition_wn"].nunique():,}')
print(f'Unique answer_wn values:     {df["answer_wn"].nunique():,}')

# Number of unique (definition_wn, answer_wn) pairs — this is the
# grouping unit for GroupKFold. Each pair may appear in multiple clue
# rows (different clue surfaces for the same definition–answer pair),
# and each real row has a corresponding distractor row with a different
# answer_wn. GroupKFold ensures all rows sharing the same pair stay in
# the same fold, preventing near-duplicate feature vectors from leaking
# across train/test splits.
n_unique_pairs = df.groupby(['definition_wn', 'answer_wn']).ngroups
print(f'Unique (definition_wn, answer_wn) pairs: {n_unique_pairs:,}')

# --- Validate no NaNs in feature columns ---
feat_nulls = df[ALL_FEATURE_COLS].isnull().any()
if feat_nulls.any():
    print(f'\nWARNING: NaN values found in features:')
    print(feat_nulls[feat_nulls].to_string())
else:
    print(f'\nNo NaN values in any of the {len(ALL_FEATURE_COLS)} feature columns ✓')
Loaded dataset_easy.parquet: 480,422 rows × 62 columns

⚠ SAMPLE MODE: subsampled to 20,000 rows (set SAMPLE_MODE = False for final runs)

Shape: (20000, 62)

Label distribution:
label
1    10000
0    10000

Unique definition_wn values: 8,151
Unique answer_wn values:     15,114
Unique (definition_wn, answer_wn) pairs: 19,339

No NaN values in any of the 47 feature columns ✓

3. GroupKFold Assignment¶

We use GroupKFold (Decision 7) rather than StratifiedKFold because multiple clue rows can share the same (definition_wn, answer_wn) pair. These rows have near-identical feature vectors (differing only in the 6 context-informed features, which depend on the specific clue surface). If they were split across train and test folds, the model would effectively see the test example during training — leaking information and inflating accuracy.

GroupKFold guarantees that all rows belonging to the same (definition_wn, answer_wn) group are assigned to the same fold. The same fold assignments are reused for both Exp 1A and Exp 1B to ensure a fair comparison.

In [3]:
# ============================================================
# Create group key and assign folds
# ============================================================
# Build a composite group key from (definition_wn, answer_wn). All rows
# sharing this pair — whether real or distractor, and across multiple
# clue surfaces — land in the same fold.
groups = df['definition_wn'].astype(str) + '|||' + df['answer_wn'].astype(str)

gkf = GroupKFold(n_splits=N_FOLDS)

# GroupKFold.split() yields (train_idx, test_idx) tuples. We only need
# the fold assignment for each row, so we iterate and record which fold
# each row's test set falls into.
df['fold'] = -1
for fold_idx, (_, test_idx) in enumerate(gkf.split(df, y=df['label'], groups=groups)):
    df.loc[df.index[test_idx], 'fold'] = fold_idx

assert (df['fold'] >= 0).all(), 'Some rows were not assigned to any fold'

# ============================================================
# Verify: no (definition_wn, answer_wn) pair spans multiple folds
# ============================================================
folds_per_group = (
    df.groupby(['definition_wn', 'answer_wn'])['fold']
      .nunique()
)
leaked_groups = folds_per_group[folds_per_group > 1]
assert len(leaked_groups) == 0, (
    f'{len(leaked_groups)} groups span multiple folds — GroupKFold failed!\n'
    f'Examples: {leaked_groups.head(5).to_dict()}'
)
print(f'GroupKFold verification passed: no (definition_wn, answer_wn) '
      f'pair spans multiple folds ✓')

# ============================================================
# Print fold sizes and label balance
# ============================================================
print(f'\n{"Fold":<6s} {"Size":>8s} {"Label=1":>10s} {"Label=0":>10s} {"% Positive":>12s}')
print('-' * 50)
for fold_idx in range(N_FOLDS):
    fold_mask = df['fold'] == fold_idx
    fold_size = fold_mask.sum()
    n_pos = (df.loc[fold_mask, 'label'] == 1).sum()
    n_neg = (df.loc[fold_mask, 'label'] == 0).sum()
    pct_pos = n_pos / fold_size * 100
    print(f'{fold_idx:<6d} {fold_size:>8,d} {n_pos:>10,d} {n_neg:>10,d} {pct_pos:>11.1f}%')

print(f'\nTotal rows: {len(df):,}')
print(f'Unique groups: {groups.nunique():,}')
GroupKFold verification passed: no (definition_wn, answer_wn) pair spans multiple folds ✓

Fold       Size    Label=1    Label=0   % Positive
--------------------------------------------------
0         4,000      2,024      1,976        50.6%
1         4,000      1,965      2,035        49.1%
2         4,000      1,978      2,022        49.5%
3         4,000      1,980      2,020        49.5%
4         4,000      2,053      1,947        51.3%

Total rows: 20,000
Unique groups: 19,339

4. Experiment Design¶

We run two experiments on the easy (random-distractor) dataset, following the design in Section 8.3 of the design document (Table 7):

Experiment Features Count Description
Exp 1A All 47 15 context-free + 6 context-informed + 22 relationship + 4 surface
Exp 1B Context-informed removed 41 15 context-free + 22 relationship + 4 surface

Three model families are trained under each condition:

  1. K-Nearest Neighbors (KNN) — instance-based, non-parametric. Features are scaled with StandardScaler fitted on the training fold.
  2. Logistic Regression — probabilistic, linear. Scaled. L1 and L2 penalties explored (solver='saga' supports both).
  3. Random Forest — tree-based, non-linear. Scale-invariant, so no scaling applied. Uses RandomizedSearchCV due to larger grid.

Hyperparameters are tuned via inner 3-fold stratified CV within each training fold. The outer 5-fold GroupKFold (assigned in Section 3) provides the train/test split. The same folds are used for both experiments to ensure a fair A vs. B comparison (Decision 7).

Expected outcome: High accuracy in both experiments. Random distractors are semantically unrelated to the definition, so even simple features should discriminate well. The Δ (1A − 1B) should be small — the real test of misdirection is on the harder dataset (NB 07, Exp 2A vs. 2B).

In [4]:
import time
from sklearn.base import clone
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import (GridSearchCV, RandomizedSearchCV,
                                     StratifiedKFold)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, f1_score, precision_score,
                             recall_score, roc_auc_score)

# ============================================================
# Hyperparameter grids
# ============================================================
# Full grids for final runs; reduced grids when SAMPLE_MODE is True
# to keep iteration time under a few minutes.

if SAMPLE_MODE:
    knn_grid = {
        'n_neighbors': [3, 7, 15],
        'weights': ['uniform', 'distance'],
    }
    logreg_grid = {
        'C': [0.1, 1.0, 10.0],
        'l1_ratio': [0.0, 0.5, 1.0],
    }
    rf_grid = {
        'n_estimators': [100, 200],
        'max_depth': [5, 10, None],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2],
    }
    RF_N_ITER = 10
else:
    knn_grid = {
        'n_neighbors': [3, 5, 7, 11, 15, 21],
        'weights': ['uniform', 'distance'],
        'metric': ['euclidean', 'manhattan'],
    }
    logreg_grid = {
        'C': [0.01, 0.1, 1.0, 10.0, 100.0],
        'l1_ratio': [0.0, 0.5, 1.0],
    }
    rf_grid = {
        'n_estimators': [100, 200, 500],
        'max_depth': [5, 10, 20, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['sqrt', 'log2'],
    }
    RF_N_ITER = 20

# --- Model configurations ---
# Each entry defines the base estimator, its search grid, search
# strategy (grid vs. randomized), and whether StandardScaler should
# be applied before fitting.  Random Forest is scale-invariant, so
# it receives unscaled features directly.
model_configs = {
    'KNN': {
        'estimator': KNeighborsClassifier(),
        'param_grid': knn_grid,
        'search': 'grid',
        'scale': True,
    },
    'Logistic Regression': {
        'estimator': LogisticRegression(
            solver='saga', penalty='elasticnet',
            max_iter=5000, random_state=RANDOM_SEED),
        'param_grid': logreg_grid,
        'search': 'grid',
        'scale': True,
    },
    'Random Forest': {
        'estimator': RandomForestClassifier(random_state=RANDOM_SEED),
        'param_grid': rf_grid,
        'search': 'random',
        'scale': False,
        'n_iter': RF_N_ITER,
    },
}

# Print grid sizes so we know how long tuning will take
for name, cfg in model_configs.items():
    total = 1
    for v in cfg['param_grid'].values():
        total *= len(v)
    if cfg['search'] == 'grid':
        print(f'{name}: GridSearchCV \u2014 {total} combinations \u00d7 3 inner folds')
    else:
        n_it = cfg.get('n_iter', 20)
        print(f'{name}: RandomizedSearchCV \u2014 {n_it} of {total} '
              f'combinations \u00d7 3 inner folds')
KNN: GridSearchCV — 6 combinations × 3 inner folds
Logistic Regression: GridSearchCV — 9 combinations × 3 inner folds
Random Forest: RandomizedSearchCV — 10 of 24 combinations × 3 inner folds
In [5]:
def run_experiment(df, feature_cols, experiment_name, model_configs,
                   n_folds=5):
    """Run a classification experiment using pre-assigned GroupKFold splits.

    For each outer fold, hyperparameters are tuned via inner 3-fold
    StratifiedKFold CV on the training portion, then the best model is
    evaluated on the held-out test fold.  StandardScaler is fitted on
    the training fold only for scale-sensitive models (KNN, LogReg).

    Parameters
    ----------
    df : pd.DataFrame
        Must contain ``feature_cols``, ``'label'``, and ``'fold'``.
    feature_cols : list of str
        Feature columns to use as model input.
    experiment_name : str
        Label for this experiment (e.g., ``"Exp_1A"``).
    model_configs : dict
        Model definitions — see hyperparameter grids cell above.
    n_folds : int
        Number of outer CV folds (must match ``'fold'`` column values).

    Returns
    -------
    results_df : pd.DataFrame
        One row per (model, fold) with accuracy, F1, precision, recall,
        ROC AUC, and best hyperparameters.
    best_params : dict
        ``{model_name: {fold_idx: best_params_dict}}``.
    """
    X = df[feature_cols].values
    y = df['label'].values

    all_results = []
    best_params = {name: {} for name in model_configs}

    # Inner CV for hyperparameter tuning.  We use StratifiedKFold (not
    # GroupKFold) for the inner loop because the outer GroupKFold has
    # already separated definition-answer groups across folds — further
    # group separation within a single training fold is unnecessary and
    # would complicate the search with minimal benefit.
    inner_cv = StratifiedKFold(
        n_splits=3, shuffle=True, random_state=RANDOM_SEED)

    t0_exp = time.time()
    print(f'\n{"="*65}')
    print(f'{experiment_name}: {len(feature_cols)} features, {len(df):,} rows')
    print(f'{"="*65}')

    for fold_idx in range(n_folds):
        test_mask = (df['fold'] == fold_idx).values
        train_mask = ~test_mask

        X_train, X_test = X[train_mask], X[test_mask]
        y_train, y_test = y[train_mask], y[test_mask]

        print(f'\nFold {fold_idx}: '
              f'train={train_mask.sum():,}  test={test_mask.sum():,}')

        for model_name, config in model_configs.items():
            t0 = time.time()
            estimator = clone(config['estimator'])

            # --- Feature scaling ---
            # StandardScaler is fitted on the training fold only, then
            # applied to the test fold.  This prevents information about
            # test-set feature distributions from leaking into training.
            # Random Forest is scale-invariant and skips this step.
            if config['scale']:
                scaler = StandardScaler()
                X_tr = scaler.fit_transform(X_train)
                X_te = scaler.transform(X_test)
            else:
                X_tr = X_train
                X_te = X_test

            # --- Inner CV hyperparameter search ---
            if config['search'] == 'random':
                search = RandomizedSearchCV(
                    estimator, config['param_grid'],
                    n_iter=config.get('n_iter', 20),
                    cv=inner_cv, scoring='accuracy',
                    n_jobs=-1, random_state=RANDOM_SEED,
                )
            else:
                search = GridSearchCV(
                    estimator, config['param_grid'],
                    cv=inner_cv, scoring='accuracy',
                    n_jobs=-1,
                )

            search.fit(X_tr, y_train)
            best_params[model_name][fold_idx] = search.best_params_

            # --- Evaluate on held-out test fold ---
            y_pred = search.predict(X_te)
            y_prob = search.predict_proba(X_te)[:, 1]

            acc = accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            prec = precision_score(y_test, y_pred)
            rec = recall_score(y_test, y_pred)
            roc = roc_auc_score(y_test, y_prob)
            elapsed = time.time() - t0

            all_results.append({
                'experiment': experiment_name,
                'model': model_name,
                'fold': fold_idx,
                'accuracy': acc,
                'f1': f1,
                'precision': prec,
                'recall': rec,
                'roc_auc': roc,
                'best_params': str(search.best_params_),
            })

            print(f'  {model_name:<22s} '
                  f'Acc={acc:.4f}  F1={f1:.4f}  AUC={roc:.4f}  '
                  f'[{elapsed:.1f}s]')

    elapsed_total = time.time() - t0_exp
    print(f'\n{experiment_name} complete in {elapsed_total:.0f}s')

    return pd.DataFrame(all_results), best_params
In [6]:
# ============================================================
# Run Experiment 1A: all 47 features
# ============================================================
results_1a, params_1a = run_experiment(
    df, ALL_FEATURE_COLS, "Exp_1A", model_configs, n_folds=N_FOLDS)

# ============================================================
# Run Experiment 1B: 41 features (context-informed removed)
# ============================================================
results_1b, params_1b = run_experiment(
    df, EXP_1B_COLS, "Exp_1B", model_configs, n_folds=N_FOLDS)
=================================================================
Exp_1A: 47 features, 20,000 rows
=================================================================

Fold 0: train=16,000  test=4,000
  KNN                    Acc=0.8620  F1=0.8542  AUC=0.9265  [3.1s]
  Logistic Regression    Acc=0.8678  F1=0.8657  AUC=0.9393  [25.2s]
  Random Forest          Acc=0.8678  F1=0.8669  AUC=0.9406  [14.6s]

Fold 1: train=16,000  test=4,000
  KNN                    Acc=0.8552  F1=0.8420  AUC=0.9255  [0.9s]
  Logistic Regression    Acc=0.8615  F1=0.8540  AUC=0.9358  [34.9s]
  Random Forest          Acc=0.8695  F1=0.8636  AUC=0.9401  [17.3s]

Fold 2: train=16,000  test=4,000
  KNN                    Acc=0.8588  F1=0.8480  AUC=0.9265  [0.9s]
  Logistic Regression    Acc=0.8705  F1=0.8662  AUC=0.9367  [28.2s]
  Random Forest          Acc=0.8712  F1=0.8685  AUC=0.9398  [15.1s]

Fold 3: train=16,000  test=4,000
  KNN                    Acc=0.8538  F1=0.8402  AUC=0.9249  [0.9s]
  Logistic Regression    Acc=0.8688  F1=0.8630  AUC=0.9410  [20.7s]
  Random Forest          Acc=0.8738  F1=0.8691  AUC=0.9396  [14.9s]

Fold 4: train=16,000  test=4,000
  KNN                    Acc=0.8540  F1=0.8467  AUC=0.9256  [0.9s]
  Logistic Regression    Acc=0.8700  F1=0.8686  AUC=0.9416  [19.2s]
  Random Forest          Acc=0.8788  F1=0.8788  AUC=0.9418  [17.7s]

Exp_1A complete in 214s

=================================================================
Exp_1B: 41 features, 20,000 rows
=================================================================

Fold 0: train=16,000  test=4,000
  KNN                    Acc=0.8580  F1=0.8508  AUC=0.9231  [0.8s]
  Logistic Regression    Acc=0.8668  F1=0.8644  AUC=0.9364  [27.7s]
  Random Forest          Acc=0.8652  F1=0.8632  AUC=0.9355  [13.0s]

Fold 1: train=16,000  test=4,000
  KNN                    Acc=0.8522  F1=0.8401  AUC=0.9215  [0.9s]
  Logistic Regression    Acc=0.8618  F1=0.8543  AUC=0.9346  [28.2s]
  Random Forest          Acc=0.8678  F1=0.8611  AUC=0.9385  [14.9s]

Fold 2: train=16,000  test=4,000
  KNN                    Acc=0.8550  F1=0.8462  AUC=0.9219  [0.8s]
  Logistic Regression    Acc=0.8665  F1=0.8617  AUC=0.9348  [28.8s]
  Random Forest          Acc=0.8650  F1=0.8609  AUC=0.9339  [12.5s]

Fold 3: train=16,000  test=4,000
  KNN                    Acc=0.8528  F1=0.8417  AUC=0.9249  [0.8s]
  Logistic Regression    Acc=0.8670  F1=0.8609  AUC=0.9392  [21.3s]
  Random Forest          Acc=0.8708  F1=0.8658  AUC=0.9371  [12.6s]

Fold 4: train=16,000  test=4,000
  KNN                    Acc=0.8565  F1=0.8510  AUC=0.9239  [0.8s]
  Logistic Regression    Acc=0.8690  F1=0.8673  AUC=0.9401  [18.1s]
  Random Forest          Acc=0.8732  F1=0.8729  AUC=0.9404  [12.8s]

Exp_1B complete in 194s
In [7]:
# ============================================================
# Results Summary: mean +/- SD across folds
# ============================================================
results_all = pd.concat([results_1a, results_1b], ignore_index=True)

metrics = ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']
summary_rows = []

for exp_name in ['Exp_1A', 'Exp_1B']:
    for model_name in model_configs:
        mask = ((results_all['experiment'] == exp_name) &
                (results_all['model'] == model_name))
        subset = results_all[mask]
        row = {'Experiment': exp_name, 'Model': model_name}
        for m in metrics:
            mean_val = subset[m].mean()
            std_val = subset[m].std()
            row[f'{m}_mean'] = mean_val
            row[f'{m}_std'] = std_val
            row[m] = f'{mean_val:.4f} +/- {std_val:.4f}'
        summary_rows.append(row)

summary_df = pd.DataFrame(summary_rows)

# --- Display formatted table ---
display_cols = ['Experiment', 'Model'] + metrics
print('RESULTS SUMMARY — Easy Dataset (mean +/- SD across 5 folds)')
print('=' * 115)
print(summary_df[display_cols].to_string(index=False))

# --- Delta Easy = Exp 1A - Exp 1B per model ---
print(f'\n{"="*65}')
print('Delta Easy (Exp 1A - Exp 1B)')
print(f'{"="*65}')
delta_rows = []
for model_name in model_configs:
    row_1a = summary_df[(summary_df['Experiment'] == 'Exp_1A') &
                        (summary_df['Model'] == model_name)].iloc[0]
    row_1b = summary_df[(summary_df['Experiment'] == 'Exp_1B') &
                        (summary_df['Model'] == model_name)].iloc[0]
    delta_row = {'Model': model_name}
    for m in metrics:
        delta_row[f'delta_{m}'] = row_1a[f'{m}_mean'] - row_1b[f'{m}_mean']
    delta_rows.append(delta_row)
    print(f'  {model_name:<22s}  '
          f'dAcc={delta_row["delta_accuracy"]:+.4f}  '
          f'dF1={delta_row["delta_f1"]:+.4f}  '
          f'dAUC={delta_row["delta_roc_auc"]:+.4f}')

delta_df = pd.DataFrame(delta_rows)

# --- Best hyperparameters (fold 0 as representative) ---
print(f'\n{"="*65}')
print('Best Hyperparameters (fold 0, representative)')
print(f'{"="*65}')
for model_name in model_configs:
    print(f'\n  {model_name}:')
    print(f'    Exp 1A: {params_1a[model_name][0]}')
    print(f'    Exp 1B: {params_1b[model_name][0]}')

# --- Save to CSV ---
save_df = summary_df[['Experiment', 'Model'] +
                      [f'{m}_mean' for m in metrics] +
                      [f'{m}_std' for m in metrics]]
save_path = OUTPUT_DIR / 'results_easy.csv'
save_df.to_csv(save_path, index=False)
print(f'\nSaved summary to {save_path}')

# Per-fold results (including best_params) for reproducibility
fold_path = OUTPUT_DIR / 'results_easy_per_fold.csv'
results_all.to_csv(fold_path, index=False)
print(f'Per-fold results saved to {fold_path}')
RESULTS SUMMARY — Easy Dataset (mean +/- SD across 5 folds)
===================================================================================================================
Experiment               Model          accuracy                f1         precision            recall           roc_auc
    Exp_1A                 KNN 0.8568 +/- 0.0035 0.8462 +/- 0.0055 0.9129 +/- 0.0056 0.7887 +/- 0.0091 0.9258 +/- 0.0007
    Exp_1A Logistic Regression 0.8677 +/- 0.0036 0.8635 +/- 0.0057 0.8913 +/- 0.0069 0.8375 +/- 0.0087 0.9389 +/- 0.0026
    Exp_1A       Random Forest 0.8722 +/- 0.0043 0.8694 +/- 0.0057 0.8888 +/- 0.0094 0.8508 +/- 0.0074 0.9404 +/- 0.0009
    Exp_1B                 KNN 0.8549 +/- 0.0024 0.8459 +/- 0.0051 0.9012 +/- 0.0085 0.7972 +/- 0.0069 0.9230 +/- 0.0014
    Exp_1B Logistic Regression 0.8662 +/- 0.0027 0.8617 +/- 0.0049 0.8912 +/- 0.0075 0.8342 +/- 0.0064 0.9370 +/- 0.0025
    Exp_1B       Random Forest 0.8684 +/- 0.0036 0.8648 +/- 0.0050 0.8889 +/- 0.0077 0.8420 +/- 0.0051 0.9371 +/- 0.0025

=================================================================
Delta Easy (Exp 1A - Exp 1B)
=================================================================
  KNN                     dAcc=+0.0019  dF1=+0.0003  dAUC=+0.0028
  Logistic Regression     dAcc=+0.0015  dF1=+0.0018  dAUC=+0.0019
  Random Forest           dAcc=+0.0038  dF1=+0.0046  dAUC=+0.0033

=================================================================
Best Hyperparameters (fold 0, representative)
=================================================================

  KNN:
    Exp 1A: {'n_neighbors': 15, 'weights': 'distance'}
    Exp 1B: {'n_neighbors': 15, 'weights': 'distance'}

  Logistic Regression:
    Exp 1A: {'C': 0.1, 'l1_ratio': 0.0}
    Exp 1B: {'C': 0.1, 'l1_ratio': 0.0}

  Random Forest:
    Exp 1A: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None}
    Exp 1B: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None}

Saved summary to /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/outputs/results_easy.csv
Per-fold results saved to /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/outputs/results_easy_per_fold.csv

5. Discussion¶

The easy dataset uses random distractors — answer words sampled uniformly at random from the pool of known answers. Because random pairings almost never share meaningful semantic overlap with the definition, classifiers achieve high accuracy by learning a simple rule: "does this definition–answer pair share any semantic relationship at all?" (design doc Section 8.3).

What the results tell us¶

  • High accuracy across all models confirms the pipeline is working correctly and the features capture genuine signal. This is a sanity check, not the main experiment.

  • Small Δ Easy (1A − 1B) is expected. On random distractors, the 6 context-informed features (which measure how the definition’s embedding shifts when read in the clue sentence) add little discriminative power. The context-free meaning and relationship features already distinguish real from random pairs with high confidence.

  • Consistent ROC AUC across models reflects that all three families can separate the classes well regardless of their underlying mechanism. Model choice matters less than feature quality when the task is easy.

Why this matters for the harder dataset¶

The real test of the misdirection hypothesis comes in NB 07 (Exp 2A vs. 2B on the harder dataset), where distractors are chosen by cosine similarity to the definition (Decision 6). In that setting:

  • The 15 context-free cosine features are removed (they are artifacts of the cosine-similarity-based construction).
  • The task is much harder: distractors are semantically plausible.
  • If Exp 2A (with context) < Exp 2B (without context), clue context actively hurts classification — direct evidence for misdirection through the classifier lens.
  • If Exp 2A > Exp 2B, context helps despite misdirection, suggesting the classifier extracts useful signal that the retrieval analysis missed.

The easy dataset results in this notebook establish the baseline against which the harder dataset results (NB 07) should be interpreted.