05 — Dataset Construction¶

Primary author: Victoria

Builds on:

  • 03_feature_engineering.ipynb (Victoria — 47-feature computation logic, extracted to scripts/feature_utils.py per Decision 18)
  • Hans's Hans_Supervised_Learning_Models.ipynb (random distractor generation pattern for the easy dataset — adapted for our expanded feature set and embedding-based feature computation)
  • Hans's Hans_Negative_Strategies_Experiment.ipynb (reference for distractor strategies — our harder dataset uses cosine-similarity-based selection per Decision 6, rather than matching Hans's three-strategy comparison)

Prompt engineering: Victoria
AI assistance: Claude Code (Anthropic)
Environment: Local (CPU only)

This notebook implements PLAN.md Steps 5 and 7 — constructing the two balanced binary classification datasets used in the supervised learning experiments.

Step 5: Easy Dataset¶

For each real (clue, definition, answer) row, we generate one distractor by keeping the same (clue, definition) and substituting a randomly sampled answer word from the pool of known answers. Because random distractors are semantically unrelated to the definition, classifiers should easily distinguish real from distractor pairs — this serves as a baseline to verify the pipeline works before moving to harder distractors.

Step 7: Harder Dataset (second half of notebook)¶

Distractors are drawn from the top-k most cosine-similar answer words to each definition (Decision 6). This forces the classifier to rely on subtler features rather than raw cosine similarity. The 15 context-free meaning features are removed because they are artifacts of this construction method.

Inputs:

  • data/features_all.parquet (Step 3 — all 47 features + metadata)
  • data/embeddings/ — embedding .npy arrays + index CSVs (Step 2)
  • data/embeddings/clue_context_phrases.csv (Step 2 — composite key for clue-context embedding lookups)
  • scripts/feature_utils.py (extracted from NB 03 per Decision 18)

Outputs:

  • data/dataset_easy.parquet — balanced 1:1, all 47 features, label column
  • data/dataset_harder.parquet — balanced 1:1, 32 features (no context-free meaning), label column

Imports and Path Setup¶

In [1]:
import warnings
import sys

import numpy as np
import pandas as pd
from pathlib import Path

import nltk
from nltk.corpus import wordnet as wn

from tqdm.auto import tqdm

warnings.filterwarnings('ignore', category=FutureWarning)

# --- Environment Auto-Detection ---
# Same pattern as 02_embedding_generation.ipynb, 03_feature_engineering.ipynb,
# and 04_retrieval_analysis.ipynb: detect Colab, Great Lakes, or local and
# set paths accordingly.
try:
    IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
    IS_COLAB = False

if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/'
                        'Milestone II - NLP Cryptic Crossword Clues/'
                        'clue_misdirection')
else:
    # Local or Great Lakes: notebook is in clue_misdirection/notebooks/,
    # so parent is the clue_misdirection project root.
    try:
        PROJECT_ROOT = Path(__file__).resolve().parent.parent
    except NameError:
        PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / 'data'
EMBEDDINGS_DIR = DATA_DIR / 'embeddings'
OUTPUT_DIR = PROJECT_ROOT / 'outputs'
SCRIPTS_DIR = PROJECT_ROOT / 'scripts'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# --- Add scripts/ to sys.path so feature_utils is importable ---
# Decision 18: feature computation logic is extracted from NB 03 into
# scripts/feature_utils.py to avoid duplicating code across NB 05 and NB 07.
if str(SCRIPTS_DIR) not in sys.path:
    sys.path.insert(0, str(SCRIPTS_DIR))

from feature_utils import (
    CONTEXT_FREE_COLS, CONTEXT_INFORMED_COLS, RELATIONSHIP_COLS,
    SURFACE_COLS, ALL_FEATURE_COLS, METADATA_COLS,
    rowwise_cosine,
    compute_surface_features,
    get_wordnet_synsets,
    compute_relationship_features,
    compute_cosine_features_for_pair,
)

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
rng = np.random.RandomState(RANDOM_SEED)

# Download WordNet data — needed for relationship features on distractor pairs.
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

# Embedding dimension for CALE-MBERT-en
EMBED_DIM = 1024

print(f'Environment: {"Google Colab" if IS_COLAB else "Local / Great Lakes"}')
print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
print(f'Embeddings directory: {EMBEDDINGS_DIR}')
print(f'\nFeature group sizes (from feature_utils):')
print(f'  Context-free:     {len(CONTEXT_FREE_COLS)}')
print(f'  Context-informed: {len(CONTEXT_INFORMED_COLS)}')
print(f'  Relationship:     {len(RELATIONSHIP_COLS)}')
print(f'  Surface:          {len(SURFACE_COLS)}')
print(f'  Total:            {len(ALL_FEATURE_COLS)}')
Environment: Local / Great Lakes
Project root: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection
Data directory: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/data
Embeddings directory: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/data/embeddings

Feature group sizes (from feature_utils):
  Context-free:     15
  Context-informed: 6
  Relationship:     22
  Surface:          4
  Total:            47

Load Data¶

We load two types of data:

  1. features_all.parquet (Step 3) — the 47 features + metadata for all 240,211 real (clue, definition, answer) rows. These become the label=1 rows in both datasets.

  2. Embedding files (Step 2) — needed to compute cosine features for distractor pairs. When we swap in a new answer word, we need its pre-computed embeddings (allsense, common, obscure) to compute the 15 context-free and 6 context-informed cosine similarities. Definition embeddings and clue-context embeddings stay the same (same definition, same clue), only the answer side changes.

We build word-to-row-index lookup dicts so we can quickly retrieve any word's embedding from the .npy arrays.

In [2]:
# ============================================================
# Load features_all.parquet (Step 3 output)
# ============================================================
features_path = DATA_DIR / 'features_all.parquet'
assert features_path.exists(), (
    f'Missing input file: {features_path}\n'
    f'Run 03_feature_engineering.ipynb first to produce this file.'
)
df_real = pd.read_parquet(features_path)
print(f'features_all.parquet: {len(df_real):,} rows × {len(df_real.columns)} columns')
print(f'  Unique (definition, answer) pairs: '
      f'{df_real["def_answer_pair_id"].nunique():,}')
print(f'  Unique definition_wn values:       '
      f'{df_real["definition_wn"].nunique():,}')
print(f'  Unique answer_wn values:            '
      f'{df_real["answer_wn"].nunique():,}')

# Validate expected columns exist
missing_feat = [c for c in ALL_FEATURE_COLS if c not in df_real.columns]
assert not missing_feat, f'Missing feature columns in parquet: {missing_feat}'
missing_meta = [c for c in METADATA_COLS if c not in df_real.columns]
assert not missing_meta, f'Missing metadata columns in parquet: {missing_meta}'

# ============================================================
# Load embedding arrays and index files
# ============================================================
# CRITICAL: keep_default_na=False on all index CSVs — the word "nan"
# (grandmother) is a valid crossword definition/answer.

definition_embeddings = np.load(EMBEDDINGS_DIR / 'definition_embeddings.npy')
definition_index = pd.read_csv(
    EMBEDDINGS_DIR / 'definition_index.csv', index_col=0,
    keep_default_na=False)

answer_embeddings = np.load(EMBEDDINGS_DIR / 'answer_embeddings.npy')
answer_index = pd.read_csv(
    EMBEDDINGS_DIR / 'answer_index.csv', index_col=0,
    keep_default_na=False)

clue_context_embeddings = np.load(
    EMBEDDINGS_DIR / 'clue_context_embeddings.npy')

# --- Load clue_context_phrases.csv for composite-key lookups ---
# We use clue_context_phrases.csv (not clue_context_index.csv) because
# clue_id is non-unique for double-definition clues. The composite key
# (clue_id, definition) disambiguates rows. This is the same approach
# used in NB 03 and NB 04.
cc_phrases = pd.read_csv(
    EMBEDDINGS_DIR / 'clue_context_phrases.csv', keep_default_na=False)
cc_phrases['cc_row_position'] = np.arange(len(cc_phrases))

# --- Print shapes ---
print(f'\n{"File":<35s} {"Shape":<25s} {"Memory":>8s}')
print(f'{"-"*35} {"-"*25} {"-"*8}')
for name, arr in [
    ('definition_embeddings.npy', definition_embeddings),
    ('answer_embeddings.npy', answer_embeddings),
    ('clue_context_embeddings.npy', clue_context_embeddings),
]:
    mb = arr.nbytes / 1024**2
    print(f'{name:<35s} {str(arr.shape):<25s} {mb:>6.1f} MB')

print(f'\nIndex sizes:')
print(f'  definition_index:        {len(definition_index):,} rows')
print(f'  answer_index:            {len(answer_index):,} rows')
print(f'  clue_context_phrases:    {len(cc_phrases):,} rows')

# --- Shape assertions ---
n_def = len(definition_index)
n_ans = len(answer_index)
n_cc = len(cc_phrases)
assert definition_embeddings.shape == (n_def, 3, EMBED_DIM)
assert answer_embeddings.shape == (n_ans, 3, EMBED_DIM)
assert clue_context_embeddings.shape == (n_cc, EMBED_DIM)

# ============================================================
# Build word → row-index lookup dicts for embedding retrieval
# ============================================================
# definition_index and answer_index have integer row indices (0, 1, 2, ...)
# and a 'word' column. We create dicts mapping word → row position in the
# corresponding .npy array.
#
# The embedding .npy arrays have shape (N, 3, 1024) where the 3 slots are:
#   [0] = allsense_avg, [1] = common synset, [2] = obscure synset
# So to get e.g. answer "rose"'s common embedding:
#   answer_embeddings[ans_word_to_idx['rose'], 1, :]

def_word_to_idx = pd.Series(
    definition_index.index, index=definition_index['word']).to_dict()
ans_word_to_idx = pd.Series(
    answer_index.index, index=answer_index['word']).to_dict()

# Build a mapping from (clue_id, definition) → cc_row_position for
# clue-context embedding lookups. This composite key handles double-
# definition clues where multiple rows share the same clue_id.
cc_lookup = cc_phrases.set_index(
    ['clue_id', 'definition'])['cc_row_position'].to_dict()

print(f'\nLookup dicts built:')
print(f'  def_word_to_idx:  {len(def_word_to_idx):,} entries')
print(f'  ans_word_to_idx:  {len(ans_word_to_idx):,} entries')
print(f'  cc_lookup:        {len(cc_lookup):,} entries')
features_all.parquet: 240,211 rows × 60 columns
  Unique (definition, answer) pairs: 128,961
  Unique definition_wn values:       27,356
  Unique answer_wn values:            45,183

File                                Shape                       Memory
----------------------------------- ------------------------- --------
definition_embeddings.npy           (27385, 3, 1024)           320.9 MB
answer_embeddings.npy               (45254, 3, 1024)           530.3 MB
clue_context_embeddings.npy         (240211, 1024)             938.3 MB

Index sizes:
  definition_index:        27,385 rows
  answer_index:            45,254 rows
  clue_context_phrases:    240,211 rows

Lookup dicts built:
  def_word_to_idx:  27,385 entries
  ans_word_to_idx:  45,254 entries
  cc_lookup:        240,211 entries

Step 5: Easy Dataset Construction¶

The easy dataset tests whether a classifier can distinguish real definition–answer pairs from random ones. For each of the 240,211 real (clue, definition, answer) rows, we generate one distractor by keeping the same (clue, definition) and substituting a randomly sampled answer word from the pool of all known answer words (excluding the true answer).

Because random distractors are overwhelmingly unrelated to the definition (e.g., pairing the definition "flower" with the answer "parliament"), we expect classifiers to achieve very high accuracy on this dataset. The cosine similarity between definition and distractor answer embeddings should be much lower than for real pairs, making classification trivial.

This is intentional — the easy dataset serves as a sanity check baseline. It verifies that the feature engineering pipeline and modeling code work correctly before we move on to the harder dataset (Step 7), where distractors are chosen to be semantically similar and the classification task becomes genuinely informative about misdirection.

In [3]:
# ============================================================
# Generate Easy Distractors
# ============================================================
# For each real row, sample one random answer_wn from the pool of all
# known answer words, excluding the true answer for that row.

all_answer_words = df_real['answer_wn'].unique()
print(f'Pool of candidate answer words: {len(all_answer_words):,}')

# Pre-build a set for fast membership checks
all_answer_set = set(all_answer_words)

# Sample one distractor answer per row. We iterate rather than vectorize
# because each row must exclude its own true answer from the pool.
distractor_answers = []
true_answers = df_real['answer_wn'].values

for i in tqdm(range(len(df_real)), desc='Sampling distractors'):
    true_ans = true_answers[i]
    # Sample from the full pool; resample if we hit the true answer.
    # With ~45K candidates, collision probability is ~0.002%, so this
    # almost never loops more than once.
    while True:
        sampled = all_answer_words[rng.randint(len(all_answer_words))]
        if sampled != true_ans:
            break
    distractor_answers.append(sampled)

df_real_copy = df_real.copy()
df_real_copy['distractor_answer_wn'] = distractor_answers

print(f'\nGenerated {len(distractor_answers):,} distractor assignments')
print(f'  Unique distractor answers used: '
      f'{len(set(distractor_answers)):,}')

# Sanity check: no row's distractor is the same as its true answer
assert (df_real_copy['answer_wn'] != df_real_copy['distractor_answer_wn']).all(), \
    'Some distractors match the true answer!'
Pool of candidate answer words: 45,183
Sampling distractors:   0%|          | 0/240211 [00:00<?, ?it/s]
Generated 240,211 distractor assignments
  Unique distractor answers used: 44,974

Compute Features for Easy Distractors¶

Each distractor row needs all 47 features recomputed for the new (definition, distractor_answer) pair:

  • Cosine features (21): We look up the definition's 3 pre-computed embeddings (allsense, common, obscure) and the distractor answer's 3 embeddings from the .npy arrays, plus the clue-context embedding for that row. The definition and clue-context embeddings are unchanged from the real row (same clue, same definition) — only the answer side changes.

  • Relationship features (22): Computed from WordNet using compute_relationship_features(definition_wn, distractor_answer_wn). These depend only on the word strings, not embeddings.

  • Surface features (4): Computed from the word strings using compute_surface_features(definition_wn, distractor_answer_wn).

All feature functions are imported from scripts/feature_utils.py (Decision 18), ensuring consistency with the NB 03 feature computation.

Efficiency optimization: Relationship and surface features depend only on the (definition_wn, distractor_answer_wn) pair — not on the clue. Many rows may share the same pair (e.g., if the same definition appears in multiple clues and gets the same distractor). We compute these features once per unique pair and merge back, following the same deduplication pattern used in NB 03.

In [4]:
# ============================================================
# Step 1: Compute relationship + surface features per unique pair
# ============================================================
# Deduplicate to unique (definition_wn, distractor_answer_wn) pairs first,
# then compute WordNet relationship and surface features once per pair.
# This avoids redundant WordNet lookups — the same optimization NB 03 uses.

unique_pairs = df_real_copy[['definition_wn', 'distractor_answer_wn']].drop_duplicates()
print(f'Unique (definition_wn, distractor_answer_wn) pairs: '
      f'{len(unique_pairs):,}')
print(f'  (vs {len(df_real_copy):,} total rows — '
      f'{len(df_real_copy) / len(unique_pairs):.1f}x dedup ratio)')

# Compute relationship features (22) and surface features (4) per pair
pair_features = []
for _, row in tqdm(unique_pairs.iterrows(), total=len(unique_pairs),
                   desc='Relationship + surface features'):
    def_w = row['definition_wn']
    ans_w = row['distractor_answer_wn']
    feats = compute_relationship_features(def_w, ans_w)
    feats.update(compute_surface_features(def_w, ans_w))
    feats['definition_wn'] = def_w
    feats['distractor_answer_wn'] = ans_w
    pair_features.append(feats)

pair_feat_df = pd.DataFrame(pair_features)
print(f'\nPair feature table: {pair_feat_df.shape}')

# ============================================================
# Step 2: Compute cosine features per row
# ============================================================
# Cosine features involve the clue-context embedding, which is row-specific
# (different clues have different context embeddings even for the same
# definition). So we compute these per row.

cosine_features_list = []

for idx in tqdm(range(len(df_real_copy)), desc='Cosine features'):
    row = df_real_copy.iloc[idx]
    def_wn = row['definition_wn']
    dist_ans_wn = row['distractor_answer_wn']

    # Look up definition embeddings (unchanged from real row)
    def_row_idx = def_word_to_idx[def_wn]
    def_embs = {
        'allsense': definition_embeddings[def_row_idx, 0, :],
        'common':   definition_embeddings[def_row_idx, 1, :],
        'obscure':  definition_embeddings[def_row_idx, 2, :],
    }

    # Look up distractor answer embeddings
    ans_row_idx = ans_word_to_idx[dist_ans_wn]
    ans_embs = {
        'allsense': answer_embeddings[ans_row_idx, 0, :],
        'common':   answer_embeddings[ans_row_idx, 1, :],
        'obscure':  answer_embeddings[ans_row_idx, 2, :],
    }

    # Look up clue-context embedding (unchanged — same clue, same definition)
    cc_key = (row['clue_id'], row['definition'])
    cc_pos = cc_lookup[cc_key]
    clue_emb = clue_context_embeddings[cc_pos, :]

    # Compute all 21 cosine features (15 context-free + 6 context-informed)
    cos_feats = compute_cosine_features_for_pair(
        def_embs, ans_embs, clue_emb=clue_emb)
    cosine_features_list.append(cos_feats)

cosine_feat_df = pd.DataFrame(cosine_features_list, index=df_real_copy.index)
print(f'\nCosine feature table: {cosine_feat_df.shape}')

# ============================================================
# Step 3: Merge all distractor features together
# ============================================================
# Drop the original 47 feature columns inherited from features_all.parquet.
# These are the real-pair features — we're replacing them with distractor-pair
# features. Without this, the merge below would produce _x/_y suffixed
# duplicates for the 26 relationship + surface columns that appear in both
# df_real_copy and pair_feat_df.
df_real_copy = df_real_copy.drop(
    columns=[c for c in ALL_FEATURE_COLS if c in df_real_copy.columns])

# Merge the per-pair relationship/surface features back to all rows
distractor_df = df_real_copy.merge(
    pair_feat_df,
    on=['definition_wn', 'distractor_answer_wn'],
    how='left'
)

# Add cosine features (already aligned by index)
for col in cosine_feat_df.columns:
    distractor_df[col] = cosine_feat_df[col].values

# ============================================================
# Validation: no NaN in any feature column
# ============================================================
nan_counts = distractor_df[ALL_FEATURE_COLS].isnull().sum()
nan_cols = nan_counts[nan_counts > 0]
assert len(nan_cols) == 0, (
    f'NaN values found in distractor feature columns — violates Decision 3:\n'
    f'{nan_cols}')

print(f'\nDistractor feature computation complete.')
print(f'  {len(distractor_df):,} distractor rows')
print(f'  {len(ALL_FEATURE_COLS)} features validated (no NaN)')
Unique (definition_wn, distractor_answer_wn) pairs: 239,972
  (vs 240,211 total rows — 1.0x dedup ratio)
Relationship + surface features:   0%|          | 0/239972 [00:00<?, ?it/s]
Pair feature table: (239972, 28)
Cosine features:   0%|          | 0/240211 [00:00<?, ?it/s]
Cosine feature table: (240211, 21)

Distractor feature computation complete.
  240,211 distractor rows
  47 features validated (no NaN)

Assemble Easy Dataset¶

We combine the real rows (label=1) and distractor rows (label=0) into a single balanced binary classification dataset. The dataset includes all 47 features, metadata columns for downstream analysis, and a label column for the classification target.

Distractor rows also carry a distractor_source column recording which answer word was substituted, and is_distractor for easy filtering. The answer_wn column in distractor rows is set to the distractor answer (not the original true answer), because this is the word whose features were actually computed.

In [5]:
# ============================================================
# Build real-pair rows (label = 1)
# ============================================================
real_rows = df_real[METADATA_COLS + ALL_FEATURE_COLS].copy()
real_rows['label'] = 1
real_rows['distractor_source'] = np.nan  # not a distractor

# ============================================================
# Build distractor rows (label = 0)
# ============================================================
# For distractor rows, answer_wn is the distractor answer (the word whose
# features were computed), and distractor_source records which word it is.
dist_rows = distractor_df[METADATA_COLS + ALL_FEATURE_COLS].copy()

# Overwrite answer_wn with the distractor answer and record the source
dist_rows['answer_wn'] = distractor_df['distractor_answer_wn'].values
dist_rows['label'] = 0
dist_rows['distractor_source'] = distractor_df['distractor_answer_wn'].values

# ============================================================
# Concatenate and shuffle
# ============================================================
dataset_easy = pd.concat([real_rows, dist_rows], ignore_index=True)
dataset_easy = dataset_easy.sample(
    frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

# ============================================================
# Validation
# ============================================================
n_real = (dataset_easy['label'] == 1).sum()
n_dist = (dataset_easy['label'] == 0).sum()
assert n_real == n_dist, (
    f'Dataset is not balanced: {n_real:,} real vs {n_dist:,} distractor')
assert n_real == len(df_real), (
    f'Real row count mismatch: {n_real:,} vs {len(df_real):,}')

# Verify all 47 feature columns present and NaN-free
missing_cols = [c for c in ALL_FEATURE_COLS if c not in dataset_easy.columns]
assert not missing_cols, f'Missing feature columns: {missing_cols}'
assert not dataset_easy[ALL_FEATURE_COLS].isnull().any().any(), (
    'NaN values found in easy dataset features — violates Decision 3')

print(f'Easy dataset assembled:')
print(f'  Total rows:       {len(dataset_easy):,}')
print(f'  Real (label=1):   {n_real:,}')
print(f'  Distractor (0):   {n_dist:,}')
print(f'  Feature columns:  {len(ALL_FEATURE_COLS)}')
print(f'  Total columns:    {len(dataset_easy.columns)}')

# ============================================================
# Feature statistics summary (real vs distractor)
# ============================================================
print(f'\n--- Feature Statistics by Label ---')
print(f'{"Feature":<35s} {"Real mean":>10s} {"Real std":>10s}'
      f' {"Dist mean":>10s} {"Dist std":>10s}')
print(f'{"-"*35} {"-"*10} {"-"*10} {"-"*10} {"-"*10}')
for col in ALL_FEATURE_COLS[:10]:  # show first 10 for brevity
    r = dataset_easy.loc[dataset_easy['label'] == 1, col]
    d = dataset_easy.loc[dataset_easy['label'] == 0, col]
    print(f'{col:<35s} {r.mean():>10.4f} {r.std():>10.4f}'
          f' {d.mean():>10.4f} {d.std():>10.4f}')
print(f'  ... ({len(ALL_FEATURE_COLS) - 10} more features omitted)')

# ============================================================
# Save to parquet
# ============================================================
easy_path = DATA_DIR / 'dataset_easy.parquet'
dataset_easy.to_parquet(easy_path, index=False)
file_size_mb = easy_path.stat().st_size / (1024 * 1024)
print(f'\nSaved to {easy_path}')
print(f'File size: {file_size_mb:.1f} MB')

# Round-trip verification
reloaded = pd.read_parquet(easy_path)
assert reloaded.shape == dataset_easy.shape, (
    f'Shape mismatch after reload: {reloaded.shape} vs {dataset_easy.shape}')
print(f'Round-trip verification passed: '
      f'{reloaded.shape[0]:,} rows × {reloaded.shape[1]} columns')
Easy dataset assembled:
  Total rows:       480,422
  Real (label=1):   240,211
  Distractor (0):   240,211
  Feature columns:  47
  Total columns:    62

--- Feature Statistics by Label ---
Feature                              Real mean   Real std  Dist mean   Dist std
----------------------------------- ---------- ---------- ---------- ----------
cos_w1all_w1common                      0.8638     0.1257     0.8638     0.1257
cos_w1all_w1obscure                     0.8636     0.1168     0.8636     0.1168
cos_w1all_w2all                         0.6482     0.1366     0.4304     0.1189
cos_w1all_w2common                      0.5954     0.1575     0.4045     0.1175
cos_w1all_w2obscure                     0.5940     0.1545     0.4061     0.1179
cos_w1common_w1obscure                  0.6891     0.2331     0.6891     0.2331
cos_w1common_w2all                      0.5686     0.1710     0.3640     0.1194
cos_w1common_w2common                   0.5311     0.1871     0.3435     0.1184
cos_w1common_w2obscure                  0.5171     0.1867     0.3426     0.1192
cos_w1obscure_w2all                     0.5593     0.1640     0.3755     0.1207
  ... (37 more features omitted)

Saved to /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/data/dataset_easy.parquet
File size: 103.1 MB
Round-trip verification passed: 480,422 rows × 62 columns

Easy Dataset Validation¶

Before moving on to the harder dataset, we verify that the easy dataset has the expected properties: balanced labels, no missing features, and — most importantly — a clear separation between real and distractor pairs on key features. If random distractors don't look obviously different from real pairs, something is wrong with the feature computation.

In [6]:
# ============================================================
# Basic Statistics
# ============================================================
print('=== Easy Dataset Summary ===')
print(f'Total rows:                    {len(dataset_easy):,}')
print(f'Label=1 (real):                {(dataset_easy["label"] == 1).sum():,}')
print(f'Label=0 (distractor):          {(dataset_easy["label"] == 0).sum():,}')
print(f'Unique definition_wn values:   '
      f'{dataset_easy["definition_wn"].nunique():,}')
print(f'Unique answer_wn values:       '
      f'{dataset_easy["answer_wn"].nunique():,}')
print(f'Unique def_answer_pair_id:     '
      f'{dataset_easy["def_answer_pair_id"].nunique():,}')

# ============================================================
# Key Feature Comparison: Real vs Distractor
# ============================================================
# We expect real pairs to have significantly higher cosine similarity
# and WordNet path similarity than random distractors. This confirms
# the "easy" nature of the dataset.

validation_features = [
    'cos_w1all_w2all',       # primary cross-word cosine similarity
    'cos_w1common_w2common', # common-sense cosine
    'wn_max_path_sim',       # WordNet structural similarity
    'wn_shared_synset_count',# direct synonym overlap
    'surface_edit_distance', # orthographic similarity
]

real_mask = dataset_easy['label'] == 1
dist_mask = dataset_easy['label'] == 0

print(f'\n{"Feature":<30s} {"Real":>18s}  {"Distractor":>18s}  {"Gap":>8s}')
print(f'{"-"*30} {"-"*18}  {"-"*18}  {"-"*8}')

for feat in validation_features:
    r_mean = dataset_easy.loc[real_mask, feat].mean()
    r_std  = dataset_easy.loc[real_mask, feat].std()
    d_mean = dataset_easy.loc[dist_mask, feat].mean()
    d_std  = dataset_easy.loc[dist_mask, feat].std()
    gap = r_mean - d_mean
    print(f'{feat:<30s} {r_mean:>7.4f} ± {r_std:<7.4f}  '
          f'{d_mean:>7.4f} ± {d_std:<7.4f}  {gap:>+7.4f}')

# ============================================================
# Confirm distractors are substantially different
# ============================================================
# Real pairs should have notably higher cos_w1all_w2all than distractors.
# If the gap is near zero, feature computation may be incorrect.
cos_gap = (dataset_easy.loc[real_mask, 'cos_w1all_w2all'].mean()
           - dataset_easy.loc[dist_mask, 'cos_w1all_w2all'].mean())
print(f'\ncos_w1all_w2all gap (real - distractor): {cos_gap:+.4f}')
assert cos_gap > 0.05, (
    f'Cosine gap is suspiciously small ({cos_gap:.4f}). '
    f'Random distractors should be much less similar than real pairs.')
print('Validation passed: distractors are substantially less similar '
      'than real pairs, as expected for the easy dataset.')
=== Easy Dataset Summary ===
Total rows:                    480,422
Label=1 (real):                240,211
Label=0 (distractor):          240,211
Unique definition_wn values:   27,356
Unique answer_wn values:       45,183
Unique def_answer_pair_id:     128,961

Feature                                      Real          Distractor       Gap
------------------------------ ------------------  ------------------  --------
cos_w1all_w2all                 0.6482 ± 0.1366    0.4304 ± 0.1189   +0.2178
cos_w1common_w2common           0.5311 ± 0.1871    0.3435 ± 0.1184   +0.1876
wn_max_path_sim                 0.4352 ± 0.2902    0.1477 ± 0.0836   +0.2876
wn_shared_synset_count          0.2081 ± 0.5066    0.0008 ± 0.0617   +0.2073
surface_edit_distance           6.7299 ± 2.2153    7.6067 ± 2.2683   -0.8768

cos_w1all_w2all gap (real - distractor): +0.2178
Validation passed: distractors are substantially less similar than real pairs, as expected for the easy dataset.

Step 7: Harder Dataset Construction¶

The easy dataset uses random distractors, which are trivially separable from real pairs (as confirmed above). The harder dataset tests whether a classifier can still distinguish real from distractor pairs when distractors are selected to be semantically similar to the definition.

Construction (Design Doc Section 7.2, Decision 6): For each real row's definition, we rank all candidate answer words by cosine similarity between the definition's allsense embedding (Word1_allsense, no clue context) and each answer's allsense embedding (Word2_allsense). Distractors are sampled from the top-k most similar answer words, excluding the true answer.

Since real definition–answer pairs average ~0.65 cosine similarity (cos_w1all_w2all), choosing distractors from the top-k ensures they fall in a similar range — making the task genuinely harder than random selection.

Critical consequence: Because distractors are selected by cosine similarity between context-free allsense embeddings, the 15 context-free meaning features are artifacts of dataset construction — they directly encode the selection criterion. Including them would let the classifier exploit the construction method rather than learning meaningful patterns. Per Decision 6, these 15 features are removed from the harder dataset.

Remaining features (32):

  • 6 context-informed meaning features (cos_w1clue_*) — these involve the clue-context embedding, which was NOT used in distractor selection
  • 22 WordNet relationship features — structural, not embedding-based
  • 4 surface features — orthographic similarity

This is the feature set for Experiment 2A. Experiment 2B further removes the 6 context-informed features, leaving 26 features (relationship + surface only).

Compute Definition–Answer Allsense Similarity¶

To select harder distractors, we need the cosine similarity between each unique definition's allsense embedding and every unique answer's allsense embedding. This lets us rank all candidate answer words for each definition by their semantic similarity.

Rather than materializing the full ~27K × ~45K similarity matrix (~5 GB in float32), we process definitions in batches and store only the top-K most similar answers per definition. This keeps memory usage manageable while giving us enough candidates to tune the difficulty level.

In [7]:
# ============================================================
# Compute top-K most similar answers for each unique definition
# ============================================================
# We use allsense embeddings (slot [0]) for both definitions and answers,
# matching the "Word1_allsense vs Word2_allsense" specification in Step 7.
# Cosine similarity is computed as a dot product of L2-normalized vectors.

# --- Extract and normalize allsense embeddings ---
def_allsense = definition_embeddings[:, 0, :].astype(np.float32)
ans_allsense = answer_embeddings[:, 0, :].astype(np.float32)

def_norms = np.linalg.norm(def_allsense, axis=1, keepdims=True)
ans_norms = np.linalg.norm(ans_allsense, axis=1, keepdims=True)
def_normed = def_allsense / def_norms
ans_normed = ans_allsense / ans_norms

# --- Map from answer_index row positions to answer_wn strings ---
ans_words_arr = answer_index['word'].values
def_words_arr = definition_index['word'].values

# --- Compute top-K candidates per definition in chunks ---
# We store a generous upper bound (MAX_K=500) so we can tune k downward.
MAX_K = 500

top_k_indices_arr = np.zeros((n_def, MAX_K), dtype=np.int32)
top_k_sims_arr = np.zeros((n_def, MAX_K), dtype=np.float32)

CHUNK_SIZE = 500
n_chunks = (n_def + CHUNK_SIZE - 1) // CHUNK_SIZE

for chunk_start in tqdm(range(0, n_def, CHUNK_SIZE), total=n_chunks,
                        desc='Computing similarity (chunked)'):
    chunk_end = min(chunk_start + CHUNK_SIZE, n_def)
    # Cosine similarity between this chunk of definitions and all answers
    chunk_sims = def_normed[chunk_start:chunk_end] @ ans_normed.T

    for i in range(chunk_end - chunk_start):
        row_sims = chunk_sims[i]
        # argpartition is O(n), much faster than full sort for 45K candidates
        top_idx = np.argpartition(row_sims, -MAX_K)[-MAX_K:]
        sorted_order = np.argsort(row_sims[top_idx])[::-1]
        top_k_indices_arr[chunk_start + i] = top_idx[sorted_order]
        top_k_sims_arr[chunk_start + i] = row_sims[top_idx[sorted_order]]

# --- Build lookup: definition_wn -> row position in top_k arrays ---
def_word_to_topk_row = {w: i for i, w in enumerate(def_words_arr)}

print(f'Similarity computation complete:')
print(f'  Definitions processed: {n_def:,}')
print(f'  Answer candidates:     {n_ans:,}')
print(f'  Top-K stored per def:  {MAX_K}')
print(f'  Equivalent full matrix shape: ({n_def:,}, {n_ans:,})')
print(f'  Storage: {top_k_indices_arr.nbytes / 1024**2:.1f} MB (indices) + '
      f'{top_k_sims_arr.nbytes / 1024**2:.1f} MB (similarities)')

# --- Summary of similarity ranges ---
real_cos_mean = df_real['cos_w1all_w2all'].mean()
print(f'\nSimilarity ranges across definitions (mean of each rank):')
for rank in [1, 10, 50, 100, 200, 500]:
    idx = min(rank - 1, MAX_K - 1)
    mean_val = top_k_sims_arr[:, idx].mean()
    med_val = np.median(top_k_sims_arr[:, idx])
    print(f'  Rank {rank:>4d}: mean={mean_val:.4f}, median={med_val:.4f}')
print(f'\nReal pairs mean cos_w1all_w2all: {real_cos_mean:.4f}')
Computing similarity (chunked):   0%|          | 0/55 [00:00<?, ?it/s]
Similarity computation complete:
  Definitions processed: 27,385
  Answer candidates:     45,254
  Top-K stored per def:  500
  Equivalent full matrix shape: (27,385, 45,254)
  Storage: 52.2 MB (indices) + 52.2 MB (similarities)

Similarity ranges across definitions (mean of each rank):
  Rank    1: mean=0.9613, median=1.0000
  Rank   10: mean=0.8178, median=0.8251
  Rank   50: mean=0.7735, median=0.7825
  Rank  100: mean=0.7523, median=0.7618
  Rank  200: mean=0.7291, median=0.7393
  Rank  500: mean=0.6943, median=0.7049

Real pairs mean cos_w1all_w2all: 0.6482

Select Harder Distractors¶

For each real row's definition, we sample one distractor from the top-k most similar answer words (excluding the true answer). The choice of k controls the difficulty: smaller k means more similar distractors (harder task), larger k allows more diversity.

We first explore a few k values to estimate the resulting distractor similarity, then finalize with the chosen k.

In [8]:
# ============================================================
# Explore different k values
# ============================================================
# For each k, compute the expected mean similarity of a uniformly-sampled
# distractor from the top-k pool. This helps us choose k so that harder
# distractors are meaningfully closer to the definition than random ones.

k_candidates = [20, 50, 100, 200, 500]

print(f'Expected mean distractor similarity by k:')
print(f'  {"k":>6s}  {"Mean(top-k)":>12s}  {"Median(top-k)":>14s}  '
      f'{"Floor (rank k)":>14s}')
print(f'  {"-"*6}  {"-"*12}  {"-"*14}  {"-"*14}')

for k in k_candidates:
    # Mean similarity across all definitions, averaged over the top-k ranks
    mean_topk = top_k_sims_arr[:, :k].mean()
    median_topk = np.median(top_k_sims_arr[:, :k].mean(axis=1))
    floor_val = top_k_sims_arr[:, k-1].mean()
    print(f'  {k:>6d}  {mean_topk:>12.4f}  {median_topk:>14.4f}  '
          f'{floor_val:>14.4f}')

real_cos_mean = df_real['cos_w1all_w2all'].mean()
easy_dist_cos_mean = dataset_easy.loc[
    dataset_easy['label'] == 0, 'cos_w1all_w2all'].mean()
print(f'\nReference points:')
print(f'  Real pairs mean cos_w1all_w2all:          {real_cos_mean:.4f}')
print(f'  Easy (random) distractor mean:            {easy_dist_cos_mean:.4f}')

# ============================================================
# Choose k and sample distractors
# ============================================================
# k=100 gives distractors from the top ~0.2% of the answer pool for each
# definition. These are semantically plausible alternatives that should
# challenge the classifier without being indistinguishable from real pairs.

TOP_K_FINAL = 100
print(f'\n--- Selected TOP_K = {TOP_K_FINAL} ---')

harder_distractor_answers = []
harder_distractor_sims = []
true_answers_arr = df_real['answer_wn'].values
def_wns_arr = df_real['definition_wn'].values

for i in tqdm(range(len(df_real)), desc='Sampling harder distractors'):
    def_wn = def_wns_arr[i]
    true_ans = true_answers_arr[i]

    topk_row = def_word_to_topk_row[def_wn]
    candidate_indices = top_k_indices_arr[topk_row, :TOP_K_FINAL]
    candidate_sims = top_k_sims_arr[topk_row, :TOP_K_FINAL]
    candidate_words = ans_words_arr[candidate_indices]

    # Exclude the true answer from candidates
    valid_mask = candidate_words != true_ans
    valid_words = candidate_words[valid_mask]
    valid_sims = candidate_sims[valid_mask]

    if len(valid_words) == 0:
        # Extremely unlikely fallback: true answer is the only top-K candidate.
        # Fall back to random selection.
        while True:
            sampled = all_answer_words[rng.randint(len(all_answer_words))]
            if sampled != true_ans:
                break
        harder_distractor_answers.append(sampled)
        harder_distractor_sims.append(np.nan)
    else:
        idx = rng.randint(len(valid_words))
        harder_distractor_answers.append(valid_words[idx])
        harder_distractor_sims.append(float(valid_sims[idx]))

harder_distractor_answers = np.array(harder_distractor_answers)
harder_distractor_sims = np.array(harder_distractor_sims)

# ============================================================
# Report statistics
# ============================================================
print(f'\nHarder distractor statistics (k={TOP_K_FINAL}):')
print(f'  Count:              {len(harder_distractor_answers):,}')
print(f'  Unique distractors: {len(set(harder_distractor_answers)):,}')
print(f'  Mean cos sim:       {np.nanmean(harder_distractor_sims):.4f}')
print(f'  Median cos sim:     {np.nanmedian(harder_distractor_sims):.4f}')
print(f'  Std cos sim:        {np.nanstd(harder_distractor_sims):.4f}')
print(f'  NaN (fallback):     {np.isnan(harder_distractor_sims).sum()}')

print(f'\nComparison:')
print(f'  Real pairs mean:         {real_cos_mean:.4f}')
print(f'  Harder distractor mean:  {np.nanmean(harder_distractor_sims):.4f}')
print(f'  Easy distractor mean:    {easy_dist_cos_mean:.4f}')
print(f'  Gap (real - harder):     '
      f'{real_cos_mean - np.nanmean(harder_distractor_sims):+.4f}')
print(f'  Gap (real - easy):       '
      f'{real_cos_mean - easy_dist_cos_mean:+.4f}')

# Sanity: no distractor matches the true answer
assert (harder_distractor_answers != true_answers_arr).all(), \
    'Some harder distractors match the true answer!'
Expected mean distractor similarity by k:
       k   Mean(top-k)   Median(top-k)  Floor (rank k)
  ------  ------------  --------------  --------------
      20        0.8281          0.8348          0.7991
      50        0.8017          0.8094          0.7735
     100        0.7817          0.7902          0.7523
     200        0.7606          0.7694          0.7291
     500        0.7299          0.7399          0.6943

Reference points:
  Real pairs mean cos_w1all_w2all:          0.6482
  Easy (random) distractor mean:            0.4304

--- Selected TOP_K = 100 ---
Sampling harder distractors:   0%|          | 0/240211 [00:00<?, ?it/s]
Harder distractor statistics (k=100):
  Count:              240,211
  Unique distractors: 37,829
  Mean cos sim:       0.7703
  Median cos sim:     0.7749
  Std cos sim:        0.0682
  NaN (fallback):     0

Comparison:
  Real pairs mean:         0.6482
  Harder distractor mean:  0.7703
  Easy distractor mean:    0.4304
  Gap (real - harder):     -0.1221
  Gap (real - easy):       +0.2178

Compute Features for Harder Distractors¶

The harder distractors need the same 47 features as the easy distractors — we compute all of them now and defer the removal of the 15 context-free features to the assembly step. This keeps the computation logic identical between the two datasets and makes it easy to verify feature consistency.

The computation follows the same two-stage approach used for easy distractors (see "Compute Features for Easy Distractors" above):

  1. Relationship + surface features are computed per unique (definition, distractor_answer) pair, then broadcast to all rows sharing that pair. This deduplication avoids redundant WordNet lookups.
  2. Cosine features are computed per row, because the clue-context embedding varies across clues even when the definition word is the same.
In [9]:
# ============================================================
# Compute Features for Harder Distractors
# ============================================================
# Same approach as the easy distractor feature computation above:
# relationship + surface features per unique pair (deduplicated),
# cosine features per row (because clue-context embedding is row-specific).
# We compute all 47 features first, then drop the 15 context-free columns
# in the assembly step.

df_harder_copy = df_real.copy()
df_harder_copy['distractor_answer_wn'] = harder_distractor_answers
df_harder_copy['distractor_sim'] = harder_distractor_sims

# ============================================================
# Step 1: Relationship + surface features per unique pair
# ============================================================
unique_harder_pairs = df_harder_copy[
    ['definition_wn', 'distractor_answer_wn']].drop_duplicates()
print(f'Unique (definition_wn, distractor_answer_wn) pairs: '
      f'{len(unique_harder_pairs):,}')
print(f'  (vs {len(df_harder_copy):,} total rows — '
      f'{len(df_harder_copy) / len(unique_harder_pairs):.1f}x dedup ratio)')

harder_pair_features = []
for _, row in tqdm(unique_harder_pairs.iterrows(),
                   total=len(unique_harder_pairs),
                   desc='Harder: relationship + surface features'):
    def_w = row['definition_wn']
    ans_w = row['distractor_answer_wn']
    feats = compute_relationship_features(def_w, ans_w)
    feats.update(compute_surface_features(def_w, ans_w))
    feats['definition_wn'] = def_w
    feats['distractor_answer_wn'] = ans_w
    harder_pair_features.append(feats)

harder_pair_feat_df = pd.DataFrame(harder_pair_features)
print(f'\nPair feature table: {harder_pair_feat_df.shape}')

# ============================================================
# Step 2: Cosine features per row
# ============================================================
harder_cosine_features_list = []

for idx in tqdm(range(len(df_harder_copy)),
                desc='Harder: cosine features'):
    row = df_harder_copy.iloc[idx]
    def_wn = row['definition_wn']
    dist_ans_wn = row['distractor_answer_wn']

    # Definition embeddings (unchanged from real row)
    def_row_idx = def_word_to_idx[def_wn]
    def_embs = {
        'allsense': definition_embeddings[def_row_idx, 0, :],
        'common':   definition_embeddings[def_row_idx, 1, :],
        'obscure':  definition_embeddings[def_row_idx, 2, :],
    }

    # Distractor answer embeddings
    ans_row_idx = ans_word_to_idx[dist_ans_wn]
    ans_embs = {
        'allsense': answer_embeddings[ans_row_idx, 0, :],
        'common':   answer_embeddings[ans_row_idx, 1, :],
        'obscure':  answer_embeddings[ans_row_idx, 2, :],
    }

    # Clue-context embedding (unchanged — same clue, same definition)
    cc_key = (row['clue_id'], row['definition'])
    cc_pos = cc_lookup[cc_key]
    clue_emb = clue_context_embeddings[cc_pos, :]

    # Compute all 21 cosine features (15 context-free + 6 context-informed)
    cos_feats = compute_cosine_features_for_pair(
        def_embs, ans_embs, clue_emb=clue_emb)
    harder_cosine_features_list.append(cos_feats)

harder_cosine_feat_df = pd.DataFrame(
    harder_cosine_features_list, index=df_harder_copy.index)
print(f'\nCosine feature table: {harder_cosine_feat_df.shape}')

# ============================================================
# Step 3: Merge all harder distractor features
# ============================================================
# Drop original feature columns before merging distractor features
df_harder_copy = df_harder_copy.drop(
    columns=[c for c in ALL_FEATURE_COLS if c in df_harder_copy.columns])

harder_distractor_df = df_harder_copy.merge(
    harder_pair_feat_df,
    on=['definition_wn', 'distractor_answer_wn'],
    how='left'
)

# Add cosine features (aligned by index)
for col in harder_cosine_feat_df.columns:
    harder_distractor_df[col] = harder_cosine_feat_df[col].values

# ============================================================
# Validate: no NaN in any of the 47 feature columns
# ============================================================
nan_counts = harder_distractor_df[ALL_FEATURE_COLS].isnull().sum()
nan_cols = nan_counts[nan_counts > 0]
assert len(nan_cols) == 0, (
    f'NaN found in harder distractor features:\n{nan_cols}')

print(f'\nHarder distractor feature computation complete.')
print(f'  {len(harder_distractor_df):,} distractor rows')
print(f'  {len(ALL_FEATURE_COLS)} features computed (no NaN)')
print(f'  15 context-free features will be dropped in assembly step')
Unique (definition_wn, distractor_answer_wn) pairs: 186,068
  (vs 240,211 total rows — 1.3x dedup ratio)
Harder: relationship + surface features:   0%|          | 0/186068 [00:00<?, ?it/s]
Pair feature table: (186068, 28)
Harder: cosine features:   0%|          | 0/240211 [00:00<?, ?it/s]
Cosine feature table: (240211, 21)

Harder distractor feature computation complete.
  240,211 distractor rows
  47 features computed (no NaN)
  15 context-free features will be dropped in assembly step

Assemble Harder Dataset¶

Per Decision 6, the 15 context-free meaning features must be removed from the harder dataset. These features are direct cosine similarities between context-free definition and answer embeddings — the same metric used to select the harder distractors. Including them would let the classifier trivially exploit the construction criterion.

Features removed (15 context-free meaning): All pairwise cosine similarities among w1all, w1common, w1obscure, w2all, w2common, w2obscure — i.e., every cos_* column that does NOT involve w1clue.

Features retained (32):

  • 6 context-informed: cos_w1clue_w1all, cos_w1clue_w1common, cos_w1clue_w1obscure, cos_w1clue_w2all, cos_w1clue_w2common, cos_w1clue_w2obscure
  • 22 relationship: 20 boolean WordNet types + wn_max_path_sim + wn_shared_synset_count
  • 4 surface: surface_edit_distance, surface_length_ratio, surface_shared_first_letter, surface_char_overlap_ratio
In [10]:
# ============================================================
# Define the 32 retained feature columns
# ============================================================
HARDER_FEATURE_COLS = (
    CONTEXT_INFORMED_COLS + RELATIONSHIP_COLS + SURFACE_COLS
)
assert len(HARDER_FEATURE_COLS) == 32, (
    f'Expected 32 harder features, got {len(HARDER_FEATURE_COLS)}')

print(f'Feature groups in harder dataset:')
print(f'  Context-informed: {len(CONTEXT_INFORMED_COLS)}')
print(f'  Relationship:     {len(RELATIONSHIP_COLS)}')
print(f'  Surface:          {len(SURFACE_COLS)}')
print(f'  TOTAL:            {len(HARDER_FEATURE_COLS)}')
print(f'  Removed:          {len(CONTEXT_FREE_COLS)} context-free')

# ============================================================
# Build real-pair rows (label = 1) — drop context-free features
# ============================================================
real_harder = df_real[METADATA_COLS + HARDER_FEATURE_COLS].copy()
real_harder['label'] = 1
real_harder['distractor_source'] = np.nan

# ============================================================
# Build distractor rows (label = 0) — drop context-free features
# ============================================================
dist_harder = harder_distractor_df[
    METADATA_COLS + HARDER_FEATURE_COLS].copy()
# Overwrite answer_wn with the distractor answer
dist_harder['answer_wn'] = harder_distractor_df['distractor_answer_wn'].values
dist_harder['label'] = 0
dist_harder['distractor_source'] = (
    harder_distractor_df['distractor_answer_wn'].values)

# ============================================================
# Concatenate and shuffle
# ============================================================
dataset_harder = pd.concat([real_harder, dist_harder], ignore_index=True)
dataset_harder = dataset_harder.sample(
    frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

# ============================================================
# Validation
# ============================================================
n_real_h = (dataset_harder['label'] == 1).sum()
n_dist_h = (dataset_harder['label'] == 0).sum()
assert n_real_h == n_dist_h, (
    f'Dataset not balanced: {n_real_h:,} real vs {n_dist_h:,} distractor')
assert n_real_h == len(df_real), (
    f'Real row count mismatch: {n_real_h:,} vs {len(df_real):,}')

# Verify exactly 32 feature columns, no context-free columns present
feature_cols_present = [c for c in dataset_harder.columns
                        if c in ALL_FEATURE_COLS]
assert len(feature_cols_present) == 32, (
    f'Expected 32 features, found {len(feature_cols_present)}')
context_free_present = [c for c in CONTEXT_FREE_COLS
                        if c in dataset_harder.columns]
assert not context_free_present, (
    f'Context-free features should be removed: {context_free_present}')

# No NaN in retained features
assert not dataset_harder[HARDER_FEATURE_COLS].isnull().any().any(), (
    'NaN values found in harder dataset features')

print(f'\nHarder dataset assembled:')
print(f'  Total rows:       {len(dataset_harder):,}')
print(f'  Real (label=1):   {n_real_h:,}')
print(f'  Distractor (0):   {n_dist_h:,}')
print(f'  Feature columns:  {len(HARDER_FEATURE_COLS)}')
print(f'  Total columns:    {len(dataset_harder.columns)}')

# ============================================================
# Save to parquet
# ============================================================
harder_path = DATA_DIR / 'dataset_harder.parquet'
dataset_harder.to_parquet(harder_path, index=False)
file_size_mb = harder_path.stat().st_size / (1024 * 1024)
print(f'\nSaved to {harder_path}')
print(f'File size: {file_size_mb:.1f} MB')

# Round-trip verification
reloaded_h = pd.read_parquet(harder_path)
assert reloaded_h.shape == dataset_harder.shape, (
    f'Shape mismatch: {reloaded_h.shape} vs {dataset_harder.shape}')
print(f'Round-trip verification passed: '
      f'{reloaded_h.shape[0]:,} rows x {reloaded_h.shape[1]} columns')
Feature groups in harder dataset:
  Context-informed: 6
  Relationship:     22
  Surface:          4
  TOTAL:            32
  Removed:          15 context-free

Harder dataset assembled:
  Total rows:       480,422
  Real (label=1):   240,211
  Distractor (0):   240,211
  Feature columns:  32
  Total columns:    47

Saved to /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/data/dataset_harder.parquet
File size: 71.2 MB
Round-trip verification passed: 480,422 rows x 47 columns

Harder Dataset Validation¶

We now compare the easy and harder datasets to verify that the cosine-similarity-based distractor selection genuinely makes the classification task harder. Two key diagnostic features help us assess this:

  1. cos_w1clue_w2all (context-informed cosine similarity) — this feature was not used to select distractors, so it provides an independent signal. If harder distractors are truly more confusing, the gap between real and distractor values on this feature should shrink or even flip compared to the easy dataset.
  2. wn_max_path_sim (WordNet max path similarity) — measures how closely the definition and answer are related in WordNet's taxonomy. The gap should narrow for harder distractors because cosine-similar words tend to share WordNet neighborhoods.

We also show example pairs for qualitative spot-checking and a side-by-side summary table comparing the two datasets.

In [11]:
# ============================================================
# Harder Dataset Validation
# ============================================================
# Verify that the harder distractors are genuinely harder to distinguish
# from real pairs compared to the easy dataset's random distractors.

print('=== Harder Dataset Validation ===\n')

real_h_mask = dataset_harder['label'] == 1
dist_h_mask = dataset_harder['label'] == 0

# ============================================================
# 1. Context-informed cosine similarity (retained feature)
# ============================================================
# cos_w1clue_w2all measures how similar the definition (in clue context)
# is to the answer's allsense embedding. This is a context-informed
# feature — NOT used in distractor selection — so it provides an
# independent signal. The gap between real and distractor rows should
# be *smaller* than in the easy dataset, confirming the harder dataset
# is genuinely more challenging.

feat = 'cos_w1clue_w2all'
real_vals = dataset_harder.loc[real_h_mask, feat]
dist_vals = dataset_harder.loc[dist_h_mask, feat]
easy_real_vals = dataset_easy.loc[dataset_easy['label'] == 1, feat]
easy_dist_vals = dataset_easy.loc[dataset_easy['label'] == 0, feat]

print(f'cos_w1clue_w2all (context-informed, retained):')
print(f'  Easy dataset:    real={easy_real_vals.mean():.4f} +/- {easy_real_vals.std():.4f}  '
      f'dist={easy_dist_vals.mean():.4f} +/- {easy_dist_vals.std():.4f}  '
      f'gap={easy_real_vals.mean() - easy_dist_vals.mean():+.4f}')
print(f'  Harder dataset:  real={real_vals.mean():.4f} +/- {real_vals.std():.4f}  '
      f'dist={dist_vals.mean():.4f} +/- {dist_vals.std():.4f}  '
      f'gap={real_vals.mean() - dist_vals.mean():+.4f}')

# ============================================================
# 2. WordNet max path similarity (retained feature)
# ============================================================
feat2 = 'wn_max_path_sim'
real_path = dataset_harder.loc[real_h_mask, feat2]
dist_path = dataset_harder.loc[dist_h_mask, feat2]
easy_real_path = dataset_easy.loc[dataset_easy['label'] == 1, feat2]
easy_dist_path = dataset_easy.loc[dataset_easy['label'] == 0, feat2]

print(f'\nwn_max_path_sim (relationship, retained):')
print(f'  Easy dataset:    real={easy_real_path.mean():.4f} +/- {easy_real_path.std():.4f}  '
      f'dist={easy_dist_path.mean():.4f} +/- {easy_dist_path.std():.4f}  '
      f'gap={easy_real_path.mean() - easy_dist_path.mean():+.4f}')
print(f'  Harder dataset:  real={real_path.mean():.4f} +/- {real_path.std():.4f}  '
      f'dist={dist_path.mean():.4f} +/- {dist_path.std():.4f}  '
      f'gap={real_path.mean() - dist_path.mean():+.4f}')

# ============================================================
# 3. Example distractor pairs
# ============================================================
# Show a few examples to spot-check plausibility. Good harder distractors
# should be semantically related to the definition (not random words).

print(f'\n--- Example Harder Distractor Pairs ---')
print(f'{"Definition":<20s} {"True Answer":<15s} {"Distractor":<15s} '
      f'{"Cos Sim":>7s}  {"PathSim(real)":>13s}  {"PathSim(dist)":>13s}')
print(f'{"-"*20} {"-"*15} {"-"*15} {"-"*7}  {"-"*13}  {"-"*13}')

example_indices = rng.choice(len(df_real), size=10, replace=False)
for ex_idx in example_indices:
    row = df_real.iloc[ex_idx]
    def_wn = row['definition_wn']
    true_ans = row['answer_wn']
    dist_ans = harder_distractor_answers[ex_idx]
    dist_sim = harder_distractor_sims[ex_idx]

    real_path_val = row['wn_max_path_sim']
    dist_feats = compute_relationship_features(def_wn, dist_ans)
    dist_path_val = dist_feats['wn_max_path_sim']

    print(f'{def_wn:<20s} {true_ans:<15s} {dist_ans:<15s} '
          f'{dist_sim:>7.3f}  {real_path_val:>13.3f}  {dist_path_val:>13.3f}')

# ============================================================
# 4. Summary comparison: easy vs harder
# ============================================================
print(f'\n--- Easy vs Harder Dataset Summary ---')
print(f'{"Metric":<40s} {"Easy":>12s} {"Harder":>12s}')
print(f'{"-"*40} {"-"*12} {"-"*12}')
print(f'{"Total rows":<40s} {len(dataset_easy):>12,} {len(dataset_harder):>12,}')
print(f'{"Feature columns":<40s} {len(ALL_FEATURE_COLS):>12} '
      f'{len(HARDER_FEATURE_COLS):>12}')
print(f'{"cos_w1clue_w2all gap (real-dist)":<40s} '
      f'{easy_real_vals.mean() - easy_dist_vals.mean():>+12.4f} '
      f'{real_vals.mean() - dist_vals.mean():>+12.4f}')
print(f'{"wn_max_path_sim gap (real-dist)":<40s} '
      f'{easy_real_path.mean() - easy_dist_path.mean():>+12.4f} '
      f'{real_path.mean() - dist_path.mean():>+12.4f}')
=== Harder Dataset Validation ===

cos_w1clue_w2all (context-informed, retained):
  Easy dataset:    real=0.5446 +/- 0.1544  dist=0.3607 +/- 0.1156  gap=+0.1839
  Harder dataset:  real=0.5446 +/- 0.1544  dist=0.6148 +/- 0.1225  gap=-0.0702

wn_max_path_sim (relationship, retained):
  Easy dataset:    real=0.4352 +/- 0.2902  dist=0.1477 +/- 0.0836  gap=+0.2876
  Harder dataset:  real=0.4352 +/- 0.2902  dist=0.2798 +/- 0.1985  gap=+0.1554

--- Example Harder Distractor Pairs ---
Definition           True Answer     Distractor      Cos Sim  PathSim(real)  PathSim(dist)
-------------------- --------------- --------------- -------  -------------  -------------
culmination          apogee          tailspin          0.761          1.000          0.100
promises             assurances      promise           0.927          0.333          1.000
nimble               agile           grains            0.814          1.000          0.200
chews                champs          grinds            0.737          0.500          0.500
drier                tea_towel       abrasive          0.853          0.100          0.333
titan                prometheus      callisto          0.777          0.500          0.250
weak                 anaemic         weedy             0.793          0.333          0.333
hide                 leather         disgorge          0.769          0.333          0.250
talent               flair           bravery           0.721          0.500          0.091
prevent              block           proscribed        0.682          0.500          0.333

--- Easy vs Harder Dataset Summary ---
Metric                                           Easy       Harder
---------------------------------------- ------------ ------------
Total rows                                    480,422      480,422
Feature columns                                    47           32
cos_w1clue_w2all gap (real-dist)              +0.1839      -0.0702
wn_max_path_sim gap (real-dist)               +0.2876      +0.1554

Summary¶

This notebook implements PLAN.md Steps 5 and 7 — constructing the two balanced binary classification datasets that will be used in the supervised learning experiments.

What was done¶

  • Easy dataset (Step 5): For each of the 240,211 real (clue, definition, answer) rows, generated one distractor by substituting a randomly sampled answer word. Recomputed all 47 features for each distractor pair. Combined real (label=1) and distractor (label=0) rows into a balanced dataset.
  • Harder dataset (Step 7): For each real row, selected a distractor from the top-k most cosine-similar answer words to the definition (Decision 6). Recomputed all 47 features, then removed the 15 context-free meaning features (which would leak the construction criterion), leaving 32 features.

Dataset statistics¶

Easy Harder
Total rows 480,422 480,422
Label balance 50/50 50/50
Feature columns 47 32 (15 context-free removed)
cos_w1all_w2all gap (real − distractor) +0.218 (removed)
cos_w1clue_w2all gap (real − distractor) +0.184 −0.070
wn_max_path_sim gap (real − distractor) +0.288 +0.155

The negative cos_w1clue_w2all gap in the harder dataset is expected and validates the harder distractor design. It means that distractors are now more similar to definitions than real answers on this context-informed metric — raw cosine similarity alone can no longer distinguish real from distractor. The classifier must rely on subtler signals (WordNet relationships, surface features, and interactions among context-informed features) to succeed.

Output files¶

File Description
data/dataset_easy.parquet 480,422 rows × 47 features + metadata + label
data/dataset_harder.parquet 480,422 rows × 32 features + metadata + label

What comes next¶

  • Step 6 (NB 06): Runs Experiments 1A and 1B on the easy dataset — three model families (Logistic Regression, Random Forest, KNN) with full features vs. ablation conditions. This establishes the baseline classification performance with random distractors.
  • Step 8 (NB 07): Runs Experiments 2A and 2B on the harder dataset — same three model families, testing whether classifiers can still distinguish real from distractor when raw similarity no longer works.
In [ ]:
# ============================================================
# Cross-check: Reload and verify both datasets from disk
# ============================================================
# Independent verification that the saved parquet files have the expected
# properties. This catches any corruption during the save/load round-trip.

easy_check = pd.read_parquet(DATA_DIR / 'dataset_easy.parquet')
harder_check = pd.read_parquet(DATA_DIR / 'dataset_harder.parquet')

# --- Easy dataset ---
assert easy_check.shape[0] == 480_422, (
    f'Easy dataset row count: expected 480,422, got {easy_check.shape[0]:,}')
easy_feat_cols = [c for c in easy_check.columns if c in ALL_FEATURE_COLS]
assert len(easy_feat_cols) == 47, (
    f'Easy dataset feature columns: expected 47, got {len(easy_feat_cols)}')
easy_label_counts = easy_check['label'].value_counts()
assert easy_label_counts[1] == easy_label_counts[0], (
    f'Easy dataset not balanced: {easy_label_counts.to_dict()}')

# --- Harder dataset ---
assert harder_check.shape[0] == 480_422, (
    f'Harder dataset row count: expected 480,422, got {harder_check.shape[0]:,}')
harder_feat_cols = [c for c in harder_check.columns
                    if c in ALL_FEATURE_COLS or c in HARDER_FEATURE_COLS]
# Only the 32 retained features should be present
harder_feat_only = [c for c in harder_check.columns if c in HARDER_FEATURE_COLS]
assert len(harder_feat_only) == 32, (
    f'Harder dataset feature columns: expected 32, got {len(harder_feat_only)}')
harder_label_counts = harder_check['label'].value_counts()
assert harder_label_counts[1] == harder_label_counts[0], (
    f'Harder dataset not balanced: {harder_label_counts.to_dict()}')

# --- No context-free features in harder dataset ---
context_free_in_harder = [c for c in CONTEXT_FREE_COLS
                          if c in harder_check.columns]
assert not context_free_in_harder, (
    f'Context-free features found in harder dataset: {context_free_in_harder}')

print('All cross-checks passed.')
print(f'  Easy:   {easy_check.shape[0]:,} rows, {len(easy_feat_cols)} features, '
      f'balanced ({easy_label_counts[1]:,} per class)')
print(f'  Harder: {harder_check.shape[0]:,} rows, {len(harder_feat_only)} features, '
      f'balanced ({harder_label_counts[1]:,} per class)')