03 — Feature Engineering¶

Primary author: Victoria

Builds on:

  • 02_embedding_generation.ipynb (Victoria — embedding files and index contract)
  • Hans's Hans_Supervised_Learning.ipynb and Hans_Supervised_Learning_Models.ipynb (cosine similarity computation pattern, WordNet relationship features, surface features — expanding from 10 to 47)

Prompt engineering: Victoria
AI assistance: Claude Code (Anthropic)
Environment: Local (CPU only)

Compute all 47 features for each (clue, definition, answer) row. Loads embeddings from Step 2, computes 15 context-free + 6 context-informed cosine similarities, 22 WordNet relationship features, and 4 surface features. Outputs data/features_all.parquet.

Implements PLAN.md Step 3.

Inputs:

  • data/clues_filtered.csv (Step 1)
  • data/embeddings/ — 3 .npy arrays + 3 index CSVs (Step 2)
  • data/embeddings/clue_context_phrases.csv (Step 2 — provides definition_wn and answer_wn lookup keys)
  • WordNet (via NLTK)

Output:

  • data/features_all.parquet — one row per (clue, definition, answer) with all 47 features plus metadata columns

Imports¶

In [1]:
import warnings

import numpy as np
import pandas as pd
from pathlib import Path

import nltk
from nltk.corpus import wordnet as wn
from sklearn.metrics.pairwise import cosine_similarity

from tqdm.auto import tqdm

warnings.filterwarnings('ignore', category=FutureWarning)

Environment and Paths¶

In [2]:
# --- Environment Auto-Detection ---
# Same pattern as 02_embedding_generation.ipynb: detect Colab, Great Lakes,
# or local and set paths accordingly.
try:
    IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
    IS_COLAB = False

if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/'
                        'Milestone II - NLP Cryptic Crossword Clues/'
                        'clue_misdirection')
else:
    # Local or Great Lakes: notebook is in clue_misdirection/notebooks/,
    # so parent is the clue_misdirection project root.
    try:
        PROJECT_ROOT = Path(__file__).resolve().parent.parent
    except NameError:
        PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / 'data'
EMBEDDINGS_DIR = DATA_DIR / 'embeddings'
OUTPUT_DIR = PROJECT_ROOT / 'outputs'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Download WordNet data — needed for the 21 relationship features computed
# later in this notebook.
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

print(f'Environment: {"Google Colab" if IS_COLAB else "Local / Great Lakes"}')
print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
print(f'Embeddings directory: {EMBEDDINGS_DIR}')
Environment: Local / Great Lakes
Project root: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection
Data directory: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/data
Embeddings directory: /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/data/embeddings

Load Data¶

We load two CSV files:

  1. clues_filtered.csv (Step 1) — 241,397 rows with all metadata columns (clue_id, clue, surface, definition, answer, def_answer_pair_id, etc.).

  2. clue_context_phrases.csv (Step 2) — 240,211 rows that survived the Step 2 cleanup (dropping 1,186 rows for duplicate definitions in the surface text and fully unresolvable words). This file provides the definition_wn and answer_wn columns — the WordNet-ready lookup keys needed to find the correct row in the embedding index files.

We inner-merge on (clue_id, definition) to produce a 240,211-row working set that has both the original metadata and the embedding lookup keys. Merging on clue_id alone would be wrong: double-definition clues have multiple rows per clue_id (one per valid definition), so a clue_id-only join would produce a many-to-many cross product. Adding definition as a second key disambiguates these rows.

We also record each row's position in clue_context_phrases.csv as cc_row_position. Since clue_context_phrases.csv and clue_context_embeddings.npy are in identical row order (verified in Step 2), this gives us a direct index into the clue-context embedding array — no need to map through clue_context_index.csv, which only has clue_id and would suffer the same double-definition ambiguity.

In [3]:
# --- Load clues_filtered.csv (Step 1 output) ---
clues_path = DATA_DIR / 'clues_filtered.csv'
assert clues_path.exists(), (
    f'Missing input file: {clues_path}\n'
    f'Run 01_data_cleaning.ipynb first to produce this file.'
)
clues_df = pd.read_csv(clues_path)
print(f'clues_filtered.csv: {len(clues_df):,} rows')

# --- Load clue_context_phrases.csv (Step 2 intermediate) ---
# This file provides definition_wn and answer_wn — the WordNet-ready lookup
# keys that map to the embedding index files. It also identifies which rows
# survived Step 2's cleanup (240,211 of the original 241,397).
# CRITICAL: keep_default_na=False prevents pandas from interpreting the word
# "nan" (grandmother) as NaN — see DATA.md.
cc_phrases_path = EMBEDDINGS_DIR / 'clue_context_phrases.csv'
assert cc_phrases_path.exists(), (
    f'Missing input file: {cc_phrases_path}\n'
    f'Run 02_embedding_generation.ipynb first to produce this file.'
)
cc_phrases = pd.read_csv(cc_phrases_path, keep_default_na=False)
print(f'clue_context_phrases.csv: {len(cc_phrases):,} rows')

# --- Record each row's position in cc_phrases / clue_context_embeddings ---
# clue_context_phrases.csv and clue_context_embeddings.npy are in identical
# row order (verified in Step 2). By recording the row position here, we get
# a direct index into the clue-context embedding array after the merge —
# avoiding the ambiguity of mapping through clue_context_index.csv, which
# only has clue_id and can't disambiguate double-definition clues.
cc_phrases['cc_row_position'] = np.arange(len(cc_phrases))

# --- Merge to get definition_wn and answer_wn onto the clue rows ---
# Inner merge restricts to the 240,211 rows that have embeddings.
# We merge on (clue_id, definition) — NOT clue_id alone — because
# double-definition clues have multiple rows per clue_id. A clue_id-only
# merge would produce a many-to-many cross product for those clues,
# inflating the row count. The 'definition' column appears in both files
# (original-case definition text) and disambiguates which definition each
# row corresponds to.
df = clues_df.merge(
    cc_phrases[['clue_id', 'definition', 'definition_wn', 'answer_wn',
                'def_num_usable_synsets', 'ans_num_usable_synsets',
                'cc_row_position']],
    on=['clue_id', 'definition'],
    how='inner'
)

# Verify the merge produced exactly the expected number of rows — no
# inflation from many-to-many joins and no unexpected drops.
assert len(df) == len(cc_phrases), (
    f'Merge produced {len(df):,} rows, expected {len(cc_phrases):,}. '
    f'This likely means a double-definition clue was not disambiguated '
    f'correctly by the (clue_id, definition) key.')

print(f'\nWorking set after merge: {len(df):,} rows')
print(f'  (dropped {len(clues_df) - len(df):,} rows without embeddings)')
print(f'  Unique (definition, answer) pairs: '
      f'{df["def_answer_pair_id"].nunique():,}')
print(f'\nColumns: {list(df.columns)}')
df.head(3)
clues_filtered.csv: 241,397 rows
clue_context_phrases.csv: 240,211 rows

Working set after merge: 240,211 rows
  (dropped 1,186 rows without embeddings)
  Unique (definition, answer) pairs: 128,961

Columns: ['clue_id', 'clue', 'surface', 'surface_normalized', 'definition', 'answer', 'answer_format', 'num_definitions', 'def_answer_pair_id', 'definition_wn', 'answer_wn', 'def_num_usable_synsets', 'ans_num_usable_synsets', 'cc_row_position']
Out[3]:
clue_id clue surface surface_normalized definition answer answer_format num_definitions def_answer_pair_id definition_wn answer_wn def_num_usable_synsets ans_num_usable_synsets cc_row_position
0 1 Acquisitive chap, as we see it (8) Acquisitive chap, as we see it acquisitive chap as we see it Acquisitive COVETOUS 8 1 0 acquisitive covetous 1 2 0
1 2 Back yard fencing weak and sagging (6) Back yard fencing weak and sagging back yard fencing weak and sagging sagging DROOPY 6 1 1 sagging droopy 3 1 1
2 3 Stripping off uniform, love holding colonel's ... Stripping off uniform, love holding colonel's ... stripping off uniform love holding colonels coat Stripping UNCLOTHING 10 1 2 stripping unclothing 14 3 2

Load Embeddings¶

Load all 6 embedding files from Step 2 (3 .npy arrays + 3 index CSVs). The index CSVs map row positions in the .npy arrays to word strings (definition_wn / answer_wn) or clue_id.

In [4]:
# Load embedding arrays and index files.
# CRITICAL: keep_default_na=False on all index CSVs — the word "nan"
# (grandmother) is a valid crossword definition/answer.

definition_embeddings = np.load(EMBEDDINGS_DIR / 'definition_embeddings.npy')
definition_index = pd.read_csv(
    EMBEDDINGS_DIR / 'definition_index.csv', index_col=0,
    keep_default_na=False)

answer_embeddings = np.load(EMBEDDINGS_DIR / 'answer_embeddings.npy')
answer_index = pd.read_csv(
    EMBEDDINGS_DIR / 'answer_index.csv', index_col=0,
    keep_default_na=False)

clue_context_embeddings = np.load(
    EMBEDDINGS_DIR / 'clue_context_embeddings.npy')
clue_context_index = pd.read_csv(
    EMBEDDINGS_DIR / 'clue_context_index.csv', index_col=0,
    keep_default_na=False)

# --- Print shapes and sizes ---
EMBED_DIM = 1024
print(f'{"File":<35s} {"Shape":<25s} {"Memory":>8s}')
print(f'{"-"*35} {"-"*25} {"-"*8}')
for name, arr in [
    ('definition_embeddings.npy', definition_embeddings),
    ('answer_embeddings.npy', answer_embeddings),
    ('clue_context_embeddings.npy', clue_context_embeddings),
]:
    mb = arr.nbytes / 1024**2
    print(f'{name:<35s} {str(arr.shape):<25s} {mb:>6.1f} MB')

total_mb = (definition_embeddings.nbytes + answer_embeddings.nbytes
            + clue_context_embeddings.nbytes) / 1024**2
print(f'\nTotal embedding memory: {total_mb:.1f} MB')
print(f'\nIndex sizes:')
print(f'  definition_index:     {len(definition_index):,} rows')
print(f'  answer_index:         {len(answer_index):,} rows')
print(f'  clue_context_index:   {len(clue_context_index):,} rows')

# --- Shape and consistency assertions ---
n_def = len(definition_index)
n_ans = len(answer_index)
n_cc = len(clue_context_index)

assert definition_embeddings.shape == (n_def, 3, EMBED_DIM), (
    f'definition_embeddings shape mismatch: expected ({n_def}, 3, {EMBED_DIM}), '
    f'got {definition_embeddings.shape}')
assert answer_embeddings.shape == (n_ans, 3, EMBED_DIM), (
    f'answer_embeddings shape mismatch: expected ({n_ans}, 3, {EMBED_DIM}), '
    f'got {answer_embeddings.shape}')
assert clue_context_embeddings.shape == (n_cc, EMBED_DIM), (
    f'clue_context_embeddings shape mismatch: expected ({n_cc}, {EMBED_DIM}), '
    f'got {clue_context_embeddings.shape}')

print(f'\nAll shape assertions passed.')
File                                Shape                       Memory
----------------------------------- ------------------------- --------
definition_embeddings.npy           (27385, 3, 1024)           320.9 MB
answer_embeddings.npy               (45254, 3, 1024)           530.3 MB
clue_context_embeddings.npy         (240211, 1024)             938.3 MB

Total embedding memory: 1789.6 MB

Index sizes:
  definition_index:     27,385 rows
  answer_index:         45,254 rows
  clue_context_index:   240,211 rows

All shape assertions passed.

Build Embedding Matrices¶

For each of the 240,211 rows in our working set, we need 7 embedding vectors (each 1024-dim):

Abbrev Source Slot in .npy array
w1_all definition allsense-average definition_embeddings[idx, 0, :]
w1_common definition common synset definition_embeddings[idx, 1, :]
w1_obscure definition obscure synset definition_embeddings[idx, 2, :]
w2_all answer allsense-average answer_embeddings[idx, 0, :]
w2_common answer common synset answer_embeddings[idx, 1, :]
w2_obscure answer obscure synset answer_embeddings[idx, 2, :]
w1_clue definition in clue context clue_context_embeddings[idx, :]

Rather than looping row-by-row, we build full (N, 1024) matrices for each embedding type using bulk index lookups, then fancy-index into the .npy arrays. For definition and answer embeddings, we map definition_wn / answer_wn strings to their row positions in the index files. For clue-context embeddings, we use the cc_row_position column recorded during the merge — this is a direct positional index into the .npy array and correctly handles double-definition clues (where multiple rows share the same clue_id but have different clue-context embeddings).

In [5]:
# --- Build word → row-position mappings for O(1) lookup ---
# definition_index and answer_index have integer row indices (0, 1, 2, ...)
# and a 'word' column. We create a Series mapping word string → row position.
def_word_to_idx = pd.Series(
    definition_index.index, index=definition_index['word'])
ans_word_to_idx = pd.Series(
    answer_index.index, index=answer_index['word'])

# --- Map each row's lookup key to its position in the embedding arrays ---
# .map() returns NaN for any key not found — we assert none are missing.
def_indices = df['definition_wn'].map(def_word_to_idx)
assert def_indices.notna().all(), (
    f'{def_indices.isna().sum()} definition_wn values not found in '
    f'definition_index. Examples: '
    f'{df.loc[def_indices.isna(), "definition_wn"].head().tolist()}')
def_indices = def_indices.astype(int).values

ans_indices = df['answer_wn'].map(ans_word_to_idx)
assert ans_indices.notna().all(), (
    f'{ans_indices.isna().sum()} answer_wn values not found in '
    f'answer_index. Examples: '
    f'{df.loc[ans_indices.isna(), "answer_wn"].head().tolist()}')
ans_indices = ans_indices.astype(int).values

# For clue-context embeddings, we use cc_row_position directly. This column
# was set to np.arange(len(cc_phrases)) before the merge, giving each row
# its position in clue_context_embeddings.npy. Unlike clue_context_index.csv
# (which only has clue_id), cc_row_position correctly handles double-definition
# clues where multiple rows share the same clue_id but have different
# clue-context embeddings.
cc_indices = df['cc_row_position'].values

# --- Fancy-index into embedding arrays to build (N, 1024) matrices ---
# Each matrix has one row per clue in our working set.
N = len(df)
print(f'Building 7 embedding matrices of shape ({N:,}, {EMBED_DIM})...')

w1_all     = definition_embeddings[def_indices, 0, :]   # (N, 1024)
w1_common  = definition_embeddings[def_indices, 1, :]   # (N, 1024)
w1_obscure = definition_embeddings[def_indices, 2, :]   # (N, 1024)
w2_all     = answer_embeddings[ans_indices, 0, :]       # (N, 1024)
w2_common  = answer_embeddings[ans_indices, 1, :]       # (N, 1024)
w2_obscure = answer_embeddings[ans_indices, 2, :]       # (N, 1024)
w1_clue    = clue_context_embeddings[cc_indices, :]     # (N, 1024)

print(f'Done. Each matrix: {w1_all.shape}')
print(f'Total memory for 7 matrices: '
      f'{7 * w1_all.nbytes / 1024**2:.1f} MB')
Building 7 embedding matrices of shape (240,211, 1024)...
Done. Each matrix: (240211, 1024)
Total memory for 7 matrices: 6568.3 MB

Cosine Similarity Features (21 total)¶

We compute 21 pairwise cosine similarities organized into two groups:

Context-Free Meaning (15): All ${6 \choose 2} = 15$ pairwise cosine similarities among the 6 context-free embeddings (3 definition × 3 answer cross-word pairs = 9, plus 3 within-definition and 3 within-answer pairs). These capture the semantic relationship between definition and answer senses without any influence from the clue's surface reading.

Context-Informed Meaning (6): Cosine similarity between word1_clue_context (the definition embedded within the clue surface using CALE's <t></t> delimiters) and each of the 6 context-free embeddings. These capture how the clue's surface reading shifts the definition's embedding — the core signal for misdirection.

We use row-wise dot products on L2-normalized vectors rather than sklearn.metrics.pairwise.cosine_similarity on individual pairs. This computes all N similarities for a given pair of embedding types in one vectorized operation, which is orders of magnitude faster than looping.

In [6]:
def rowwise_cosine(A, B):
    """Compute row-wise cosine similarity between corresponding rows of A and B.

    Parameters
    ----------
    A, B : np.ndarray, shape (N, D)
        Two matrices with the same shape.

    Returns
    -------
    np.ndarray, shape (N,)
        Cosine similarity for each row pair: cos(A[i], B[i]).
    """
    # L2-normalize each row, then take the row-wise dot product.
    A_norm = np.linalg.norm(A, axis=1, keepdims=True)
    B_norm = np.linalg.norm(B, axis=1, keepdims=True)
    return np.sum((A / A_norm) * (B / B_norm), axis=1)


# --- Context-Free Meaning: 15 features ---
# Cross-word pairs (definition × answer): 3 × 3 = 9
df['cos_w1all_w2all']       = rowwise_cosine(w1_all, w2_all)
df['cos_w1all_w2common']    = rowwise_cosine(w1_all, w2_common)
df['cos_w1all_w2obscure']   = rowwise_cosine(w1_all, w2_obscure)
df['cos_w1common_w2all']    = rowwise_cosine(w1_common, w2_all)
df['cos_w1common_w2common'] = rowwise_cosine(w1_common, w2_common)
df['cos_w1common_w2obscure']= rowwise_cosine(w1_common, w2_obscure)
df['cos_w1obscure_w2all']   = rowwise_cosine(w1_obscure, w2_all)
df['cos_w1obscure_w2common']= rowwise_cosine(w1_obscure, w2_common)
df['cos_w1obscure_w2obscure'] = rowwise_cosine(w1_obscure, w2_obscure)

# Within-definition pairs: C(3,2) = 3
df['cos_w1all_w1common']    = rowwise_cosine(w1_all, w1_common)
df['cos_w1all_w1obscure']   = rowwise_cosine(w1_all, w1_obscure)
df['cos_w1common_w1obscure']= rowwise_cosine(w1_common, w1_obscure)

# Within-answer pairs: C(3,2) = 3
df['cos_w2all_w2common']    = rowwise_cosine(w2_all, w2_common)
df['cos_w2all_w2obscure']   = rowwise_cosine(w2_all, w2_obscure)
df['cos_w2common_w2obscure']= rowwise_cosine(w2_common, w2_obscure)

# --- Context-Informed Meaning: 6 features ---
# word1_clue_context paired with each of the 6 context-free embeddings.
df['cos_w1clue_w1all']      = rowwise_cosine(w1_clue, w1_all)
df['cos_w1clue_w1common']   = rowwise_cosine(w1_clue, w1_common)
df['cos_w1clue_w1obscure']  = rowwise_cosine(w1_clue, w1_obscure)
df['cos_w1clue_w2all']      = rowwise_cosine(w1_clue, w2_all)
df['cos_w1clue_w2common']   = rowwise_cosine(w1_clue, w2_common)
df['cos_w1clue_w2obscure']  = rowwise_cosine(w1_clue, w2_obscure)

print('Computed 21 cosine similarity features.')
Computed 21 cosine similarity features.

Verification: Cosine Similarity Features¶

In [7]:
# --- Identify the 21 cosine columns ---
cos_cols = [c for c in df.columns if c.startswith('cos_')]

# Assert exactly 21 cosine similarity columns
assert len(cos_cols) == 21, (
    f'Expected 21 cosine similarity columns, found {len(cos_cols)}: {cos_cols}')

# Assert no NaN values in any cosine column (Decision 3: no NaN features).
# Cosine similarities are guaranteed non-NaN as long as embeddings are non-zero,
# which we verified in Step 2.
nan_counts = df[cos_cols].isnull().sum()
assert (nan_counts == 0).all(), (
    f'NaN values found in cosine columns:\n'
    f'{nan_counts[nan_counts > 0]}')

print(f'Verification passed: 21 cosine columns, 0 NaN values.')
print(f'Rows: {len(df):,}\n')

# --- Descriptive statistics ---
# Context-free features (15): cross-word similarities tend to be moderate
# (definition and answer are semantically related but not identical), while
# within-word similarities tend to be high (different senses of the same word
# still share much of the embedding space).
# Context-informed features (6): the clue context shifts the definition
# embedding — the degree of shift is the misdirection signal.
print('--- Descriptive Statistics (21 Cosine Similarity Features) ---\n')
stats = df[cos_cols].describe().T[['mean', 'std', 'min', 'max']]
# Use wider display so columns don't wrap
with pd.option_context('display.float_format', '{:.4f}'.format,
                       'display.max_colwidth', 30):
    print(stats.to_string())

print(f'\n--- First 5 Rows (Cosine Similarity Features) ---\n')
with pd.option_context('display.float_format', '{:.4f}'.format,
                       'display.max_columns', None):
    display(df[cos_cols].head())
Verification passed: 21 cosine columns, 0 NaN values.
Rows: 240,211

--- Descriptive Statistics (21 Cosine Similarity Features) ---

                          mean    std     min    max
cos_w1all_w2all         0.6482 0.1366  0.0909 1.0000
cos_w1all_w2common      0.5954 0.1575  0.0180 0.9689
cos_w1all_w2obscure     0.5940 0.1545  0.0148 0.9689
cos_w1common_w2all      0.5686 0.1710  0.0085 0.9689
cos_w1common_w2common   0.5311 0.1871 -0.0402 1.0000
cos_w1common_w2obscure  0.5171 0.1867 -0.0420 0.9768
cos_w1obscure_w2all     0.5593 0.1640  0.0128 0.9689
cos_w1obscure_w2common  0.5096 0.1801 -0.0444 0.9768
cos_w1obscure_w2obscure 0.5171 0.1754 -0.0407 1.0000
cos_w1all_w1common      0.8638 0.1257  0.2143 1.0000
cos_w1all_w1obscure     0.8636 0.1168  0.3153 1.0000
cos_w1common_w1obscure  0.6891 0.2331 -0.0262 1.0000
cos_w2all_w2common      0.9150 0.1021  0.2143 1.0000
cos_w2all_w2obscure     0.9145 0.0982  0.3153 1.0000
cos_w2common_w2obscure  0.7798 0.2194 -0.0262 1.0000
cos_w1clue_w1all        0.7453 0.1171  0.1011 0.9772
cos_w1clue_w1common     0.6721 0.1780 -0.0199 0.9748
cos_w1clue_w1obscure    0.6374 0.1794 -0.0490 0.9763
cos_w1clue_w2all        0.5446 0.1544 -0.0028 0.9404
cos_w1clue_w2common     0.5019 0.1695 -0.0382 0.9427
cos_w1clue_w2obscure    0.4987 0.1676 -0.0545 0.9441

--- First 5 Rows (Cosine Similarity Features) ---

cos_w1all_w2all cos_w1all_w2common cos_w1all_w2obscure cos_w1common_w2all cos_w1common_w2common cos_w1common_w2obscure cos_w1obscure_w2all cos_w1obscure_w2common cos_w1obscure_w2obscure cos_w1all_w1common cos_w1all_w1obscure cos_w1common_w1obscure cos_w2all_w2common cos_w2all_w2obscure cos_w2common_w2obscure cos_w1clue_w1all cos_w1clue_w1common cos_w1clue_w1obscure cos_w1clue_w2all cos_w1clue_w2common cos_w1clue_w2obscure
0 0.5553 0.5385 0.5349 0.5553 0.5385 0.5349 0.5553 0.5385 0.5349 1.0000 1.0000 1.0000 0.9660 0.9669 0.8680 0.8058 0.8058 0.8058 0.5530 0.5572 0.5120
1 0.7398 0.7398 0.7398 0.7110 0.7110 0.7110 0.7100 0.7100 0.7100 0.9444 0.9177 0.8282 1.0000 1.0000 1.0000 0.6649 0.6840 0.6812 0.5610 0.5610 0.5610
2 0.8343 0.6033 0.7967 0.7282 0.5366 0.6508 0.7749 0.5748 0.7374 0.9136 0.8823 0.7993 0.8615 0.9469 0.7181 0.6194 0.5856 0.6176 0.4217 0.2459 0.3888
3 0.3062 0.2764 0.3207 0.3062 0.2764 0.3207 0.3062 0.2764 0.3207 1.0000 1.0000 1.0000 0.9747 0.9738 0.8985 0.8256 0.8256 0.8256 0.3337 0.2933 0.3575
4 0.7116 0.7116 0.7116 0.6720 0.6720 0.6720 0.5386 0.5386 0.5386 0.9113 0.7305 0.4987 1.0000 1.0000 1.0000 0.8402 0.8220 0.5556 0.6772 0.6772 0.6772

WordNet Relationship Features (22 total)¶

WordNet relationship features capture how the definition and answer words are connected in the WordNet semantic hierarchy. Unlike cosine similarity features (which measure embedding-space proximity), these features capture discrete structural relationships — e.g., whether the answer is a hypernym (more general category) or hyponym (more specific instance) of the definition.

Cryptic crossword setters often exploit these relationships: the definition might be a hypernym of the answer (e.g., "flower" → ROSE), or the two might share a two-hop path (e.g., "plant" and "aster" are both hyponyms of "organism" — a co-hyponymy pattern captured by hyponym_of_hypernym). By encoding which relationship paths exist, we give the classifier structural information about the type of semantic connection — not just whether one exists.

Feature breakdown:

  • 20 boolean features (wn_rel_*): whether the answer is reachable from the definition via specific one-hop or two-hop WordNet relationships. Covers all relationship types from Table 3 of the design doc: synonym, hypernym, hyponym, meronyms, holonyms, and their two-hop compounds. The original plan estimated 19 types; we implement 20 to include hypernym_of_hyponym which also appears in the data.
  • wn_max_path_sim (float): maximum path similarity across all definition–answer synset pairs. Uses path_similarity() for consistency with Hans's preliminary findings, where wn_path_sim was the single most predictive feature (removing it alone dropped accuracy by 3.4%). Default 0.0 if no synsets connect.
  • wn_shared_synset_count (int): number of synsets shared by both definition and answer words (both appear as lemmas in the same synset). Default 0.

Missing value handling (Decision 3): Pairs with no WordNet connection get False for all 20 booleans, 0.0 for path similarity, 0 for shared synset count. No NaN values are ever produced.

Optimization: Many rows share the same (definition_wn, answer_wn) pair — different clues for the same definition–answer combination. Since relationship features depend only on the word pair (not the clue text), we compute them once per unique pair and merge back, avoiding redundant WordNet traversals.

Implementation note (Decision 18): All functions are standalone so they can later be extracted into scripts/feature_utils.py for reuse when computing features for distractor pairs in Steps 5 and 7.

In [8]:
# ============================================================
# WordNet Relationship Feature Functions
# ============================================================
# These functions are standalone and can be extracted into
# scripts/feature_utils.py for reuse in distractor feature
# computation (Steps 5 and 7). See Decision 18.
# ============================================================


def get_wordnet_synsets(word):
    """Look up all WordNet synsets for a word, handling multi-word entries.

    Tries the word as-is first (which works for both single-word and
    underscore-separated multi-word entries like "ice_cream"). If no
    synsets are found and the word contains spaces, retries with spaces
    converted to underscores — WordNet's convention for multi-word entries.

    Parameters
    ----------
    word : str
        The word to look up (from the definition_wn or answer_wn column).

    Returns
    -------
    list of nltk.corpus.reader.wordnet.Synset
        All synsets found for the word. Empty list if no synsets exist.
    """
    synsets = wn.synsets(word)
    if not synsets and ' ' in word:
        synsets = wn.synsets(word.replace(' ', '_'))
    return synsets


def check_synonym(def_synsets, ans_word):
    """Check if the answer word is a synonym of the definition in WordNet.

    Two words are WordNet synonyms if they share at least one synset — i.e.,
    the answer word appears as a lemma in some synset of the definition word.
    This is a lemma-level check, not a synset-graph traversal: we look at
    the set of lemma names within each definition synset and check whether
    the answer word is among them.

    Parameters
    ----------
    def_synsets : list of Synset
        All synsets for the definition word.
    ans_word : str
        The answer word (lowercase, underscored for multi-word; from the
        answer_wn column).

    Returns
    -------
    bool
        True if the answer appears as a lemma in any definition synset.
    """
    for syn in def_synsets:
        lemma_names = {lemma.name().lower() for lemma in syn.lemmas()}
        if ans_word in lemma_names:
            return True
    return False


def check_synset_reachable(def_synsets, ans_synsets_set, hops):
    """Check if any answer synset is reachable from definition synsets via
    a sequence of WordNet relationship hops.

    For single-hop relationships (e.g., "hyponym"), ``hops`` contains one
    method name — we follow that relationship from each definition synset and
    check if any answer synset appears among the targets. For compound
    two-hop relationships (e.g., "hyponym_of_hypernym"), ``hops`` contains
    two method names: the first hop follows the second word of the compound
    name (hypernyms), then the second hop follows the first word (hyponyms).
    This right-to-left reading matches the English semantics: "hyponym OF
    hypernym" = take hypernyms first, then take hyponyms of those.

    Parameters
    ----------
    def_synsets : list of Synset
        Starting synsets (from the definition word).
    ans_synsets_set : set of Synset
        Target synsets (from the answer word).
    hops : list of str
        Sequence of WordNet synset method names to follow. Each must be
        a valid method on nltk.corpus.reader.wordnet.Synset (e.g.,
        'hypernyms', 'hyponyms', 'part_meronyms').

    Returns
    -------
    bool
        True if any answer synset is reachable via the specified path.
    """
    current = set(def_synsets)
    for method_name in hops:
        next_level = set()
        for synset in current:
            next_level.update(getattr(synset, method_name)())
        current = next_level
        if not current:
            return False
    return bool(current & ans_synsets_set)


def compute_max_path_similarity(def_synsets, ans_synsets):
    """Compute the maximum path similarity across all definition-answer
    synset pairs.

    Path similarity (Rada et al., 1989) measures the inverse of the shortest
    path length between two synsets in the WordNet hypernym/hyponym hierarchy,
    normalized to [0, 1]. We use path_similarity (not wup_similarity) for
    consistency with Hans's preliminary classification experiments, where
    wn_path_sim was the single most predictive feature.

    Parameters
    ----------
    def_synsets : list of Synset
        All synsets for the definition word.
    ans_synsets : list of Synset
        All synsets for the answer word.

    Returns
    -------
    float
        Maximum path similarity across all synset pairs. Returns 0.0 if
        no synset pair connects or if either word has no synsets.
    """
    max_sim = 0.0
    for ds in def_synsets:
        for as_ in ans_synsets:
            sim = ds.path_similarity(as_)
            if sim is not None and sim > max_sim:
                max_sim = sim
    return max_sim


def compute_shared_synset_count(def_synsets, ans_synsets):
    """Count synsets that contain both the definition and answer as lemmas.

    A shared synset means both words can express the same concept — they are
    direct synonyms for that particular meaning. This is related to but
    distinct from the synonym boolean: check_synonym looks at lemma names
    within synsets, while this counts the actual synset overlap. In practice
    these are equivalent, but shared_synset_count captures *how many* senses
    overlap, not just whether any do.

    Parameters
    ----------
    def_synsets : list of Synset
        All synsets for the definition word.
    ans_synsets : list of Synset
        All synsets for the answer word.

    Returns
    -------
    int
        Number of synsets appearing in both lists. Returns 0 if no overlap.
    """
    return len(set(def_synsets) & set(ans_synsets))


# --- Relationship type definitions ---
# Maps each relationship type name to the sequence of WordNet synset methods
# needed to check reachability. For compound types "X_of_Y", the first hop
# follows Y and the second hop follows X (read right-to-left). For example,
# "hyponym_of_hypernym" means: from definition synsets, follow hypernyms()
# (first hop), then follow hyponyms() of those results (second hop), and
# check if any answer synset appears in the final set.
#
# These 19 types (plus synonym handled separately = 20 total) cover all
# relationship types from Table 3 of the design doc.
RELATIONSHIP_HOPS = {
    # --- Single-hop relationships (6) ---
    # Direct taxonomic and part-whole relationships.
    'hyponym':           ['hyponyms'],
    'hypernym':          ['hypernyms'],
    'part_holonym':      ['part_holonyms'],
    'part_meronym':      ['part_meronyms'],
    'substance_meronym': ['substance_meronyms'],
    'member_meronym':    ['member_meronyms'],

    # --- Two-hop relationships (13) ---
    # Taxonomic two-hop: moving up/down the is-a hierarchy twice.
    'hyponym_of_hypernym':  ['hypernyms', 'hyponyms'],   # co-hyponymy
    'hypernym_of_hyponym':  ['hyponyms', 'hypernyms'],   # co-hypernymy
    'hyponym_of_hyponym':   ['hyponyms', 'hyponyms'],    # grandchild
    'hypernym_of_hypernym': ['hypernyms', 'hypernyms'],  # grandparent

    # Mixed taxonomic + part-whole: one taxonomic hop + one part-whole hop.
    'part_holonym_of_hyponym':      ['hyponyms', 'part_holonyms'],
    'hyponym_of_part_holonym':      ['part_holonyms', 'hyponyms'],
    'substance_meronym_of_hyponym': ['hyponyms', 'substance_meronyms'],
    'part_meronym_of_hyponym':      ['hyponyms', 'part_meronyms'],
    'hyponym_of_part_meronym':      ['part_meronyms', 'hyponyms'],
    'part_meronym_of_hypernym':     ['hypernyms', 'part_meronyms'],
    'part_holonym_of_hypernym':     ['hypernyms', 'part_holonyms'],

    # Part-whole two-hop: two part-whole hops.
    'part_holonym_of_part_meronym':     ['part_meronyms', 'part_holonyms'],
    'member_meronym_of_member_holonym': ['member_holonyms', 'member_meronyms'],
}


def compute_wordnet_relationship_features(def_word, ans_word):
    """Compute all WordNet relationship features for a (definition, answer) pair.

    This is the master function that combines all relationship checks into
    a single feature dictionary. It is designed to be called once per unique
    (definition_wn, answer_wn) pair; the results are then merged back to all
    rows sharing that pair.

    Parameters
    ----------
    def_word : str
        The definition word (lowercase, underscored for multi-word; from the
        definition_wn column).
    ans_word : str
        The answer word (lowercase, underscored for multi-word; from the
        answer_wn column).

    Returns
    -------
    dict
        Keys are feature column names, values are feature values:
        - 20 boolean features (int 0/1): wn_rel_synonym, wn_rel_hyponym, ...
        - wn_max_path_sim (float): max path similarity, default 0.0
        - wn_shared_synset_count (int): shared synset count, default 0
    """
    def_synsets = get_wordnet_synsets(def_word)
    ans_synsets = get_wordnet_synsets(ans_word)
    ans_synsets_set = set(ans_synsets)

    features = {}

    # --- Synonym (lemma-based, one hop) ---
    features['wn_rel_synonym'] = int(check_synonym(def_synsets, ans_word))

    # --- Synset-based relationship checks (one-hop and two-hop) ---
    for rel_name, hops in RELATIONSHIP_HOPS.items():
        features[f'wn_rel_{rel_name}'] = int(
            check_synset_reachable(def_synsets, ans_synsets_set, hops)
        )

    # --- Max path similarity ---
    features['wn_max_path_sim'] = compute_max_path_similarity(
        def_synsets, ans_synsets)

    # --- Shared synset count ---
    features['wn_shared_synset_count'] = compute_shared_synset_count(
        def_synsets, ans_synsets)

    return features


# Quick sanity check: verify the function produces the expected keys
# and sensible values for a known word pair.
_test = compute_wordnet_relationship_features('flower', 'rose')
_expected_keys = (
    ['wn_rel_synonym']
    + [f'wn_rel_{r}' for r in RELATIONSHIP_HOPS]
    + ['wn_max_path_sim', 'wn_shared_synset_count']
)
assert set(_test.keys()) == set(_expected_keys), (
    f'Key mismatch. Expected {len(_expected_keys)} keys, got {len(_test)}')
print(f'Function sanity check passed (flower → rose):')
print(f'  wn_rel_synonym={_test["wn_rel_synonym"]}, '
      f'wn_rel_hypernym={_test["wn_rel_hypernym"]}, '
      f'wn_rel_hyponym={_test["wn_rel_hyponym"]}, '
      f'wn_max_path_sim={_test["wn_max_path_sim"]:.4f}, '
      f'wn_shared_synset_count={_test["wn_shared_synset_count"]}')
print(f'  Total features per pair: {len(_test)}')
Function sanity check passed (flower → rose):
  wn_rel_synonym=0, wn_rel_hypernym=0, wn_rel_hyponym=0, wn_max_path_sim=0.2500, wn_shared_synset_count=0
  Total features per pair: 22
In [9]:
# --- Deduplicate to unique (definition_wn, answer_wn) pairs ---
# Many clue rows share the same definition–answer pair (different clue
# sentences for the same word pair). Relationship features depend only on
# the word pair, not the clue text, so we compute them once per unique pair
# and merge back. This avoids redundant WordNet traversals and dramatically
# reduces computation time.

unique_pairs = df[['definition_wn', 'answer_wn']].drop_duplicates()
print(f'Total rows:                              {len(df):>10,}')
print(f'Unique (definition_wn, answer_wn) pairs: {len(unique_pairs):>10,}')
print(f'Deduplication ratio:                     {len(df) / len(unique_pairs):>10.2f}x')
print(f'\nComputing WordNet relationship features for {len(unique_pairs):,} '
      f'unique pairs...')
print(f'(This may take 10–30 minutes on CPU.)\n')

# --- Compute features for each unique pair ---
# Use itertuples (faster than iterrows) to iterate over the unique pairs.
pair_list = list(unique_pairs.itertuples(index=False, name=None))
results = []
for def_word, ans_word in tqdm(pair_list, desc='WordNet relationships'):
    feats = compute_wordnet_relationship_features(def_word, ans_word)
    feats['definition_wn'] = def_word
    feats['answer_wn'] = ans_word
    results.append(feats)

rel_features_df = pd.DataFrame(results)
print(f'\nComputed {len(rel_features_df):,} unique-pair feature rows.')

# --- Merge relationship features back to all rows ---
# Left merge preserves all 240,211 rows in df. Each row gets the
# relationship features for its (definition_wn, answer_wn) pair.
n_before = len(df)
df = df.merge(rel_features_df, on=['definition_wn', 'answer_wn'], how='left')
assert len(df) == n_before, (
    f'Merge changed row count: {n_before:,} → {len(df):,}. '
    f'This should not happen with a left merge on unique keys.')

print(f'Merged relationship features back to {len(df):,} rows.')
Total rows:                                 240,211
Unique (definition_wn, answer_wn) pairs:    127,608
Deduplication ratio:                           1.88x

Computing WordNet relationship features for 127,608 unique pairs...
(This may take 10–30 minutes on CPU.)

WordNet relationships:   0%|          | 0/127608 [00:00<?, ?it/s]
Computed 127,608 unique-pair feature rows.
Merged relationship features back to 240,211 rows.

Verification: WordNet Relationship Features¶

In [10]:
# --- Identify relationship columns ---
wn_rel_bool_cols = sorted([c for c in df.columns if c.startswith('wn_rel_')])
wn_continuous_cols = ['wn_max_path_sim', 'wn_shared_synset_count']
wn_all_cols = wn_rel_bool_cols + wn_continuous_cols

print(f'Boolean relationship columns ({len(wn_rel_bool_cols)}):')
for col in wn_rel_bool_cols:
    print(f'  {col}')
print(f'\nContinuous relationship columns ({len(wn_continuous_cols)}):')
for col in wn_continuous_cols:
    print(f'  {col}')
print(f'\nTotal relationship columns: {len(wn_all_cols)}')

# --- Assert expected counts ---
# The design doc Table 3 lists 20 relationship types (19 from the original
# plan + hypernym_of_hyponym). Together with path similarity and shared
# synset count, that gives 22 total relationship columns — one more than
# the original PLAN.md estimate of 21 (which assumed 19 boolean types).
assert len(wn_rel_bool_cols) == 20, (
    f'Expected 20 boolean relationship columns, found {len(wn_rel_bool_cols)}')
assert len(wn_all_cols) == 22, (
    f'Expected 22 total relationship columns, found {len(wn_all_cols)}')

# --- Assert no NaN values (Decision 3) ---
nan_counts = df[wn_all_cols].isnull().sum()
assert (nan_counts == 0).all(), (
    f'NaN values found in relationship columns:\n'
    f'{nan_counts[nan_counts > 0]}')
print(f'\nVerification passed: {len(wn_rel_bool_cols)} boolean + '
      f'{len(wn_continuous_cols)} continuous = {len(wn_all_cols)} '
      f'relationship columns, 0 NaN values.')
print(f'Rows: {len(df):,}\n')

# --- Boolean feature prevalence ---
# Which relationship types are most common? We expect hyponym and synonym
# to dominate, since cryptic crossword definitions typically use "is-a"
# relationships (hypernym→answer) or direct synonymy.
print(f'--- Boolean Feature Prevalence ---')
print(f'{"Feature":<45s} {"True":>8s} {"False":>8s} {"% True":>8s}')
print(f'{"-"*45} {"-"*8} {"-"*8} {"-"*8}')
for col in wn_rel_bool_cols:
    n_true = int(df[col].sum())
    n_false = len(df) - n_true
    pct = 100 * n_true / len(df)
    print(f'{col:<45s} {n_true:>8,} {n_false:>8,} {pct:>7.2f}%')

# --- Rows with at least one boolean True ---
any_true = df[wn_rel_bool_cols].any(axis=1).sum()
pct_any = 100 * any_true / len(df)
print(f'\nRows with at least one boolean True: '
      f'{any_true:,} / {len(df):,} ({pct_any:.1f}%)')
print(f'(Design doc preliminary finding: ~31–34%)')

# --- Continuous feature statistics ---
print(f'\n--- Continuous Feature Statistics ---\n')
with pd.option_context('display.float_format', '{:.4f}'.format):
    print(df[wn_continuous_cols].describe().to_string())

# --- Distribution of wn_shared_synset_count ---
print(f'\n--- wn_shared_synset_count value distribution ---')
ssc_counts = df['wn_shared_synset_count'].value_counts().sort_index()
for val, cnt in ssc_counts.items():
    print(f'  {val}: {cnt:,} rows ({100 * cnt / len(df):.2f}%)')
Boolean relationship columns (20):
  wn_rel_hypernym
  wn_rel_hypernym_of_hypernym
  wn_rel_hypernym_of_hyponym
  wn_rel_hyponym
  wn_rel_hyponym_of_hypernym
  wn_rel_hyponym_of_hyponym
  wn_rel_hyponym_of_part_holonym
  wn_rel_hyponym_of_part_meronym
  wn_rel_member_meronym
  wn_rel_member_meronym_of_member_holonym
  wn_rel_part_holonym
  wn_rel_part_holonym_of_hypernym
  wn_rel_part_holonym_of_hyponym
  wn_rel_part_holonym_of_part_meronym
  wn_rel_part_meronym
  wn_rel_part_meronym_of_hypernym
  wn_rel_part_meronym_of_hyponym
  wn_rel_substance_meronym
  wn_rel_substance_meronym_of_hyponym
  wn_rel_synonym

Continuous relationship columns (2):
  wn_max_path_sim
  wn_shared_synset_count

Total relationship columns: 22

Verification passed: 20 boolean + 2 continuous = 22 relationship columns, 0 NaN values.
Rows: 240,211

--- Boolean Feature Prevalence ---
Feature                                           True    False   % True
--------------------------------------------- -------- -------- --------
wn_rel_hypernym                                 14,158  226,053    5.89%
wn_rel_hypernym_of_hypernym                      3,008  237,203    1.25%
wn_rel_hypernym_of_hyponym                      25,010  215,201   10.41%
wn_rel_hyponym                                  45,084  195,127   18.77%
wn_rel_hyponym_of_hypernym                      51,674  188,537   21.51%
wn_rel_hyponym_of_hyponym                       16,322  223,889    6.79%
wn_rel_hyponym_of_part_holonym                     651  239,560    0.27%
wn_rel_hyponym_of_part_meronym                     318  239,893    0.13%
wn_rel_member_meronym                               60  240,151    0.02%
wn_rel_member_meronym_of_member_holonym            856  239,355    0.36%
wn_rel_part_holonym                                443  239,768    0.18%
wn_rel_part_holonym_of_hypernym                    132  240,079    0.05%
wn_rel_part_holonym_of_hyponym                     836  239,375    0.35%
wn_rel_part_holonym_of_part_meronym              2,021  238,190    0.84%
wn_rel_part_meronym                                471  239,740    0.20%
wn_rel_part_meronym_of_hypernym                    211  240,000    0.09%
wn_rel_part_meronym_of_hyponym                     416  239,795    0.17%
wn_rel_substance_meronym                            35  240,176    0.01%
wn_rel_substance_meronym_of_hyponym                392  239,819    0.16%
wn_rel_synonym                                  36,121  204,090   15.04%

Rows with at least one boolean True: 120,854 / 240,211 (50.3%)
(Design doc preliminary finding: ~31–34%)

--- Continuous Feature Statistics ---

       wn_max_path_sim  wn_shared_synset_count
count      240211.0000             240211.0000
mean            0.4352                  0.2081
std             0.2902                  0.5066
min             0.0385                  0.0000
25%             0.2500                  0.0000
50%             0.3333                  0.0000
75%             0.5000                  0.0000
max             1.0000                 16.0000

--- wn_shared_synset_count value distribution ---
  0: 197,886 rows (82.38%)
  1: 36,660 rows (15.26%)
  2: 4,336 rows (1.81%)
  3: 934 rows (0.39%)
  4: 244 rows (0.10%)
  5: 87 rows (0.04%)
  6: 43 rows (0.02%)
  7: 16 rows (0.01%)
  14: 1 rows (0.00%)
  16: 4 rows (0.00%)
In [11]:
# What's the percentage at the unique-pair level?
pair_df = df.drop_duplicates(subset=['definition_wn', 'answer_wn'])
pair_any_true = pair_df[wn_rel_bool_cols].any(axis=1).sum()
print(f"Unique pairs with at least one boolean True: "
      f"{pair_any_true:,} / {len(pair_df):,} "
      f"({100 * pair_any_true / len(pair_df):.1f}%)")
Unique pairs with at least one boolean True: 56,220 / 127,608 (44.1%)

Interpretation¶

WordNet connectivity: The percentage of rows with at least one boolean relationship True should be compared to the ~31–34% reported in Hans's preliminary analysis (design doc Table 3). Any difference likely reflects the broader dataset (we include multi-word entries and double-definition clues that the preliminary analysis excluded) and the improved WordNet lookup (underscore conversion for multi-word entries — see Decision 17).

Relationship type distribution: Cryptic crossword definitions typically use hypernym relationships ("flower" for ROSE, "creature" for CAT) or direct synonymy ("hide" for CONCEAL). hyponym_of_hypernym (co-hyponymy) captures pairs like "plant"/"aster" — both children of a shared parent concept — which is common when the definition is a sibling category rather than a direct parent. Two-hop types involving meronyms and holonyms are rarer since part–whole relationships are less common in crossword definitions.

Path similarity: wn_max_path_sim was the single most predictive feature in Hans's preliminary experiments (removing it alone dropped accuracy by 3.4%). Its distribution — including the proportion of zero values where no WordNet path exists — informs whether this feature will remain dominant with the expanded 46-feature set. Values near 1.0 indicate direct synonym or parent–child relationships; values near 0.05–0.1 indicate distant connections.

Shared synsets: Pairs with wn_shared_synset_count > 0 are definitional synonyms — the definition and answer literally share a WordNet sense — making classification trivially easy for those rows. The proportion of such pairs contributes to the baseline difficulty of the classification task.

Surface Features (4 total)¶

Surface features capture shallow orthographic similarity between the definition and answer words. These are the simplest features in the pipeline — they require no embeddings or WordNet lookups, just string comparison.

Unlike cosine similarity features (which measure semantic similarity in embedding space) and relationship features (which measure structural connections in WordNet), surface features measure how the words look — whether they share characters, start with the same letter, or have similar lengths. In cryptic crosswords, surface similarity can be a red herring (e.g., "flower" for ROSE exploits the "something that flows" reading, not orthographic overlap) or occasionally a genuine signal in compound clues where the answer overlaps with definition characters.

We compute these from definition_wn and answer_wn (the WordNet-ready lowercase strings), consistent with how the relationship features use these columns.

Features:

  1. surface_edit_distance — Levenshtein edit distance: minimum number of single-character insertions, deletions, or substitutions to transform one string into the other. Higher values = less similar.
  2. surface_length_ratio — len(shorter) / len(longer), always in [0, 1]. Values near 1.0 mean the words are similar length.
  3. surface_shared_first_letter — 1 if the words start with the same character, 0 otherwise.
  4. surface_char_overlap_ratio — Jaccard similarity of character sets: |chars(def) ∩ chars(ans)| / |chars(def) ∪ chars(ans)|. Measures what fraction of distinct letters appear in both words.
In [12]:
# ============================================================
# Surface Feature Function
# ============================================================
# This function is standalone and can be extracted into
# scripts/feature_utils.py for reuse in distractor feature
# computation (Steps 5 and 7). See Decision 18.
# ============================================================


def compute_surface_features(def_word, ans_word):
    """Compute all 4 surface features for a (definition, answer) pair.

    Surface features measure shallow orthographic similarity between words.
    This function is standalone and can be extracted into
    scripts/feature_utils.py for reuse in distractor feature computation
    (Steps 5 and 7). See Decision 18.

    Parameters
    ----------
    def_word : str
        The definition word (from the definition_wn column).
    ans_word : str
        The answer word (from the answer_wn column).

    Returns
    -------
    dict
        Keys are feature column names, values are feature values:
        - surface_edit_distance (int): Levenshtein edit distance
        - surface_length_ratio (float): len(shorter) / len(longer), in [0, 1]
        - surface_shared_first_letter (int): 1 if same first character, else 0
        - surface_char_overlap_ratio (float): Jaccard similarity of character sets
    """
    features = {}

    # --- Levenshtein edit distance ---
    # Uses NLTK's edit_distance which computes the standard dynamic-programming
    # Levenshtein distance (insertions, deletions, substitutions each cost 1).
    features['surface_edit_distance'] = nltk.edit_distance(def_word, ans_word)

    # --- Length ratio: len(shorter) / len(longer) ---
    # Always in [0, 1]. A value of 1.0 means identical lengths.
    # Default to 0.0 if both strings are empty (should not occur given
    # WordNet filter, but be safe).
    len_def = len(def_word)
    len_ans = len(ans_word)
    if len_def == 0 and len_ans == 0:
        features['surface_length_ratio'] = 0.0
    else:
        shorter = min(len_def, len_ans)
        longer = max(len_def, len_ans)
        features['surface_length_ratio'] = shorter / longer

    # --- Shared first letter ---
    # Binary indicator: do the words start with the same character?
    if def_word and ans_word:
        features['surface_shared_first_letter'] = int(
            def_word[0] == ans_word[0])
    else:
        features['surface_shared_first_letter'] = 0

    # --- Character overlap ratio (Jaccard similarity of character sets) ---
    # Measures what fraction of distinct characters appear in both words.
    # For example, "plant" {p,l,a,n,t} and "aster" {a,s,t,e,r} share
    # {a, t} out of {p,l,a,n,t,s,e,r} → 2/8 = 0.25.
    def_chars = set(def_word)
    ans_chars = set(ans_word)
    union = def_chars | ans_chars
    if len(union) == 0:
        features['surface_char_overlap_ratio'] = 0.0
    else:
        features['surface_char_overlap_ratio'] = (
            len(def_chars & ans_chars) / len(union))

    return features


# Quick sanity check with a known pair
_test_surf = compute_surface_features('flower', 'rose')
print('Sanity check (flower → rose):')
for k, v in _test_surf.items():
    print(f'  {k} = {v}')
assert len(_test_surf) == 4

print(f'\nComputing surface features for {len(df):,} rows...')

# --- Apply to all rows ---
# Surface features are cheap to compute (no WordNet or embedding lookups),
# so we compute them directly on all rows without deduplication.
# Using a list comprehension over zip is faster than df.apply(axis=1).
surf_results = [
    compute_surface_features(d, a)
    for d, a in zip(df['definition_wn'], df['answer_wn'])
]
surf_df = pd.DataFrame(surf_results, index=df.index)
for col in surf_df.columns:
    df[col] = surf_df[col]

print(f'Added {len(surf_df.columns)} surface feature columns to df.')
Sanity check (flower → rose):
  surface_edit_distance = 4
  surface_length_ratio = 0.6666666666666666
  surface_shared_first_letter = 0
  surface_char_overlap_ratio = 0.42857142857142855

Computing surface features for 240,211 rows...
Added 4 surface feature columns to df.

Verification: Surface Features¶

In [13]:
# --- Identify surface columns ---
SURFACE_FEATURE_COLS = ['surface_edit_distance', 'surface_length_ratio',
                        'surface_shared_first_letter', 'surface_char_overlap_ratio']
surface_cols = SURFACE_FEATURE_COLS

# --- Assert exactly 4 surface columns ---
assert len(surface_cols) == 4, (
    f'Expected 4 surface columns, found {len(surface_cols)}: {surface_cols}')

# --- Assert no NaN values (Decision 3) ---
nan_counts = df[surface_cols].isnull().sum()
assert (nan_counts == 0).all(), (
    f'NaN values found in surface columns:\n'
    f'{nan_counts[nan_counts > 0]}')

print(f'Verification passed: 4 surface columns, 0 NaN values.')
print(f'Rows: {len(df):,}\n')

# --- Descriptive statistics ---
print('--- Descriptive Statistics (Surface Features) ---\n')
with pd.option_context('display.float_format', '{:.4f}'.format):
    print(df[surface_cols].describe().to_string())
Verification passed: 4 surface columns, 0 NaN values.
Rows: 240,211

--- Descriptive Statistics (Surface Features) ---

       surface_edit_distance  surface_length_ratio  surface_shared_first_letter  surface_char_overlap_ratio
count            240211.0000           240211.0000                  240211.0000                 240211.0000
mean                  6.7299                0.7259                       0.0738                      0.2304
std                   2.2153                0.1836                       0.2615                      0.1451
min                   0.0000                0.0667                       0.0000                      0.0000
25%                   5.0000                0.6000                       0.0000                      0.1250
50%                   6.0000                0.7500                       0.0000                      0.2222
75%                   8.0000                0.8571                       0.0000                      0.3333
max                  26.0000                1.0000                       1.0000                      1.0000

Quick sanity checks:

Edit distance mean 6.7 — definitions and answers are typically quite different strings, which makes sense.

Shared first letter 7.4% — only 1 in 13 pairs start with the same letter, roughly what you'd expect by chance (1/26 ≈ 3.8%), maybe slightly elevated because some definitions are synonyms of the answer.

Character overlap 23% — modest Jaccard overlap of character sets, reasonable.

Length ratio 0.73 mean — definitions and answers tend to be somewhat similar in length.

Final Assembly, Validation, and Save¶

All 47 features have been computed:

  • 15 context-free cosine similarities (no clue context involved)
  • 6 context-informed cosine similarities (definition in clue context vs. each embedding)
  • 22 WordNet relationship features (20 boolean types + max path similarity
    • shared synset count)
  • 4 surface features (edit distance, length ratio, shared first letter, character overlap)

This section assembles the feature columns and metadata into the final output DataFrame, validates that all features are present and NaN-free (Decision 3), and saves to data/features_all.parquet.

The feature count is 47 rather than the 46 originally planned in PLAN.md because hypernym_of_hyponym was added as a 20th boolean relationship type (the original plan assumed 19 boolean types).

In [14]:
# ============================================================
# Feature Group Definitions
# ============================================================

# --- Context-Free Meaning (15): cosine similarities not involving clue context ---
CONTEXT_FREE_COLS = sorted([
    c for c in df.columns if c.startswith('cos_') and 'w1clue' not in c
])

# --- Context-Informed Meaning (6): cosine similarities involving w1clue ---
CONTEXT_INFORMED_COLS = sorted([
    c for c in df.columns if c.startswith('cos_') and 'w1clue' in c
])

# --- Relationship (22): 20 boolean + path similarity + shared synset count ---
RELATIONSHIP_COLS = (
    sorted([c for c in df.columns if c.startswith('wn_rel_')])
    + ['wn_max_path_sim', 'wn_shared_synset_count']
)

# --- Surface (4): orthographic similarity ---
SURFACE_COLS = ['surface_edit_distance', 'surface_length_ratio',
                'surface_shared_first_letter', 'surface_char_overlap_ratio']

# --- All features ---
ALL_FEATURE_COLS = (
    CONTEXT_FREE_COLS + CONTEXT_INFORMED_COLS
    + RELATIONSHIP_COLS + SURFACE_COLS
)

print('Feature group counts:')
print(f'  Context-free cosine:     {len(CONTEXT_FREE_COLS):>3d}')
print(f'  Context-informed cosine: {len(CONTEXT_INFORMED_COLS):>3d}')
print(f'  WordNet relationship:    {len(RELATIONSHIP_COLS):>3d}')
print(f'  Surface:                 {len(SURFACE_COLS):>3d}')
print(f'  {"─" * 35}')
print(f'  Total features:          {len(ALL_FEATURE_COLS):>3d}')

# --- Assert expected group sizes ---
assert len(CONTEXT_FREE_COLS) == 15, (
    f'Expected 15 context-free, got {len(CONTEXT_FREE_COLS)}')
assert len(CONTEXT_INFORMED_COLS) == 6, (
    f'Expected 6 context-informed, got {len(CONTEXT_INFORMED_COLS)}')
assert len(RELATIONSHIP_COLS) == 22, (
    f'Expected 22 relationship, got {len(RELATIONSHIP_COLS)}')
assert len(SURFACE_COLS) == 4, (
    f'Expected 4 surface, got {len(SURFACE_COLS)}')
assert len(ALL_FEATURE_COLS) == 47, (
    f'Expected 47 total features, got {len(ALL_FEATURE_COLS)}')

# ============================================================
# Metadata Columns
# ============================================================
# Carry through identifiers and lookup keys needed by downstream notebooks.
# Includes def_num_usable_synsets and ans_num_usable_synsets for downstream
# stratification by polysemy level (Decision 19).
# Does NOT include raw embedding matrices or intermediate columns (e.g.,
# cc_row_position used for embedding array indexing).

METADATA_COLS = [
    'clue_id', 'clue', 'surface', 'surface_normalized',
    'definition', 'answer', 'definition_wn', 'answer_wn',
    'def_answer_pair_id', 'answer_format', 'num_definitions',
    'def_num_usable_synsets', 'ans_num_usable_synsets',
]

# Verify all metadata columns exist in df
missing_meta = [c for c in METADATA_COLS if c not in df.columns]
assert not missing_meta, f'Missing metadata columns: {missing_meta}'

# ============================================================
# Build Output DataFrame
# ============================================================
output_df = df[METADATA_COLS + ALL_FEATURE_COLS].copy()

# ============================================================
# Validation
# ============================================================
expected_rows = len(df)
assert len(output_df) == expected_rows, (
    f'Row count mismatch: {len(output_df):,} vs {expected_rows:,}')
assert not output_df[ALL_FEATURE_COLS].isnull().any().any(), (
    'NaN values found in feature columns — violates Decision 3')
assert len(ALL_FEATURE_COLS) == 47, (
    f'Expected 47 features, got {len(ALL_FEATURE_COLS)}')

print(f'\nOutput DataFrame: {len(output_df):,} rows × '
      f'{len(output_df.columns)} columns')
print(f'  Metadata columns: {len(METADATA_COLS)}')
print(f'  Feature columns:  {len(ALL_FEATURE_COLS)}')
print(f'  Total columns:    {len(METADATA_COLS) + len(ALL_FEATURE_COLS)}')

# ============================================================
# Feature Statistics (all 47 features, transposed)
# ============================================================
print(f'\n--- Feature Statistics (all 47 features) ---\n')
with pd.option_context('display.float_format', '{:.4f}'.format,
                       'display.max_rows', 50):
    print(output_df[ALL_FEATURE_COLS].describe().T[
        ['mean', 'std', 'min', '25%', '50%', '75%', 'max']
    ].to_string())

# ============================================================
# Correlation Highlights
# ============================================================
# Flag the top 5 most correlated feature pairs. Some multicollinearity is
# expected: within-word cosine pairs (e.g., cos_w1all_w1common and
# cos_w1all_w1obscure) share most of their variance because allsense is an
# average that includes the common and obscure senses. Certain relationship
# booleans (synonym and hyponym_of_hypernym) may also co-occur frequently.
print(f'\n--- Top 5 Most Correlated Feature Pairs ---\n')
corr_matrix = output_df[ALL_FEATURE_COLS].corr()
# Extract upper triangle (exclude diagonal and lower triangle)
upper_tri = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape, dtype=bool), k=1))
# Stack and sort by absolute correlation
corr_pairs = upper_tri.stack().reset_index()
corr_pairs.columns = ['feature_1', 'feature_2', 'correlation']
corr_pairs['abs_corr'] = corr_pairs['correlation'].abs()
top_5 = corr_pairs.nlargest(5, 'abs_corr')
for _, row in top_5.iterrows():
    print(f'  {row["feature_1"]:<40s} × {row["feature_2"]:<40s}'
          f'  r = {row["correlation"]:+.4f}')

# ============================================================
# Save to Parquet
# ============================================================
output_path = DATA_DIR / 'features_all.parquet'
output_df.to_parquet(output_path, index=False)
file_size_mb = output_path.stat().st_size / (1024 * 1024)
print(f'\nSaved to {output_path}')
print(f'File size: {file_size_mb:.1f} MB')

# --- Round-trip verification ---
reloaded = pd.read_parquet(output_path)
assert reloaded.shape == output_df.shape, (
    f'Shape mismatch after reload: {reloaded.shape} vs {output_df.shape}')
print(f'Round-trip verification passed: '
      f'{reloaded.shape[0]:,} rows × {reloaded.shape[1]} columns')
Feature group counts:
  Context-free cosine:      15
  Context-informed cosine:   6
  WordNet relationship:     22
  Surface:                   4
  ───────────────────────────────────
  Total features:           47

Output DataFrame: 240,211 rows × 60 columns
  Metadata columns: 13
  Feature columns:  47
  Total columns:    60

--- Feature Statistics (all 47 features) ---

                                          mean    std     min    25%    50%    75%     max
cos_w1all_w1common                      0.8638 0.1257  0.2143 0.8146 0.8939 0.9521  1.0000
cos_w1all_w1obscure                     0.8636 0.1168  0.3153 0.8004 0.8887 0.9503  1.0000
cos_w1all_w2all                         0.6482 0.1366  0.0909 0.5597 0.6622 0.7512  1.0000
cos_w1all_w2common                      0.5954 0.1575  0.0180 0.4850 0.6090 0.7172  0.9689
cos_w1all_w2obscure                     0.5940 0.1545  0.0148 0.4862 0.6071 0.7112  0.9689
cos_w1common_w1obscure                  0.6891 0.2331 -0.0262 0.5217 0.7244 0.8780  1.0000
cos_w1common_w2all                      0.5686 0.1710  0.0085 0.4431 0.5815 0.7019  0.9689
cos_w1common_w2common                   0.5311 0.1871 -0.0402 0.3819 0.5337 0.6800  1.0000
cos_w1common_w2obscure                  0.5171 0.1867 -0.0420 0.3694 0.5192 0.6640  0.9768
cos_w1obscure_w2all                     0.5593 0.1640  0.0128 0.4390 0.5676 0.6856  0.9689
cos_w1obscure_w2common                  0.5096 0.1801 -0.0444 0.3693 0.5072 0.6487  0.9768
cos_w1obscure_w2obscure                 0.5171 0.1754 -0.0407 0.3818 0.5153 0.6509  1.0000
cos_w2all_w2common                      0.9150 0.1021  0.2143 0.8729 0.9430 1.0000  1.0000
cos_w2all_w2obscure                     0.9145 0.0982  0.3153 0.8696 0.9409 1.0000  1.0000
cos_w2common_w2obscure                  0.7798 0.2194 -0.0262 0.6318 0.8236 1.0000  1.0000
cos_w1clue_w1all                        0.7453 0.1171  0.1011 0.6814 0.7681 0.8318  0.9772
cos_w1clue_w1common                     0.6721 0.1780 -0.0199 0.5655 0.7170 0.8105  0.9748
cos_w1clue_w1obscure                    0.6374 0.1794 -0.0490 0.5127 0.6725 0.7809  0.9763
cos_w1clue_w2all                        0.5446 0.1544 -0.0028 0.4318 0.5546 0.6642  0.9404
cos_w1clue_w2common                     0.5019 0.1695 -0.0382 0.3701 0.5033 0.6360  0.9427
cos_w1clue_w2obscure                    0.4987 0.1676 -0.0545 0.3683 0.4994 0.6303  0.9441
wn_rel_hypernym                         0.0589 0.2355  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_hypernym_of_hypernym             0.0125 0.1112  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_hypernym_of_hyponym              0.1041 0.3054  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_hyponym                          0.1877 0.3905  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_hyponym_of_hypernym              0.2151 0.4109  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_hyponym_of_hyponym               0.0679 0.2517  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_hyponym_of_part_holonym          0.0027 0.0520  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_hyponym_of_part_meronym          0.0013 0.0364  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_member_meronym                   0.0002 0.0158  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_member_meronym_of_member_holonym 0.0036 0.0596  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_part_holonym                     0.0018 0.0429  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_part_holonym_of_hypernym         0.0005 0.0234  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_part_holonym_of_hyponym          0.0035 0.0589  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_part_holonym_of_part_meronym     0.0084 0.0913  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_part_meronym                     0.0020 0.0442  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_part_meronym_of_hypernym         0.0009 0.0296  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_part_meronym_of_hyponym          0.0017 0.0416  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_substance_meronym                0.0001 0.0121  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_substance_meronym_of_hyponym     0.0016 0.0404  0.0000 0.0000 0.0000 0.0000  1.0000
wn_rel_synonym                          0.1504 0.3574  0.0000 0.0000 0.0000 0.0000  1.0000
wn_max_path_sim                         0.4352 0.2902  0.0385 0.2500 0.3333 0.5000  1.0000
wn_shared_synset_count                  0.2081 0.5066  0.0000 0.0000 0.0000 0.0000 16.0000
surface_edit_distance                   6.7299 2.2153  0.0000 5.0000 6.0000 8.0000 26.0000
surface_length_ratio                    0.7259 0.1836  0.0667 0.6000 0.7500 0.8571  1.0000
surface_shared_first_letter             0.0738 0.2615  0.0000 0.0000 0.0000 0.0000  1.0000
surface_char_overlap_ratio              0.2304 0.1451  0.0000 0.1250 0.2222 0.3333  1.0000

--- Top 5 Most Correlated Feature Pairs ---

  cos_w1common_w2all                       × cos_w1common_w2common                     r = +0.8850
  cos_w1obscure_w2all                      × cos_w1obscure_w2obscure                   r = +0.8731
  cos_w1common_w2all                       × cos_w1common_w2obscure                    r = +0.8710
  cos_w1obscure_w2all                      × cos_w1obscure_w2common                    r = +0.8653
  cos_w2all_w2obscure                      × cos_w2common_w2obscure                    r = +0.8624

Saved to /Users/victoria/Desktop/MADS/ccc-project/clue_misdirection/data/features_all.parquet
File size: 46.8 MB
Round-trip verification passed: 240,211 rows × 60 columns

Summary¶

This notebook computed 47 features across 4 groups for all 240,211 (clue, definition, answer) rows. The feature count is 47 rather than the 46 originally planned in PLAN.md because hypernym_of_hyponym was added as a 20th boolean WordNet relationship type during implementation (the original plan assumed 19 boolean types).

Feature groups¶

Group Count Description
Context-free cosine 15 Pairwise cosine similarities among the 6 sense-specific embeddings (definition × answer cross-word, within-definition, within-answer)
Context-informed cosine 6 Cosine similarities between the clue-contextualized definition embedding and each of the 6 sense-specific embeddings
WordNet relationship 22 20 boolean two-hop relationship types + max path similarity + shared synset count
Surface 4 Edit distance, length ratio, shared first letter, character overlap ratio
Total 47

Key statistics by group¶

Cosine similarities (21 features):

  • Cross-word cosine similarities (definition vs. answer) average 0.50–0.65, reflecting moderate semantic relatedness between definition–answer pairs.
  • Within-word cosine similarities (e.g., allsense vs. common for the same word) average 0.69–0.92, as expected — different senses of the same word share much of the embedding space.
  • Context-informed features show the misdirection signal: cos_w1clue_w2all (mean ~0.54) is lower than cos_w1all_w2all (mean ~0.65), consistent with the hypothesis that clue context pulls the definition embedding away from the answer.

WordNet relationship features (22 features):

  • Synonym, hyponym, and hyponym_of_hypernym (co-hyponymy) are the most prevalent boolean types, consistent with how cryptic crossword definitions typically use "is-a" or sibling-category relationships.
  • wn_max_path_sim was the single most predictive feature in Hans's preliminary experiments; its distribution in the full dataset will inform whether it remains dominant with 47 features.

Surface features (4 features):

  • Edit distance captures the raw character-level difference between definition and answer strings. Since cryptic definitions are rarely orthographically similar to their answers, most values are moderate to high.
  • Length ratio, shared first letter, and character overlap ratio provide complementary views of surface-level similarity.

Notable observations¶

  • Single-synset words: Words with only one usable WordNet synset (35% of definitions, 46% of answers) have identical allsense/common/obscure embeddings, causing within-word cosine features to be exactly 1.0 for those rows. def_num_usable_synsets and ans_num_usable_synsets are carried as metadata columns for downstream stratification (Decision 19).
  • No NaN values: All 47 features are valid numbers for all rows, satisfying Decision 3.

Output¶

  • File: data/features_all.parquet
  • Shape: 240,211 rows × 60 columns (13 metadata + 47 features)

FINDINGS.md update¶

Step 3 findings to record:

  • 47 features computed (15 + 6 + 22 + 4), one more than the originally planned 46 due to the addition of hypernym_of_hyponym as a 20th boolean relationship type.
  • Surface features computed from definition_wn/answer_wn lowercase strings. All 4 features are NaN-free across 240,211 rows.
  • The complete feature table is saved as data/features_all.parquet with 13 metadata columns for downstream use.