Stage 7: Definitions as a Control Condition¶

Primary author: Nathan (NC)
Builds on:

  • 02_embedding_generation.ipynb (Victoria — BGE-M3 embedding pipeline)
  • 04_clustering.ipynb (Victoria — HDBSCAN and agglomerative clustering pipeline)

Prompt engineering: Victoria
AI assistance: Claude (Anthropic)
Environment: Great Lakes (GPU required for Section 2 embedding generation); Colab or Local for all other sections


Research Question¶

"Is the structure we found in indicator embeddings specific to wordplay, or would any set of words from the same clues cluster similarly?"

Why This Notebook Exists¶

In Notebooks 02–05, we clustered CCC indicator words — phrases like scrambled, inside, and sounds like that signal what kind of wordplay is happening in a cryptic crossword clue. Those indicators formed semantically coherent clusters that partially reflect the wordplay taxonomy.

But there is a competing explanation: maybe any short English phrases drawn from the same clues would cluster just as well, simply because BGE-M3 and UMAP are powerful general-purpose tools that find structure in any language. Under this explanation, our indicator clustering results would not tell us anything specific about wordplay.

To test this, we run the exact same pipeline on the definitions from the same clues. In a cryptic crossword clue, the definition is the part that gives a conventional synonym of the answer (e.g., "place to sleep" in a clue whose answer is BED). Definitions come from arbitrary semantic domains — animals, places, emotions, anything — with no inherent connection to wordplay operations.

Prediction:

  • If indicator clustering captures wordplay-specific structure, definition clusters should organize by topic (animals, geography, emotions) rather than by conceptual metaphors, or produce meaningfully different quality metrics.
  • If definitions cluster just as well and in the same way, that challenges the wordplay-specificity interpretation and becomes a finding in itself.

This notebook is part of Stage 5 (Constrained and Targeted Experiments). Output files feed into Notebook 06 (Evaluation and Report Figures).

Running on Google Colab¶

Section 2 (embedding generation) requires a GPU. For all other sections, CPU is fine.

If running on Colab:

  1. Go to Runtime > Change runtime type
  2. Select a GPU accelerator (T4 on free tier is sufficient)
  3. Click Save, then run all cells

Embedding definitions takes approximately 2–8 minutes on a T4 GPU, depending on the number of unique definitions extracted.

Great Lakes session settings (if running on the cluster):

  • Partition: gpu
  • GPUs: 1 (V100 or A40)
  • CPUs: 4
  • Memory: 32 GB
  • Wall time: 1 hour (embeddings + UMAP + clustering)

Section 0: Setup¶

Imports¶

We import the same libraries used in Notebooks 02 and 04 to ensure the pipeline is identical and any differences in results come from the content of the data, not the method.

  • re / unicodedata: used in Section 1 to normalize text and verify definition presence in clue surfaces (copied from 01b_data_cleaning.ipynb)
  • sentence_transformers: provides the BGE-M3 model for generating embeddings
  • umap: dimensionality reduction (same parameters as NB 03)
  • hdbscan: density-based clustering
  • sklearn: agglomerative clustering and evaluation metrics
  • matplotlib / seaborn: visualization

Expected output: No output. If any import fails, install the missing package (pip install sentence-transformers umap-learn hdbscan).

In [ ]:
import os
import re
import unicodedata
import numpy as np
import pandas as pd
from pathlib import Path

from sentence_transformers import SentenceTransformer
import umap

import hdbscan
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

matplotlib.rcParams['figure.dpi'] = 120

Environment Auto-Detection and Path Configuration¶

This cell detects whether we are running on Google Colab, the UMich Great Lakes cluster, or a local machine, and sets file paths accordingly. This is the standard pattern used across all notebooks in this project — see CLAUDE.md for details.

What BATCH_SIZE controls: How many definitions are passed to the embedding model at once. Larger batches are faster but use more GPU memory. Reduce to 64 if you see CUDA out-of-memory errors on Colab's T4 GPU.

Expected output: Three printed paths confirming your working environment.

In [ ]:
try:
    IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
    IS_COLAB = False

IS_GREATLAKES = 'SLURM_JOB_ID' in os.environ

if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues')
elif IS_GREATLAKES:
    # Update YOUR_UNIQNAME to your actual UMich uniqname
    PROJECT_ROOT = Path('/home/nycantwe/ccc_project')
else:
    # Local: notebooks/ is one level below project root
    PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / 'data'
OUTPUT_DIR = PROJECT_ROOT / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

# Larger batches are faster; reduce to 64 if CUDA OOM on Colab T4
BATCH_SIZE = 32 if IS_COLAB else 256

env_name = 'Colab' if IS_COLAB else ('Great Lakes' if IS_GREATLAKES else 'Local')
print(f'Environment : {env_name}')
print(f'Project root: {PROJECT_ROOT}')
print(f'Data dir    : {DATA_DIR}')
print(f'Batch size  : {BATCH_SIZE}')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Environment : Colab
Project root: /content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues
Data dir    : /content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues/data
Batch size  : 32
In [ ]:
# Set the global random seed for reproducibility.
# Every stochastic step (UMAP, HDBSCAN soft assignments) must use this seed
# so results match across runs — a hard requirement from CLAUDE.md.
np.random.seed(42)
print('Random seed set to 42.')
Random seed set to 42.

Section 1: Extract Definitions¶

What is a definition in a cryptic crossword clue?¶

Every cryptic crossword clue (CCC) has exactly three components:

  1. Indicator — a word or phrase signalling what wordplay operation to perform (e.g., scrambled, inside, sounds like). This is what Notebooks 02–05 embedded and clustered.
  2. Fodder — the raw letters that the wordplay operation acts on.
  3. Definition — a conventional synonym or near-synonym of the answer, placed at the start or end of the clue (e.g., "place to sleep" for the answer BED).

The key difference: indicators are drawn from a constrained semantic space shaped by wordplay conventions (motion words, containment words, sound words, etc.). Definitions are drawn from the open vocabulary of English — any topic is fair game.

If our clustering captures something specific to wordplay structure, definitions should cluster differently from indicators. This section extracts those definitions.

How Definitions Are Extracted¶

We use clues_raw.csv (Stage 0 output), which contains clue_id, answer, definition, and the full clue text for all 660,613 clues.

Following the same checksum verification approach that Victoria developed for indicators in 01b_data_cleaning.ipynb, we verify that each definition text appears word-for-word (using word boundaries) in the normalized clue surface before accepting it. This produces verified_definition_clues.csv — a set of (clue_id, definition) pairs we can trust. We then extract unique definition strings from that file for embedding.

Required Input File¶

Section 1 generates verified_definition_clues.csv directly from clues_raw.csv (Stage 0 output). No Stage 1 files are required — the verification logic runs here.

This cell raises a clear error if the file is missing rather than failing silently deeper in the notebook — a convention required by CLAUDE.md.

Expected output: [OK] and a confirmation message.

In [ ]:
required = {
    'clues_raw.csv': DATA_DIR / 'clues_raw.csv',
}

all_present = True
for name, path in required.items():
    status = 'OK     ' if path.exists() else 'MISSING'
    print(f'  [{status}] {name}')
    if not path.exists():
        all_present = False

if not all_present:
    raise FileNotFoundError(
        'Required input file missing.\n'
        'Run 00_data_extraction.ipynb first to produce clues_raw.csv.'
    )
print('\nRequired file present. Proceeding.')
  [OK     ] clues_raw.csv

Required file present. Proceeding.

Step 1: Generate verified_definition_clues.csv¶

We apply the same checksum verification approach Victoria developed for indicators in 01_data_cleaning.ipynb, now for definitions. Rather than trusting the definition field in the database to always correctly identify what was used as the definition in the actual clue, we verify that the definition text appears word-for-word in the normalized clue surface.

This produces verified_definition_clues.csv — a set of (clue_id, definition) pairs where we are confident the definition string genuinely appears in the clue.

The steps below mirror 01b_data_cleaning.ipynb exactly:

  1. Load clues_raw.csv and define a normalization function
  2. Normalize clue surfaces, answers, and definitions (remove accents, punctuation, lowercase)
  3. Extract and validate the answer letter-count format
  4. For each definition, verify it appears intact in the clue surface
  5. Export the verified (clue_id, definition) pairs
In [ ]:
df_clues = pd.read_csv(DATA_DIR / 'clues_raw.csv')
print(f'Loaded {len(df_clues):,} rows from clues_raw.csv')
print(f'Columns: {df_clues.columns.tolist()}')

def normalize(s: str) -> str:
    """Remove accents and punctuation, convert to lowercase.
    Keeps letters, digits, and spaces only — matches 01b_data_cleaning logic."""
    s_normalized = ''.join(
        ch for ch in unicodedata.normalize('NFD', s)
        if unicodedata.category(ch).startswith(('L', 'N', 'Zs'))
    ).lower()
    return s_normalized
Loaded 660,613 rows from clues_raw.csv
Columns: ['clue_id', 'clue', 'answer', 'definition', 'clue_number', 'puzzle_date', 'puzzle_name', 'source_url', 'source']

Normalize Clue Text, Answers, and Definitions¶

We create normalized (lowercase, no punctuation, no accents) versions of three fields:

  • surface: the clue text with the trailing (n) letter-count hint removed
  • surface_normalized: fully normalized surface, used for substring matching
  • answer_normalized: normalized answer string
  • definition_normalized: normalized definition — the key field checked against surface_normalized

We also compute req_ans_format and ans_format_valid (copied faithfully from NB 01b) even though they are not used directly in generating verified_definition_clues.csv.

Note: rows with a null definition are not dropped here (matching 01b behavior — they will simply produce no verified matches and be filtered out later).

Expected output: Row count after dropping null clue/answer rows.

In [ ]:
# Drop rows where clue, answer, or definition is missing; null definitions are kept (01b behavior)
df_clues.dropna(subset=['clue', 'answer', 'definition'], inplace=True)

# Strip trailing letter-count parentheses, e.g. "(5,2,3)" -> preserve surface only
df_clues['surface'] = df_clues['clue'].astype(str).apply(
    lambda x: re.sub(r'\s*\(\d+(?:[,\s-]+\d+)*\)$', '', x)
)

df_clues['surface_normalized']    = df_clues['surface'].astype(str).apply(normalize)
df_clues['answer_normalized']     = df_clues['answer'].astype(str).apply(normalize)
df_clues['definition_normalized'] = df_clues['definition'].astype(str).apply(normalize)

print(f'Rows after dropping null clue/answer: {len(df_clues):,}')
Rows after dropping null clue/answer: 510,886
In [ ]:
# Extract the required answer format from the parenthetical at the end of the clue
df_clues['req_ans_format'] = df_clues['clue'].astype(str).str.extract(
    r'\((\d+(?:[,\s-]+\d+)*)\)$'
)
df_clues['req_ans_letter_count'] = df_clues['req_ans_format'].apply(
    lambda x: sum(int(n) for n in re.findall(r'\d+', str(x))) if pd.notnull(x) else 0
)

def check_format_match(row):
    answer     = str(row['answer'])
    req_format = str(row['req_ans_format'])
    required_lengths = [int(n) for n in re.findall(r'\d+', req_format)]
    answer_segments  = re.findall(r'[a-zA-Z0-9]+', answer)
    answer_lengths   = [len(seg) for seg in answer_segments]
    return required_lengths == answer_lengths

df_clues['ans_format_valid'] = df_clues.apply(check_format_match, axis=1)
print(f'ans_format_valid: {df_clues["ans_format_valid"].sum():,} / {len(df_clues):,} rows')
ans_format_valid: 473,397 / 510,886 rows

Verify That Each Definition Appears in Its Clue Surface¶

This is the checksum verification step. For each (definition, clue_id) pair, we check whether the normalized definition text appears as intact words (using word boundaries \b) in the normalized clue surface.

This filter removes cases where the database's definition field does not match what actually appears in the clue — a known quality issue in parsed crossword data.

verify_clues(definition, clue_ids) returns only the clue_ids where the match holds. An empty return list means no verified match for that (definition, clue) pair, and those rows are dropped before export.

Expected output: Counts of unique definitions before and after verification.

In [ ]:
# Build a fast lookup dictionary: clue_id -> normalized surface
clue_lookup = df_clues.set_index('clue_id')['surface_normalized'].to_dict()

def verify_clues(definition, clue_ids):
    """Return subset of clue_ids where `definition` appears word-boundary-intact."""
    if not isinstance(clue_ids, list):
        clue_ids = []
    if not clue_ids:
        return []
    pattern = rf'\b{re.escape(str(definition))}\b'
    verified = []
    for cid in clue_ids:
        surface = clue_lookup.get(cid)
        if surface and re.search(pattern, surface):
            verified.append(cid)
    return verified
In [ ]:
# 1. Build definition -> clue_ids mapping
definition_to_clues = (
    df_clues.groupby('definition_normalized')['clue_id']
    .apply(list)
    .to_dict()
)

# 2. Verify once per unique definition
verified_map = {}

for definition, clue_ids in definition_to_clues.items():
    verified_map[definition] = verify_clues(definition, clue_ids)

# 3. Map results back
df_clues['clue_ids'] = df_clues['definition_normalized'].map(definition_to_clues)
df_clues['clue_ids_verified'] = df_clues['definition_normalized'].map(verified_map)
df_clues['num_clues_verified'] = df_clues['clue_ids_verified'].map(len)

# 4. Print summary
print(f'Unique definitions total   : {df_clues["definition_normalized"].nunique():,}')
print(f'Unique definitions verified: {df_clues[df_clues["num_clues_verified"] > 0]["definition_normalized"].nunique():,}')
Unique definitions total   : 229,604
Unique definitions verified: 196,636
In [ ]:
# Keep only rows where at least one clue verified the definition
df_export = df_clues[df_clues['num_clues_verified'] > 0].copy()
df_export['clue_id'] = df_export['clue_id'].astype(int)

# Drop rows where 'definition' is erroneously the entire clue.
df_export = df_export[df_export['surface_normalized']!=df_export['definition_normalized']]

# Drop rows with string value 'nan' as the definition value.
df_export['definition'] = df_export['definition'].replace('nan', np.nan)
df_export = df_export.dropna(subset=['definition'])

# Select and rename to the final two-column schema
df_export = (
    df_export[['clue_id', 'definition_normalized']]
    .rename(columns={'definition_normalized': 'definition'})
    .replace('nan', np.nan)
    .replace('', np.nan)
    .dropna(subset=['clue_id', 'definition'])
)

print(f'Unique clue_ids    : {df_export["clue_id"].nunique():,}')
print(f'Unique definitions : {df_export["definition"].nunique():,}')

df_export.to_csv(DATA_DIR / 'verified_definition_clues.csv', index=False)
print(f'\nSaved {len(df_export):,} rows to verified_definition_clues.csv')
Unique clue_ids    : 464,054
Unique definitions : 185,160

Saved 464,054 rows to verified_definition_clues.csv
In [ ]:
# Some definitions of exceptional length.
df_export.sort_values(by='definition', key=lambda col: col.str.len(), ascending=False).head(10)
Out[ ]:
clue_id definition
422689 422690 song message why not be our ruler one day ver...
139087 139088 what leading characters from harlequins and hu...
250678 250679 start to swiftly forget when heated romance br...
389901 389902 for him and four others here a fantastic route...
549309 549310 these decisions conveyed by sherlock holmes to...
504160 504161 woman instrumental in romes transition from a ...
512141 512142 a description of clocks having no similarities...
292202 292203 to recap with it roman somehow encapsulated h...
161880 161881 migrants originally one saw in fringes of camr...
240926 240927 thunderous and grey in conclusion cumulonimbus...
In [ ]:
# Example long-form definition that is correctly represented.
print(df_clues[df_clues['clue_id']==504161].surface_normalized.values)
print(df_clues[df_clues['clue_id']==504161].definition_normalized.values)
print(df_clues[df_clues['clue_id']==504161].answer.values)
['moving article about posh woman instrumental in romes transition from a monarchy to a republic']
['woman instrumental in romes transition from a monarchy to a republic']
['LUCRETIA']

Step 2: Load verified_definition_clues.csv and Extract Unique Definitions¶

With verified_definition_clues.csv now on disk, we load it and extract the unique definition strings. These unique strings are what we embed — one embedding per unique definition, matching how verified_indicators_unique.csv has one row per unique indicator in the main pipeline.

Loading from the saved file here (rather than using df_export directly) means this cell can also be run independently if the kernel is restarted after generation.

Expected output: Row counts, sample definitions, and unique string count.

In [ ]:
df_verified_defs = pd.read_csv(DATA_DIR / 'verified_definition_clues.csv')
print(f'verified_definition_clues.csv: {len(df_verified_defs):,} rows')
print(f'Unique clue_ids   : {df_verified_defs["clue_id"].nunique():,}')
print(f'Unique definitions: {df_verified_defs["definition"].nunique():,}')
print(f'\nSample rows:')
print(df_verified_defs.sample(5, random_state=42).to_string(index=False))

# Unique definition strings — preserving original case (matches indicator handling)
unique_def_strings = df_verified_defs['definition'].str.strip().unique()
df_definitions = pd.DataFrame({'definition': sorted(unique_def_strings)})
print(f'\nUnique definition strings for embedding: {len(df_definitions):,}')
verified_definition_clues.csv: 464,054 rows
Unique clue_ids   : 464,054
Unique definitions: 185,160

Sample rows:
 clue_id                definition
  131756                    finger
  614135                      fish
  442849                    floors
  129419 something to fall back on
  445486  accommodating of callers

Unique definition strings for embedding: 185,061

Save definitions_unique.csv¶

We save the deduplicated definition strings as definitions_unique.csv in the project data directory. This file is the sole input to Section 2 (embedding generation), matching how verified_indicators_unique.csv is the sole input to Notebook 02.

Expected output: Confirmation of file path and row count.

In [ ]:
defs_csv_path = DATA_DIR / 'definitions_unique.csv'
df_definitions.to_csv(defs_csv_path, index=False)

print(f'Saved {len(df_definitions):,} unique definitions to:')
print(f'  {defs_csv_path}')
print(f'\nLength distribution:')
lengths = df_definitions['definition'].str.split().str.len()
print(lengths.describe().to_string())
Saved 185,061 unique definitions to:
  /content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues/data/definitions_unique.csv

Length distribution:
count    185061.000000
mean          2.707632
std           1.285482
min           1.000000
25%           2.000000
50%           2.000000
75%           3.000000
max          16.000000

Section 2: Embed Definitions with BGE-M3¶

This section requires a GPU. Run on Great Lakes (gpu partition) or Colab with GPU enabled.

Why the Same Model Matters¶

We use BAAI/bge-m3 — exactly the model used in Notebook 02 for indicators. This is essential for a valid comparison: if we used a different model, any difference in clustering results could be due to the model rather than the content.

BGE-M3 produces 1024-dimensional dense embeddings and belongs to the CALE family (Concept-Aligned Language Embeddings) — models pretrained to distinguish word senses contextually. The settled decision from FINDINGS_AND_DECISIONS.md is to embed each phrase in isolation (not within its clue context). We apply the same rule here.

Section 2 outputs¶

  • embeddings_bge_m3_definitions.npy — shape (N_defs, 1024)
  • definition_index.csv — maps row number to definition string

Load the BGE-M3 Model¶

The first run downloads the model weights (~2.3 GB) from Hugging Face and caches them. Subsequent runs load from cache and are much faster.

Expected output: The model name and its embedding dimension (should be 1024).

In [ ]:
# Load the same BGE-M3 model used in NB 02 for indicators.
# First run: ~2.3 GB download. Subsequent runs: loads from local cache in seconds.
model = SentenceTransformer('BAAI/bge-m3')

print(f'Model loaded : BAAI/bge-m3')
print(f'Embedding dim: {model.get_sentence_embedding_dimension()}')

Encode All Unique Definitions¶

We pass every definition string through BGE-M3 in isolation — no surrounding clue context. This matches the approach taken for indicators in NB 02, which was settled in FINDINGS_AND_DECISIONS.md: embedding in context ties an embedding to a specific clue instance rather than to the phrase as a general token.

The show_progress_bar=True argument displays a tqdm progress bar. On a Great Lakes V100 or A40, this should complete in 2–5 minutes.

Expected output: Progress bar, then shape (N_defs, 1024) and memory usage.

In [ ]:
# Load definitions_unique.csv (works even if kernel was restarted after Section 1)
df_definitions = pd.read_csv(DATA_DIR / 'definitions_unique.csv')
definitions_list = df_definitions['definition'].tolist()
print(f'Encoding {len(definitions_list):,} definitions  (batch_size={BATCH_SIZE})...')

embeddings_defs = model.encode(
    definitions_list,
    batch_size=BATCH_SIZE,
    show_progress_bar=True
)

print(f'\nEmbeddings shape: {embeddings_defs.shape}')
print(f'Dtype           : {embeddings_defs.dtype}')
print(f'Memory          : {embeddings_defs.nbytes / 1024**2:.1f} MB')

Save Embeddings and Definition Index¶

We save two files that together form the contract for downstream sections:

  1. embeddings_bge_m3_definitions.npy — the embedding matrix, shape (N, 1024). Row i is the 1024-dim embedding of definition i.
  2. definition_index.csv — maps row number to definition string. Row i in the CSV corresponds to row i in the .npy file.

Downstream sections (UMAP, clustering) load these files rather than re-running the embedding model. This is the same contract used by NB 02 for indicators.

Expected output: File paths and a verification that reload shapes match.

In [ ]:
emb_path   = DATA_DIR / 'embeddings_bge_m3_definitions.npy'
index_path = DATA_DIR / 'definition_index.csv'

np.save(emb_path, embeddings_defs)
df_definitions.to_csv(index_path, index=True)  # integer row index is the key

print(f'Saved embeddings : {emb_path.name}  {embeddings_defs.shape}')
print(f'Saved index      : {index_path.name}  ({len(df_definitions):,} rows)')
Saved embeddings : embeddings_bge_m3_definitions.npy  (185061, 1024)
Saved index      : definition_index.csv  (185,061 rows)
In [ ]:
# Reload and verify: shapes must match, and embeddings must be non-zero
emb_check   = np.load(DATA_DIR / 'embeddings_bge_m3_definitions.npy')
index_check = pd.read_csv(DATA_DIR / 'definition_index.csv', index_col=0)

assert emb_check.shape[0] == len(index_check), \
    f'Row mismatch: embeddings={emb_check.shape[0]}, index={len(index_check)}'
assert emb_check.shape[1] == 1024, \
    f'Expected 1024 dimensions, got {emb_check.shape[1]}'

sample_norm = np.linalg.norm(emb_check[0])
print(f'Verification passed.')
print(f'Embeddings : {emb_check.shape}  (N definitions x 1024 dims)')
print(f'Index rows : {len(index_check):,}')
print(f'Row-0 L2 norm: {sample_norm:.4f}  (should be ~1 for normalised BGE-M3 output)')
Verification passed.
Embeddings : (185061, 1024)  (N definitions x 1024 dims)
Index rows : 185,061
Row-0 L2 norm: 1.0000  (should be ~1 for normalised BGE-M3 output)

Section 3: UMAP Dimensionality Reduction¶

Raw BGE-M3 embeddings are 1024-dimensional. Before clustering, we reduce them using UMAP (Uniform Manifold Approximation and Projection). UMAP preserves both local and global structure better than PCA for high-dimensional semantic embeddings — a settled decision confirmed in NB 03 (PCA's top component explained only ~4% of variance).

We produce two separate UMAP reductions, each optimised for a different purpose:

Reduction Dimensions Purpose
10D 10 Input to clustering algorithms (preserves structure; reduces noise)
2D 2 Scatter plots for visualisation only

Parameters — identical to NB 03 to ensure comparability:

  • n_neighbors=15 — balances local vs. global structure
  • min_dist=0.1 — allows tight clusters while preserving separation
  • metric='cosine' — cosine similarity is appropriate for normalised sentence embeddings
  • random_state=42 — reproducibility

Runtime note: UMAP on ~20K+ definitions at 1024 dims takes 5–20 minutes on CPU. Run this section on Great Lakes or Colab with GPU for best performance.

In [ ]:
# Load embeddings from file so this section can run independently
# (e.g., if the kernel was restarted after completing Section 2)
emb_path = DATA_DIR / 'embeddings_bge_m3_definitions.npy'
assert emb_path.exists(), (
    f'Missing: {emb_path}\nRun Section 2 first to generate definition embeddings.'
)
embeddings_defs = np.load(emb_path)
print(f'Loaded embeddings for UMAP: {embeddings_defs.shape}')
Loaded embeddings for UMAP: (185061, 1024)

10-Dimensional UMAP (for Clustering)¶

We reduce from 1024 to 10 dimensions. Why 10 and not 2 or 3? Two-dimensional UMAP distorts the embedding space to make it human-readable — it's great for plots but loses information needed for accurate clustering. Ten dimensions retain more of the geometric structure while still greatly reducing noise and computation time.

Expected output: Shape (N_defs, 10) and a min/max range check.

In [ ]:
# UMAP parameters are fixed to match NB 03 exactly
UMAP_PARAMS = dict(n_neighbors=15, min_dist=0.1, metric='cosine', random_state=42)

print('Fitting 10D UMAP (for clustering input)...')
print(f'Parameters: n_neighbors={UMAP_PARAMS["n_neighbors"]}, '
      f'min_dist={UMAP_PARAMS["min_dist"]}, metric={UMAP_PARAMS["metric"]}')

reducer_10d = umap.UMAP(n_components=10, **UMAP_PARAMS)
embeddings_umap_10d = reducer_10d.fit_transform(embeddings_defs)

print(f'\n10D UMAP complete. Shape: {embeddings_umap_10d.shape}')
print(f'Value range: [{embeddings_umap_10d.min():.3f}, {embeddings_umap_10d.max():.3f}]')
Fitting 10D UMAP (for clustering input)...
Parameters: n_neighbors=15, min_dist=0.1, metric=cosine
/usr/local/lib/python3.12/dist-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
10D UMAP complete. Shape: (185061, 10)
Value range: [-1.695, 11.346]

2-Dimensional UMAP (for Visualisation)¶

We fit a second, separate UMAP reduction to 2 dimensions for scatter plots. This is NOT used as input to any clustering algorithm — it is purely for visualisation in Section 5. Running the two UMAPs separately (rather than extracting 2D from the 10D) ensures each is optimised for its specific purpose.

Expected output: Shape (N_defs, 2).

In [ ]:
print('Fitting 2D UMAP (for visualisation only)...')

reducer_2d = umap.UMAP(n_components=2, **UMAP_PARAMS)
embeddings_umap_2d = reducer_2d.fit_transform(embeddings_defs)

print(f'2D UMAP complete. Shape: {embeddings_umap_2d.shape}')
Fitting 2D UMAP (for visualisation only)...
/usr/local/lib/python3.12/dist-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
2D UMAP complete. Shape: (185061, 2)
In [ ]:
# Save both reductions. File names are prefixed with "definitions_" to distinguish
# from the indicator UMAP files (embeddings_umap_10d.npy, embeddings_umap_2d.npy).
np.save(DATA_DIR / 'embeddings_umap_10d_definitions.npy', embeddings_umap_10d)
np.save(DATA_DIR / 'embeddings_umap_2d_definitions.npy',  embeddings_umap_2d)

print('Saved UMAP reductions:')
print(f'  embeddings_umap_10d_definitions.npy  {embeddings_umap_10d.shape}')
print(f'  embeddings_umap_2d_definitions.npy   {embeddings_umap_2d.shape}')
Saved UMAP reductions:
  embeddings_umap_10d_definitions.npy  (185061, 10)
  embeddings_umap_2d_definitions.npy   (185061, 2)

Section 4: Clustering¶

We run the same two methods used in NB 04 on indicators, at the same parameter settings. This ensures that any differences in results are due to the data, not the methodology.

Methods:

  1. HDBSCAN at eps=0.0 (the NB 04 baseline — most fine-grained run)
  2. Agglomerative clustering with Ward's linkage at k=8, k=10, and k=34 (the three key k values from NB 04: the number of labeled types, the local silhouette optimum, and the mid-range granularity)

Metrics computed for each run:

  • Silhouette score [-1, 1]: how well-separated clusters are from each other (higher is better; for HDBSCAN, computed on non-noise points only)
  • Davies-Bouldin index [0, ∞): average similarity of each cluster to its most similar neighbour (lower is better)

All cluster label assignments are saved as CSVs prefixed with definitions_.

In [ ]:
# Load UMAP outputs and definition names from file.
# This makes Section 4 independently runnable after Sections 2–3 have completed.
embeddings_umap_10d = np.load(DATA_DIR / 'embeddings_umap_10d_definitions.npy')
embeddings_umap_2d  = np.load(DATA_DIR / 'embeddings_umap_2d_definitions.npy')
df_def_index = pd.read_csv(DATA_DIR / 'definition_index.csv', index_col=0)
definition_names = df_def_index['definition'].values  # aligned with UMAP rows

print(f'10D UMAP  : {embeddings_umap_10d.shape}')
print(f'2D UMAP   : {embeddings_umap_2d.shape}')
print(f'Definitions: {len(definition_names):,}')
10D UMAP  : (185061, 10)
2D UMAP   : (185061, 2)
Definitions: 185,061

HDBSCAN at eps=0.0 (Baseline Run)¶

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) automatically determines the number of clusters based on data density. It assigns a label of -1 to noise points — points that do not belong to any dense region.

We use eps=0.0 (no cluster merging threshold), which matches the NB 04 baseline run that produced 282 clusters for indicators. At this setting, HDBSCAN finds the finest possible cluster structure — likely morphological variant groups.

min_cluster_size=10 means any dense region with fewer than 10 points is treated as noise rather than a cluster. This prevents single-word outliers from forming spurious micro-clusters.

Expected output: Cluster count, noise percentage, and metrics. Compare to the indicator baseline: 282 clusters, 33.4% noise, silhouette=0.631.

In [ ]:
print('Running HDBSCAN  eps=0.0  min_cluster_size=10 ...')

clusterer_hdbscan = hdbscan.HDBSCAN(
    min_cluster_size=10,
    cluster_selection_epsilon=0.0,  # no merging — finds finest structure
    allow_single_cluster=False
)
labels_hdbscan = clusterer_hdbscan.fit_predict(embeddings_umap_10d)

n_clusters_h = len(set(labels_hdbscan)) - (1 if -1 in labels_hdbscan else 0)
n_noise_h    = int(np.sum(labels_hdbscan == -1))
noise_pct_h  = 100 * n_noise_h / len(labels_hdbscan)
print(f'Clusters    : {n_clusters_h}')
print(f'Noise points: {n_noise_h:,}  ({noise_pct_h:.1f}%)')
Running HDBSCAN  eps=0.0  min_cluster_size=10 ...
Clusters    : 2726
Noise points: 81,382  (44.0%)
In [ ]:
# Silhouette and Davies-Bouldin are computed on non-noise points only.
# Including noise points (label=-1) would unfairly penalise the score.
mask_h = labels_hdbscan != -1
sil_h, db_h = np.nan, np.nan

if len(set(labels_hdbscan[mask_h])) >= 2:
    sil_h = silhouette_score(embeddings_umap_10d[mask_h], labels_hdbscan[mask_h])
    db_h  = davies_bouldin_score(embeddings_umap_10d[mask_h], labels_hdbscan[mask_h])
else:
    print('Warning: fewer than 2 clusters found — metrics undefined.')

print(f'Metrics (on {mask_h.sum():,} non-noise points):')
print(f'  Silhouette     : {sil_h:.4f}')
print(f'  Davies-Bouldin : {db_h:.4f}')

# Save cluster label assignments
hdbscan_out = DATA_DIR / 'definitions_cluster_labels_hdbscan_eps_0p0.csv'
pd.DataFrame({'definition': definition_names, 'cluster': labels_hdbscan}).to_csv(
    hdbscan_out, index=False
)
print(f'\nSaved: {hdbscan_out.name}')
Metrics (on 103,679 non-noise points):
  Silhouette     : 0.5943
  Davies-Bouldin : 0.5202

Saved: definitions_cluster_labels_hdbscan_eps_0p0.csv

Agglomerative Clustering with Ward's Linkage¶

Agglomerative clustering builds a hierarchy by starting with every point as its own cluster and repeatedly merging the two closest clusters until k clusters remain. Ward's linkage minimises the total within-cluster variance at each merge step, producing compact, roughly spherical clusters (per KCT's guidance from Feb 15).

Unlike HDBSCAN, agglomerative clustering assigns every point to a cluster — there are no noise points. This means silhouette and Davies-Bouldin are computed on all points.

We test k=8, 10, 34 — the same k values highlighted in NB 04:

  • k=8 matches the number of wordplay types in the Ho dataset
  • k=10 is the local silhouette optimum for indicators (the most informative comparison)
  • k=34 is a mid-range granularity that emerged as interesting in NB 04

Expected output: Silhouette and DB scores for each k, with CSVs saved.

In [ ]:
# ============================================================
# Downsampling Definition Embeddings
# ============================================================
#
# Rationale:
# Ward's agglomerative clustering requires O(n^2) memory.
# With 185,061 embeddings, this is computationally infeasible.
#
# To maintain symmetry with the ~12k indicator embeddings used
# in prior stages while preserving representativeness, we
# randomly downsample a fixed percentage of the full dataset.
#
# The sampling is:
# - Random
# - Reproducible (seed=42)
# - Without replacement
#
# This preserves the global semantic distribution while making
# Ward clustering tractable.
# ============================================================

n_total = embeddings_umap_10d.shape[0]
print(f"Total definitions: {n_total:,}")

target_percent = 0.55   # max size possible for running on Great Lakes cluster
target_size = int(n_total * target_percent)
print(f"Target percent: {target_percent*100:.1f}%")
print(f"Target sample size: {target_size:,}")

indices = np.random.choice(
    n_total,
    size=target_size,
    replace=False
)

embeddings_umap_10d_sample = embeddings_umap_10d[indices]
embeddings_umap_2d_sample = embeddings_umap_2d[indices]
definition_names_sample = definition_names[indices]
print(f"Downsampled shape: {embeddings_umap_10d_sample.shape}")

# Save indices for reproducibility
np.save(DATA_DIR / 'definition_downsample_indices.npy', indices)
print("Downsampling complete.")
Total definitions: 185,061
Target percent: 55.0%
Target sample size: 101,783
Downsampled shape: (101783, 10)
Downsampling complete.
In [ ]:
K_VALUES = [8, 10, 34]
agglo_metrics = []

for k in K_VALUES:
    clusterer = AgglomerativeClustering(n_clusters=k, linkage='ward')
    labels_k  = clusterer.fit_predict(embeddings_umap_10d_sample)

    sil = silhouette_score(embeddings_umap_10d_sample, labels_k)
    db  = davies_bouldin_score(embeddings_umap_10d_sample, labels_k)
    agglo_metrics.append({'method': 'Agglomerative', 'k': k,
                          'n_clusters': k, 'silhouette': round(sil, 4),
                          'davies_bouldin': round(db, 4)})

    # Save label assignments to CSV
    out_path = DATA_DIR / f'definitions_cluster_labels_agglo_k{k}.csv'
    pd.DataFrame({'definition': definition_names_sample, 'cluster': labels_k}).to_csv(
        out_path, index=False
    )
    print(f'  k={k:>3}: silhouette={sil:.4f}  DB={db:.4f}  → saved {out_path.name}')
  k=  8: silhouette=0.1640  DB=1.5789  → saved definitions_cluster_labels_agglo_k8.csv
  k= 10: silhouette=0.1718  DB=1.4916  → saved definitions_cluster_labels_agglo_k10.csv
  k= 34: silhouette=0.1853  DB=1.5083  → saved definitions_cluster_labels_agglo_k34.csv

Compile and Save the Definitions Metrics Summary¶

We collect all clustering metrics into a single DataFrame and save it as definitions_clustering_metrics.csv. This file mirrors the structure of clustering_metrics_summary.csv from NB 04, making the comparison in Section 5 straightforward.

Expected output: A printed metrics table with one row per clustering run.

In [ ]:
hdbscan_row = {
    'method': 'HDBSCAN', 'k': float('nan'),
    'n_clusters': n_clusters_h, 'n_noise': n_noise_h,
    'noise_pct': round(noise_pct_h, 2),
    'silhouette': round(sil_h, 4) if not np.isnan(sil_h) else float('nan'),
    'davies_bouldin': round(db_h, 4) if not np.isnan(db_h) else float('nan'),
}

df_def_metrics = pd.DataFrame([hdbscan_row] + agglo_metrics)
df_def_metrics['n_noise']   = df_def_metrics.get('n_noise', pd.NA)
df_def_metrics['noise_pct'] = df_def_metrics.get('noise_pct', pd.NA)

metrics_out = OUTPUT_DIR / 'definitions_clustering_metrics.csv'
df_def_metrics.to_csv(metrics_out, index=False)

print('Definitions clustering metrics:')
print(df_def_metrics.to_string(index=False))
print(f'\nSaved to: {metrics_out.name}')
Definitions clustering metrics:
       method    k  n_clusters  n_noise  noise_pct  silhouette  davies_bouldin
      HDBSCAN  NaN        2726  81382.0      43.98      0.5943          0.5202
Agglomerative  8.0           8      NaN        NaN      0.1640          1.5789
Agglomerative 10.0          10      NaN        NaN      0.1718          1.4916
Agglomerative 34.0          34      NaN        NaN      0.1853          1.5083

Saved to: definitions_clustering_metrics.csv

Section 5: Comparison to Indicator Results¶

This is the core of the control experiment. We load the indicator clustering metrics from NB 04 and compare them side-by-side with the definition metrics computed above.

Three forms of comparison:

  1. Numeric metrics table — silhouette and Davies-Bouldin for indicators vs. definitions at matched k values
  2. UMAP scatter plot — definition clusters at k=10 coloured by cluster assignment, to visualise the geometric structure
  3. Qualitative inspection — centroid-nearest definitions per cluster, to understand the principle of organisation (topic vs. conceptual metaphor)

The most important finding is not just whether scores differ numerically, but whether the clusters are organised by different principles.

Load Indicator Metrics from NB 04¶

We load clustering_metrics_summary.csv produced by Notebook 04. If this file does not yet exist (because NB 04 has not been run to completion), we proceed with definitions-only analysis and print a clear warning.

Expected output: The indicator metrics table, or a warning if the file is missing.

In [ ]:
ind_metrics_path = DATA_DIR / 'clustering_metrics_summary.csv'
df_ind_metrics   = None

if ind_metrics_path.exists():
    df_ind_metrics = pd.read_csv(ind_metrics_path)
    print(f'Loaded indicator metrics: {df_ind_metrics.shape}')
    print(f'Columns: {df_ind_metrics.columns.tolist()}')
    print()
    print(df_ind_metrics.to_string(index=False))
else:
    print('WARNING: clustering_metrics_summary.csv not found.')
    print(f'Expected at: {ind_metrics_path}')
    print('Run 04_clustering.ipynb first to generate indicator metrics.')
    print('Proceeding with definitions-only analysis.')
Loaded indicator metrics: (30, 8)
Columns: ['method', 'parameters', 'n_clusters', 'n_noise', 'noise_pct', 'silhouette', 'davies_bouldin', 'calinski_harabasz']

              method                      parameters  n_clusters  n_noise  noise_pct  silhouette  davies_bouldin  calinski_harabasz
             HDBSCAN    min_cluster_size=10, eps=0.0         282     4212  33.370306    0.630992        0.470028                NaN
             HDBSCAN  min_cluster_size=10, eps=0.214         244     3783  29.971478    0.584483        0.508532                NaN
             HDBSCAN  min_cluster_size=10, eps=0.428          62     1306  10.347013   -0.117756        1.047558                NaN
             HDBSCAN  min_cluster_size=10, eps=0.642          17      313   2.479797   -0.296464        0.976518                NaN
             HDBSCAN min_cluster_size=10, eps=0.7788          11      114   0.903185   -0.185984        0.775157                NaN
             HDBSCAN  min_cluster_size=10, eps=0.856          10       28   0.221835   -0.167519        0.797766                NaN
             HDBSCAN   min_cluster_size=10, eps=1.07           6        0   0.000000   -0.120162        0.782393                NaN
             HDBSCAN  min_cluster_size=10, eps=1.284           6        0   0.000000   -0.120162        0.782393                NaN
             HDBSCAN  min_cluster_size=10, eps=1.498           4        0   0.000000    0.230049        0.549086                NaN
             HDBSCAN min_cluster_size=10, eps=1.9327           4        0   0.000000    0.230049        0.549086                NaN
             HDBSCAN min_cluster_size=10, eps=2.2334           3        0   0.000000    0.385651        0.461465                NaN
             HDBSCAN min_cluster_size=10, eps=2.4729           3        0   0.000000    0.385651        0.461465                NaN
             HDBSCAN min_cluster_size=10, eps=2.6847           3        0   0.000000    0.385651        0.461465                NaN
Agglomerative (Ward)                             k=4           4        0   0.000000    0.246021        1.456096        3909.220873
Agglomerative (Ward)                             k=6           6        0   0.000000    0.259034        1.322581        3920.874006
Agglomerative (Ward)                             k=8           8        0   0.000000    0.272435        1.267374        3897.154308
Agglomerative (Ward)                             k=9           9        0   0.000000    0.288711        1.184741        3844.271002
Agglomerative (Ward)                            k=10          10        0   0.000000    0.298516        1.163678        3788.683596
Agglomerative (Ward)                            k=11          11        0   0.000000    0.281040        1.184305        3654.531355
Agglomerative (Ward)                            k=12          12        0   0.000000    0.281408        1.225761        3552.264089
Agglomerative (Ward)                            k=16          16        0   0.000000    0.278622        1.274775        3319.894167
Agglomerative (Ward)                            k=20          20        0   0.000000    0.295055        1.193753        3203.470119
Agglomerative (Ward)                            k=26          26        0   0.000000    0.304119        1.172194        3072.139975
Agglomerative (Ward)                            k=34          34        0   0.000000    0.321995        1.067513        2992.948844
Agglomerative (Ward)                            k=50          50        0   0.000000    0.343506        1.010616        2944.236720
Agglomerative (Ward)                            k=75          75        0   0.000000    0.370112        1.002773        2838.968311
Agglomerative (Ward)                           k=100         100        0   0.000000    0.377600        0.978082        2786.891253
Agglomerative (Ward)                           k=150         150        0   0.000000    0.388251        0.959118        2734.306732
Agglomerative (Ward)                           k=200         200        0   0.000000    0.418647        0.890381        2744.780282
Agglomerative (Ward)                           k=250         250        0   0.000000    0.431290        0.883640        2801.724667

Side-by-Side Metrics Comparison¶

We extract the agglomerative runs at k=8, 10, 34 from the indicator metrics and pair them with the corresponding definition runs.

How to read the table:

  • ind_silhouette — silhouette score for indicator clustering at that k
  • def_silhouette — silhouette score for definition clustering at that k
  • sil_diff = indicator − definition (positive means indicators cluster better)
  • For Davies-Bouldin, lower is better; a positive db_diff means definitions have worse (higher) DB than indicators

Expected output: A 3-row comparison table (one row per k value).

In [ ]:
if df_ind_metrics is not None:
    # Locate the agglomerative rows — handle varying column naming conventions
    method_col = 'method' if 'method' in df_ind_metrics.columns else None
    k_col      = 'k' if 'k' in df_ind_metrics.columns else 'n_clusters'

    mask_agglo = (
        df_ind_metrics[method_col].str.lower().str.contains('agglo', na=False)
        if method_col else pd.Series([True] * len(df_ind_metrics))
    )
    ind_agglo = (
        df_ind_metrics[mask_agglo & df_ind_metrics[k_col].isin(K_VALUES)]
        [[k_col, 'silhouette', 'davies_bouldin']]
        .rename(columns={'silhouette': 'ind_sil', 'davies_bouldin': 'ind_db', k_col: 'k'})
    )
    def_agglo = (
        df_def_metrics[df_def_metrics['method'] == 'Agglomerative']
        [['k', 'silhouette', 'davies_bouldin']]
        .rename(columns={'silhouette': 'def_sil', 'davies_bouldin': 'def_db'})
    )
    cmp = pd.merge(ind_agglo.astype({'k': int}), def_agglo.astype({'k': int}), on='k')
    cmp['sil_diff'] = (cmp['ind_sil'] - cmp['def_sil']).round(4)
    cmp['db_diff']  = (cmp['def_db'] - cmp['ind_db']).round(4)
    print('=== Indicator vs. Definition Clustering (Agglomerative Ward\'s) ===')
    print('(sil_diff > 0 → indicators cluster better; db_diff > 0 → definitions worse)')
    print(cmp.to_string(index=False))
else:
    print('Indicator metrics unavailable — run 04_clustering.ipynb first.')
=== Indicator vs. Definition Clustering (Agglomerative Ward's) ===
(sil_diff > 0 → indicators cluster better; db_diff > 0 → definitions worse)
 k  ind_sil   ind_db  def_sil  def_db  sil_diff  db_diff
 8 0.272435 1.267374   0.1640  1.5789    0.1084   0.3115
10 0.298516 1.163678   0.1718  1.4916    0.1267   0.3279
34 0.321995 1.067513   0.1853  1.5083    0.1367   0.4408

UMAP Scatter Plot of Definition Clusters (k=10)¶

We plot the 2D UMAP embedding of definitions, coloured by their agglomerative cluster assignment at k=10. The local silhouette optimum at k=10 for indicators was one of the most interpretable results in NB 04, so matching this k makes the visual comparison most meaningful.

What to look for: Do the definition clusters form compact, well-separated blobs (similar to what indicator clusters look like at k=10)? Or are they more diffuse and intermixed? The geometry often tells a different story than the metrics alone.

Expected output: A scatter plot saved to figures/definitions_umap_agglo_k10.png.

In [ ]:
labels_k10 = pd.read_csv(
    DATA_DIR / 'definitions_cluster_labels_agglo_k10.csv'
)['cluster'].values

fig, ax = plt.subplots(figsize=(10, 7))
scatter = ax.scatter(
    embeddings_umap_2d_sample[:, 0], embeddings_umap_2d_sample[:, 1],
    c=labels_k10, cmap='tab10', s=4, alpha=0.55
)
plt.colorbar(scatter, ax=ax, label='Cluster')
ax.set_title('Definition Embeddings — UMAP 2D, Agglomerative Ward k=10', fontsize=13)
ax.set_xlabel('UMAP Dimension 1')
ax.set_ylabel('UMAP Dimension 2')
plt.tight_layout()

fig_path = FIGURES_DIR / 'definitions_umap_agglo_k10.png'
plt.savefig(fig_path, dpi=150)
plt.show()
print(f'Saved: {fig_path}')
Saved: /home/nycantwe/ccc_project/outputs/figures/definitions_umap_agglo_k10.png

Qualitative Inspection: Centroid-Nearest Definitions¶

Metrics tell us how well things cluster; qualitative inspection tells us by what principle. For each cluster we find the 5 definitions closest to the cluster centroid in 10D UMAP space. These centroid-nearest examples are the most representative definitions for their cluster.

What to look for:

  • Topic-based clustering (expected for definitions): clusters centred on words from the same semantic domain — e.g., all animals, all places, all emotions
  • Metaphor-based clustering (what we saw for indicators): clusters organised around a conceptual theme like disorder, containment, or auditory perception

If definitions cluster by topic and indicators cluster by metaphor, that is evidence that indicator clustering captures something specific to the wordplay domain.

Expected output: 10 cluster summaries, each with 5 representative definitions.

In [ ]:
K_INSPECT = 10
labels_inspect = pd.read_csv(
    DATA_DIR / f'definitions_cluster_labels_agglo_k{K_INSPECT}.csv'
)['cluster'].values

print(f'=== Centroid-Nearest Definitions per Cluster (k={K_INSPECT}) ===\n')
for cluster_id in range(K_INSPECT):
    mask_c    = labels_inspect == cluster_id
    centroid  = embeddings_umap_10d_sample[mask_c].mean(axis=0)
    dists     = np.linalg.norm(embeddings_umap_10d_sample[mask_c] - centroid, axis=1)
    top5_idx  = np.argsort(dists)[:5]
    top5_defs = definition_names_sample[mask_c][top5_idx]
    n_in_clust = mask_c.sum()
    print(f'Cluster {cluster_id}  (n={n_in_clust:,}):')
    for d in top5_defs:
        print(f'    · {d}')
    print()
=== Centroid-Nearest Definitions per Cluster (k=10) ===

Cluster 0  (n=18,799):
    · they ruminate
    · they may be high
    · rays
    · this bloke
    · get cracking with this

Cluster 1  (n=9,570):
    · with measles
    · inyourface
    · it could be critical
    · what faces
    · get bent

Cluster 2  (n=16,473):
    · aloof
    · to pry
    · ones highmaintenance
    · the most bent
    · faced

Cluster 3  (n=13,133):
    · entanglements
    · confab
    · commie
    · ray
    · ones peaked

Cluster 4  (n=13,347):
    · be
    · comb
    · hammer this
    · giddyup
    · potential share

Cluster 5  (n=4,666):
    · that might get trump high
    · a ray
    · they may be wearing rings
    · as shown by face
    · they are balmy

Cluster 6  (n=9,299):
    · compere
    · on the dot
    · entrenched peak
    · enticed
    · enfeebled

Cluster 7  (n=5,121):
    · have an inkling of
    · facet
    · one may generate
    · one on top in something vulgar
    · it could be dramatic

Cluster 8  (n=5,785):
    · tendency to pry
    · compel
    · encumbered
    · they might be tall
    · it may create space

Cluster 9  (n=5,590):
    · be in cahoots
    · another bloke
    · crowning point
    · a tyrants thing
    · he could give you a ring


Section 6: Interpretation¶

Fill in this section after running the notebook and reviewing the outputs. The template below guides the analysis; replace the bracketed placeholders.


6.1 Cluster Quality Comparison¶

At the matched granularity of k=10 (the local optimum for indicators in NB 04):

Metric Indicators (NB 04) Definitions (this notebook) Interpretation
Silhouette [value from NB 04] [value from Section 4] Higher = more compact clusters
Davies-Bouldin [value from NB 04] [value from Section 4] Lower = more separated clusters

Interpretation: [Did definitions score higher, lower, or similar to indicators? Note that a higher silhouette for definitions does not necessarily mean definitions have "better" structure — it may just mean topically similar words are more homogeneous than wordplay-indicator vocabulary.]


6.2 Cluster Organisation Principle¶

From the centroid-nearest inspection in Section 5:

  • Are definition clusters organised by topic? (e.g., animals, geography, body parts, occupations) — [Yes / No / Partially — describe what you see]
  • Are indicator clusters organised by conceptual metaphor? (disorder, containment, auditory, directionality) — [Reference NB 04/05 findings]

If the two types of clusters are organised by different principles, that is the most meaningful finding regardless of whether the numeric metrics differ.


6.3 What This Tells Us About Indicator Clustering¶

If definitions cluster WORSE than indicators (lower silhouette, higher DB): This strengthens the claim that indicator clustering detects wordplay-specific structure. Wordplay indicators have a more constrained semantic vocabulary than definitions, which is why they form tighter clusters.

If definitions cluster SIMILARLY to indicators: BGE-M3 and UMAP find semantic structure in both. This does not invalidate the indicator findings but suggests the clustering is detecting general-purpose semantic similarity rather than wordplay-specific organisation. The qualitative difference (topic vs. metaphor) then becomes the key argument.

If definitions cluster BETTER than indicators: Topic vocabulary (animals, places, emotions) may be more cohesive than the cross-type indicator vocabulary. The indicators' value lies not in compact clustering but in the interpretability of clusters as wordplay metaphors — point to the ARI contrasts from NB 05 (homophone/reversal ARI=0.611 vs. hidden/container/insertion ARI=0.045) as evidence that indicator structure tracks wordplay distinctions.


6.4 For the Report¶

Cite this notebook as evidence for or against the specificity of indicator clustering. Key points to include in the report:

  1. The definitions experiment is a null hypothesis baseline: if random English phrases cluster as well as indicators, our indicator clusters do not reflect wordplay-specific structure.
  2. Even if metrics are similar, the qualitative character of clusters (topic vs. conceptual metaphor) is a meaningful distinction worth reporting.
  3. Pair this result with the ARI contrast from NB 05 to make the argument: indicator clusters that align with theoretically distinct types (homophone, reversal) are unlikely to emerge from arbitrary text.

Output File Summary¶

File Location Description
verified_definition_clues.csv data/ Verified (clue_id, definition) pairs
definitions_unique.csv data/ Unique definition strings for embedding
embeddings_bge_m3_definitions.npy data/ BGE-M3 embeddings (N × 1024)
definition_index.csv data/ Row number → definition string
embeddings_umap_10d_definitions.npy data/ 10D UMAP for clustering
embeddings_umap_2d_definitions.npy data/ 2D UMAP for visualisation
definitions_cluster_labels_hdbscan_eps_0p0.csv data/ HDBSCAN labels
definition_downsample_indices.npy data/ Definition sample indices for agglomerative clustering
definitions_cluster_labels_agglo_k8.csv data/ Agglomerative k=8 labels
definitions_cluster_labels_agglo_k10.csv data/ Agglomerative k=10 labels
definitions_cluster_labels_agglo_k34.csv data/ Agglomerative k=34 labels
definitions_clustering_metrics.csv outputs/ Metrics summary (all runs)
definitions_umap_agglo_k10.png outputs/figures/ Cluster scatter plot
In [ ]: