University of Michigan MADS Milestone II Project — SIADS 696, Winter 2026
Team: Victoria Winters, Sahana Sundar, Nathan Cantwell, Hans Li
Faculty Advisor: Dr. Kevyn Collins-Thompson
This project applies natural language processing to cryptic crossword clues (CCCs), a domain that poses unique challenges for language models due to deliberate semantic misdirection and strict hidden grammatical structure.
The project has two independent components that investigate complementary aspects of CCC language:
indicator_clustering/)Unsupervised clustering of 12,622 unique CCC indicator words and phrases to explore whether their semantic embeddings naturally reflect the structure of CCC wordplay types. Uses BGE-M3 embeddings (1024-dim) with UMAP dimensionality reduction, HDBSCAN, and agglomerative clustering.
Key findings:
clue_misdirection/)Supervised learning experiments quantifying how much the surface text of a
cryptic clue misleads embedding-based models attempting to connect a definition
to its answer. Uses CALE-MBERT-en embeddings with <t></t> target-word
delimiters, retrieval analysis, and binary classification with 47 engineered
features.
Key findings:
Both components use George Ho’s CCC dataset (660,613 clues), available under the Open Database License (ODbL v1.0):
data.sqlite3indicator_clustering/data/ before running notebooksSee DATA_LICENSE for full attribution and license details.
ccc-project/
indicator_clustering/ # Unsupervised clustering component
notebooks/ # Pipeline notebooks 00–07 (run in order)
archive/ # Superseded and exploratory notebooks
data/ # Data files (see setup above)
outputs/ # Metrics CSVs and generated figures
figures/report/ # Publication-quality figures
docs/ # Rendered HTML notebooks (GitHub Pages)
README.md # Index of rendered notebooks
CLAUDE.md # Claude Code project configuration
PROJECT_OVERVIEW.md # Research context and task definitions
DOMAIN_KNOWLEDGE.md # CCC wordplay taxonomy
FINDINGS_AND_DECISIONS.md # Empirical results and advisor guidance
OPEN_QUESTIONS.md # Unresolved decisions
clue_misdirection/ # Supervised learning component
notebooks/ # Pipeline notebooks 00–08 (run in order)
archive/ # Superseded and exploratory notebooks
scripts/ # Python and shell scripts for GPU jobs
data/ # Data files and embeddings (~1.8 GB)
outputs/ # Results CSVs and generated figures
figures/ # Retrieval, importance, and evaluation plots
docs/ # Rendered HTML notebooks (GitHub Pages)
README.md # Index of rendered notebooks
CLAUDE.md # Claude Code project configuration
PLAN.md # 12-step pipeline plan
FINDINGS.md # Research findings log
DECISIONS.md # Key decisions and rationale
DATA.md # Data dictionary and schemas
NOTEBOOKS.md # Notebook descriptions and purposes
requirements.txt # Component-specific dependencies
README.md # This file
requirements.txt # Python dependencies (CPU/analysis)
LICENSE # MIT License
DATA_LICENSE # ODbL v1.0 for derived datasets
Rendered notebooks with full outputs (figures, tables, metrics) are available via GitHub Pages:
https://vwintumich.github.io/ccc-project/
Each component’s notebooks form a sequential pipeline — run them in numerical order. Later notebooks depend on outputs from earlier stages.
| Stage | Notebook | Environment |
|---|---|---|
| 0 | 00_data_extraction | Local |
| 1 | 01_data_cleaning | Local |
| 2 | 02_embedding_generation | GPU (Great Lakes / Colab) |
| 3 | 03_dimensionality_reduction | GPU (Great Lakes / Colab) |
| 4 | 04_clustering | Local |
| 5 | 05_constrained_and_targeted | Local |
| 6 | 06_evaluation_and_figures | Local |
| 7 | 07_definitions_control | GPU for Section 2; Local otherwise |
| Stage | Notebook | Environment |
|---|---|---|
| 0 | 00_model_comparison | Local |
| 1 | 01_data_cleaning | Local |
| 2 | 02_embedding_generation | GPU (Great Lakes / Colab) |
| 3 | 03_feature_engineering | Local |
| 4 | 04_retrieval_analysis | Local |
| 5 | 05_dataset_construction | Local |
| 6 | 06_experiments_easy | Local (or Great Lakes for full data) |
| 7 | 07_experiments_harder | Local (or Great Lakes for full data) |
| 8 | 08_results_and_evaluation | Local |
Embedding generation notebooks require a GPU and sentence-transformers /
torch. They are designed to run on the University of Michigan Great Lakes
cluster or Google Colab with a GPU runtime (Runtime > Change runtime type >
T4 GPU). Generated embedding files are not included in this repository due to
size and must be produced by running the relevant notebooks.
CPU/analysis dependencies (both components):
pip install -r requirements.txt
GPU/embedding dependencies (clue_misdirection):
pip install -r clue_misdirection/requirements.txt
This installs sentence-transformers, torch, and pyarrow in addition to
the base dependencies. On Great Lakes and Colab, torch with CUDA support is
pre-installed.
Key libraries: scikit-learn, hdbscan, umap-learn, pandas, numpy, nltk,
matplotlib, seaborn. The indicator clustering component uses BGE-M3
(BAAI/bge-m3) and the clue misdirection component uses CALE-MBERT-en
(oskar-h/cale-modernbert-base); both are downloaded automatically by
sentence-transformers on first use.