ccc-project

Exploring Wordplay and Misdirection in Cryptic Crossword Clues

University of Michigan MADS Milestone II Project — SIADS 696, Winter 2026

Team: Victoria Winters, Sahana Sundar, Nathan Cantwell, Hans Li

Faculty Advisor: Dr. Kevyn Collins-Thompson

Project Overview

This project applies natural language processing to cryptic crossword clues (CCCs), a domain that poses unique challenges for language models due to deliberate semantic misdirection and strict hidden grammatical structure.

The project has two independent components that investigate complementary aspects of CCC language:

1. Indicator Clustering (indicator_clustering/)

Unsupervised clustering of 12,622 unique CCC indicator words and phrases to explore whether their semantic embeddings naturally reflect the structure of CCC wordplay types. Uses BGE-M3 embeddings (1024-dim) with UMAP dimensionality reduction, HDBSCAN, and agglomerative clustering.

Key findings:

2. Clue Misdirection (clue_misdirection/)

Supervised learning experiments quantifying how much the surface text of a cryptic clue misleads embedding-based models attempting to connect a definition to its answer. Uses CALE-MBERT-en embeddings with <t></t> target-word delimiters, retrieval analysis, and binary classification with 47 engineered features.

Key findings:

Data

Both components use George Ho’s CCC dataset (660,613 clues), available under the Open Database License (ODbL v1.0):

See DATA_LICENSE for full attribution and license details.

Repository Structure

ccc-project/
  indicator_clustering/          # Unsupervised clustering component
    notebooks/                   # Pipeline notebooks 00–07 (run in order)
      archive/                   # Superseded and exploratory notebooks
    data/                        # Data files (see setup above)
    outputs/                     # Metrics CSVs and generated figures
      figures/report/            # Publication-quality figures
    docs/                        # Rendered HTML notebooks (GitHub Pages)
      README.md                  # Index of rendered notebooks
    CLAUDE.md                    # Claude Code project configuration
    PROJECT_OVERVIEW.md          # Research context and task definitions
    DOMAIN_KNOWLEDGE.md          # CCC wordplay taxonomy
    FINDINGS_AND_DECISIONS.md    # Empirical results and advisor guidance
    OPEN_QUESTIONS.md            # Unresolved decisions

  clue_misdirection/             # Supervised learning component
    notebooks/                   # Pipeline notebooks 00–08 (run in order)
      archive/                   # Superseded and exploratory notebooks
    scripts/                     # Python and shell scripts for GPU jobs
    data/                        # Data files and embeddings (~1.8 GB)
    outputs/                     # Results CSVs and generated figures
      figures/                   # Retrieval, importance, and evaluation plots
    docs/                        # Rendered HTML notebooks (GitHub Pages)
      README.md                  # Index of rendered notebooks
    CLAUDE.md                    # Claude Code project configuration
    PLAN.md                      # 12-step pipeline plan
    FINDINGS.md                  # Research findings log
    DECISIONS.md                 # Key decisions and rationale
    DATA.md                      # Data dictionary and schemas
    NOTEBOOKS.md                 # Notebook descriptions and purposes
    requirements.txt             # Component-specific dependencies

  README.md                      # This file
  requirements.txt               # Python dependencies (CPU/analysis)
  LICENSE                        # MIT License
  DATA_LICENSE                   # ODbL v1.0 for derived datasets

Rendered Notebooks

Rendered notebooks with full outputs (figures, tables, metrics) are available via GitHub Pages:

https://vwintumich.github.io/ccc-project/

Running the Notebooks

Each component’s notebooks form a sequential pipeline — run them in numerical order. Later notebooks depend on outputs from earlier stages.

Indicator Clustering

Stage Notebook Environment
0 00_data_extraction Local
1 01_data_cleaning Local
2 02_embedding_generation GPU (Great Lakes / Colab)
3 03_dimensionality_reduction GPU (Great Lakes / Colab)
4 04_clustering Local
5 05_constrained_and_targeted Local
6 06_evaluation_and_figures Local
7 07_definitions_control GPU for Section 2; Local otherwise

Clue Misdirection

Stage Notebook Environment
0 00_model_comparison Local
1 01_data_cleaning Local
2 02_embedding_generation GPU (Great Lakes / Colab)
3 03_feature_engineering Local
4 04_retrieval_analysis Local
5 05_dataset_construction Local
6 06_experiments_easy Local (or Great Lakes for full data)
7 07_experiments_harder Local (or Great Lakes for full data)
8 08_results_and_evaluation Local

Note on GPU steps

Embedding generation notebooks require a GPU and sentence-transformers / torch. They are designed to run on the University of Michigan Great Lakes cluster or Google Colab with a GPU runtime (Runtime > Change runtime type > T4 GPU). Generated embedding files are not included in this repository due to size and must be produced by running the relevant notebooks.

Environment

CPU/analysis dependencies (both components):

pip install -r requirements.txt

GPU/embedding dependencies (clue_misdirection):

pip install -r clue_misdirection/requirements.txt

This installs sentence-transformers, torch, and pyarrow in addition to the base dependencies. On Great Lakes and Colab, torch with CUDA support is pre-installed.

Key libraries: scikit-learn, hdbscan, umap-learn, pandas, numpy, nltk, matplotlib, seaborn. The indicator clustering component uses BGE-M3 (BAAI/bge-m3) and the clue misdirection component uses CALE-MBERT-en (oskar-h/cale-modernbert-base); both are downloaded automatically by sentence-transformers on first use.

References