Stage 3: Dimensionality Reduction¶
Primary author: Victoria Builds on:
- Hierarchical_Clustering_Indicators_with_BGE_M3_Embeddings.ipynb (Victoria/Sahana — UMAP reduction approach)
- NC_PCA_Analysis.ipynb (Nathan — PCA exploration and visualization approach)
02_embedding_generation.ipynb(Stage 2 output: BGE-M3 embeddings)
Prompt engineering: Victoria AI assistance: Claude (Anthropic) Environment: Great Lakes or Colab (GPU helps for UMAP)
This notebook takes the 1024-dimensional BGE-M3 embeddings from Stage 2 and reduces them to lower dimensions for two purposes:
- 10 dimensions for clustering input (Stage 4) — reduces noise while preserving structure
- 2 dimensions for visualization (Stage 5) — enables scatter plots of the embedding space
Two methods are applied: PCA (linear baseline) and UMAP (nonlinear, structure-preserving).
Running on Google Colab¶
If running on Google Colab:
- Go to Runtime > Change runtime type
- Select a GPU accelerator (T4 is sufficient)
- Click Save, then run all cells
UMAP benefits from GPU acceleration but will run on CPU in ~2-5 minutes for this dataset size. PCA runs instantly on any hardware.
Imports¶
import os
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.decomposition import PCA
import umap # install via: pip install umap-learn
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.dpi'] = 120
Environment Auto-Detection and Paths¶
# --- Environment Auto-Detection ---
try:
IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
IS_COLAB = False
if IS_COLAB:
from google.colab import drive
drive.mount('/content/drive')
PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues')
else:
try:
PROJECT_ROOT = Path(__file__).resolve().parent.parent
except NameError:
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / 'data'
OUTPUT_DIR = PROJECT_ROOT / 'outputs'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
print(f'Output directory: {OUTPUT_DIR}')
Project root: /home/vwinters/ccc-project/indicator_clustering Data directory: /home/vwinters/ccc-project/indicator_clustering/data Output directory: /home/vwinters/ccc-project/indicator_clustering/outputs
Note: All figures produced by this notebook are saved as .png files to the
outputs/ directory (OUTPUT_DIR) for viewing outside the notebook environment.
np.random.seed(42)
Input File Validation¶
Before proceeding, we verify that the required input files from Stage 2 exist.
These are produced by 02_embedding_generation.ipynb:
embeddings_bge_m3_all.npy— the 1024-dimensional BGE-M3 embedding matrixindicator_index_all.csv— maps each row index to its indicator string
# Check that both input files exist before proceeding
embedding_file = DATA_DIR / 'embeddings_bge_m3_all.npy'
index_file = DATA_DIR / 'indicator_index_all.csv'
assert embedding_file.exists(), (
f'Missing embedding file: {embedding_file}\n'
f'Run 02_embedding_generation.ipynb first to produce this file.'
)
assert index_file.exists(), (
f'Missing index file: {index_file}\n'
f'Run 02_embedding_generation.ipynb first to produce this file.'
)
print('Input files found:')
print(f' {embedding_file}')
print(f' {index_file}')
Input files found: /home/vwinters/ccc-project/indicator_clustering/data/embeddings_bge_m3_all.npy /home/vwinters/ccc-project/indicator_clustering/data/indicator_index_all.csv
Load Embeddings¶
We load the BGE-M3 embedding matrix and the indicator index. The embedding matrix has shape (12622, 1024) — one 1024-dimensional vector per unique verified indicator. The index CSV maps row number to indicator string so we can trace any point back to its indicator after dimensionality reduction.
embeddings = np.load(embedding_file)
df_index = pd.read_csv(index_file, index_col=0)
print(f'Embeddings shape: {embeddings.shape}')
print(f'Index rows: {len(df_index)}')
assert embeddings.shape == (12622, 1024), (
f'Expected shape (12622, 1024), got {embeddings.shape}'
)
assert len(df_index) == embeddings.shape[0], (
f'Index length {len(df_index)} does not match embedding rows {embeddings.shape[0]}'
)
print(f'Shape verified: {embeddings.shape[0]:,} indicators x {embeddings.shape[1]} dimensions')
Embeddings shape: (12622, 1024) Index rows: 12622 Shape verified: 12,622 indicators x 1024 dimensions
Why Reduce Dimensionality?¶
Our BGE-M3 embeddings live in a 1024-dimensional space. While this high dimensionality captures rich semantic information, it poses challenges for downstream analysis:
The curse of dimensionality: In very high dimensions, distances between points become increasingly uniform. Clustering algorithms rely on meaningful distance differences, so this uniformity degrades their performance.
Computational cost: Pairwise distance computation is O(n^2 x d). Reducing d from 1024 to 10 gives a ~100x speedup for distance-based methods like HDBSCAN and DBSCAN.
Visualization: Humans can only perceive 2D or 3D plots. Reducing to 2D lets us inspect whether the embedding space has visible structure.
We apply two methods:
- PCA (Principal Component Analysis) — a linear projection that maximizes variance. Fast and deterministic, but cannot capture nonlinear relationships.
- UMAP (Uniform Manifold Approximation and Projection) — a nonlinear method that preserves both local neighborhood structure and global topology. Better for semantic embeddings but slower and stochastic.
PCA: A Linear Baseline¶
What is PCA?¶
Principal Component Analysis (PCA) finds the directions of maximum variance in the data and projects onto those directions. The first principal component captures the most variance, the second captures the next most (orthogonal to the first), and so on.
Why is PCA a baseline, not our primary method? PCA assumes that the important structure in the data is linear — that the most informative differences between points lie along straight-line directions. Semantic embeddings, however, often have nonlinear structure: words with similar meanings form curved manifolds in high-dimensional space, not flat planes. PCA will miss this curvature and may distort meaningful relationships.
We include PCA for two reasons:
- Comparison: If UMAP and PCA produce similar clusterings, the structure is linear and PCA suffices. If UMAP performs much better, the structure is genuinely nonlinear.
- Explained variance analysis: PCA's scree plot tells us how much of the total variance each dimension captures, which helps us understand the intrinsic dimensionality of the embedding space.
Note: PCA implicitly uses Euclidean distance (it maximizes variance, which is based on
squared Euclidean distances). For text embeddings, cosine similarity is usually more
appropriate — this is one reason we prefer UMAP with metric='cosine' for the primary
reduction.
Fit PCA and Examine Explained Variance¶
We fit PCA with 100 components (out of 1024) to see how quickly variance is captured. The scree plot shows:
- Individual explained variance ratio per component (bar plot)
- Cumulative explained variance (line plot)
What to look for: An "elbow" in the cumulative curve indicates a natural dimensionality — the point where adding more components yields diminishing returns. If 10 components capture 80%+ of variance, the data is relatively low-dimensional. If 100 components are needed for 80%, the structure is more distributed across many dimensions.
# Fit PCA with 100 components to examine the variance structure
pca_full = PCA(n_components=100, random_state=42)
pca_full.fit(embeddings)
# Print cumulative variance at key thresholds
cumvar = np.cumsum(pca_full.explained_variance_ratio_)
for threshold in [0.50, 0.80, 0.90, 0.95, 0.99]:
n_needed = np.searchsorted(cumvar, threshold) + 1
print(f'{threshold:.0%} variance explained by {n_needed} components')
# Scree plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Left: individual explained variance (first 50 components)
ax1.bar(range(1, 51), pca_full.explained_variance_ratio_[:50],
color='steelblue', alpha=0.8)
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Explained Variance Ratio')
ax1.set_title('Individual Explained Variance (first 50 components)')
# Right: cumulative explained variance (all 100 components)
ax2.plot(range(1, 101), cumvar, 'o-', markersize=3, color='steelblue')
ax2.axhline(y=0.90, color='red', linestyle='--', alpha=0.5, label='90% variance')
ax2.axhline(y=0.80, color='orange', linestyle='--', alpha=0.5, label='80% variance')
ax2.axvline(x=10, color='green', linestyle='--', alpha=0.5, label='10 components')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('Cumulative Explained Variance')
ax2.set_title('Cumulative Explained Variance')
ax2.legend()
plt.tight_layout()
fig.savefig(OUTPUT_DIR / 'pca_explained_variance.png', dpi=150, bbox_inches='tight')
plt.show()
50% variance explained by 45 components 80% variance explained by 101 components 90% variance explained by 101 components 95% variance explained by 101 components 99% variance explained by 101 components
Project to 10D and 2D¶
We project the full 1024-dimensional embeddings down to 10 and 2 dimensions using PCA.
- 10D PCA will serve as a baseline for clustering comparison against UMAP 10D in Stage 4.
- 2D PCA will serve as a baseline for visualization comparison against UMAP 2D in Stage 5.
# Project to 10 dimensions (for clustering comparison)
pca_10d = PCA(n_components=10, random_state=42)
embeddings_pca_10d = pca_10d.fit_transform(embeddings)
print(f'PCA 10D shape: {embeddings_pca_10d.shape}')
print(f'Variance explained by 10 components: {pca_10d.explained_variance_ratio_.sum():.1%}')
# Project to 2 dimensions (for visualization comparison)
pca_2d = PCA(n_components=2, random_state=42)
embeddings_pca_2d = pca_2d.fit_transform(embeddings)
print(f'PCA 2D shape: {embeddings_pca_2d.shape}')
print(f'Variance explained by 2 components: {pca_2d.explained_variance_ratio_.sum():.1%}')
PCA 10D shape: (12622, 10) Variance explained by 10 components: 22.3% PCA 2D shape: (12622, 2) Variance explained by 2 components: 7.1%
UMAP: Nonlinear Dimensionality Reduction¶
What is UMAP?¶
UMAP (Uniform Manifold Approximation and Projection) is a nonlinear dimensionality reduction technique that preserves both local and global structure of high-dimensional data. Unlike PCA, UMAP can capture curved manifolds — for example, if all anagram indicators form a curved cluster in 1024D space, UMAP will preserve that cluster's shape in the reduced space, while PCA might flatten or distort it.
UMAP works in two phases:
- Build a neighborhood graph in the original high-dimensional space
- Optimize a low-dimensional layout that preserves the graph's edge structure
Key Parameters¶
n_neighbors(default: 15) — Controls the balance between local and global structure. Small values (5-10) emphasize fine-grained local clusters. Large values (50+) preserve more global topology but blur local detail. We use 15 as a balanced starting point.min_dist(default: 0.1) — Controls how tightly UMAP packs points together. Smaller values (0.0-0.05) produce tighter, more separated clusters. Larger values (0.5-1.0) produce a more uniform spread. We use 0.1 as a moderate default.metric— The distance measure used in the original space. We use'cosine'because cosine similarity is the standard metric for comparing text embeddings. It measures the angle between vectors, making it robust to differences in vector magnitude.random_state— UMAP is stochastic (it uses random initialization and stochastic gradient descent). Settingrandom_state=42ensures reproducibility across runs.
These parameters have not been systematically tuned (see OPEN_QUESTIONS.md). The values used here are reasonable defaults for a first pass. Stage 4 clustering notebooks should investigate sensitivity to these choices.
UMAP to 10 Dimensions (for Clustering Input)¶
The 10-dimensional UMAP embedding is the primary input for clustering in Stage 4. Ten dimensions is a common choice that balances noise reduction with information preservation. This step typically takes 1-3 minutes depending on hardware.
# UMAP reduction to 10 dimensions for clustering input
reducer_10d = umap.UMAP(
n_components=10,
n_neighbors=15,
min_dist=0.1,
metric='cosine',
random_state=42
)
embeddings_umap_10d = reducer_10d.fit_transform(embeddings)
print(f'UMAP 10D shape: {embeddings_umap_10d.shape}')
/home/vwinters/.conda/envs/nlp_env/lib/python3.12/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism. warn(
UMAP 10D shape: (12622, 10)
UMAP to 2 Dimensions (for Visualization)¶
The 2-dimensional UMAP embedding is used for scatter plots in Stage 5. Note that UMAP must be fitted separately for each target dimensionality — we cannot simply take the first 2 columns of the 10D output, because UMAP optimizes the layout for the specific number of dimensions requested.
# UMAP reduction to 2 dimensions for visualization
reducer_2d = umap.UMAP(
n_components=2,
n_neighbors=15,
min_dist=0.1,
metric='cosine',
random_state=42
)
embeddings_umap_2d = reducer_2d.fit_transform(embeddings)
print(f'UMAP 2D shape: {embeddings_umap_2d.shape}')
/home/vwinters/.conda/envs/nlp_env/lib/python3.12/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism. warn(
UMAP 2D shape: (12622, 2)
Quick Visual Check¶
Before saving, we plot both 2D projections side by side to verify that the reductions look reasonable. At this stage we do not color by cluster or wordplay type — that comes in Stage 5. We are checking that:
- Points are not all collapsed to a single blob (would indicate a degenerate reduction)
- There is visible structure — clumps, filaments, or separated regions
- No obvious artifacts (e.g., a uniform grid pattern would indicate a bug)
Expect UMAP to show more distinct clumps than PCA, since UMAP preserves local neighborhood structure while PCA only maximizes variance along linear directions.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
# UMAP 2D
ax1.scatter(
embeddings_umap_2d[:, 0],
embeddings_umap_2d[:, 1],
s=1, alpha=0.3, color='steelblue'
)
ax1.set_title('UMAP 2D Projection')
ax1.set_xlabel('UMAP 1')
ax1.set_ylabel('UMAP 2')
# PCA 2D for comparison
ax2.scatter(
embeddings_pca_2d[:, 0],
embeddings_pca_2d[:, 1],
s=1, alpha=0.3, color='coral'
)
ax2.set_title('PCA 2D Projection (for comparison)')
ax2.set_xlabel('PC 1')
ax2.set_ylabel('PC 2')
plt.tight_layout()
fig.savefig(OUTPUT_DIR / 'umap_vs_pca_2d_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
# Print coordinate ranges as a sanity check
print(f'UMAP 2D range — '
f'x: [{embeddings_umap_2d[:, 0].min():.1f}, {embeddings_umap_2d[:, 0].max():.1f}], '
f'y: [{embeddings_umap_2d[:, 1].min():.1f}, {embeddings_umap_2d[:, 1].max():.1f}]')
print(f'PCA 2D range — '
f'x: [{embeddings_pca_2d[:, 0].min():.2f}, {embeddings_pca_2d[:, 0].max():.2f}], '
f'y: [{embeddings_pca_2d[:, 1].min():.2f}, {embeddings_pca_2d[:, 1].max():.2f}]')
UMAP 2D range — x: [2.6, 15.8], y: [-5.5, 7.9] PCA 2D range — x: [-0.36, 0.43], y: [-0.34, 0.33]
Save All Outputs¶
Four files are saved to DATA_DIR:
| File | Shape | Purpose |
|---|---|---|
embeddings_umap_10d.npy |
(12622, 10) | Clustering input for Stage 4 |
embeddings_umap_2d.npy |
(12622, 2) | Visualization in Stage 5 |
embeddings_pca_10d.npy |
(12622, 10) | PCA baseline for clustering comparison |
embeddings_pca_2d.npy |
(12622, 2) | PCA baseline for visualization comparison |
Row i in every output file corresponds to row i in indicator_index_all.csv
(the same mapping as the input embeddings).
np.save(DATA_DIR / 'embeddings_umap_10d.npy', embeddings_umap_10d)
np.save(DATA_DIR / 'embeddings_umap_2d.npy', embeddings_umap_2d)
np.save(DATA_DIR / 'embeddings_pca_10d.npy', embeddings_pca_10d)
np.save(DATA_DIR / 'embeddings_pca_2d.npy', embeddings_pca_2d)
print('Saved:')
for fname in ['embeddings_umap_10d.npy', 'embeddings_umap_2d.npy',
'embeddings_pca_10d.npy', 'embeddings_pca_2d.npy']:
fpath = DATA_DIR / fname
print(f' {fname} ({fpath.stat().st_size / 1024:.0f} KB)')
Saved: embeddings_umap_10d.npy (493 KB) embeddings_umap_2d.npy (99 KB) embeddings_pca_10d.npy (493 KB) embeddings_pca_2d.npy (99 KB)
Verification¶
Reload all saved files and verify that shapes match expectations. This catches silent corruption (e.g., truncated writes on network-mounted filesystems like Google Drive or Great Lakes scratch).
expected_shapes = {
'embeddings_umap_10d.npy': (12622, 10),
'embeddings_umap_2d.npy': (12622, 2),
'embeddings_pca_10d.npy': (12622, 10),
'embeddings_pca_2d.npy': (12622, 2),
}
for fname, expected_shape in expected_shapes.items():
loaded = np.load(DATA_DIR / fname)
assert loaded.shape == expected_shape, (
f'{fname}: expected {expected_shape}, got {loaded.shape}'
)
print(f'{fname}: {loaded.shape} OK')
print('\nAll verification checks passed.')
embeddings_umap_10d.npy: (12622, 10) OK embeddings_umap_2d.npy: (12622, 2) OK embeddings_pca_10d.npy: (12622, 10) OK embeddings_pca_2d.npy: (12622, 2) OK All verification checks passed.
Interpretation of Results¶
PCA Explained Variance¶
The scree plot shows no sharp "elbow" — variance is spread gradually across many components. The top component explains only ~4%, and even 100 components reach only about 70%. This is typical of high-dimensional semantic embeddings where meaning is distributed across many dimensions rather than concentrated in a few. The 22.3% captured by 10 PCA components confirms that PCA is a poor fit for this data.
UMAP vs PCA Comparison¶
PCA shows a diffuse blob with no visible structure. UMAP reveals clear local structure: many small dense clumps connected by sparser regions, with some isolated clusters at the periphery. UMAP is preserving local neighborhood relationships that PCA misses entirely.
Morphological Variant Clustering¶
Many of the tight clumps visible in the UMAP plot likely correspond to groups of morphological variants of the same root concept. Because we did not stem or lemmatize indicators before embedding (a settled decision — the embedding model handles morphological variation in semantic space), variants like "contribute to", "contributes to", "contributing", "contributing in", "contributing to", "contribution from", "contribution to", and "contributors to" all receive very similar embeddings and land in tight local neighborhoods. This is the embedding model working as intended, but it means that fine-grained clustering algorithms like HDBSCAN will likely find hundreds of small variant-level clusters (as seen in the earlier exploratory finding of 353 HDBSCAN clusters) rather than a smaller number of wordplay-type-level clusters. This observation is relevant to the open question about two-stage clustering (see OPEN_QUESTIONS.md, Q5).
Implications for Clustering¶
The UMAP plot is encouraging. The visible local structure confirms there is real semantic organization in the indicator space. However, the overall shape is one large connected mass rather than 8 cleanly separated islands, which is consistent with the primary hypothesis that indicators will not separate cleanly into 8 wordplay types. Density-based methods (HDBSCAN) can find clusters within this connected space, but the natural granularity may be at the variant/conceptual-metaphor level rather than the wordplay-type level. The clustering notebook (Stage 4) will first attempt k=8 as a baseline to characterize how well wordplay types can be recovered, then explore other granularities.