Stage 2: Embedding Generation with BGE-M3¶
Primary author: Victoria Builds on:
- Hierarchical_Clustering_Indicators_with_BGE_M3_Embeddings.ipynb (Victoria/Sahana — BGE-M3 model selection and inline embedding approach)
- NC_Comprehensive_Embeddings.ipynb (Nathan — multi-model comparison that helped justify BGE-M3 as the primary model)
01_data_cleaning.ipynb(Stage 1 output: verified unique indicators)
Prompt engineering: Victoria AI assistance: Claude (Anthropic) Environment: Great Lakes (GPU required) or Google Colab (GPU enabled)
This notebook loads the deduplicated list of 12,622 verified unique indicator strings
produced by Stage 1, generates 1024-dimensional embeddings using the BGE-M3 sentence
transformer model, and saves the results as .npy and .csv files for downstream
dimensionality reduction (Stage 3) and clustering (Stage 4).
Great Lakes session settings:
- Partition: gpu
- GPUs: 1 (V100 or A40)
- CPUs: 4
- Memory: 32GB
- Wall time: 1 hour (embedding takes ~2-5 min; most time is model download on first run)
Running on Google Colab¶
If you are running this notebook on Google Colab after the course ends:
- Go to Runtime > Change runtime type
- Select a GPU accelerator:
- T4 is available on the free tier and is sufficient for this notebook
- A100 is available with Colab Pro and will be faster
- Click Save, then run all cells
Embedding 12,622 short phrases takes approximately 2-5 minutes on a T4 GPU. Without a GPU, it will still work but may take 15-30 minutes on CPU.
Imports¶
import os
import numpy as np
import pandas as pd
from pathlib import Path
from sentence_transformers import SentenceTransformer
Environment Auto-Detection and Paths¶
# --- Environment Auto-Detection ---
try:
IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
IS_COLAB = False
if IS_COLAB:
from google.colab import drive
drive.mount('/content/drive')
PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues')
else:
try:
PROJECT_ROOT = Path(__file__).resolve().parent.parent
except NameError:
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / 'data'
OUTPUT_DIR = PROJECT_ROOT / 'outputs'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
# Batch size for embedding generation.
# Colab free-tier T4 GPUs have 16GB VRAM — use a smaller batch to avoid OOM.
# Great Lakes V100/A40 and local GPUs with more VRAM can handle larger batches.
BATCH_SIZE = 32 if IS_COLAB else 64
print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
print(f'Batch size: {BATCH_SIZE}')
np.random.seed(42)
Load Unique Indicators¶
The input file verified_indicators_unique.csv is produced by 01_data_cleaning.ipynb.
It contains one row per unique indicator string (12,622 indicators), with no wordplay labels.
Labels are stored separately in verified_clues_labeled.csv and can be joined by indicator
string whenever needed for evaluation.
# Check that the input file exists before proceeding
input_file = DATA_DIR / 'verified_indicators_unique.csv'
assert input_file.exists(), (
f'Missing input file: {input_file}\n'
f'Run 01_data_cleaning.ipynb first to produce this file.'
)
df_indicators = pd.read_csv(input_file)
indicators_list = df_indicators['indicator'].tolist()
print(f'Loaded {len(indicators_list):,} unique indicators')
print(f'Examples: {indicators_list[:5]}')
print(f'Shortest: "{min(indicators_list, key=len)}" ({len(min(indicators_list, key=len))} chars)')
print(f'Longest: "{max(indicators_list, key=len)}" ({len(max(indicators_list, key=len))} chars)')
Generate BGE-M3 Embeddings¶
We use the BAAI/bge-m3 model from the
sentence-transformers library. BGE-M3 produces 1024-dimensional dense embeddings
and is part of the CALE (Concept-Aligned Language Embeddings) family of models
pretrained to distinguish word senses in context.
Why BGE-M3? Our indicators are short phrases (1-6 words) that carry specific semantic meaning related to wordplay operations. BGE-M3 handles short text well and produces embeddings where semantically similar phrases (e.g., "scrambled" and "mixed up") are close in vector space. This is the settled model choice per FINDINGS_AND_DECISIONS.md.
What we are NOT doing: We embed each indicator in isolation (not within its clue context). This is a settled decision — see FINDINGS_AND_DECISIONS.md for the rationale.
# Load the BGE-M3 model
# First run will download the model (~2.3 GB). Subsequent runs use the cached version.
model = SentenceTransformer('BAAI/bge-m3')
print(f'Model loaded: {model.get_sentence_embedding_dimension()} dimensions')
# Generate embeddings for all unique indicators
# show_progress_bar=True displays a tqdm progress bar during encoding
embeddings = model.encode(
indicators_list,
batch_size=BATCH_SIZE,
show_progress_bar=True
)
print(f'Embeddings shape: {embeddings.shape}')
print(f'Dtype: {embeddings.dtype}')
print(f'Memory: {embeddings.nbytes / 1024**2:.1f} MB')
Save Outputs¶
Two files are saved:
embeddings_bge_m3_all.npy— NumPy array of shape (N, 1024) where N is the number of unique indicators. Rowiin this array corresponds to rowiin the indicator index CSV.indicator_index_all.csv— Maps each row number to its indicator string. The CSV index (first column) is the row number in the embedding array. This is the contract between the embedding file and the indicator identity.
Downstream notebooks (Stage 3, 4, 5) should load these files rather than recomputing embeddings.
# Save the embedding matrix
np.save(DATA_DIR / 'embeddings_bge_m3_all.npy', embeddings)
print(f'Saved embeddings to {DATA_DIR / "embeddings_bge_m3_all.npy"}')
# Save the indicator index (row number -> indicator string)
df_indicators.to_csv(DATA_DIR / 'indicator_index_all.csv', index=True)
print(f'Saved indicator index to {DATA_DIR / "indicator_index_all.csv"}')
Verification¶
Reload the saved files and verify that shapes match and the row mapping is correct.
# Reload and verify
embeddings_check = np.load(DATA_DIR / 'embeddings_bge_m3_all.npy')
index_check = pd.read_csv(DATA_DIR / 'indicator_index_all.csv', index_col=0)
assert embeddings_check.shape[0] == len(index_check), (
f'Shape mismatch: embeddings has {embeddings_check.shape[0]} rows, '
f'index has {len(index_check)} rows'
)
assert embeddings_check.shape[1] == 1024, (
f'Expected 1024 dimensions, got {embeddings_check.shape[1]}'
)
print(f'Embeddings: {embeddings_check.shape}')
print(f'Index: {len(index_check)} rows')
print(f'All checks passed.')
# Spot-check: find a known indicator and verify it has a non-zero embedding
spot_check = 'about'
row = index_check[index_check['indicator'] == spot_check].index[0]
norm = np.linalg.norm(embeddings_check[row])
print(f'\nSpot check: "{spot_check}" is at row {row}, embedding L2 norm = {norm:.4f}')