Stage 0: Data Extraction from SQLite¶
Primary author: Victoria Based on: George Ho's dataset schema (cryptics.georgeho.org) Prompt engineering: Victoria AI assistance: Claude (Anthropic) Environment: Local or Colab
This notebook extracts six tables from the raw data.sqlite3 database (downloaded from
cryptics.georgeho.org) and saves them as CSV files with consistent, descriptive column
names. These CSVs are the starting point for all downstream notebooks.
Dataset created by George Ho.
Available for download: https://cryptics.georgeho.org/data.db
This notebook takes the complete database downloaded from cryptics.georgeho.org, and outputs the six most relevant tables as .csv files.
It changes the feature names so that they are descriptive and consistent across all tables:
rowidbecomesclue_id,ind_id, orcharade_idanswerbecomescharade_answerwhen it refers to the charade, remainsanswerwhen referring to the clue answerclue_rowidsbecomesclue_idsfor consistency
In [1]:
# imports
import os
import sqlite3
import pandas as pd
import numpy as np
from pathlib import Path
In [2]:
# --- Environment Auto-Detection ---
try:
IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
IS_COLAB = False
if IS_COLAB:
from google.colab import drive
drive.mount('/content/drive')
PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues')
else:
try:
PROJECT_ROOT = Path(__file__).resolve().parent.parent
except NameError:
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data"
print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
Project root: /home/vwinters/ccc-project/indicator_clustering Data directory: /home/vwinters/ccc-project/indicator_clustering/data
In [3]:
# Connect to the sqlite3 file
data_file = str(DATA_DIR / "data.sqlite3")
conn = sqlite3.connect(data_file)
In [4]:
# Uncomment to see what data tables exist in the file
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)
#tables
In [5]:
# Keep track of all tables that might be of interest from the original dataset
# Display the names and sizes of all tables.
tables = [
"clues",
"indicators",
"charades",
"indicators_by_clue",
"charades_by_clue",
"indicators_consolidated"
]
summary = []
for t in tables:
# count rows
row_count = pd.read_sql(f"SELECT COUNT(*) AS n FROM {t};", conn).iloc[0]["n"]
# count rows and columns
col_info = pd.read_sql(f"PRAGMA table_info({t});", conn)
col_count = len(col_info)
summary.append({
"table": t,
"rows": row_count,
"columns": col_count
})
summary_df = pd.DataFrame(summary)
summary_df.style.format({"rows": "{:,}"}) # display with commas
Out[5]:
| table | rows | columns | |
|---|---|---|---|
| 0 | clues | 660,613 | 9 |
| 1 | indicators | 15,735 | 4 |
| 2 | charades | 57,289 | 4 |
| 3 | indicators_by_clue | 88,037 | 9 |
| 4 | charades_by_clue | 126,825 | 3 |
| 5 | indicators_consolidated | 1 | 8 |
In [6]:
# Create the dataframes related to indicators
df_indicators = pd.read_sql("SELECT * FROM indicators;", conn)
df_ind_by_clue = pd.read_sql("SELECT * FROM indicators_by_clue;", conn)
df_indicators_consolidated = pd.read_sql("SELECT * FROM indicators_consolidated;", conn)
# Create dataframes pertaining to clue and charade
df_clues = pd.read_sql("SELECT * FROM clues;", conn)
df_charades = pd.read_sql("SELECT * FROM charades;", conn)
df_charades_by_clue = pd.read_sql("SELECT * FROM charades_by_clue;", conn)
In [7]:
# Indicators
display(df_indicators.head(3))
# Rename columns for consistency across tables/dataframes
df_indicators = df_indicators.rename(columns={'rowid': 'ind_id', 'clue_rowids': 'clue_ids'})
display(df_indicators.head(3))
| rowid | wordplay | indicator | clue_rowids | |
|---|---|---|---|---|
| 0 | 1 | alternation | abnormal | [623961](/data/clues/623961) |
| 1 | 2 | alternation | after odd losses | [139327](/data/clues/139327) |
| 2 | 3 | alternation | after regular excisions | [107211](/data/clues/107211) |
| ind_id | wordplay | indicator | clue_ids | |
|---|---|---|---|---|
| 0 | 1 | alternation | abnormal | [623961](/data/clues/623961) |
| 1 | 2 | alternation | after odd losses | [139327](/data/clues/139327) |
| 2 | 3 | alternation | after regular excisions | [107211](/data/clues/107211) |
In [8]:
# Indicators by Clue
display(df_ind_by_clue.head(4))
print()
# Rename columns for consistancy across tables/dataframes
df_ind_by_clue = df_ind_by_clue.rename(columns={'clue_rowid': 'clue_id'})
display(df_ind_by_clue.head(4))
# See how many contextualized indicators are in this table
#print("Instances of each CONTEXTUALIZED wordplay (multiple per clue, redundant indicators)")
#df_ind_by_clue.replace("", np.nan).count()
| clue_rowid | alternation | anagram | container | deletion | hidden | homophone | insertion | reversal | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 90 | transforming | |||||||
| 1 | 97 | ground | |||||||
| 2 | 101 | verbally | |||||||
| 3 | 142 | about to go back |
| clue_id | alternation | anagram | container | deletion | hidden | homophone | insertion | reversal | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 90 | transforming | |||||||
| 1 | 97 | ground | |||||||
| 2 | 101 | verbally | |||||||
| 3 | 142 | about to go back |
In [9]:
# Indicators Consolidated
display(df_indicators_consolidated.head())
print()
| alternation | anagram | container | deletion | hidden | homophone | insertion | reversal | |
|---|---|---|---|---|---|---|---|---|
| 0 | abnormal\nafter odd losses\nafter regular exci... | a bad way\na bit differently\na brew of\na coc... | a single\naboard\nabout\nabout t\nabout/confin... | a certain amount off\nabandoned\nabandoned by\... | a bit\na bit of\na bit of this\na bunch of\na ... | a report on\naccording to announcement\naccord... | a bit of\na contingent of\na feature of\na fic... | a cleric raised\na fastening device put up\na ... |
In [10]:
# Clues
display(df_clues.head(3))
# Rename columns for consistency across tables/dataframes
df_clues = df_clues.rename(columns={'rowid': 'clue_id'})
display(df_clues.head(3))
| rowid | clue | answer | definition | clue_number | puzzle_date | puzzle_name | source_url | source | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Acquisitive chap, as we see it (8) | COVETOUS | Acquisitive | 1a | 2019-08-08 | Times 27424 | https://times-xwd-times.livejournal.com/218581... | times_xwd_times |
| 1 | 2 | Back yard fencing weak and sagging (6) | DROOPY | sagging | 5a | 2019-08-08 | Times 27424 | https://times-xwd-times.livejournal.com/218581... | times_xwd_times |
| 2 | 3 | Stripping off uniform, love holding colonel's ... | UNCLOTHING | Stripping | 8a | 2019-08-08 | Times 27424 | https://times-xwd-times.livejournal.com/218581... | times_xwd_times |
| clue_id | clue | answer | definition | clue_number | puzzle_date | puzzle_name | source_url | source | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Acquisitive chap, as we see it (8) | COVETOUS | Acquisitive | 1a | 2019-08-08 | Times 27424 | https://times-xwd-times.livejournal.com/218581... | times_xwd_times |
| 1 | 2 | Back yard fencing weak and sagging (6) | DROOPY | sagging | 5a | 2019-08-08 | Times 27424 | https://times-xwd-times.livejournal.com/218581... | times_xwd_times |
| 2 | 3 | Stripping off uniform, love holding colonel's ... | UNCLOTHING | Stripping | 8a | 2019-08-08 | Times 27424 | https://times-xwd-times.livejournal.com/218581... | times_xwd_times |
In [11]:
# Charades
display(df_charades.head(3))
# Rename columns for consistency across tables/dataframes
df_charades = df_charades.rename(columns={'rowid': 'charade_id', 'answer':'charade_answer', 'clue_rowids': 'clue_ids'})
display(df_charades.head(3))
| rowid | charade | answer | clue_rowids | |
|---|---|---|---|---|
| 0 | 1 | a | ALPHA | [108948](/data/clues/108948) |
| 1 | 2 | a | AN | [79856](/data/clues/79856), [162234](/data/clu... |
| 2 | 3 | a | AR | [384876](/data/clues/384876) |
| charade_id | charade | charade_answer | clue_ids | |
|---|---|---|---|---|
| 0 | 1 | a | ALPHA | [108948](/data/clues/108948) |
| 1 | 2 | a | AN | [79856](/data/clues/79856), [162234](/data/clu... |
| 2 | 3 | a | AR | [384876](/data/clues/384876) |
In [12]:
# Charades by Clue
display(df_charades_by_clue.head(3))
# Rename columns for consistency across tables/dataframes
df_charades_by_clue = df_charades_by_clue.rename(columns={'clue_rowid': 'clue_id', 'answer':'charade_answer'})
display(df_charades_by_clue.head(3))
| clue_rowid | charade | answer | |
|---|---|---|---|
| 0 | 1 | chap | COVE |
| 1 | 1 | as we see it | TO US |
| 2 | 21 | give the thing acclaim | LAUD IT |
| clue_id | charade | charade_answer | |
|---|---|---|---|
| 0 | 1 | chap | COVE |
| 1 | 1 | as we see it | TO US |
| 2 | 21 | give the thing acclaim | LAUD IT |
In [13]:
# Write each dataframe to a CSV file in the data directory (without the index)
df_indicators.to_csv(DATA_DIR / "indicators_raw.csv", index=False)
df_ind_by_clue.to_csv(DATA_DIR / "indicators_by_clue_raw.csv", index=False)
df_indicators_consolidated.to_csv(DATA_DIR / "indicators_consolidated_raw.csv", index=False)
df_clues.to_csv(DATA_DIR / "clues_raw.csv", index=False)
df_charades.to_csv(DATA_DIR / "charades_raw.csv", index=False)
df_charades_by_clue.to_csv(DATA_DIR / "charades_by_clue_raw.csv", index=False)
In [ ]: