Stage 0: Data Extraction from SQLite¶

Primary author: Victoria Based on: George Ho's dataset schema (cryptics.georgeho.org) Prompt engineering: Victoria AI assistance: Claude (Anthropic) Environment: Local or Colab

This notebook extracts six tables from the raw data.sqlite3 database (downloaded from cryptics.georgeho.org) and saves them as CSV files with consistent, descriptive column names. These CSVs are the starting point for all downstream notebooks.

Dataset created by George Ho.

Available for download: https://cryptics.georgeho.org/data.db

This notebook takes the complete database downloaded from cryptics.georgeho.org, and outputs the six most relevant tables as .csv files.

It changes the feature names so that they are descriptive and consistent across all tables:

  • rowid becomes clue_id, ind_id, or charade_id
  • answer becomes charade_answer when it refers to the charade, remains answer when referring to the clue answer
  • clue_rowids becomes clue_ids for consistency
In [1]:
# imports
import os
import sqlite3
import pandas as pd
import numpy as np
from pathlib import Path
In [2]:
# --- Environment Auto-Detection ---
try:
    IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
    IS_COLAB = False

if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues')
else:
    try:
        PROJECT_ROOT = Path(__file__).resolve().parent.parent
    except NameError:
        PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / "data"

print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
Project root: /Users/victoria/Desktop/MADS/ccc-project/indicator_clustering
Data directory: /Users/victoria/Desktop/MADS/ccc-project/indicator_clustering/data
In [3]:
# Connect to the sqlite3 file
data_file = str(DATA_DIR / "data.sqlite3")
conn = sqlite3.connect(data_file)
In [4]:
# Uncomment to see what data tables exist in the file
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)
#tables
In [5]:
# Keep track of all tables that might be of interest from the original dataset
# Display the names and sizes of all tables.

tables = [
    "clues",
    "indicators",
    "charades",
    "indicators_by_clue",
    "charades_by_clue",
    "indicators_consolidated"
]

summary = []

for t in tables:
    # count rows
    row_count = pd.read_sql(f"SELECT COUNT(*) AS n FROM {t};", conn).iloc[0]["n"]
    
    # count rows and columns
    col_info = pd.read_sql(f"PRAGMA table_info({t});", conn)
    col_count = len(col_info)

    summary.append({
        "table": t,
        "rows": row_count,
        "columns": col_count
    })

summary_df = pd.DataFrame(summary)
summary_df.style.format({"rows": "{:,}"}) # display with commas 
Out[5]:
  table rows columns
0 clues 660,613 9
1 indicators 15,735 4
2 charades 57,289 4
3 indicators_by_clue 88,037 9
4 charades_by_clue 126,825 3
5 indicators_consolidated 1 8
In [6]:
# Create the dataframes related to indicators
df_indicators = pd.read_sql("SELECT * FROM indicators;", conn)
df_ind_by_clue = pd.read_sql("SELECT * FROM indicators_by_clue;", conn)
df_indicators_consolidated = pd.read_sql("SELECT * FROM indicators_consolidated;", conn)

# Create dataframes pertaining to clue and charade
df_clues = pd.read_sql("SELECT * FROM clues;", conn)
df_charades = pd.read_sql("SELECT * FROM charades;", conn)
df_charades_by_clue = pd.read_sql("SELECT * FROM charades_by_clue;", conn)
In [7]:
# Indicators
display(df_indicators.head(3))

# Rename columns for consistency across tables/dataframes
df_indicators = df_indicators.rename(columns={'rowid': 'ind_id', 'clue_rowids': 'clue_ids'})
display(df_indicators.head(3))
rowid wordplay indicator clue_rowids
0 1 alternation abnormal [623961](/data/clues/623961)
1 2 alternation after odd losses [139327](/data/clues/139327)
2 3 alternation after regular excisions [107211](/data/clues/107211)
ind_id wordplay indicator clue_ids
0 1 alternation abnormal [623961](/data/clues/623961)
1 2 alternation after odd losses [139327](/data/clues/139327)
2 3 alternation after regular excisions [107211](/data/clues/107211)
In [8]:
# Indicators by Clue
display(df_ind_by_clue.head(4))
print()

# Rename columns for consistancy across tables/dataframes
df_ind_by_clue = df_ind_by_clue.rename(columns={'clue_rowid': 'clue_id'})
display(df_ind_by_clue.head(4))

# See how many contextualized indicators are in this table
#print("Instances of each CONTEXTUALIZED wordplay (multiple per clue, redundant indicators)")
#df_ind_by_clue.replace("", np.nan).count()
clue_rowid alternation anagram container deletion hidden homophone insertion reversal
0 90 transforming
1 97 ground
2 101 verbally
3 142 about to go back

clue_id alternation anagram container deletion hidden homophone insertion reversal
0 90 transforming
1 97 ground
2 101 verbally
3 142 about to go back
In [9]:
# Indicators Consolidated
display(df_indicators_consolidated.head())
print()
alternation anagram container deletion hidden homophone insertion reversal
0 abnormal\nafter odd losses\nafter regular exci... a bad way\na bit differently\na brew of\na coc... a single\naboard\nabout\nabout t\nabout/confin... a certain amount off\nabandoned\nabandoned by\... a bit\na bit of\na bit of this\na bunch of\na ... a report on\naccording to announcement\naccord... a bit of\na contingent of\na feature of\na fic... a cleric raised\na fastening device put up\na ...

In [10]:
# Clues
display(df_clues.head(3))

# Rename columns for consistency across tables/dataframes
df_clues = df_clues.rename(columns={'rowid': 'clue_id'})
display(df_clues.head(3))
rowid clue answer definition clue_number puzzle_date puzzle_name source_url source
0 1 Acquisitive chap, as we see it (8) COVETOUS Acquisitive 1a 2019-08-08 Times 27424 https://times-xwd-times.livejournal.com/218581... times_xwd_times
1 2 Back yard fencing weak and sagging (6) DROOPY sagging 5a 2019-08-08 Times 27424 https://times-xwd-times.livejournal.com/218581... times_xwd_times
2 3 Stripping off uniform, love holding colonel's ... UNCLOTHING Stripping 8a 2019-08-08 Times 27424 https://times-xwd-times.livejournal.com/218581... times_xwd_times
clue_id clue answer definition clue_number puzzle_date puzzle_name source_url source
0 1 Acquisitive chap, as we see it (8) COVETOUS Acquisitive 1a 2019-08-08 Times 27424 https://times-xwd-times.livejournal.com/218581... times_xwd_times
1 2 Back yard fencing weak and sagging (6) DROOPY sagging 5a 2019-08-08 Times 27424 https://times-xwd-times.livejournal.com/218581... times_xwd_times
2 3 Stripping off uniform, love holding colonel's ... UNCLOTHING Stripping 8a 2019-08-08 Times 27424 https://times-xwd-times.livejournal.com/218581... times_xwd_times
In [11]:
# Charades
display(df_charades.head(3))

# Rename columns for consistency across tables/dataframes
df_charades = df_charades.rename(columns={'rowid': 'charade_id', 'answer':'charade_answer', 'clue_rowids': 'clue_ids'})
display(df_charades.head(3))
rowid charade answer clue_rowids
0 1 a ALPHA [108948](/data/clues/108948)
1 2 a AN [79856](/data/clues/79856), [162234](/data/clu...
2 3 a AR [384876](/data/clues/384876)
charade_id charade charade_answer clue_ids
0 1 a ALPHA [108948](/data/clues/108948)
1 2 a AN [79856](/data/clues/79856), [162234](/data/clu...
2 3 a AR [384876](/data/clues/384876)
In [12]:
# Charades by Clue
display(df_charades_by_clue.head(3))

# Rename columns for consistency across tables/dataframes
df_charades_by_clue = df_charades_by_clue.rename(columns={'clue_rowid': 'clue_id', 'answer':'charade_answer'})
display(df_charades_by_clue.head(3))
clue_rowid charade answer
0 1 chap COVE
1 1 as we see it TO US
2 21 give the thing acclaim LAUD IT
clue_id charade charade_answer
0 1 chap COVE
1 1 as we see it TO US
2 21 give the thing acclaim LAUD IT
In [13]:
# Write each dataframe to a CSV file in the data directory (without the index)
df_indicators.to_csv(DATA_DIR / "indicators_raw.csv", index=False)
df_ind_by_clue.to_csv(DATA_DIR / "indicators_by_clue_raw.csv", index=False)
df_indicators_consolidated.to_csv(DATA_DIR / "indicators_consolidated_raw.csv", index=False)
df_clues.to_csv(DATA_DIR / "clues_raw.csv", index=False)
df_charades.to_csv(DATA_DIR / "charades_raw.csv", index=False)
df_charades_by_clue.to_csv(DATA_DIR / "charades_by_clue_raw.csv", index=False)
In [ ]: