Skip to content

glazing.wordnet.converter

Converting WordNet database to JSON Lines.

converter

WordNet database file parser.

This module provides parsing functionality for WordNet 3.1 database files including index files, data files, sense index, and exception files.

CLASS DESCRIPTION
WordNetConverter

Parse WordNet database files into JSON Lines format.

FUNCTION DESCRIPTION
parse_index_file

Parse WordNet index file (index.noun, index.verb, etc.).

parse_data_file

Parse WordNet data file (data.noun, data.verb, etc.).

parse_sense_index

Parse WordNet sense index file.

parse_exception_file

Parse morphological exception file.

Examples:

>>> from pathlib import Path
>>> from glazing.wordnet.converter import WordNetConverter
>>> converter = WordNetConverter()
>>> synsets = converter.parse_data_file("data.noun")
>>> index_entries = converter.parse_index_file("index.verb")
>>> # Convert entire WordNet database
>>> converter.convert_wordnet_database(
...     wordnet_dir="wordnet31/dict",
...     output_dir="wordnet_jsonl"
... )

Classes

WordNetConverter

Parse WordNet database files into structured models.

Handles parsing of WordNet 3.1 database files including index files, data files, sense index, and morphological exception files.

METHOD DESCRIPTION
parse_data_file

Parse WordNet data file into list of Synset models.

parse_index_file

Parse WordNet index file into list of IndexEntry models.

parse_sense_index

Parse sense index file into list of Sense models.

parse_exception_file

Parse morphological exception file.

convert_wordnet_database

Convert entire WordNet database to JSON Lines.

Functions
convert_exceptions(wordnet_dir: Path | str, output_file: Path | str) -> int

Parse *.exc files and output ExceptionEntry objects to JSONL.

PARAMETER DESCRIPTION
wordnet_dir

Directory containing WordNet database files.

TYPE: Path | str

output_file

Output JSON Lines file path.

TYPE: Path | str

RETURNS DESCRIPTION
int

Number of exception entries written.

Source code in src/glazing/wordnet/converter.py
def convert_exceptions(self, wordnet_dir: Path | str, output_file: Path | str) -> int:
    """Parse *.exc files and output ExceptionEntry objects to JSONL.

    Parameters
    ----------
    wordnet_dir : Path | str
        Directory containing WordNet database files.
    output_file : Path | str
        Output JSON Lines file path.

    Returns
    -------
    int
        Number of exception entries written.
    """
    wordnet_dir = Path(wordnet_dir)
    output_file = Path(output_file)

    output_file.parent.mkdir(parents=True, exist_ok=True)

    all_entries: list[ExceptionEntry] = []

    exc_files: list[tuple[str, WordNetPOS]] = [
        ("noun.exc", "n"),
        ("verb.exc", "v"),
        ("adj.exc", "a"),
        ("adv.exc", "r"),
    ]

    for exc_name, pos in exc_files:
        exc_path = wordnet_dir / exc_name
        if exc_path.exists():
            entries = self.parse_exception_file(exc_path)
            for entry in entries:
                entry.pos = pos
            all_entries.extend(entries)

    with output_file.open("w", encoding="utf-8") as f:
        for entry in all_entries:
            f.write(f"{entry.model_dump_json()}\n")

    return len(all_entries)
convert_sense_index(wordnet_dir: Path | str, output_file: Path | str) -> int

Parse index.sense and output Sense objects to JSONL.

PARAMETER DESCRIPTION
wordnet_dir

Directory containing WordNet database files.

TYPE: Path | str

output_file

Output JSON Lines file path.

TYPE: Path | str

RETURNS DESCRIPTION
int

Number of sense entries written.

RAISES DESCRIPTION
FileNotFoundError

If index.sense file does not exist.

Source code in src/glazing/wordnet/converter.py
def convert_sense_index(self, wordnet_dir: Path | str, output_file: Path | str) -> int:
    """Parse index.sense and output Sense objects to JSONL.

    Parameters
    ----------
    wordnet_dir : Path | str
        Directory containing WordNet database files.
    output_file : Path | str
        Output JSON Lines file path.

    Returns
    -------
    int
        Number of sense entries written.

    Raises
    ------
    FileNotFoundError
        If index.sense file does not exist.
    """
    wordnet_dir = Path(wordnet_dir)
    output_file = Path(output_file)

    output_file.parent.mkdir(parents=True, exist_ok=True)

    sense_file = wordnet_dir / "index.sense"
    senses = self.parse_sense_index(sense_file)

    with output_file.open("w", encoding="utf-8") as f:
        for sense in senses:
            f.write(f"{sense.model_dump_json()}\n")

    return len(senses)
convert_wordnet_database(wordnet_dir: Path | str, output_file: Path | str) -> dict[str, int]

Convert entire WordNet database to JSON Lines.

PARAMETER DESCRIPTION
wordnet_dir

Directory containing WordNet database files.

TYPE: Path | str

output_file

Output JSON Lines file path.

TYPE: Path | str

RETURNS DESCRIPTION
dict[str, int]

Counts of processed items by file type.

RAISES DESCRIPTION
FileNotFoundError

If WordNet directory does not exist.

Source code in src/glazing/wordnet/converter.py
def convert_wordnet_database(
    self, wordnet_dir: Path | str, output_file: Path | str
) -> dict[str, int]:
    """Convert entire WordNet database to JSON Lines.

    Parameters
    ----------
    wordnet_dir : Path | str
        Directory containing WordNet database files.
    output_file : Path | str
        Output JSON Lines file path.

    Returns
    -------
    dict[str, int]
        Counts of processed items by file type.

    Raises
    ------
    FileNotFoundError
        If WordNet directory does not exist.
    """
    wordnet_dir = Path(wordnet_dir)
    output_file = Path(output_file)

    if not wordnet_dir.exists():
        msg = f"WordNet directory not found: {wordnet_dir}"
        raise FileNotFoundError(msg)

    output_file.parent.mkdir(parents=True, exist_ok=True)

    counts = {}
    all_synsets = []

    # Process data files
    pos_mappings: list[tuple[str, WordNetPOS]] = [
        ("noun", "n"),
        ("verb", "v"),
        ("adj", "a"),
        ("adv", "r"),
    ]
    for pos_name, pos_code in pos_mappings:
        data_file = wordnet_dir / f"data.{pos_name}"
        if data_file.exists():
            synsets = self.parse_data_file(data_file, pos_code)
            all_synsets.extend(synsets)
            counts[f"synsets_{pos_name}"] = len(synsets)

    # Parse supplementary files for enrichment
    framestext = self.parse_verb_framestext(wordnet_dir / "verb.Framestext")
    sents = self.parse_verb_sentences(wordnet_dir / "sents.vrb")

    # Build sense_key → (sense_number, tag_count) map from index.sense
    sense_map: dict[str, tuple[int, int]] = {}
    sense_index_file = wordnet_dir / "index.sense"
    if sense_index_file.exists():
        with sense_index_file.open("r", encoding="utf-8") as f:
            for line_raw in f:
                line = line_raw.strip()
                if not line:
                    continue
                parts = line.split()
                if len(parts) != 4:
                    continue
                try:
                    sk = parts[0]
                    sense_number = int(parts[2])
                    tag_count = int(parts[3])
                    sense_map[sk] = (sense_number, tag_count)
                except ValueError:
                    continue

    # Parse cntlist to enhance tag_count data
    cntlist = self.parse_cntlist(wordnet_dir / "cntlist")
    for sk, count in cntlist.items():
        if sk in sense_map:
            sn, _ = sense_map[sk]
            sense_map[sk] = (sn, count)
        else:
            sense_map[sk] = (0, count)

    # ss_type to number mapping for sense key construction
    ss_type_num_map: dict[str, int] = {
        "n": 1,
        "v": 2,
        "a": 3,
        "r": 4,
        "s": 5,
    }

    # Enrich synsets with sense data and verb frame templates
    for synset in all_synsets:
        ss_num = ss_type_num_map.get(synset.ss_type, 1)

        # Enrich words with sense_number and tag_count
        for word in synset.words:
            lemma_lower = word.lemma.lower()
            sense_key = f"{lemma_lower}%{ss_num}:{synset.lex_filenum:02d}:{word.lex_id:02d}::"
            if sense_key in sense_map:
                sn, tc = sense_map[sense_key]
                if sn > 0:
                    word.sense_number = sn
                word.tag_count = tc

        # Enrich verb frames with template and example_sentence
        if synset.frames:
            for frame in synset.frames:
                fn = frame.frame_number
                if fn in framestext:
                    frame.template = framestext[fn]
                if fn in sents:
                    frame.example_sentence = sents[fn]

    # Write all synsets to single output file
    with output_file.open("w", encoding="utf-8") as f:
        for synset in all_synsets:
            f.write(f"{synset.model_dump_json()}\n")

    counts["total_synsets"] = len(all_synsets)

    return counts
parse_cntlist(filepath: Path | str) -> dict[str, int]

Parse cntlist into a mapping of sense key to frequency count.

PARAMETER DESCRIPTION
filepath

Path to cntlist file.

TYPE: Path | str

RETURNS DESCRIPTION
dict[str, int]

Mapping from sense key to frequency count.

Source code in src/glazing/wordnet/converter.py
def parse_cntlist(self, filepath: Path | str) -> dict[str, int]:
    """Parse cntlist into a mapping of sense key to frequency count.

    Parameters
    ----------
    filepath : Path | str
        Path to cntlist file.

    Returns
    -------
    dict[str, int]
        Mapping from sense key to frequency count.
    """
    filepath = Path(filepath)
    if not filepath.exists():
        return {}

    counts: dict[str, int] = {}

    with filepath.open("r", encoding="utf-8") as f:
        for line_raw in f:
            line = line_raw.strip()
            if not line:
                continue

            parts = line.split()
            if len(parts) < 2:
                continue

            try:
                count = int(parts[0])
                sense_key = parts[1]
                counts[sense_key] = count
            except ValueError:
                continue

    return counts
parse_data_file(filepath: Path | str, pos: WordNetPOS) -> list[Synset]

Parse WordNet data file into list of Synset models.

PARAMETER DESCRIPTION
filepath

Path to WordNet data file (e.g., data.noun).

TYPE: Path | str

pos

Part of speech for validation.

TYPE: WordNetPOS

RETURNS DESCRIPTION
list[Synset]

List of parsed Synset models.

RAISES DESCRIPTION
FileNotFoundError

If the data file does not exist.

ValueError

If line format is invalid.

Source code in src/glazing/wordnet/converter.py
def parse_data_file(self, filepath: Path | str, pos: WordNetPOS) -> list[Synset]:
    """Parse WordNet data file into list of Synset models.

    Parameters
    ----------
    filepath : Path | str
        Path to WordNet data file (e.g., data.noun).
    pos : WordNetPOS
        Part of speech for validation.

    Returns
    -------
    list[Synset]
        List of parsed Synset models.

    Raises
    ------
    FileNotFoundError
        If the data file does not exist.
    ValueError
        If line format is invalid.
    """
    filepath = Path(filepath)
    if not filepath.exists():
        msg = f"WordNet data file not found: {filepath}"
        raise FileNotFoundError(msg)

    synsets = []

    with filepath.open("r", encoding="utf-8") as f:
        for line_num, line_raw in enumerate(f, 1):
            # Skip license header (lines starting with two spaces)
            if line_raw.startswith("  "):
                continue

            line = line_raw.strip()
            if not line:
                continue

            try:
                synset = self._parse_data_line(line)
                if synset and synset.ss_type == pos:
                    synsets.append(synset)
            except ValueError as e:
                # Log parsing error but continue
                print(f"Error parsing line {line_num} in {filepath}: {e}")
                continue

    return synsets
parse_exception_file(filepath: Path | str) -> list[ExceptionEntry]

Parse morphological exception file.

PARAMETER DESCRIPTION
filepath

Path to exception file (e.g., verb.exc).

TYPE: Path | str

RETURNS DESCRIPTION
list[ExceptionEntry]

List of parsed ExceptionEntry models.

RAISES DESCRIPTION
FileNotFoundError

If the exception file does not exist.

Source code in src/glazing/wordnet/converter.py
def parse_exception_file(self, filepath: Path | str) -> list[ExceptionEntry]:
    """Parse morphological exception file.

    Parameters
    ----------
    filepath : Path | str
        Path to exception file (e.g., verb.exc).

    Returns
    -------
    list[ExceptionEntry]
        List of parsed ExceptionEntry models.

    Raises
    ------
    FileNotFoundError
        If the exception file does not exist.
    """
    filepath = Path(filepath)
    if not filepath.exists():
        msg = f"WordNet exception file not found: {filepath}"
        raise FileNotFoundError(msg)

    entries = []

    with filepath.open("r", encoding="utf-8") as f:
        for line_num, line_raw in enumerate(f, 1):
            # Skip license header
            if line_raw.startswith("  "):
                continue

            line = line_raw.strip()
            if not line:
                continue

            try:
                entry = self._parse_exception_line(line)
                if entry:
                    entries.append(entry)
            except ValueError as e:
                print(f"Error parsing line {line_num} in {filepath}: {e}")
                continue

    return entries
parse_index_file(filepath: Path | str, pos: WordNetPOS) -> list[IndexEntry]

Parse WordNet index file into list of IndexEntry models.

PARAMETER DESCRIPTION
filepath

Path to WordNet index file (e.g., index.noun).

TYPE: Path | str

pos

Part of speech for validation.

TYPE: WordNetPOS

RETURNS DESCRIPTION
list[IndexEntry]

List of parsed IndexEntry models.

RAISES DESCRIPTION
FileNotFoundError

If the index file does not exist.

ValueError

If line format is invalid.

Source code in src/glazing/wordnet/converter.py
def parse_index_file(self, filepath: Path | str, pos: WordNetPOS) -> list[IndexEntry]:
    """Parse WordNet index file into list of IndexEntry models.

    Parameters
    ----------
    filepath : Path | str
        Path to WordNet index file (e.g., index.noun).
    pos : WordNetPOS
        Part of speech for validation.

    Returns
    -------
    list[IndexEntry]
        List of parsed IndexEntry models.

    Raises
    ------
    FileNotFoundError
        If the index file does not exist.
    ValueError
        If line format is invalid.
    """
    filepath = Path(filepath)
    if not filepath.exists():
        msg = f"WordNet index file not found: {filepath}"
        raise FileNotFoundError(msg)

    entries = []

    with filepath.open("r", encoding="utf-8") as f:
        for line_num, line_raw in enumerate(f, 1):
            # Skip license header (lines starting with two spaces)
            if line_raw.startswith("  "):
                continue

            line = line_raw.strip()
            if not line:
                continue

            try:
                entry = self._parse_index_line(line, pos)
                if entry:
                    entries.append(entry)
            except ValueError as e:
                print(f"Error parsing line {line_num} in {filepath}: {e}")
                continue

    return entries
parse_sense_index(filepath: Path | str) -> list[Sense]

Parse WordNet sense index file.

PARAMETER DESCRIPTION
filepath

Path to sense index file (index.sense).

TYPE: Path | str

RETURNS DESCRIPTION
list[Sense]

List of parsed Sense models.

RAISES DESCRIPTION
FileNotFoundError

If the sense index file does not exist.

Source code in src/glazing/wordnet/converter.py
def parse_sense_index(self, filepath: Path | str) -> list[Sense]:
    """Parse WordNet sense index file.

    Parameters
    ----------
    filepath : Path | str
        Path to sense index file (index.sense).

    Returns
    -------
    list[Sense]
        List of parsed Sense models.

    Raises
    ------
    FileNotFoundError
        If the sense index file does not exist.
    """
    filepath = Path(filepath)
    if not filepath.exists():
        msg = f"WordNet sense index file not found: {filepath}"
        raise FileNotFoundError(msg)

    senses = []

    with filepath.open("r", encoding="utf-8") as f:
        for line_num, line_raw in enumerate(f, 1):
            # Skip license header
            if line_raw.startswith("  "):
                continue

            line = line_raw.strip()
            if not line:
                continue

            try:
                sense = self._parse_sense_line(line)
                if sense:
                    senses.append(sense)
            except ValueError as e:
                print(f"Error parsing line {line_num} in {filepath}: {e}")
                continue

    return senses
parse_verb_framestext(filepath: Path | str) -> dict[int, str]

Parse verb.Framestext into a mapping of frame number to template string.

PARAMETER DESCRIPTION
filepath

Path to verb.Framestext file.

TYPE: Path | str

RETURNS DESCRIPTION
dict[int, str]

Mapping from frame number to template string.

Source code in src/glazing/wordnet/converter.py
def parse_verb_framestext(self, filepath: Path | str) -> dict[int, str]:
    """Parse verb.Framestext into a mapping of frame number to template string.

    Parameters
    ----------
    filepath : Path | str
        Path to verb.Framestext file.

    Returns
    -------
    dict[int, str]
        Mapping from frame number to template string.
    """
    filepath = Path(filepath)
    if not filepath.exists():
        return {}

    frames: dict[int, str] = {}

    with filepath.open("r", encoding="utf-8") as f:
        for line_raw in f:
            line = line_raw.strip()
            if not line:
                continue

            parts = line.split(None, 1)
            if len(parts) < 2:
                continue

            try:
                frame_num = int(parts[0])
                template = parts[1]
                frames[frame_num] = template
            except ValueError:
                continue

    return frames
parse_verb_sentences(filepath: Path | str) -> dict[int, str]

Parse sents.vrb into a mapping of frame number to example sentence.

PARAMETER DESCRIPTION
filepath

Path to sents.vrb file.

TYPE: Path | str

RETURNS DESCRIPTION
dict[int, str]

Mapping from frame number to example sentence.

Source code in src/glazing/wordnet/converter.py
def parse_verb_sentences(self, filepath: Path | str) -> dict[int, str]:
    """Parse sents.vrb into a mapping of frame number to example sentence.

    Parameters
    ----------
    filepath : Path | str
        Path to sents.vrb file.

    Returns
    -------
    dict[int, str]
        Mapping from frame number to example sentence.
    """
    filepath = Path(filepath)
    if not filepath.exists():
        return {}

    sentences: dict[int, str] = {}

    with filepath.open("r", encoding="utf-8") as f:
        for line_raw in f:
            line = line_raw.strip()
            if not line:
                continue

            parts = line.split(None, 1)
            if len(parts) < 2:
                continue

            try:
                sent_num = int(parts[0])
                sentence = parts[1]
                sentences[sent_num] = sentence
            except ValueError:
                continue

    return sentences

Functions

convert_wordnet_database(wordnet_dir: Path | str, output_file: Path | str) -> dict[str, int]

Convert entire WordNet database to JSON Lines.

PARAMETER DESCRIPTION
wordnet_dir

WordNet database directory.

TYPE: Path | str

output_file

Output JSON Lines file path.

TYPE: Path | str

RETURNS DESCRIPTION
dict[str, int]

Processing counts by file type.

Source code in src/glazing/wordnet/converter.py
def convert_wordnet_database(wordnet_dir: Path | str, output_file: Path | str) -> dict[str, int]:
    """Convert entire WordNet database to JSON Lines.

    Parameters
    ----------
    wordnet_dir : Path | str
        WordNet database directory.
    output_file : Path | str
        Output JSON Lines file path.

    Returns
    -------
    dict[str, int]
        Processing counts by file type.
    """
    converter = WordNetConverter()
    return converter.convert_wordnet_database(wordnet_dir, output_file)

parse_data_file(filepath: Path | str, pos: WordNetPOS) -> list[Synset]

Parse WordNet data file into list of Synset models.

PARAMETER DESCRIPTION
filepath

Path to WordNet data file.

TYPE: Path | str

pos

Part of speech code.

TYPE: WordNetPOS

RETURNS DESCRIPTION
list[Synset]

List of parsed synsets.

Source code in src/glazing/wordnet/converter.py
def parse_data_file(filepath: Path | str, pos: WordNetPOS) -> list[Synset]:
    """Parse WordNet data file into list of Synset models.

    Parameters
    ----------
    filepath : Path | str
        Path to WordNet data file.
    pos : WordNetPOS
        Part of speech code.

    Returns
    -------
    list[Synset]
        List of parsed synsets.
    """
    converter = WordNetConverter()
    return converter.parse_data_file(filepath, pos)

parse_exception_file(filepath: Path | str) -> list[ExceptionEntry]

Parse morphological exception file.

PARAMETER DESCRIPTION
filepath

Path to exception file.

TYPE: Path | str

RETURNS DESCRIPTION
list[ExceptionEntry]

List of parsed exception entries.

Source code in src/glazing/wordnet/converter.py
def parse_exception_file(filepath: Path | str) -> list[ExceptionEntry]:
    """Parse morphological exception file.

    Parameters
    ----------
    filepath : Path | str
        Path to exception file.

    Returns
    -------
    list[ExceptionEntry]
        List of parsed exception entries.
    """
    converter = WordNetConverter()
    return converter.parse_exception_file(filepath)

parse_index_file(filepath: Path | str, pos: WordNetPOS) -> list[IndexEntry]

Parse WordNet index file into list of IndexEntry models.

PARAMETER DESCRIPTION
filepath

Path to WordNet index file.

TYPE: Path | str

pos

Part of speech code.

TYPE: WordNetPOS

RETURNS DESCRIPTION
list[IndexEntry]

List of parsed index entries.

Source code in src/glazing/wordnet/converter.py
def parse_index_file(filepath: Path | str, pos: WordNetPOS) -> list[IndexEntry]:
    """Parse WordNet index file into list of IndexEntry models.

    Parameters
    ----------
    filepath : Path | str
        Path to WordNet index file.
    pos : WordNetPOS
        Part of speech code.

    Returns
    -------
    list[IndexEntry]
        List of parsed index entries.
    """
    converter = WordNetConverter()
    return converter.parse_index_file(filepath, pos)

parse_sense_index(filepath: Path | str) -> list[Sense]

Parse WordNet sense index file.

PARAMETER DESCRIPTION
filepath

Path to sense index file.

TYPE: Path | str

RETURNS DESCRIPTION
list[Sense]

List of parsed senses.

Source code in src/glazing/wordnet/converter.py
def parse_sense_index(filepath: Path | str) -> list[Sense]:
    """Parse WordNet sense index file.

    Parameters
    ----------
    filepath : Path | str
        Path to sense index file.

    Returns
    -------
    list[Sense]
        List of parsed senses.
    """
    converter = WordNetConverter()
    return converter.parse_sense_index(filepath)