Skip to content

glazing.wordnet.loader

Loading WordNet data from JSON Lines.

loader

WordNet database loader with index building and caching.

This module provides functionality to load WordNet data from JSON Lines files, build efficient indices for fast lookups, and construct relation graphs for traversal operations.

CLASS DESCRIPTION
WordNetLoader

Loads and indexes WordNet database from JSON Lines format with automatic loading.

FUNCTION DESCRIPTION
load_wordnet

Load a complete WordNet database from JSON Lines files.

Examples:

>>> from glazing.wordnet.loader import WordNetLoader
>>> # Data loads automatically on initialization
>>> loader = WordNetLoader()
>>> synset = loader.get_synset("00001740")
>>> senses = loader.get_senses_by_lemma("dog", pos="n")
>>>
>>> # Or disable autoload for manual control
>>> loader = WordNetLoader(autoload=False)
>>> loader.load()  # Load manually when needed

Classes

WordNetLoader(data_path: Path | str | None = None, lazy: bool = False, autoload: bool = True, cache_size: int = 1000)

Load and index WordNet database from JSON Lines format with automatic loading.

This class provides efficient loading and indexing of WordNet data, including synsets, senses, and morphological exceptions. It builds multiple indices for fast lookups and supports lazy loading of large datasets. By default, data is loaded automatically on initialization.

PARAMETER DESCRIPTION
data_path

Path to the WordNet JSONL file (e.g., wordnet.jsonl). If None, uses default path from environment.

TYPE: Path | str | None DEFAULT: None

lazy

If True, load synsets on demand rather than all at once.

TYPE: bool DEFAULT: False

autoload

Whether to automatically load data on initialization. Only applies when lazy=False.

TYPE: bool DEFAULT: True

cache_size

Number of synsets to cache when using lazy loading.

TYPE: int DEFAULT: 1000

ATTRIBUTE DESCRIPTION
synsets

All loaded synsets indexed by offset.

TYPE: dict[SynsetOffset, Synset]

lemma_index

Index from lemmas to synset offsets by POS.

TYPE: dict[str, dict[WordNetPOS, list[SynsetOffset]]]

sense_index

Index from sense keys to sense objects.

TYPE: dict[SenseKey, Sense]

exceptions

Morphological exceptions by POS.

TYPE: dict[WordNetPOS, dict[str, list[str]]]

METHOD DESCRIPTION
load

Load all WordNet data from JSON Lines files.

get_synset

Get a synset by its offset.

get_senses_by_lemma

Get all senses for a lemma and optional POS.

get_sense_by_key

Get a sense by its unique sense key.

Examples:

>>> # Automatic loading (default)
>>> loader = WordNetLoader()
>>> dog_synsets = loader.get_synsets_by_lemma("dog", "n")
>>> for synset in dog_synsets:
...     print(f"{synset.offset}: {synset.gloss}")
>>> # Manual loading
>>> loader = WordNetLoader(autoload=False)
>>> loader.load()
>>> synsets = loader.synsets  # Now accessible

Initialize WordNet loader.

PARAMETER DESCRIPTION
data_path

Path to the WordNet JSONL file (e.g., wordnet.jsonl). If None, uses default path from environment.

TYPE: Path | str | None DEFAULT: None

lazy

If True, load synsets on demand.

TYPE: bool DEFAULT: False

autoload

Whether to automatically load data on initialization. Only applies when lazy=False.

TYPE: bool DEFAULT: True

cache_size

Size of LRU cache for lazy loading.

TYPE: int DEFAULT: 1000

Source code in src/glazing/wordnet/loader.py
def __init__(
    self,
    data_path: Path | str | None = None,
    lazy: bool = False,
    autoload: bool = True,
    cache_size: int = 1000,
) -> None:
    """Initialize WordNet loader.

    Parameters
    ----------
    data_path : Path | str | None, optional
        Path to the WordNet JSONL file (e.g., wordnet.jsonl).
        If None, uses default path from environment.
    lazy : bool, default=False
        If True, load synsets on demand.
    autoload : bool, default=True
        Whether to automatically load data on initialization.
        Only applies when lazy=False.
    cache_size : int, default=1000
        Size of LRU cache for lazy loading.
    """
    if data_path is None:
        data_path = get_default_data_path("wordnet.jsonl")
    self.data_path = Path(data_path)
    self.lazy = lazy
    self.cache_size = cache_size

    # Core data structures
    self.synsets: dict[SynsetOffset, Synset] = {}
    self.lemma_index: dict[str, dict[WordNetPOS, list[SynsetOffset]]] = defaultdict(dict)
    self.sense_index: dict[SenseKey, Sense] = {}
    self.exceptions: dict[WordNetPOS, dict[str, list[str]]] = {}

    # Relation indices for efficient traversal
    self.hypernym_index: dict[SynsetOffset, list[SynsetOffset]] = defaultdict(list)
    self.hyponym_index: dict[SynsetOffset, list[SynsetOffset]] = defaultdict(list)
    self.meronym_index: dict[SynsetOffset, list[SynsetOffset]] = defaultdict(list)
    self.holonym_index: dict[SynsetOffset, list[SynsetOffset]] = defaultdict(list)

    # File index for lazy loading (offset -> byte position in file)
    self._synset_file_index: dict[SynsetOffset, int] = {}

    # Cache for lazy loading
    if lazy:
        self._cache: LRUCache[Synset] | None = LRUCache(cache_size)
    else:
        self._cache = None

    # Track loaded state
    self._loaded = False

    # Autoload data if requested and not lazy loading
    if autoload and not lazy:
        self.load()
Functions
get_exceptions(pos: WordNetPOS) -> dict[str, list[str]]

Get morphological exceptions for a POS.

PARAMETER DESCRIPTION
pos

The part of speech.

TYPE: WordNetPOS

RETURNS DESCRIPTION
dict[str, list[str]]

Mapping from inflected forms to base forms.

Source code in src/glazing/wordnet/loader.py
def get_exceptions(self, pos: WordNetPOS) -> dict[str, list[str]]:
    """Get morphological exceptions for a POS.

    Parameters
    ----------
    pos : WordNetPOS
        The part of speech.

    Returns
    -------
    dict[str, list[str]]
        Mapping from inflected forms to base forms.
    """
    return self.exceptions.get(pos, {})
get_holonyms(synset: Synset) -> list[Synset]

Get all holonyms (wholes) of a synset.

PARAMETER DESCRIPTION
synset

The synset to get holonyms for.

TYPE: Synset

RETURNS DESCRIPTION
list[Synset]

List of holonym synsets.

Source code in src/glazing/wordnet/loader.py
def get_holonyms(self, synset: Synset) -> list[Synset]:
    """Get all holonyms (wholes) of a synset.

    Parameters
    ----------
    synset : Synset
        The synset to get holonyms for.

    Returns
    -------
    list[Synset]
        List of holonym synsets.
    """
    holonyms = []
    for offset in self.holonym_index.get(synset.offset, []):
        holonym = self.get_synset(offset)
        if holonym:
            holonyms.append(holonym)
    return holonyms
get_hypernyms(synset: Synset) -> list[Synset]

Get direct hypernyms of a synset.

PARAMETER DESCRIPTION
synset

The synset to get hypernyms for.

TYPE: Synset

RETURNS DESCRIPTION
list[Synset]

List of hypernym synsets.

Source code in src/glazing/wordnet/loader.py
def get_hypernyms(self, synset: Synset) -> list[Synset]:
    """Get direct hypernyms of a synset.

    Parameters
    ----------
    synset : Synset
        The synset to get hypernyms for.

    Returns
    -------
    list[Synset]
        List of hypernym synsets.
    """
    hypernyms = []
    for offset in self.hypernym_index.get(synset.offset, []):
        hypernym = self.get_synset(offset)
        if hypernym:
            hypernyms.append(hypernym)
    return hypernyms
get_hyponyms(synset: Synset) -> list[Synset]

Get direct hyponyms of a synset.

PARAMETER DESCRIPTION
synset

The synset to get hyponyms for.

TYPE: Synset

RETURNS DESCRIPTION
list[Synset]

List of hyponym synsets.

Source code in src/glazing/wordnet/loader.py
def get_hyponyms(self, synset: Synset) -> list[Synset]:
    """Get direct hyponyms of a synset.

    Parameters
    ----------
    synset : Synset
        The synset to get hyponyms for.

    Returns
    -------
    list[Synset]
        List of hyponym synsets.
    """
    hyponyms = []
    for offset in self.hyponym_index.get(synset.offset, []):
        hyponym = self.get_synset(offset)
        if hyponym:
            hyponyms.append(hyponym)
    return hyponyms
get_meronyms(synset: Synset) -> list[Synset]

Get all meronyms (parts) of a synset.

PARAMETER DESCRIPTION
synset

The synset to get meronyms for.

TYPE: Synset

RETURNS DESCRIPTION
list[Synset]

List of meronym synsets.

Source code in src/glazing/wordnet/loader.py
def get_meronyms(self, synset: Synset) -> list[Synset]:
    """Get all meronyms (parts) of a synset.

    Parameters
    ----------
    synset : Synset
        The synset to get meronyms for.

    Returns
    -------
    list[Synset]
        List of meronym synsets.
    """
    meronyms = []
    for offset in self.meronym_index.get(synset.offset, []):
        meronym = self.get_synset(offset)
        if meronym:
            meronyms.append(meronym)
    return meronyms
get_sense_by_key(sense_key: SenseKey) -> Sense | None

Get a sense by its unique sense key.

PARAMETER DESCRIPTION
sense_key

The unique sense key.

TYPE: SenseKey

RETURNS DESCRIPTION
Sense | None

The sense or None if not found.

Examples:

>>> sense = loader.get_sense_by_key("dog%1:05:00::")
>>> print(sense.synset_offset)
Source code in src/glazing/wordnet/loader.py
def get_sense_by_key(self, sense_key: SenseKey) -> Sense | None:
    """Get a sense by its unique sense key.

    Parameters
    ----------
    sense_key : SenseKey
        The unique sense key.

    Returns
    -------
    Sense | None
        The sense or None if not found.

    Examples
    --------
    >>> sense = loader.get_sense_by_key("dog%1:05:00::")
    >>> print(sense.synset_offset)
    """
    return self.sense_index.get(sense_key)
get_senses_by_lemma(lemma: str, pos: WordNetPOS | None = None) -> list[Sense]

Get all senses for a lemma.

PARAMETER DESCRIPTION
lemma

The word lemma to search for.

TYPE: str

pos

Part of speech filter.

TYPE: WordNetPOS | None DEFAULT: None

RETURNS DESCRIPTION
list[Sense]

List of senses for the lemma, sorted by sense number.

Examples:

>>> senses = loader.get_senses_by_lemma("run", "v")
>>> for sense in senses:
...     print(f"{sense.sense_key}: {sense.sense_number}")
Source code in src/glazing/wordnet/loader.py
def get_senses_by_lemma(self, lemma: str, pos: WordNetPOS | None = None) -> list[Sense]:
    """Get all senses for a lemma.

    Parameters
    ----------
    lemma : str
        The word lemma to search for.
    pos : WordNetPOS | None, default=None
        Part of speech filter.

    Returns
    -------
    list[Sense]
        List of senses for the lemma, sorted by sense number.

    Examples
    --------
    >>> senses = loader.get_senses_by_lemma("run", "v")
    >>> for sense in senses:
    ...     print(f"{sense.sense_key}: {sense.sense_number}")
    """
    senses = []

    for sense in self.sense_index.values():
        if sense.lemma == lemma and (pos is None or sense.ss_type == pos):
            senses.append(sense)

    # Sort by sense number (frequency order)
    senses.sort(key=lambda s: s.sense_number)

    return senses
get_synset(offset: SynsetOffset) -> Synset | None

Get a synset by its offset.

PARAMETER DESCRIPTION
offset

The 8-digit synset offset.

TYPE: SynsetOffset

RETURNS DESCRIPTION
Synset | None

The synset or None if not found.

Examples:

>>> synset = loader.get_synset("02084442")
>>> print(synset.gloss)
Source code in src/glazing/wordnet/loader.py
def get_synset(self, offset: SynsetOffset) -> Synset | None:
    """Get a synset by its offset.

    Parameters
    ----------
    offset : SynsetOffset
        The 8-digit synset offset.

    Returns
    -------
    Synset | None
        The synset or None if not found.

    Examples
    --------
    >>> synset = loader.get_synset("02084442")
    >>> print(synset.gloss)
    """
    if self.lazy:
        return self._load_synset_lazy(offset)
    return self.synsets.get(offset)
get_synsets_by_lemma(lemma: str, pos: WordNetPOS | None = None) -> list[Synset]

Get all synsets containing a lemma.

PARAMETER DESCRIPTION
lemma

The word lemma to search for.

TYPE: str

pos

Part of speech filter. If None, returns all POS.

TYPE: WordNetPOS | None DEFAULT: None

RETURNS DESCRIPTION
list[Synset]

List of synsets containing the lemma.

Examples:

>>> synsets = loader.get_synsets_by_lemma("run", "v")
>>> for synset in synsets:
...     print(synset.gloss)
Source code in src/glazing/wordnet/loader.py
def get_synsets_by_lemma(self, lemma: str, pos: WordNetPOS | None = None) -> list[Synset]:
    """Get all synsets containing a lemma.

    Parameters
    ----------
    lemma : str
        The word lemma to search for.
    pos : WordNetPOS | None, default=None
        Part of speech filter. If None, returns all POS.

    Returns
    -------
    list[Synset]
        List of synsets containing the lemma.

    Examples
    --------
    >>> synsets = loader.get_synsets_by_lemma("run", "v")
    >>> for synset in synsets:
    ...     print(synset.gloss)
    """
    synsets: list[Synset] = []
    lemma_lower = lemma.lower()

    if lemma_lower not in self.lemma_index:
        return synsets

    # Get POS tags to search
    pos_tags: list[WordNetPOS]
    if pos:
        pos_tags = [pos] if pos in self.lemma_index[lemma_lower] else []
    else:
        pos_tags = list(self.lemma_index[lemma_lower].keys())

    # Collect synsets from offset lists
    for pos_tag in pos_tags:
        for offset in self.lemma_index[lemma_lower].get(pos_tag, []):
            synset = self.get_synset(offset)
            if synset:
                synsets.append(synset)

    return synsets
load() -> None

Load all WordNet data from JSON Lines files.

This method loads synsets from the primary JSONL file, builds lemma and relation indices from loaded data, and optionally loads supplementary sense and exception data.

RAISES DESCRIPTION
FileNotFoundError

If the primary JSONL file doesn't exist.

ValidationError

If JSON data doesn't match expected schema.

Source code in src/glazing/wordnet/loader.py
def load(self) -> None:
    """Load all WordNet data from JSON Lines files.

    This method loads synsets from the primary JSONL file, builds
    lemma and relation indices from loaded data, and optionally loads
    supplementary sense and exception data.

    Raises
    ------
    FileNotFoundError
        If the primary JSONL file doesn't exist.
    ValidationError
        If JSON data doesn't match expected schema.
    """
    if self._loaded:
        return

    # Load synsets from single JSONL file
    if self.lazy:
        self._build_file_index()
    else:
        self._load_all_synsets()

    # Build lemma index from loaded synsets
    if not self.lazy:
        self._build_lemma_index()

    # Load supplementary data if available
    self._load_sense_index()
    self._load_exceptions()

    # Build relation indices
    if not self.lazy:
        self._build_relation_indices()

    self._loaded = True

Functions

load_wordnet(data_path: Path | str, lazy: bool = False, cache_size: int = 1000) -> WordNetLoader

Load a WordNet database from JSON Lines files.

PARAMETER DESCRIPTION
data_path

Path to the WordNet JSONL file (e.g., wordnet.jsonl).

TYPE: Path | str

lazy

If True, load synsets on demand.

TYPE: bool DEFAULT: False

cache_size

Size of LRU cache for lazy loading.

TYPE: int DEFAULT: 1000

RETURNS DESCRIPTION
WordNetLoader

Loaded WordNet database.

Examples:

>>> wn = load_wordnet("data/wordnet.jsonl")
>>> dog = wn.get_synsets_by_lemma("dog", "n")[0]
>>> print(dog.gloss)
Source code in src/glazing/wordnet/loader.py
def load_wordnet(
    data_path: Path | str, lazy: bool = False, cache_size: int = 1000
) -> WordNetLoader:
    """Load a WordNet database from JSON Lines files.

    Parameters
    ----------
    data_path : Path | str
        Path to the WordNet JSONL file (e.g., wordnet.jsonl).
    lazy : bool, default=False
        If True, load synsets on demand.
    cache_size : int, default=1000
        Size of LRU cache for lazy loading.

    Returns
    -------
    WordNetLoader
        Loaded WordNet database.

    Examples
    --------
    >>> wn = load_wordnet("data/wordnet.jsonl")
    >>> dog = wn.get_synsets_by_lemma("dog", "n")[0]
    >>> print(dog.gloss)
    """
    loader = WordNetLoader(data_path, lazy=lazy, cache_size=cache_size, autoload=False)
    loader.load()
    return loader