glazing.wordnet.loader¶

Loading WordNet data from JSON Lines.

`loader` ¶

WordNet database loader with index building and caching.

This module provides functionality to load WordNet data from JSON Lines files, build efficient indices for fast lookups, and construct relation graphs for traversal operations.

CLASS	DESCRIPTION
`WordNetLoader`	Loads and indexes WordNet database from JSON Lines format with automatic loading.

FUNCTION	DESCRIPTION
`load_wordnet`	Load a complete WordNet database from JSON Lines files.

Examples:

>>> from glazing.wordnet.loader import WordNetLoader
>>> # Data loads automatically on initialization
>>> loader = WordNetLoader()
>>> synset = loader.get_synset("00001740")
>>> senses = loader.get_senses_by_lemma("dog", pos="n")
>>>
>>> # Or disable autoload for manual control
>>> loader = WordNetLoader(autoload=False)
>>> loader.load()  # Load manually when needed

Classes¶

`WordNetLoader(data_path: Path | str | None = None, lazy: bool = False, autoload: bool = True, cache_size: int = 1000)` ¶

Load and index WordNet database from JSON Lines format with automatic loading.

This class provides efficient loading and indexing of WordNet data, including synsets, senses, and morphological exceptions. It builds multiple indices for fast lookups and supports lazy loading of large datasets. By default, data is loaded automatically on initialization.

PARAMETER	DESCRIPTION
`data_path`	Path to the WordNet JSONL file (e.g., wordnet.jsonl). If None, uses default path from environment. TYPE: `Path \| str \| None` DEFAULT: `None`
`lazy`	If True, load synsets on demand rather than all at once. TYPE: `bool` DEFAULT: `False`
`autoload`	Whether to automatically load data on initialization. Only applies when lazy=False. TYPE: `bool` DEFAULT: `True`
`cache_size`	Number of synsets to cache when using lazy loading. TYPE: `int` DEFAULT: `1000`

ATTRIBUTE	DESCRIPTION
`synsets`	All loaded synsets indexed by canonical key (offset plus normalized POS, e.g. `"00001740n"`). Offsets alone collide across parts of speech and so cannot key this mapping. TYPE: `dict[SynsetKey, Synset]`
`lemma_index`	Index from lemmas to synset keys by POS. Adjective satellites appear under both `"s"` and `"a"`. TYPE: `dict[str, dict[WordNetPOS, list[SynsetKey]]]`
`sense_index`	Index from sense keys to sense objects. TYPE: `dict[SenseKey, Sense]`
`exceptions`	Morphological exceptions by POS. TYPE: `dict[WordNetPOS, dict[str, list[str]]]`

METHOD	DESCRIPTION
`load`	Load all WordNet data from JSON Lines files.
`get_synset`	Get a synset by its offset or canonical key.
`get_senses_by_lemma`	Get all senses for a lemma and optional POS.
`get_sense_by_key`	Get a sense by its unique sense key.

Examples:

>>> # Automatic loading (default)
>>> loader = WordNetLoader()
>>> dog_synsets = loader.get_synsets_by_lemma("dog", "n")
>>> for synset in dog_synsets:
...     print(f"{synset.offset}: {synset.gloss}")

>>> # Manual loading
>>> loader = WordNetLoader(autoload=False)
>>> loader.load()
>>> synsets = loader.synsets  # Now accessible

Initialize WordNet loader.

PARAMETER	DESCRIPTION
`data_path`	Path to the WordNet JSONL file (e.g., wordnet.jsonl). If None, uses default path from environment. TYPE: `Path \| str \| None` DEFAULT: `None`
`lazy`	If True, load synsets on demand. TYPE: `bool` DEFAULT: `False`
`autoload`	Whether to automatically load data on initialization. Only applies when lazy=False. TYPE: `bool` DEFAULT: `True`
`cache_size`	Size of LRU cache for lazy loading. TYPE: `int` DEFAULT: `1000`

Source code in src/glazing/wordnet/loader.py

def __init__(
    self,
    data_path: Path | str | None = None,
    lazy: bool = False,
    autoload: bool = True,
    cache_size: int = 1000,
) -> None:
    """Initialize WordNet loader.

    Parameters
    ----------
    data_path : Path | str | None, optional
        Path to the WordNet JSONL file (e.g., wordnet.jsonl).
        If None, uses default path from environment.
    lazy : bool, default=False
        If True, load synsets on demand.
    autoload : bool, default=True
        Whether to automatically load data on initialization.
        Only applies when lazy=False.
    cache_size : int, default=1000
        Size of LRU cache for lazy loading.
    """
    if data_path is None:
        data_path = get_default_data_path("wordnet.jsonl")
    self.data_path = Path(data_path)
    self.lazy = lazy
    self.cache_size = cache_size

    # Core data structures, keyed by canonical synset key (offset + POS)
    self.synsets: dict[SynsetKey, Synset] = {}
    self.lemma_index: dict[str, dict[WordNetPOS, list[SynsetKey]]] = defaultdict(dict)
    self.sense_index: dict[SenseKey, Sense] = {}
    self.exceptions: dict[WordNetPOS, dict[str, list[str]]] = {}

    # Relation indices for efficient traversal
    self.hypernym_index: dict[SynsetKey, list[SynsetKey]] = defaultdict(list)
    self.hyponym_index: dict[SynsetKey, list[SynsetKey]] = defaultdict(list)
    self.meronym_index: dict[SynsetKey, list[SynsetKey]] = defaultdict(list)
    self.holonym_index: dict[SynsetKey, list[SynsetKey]] = defaultdict(list)

    # Bare offset -> keys sharing it, for resolving unqualified lookups
    self._keys_by_offset: dict[SynsetOffset, list[SynsetKey]] = defaultdict(list)

    # File index for lazy loading (key -> byte position in file)
    self._synset_file_index: dict[SynsetKey, int] = {}

    # Cache for lazy loading
    if lazy:
        self._cache: LRUCache[Synset] | None = LRUCache(cache_size)
    else:
        self._cache = None

    # Track loaded state
    self._loaded = False

    # Autoload data if requested and not lazy loading
    if autoload and not lazy:
        self.load()

Methods:¶

`get_exceptions(pos: WordNetPOS) -> dict[str, list[str]]` ¶

Get morphological exceptions for a POS.

PARAMETER	DESCRIPTION
`pos`	The part of speech. TYPE: `WordNetPOS`

RETURNS	DESCRIPTION
`dict[str, list[str]]`	Mapping from inflected forms to base forms.

Source code in src/glazing/wordnet/loader.py

def get_exceptions(self, pos: WordNetPOS) -> dict[str, list[str]]:
    """Get morphological exceptions for a POS.

    Parameters
    ----------
    pos : WordNetPOS
        The part of speech.

    Returns
    -------
    dict[str, list[str]]
        Mapping from inflected forms to base forms.
    """
    return self.exceptions.get(pos, {})

`get_holonyms(synset: Synset) -> list[Synset]` ¶

Get all holonyms (wholes) of a synset.

PARAMETER	DESCRIPTION
`synset`	The synset to get holonyms for. TYPE: `Synset`

RETURNS	DESCRIPTION
`list[Synset]`	List of holonym synsets.

Source code in src/glazing/wordnet/loader.py

def get_holonyms(self, synset: Synset) -> list[Synset]:
    """Get all holonyms (wholes) of a synset.

    Parameters
    ----------
    synset : Synset
        The synset to get holonyms for.

    Returns
    -------
    list[Synset]
        List of holonym synsets.
    """
    holonyms = []
    for key in self.holonym_index.get(synset.key, []):
        holonym = self.get_synset(key)
        if holonym:
            holonyms.append(holonym)
    return holonyms

`get_hypernyms(synset: Synset) -> list[Synset]` ¶

Get direct hypernyms of a synset.

PARAMETER	DESCRIPTION
`synset`	The synset to get hypernyms for. TYPE: `Synset`

RETURNS	DESCRIPTION
`list[Synset]`	List of hypernym synsets.

Source code in src/glazing/wordnet/loader.py

def get_hypernyms(self, synset: Synset) -> list[Synset]:
    """Get direct hypernyms of a synset.

    Parameters
    ----------
    synset : Synset
        The synset to get hypernyms for.

    Returns
    -------
    list[Synset]
        List of hypernym synsets.
    """
    hypernyms = []
    for key in self.hypernym_index.get(synset.key, []):
        hypernym = self.get_synset(key)
        if hypernym:
            hypernyms.append(hypernym)
    return hypernyms

`get_hyponyms(synset: Synset) -> list[Synset]` ¶

Get direct hyponyms of a synset.

PARAMETER	DESCRIPTION
`synset`	The synset to get hyponyms for. TYPE: `Synset`

RETURNS	DESCRIPTION
`list[Synset]`	List of hyponym synsets.

Source code in src/glazing/wordnet/loader.py

def get_hyponyms(self, synset: Synset) -> list[Synset]:
    """Get direct hyponyms of a synset.

    Parameters
    ----------
    synset : Synset
        The synset to get hyponyms for.

    Returns
    -------
    list[Synset]
        List of hyponym synsets.
    """
    hyponyms = []
    for key in self.hyponym_index.get(synset.key, []):
        hyponym = self.get_synset(key)
        if hyponym:
            hyponyms.append(hyponym)
    return hyponyms

`get_meronyms(synset: Synset) -> list[Synset]` ¶

Get all meronyms (parts) of a synset.

PARAMETER	DESCRIPTION
`synset`	The synset to get meronyms for. TYPE: `Synset`

RETURNS	DESCRIPTION
`list[Synset]`	List of meronym synsets.

Source code in src/glazing/wordnet/loader.py

def get_meronyms(self, synset: Synset) -> list[Synset]:
    """Get all meronyms (parts) of a synset.

    Parameters
    ----------
    synset : Synset
        The synset to get meronyms for.

    Returns
    -------
    list[Synset]
        List of meronym synsets.
    """
    meronyms = []
    for key in self.meronym_index.get(synset.key, []):
        meronym = self.get_synset(key)
        if meronym:
            meronyms.append(meronym)
    return meronyms

`get_sense_by_key(sense_key: SenseKey) -> Sense | None` ¶

Get a sense by its unique sense key.

PARAMETER	DESCRIPTION
`sense_key`	The unique sense key. TYPE: `SenseKey`

RETURNS	DESCRIPTION
`Sense \| None`	The sense or None if not found.

Examples:

>>> sense = loader.get_sense_by_key("dog%1:05:00::")
>>> print(sense.synset_offset)

Source code in src/glazing/wordnet/loader.py

def get_sense_by_key(self, sense_key: SenseKey) -> Sense | None:
    """Get a sense by its unique sense key.

    Parameters
    ----------
    sense_key : SenseKey
        The unique sense key.

    Returns
    -------
    Sense | None
        The sense or None if not found.

    Examples
    --------
    >>> sense = loader.get_sense_by_key("dog%1:05:00::")
    >>> print(sense.synset_offset)
    """
    return self.sense_index.get(sense_key)

`get_senses_by_lemma(lemma: str, pos: WordNetPOS | None = None) -> list[Sense]` ¶

Get all senses for a lemma.

PARAMETER	DESCRIPTION
`lemma`	The word lemma to search for. TYPE: `str`
`pos`	Part of speech filter. TYPE: `WordNetPOS \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[Sense]`	List of senses for the lemma, sorted by sense number.

Examples:

>>> senses = loader.get_senses_by_lemma("run", "v")
>>> for sense in senses:
...     print(f"{sense.sense_key}: {sense.sense_number}")

Source code in src/glazing/wordnet/loader.py

def get_senses_by_lemma(self, lemma: str, pos: WordNetPOS | None = None) -> list[Sense]:
    """Get all senses for a lemma.

    Parameters
    ----------
    lemma : str
        The word lemma to search for.
    pos : WordNetPOS | None, default=None
        Part of speech filter.

    Returns
    -------
    list[Sense]
        List of senses for the lemma, sorted by sense number.

    Examples
    --------
    >>> senses = loader.get_senses_by_lemma("run", "v")
    >>> for sense in senses:
    ...     print(f"{sense.sense_key}: {sense.sense_number}")
    """
    senses = []

    for sense in self.sense_index.values():
        if sense.lemma == lemma and (pos is None or sense.ss_type == pos):
            senses.append(sense)

    # Sort by sense number (frequency order)
    senses.sort(key=lambda s: s.sense_number)

    return senses

`get_synset(offset: SynsetOffset | SynsetKey, pos: WordNetPOS | None = None) -> Synset | None` ¶

Get a synset by its offset or canonical key.

PARAMETER	DESCRIPTION
`offset`	An 8-digit synset offset, or a 9-character canonical key with the part of speech appended (e.g., `"00001740n"`). TYPE: `SynsetOffset \| SynsetKey`
`pos`	Part of speech qualifying `offset`. Ignored when `offset` is already a canonical key. TYPE: `WordNetPOS \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Synset \| None`	The synset, or None if not found or if a bare offset is ambiguous.

Notes

Offsets are byte offsets into a per-POS data file, so 330 of them name more than one synset in WordNet 3.1. An unqualified lookup on such an offset returns None rather than choosing arbitrarily; pass pos or a canonical key to disambiguate.

Examples:

>>> synset = loader.get_synset("02084442")
>>> print(synset.gloss)
>>> loader.get_synset("00001740", "v")  # qualified
>>> loader.get_synset("00001740v")  # equivalent

Source code in src/glazing/wordnet/loader.py

def get_synset(
    self, offset: SynsetOffset | SynsetKey, pos: WordNetPOS | None = None
) -> Synset | None:
    """Get a synset by its offset or canonical key.

    Parameters
    ----------
    offset : SynsetOffset | SynsetKey
        An 8-digit synset offset, or a 9-character canonical key with the
        part of speech appended (e.g., ``"00001740n"``).
    pos : WordNetPOS | None, default=None
        Part of speech qualifying `offset`. Ignored when `offset` is
        already a canonical key.

    Returns
    -------
    Synset | None
        The synset, or None if not found or if a bare offset is ambiguous.

    Notes
    -----
    Offsets are byte offsets into a per-POS data file, so 330 of them name
    more than one synset in WordNet 3.1. An unqualified lookup on such an
    offset returns None rather than choosing arbitrarily; pass `pos` or a
    canonical key to disambiguate.

    Examples
    --------
    >>> synset = loader.get_synset("02084442")
    >>> print(synset.gloss)
    >>> loader.get_synset("00001740", "v")  # qualified
    >>> loader.get_synset("00001740v")  # equivalent
    """
    key = self._resolve_key(offset, pos)
    if key is None:
        return None
    if self.lazy:
        return self._load_synset_lazy(key)
    return self.synsets.get(key)

`get_synsets_by_lemma(lemma: str, pos: WordNetPOS | None = None) -> list[Synset]` ¶

Get all synsets containing a lemma.

PARAMETER	DESCRIPTION
`lemma`	The word lemma to search for. TYPE: `str`
`pos`	Part of speech filter. If None, returns all POS. TYPE: `WordNetPOS \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[Synset]`	List of synsets containing the lemma.

Examples:

>>> synsets = loader.get_synsets_by_lemma("run", "v")
>>> for synset in synsets:
...     print(synset.gloss)

Source code in src/glazing/wordnet/loader.py

def get_synsets_by_lemma(self, lemma: str, pos: WordNetPOS | None = None) -> list[Synset]:
    """Get all synsets containing a lemma.

    Parameters
    ----------
    lemma : str
        The word lemma to search for.
    pos : WordNetPOS | None, default=None
        Part of speech filter. If None, returns all POS.

    Returns
    -------
    list[Synset]
        List of synsets containing the lemma.

    Examples
    --------
    >>> synsets = loader.get_synsets_by_lemma("run", "v")
    >>> for synset in synsets:
    ...     print(synset.gloss)
    """
    synsets: list[Synset] = []
    lemma_lower = lemma.lower()

    if lemma_lower not in self.lemma_index:
        return synsets

    # Get POS tags to search
    pos_tags: list[WordNetPOS]
    if pos:
        pos_tags = [pos] if pos in self.lemma_index[lemma_lower] else []
    else:
        pos_tags = list(self.lemma_index[lemma_lower].keys())

    # Collect synsets by key. Satellites are indexed under both "s" and
    # "a", so track seen keys to avoid returning them twice.
    seen: set[SynsetKey] = set()
    for pos_tag in pos_tags:
        for key in self.lemma_index[lemma_lower].get(pos_tag, []):
            if key in seen:
                continue
            seen.add(key)
            synset = self.get_synset(key)
            if synset:
                synsets.append(synset)

    return synsets

`load() -> None` ¶

Load all WordNet data from JSON Lines files.

This method loads synsets from the primary JSONL file, builds lemma and relation indices from loaded data, and optionally loads supplementary sense and exception data.

RAISES	DESCRIPTION
`FileNotFoundError`	If the primary JSONL file doesn't exist.
`ValidationError`	If JSON data doesn't match expected schema.

Source code in src/glazing/wordnet/loader.py

def load(self) -> None:
    """Load all WordNet data from JSON Lines files.

    This method loads synsets from the primary JSONL file, builds
    lemma and relation indices from loaded data, and optionally loads
    supplementary sense and exception data.

    Raises
    ------
    FileNotFoundError
        If the primary JSONL file doesn't exist.
    ValidationError
        If JSON data doesn't match expected schema.
    """
    if self._loaded:
        return

    # Load synsets from single JSONL file
    if self.lazy:
        self._build_file_index()
    else:
        self._load_all_synsets()

    # Build lemma index from loaded synsets
    if not self.lazy:
        self._build_lemma_index()

    # Load supplementary data if available
    self._load_sense_index()
    self._load_exceptions()

    # Build relation indices
    if not self.lazy:
        self._build_relation_indices()

    self._loaded = True

Functions:¶

`load_wordnet(data_path: Path | str, lazy: bool = False, cache_size: int = 1000) -> WordNetLoader` ¶

Load a WordNet database from JSON Lines files.

PARAMETER	DESCRIPTION
`data_path`	Path to the WordNet JSONL file (e.g., wordnet.jsonl). TYPE: `Path \| str`
`lazy`	If True, load synsets on demand. TYPE: `bool` DEFAULT: `False`
`cache_size`	Size of LRU cache for lazy loading. TYPE: `int` DEFAULT: `1000`

RETURNS	DESCRIPTION
`WordNetLoader`	Loaded WordNet database.

Examples:

>>> wn = load_wordnet("data/wordnet.jsonl")
>>> dog = wn.get_synsets_by_lemma("dog", "n")[0]
>>> print(dog.gloss)

Source code in src/glazing/wordnet/loader.py

def load_wordnet(
    data_path: Path | str, lazy: bool = False, cache_size: int = 1000
) -> WordNetLoader:
    """Load a WordNet database from JSON Lines files.

    Parameters
    ----------
    data_path : Path | str
        Path to the WordNet JSONL file (e.g., wordnet.jsonl).
    lazy : bool, default=False
        If True, load synsets on demand.
    cache_size : int, default=1000
        Size of LRU cache for lazy loading.

    Returns
    -------
    WordNetLoader
        Loaded WordNet database.

    Examples
    --------
    >>> wn = load_wordnet("data/wordnet.jsonl")
    >>> dog = wn.get_synsets_by_lemma("dog", "n")[0]
    >>> print(dog.gloss)
    """
    loader = WordNetLoader(data_path, lazy=lazy, cache_size=cache_size, autoload=False)
    loader.load()
    return loader

glazing.wordnet.loader¶

loader ¶

Classes¶

WordNetLoader(data_path: Path | str | None = None, lazy: bool = False, autoload: bool = True, cache_size: int = 1000) ¶

Methods:¶

get_exceptions(pos: WordNetPOS) -> dict[str, list[str]] ¶

get_holonyms(synset: Synset) -> list[Synset] ¶

get_hypernyms(synset: Synset) -> list[Synset] ¶

get_hyponyms(synset: Synset) -> list[Synset] ¶

get_meronyms(synset: Synset) -> list[Synset] ¶

get_sense_by_key(sense_key: SenseKey) -> Sense | None ¶

get_senses_by_lemma(lemma: str, pos: WordNetPOS | None = None) -> list[Sense] ¶

get_synset(offset: SynsetOffset | SynsetKey, pos: WordNetPOS | None = None) -> Synset | None ¶

get_synsets_by_lemma(lemma: str, pos: WordNetPOS | None = None) -> list[Synset] ¶

load() -> None ¶

Functions:¶

load_wordnet(data_path: Path | str, lazy: bool = False, cache_size: int = 1000) -> WordNetLoader ¶

`loader` ¶

`WordNetLoader(data_path: Path | str | None = None, lazy: bool = False, autoload: bool = True, cache_size: int = 1000)` ¶

`get_exceptions(pos: WordNetPOS) -> dict[str, list[str]]` ¶

`get_holonyms(synset: Synset) -> list[Synset]` ¶

`get_hypernyms(synset: Synset) -> list[Synset]` ¶

`get_hyponyms(synset: Synset) -> list[Synset]` ¶

`get_meronyms(synset: Synset) -> list[Synset]` ¶

`get_sense_by_key(sense_key: SenseKey) -> Sense | None` ¶

`get_senses_by_lemma(lemma: str, pos: WordNetPOS | None = None) -> list[Sense]` ¶

`get_synset(offset: SynsetOffset | SynsetKey, pos: WordNetPOS | None = None) -> Synset | None` ¶

`get_synsets_by_lemma(lemma: str, pos: WordNetPOS | None = None) -> list[Synset]` ¶

`load() -> None` ¶

`load_wordnet(data_path: Path | str, lazy: bool = False, cache_size: int = 1000) -> WordNetLoader` ¶