glazing.downloader¶

Dataset downloading utilities.

`downloader` ¶

Dataset downloaders for linguistic resources.

This module provides automatic downloading capabilities for FrameNet, PropBank, VerbNet, and WordNet datasets. Each downloader handles version tracking, progress indication, and archive extraction.

CLASS	DESCRIPTION
`BaseDownloader`	Abstract base class for dataset downloaders.
`VerbNetDownloader`	Downloads VerbNet from GitHub with commit hash versioning.
`PropBankDownloader`	Downloads PropBank from GitHub with commit hash versioning.
`WordNetDownloader`	Downloads WordNet 3.1 from Princeton University.
`FrameNetDownloader`	Provides instructions for manual FrameNet download (license required).

FUNCTION	DESCRIPTION
`download_dataset`	Download a specific dataset by name.
`download_all`	Download all available datasets.
`get_downloader`	Get downloader instance for a dataset.

Examples:

>>> from glazing.downloader import download_dataset
>>> path = download_dataset("verbnet", Path("data/raw"))
>>> print(f"VerbNet downloaded to: {path}")

>>> from glazing.downloader import VerbNetDownloader
>>> downloader = VerbNetDownloader()
>>> path = downloader.download(Path("data/raw"))

Classes¶

`BaseDownloader` ¶

Bases: ABC

Abstract base class for dataset downloaders.

Provides common functionality for downloading and extracting datasets with progress tracking and error handling.

ATTRIBUTE	DESCRIPTION
`dataset_name`	Human-readable name of the dataset. TYPE: `str`
`version`	Version string or commit hash for the dataset. TYPE: `str`

METHOD	DESCRIPTION
`download`	Download the dataset to the specified directory.

Attributes¶

`dataset_name: str` `abstractmethod` `property` ¶

Name of the dataset.

RETURNS	DESCRIPTION
`str`	Human-readable dataset name.

`version: str` `abstractmethod` `property` ¶

Version or commit hash.

RETURNS	DESCRIPTION
`str`	Version identifier for reproducible downloads.

Methods:¶

`download(output_dir: Path) -> Path` `abstractmethod` ¶

Download dataset to output directory.

PARAMETER	DESCRIPTION
`output_dir`	Directory to download the dataset to. TYPE: `Path`

RETURNS	DESCRIPTION
`Path`	Path to the downloaded and extracted dataset.

RAISES	DESCRIPTION
`DownloadError`	If download fails.
`ExtractionError`	If archive extraction fails.

Source code in src/glazing/downloader.py

@abstractmethod
def download(self, output_dir: Path) -> Path:
    """Download dataset to output directory.

    Parameters
    ----------
    output_dir : Path
        Directory to download the dataset to.

    Returns
    -------
    Path
        Path to the downloaded and extracted dataset.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If archive extraction fails.
    """

`DownloadError` ¶

Bases: Exception

Raised when a download operation fails.

`ExtractionError` ¶

Bases: Exception

Raised when archive extraction fails.

`FrameNetDownloader` ¶

Bases: BaseDownloader

Downloads FrameNet from NLTK data repository.

Downloads FrameNet v1.7 from the NLTK data GitHub repository, which provides the dataset without license restrictions.

ATTRIBUTE	DESCRIPTION
`dataset_name`	"framenet" TYPE: `str`
`version`	"1.7" TYPE: `str`
`commit_hash`	"427fc05d3a8cc1ca99e7ff93bdea937507cc9e7a" TYPE: `str`

METHOD	DESCRIPTION
`download`	Download FrameNet from NLTK data repository.

Attributes¶

`commit_hash: str` `property` ¶

NLTK data repository commit hash.

`dataset_name: str` `property` ¶

Name of the dataset.

`version: str` `property` ¶

Version of FrameNet.

Methods:¶

`download(output_dir: Path) -> Path` ¶

Download FrameNet from NLTK data repository.

PARAMETER	DESCRIPTION
`output_dir`	Directory to download FrameNet into. TYPE: `Path`

RETURNS	DESCRIPTION
`Path`	Path to the extracted FrameNet directory.

RAISES	DESCRIPTION
`DownloadError`	If download fails.
`ExtractionError`	If extraction fails.

Source code in src/glazing/downloader.py

def download(self, output_dir: Path) -> Path:
    """Download FrameNet from NLTK data repository.

    Parameters
    ----------
    output_dir : Path
        Directory to download FrameNet into.

    Returns
    -------
    Path
        Path to the extracted FrameNet directory.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    """
    url = f"https://raw.githubusercontent.com/nltk/nltk_data/{self.commit_hash}/packages/corpora/framenet_v17.zip"
    archive_path = output_dir / f"framenet-{self.version}.zip"

    try:
        print(f"Downloading {self.dataset_name} v{self.version}...")
        self._download_file(url, archive_path)

        print(f"Extracting {archive_path.name}...")
        extracted_path = self._extract_archive(archive_path, output_dir)

    except Exception as e:
        if isinstance(e, DownloadError | ExtractionError):
            raise
        msg = f"Failed to download {self.dataset_name}: {e}"
        raise DownloadError(msg) from e

    else:
        # Clean up archive on success
        if archive_path.exists():
            archive_path.unlink()
        return extracted_path

    finally:
        # Clean up archive on any exception
        if archive_path.exists():
            archive_path.unlink()

`PropBankDownloader` ¶

Bases: BaseDownloader

Downloads PropBank from GitHub repository.

Downloads the PropBank frames from the official GitHub repository using a specific commit hash for reproducibility.

ATTRIBUTE	DESCRIPTION
`dataset_name`	"propbank" TYPE: `str`
`version`	"3.4.0" TYPE: `str`
`commit_hash`	"7280a04806b6ca3955ec82e28c4df96b6da76aef" TYPE: `str`

METHOD	DESCRIPTION
`download`	Download PropBank dataset.

Attributes¶

`commit_hash: str` `property` ¶

GitHub repository commit hash.

`dataset_name: str` `property` ¶

Name of the dataset.

`version: str` `property` ¶

Version of PropBank.

Methods:¶

`download(output_dir: Path) -> Path` ¶

Download PropBank dataset.

PARAMETER	DESCRIPTION
`output_dir`	Directory to download PropBank to. TYPE: `Path`

RETURNS	DESCRIPTION
`Path`	Path to the extracted PropBank directory.

RAISES	DESCRIPTION
`DownloadError`	If download fails.
`ExtractionError`	If extraction fails.

Source code in src/glazing/downloader.py

def download(self, output_dir: Path) -> Path:
    """Download PropBank dataset.

    Parameters
    ----------
    output_dir : Path
        Directory to download PropBank to.

    Returns
    -------
    Path
        Path to the extracted PropBank directory.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    """
    url = f"https://github.com/propbank/propbank-frames/archive/{self.commit_hash}.zip"
    archive_name = f"propbank-{self.version}.zip"
    archive_path = output_dir / archive_name

    self._download_file(url, archive_path)

    try:
        extracted_dir = self._extract_archive(archive_path, output_dir)
    except ExtractionError:
        # Clean up failed download
        if archive_path.exists():
            archive_path.unlink()
        raise
    else:
        # Clean up archive file
        archive_path.unlink()
        return extracted_dir

`VerbNetDownloader` ¶

Bases: BaseDownloader

Downloads VerbNet from GitHub repository.

Downloads the VerbNet dataset from the official GitHub repository using a specific commit hash for reproducibility.

ATTRIBUTE	DESCRIPTION
`dataset_name`	"verbnet" TYPE: `str`
`version`	"3.4" TYPE: `str`
`commit_hash`	"ae8e9cfdc2c0d3414b748763612f1a0a34194cc1" TYPE: `str`

METHOD	DESCRIPTION
`download`	Download VerbNet dataset.

Attributes¶

`commit_hash: str` `property` ¶

GitHub repository commit hash.

`dataset_name: str` `property` ¶

Name of the dataset.

`version: str` `property` ¶

Version of VerbNet.

Methods:¶

`download(output_dir: Path) -> Path` ¶

Download VerbNet dataset.

PARAMETER	DESCRIPTION
`output_dir`	Directory to download VerbNet to. TYPE: `Path`

RETURNS	DESCRIPTION
`Path`	Path to the extracted VerbNet directory.

RAISES	DESCRIPTION
`DownloadError`	If download fails.
`ExtractionError`	If extraction fails.

Source code in src/glazing/downloader.py

def download(self, output_dir: Path) -> Path:
    """Download VerbNet dataset.

    Parameters
    ----------
    output_dir : Path
        Directory to download VerbNet to.

    Returns
    -------
    Path
        Path to the extracted VerbNet directory.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    """
    url = f"https://github.com/cu-clear/verbnet/archive/{self.commit_hash}.zip"
    archive_name = f"verbnet-{self.version}.zip"
    archive_path = output_dir / archive_name

    self._download_file(url, archive_path)

    try:
        extracted_dir = self._extract_archive(archive_path, output_dir)
    except ExtractionError:
        # Clean up failed download
        if archive_path.exists():
            archive_path.unlink()
        raise
    else:
        # Clean up archive file
        archive_path.unlink()
        return extracted_dir

`WordNetDownloader` ¶

Bases: BaseDownloader

Downloads WordNet 3.1 from Princeton University.

Downloads the WordNet 3.1 database from the official Princeton University distribution site.

ATTRIBUTE	DESCRIPTION
`dataset_name`	"wordnet" TYPE: `str`
`version`	"3.1" TYPE: `str`

METHOD	DESCRIPTION
`download`	Download WordNet dataset.

Attributes¶

`dataset_name: str` `property` ¶

Name of the dataset.

`version: str` `property` ¶

Version of WordNet.

Methods:¶

`download(output_dir: Path) -> Path` ¶

Download WordNet dataset.

PARAMETER	DESCRIPTION
`output_dir`	Directory to download WordNet to. TYPE: `Path`

RETURNS	DESCRIPTION
`Path`	Path to the extracted WordNet directory.

RAISES	DESCRIPTION
`DownloadError`	If download fails.
`ExtractionError`	If extraction fails.

Source code in src/glazing/downloader.py

def download(self, output_dir: Path) -> Path:
    """Download WordNet dataset.

    Parameters
    ----------
    output_dir : Path
        Directory to download WordNet to.

    Returns
    -------
    Path
        Path to the extracted WordNet directory.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    """
    url = "https://wordnetcode.princeton.edu/wn3.1.dict.tar.gz"
    archive_name = "wordnet-3.1.tar.gz"
    archive_path = output_dir / archive_name

    self._download_file(url, archive_path)

    # WordNet tar.gz contains a 'dict' folder with all the data files
    try:
        # Extract to temp location first to see structure
        with tempfile.TemporaryDirectory() as temp_dir:
            temp_path = Path(temp_dir)

            # Extract archive
            with tarfile.open(str(archive_path), "r:gz") as tar_ref:
                tar_ref.extractall(temp_path, filter="data")

            # The archive contains a 'dict' folder
            extracted_dict = temp_path / "dict"
            if not extracted_dict.exists():
                raise ExtractionError("Expected 'dict' folder in WordNet archive")

            # Move to final location
            final_dict = output_dir / "wn31-dict"
            if final_dict.exists():
                shutil.rmtree(final_dict)
            shutil.move(str(extracted_dict), str(final_dict))

            # Clean up archive file
            archive_path.unlink()
            return final_dict

    except (tarfile.TarError, OSError) as e:
        # Clean up failed download
        if archive_path.exists():
            archive_path.unlink()
        msg = f"Failed to extract WordNet archive: {e}"
        raise ExtractionError(msg) from e

Functions:¶

`download_all(output_dir: Path, datasets: list[DatasetType] | None = None) -> dict[DatasetType, Path | Exception]` ¶

Download all available datasets.

PARAMETER	DESCRIPTION
`output_dir`	Directory to download datasets to. TYPE: `Path`
`datasets`	List of datasets to download. If None, downloads all supported datasets. TYPE: `list[DatasetType] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`dict[DatasetType, Path \| Exception]`	Mapping of dataset names to either the download path (success) or the exception that occurred (failure).

Examples:

>>> from pathlib import Path
>>> results = download_all(Path("data/raw"))
>>> for dataset, result in results.items():
...     if isinstance(result, Path):
...         print(f"{dataset}: success -> {result}")
...     else:
...         print(f"{dataset}: failed -> {result}")

Source code in src/glazing/downloader.py

def download_all(
    output_dir: Path,
    datasets: list[DatasetType] | None = None,
) -> dict[DatasetType, Path | Exception]:
    """Download all available datasets.

    Parameters
    ----------
    output_dir : Path
        Directory to download datasets to.
    datasets : list[DatasetType] | None, default=None
        List of datasets to download. If None, downloads all supported datasets.

    Returns
    -------
    dict[DatasetType, Path | Exception]
        Mapping of dataset names to either the download path (success)
        or the exception that occurred (failure).

    Examples
    --------
    >>> from pathlib import Path
    >>> results = download_all(Path("data/raw"))
    >>> for dataset, result in results.items():
    ...     if isinstance(result, Path):
    ...         print(f"{dataset}: success -> {result}")
    ...     else:
    ...         print(f"{dataset}: failed -> {result}")
    """
    if datasets is None:
        datasets = list(_DOWNLOADERS.keys())

    results: dict[DatasetType, Path | Exception] = {}

    for dataset in datasets:
        try:
            path = download_dataset(dataset, output_dir)
            results[dataset] = path
            print(f"✓ {dataset}: {path}")
        except (DownloadError, ExtractionError, NotImplementedError) as e:
            results[dataset] = e
            print(f"✗ {dataset}: {e}")

    return results

`download_dataset(dataset: DatasetType | str, output_dir: Path) -> Path` ¶

Download a specific dataset.

PARAMETER	DESCRIPTION
`dataset`	Name of the dataset to download (case-insensitive). TYPE: `DatasetType \| str`
`output_dir`	Directory to download the dataset to. TYPE: `Path`

RETURNS	DESCRIPTION
`Path`	Path to the downloaded dataset directory.

RAISES	DESCRIPTION
`ValueError`	If dataset is not supported.
`DownloadError`	If download fails.
`ExtractionError`	If extraction fails.
`NotImplementedError`	If dataset requires manual download (FrameNet).

Examples:

>>> from pathlib import Path
>>> path = download_dataset("verbnet", Path("data/raw"))
>>> print(f"Downloaded to: {path}")

Source code in src/glazing/downloader.py

def download_dataset(dataset: DatasetType | str, output_dir: Path) -> Path:
    """Download a specific dataset.

    Parameters
    ----------
    dataset : DatasetType | str
        Name of the dataset to download (case-insensitive).
    output_dir : Path
        Directory to download the dataset to.

    Returns
    -------
    Path
        Path to the downloaded dataset directory.

    Raises
    ------
    ValueError
        If dataset is not supported.
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    NotImplementedError
        If dataset requires manual download (FrameNet).

    Examples
    --------
    >>> from pathlib import Path
    >>> path = download_dataset("verbnet", Path("data/raw"))
    >>> print(f"Downloaded to: {path}")
    """
    downloader = get_downloader(dataset)
    return downloader.download(output_dir)

`get_available_datasets() -> list[DatasetType]` ¶

Get list of available datasets for download.

RETURNS	DESCRIPTION
`list[DatasetType]`	List of supported dataset names.

Examples:

>>> datasets = get_available_datasets()
>>> print(datasets)
['VerbNet', 'PropBank', 'WordNet', 'FrameNet']

Source code in src/glazing/downloader.py

def get_available_datasets() -> list[DatasetType]:
    """Get list of available datasets for download.

    Returns
    -------
    list[DatasetType]
        List of supported dataset names.

    Examples
    --------
    >>> datasets = get_available_datasets()
    >>> print(datasets)
    ['VerbNet', 'PropBank', 'WordNet', 'FrameNet']
    """
    return list(_DOWNLOADERS.keys())

`get_dataset_info(dataset: DatasetType | str) -> dict[str, str]` ¶

Get information about a dataset.

PARAMETER	DESCRIPTION
`dataset`	Name of the dataset (case-insensitive). TYPE: `DatasetType \| str`

RETURNS	DESCRIPTION
`dict[str, str]`	Dictionary with dataset information including name and version.

RAISES	DESCRIPTION
`ValueError`	If dataset is not supported.

Examples:

>>> info = get_dataset_info("verbnet")
>>> print(info["version"])
3.4

Source code in src/glazing/downloader.py

def get_dataset_info(dataset: DatasetType | str) -> dict[str, str]:
    """Get information about a dataset.

    Parameters
    ----------
    dataset : DatasetType | str
        Name of the dataset (case-insensitive).

    Returns
    -------
    dict[str, str]
        Dictionary with dataset information including name and version.

    Raises
    ------
    ValueError
        If dataset is not supported.

    Examples
    --------
    >>> info = get_dataset_info("verbnet")
    >>> print(info["version"])
    3.4
    """
    downloader = get_downloader(dataset)
    return {
        "name": downloader.dataset_name,
        "version": downloader.version,
        "class": downloader.__class__.__name__,
    }

`get_downloader(dataset: DatasetType | str) -> BaseDownloader` ¶

Get downloader instance for a dataset.

PARAMETER	DESCRIPTION
`dataset`	Name of the dataset to get downloader for (case-insensitive). TYPE: `DatasetType \| str`

RETURNS	DESCRIPTION
`BaseDownloader`	Downloader instance for the specified dataset.

RAISES	DESCRIPTION
`ValueError`	If dataset is not supported.

Examples:

>>> downloader = get_downloader("verbnet")
>>> print(downloader.version)
3.4

Source code in src/glazing/downloader.py

def get_downloader(dataset: DatasetType | str) -> BaseDownloader:
    """Get downloader instance for a dataset.

    Parameters
    ----------
    dataset : DatasetType | str
        Name of the dataset to get downloader for (case-insensitive).

    Returns
    -------
    BaseDownloader
        Downloader instance for the specified dataset.

    Raises
    ------
    ValueError
        If dataset is not supported.

    Examples
    --------
    >>> downloader = get_downloader("verbnet")
    >>> print(downloader.version)
    3.4
    """
    # Normalize to lowercase for case-insensitive lookup. The cast narrows the
    # plain ``str`` to the key type for the lookup; ``.get`` still returns None
    # for any unsupported name, so the cast never asserts a false membership.
    dataset_lower = cast(DatasetType, dataset.lower())
    downloader_class = _DOWNLOADERS.get(dataset_lower)
    if downloader_class is None:
        supported = ", ".join(_DOWNLOADERS.keys())
        msg = f"Unsupported dataset: {dataset}. Supported: {supported}"
        raise ValueError(msg)
    return downloader_class()

glazing.downloader¶

downloader ¶

Classes¶

BaseDownloader ¶

Attributes¶

dataset_name: str abstractmethod property ¶

version: str abstractmethod property ¶

Methods:¶

download(output_dir: Path) -> Path abstractmethod ¶

DownloadError ¶

ExtractionError ¶

FrameNetDownloader ¶

Attributes¶

commit_hash: str property ¶

dataset_name: str property ¶

version: str property ¶

Methods:¶

download(output_dir: Path) -> Path ¶

PropBankDownloader ¶

Attributes¶

commit_hash: str property ¶

dataset_name: str property ¶

version: str property ¶

Methods:¶

download(output_dir: Path) -> Path ¶

VerbNetDownloader ¶

Attributes¶

commit_hash: str property ¶

dataset_name: str property ¶

version: str property ¶

Methods:¶

download(output_dir: Path) -> Path ¶

WordNetDownloader ¶

Attributes¶

dataset_name: str property ¶

version: str property ¶

Methods:¶

download(output_dir: Path) -> Path ¶

Functions:¶

download_all(output_dir: Path, datasets: list[DatasetType] | None = None) -> dict[DatasetType, Path | Exception] ¶

download_dataset(dataset: DatasetType | str, output_dir: Path) -> Path ¶

get_available_datasets() -> list[DatasetType] ¶

get_dataset_info(dataset: DatasetType | str) -> dict[str, str] ¶

get_downloader(dataset: DatasetType | str) -> BaseDownloader ¶

`downloader` ¶

`BaseDownloader` ¶

`dataset_name: str` `abstractmethod` `property` ¶

`version: str` `abstractmethod` `property` ¶

`download(output_dir: Path) -> Path` `abstractmethod` ¶

`DownloadError` ¶

`ExtractionError` ¶

`FrameNetDownloader` ¶

`commit_hash: str` `property` ¶

`dataset_name: str` `property` ¶

`version: str` `property` ¶

`download(output_dir: Path) -> Path` ¶

`PropBankDownloader` ¶

`commit_hash: str` `property` ¶

`dataset_name: str` `property` ¶

`version: str` `property` ¶

`download(output_dir: Path) -> Path` ¶

`VerbNetDownloader` ¶

`commit_hash: str` `property` ¶

`dataset_name: str` `property` ¶

`version: str` `property` ¶

`download(output_dir: Path) -> Path` ¶

`WordNetDownloader` ¶

`dataset_name: str` `property` ¶

`version: str` `property` ¶

`download(output_dir: Path) -> Path` ¶

`download_all(output_dir: Path, datasets: list[DatasetType] | None = None) -> dict[DatasetType, Path | Exception]` ¶

`download_dataset(dataset: DatasetType | str, output_dir: Path) -> Path` ¶

`get_available_datasets() -> list[DatasetType]` ¶

`get_dataset_info(dataset: DatasetType | str) -> dict[str, str]` ¶

`get_downloader(dataset: DatasetType | str) -> BaseDownloader` ¶