Skip to content

glazing.downloader

Dataset downloading utilities.

downloader

Dataset downloaders for linguistic resources.

This module provides automatic downloading capabilities for FrameNet, PropBank, VerbNet, and WordNet datasets. Each downloader handles version tracking, progress indication, and archive extraction.

CLASS DESCRIPTION
BaseDownloader

Abstract base class for dataset downloaders.

VerbNetDownloader

Downloads VerbNet from GitHub with commit hash versioning.

PropBankDownloader

Downloads PropBank from GitHub with commit hash versioning.

WordNetDownloader

Downloads WordNet 3.1 from Princeton University.

FrameNetDownloader

Provides instructions for manual FrameNet download (license required).

FUNCTION DESCRIPTION
download_dataset

Download a specific dataset by name.

download_all

Download all available datasets.

get_downloader

Get downloader instance for a dataset.

Examples:

>>> from glazing.downloader import download_dataset
>>> path = download_dataset("verbnet", Path("data/raw"))
>>> print(f"VerbNet downloaded to: {path}")
>>> from glazing.downloader import VerbNetDownloader
>>> downloader = VerbNetDownloader()
>>> path = downloader.download(Path("data/raw"))

Classes

BaseDownloader

Bases: ABC

Abstract base class for dataset downloaders.

Provides common functionality for downloading and extracting datasets with progress tracking and error handling.

ATTRIBUTE DESCRIPTION
dataset_name

Human-readable name of the dataset.

TYPE: str

version

Version string or commit hash for the dataset.

TYPE: str

METHOD DESCRIPTION
download

Download the dataset to the specified directory.

Attributes
dataset_name: str abstractmethod property

Name of the dataset.

RETURNS DESCRIPTION
str

Human-readable dataset name.

version: str abstractmethod property

Version or commit hash.

RETURNS DESCRIPTION
str

Version identifier for reproducible downloads.

Functions
download(output_dir: Path) -> Path abstractmethod

Download dataset to output directory.

PARAMETER DESCRIPTION
output_dir

Directory to download the dataset to.

TYPE: Path

RETURNS DESCRIPTION
Path

Path to the downloaded and extracted dataset.

RAISES DESCRIPTION
DownloadError

If download fails.

ExtractionError

If archive extraction fails.

Source code in src/glazing/downloader.py
@abstractmethod
def download(self, output_dir: Path) -> Path:
    """Download dataset to output directory.

    Parameters
    ----------
    output_dir : Path
        Directory to download the dataset to.

    Returns
    -------
    Path
        Path to the downloaded and extracted dataset.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If archive extraction fails.
    """

DownloadError

Bases: Exception

Raised when a download operation fails.

ExtractionError

Bases: Exception

Raised when archive extraction fails.

FrameNetDownloader

Bases: BaseDownloader

Downloads FrameNet from NLTK data repository.

Downloads FrameNet v1.7 from the NLTK data GitHub repository, which provides the dataset without license restrictions.

ATTRIBUTE DESCRIPTION
dataset_name

"framenet"

TYPE: str

version

"1.7"

TYPE: str

commit_hash

"427fc05d3a8cc1ca99e7ff93bdea937507cc9e7a"

TYPE: str

METHOD DESCRIPTION
download

Download FrameNet from NLTK data repository.

Attributes
commit_hash: str property

NLTK data repository commit hash.

dataset_name: str property

Name of the dataset.

version: str property

Version of FrameNet.

Functions
download(output_dir: Path) -> Path

Download FrameNet from NLTK data repository.

PARAMETER DESCRIPTION
output_dir

Directory to download FrameNet into.

TYPE: Path

RETURNS DESCRIPTION
Path

Path to the extracted FrameNet directory.

RAISES DESCRIPTION
DownloadError

If download fails.

ExtractionError

If extraction fails.

Source code in src/glazing/downloader.py
def download(self, output_dir: Path) -> Path:
    """Download FrameNet from NLTK data repository.

    Parameters
    ----------
    output_dir : Path
        Directory to download FrameNet into.

    Returns
    -------
    Path
        Path to the extracted FrameNet directory.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    """
    url = f"https://raw.githubusercontent.com/nltk/nltk_data/{self.commit_hash}/packages/corpora/framenet_v17.zip"
    archive_path = output_dir / f"framenet-{self.version}.zip"

    try:
        print(f"Downloading {self.dataset_name} v{self.version}...")
        self._download_file(url, archive_path)

        print(f"Extracting {archive_path.name}...")
        extracted_path = self._extract_archive(archive_path, output_dir)

    except Exception as e:
        if isinstance(e, DownloadError | ExtractionError):
            raise
        msg = f"Failed to download {self.dataset_name}: {e}"
        raise DownloadError(msg) from e

    else:
        # Clean up archive on success
        if archive_path.exists():
            archive_path.unlink()
        return extracted_path

    finally:
        # Clean up archive on any exception
        if archive_path.exists():
            archive_path.unlink()

PropBankDownloader

Bases: BaseDownloader

Downloads PropBank from GitHub repository.

Downloads the PropBank frames from the official GitHub repository using a specific commit hash for reproducibility.

ATTRIBUTE DESCRIPTION
dataset_name

"propbank"

TYPE: str

version

"3.4.0"

TYPE: str

commit_hash

"7280a04806b6ca3955ec82e28c4df96b6da76aef"

TYPE: str

METHOD DESCRIPTION
download

Download PropBank dataset.

Attributes
commit_hash: str property

GitHub repository commit hash.

dataset_name: str property

Name of the dataset.

version: str property

Version of PropBank.

Functions
download(output_dir: Path) -> Path

Download PropBank dataset.

PARAMETER DESCRIPTION
output_dir

Directory to download PropBank to.

TYPE: Path

RETURNS DESCRIPTION
Path

Path to the extracted PropBank directory.

RAISES DESCRIPTION
DownloadError

If download fails.

ExtractionError

If extraction fails.

Source code in src/glazing/downloader.py
def download(self, output_dir: Path) -> Path:
    """Download PropBank dataset.

    Parameters
    ----------
    output_dir : Path
        Directory to download PropBank to.

    Returns
    -------
    Path
        Path to the extracted PropBank directory.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    """
    url = f"https://github.com/propbank/propbank-frames/archive/{self.commit_hash}.zip"
    archive_name = f"propbank-{self.version}.zip"
    archive_path = output_dir / archive_name

    self._download_file(url, archive_path)

    try:
        extracted_dir = self._extract_archive(archive_path, output_dir)
    except ExtractionError:
        # Clean up failed download
        if archive_path.exists():
            archive_path.unlink()
        raise
    else:
        # Clean up archive file
        archive_path.unlink()
        return extracted_dir

VerbNetDownloader

Bases: BaseDownloader

Downloads VerbNet from GitHub repository.

Downloads the VerbNet dataset from the official GitHub repository using a specific commit hash for reproducibility.

ATTRIBUTE DESCRIPTION
dataset_name

"verbnet"

TYPE: str

version

"3.4"

TYPE: str

commit_hash

"ae8e9cfdc2c0d3414b748763612f1a0a34194cc1"

TYPE: str

METHOD DESCRIPTION
download

Download VerbNet dataset.

Attributes
commit_hash: str property

GitHub repository commit hash.

dataset_name: str property

Name of the dataset.

version: str property

Version of VerbNet.

Functions
download(output_dir: Path) -> Path

Download VerbNet dataset.

PARAMETER DESCRIPTION
output_dir

Directory to download VerbNet to.

TYPE: Path

RETURNS DESCRIPTION
Path

Path to the extracted VerbNet directory.

RAISES DESCRIPTION
DownloadError

If download fails.

ExtractionError

If extraction fails.

Source code in src/glazing/downloader.py
def download(self, output_dir: Path) -> Path:
    """Download VerbNet dataset.

    Parameters
    ----------
    output_dir : Path
        Directory to download VerbNet to.

    Returns
    -------
    Path
        Path to the extracted VerbNet directory.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    """
    url = f"https://github.com/cu-clear/verbnet/archive/{self.commit_hash}.zip"
    archive_name = f"verbnet-{self.version}.zip"
    archive_path = output_dir / archive_name

    self._download_file(url, archive_path)

    try:
        extracted_dir = self._extract_archive(archive_path, output_dir)
    except ExtractionError:
        # Clean up failed download
        if archive_path.exists():
            archive_path.unlink()
        raise
    else:
        # Clean up archive file
        archive_path.unlink()
        return extracted_dir

WordNetDownloader

Bases: BaseDownloader

Downloads WordNet 3.1 from Princeton University.

Downloads the WordNet 3.1 database from the official Princeton University distribution site.

ATTRIBUTE DESCRIPTION
dataset_name

"wordnet"

TYPE: str

version

"3.1"

TYPE: str

METHOD DESCRIPTION
download

Download WordNet dataset.

Attributes
dataset_name: str property

Name of the dataset.

version: str property

Version of WordNet.

Functions
download(output_dir: Path) -> Path

Download WordNet dataset.

PARAMETER DESCRIPTION
output_dir

Directory to download WordNet to.

TYPE: Path

RETURNS DESCRIPTION
Path

Path to the extracted WordNet directory.

RAISES DESCRIPTION
DownloadError

If download fails.

ExtractionError

If extraction fails.

Source code in src/glazing/downloader.py
def download(self, output_dir: Path) -> Path:
    """Download WordNet dataset.

    Parameters
    ----------
    output_dir : Path
        Directory to download WordNet to.

    Returns
    -------
    Path
        Path to the extracted WordNet directory.

    Raises
    ------
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    """
    url = "https://wordnetcode.princeton.edu/wn3.1.dict.tar.gz"
    archive_name = "wordnet-3.1.tar.gz"
    archive_path = output_dir / archive_name

    self._download_file(url, archive_path)

    # WordNet tar.gz contains a 'dict' folder with all the data files
    try:
        # Extract to temp location first to see structure
        with tempfile.TemporaryDirectory() as temp_dir:
            temp_path = Path(temp_dir)

            # Extract archive
            with tarfile.open(str(archive_path), "r:gz") as tar_ref:
                tar_ref.extractall(temp_path, filter="data")

            # The archive contains a 'dict' folder
            extracted_dict = temp_path / "dict"
            if not extracted_dict.exists():
                raise ExtractionError("Expected 'dict' folder in WordNet archive")

            # Move to final location
            final_dict = output_dir / "wn31-dict"
            if final_dict.exists():
                shutil.rmtree(final_dict)
            shutil.move(str(extracted_dict), str(final_dict))

            # Clean up archive file
            archive_path.unlink()
            return final_dict

    except (tarfile.TarError, OSError) as e:
        # Clean up failed download
        if archive_path.exists():
            archive_path.unlink()
        msg = f"Failed to extract WordNet archive: {e}"
        raise ExtractionError(msg) from e

Functions

download_all(output_dir: Path, datasets: list[DatasetType] | None = None) -> dict[DatasetType, Path | Exception]

Download all available datasets.

PARAMETER DESCRIPTION
output_dir

Directory to download datasets to.

TYPE: Path

datasets

List of datasets to download. If None, downloads all supported datasets.

TYPE: list[DatasetType] | None DEFAULT: None

RETURNS DESCRIPTION
dict[DatasetType, Path | Exception]

Mapping of dataset names to either the download path (success) or the exception that occurred (failure).

Examples:

>>> from pathlib import Path
>>> results = download_all(Path("data/raw"))
>>> for dataset, result in results.items():
...     if isinstance(result, Path):
...         print(f"{dataset}: success -> {result}")
...     else:
...         print(f"{dataset}: failed -> {result}")
Source code in src/glazing/downloader.py
def download_all(
    output_dir: Path,
    datasets: list[DatasetType] | None = None,
) -> dict[DatasetType, Path | Exception]:
    """Download all available datasets.

    Parameters
    ----------
    output_dir : Path
        Directory to download datasets to.
    datasets : list[DatasetType] | None, default=None
        List of datasets to download. If None, downloads all supported datasets.

    Returns
    -------
    dict[DatasetType, Path | Exception]
        Mapping of dataset names to either the download path (success)
        or the exception that occurred (failure).

    Examples
    --------
    >>> from pathlib import Path
    >>> results = download_all(Path("data/raw"))
    >>> for dataset, result in results.items():
    ...     if isinstance(result, Path):
    ...         print(f"{dataset}: success -> {result}")
    ...     else:
    ...         print(f"{dataset}: failed -> {result}")
    """
    if datasets is None:
        datasets = list(_DOWNLOADERS.keys())

    results: dict[DatasetType, Path | Exception] = {}

    for dataset in datasets:
        try:
            path = download_dataset(dataset, output_dir)
            results[dataset] = path
            print(f"✓ {dataset}: {path}")
        except (DownloadError, ExtractionError, NotImplementedError) as e:
            results[dataset] = e
            print(f"✗ {dataset}: {e}")

    return results

download_dataset(dataset: DatasetType | str, output_dir: Path) -> Path

Download a specific dataset.

PARAMETER DESCRIPTION
dataset

Name of the dataset to download (case-insensitive).

TYPE: DatasetType | str

output_dir

Directory to download the dataset to.

TYPE: Path

RETURNS DESCRIPTION
Path

Path to the downloaded dataset directory.

RAISES DESCRIPTION
ValueError

If dataset is not supported.

DownloadError

If download fails.

ExtractionError

If extraction fails.

NotImplementedError

If dataset requires manual download (FrameNet).

Examples:

>>> from pathlib import Path
>>> path = download_dataset("verbnet", Path("data/raw"))
>>> print(f"Downloaded to: {path}")
Source code in src/glazing/downloader.py
def download_dataset(dataset: DatasetType | str, output_dir: Path) -> Path:
    """Download a specific dataset.

    Parameters
    ----------
    dataset : DatasetType | str
        Name of the dataset to download (case-insensitive).
    output_dir : Path
        Directory to download the dataset to.

    Returns
    -------
    Path
        Path to the downloaded dataset directory.

    Raises
    ------
    ValueError
        If dataset is not supported.
    DownloadError
        If download fails.
    ExtractionError
        If extraction fails.
    NotImplementedError
        If dataset requires manual download (FrameNet).

    Examples
    --------
    >>> from pathlib import Path
    >>> path = download_dataset("verbnet", Path("data/raw"))
    >>> print(f"Downloaded to: {path}")
    """
    downloader = get_downloader(dataset)
    return downloader.download(output_dir)

get_available_datasets() -> list[DatasetType]

Get list of available datasets for download.

RETURNS DESCRIPTION
list[DatasetType]

List of supported dataset names.

Examples:

>>> datasets = get_available_datasets()
>>> print(datasets)
['VerbNet', 'PropBank', 'WordNet', 'FrameNet']
Source code in src/glazing/downloader.py
def get_available_datasets() -> list[DatasetType]:
    """Get list of available datasets for download.

    Returns
    -------
    list[DatasetType]
        List of supported dataset names.

    Examples
    --------
    >>> datasets = get_available_datasets()
    >>> print(datasets)
    ['VerbNet', 'PropBank', 'WordNet', 'FrameNet']
    """
    return list(_DOWNLOADERS.keys())

get_dataset_info(dataset: DatasetType | str) -> dict[str, str]

Get information about a dataset.

PARAMETER DESCRIPTION
dataset

Name of the dataset (case-insensitive).

TYPE: DatasetType | str

RETURNS DESCRIPTION
dict[str, str]

Dictionary with dataset information including name and version.

RAISES DESCRIPTION
ValueError

If dataset is not supported.

Examples:

>>> info = get_dataset_info("verbnet")
>>> print(info["version"])
3.4
Source code in src/glazing/downloader.py
def get_dataset_info(dataset: DatasetType | str) -> dict[str, str]:
    """Get information about a dataset.

    Parameters
    ----------
    dataset : DatasetType | str
        Name of the dataset (case-insensitive).

    Returns
    -------
    dict[str, str]
        Dictionary with dataset information including name and version.

    Raises
    ------
    ValueError
        If dataset is not supported.

    Examples
    --------
    >>> info = get_dataset_info("verbnet")
    >>> print(info["version"])
    3.4
    """
    downloader = get_downloader(dataset)
    return {
        "name": downloader.dataset_name,
        "version": downloader.version,
        "class": downloader.__class__.__name__,
    }

get_downloader(dataset: DatasetType | str) -> BaseDownloader

Get downloader instance for a dataset.

PARAMETER DESCRIPTION
dataset

Name of the dataset to get downloader for (case-insensitive).

TYPE: DatasetType | str

RETURNS DESCRIPTION
BaseDownloader

Downloader instance for the specified dataset.

RAISES DESCRIPTION
ValueError

If dataset is not supported.

Examples:

>>> downloader = get_downloader("verbnet")
>>> print(downloader.version)
3.4
Source code in src/glazing/downloader.py
def get_downloader(dataset: DatasetType | str) -> BaseDownloader:
    """Get downloader instance for a dataset.

    Parameters
    ----------
    dataset : DatasetType | str
        Name of the dataset to get downloader for (case-insensitive).

    Returns
    -------
    BaseDownloader
        Downloader instance for the specified dataset.

    Raises
    ------
    ValueError
        If dataset is not supported.

    Examples
    --------
    >>> downloader = get_downloader("verbnet")
    >>> print(downloader.version)
    3.4
    """
    # Normalize to lowercase for case-insensitive lookup
    dataset_lower = dataset.lower()

    if dataset_lower not in _DOWNLOADERS:
        supported = ", ".join(_DOWNLOADERS.keys())
        msg = f"Unsupported dataset: {dataset}. Supported: {supported}"
        raise ValueError(msg)

    # Cast to DatasetType for type checking
    dataset_typed = cast(DatasetType, dataset_lower)
    downloader_class = _DOWNLOADERS[dataset_typed]
    return downloader_class()