Skip to content

glazing.utils.xml_parser

XML parsing utilities.

xml_parser

High-performance XML parsing utilities using lxml.

This module provides fast, memory-efficient XML parsing utilities built on lxml, which uses C libraries (libxml2/libxslt) for 20x performance over pure Python parsers.

FUNCTION DESCRIPTION
iterparse_elements

Memory-efficient streaming parser for large XML files.

parse_with_schema

Parse XML with DTD or XSD validation.

extract_text_with_markup

Extract text preserving embedded markup tags.

compile_xpath

Pre-compile XPath expressions for repeated use.

parse_attributes

Parse and convert XML attributes to Python types.

clear_element

Clear element to free memory during parsing.

fragment_to_annotations

Convert XML fragments to annotation objects.

CLASS DESCRIPTION
MarkupExtractor

Extract and preserve embedded markup from mixed content.

StreamingParser

Event-driven streaming parser for large files.

Notes

Uses lxml.etree for maximum performance with large linguistic datasets. All parsers use iterparse for constant memory usage regardless of file size.

Classes

MarkupExtractor(preserve_tags: set[str], nested: bool = False)

Extract and preserve embedded markup from mixed content.

Optimized for FrameNet's complex annotation structure with multiple levels of embedded markup.

PARAMETER DESCRIPTION
preserve_tags

Tags to preserve as annotations.

TYPE: set[str]

nested

Support nested markup tags.

TYPE: bool DEFAULT: False

ATTRIBUTE DESCRIPTION
preserve_tags

Tags being preserved.

TYPE: set[str]

nested

Whether nested tags are supported.

TYPE: bool

METHOD DESCRIPTION
extract

Extract text and annotations from element.

extract_recursive

Recursively extract nested annotations.

Initialize markup extractor.

PARAMETER DESCRIPTION
preserve_tags

Tags to preserve as annotations.

TYPE: set[str]

nested

Support nested markup tags.

TYPE: bool DEFAULT: False

Source code in src/glazing/utils/xml_parser.py
def __init__(self, preserve_tags: set[str], nested: bool = False) -> None:
    """Initialize markup extractor.

    Parameters
    ----------
    preserve_tags : set[str]
        Tags to preserve as annotations.
    nested : bool
        Support nested markup tags.
    """
    self.preserve_tags = preserve_tags
    self.nested = nested
Functions
extract(element: ElementType) -> tuple[str, list[dict[str, str | int]]]

Extract text and annotations from element.

PARAMETER DESCRIPTION
element

Element to extract from.

TYPE: ElementType

RETURNS DESCRIPTION
tuple[str, list[dict[str, str | int]]]

Plain text and annotation list.

Source code in src/glazing/utils/xml_parser.py
def extract(self, element: ElementType) -> tuple[str, list[dict[str, str | int]]]:
    """Extract text and annotations from element.

    Parameters
    ----------
    element : ElementType
        Element to extract from.

    Returns
    -------
    tuple[str, list[dict[str, str | int]]]
        Plain text and annotation list.
    """
    if self.nested:
        return self._extract_recursive(element, depth=0)
    return extract_text_with_markup(element, self.preserve_tags)

StreamingParser(filepath: Path | str, target_tags: set[str] | None = None, max_depth: int = 10)

Event-driven streaming parser for large files.

Processes XML files of any size with constant memory usage.

PARAMETER DESCRIPTION
filepath

Path to XML file.

TYPE: Path | str

target_tags

Tags to process. If None, process all.

TYPE: set[str] | None DEFAULT: None

max_depth

Maximum parsing depth.

TYPE: int DEFAULT: 10

ATTRIBUTE DESCRIPTION
filepath

Path to XML file.

TYPE: Path

target_tags

Tags being processed.

TYPE: set[str] | None

max_depth

Maximum depth to parse.

TYPE: int

METHOD DESCRIPTION
parse

Parse file with custom handler.

iter_elements

Iterate over elements with tag.

count_elements

Count elements without loading.

Initialize streaming parser.

PARAMETER DESCRIPTION
filepath

Path to XML file.

TYPE: Path | str

target_tags

Tags to process.

TYPE: set[str] | None DEFAULT: None

max_depth

Maximum parsing depth.

TYPE: int DEFAULT: 10

Source code in src/glazing/utils/xml_parser.py
def __init__(
    self,
    filepath: Path | str,
    target_tags: set[str] | None = None,
    max_depth: int = 10,
) -> None:
    """Initialize streaming parser.

    Parameters
    ----------
    filepath : Path | str
        Path to XML file.
    target_tags : set[str] | None
        Tags to process.
    max_depth : int
        Maximum parsing depth.
    """
    self.filepath = Path(filepath)
    self.target_tags = target_tags
    self.max_depth = max_depth
Functions
count_elements(tag: str) -> int

Count elements without loading.

PARAMETER DESCRIPTION
tag

Tag to count.

TYPE: str

RETURNS DESCRIPTION
int

Number of elements with tag.

Source code in src/glazing/utils/xml_parser.py
def count_elements(self, tag: str) -> int:
    """Count elements without loading.

    Parameters
    ----------
    tag : str
        Tag to count.

    Returns
    -------
    int
        Number of elements with tag.
    """
    count = 0
    for _ in self.iter_elements(tag):
        count += 1
    return count
iter_elements(tag: str) -> Iterator[ElementType]

Iterate over elements with tag.

PARAMETER DESCRIPTION
tag

Tag to filter by.

TYPE: str

YIELDS DESCRIPTION
ElementType

Elements with specified tag.

Source code in src/glazing/utils/xml_parser.py
def iter_elements(self, tag: str) -> Iterator[ElementType]:
    """Iterate over elements with tag.

    Parameters
    ----------
    tag : str
        Tag to filter by.

    Yields
    ------
    ElementType
        Elements with specified tag.
    """
    for _event, elem in iterparse_elements(self.filepath, tag=tag):
        yield elem
        clear_element(elem)
parse(handler: Callable[[ElementType], None]) -> None

Parse file with custom handler.

PARAMETER DESCRIPTION
handler

Function called for each target element.

TYPE: callable

Source code in src/glazing/utils/xml_parser.py
def parse(self, handler: Callable[[ElementType], None]) -> None:
    """Parse file with custom handler.

    Parameters
    ----------
    handler : callable
        Function called for each target element.
    """
    for event, elem in iterparse_elements(self.filepath, events=("start", "end")):
        if event == "end" and (self.target_tags is None or elem.tag in self.target_tags):
            handler(elem)
            clear_element(elem)

Functions

clear_element(element: ElementType, keep_tail: bool = False) -> None

Clear element to free memory during parsing.

Removes element content while preserving structure for continued iteration. Critical for processing large files with constant memory usage.

PARAMETER DESCRIPTION
element

Element to clear.

TYPE: ElementType

keep_tail

Preserve tail text (text after element).

TYPE: bool DEFAULT: False

Examples:

>>> for event, elem in iterparse_elements("huge.xml"):
...     process(elem)
...     clear_element(elem)  # Free memory immediately
Source code in src/glazing/utils/xml_parser.py
def clear_element(element: ElementType, keep_tail: bool = False) -> None:
    """Clear element to free memory during parsing.

    Removes element content while preserving structure for continued iteration.
    Critical for processing large files with constant memory usage.

    Parameters
    ----------
    element : ElementType
        Element to clear.
    keep_tail : bool
        Preserve tail text (text after element).

    Examples
    --------
    >>> for event, elem in iterparse_elements("huge.xml"):
    ...     process(elem)
    ...     clear_element(elem)  # Free memory immediately
    """
    # Remove reference to parent
    parent = element.getparent()
    if parent is not None:
        parent.remove(element)

    # Clear content
    element.clear()

    # Preserve tail if needed
    if keep_tail and element.tail:
        element.tail = element.tail

compile_xpath(expression: str, namespaces: dict[str, str] | None = None) -> XPathType

Pre-compile XPath expressions for repeated use.

Compiled XPath expressions are 3-5x faster for repeated queries.

PARAMETER DESCRIPTION
expression

XPath expression to compile.

TYPE: str

namespaces

Namespace prefixes to URIs.

TYPE: dict[str, str] | None DEFAULT: None

RETURNS DESCRIPTION
XPathType

Compiled XPath expression.

Examples:

>>> xpath = compile_xpath("//frame[@id=$frame_id]/FE")
>>> fes = xpath(root, frame_id="123")
Source code in src/glazing/utils/xml_parser.py
def compile_xpath(expression: str, namespaces: dict[str, str] | None = None) -> XPathType:
    """Pre-compile XPath expressions for repeated use.

    Compiled XPath expressions are 3-5x faster for repeated queries.

    Parameters
    ----------
    expression : str
        XPath expression to compile.
    namespaces : dict[str, str] | None
        Namespace prefixes to URIs.

    Returns
    -------
    XPathType
        Compiled XPath expression.

    Examples
    --------
    >>> xpath = compile_xpath("//frame[@id=$frame_id]/FE")
    >>> fes = xpath(root, frame_id="123")
    """
    return etree.XPath(expression, namespaces=namespaces)

extract_text_with_markup(element: ElementType, preserve_tags: set[str] | None = None) -> tuple[str, list[dict[str, str | int]]]

Extract text preserving embedded markup tags.

Handles mixed content like FrameNet's embedded FE references: The person abandoned the car.

PARAMETER DESCRIPTION
element

Element containing mixed content.

TYPE: ElementType

preserve_tags

Tags to preserve as annotations. If None, preserve all.

TYPE: set[str] | None DEFAULT: None

RETURNS DESCRIPTION
tuple[str, list[dict[str, str | int]]]

Plain text and list of annotation dictionaries with positions.

Examples:

>>> text, annos = extract_text_with_markup(elem, {"fex", "fen", "t", "ex"})
>>> print(text)
'The person abandoned the car.'
>>> print(annos[0])
{'tag': 'fex', 'name': 'Agent', 'start': 4, 'end': 10, 'text': 'person'}
Source code in src/glazing/utils/xml_parser.py
def extract_text_with_markup(
    element: ElementType,
    preserve_tags: set[str] | None = None,
) -> tuple[str, list[dict[str, str | int]]]:
    """Extract text preserving embedded markup tags.

    Handles mixed content like FrameNet's embedded FE references:
    <text>The <fex name="Agent">person</fex> abandoned the <fex name="Theme">car</fex>.</text>

    Parameters
    ----------
    element : ElementType
        Element containing mixed content.
    preserve_tags : set[str] | None
        Tags to preserve as annotations. If None, preserve all.

    Returns
    -------
    tuple[str, list[dict[str, str | int]]]
        Plain text and list of annotation dictionaries with positions.

    Examples
    --------
    >>> text, annos = extract_text_with_markup(elem, {"fex", "fen", "t", "ex"})
    >>> print(text)
    'The person abandoned the car.'
    >>> print(annos[0])
    {'tag': 'fex', 'name': 'Agent', 'start': 4, 'end': 10, 'text': 'person'}
    """
    plain_text = []
    annotations = []
    position = 0

    # Get initial text before any child elements
    if element.text:
        plain_text.append(element.text)
        position += len(element.text)

    for child in element:
        if preserve_tags is None or child.tag in preserve_tags:
            # Record annotation
            start = position
            child_text = child.text or ""
            plain_text.append(child_text)
            end = position + len(child_text)

            annotation: dict[str, str | int] = {
                "tag": child.tag,
                "start": start,
                "end": end,
                "text": child_text,
            }

            # Add attributes
            for key, value in child.attrib.items():
                annotation[str(key)] = str(value)

            annotations.append(annotation)
            position = end

        # Get tail text after child element
        if child.tail:
            plain_text.append(child.tail)
            position += len(child.tail)

    return "".join(plain_text), annotations

fragment_to_annotations(text: str, tag_pattern: str = '<(\\w+)([^>]*)>(.*?)</\\1>') -> tuple[str, list[dict[str, str | int]]]

Convert XML fragments to annotation objects.

Alternative to full XML parsing for simple embedded markup.

PARAMETER DESCRIPTION
text

Text with XML fragments.

TYPE: str

tag_pattern

Regex pattern for matching tags.

TYPE: str DEFAULT: '<(\\w+)([^>]*)>(.*?)</\\1>'

RETURNS DESCRIPTION
tuple[str, list[dict[str, str | int]]]

Plain text and annotations.

Source code in src/glazing/utils/xml_parser.py
def fragment_to_annotations(
    text: str,
    tag_pattern: str = r"<(\w+)([^>]*)>(.*?)</\1>",
) -> tuple[str, list[dict[str, str | int]]]:
    """Convert XML fragments to annotation objects.

    Alternative to full XML parsing for simple embedded markup.

    Parameters
    ----------
    text : str
        Text with XML fragments.
    tag_pattern : str
        Regex pattern for matching tags.

    Returns
    -------
    tuple[str, list[dict[str, str | int]]]
        Plain text and annotations.
    """
    annotations = []
    plain_parts = []
    last_end = 0

    for match in re.finditer(tag_pattern, text):
        # Add text before match
        plain_parts.append(text[last_end : match.start()])

        # Extract match info
        tag = match.group(1)
        attrs_str = match.group(2)
        content = match.group(3)

        # Parse attributes
        attrs = {}
        for attr_match in re.finditer(r'(\w+)="([^"]*)"', attrs_str):
            attrs[attr_match.group(1)] = attr_match.group(2)

        # Create annotation
        start = len("".join(plain_parts))
        plain_parts.append(content)
        end = len("".join(plain_parts))

        annotation = {
            "tag": tag,
            "start": start,
            "end": end,
            "text": content,
            **attrs,
        }
        annotations.append(annotation)

        last_end = match.end()

    # Add remaining text
    plain_parts.append(text[last_end:])

    return "".join(plain_parts), annotations

iterparse_elements(filepath: Path | str, tag: str | None = None, events: tuple[str, ...] = ('end',), encoding: str = 'utf-8', remove_blank_text: bool = True, huge_tree: bool = False) -> Generator[tuple[str, ElementType], None, None]

Memory-efficient streaming parser for large XML files.

Uses lxml's iterparse to process elements as they're parsed, maintaining constant memory usage regardless of file size.

PARAMETER DESCRIPTION
filepath

Path to XML file to parse.

TYPE: Path | str

tag

If specified, only yield elements with this tag.

TYPE: str | None DEFAULT: None

events

Events to listen for ('start', 'end', 'start-ns', 'end-ns').

TYPE: tuple[str, ...] DEFAULT: ('end',)

encoding

File encoding.

TYPE: str DEFAULT: 'utf-8'

remove_blank_text

Remove whitespace-only text nodes.

TYPE: bool DEFAULT: True

huge_tree

Enable parsing of very large documents (>500MB).

TYPE: bool DEFAULT: False

YIELDS DESCRIPTION
tuple[str, ElementType]

Event type and parsed element.

Examples:

>>> for event, elem in iterparse_elements("frames.xml", tag="frame"):
...     process_frame(elem)
...     elem.clear()  # Free memory
Source code in src/glazing/utils/xml_parser.py
def iterparse_elements(  # noqa: PLR0913
    filepath: Path | str,
    tag: str | None = None,
    events: tuple[str, ...] = ("end",),
    encoding: str = "utf-8",
    remove_blank_text: bool = True,
    huge_tree: bool = False,
) -> Generator[tuple[str, ElementType], None, None]:
    """Memory-efficient streaming parser for large XML files.

    Uses lxml's iterparse to process elements as they're parsed,
    maintaining constant memory usage regardless of file size.

    Parameters
    ----------
    filepath : Path | str
        Path to XML file to parse.
    tag : str | None
        If specified, only yield elements with this tag.
    events : tuple[str, ...]
        Events to listen for ('start', 'end', 'start-ns', 'end-ns').
    encoding : str
        File encoding.
    remove_blank_text : bool
        Remove whitespace-only text nodes.
    huge_tree : bool
        Enable parsing of very large documents (>500MB).

    Yields
    ------
    tuple[str, ElementType]
        Event type and parsed element.

    Examples
    --------
    >>> for event, elem in iterparse_elements("frames.xml", tag="frame"):
    ...     process_frame(elem)
    ...     elem.clear()  # Free memory
    """
    # iterparse doesn't directly accept parser, but we can set parser options
    # through the global parser settings if needed
    if huge_tree:
        # For huge trees, we need to use a different approach
        # Create parser for huge trees (not directly used with iterparse)
        _ = etree.XMLParser(
            encoding=encoding,
            remove_blank_text=remove_blank_text,
            huge_tree=huge_tree,
            recover=False,
        )
        with Path(filepath).open("rb") as f:
            context = etree.iterparse(f, events=events, tag=tag)
            for event, elem in context:
                yield event, elem
    else:
        # Standard iterparse
        context = etree.iterparse(str(filepath), events=events, tag=tag, encoding=encoding)
        for event, elem in context:
            yield event, elem

parse_attributes(element: ElementType, type_map: dict[str, type] | None = None, use_objectify: bool = False) -> dict[str, str | int | float | bool]

Parse and convert XML attributes to Python types.

PARAMETER DESCRIPTION
element

Element with attributes to parse.

TYPE: ElementType

type_map

Mapping of attribute names to Python types.

TYPE: dict[str, type] | None DEFAULT: None

use_objectify

Use lxml.objectify for automatic type detection.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
dict[str, str | int | float | bool]

Parsed attributes with converted types.

Examples:

>>> attrs = parse_attributes(elem, {"id": int, "confidence": float})
>>> print(attrs["id"])  # Returns int, not str
123
Source code in src/glazing/utils/xml_parser.py
def parse_attributes(
    element: ElementType,
    type_map: dict[str, type] | None = None,
    use_objectify: bool = False,
) -> dict[str, str | int | float | bool]:
    """Parse and convert XML attributes to Python types.

    Parameters
    ----------
    element : ElementType
        Element with attributes to parse.
    type_map : dict[str, type] | None
        Mapping of attribute names to Python types.
    use_objectify : bool
        Use lxml.objectify for automatic type detection.

    Returns
    -------
    dict[str, str | int | float | bool]
        Parsed attributes with converted types.

    Examples
    --------
    >>> attrs = parse_attributes(elem, {"id": int, "confidence": float})
    >>> print(attrs["id"])  # Returns int, not str
    123
    """
    if use_objectify:
        # Let objectify handle type conversion
        objectify.deannotate(element, cleanup_namespaces=True)
        result: dict[str, str | int | float | bool] = {}
        for k, v in element.attrib.items():
            result[str(k)] = cast(str | int | float | bool, objectify.fromstring(str(v)))
        return result

    attrs: dict[str, str | int | float | bool] = {}
    type_map = type_map or {}

    for key_raw, value in element.attrib.items():
        key = str(key_raw)  # Ensure key is string
        if key in type_map:
            try:
                if type_map[key] is bool:
                    attrs[key] = value.lower() in ("true", "1", "yes")
                else:
                    attrs[key] = type_map[key](value)
            except (ValueError, TypeError) as e:
                error_msg = (
                    f"Failed to convert attribute '{key}' with value '{value!s}' "
                    f"to type {type_map[key].__name__}: {e}"
                )
                raise ValueError(error_msg) from e
        else:
            attrs[key] = str(value)

    return attrs

parse_with_schema(filepath: Path | str, schema_path: Path | str | None = None, schema_type: str = 'xsd') -> ElementType

Parse XML with DTD or XSD validation.

PARAMETER DESCRIPTION
filepath

Path to XML file.

TYPE: Path | str

schema_path

Path to schema file. If None, use DTD from XML.

TYPE: Path | str | None DEFAULT: None

schema_type

Schema type ('xsd', 'dtd', 'relaxng').

TYPE: str DEFAULT: 'xsd'

RETURNS DESCRIPTION
ElementType

Parsed and validated root element.

RAISES DESCRIPTION
XMLSyntaxError

If XML is invalid.

DocumentInvalid

If document doesn't match schema.

Source code in src/glazing/utils/xml_parser.py
def parse_with_schema(
    filepath: Path | str,
    schema_path: Path | str | None = None,
    schema_type: str = "xsd",
) -> ElementType:
    """Parse XML with DTD or XSD validation.

    Parameters
    ----------
    filepath : Path | str
        Path to XML file.
    schema_path : Path | str | None
        Path to schema file. If None, use DTD from XML.
    schema_type : str
        Schema type ('xsd', 'dtd', 'relaxng').

    Returns
    -------
    ElementType
        Parsed and validated root element.

    Raises
    ------
    etree.XMLSyntaxError
        If XML is invalid.
    etree.DocumentInvalid
        If document doesn't match schema.
    """
    if schema_path and schema_type == "xsd":
        with Path(schema_path).open("rb") as f:
            schema_doc = etree.parse(f)
            schema = etree.XMLSchema(schema_doc)
        parser = etree.XMLParser(schema=schema)
    elif schema_type == "dtd":
        parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
    else:
        parser = etree.XMLParser()

    with Path(filepath).open("rb") as f:
        tree = etree.parse(f, parser)

    return tree.getroot()