glazing.utils.xml_parser¶

XML parsing utilities.

`xml_parser` ¶

High-performance XML parsing utilities using lxml.

This module provides fast, memory-efficient XML parsing utilities built on lxml, which uses C libraries (libxml2/libxslt) for 20x performance over pure Python parsers.

FUNCTION	DESCRIPTION
`iterparse_elements`	Memory-efficient streaming parser for large XML files.
`parse_with_schema`	Parse XML with DTD or XSD validation.
`extract_text_with_markup`	Extract text preserving embedded markup tags.
`compile_xpath`	Pre-compile XPath expressions for repeated use.
`parse_attributes`	Parse and convert XML attributes to Python types.
`clear_element`	Clear element to free memory during parsing.
`fragment_to_annotations`	Convert XML fragments to annotation objects.

CLASS	DESCRIPTION
`MarkupExtractor`	Extract and preserve embedded markup from mixed content.
`StreamingParser`	Event-driven streaming parser for large files.

Notes

Uses lxml.etree for maximum performance with large linguistic datasets. All parsers use iterparse for constant memory usage regardless of file size.

Classes¶

`MarkupExtractor(preserve_tags: set[str], nested: bool = False)` ¶

Extract and preserve embedded markup from mixed content.

Optimized for FrameNet's complex annotation structure with multiple levels of embedded markup.

PARAMETER	DESCRIPTION
`preserve_tags`	Tags to preserve as annotations. TYPE: `set[str]`
`nested`	Support nested markup tags. TYPE: `bool` DEFAULT: `False`

ATTRIBUTE	DESCRIPTION
`preserve_tags`	Tags being preserved. TYPE: `set[str]`
`nested`	Whether nested tags are supported. TYPE: `bool`

METHOD	DESCRIPTION
`extract`	Extract text and annotations from element.
`extract_recursive`	Recursively extract nested annotations.

Initialize markup extractor.

PARAMETER	DESCRIPTION
`preserve_tags`	Tags to preserve as annotations. TYPE: `set[str]`
`nested`	Support nested markup tags. TYPE: `bool` DEFAULT: `False`

Source code in src/glazing/utils/xml_parser.py

def __init__(self, preserve_tags: set[str], nested: bool = False) -> None:
    """Initialize markup extractor.

    Parameters
    ----------
    preserve_tags : set[str]
        Tags to preserve as annotations.
    nested : bool
        Support nested markup tags.
    """
    self.preserve_tags = preserve_tags
    self.nested = nested

Methods:¶

`extract(element: ElementType) -> tuple[str, list[dict[str, str | int]]]` ¶

Extract text and annotations from element.

PARAMETER	DESCRIPTION
`element`	Element to extract from. TYPE: `ElementType`

RETURNS	DESCRIPTION
`tuple[str, list[dict[str, str \| int]]]`	Plain text and annotation list.

Source code in src/glazing/utils/xml_parser.py

def extract(self, element: ElementType) -> tuple[str, list[dict[str, str | int]]]:
    """Extract text and annotations from element.

    Parameters
    ----------
    element : ElementType
        Element to extract from.

    Returns
    -------
    tuple[str, list[dict[str, str | int]]]
        Plain text and annotation list.
    """
    if self.nested:
        return self._extract_recursive(element, depth=0)
    return extract_text_with_markup(element, self.preserve_tags)

`StreamingParser(filepath: Path | str, target_tags: set[str] | None = None, max_depth: int = 10)` ¶

Event-driven streaming parser for large files.

Processes XML files of any size with constant memory usage.

PARAMETER	DESCRIPTION
`filepath`	Path to XML file. TYPE: `Path \| str`
`target_tags`	Tags to process. If None, process all. TYPE: `set[str] \| None` DEFAULT: `None`
`max_depth`	Maximum parsing depth. TYPE: `int` DEFAULT: `10`

ATTRIBUTE	DESCRIPTION
`filepath`	Path to XML file. TYPE: `Path`
`target_tags`	Tags being processed. TYPE: `set[str] \| None`
`max_depth`	Maximum depth to parse. TYPE: `int`

METHOD	DESCRIPTION
`parse`	Parse file with custom handler.
`iter_elements`	Iterate over elements with tag.
`count_elements`	Count elements without loading.

Initialize streaming parser.

PARAMETER	DESCRIPTION
`filepath`	Path to XML file. TYPE: `Path \| str`
`target_tags`	Tags to process. TYPE: `set[str] \| None` DEFAULT: `None`
`max_depth`	Maximum parsing depth. TYPE: `int` DEFAULT: `10`

Source code in src/glazing/utils/xml_parser.py

def __init__(
    self,
    filepath: Path | str,
    target_tags: set[str] | None = None,
    max_depth: int = 10,
) -> None:
    """Initialize streaming parser.

    Parameters
    ----------
    filepath : Path | str
        Path to XML file.
    target_tags : set[str] | None
        Tags to process.
    max_depth : int
        Maximum parsing depth.
    """
    self.filepath = Path(filepath)
    self.target_tags = target_tags
    self.max_depth = max_depth

Methods:¶

`count_elements(tag: str) -> int` ¶

Count elements without loading.

PARAMETER	DESCRIPTION
`tag`	Tag to count. TYPE: `str`

RETURNS	DESCRIPTION
`int`	Number of elements with tag.

Source code in src/glazing/utils/xml_parser.py

def count_elements(self, tag: str) -> int:
    """Count elements without loading.

    Parameters
    ----------
    tag : str
        Tag to count.

    Returns
    -------
    int
        Number of elements with tag.
    """
    count = 0
    for _ in self.iter_elements(tag):
        count += 1
    return count

`iter_elements(tag: str) -> Iterator[ElementType]` ¶

Iterate over elements with tag.

PARAMETER	DESCRIPTION
`tag`	Tag to filter by. TYPE: `str`

YIELDS	DESCRIPTION
`ElementType`	Elements with specified tag.

Source code in src/glazing/utils/xml_parser.py

def iter_elements(self, tag: str) -> Iterator[ElementType]:
    """Iterate over elements with tag.

    Parameters
    ----------
    tag : str
        Tag to filter by.

    Yields
    ------
    ElementType
        Elements with specified tag.
    """
    for _event, elem in iterparse_elements(self.filepath, tag=tag):
        yield elem
        clear_element(elem)

`parse(handler: Callable[[ElementType], None]) -> None` ¶

Parse file with custom handler.

PARAMETER	DESCRIPTION
`handler`	Function called for each target element. TYPE: `callable`

Source code in src/glazing/utils/xml_parser.py

def parse(self, handler: Callable[[ElementType], None]) -> None:
    """Parse file with custom handler.

    Parameters
    ----------
    handler : callable
        Function called for each target element.
    """
    for event, elem in iterparse_elements(self.filepath, events=("start", "end")):
        if event == "end" and (self.target_tags is None or elem.tag in self.target_tags):
            handler(elem)
            clear_element(elem)

Functions:¶

`clear_element(element: ElementType, keep_tail: bool = False) -> None` ¶

Clear element to free memory during parsing.

Removes element content while preserving structure for continued iteration. Critical for processing large files with constant memory usage.

PARAMETER	DESCRIPTION
`element`	Element to clear. TYPE: `ElementType`
`keep_tail`	Preserve tail text (text after element). TYPE: `bool` DEFAULT: `False`

Examples:

>>> for event, elem in iterparse_elements("huge.xml"):
...     process(elem)
...     clear_element(elem)  # Free memory immediately

Source code in src/glazing/utils/xml_parser.py

def clear_element(element: ElementType, keep_tail: bool = False) -> None:
    """Clear element to free memory during parsing.

    Removes element content while preserving structure for continued iteration.
    Critical for processing large files with constant memory usage.

    Parameters
    ----------
    element : ElementType
        Element to clear.
    keep_tail : bool
        Preserve tail text (text after element).

    Examples
    --------
    >>> for event, elem in iterparse_elements("huge.xml"):
    ...     process(elem)
    ...     clear_element(elem)  # Free memory immediately
    """
    # Remove reference to parent
    parent = element.getparent()
    if parent is not None:
        parent.remove(element)

    # Clear content
    element.clear()

    # Preserve tail if needed
    if keep_tail and element.tail:
        element.tail = element.tail

`compile_xpath(expression: str, namespaces: dict[str, str] | None = None) -> XPathType` ¶

Pre-compile XPath expressions for repeated use.

Compiled XPath expressions are 3-5x faster for repeated queries.

PARAMETER	DESCRIPTION
`expression`	XPath expression to compile. TYPE: `str`
`namespaces`	Namespace prefixes to URIs. TYPE: `dict[str, str] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`XPathType`	Compiled XPath expression.

Examples:

>>> xpath = compile_xpath("//frame[@id=$frame_id]/FE")
>>> fes = xpath(root, frame_id="123")

Source code in src/glazing/utils/xml_parser.py

def compile_xpath(expression: str, namespaces: dict[str, str] | None = None) -> XPathType:
    """Pre-compile XPath expressions for repeated use.

    Compiled XPath expressions are 3-5x faster for repeated queries.

    Parameters
    ----------
    expression : str
        XPath expression to compile.
    namespaces : dict[str, str] | None
        Namespace prefixes to URIs.

    Returns
    -------
    XPathType
        Compiled XPath expression.

    Examples
    --------
    >>> xpath = compile_xpath("//frame[@id=$frame_id]/FE")
    >>> fes = xpath(root, frame_id="123")
    """
    return etree.XPath(expression, namespaces=namespaces)

`extract_text_with_markup(element: ElementType, preserve_tags: set[str] | None = None) -> tuple[str, list[dict[str, str | int]]]` ¶

Extract text preserving embedded markup tags.

Handles mixed content like FrameNet's embedded FE references: The person abandoned the car.

PARAMETER	DESCRIPTION
`element`	Element containing mixed content. TYPE: `ElementType`
`preserve_tags`	Tags to preserve as annotations. If None, preserve all. TYPE: `set[str] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple[str, list[dict[str, str \| int]]]`	Plain text and list of annotation dictionaries with positions.

Examples:

>>> text, annos = extract_text_with_markup(elem, {"fex", "fen", "t", "ex"})
>>> print(text)
'The person abandoned the car.'
>>> print(annos[0])
{'tag': 'fex', 'name': 'Agent', 'start': 4, 'end': 10, 'text': 'person'}

Source code in src/glazing/utils/xml_parser.py

def extract_text_with_markup(
    element: ElementType,
    preserve_tags: set[str] | None = None,
) -> tuple[str, list[dict[str, str | int]]]:
    """Extract text preserving embedded markup tags.

    Handles mixed content like FrameNet's embedded FE references:
    <text>The <fex name="Agent">person</fex> abandoned the <fex name="Theme">car</fex>.</text>

    Parameters
    ----------
    element : ElementType
        Element containing mixed content.
    preserve_tags : set[str] | None
        Tags to preserve as annotations. If None, preserve all.

    Returns
    -------
    tuple[str, list[dict[str, str | int]]]
        Plain text and list of annotation dictionaries with positions.

    Examples
    --------
    >>> text, annos = extract_text_with_markup(elem, {"fex", "fen", "t", "ex"})
    >>> print(text)
    'The person abandoned the car.'
    >>> print(annos[0])
    {'tag': 'fex', 'name': 'Agent', 'start': 4, 'end': 10, 'text': 'person'}
    """
    plain_text = []
    annotations = []
    position = 0

    # Get initial text before any child elements
    if element.text:
        plain_text.append(element.text)
        position += len(element.text)

    for child in element:
        if preserve_tags is None or child.tag in preserve_tags:
            # Record annotation
            start = position
            child_text = child.text or ""
            plain_text.append(child_text)
            end = position + len(child_text)

            annotation: dict[str, str | int] = {
                "tag": child.tag,
                "start": start,
                "end": end,
                "text": child_text,
            }

            # Add attributes
            for key, value in child.attrib.items():
                annotation[str(key)] = str(value)

            annotations.append(annotation)
            position = end

        # Get tail text after child element
        if child.tail:
            plain_text.append(child.tail)
            position += len(child.tail)

    return "".join(plain_text), annotations

`fragment_to_annotations(text: str, tag_pattern: str = '<(\\w+)([^>])>(.?)</\\1>') -> tuple[str, list[dict[str, str | int]]]` ¶

Convert XML fragments to annotation objects.

Alternative to full XML parsing for simple embedded markup.

PARAMETER	DESCRIPTION
`text`	Text with XML fragments. TYPE: `str`
`tag_pattern`	Regex pattern for matching tags. TYPE: `str` DEFAULT: `'<(\\w+)([^>])>(.?)</\\1>'`

RETURNS	DESCRIPTION
`tuple[str, list[dict[str, str \| int]]]`	Plain text and annotations.

Source code in src/glazing/utils/xml_parser.py

def fragment_to_annotations(
    text: str,
    tag_pattern: str = r"<(\w+)([^>]*)>(.*?)</\1>",
) -> tuple[str, list[dict[str, str | int]]]:
    """Convert XML fragments to annotation objects.

    Alternative to full XML parsing for simple embedded markup.

    Parameters
    ----------
    text : str
        Text with XML fragments.
    tag_pattern : str
        Regex pattern for matching tags.

    Returns
    -------
    tuple[str, list[dict[str, str | int]]]
        Plain text and annotations.
    """
    annotations = []
    plain_parts = []
    last_end = 0

    for match in re.finditer(tag_pattern, text):
        # Add text before match
        plain_parts.append(text[last_end : match.start()])

        # Extract match info
        tag = match.group(1)
        attrs_str = match.group(2)
        content = match.group(3)

        # Parse attributes
        attrs = {}
        for attr_match in re.finditer(r'(\w+)="([^"]*)"', attrs_str):
            attrs[attr_match.group(1)] = attr_match.group(2)

        # Create annotation
        start = len("".join(plain_parts))
        plain_parts.append(content)
        end = len("".join(plain_parts))

        annotation = {
            "tag": tag,
            "start": start,
            "end": end,
            "text": content,
            **attrs,
        }
        annotations.append(annotation)

        last_end = match.end()

    # Add remaining text
    plain_parts.append(text[last_end:])

    return "".join(plain_parts), annotations

`iterparse_elements(filepath: Path | str, tag: str | None = None, events: tuple[str, ...] = ('end',), encoding: str = 'utf-8', remove_blank_text: bool = True, huge_tree: bool = False) -> Generator[tuple[str, ElementType], None, None]` ¶

Memory-efficient streaming parser for large XML files.

Uses lxml's iterparse to process elements as they're parsed, maintaining constant memory usage regardless of file size.

PARAMETER	DESCRIPTION
`filepath`	Path to XML file to parse. TYPE: `Path \| str`
`tag`	If specified, only yield elements with this tag. TYPE: `str \| None` DEFAULT: `None`
`events`	Events to listen for ('start', 'end', 'start-ns', 'end-ns'). TYPE: `tuple[str, ...]` DEFAULT: `('end',)`
`encoding`	File encoding. TYPE: `str` DEFAULT: `'utf-8'`
`remove_blank_text`	Remove whitespace-only text nodes. TYPE: `bool` DEFAULT: `True`
`huge_tree`	Enable parsing of very large documents (>500MB). TYPE: `bool` DEFAULT: `False`

YIELDS	DESCRIPTION
`tuple[str, ElementType]`	Event type and parsed element.

Examples:

>>> for event, elem in iterparse_elements("frames.xml", tag="frame"):
...     process_frame(elem)
...     elem.clear()  # Free memory

Source code in src/glazing/utils/xml_parser.py

def iterparse_elements(  # noqa: PLR0913
    filepath: Path | str,
    tag: str | None = None,
    events: tuple[str, ...] = ("end",),
    encoding: str = "utf-8",
    remove_blank_text: bool = True,
    huge_tree: bool = False,
) -> Generator[tuple[str, ElementType], None, None]:
    """Memory-efficient streaming parser for large XML files.

    Uses lxml's iterparse to process elements as they're parsed,
    maintaining constant memory usage regardless of file size.

    Parameters
    ----------
    filepath : Path | str
        Path to XML file to parse.
    tag : str | None
        If specified, only yield elements with this tag.
    events : tuple[str, ...]
        Events to listen for ('start', 'end', 'start-ns', 'end-ns').
    encoding : str
        File encoding.
    remove_blank_text : bool
        Remove whitespace-only text nodes.
    huge_tree : bool
        Enable parsing of very large documents (>500MB).

    Yields
    ------
    tuple[str, ElementType]
        Event type and parsed element.

    Examples
    --------
    >>> for event, elem in iterparse_elements("frames.xml", tag="frame"):
    ...     process_frame(elem)
    ...     elem.clear()  # Free memory
    """
    # iterparse doesn't directly accept parser, but we can set parser options
    # through the global parser settings if needed
    if huge_tree:
        # For huge trees, we need to use a different approach
        # Create parser for huge trees (not directly used with iterparse)
        _ = etree.XMLParser(
            encoding=encoding,
            remove_blank_text=remove_blank_text,
            huge_tree=huge_tree,
            recover=False,
        )
        with Path(filepath).open("rb") as f:
            context = etree.iterparse(f, events=events, tag=tag)
            for event, elem in context:
                yield event, elem
    else:
        # Standard iterparse
        context = etree.iterparse(str(filepath), events=events, tag=tag, encoding=encoding)
        for event, elem in context:
            yield event, elem

`parse_attributes(element: ElementType, type_map: dict[str, type] | None = None, use_objectify: bool = False) -> dict[str, str | int | float | bool]` ¶

Parse and convert XML attributes to Python types.

PARAMETER	DESCRIPTION
`element`	Element with attributes to parse. TYPE: `ElementType`
`type_map`	Mapping of attribute names to Python types. TYPE: `dict[str, type] \| None` DEFAULT: `None`
`use_objectify`	Use lxml.objectify for automatic type detection. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`dict[str, str \| int \| float \| bool]`	Parsed attributes with converted types.

Examples:

>>> attrs = parse_attributes(elem, {"id": int, "confidence": float})
>>> print(attrs["id"])  # Returns int, not str
123

Source code in src/glazing/utils/xml_parser.py

def parse_attributes(
    element: ElementType,
    type_map: dict[str, type] | None = None,
    use_objectify: bool = False,
) -> dict[str, str | int | float | bool]:
    """Parse and convert XML attributes to Python types.

    Parameters
    ----------
    element : ElementType
        Element with attributes to parse.
    type_map : dict[str, type] | None
        Mapping of attribute names to Python types.
    use_objectify : bool
        Use lxml.objectify for automatic type detection.

    Returns
    -------
    dict[str, str | int | float | bool]
        Parsed attributes with converted types.

    Examples
    --------
    >>> attrs = parse_attributes(elem, {"id": int, "confidence": float})
    >>> print(attrs["id"])  # Returns int, not str
    123
    """
    if use_objectify:
        # Let objectify handle type conversion
        objectify.deannotate(element, cleanup_namespaces=True)
        result: dict[str, str | int | float | bool] = {}
        for k, v in element.attrib.items():
            result[str(k)] = cast(str | int | float | bool, objectify.fromstring(str(v)))
        return result

    attrs: dict[str, str | int | float | bool] = {}
    type_map = type_map or {}

    for key_raw, value in element.attrib.items():
        key = str(key_raw)  # Ensure key is string
        if key in type_map:
            try:
                if type_map[key] is bool:
                    attrs[key] = value.lower() in ("true", "1", "yes")
                else:
                    attrs[key] = type_map[key](value)
            except (ValueError, TypeError) as e:
                error_msg = (
                    f"Failed to convert attribute '{key}' with value '{value!s}' "
                    f"to type {type_map[key].__name__}: {e}"
                )
                raise ValueError(error_msg) from e
        else:
            attrs[key] = str(value)

    return attrs

`parse_with_schema(filepath: Path | str, schema_path: Path | str | None = None, schema_type: str = 'xsd') -> ElementType` ¶

Parse XML with DTD or XSD validation.

PARAMETER	DESCRIPTION
`filepath`	Path to XML file. TYPE: `Path \| str`
`schema_path`	Path to schema file. If None, use DTD from XML. TYPE: `Path \| str \| None` DEFAULT: `None`
`schema_type`	Schema type ('xsd', 'dtd', 'relaxng'). TYPE: `str` DEFAULT: `'xsd'`

RETURNS	DESCRIPTION
`ElementType`	Parsed and validated root element.

RAISES	DESCRIPTION
`XMLSyntaxError`	If XML is invalid.
`DocumentInvalid`	If document doesn't match schema.

Source code in src/glazing/utils/xml_parser.py

def parse_with_schema(
    filepath: Path | str,
    schema_path: Path | str | None = None,
    schema_type: str = "xsd",
) -> ElementType:
    """Parse XML with DTD or XSD validation.

    Parameters
    ----------
    filepath : Path | str
        Path to XML file.
    schema_path : Path | str | None
        Path to schema file. If None, use DTD from XML.
    schema_type : str
        Schema type ('xsd', 'dtd', 'relaxng').

    Returns
    -------
    ElementType
        Parsed and validated root element.

    Raises
    ------
    etree.XMLSyntaxError
        If XML is invalid.
    etree.DocumentInvalid
        If document doesn't match schema.
    """
    if schema_path and schema_type == "xsd":
        with Path(schema_path).open("rb") as f:
            schema_doc = etree.parse(f)
            schema = etree.XMLSchema(schema_doc)
        parser = etree.XMLParser(schema=schema)
    elif schema_type == "dtd":
        parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
    else:
        parser = etree.XMLParser()

    with Path(filepath).open("rb") as f:
        tree = etree.parse(f, parser)

    return tree.getroot()

glazing.utils.xml_parser¶

xml_parser ¶

Classes¶

MarkupExtractor(preserve_tags: set[str], nested: bool = False) ¶

Methods:¶

extract(element: ElementType) -> tuple[str, list[dict[str, str | int]]] ¶

StreamingParser(filepath: Path | str, target_tags: set[str] | None = None, max_depth: int = 10) ¶

Methods:¶

count_elements(tag: str) -> int ¶

iter_elements(tag: str) -> Iterator[ElementType] ¶

parse(handler: Callable[[ElementType], None]) -> None ¶

Functions:¶

clear_element(element: ElementType, keep_tail: bool = False) -> None ¶

compile_xpath(expression: str, namespaces: dict[str, str] | None = None) -> XPathType ¶

extract_text_with_markup(element: ElementType, preserve_tags: set[str] | None = None) -> tuple[str, list[dict[str, str | int]]] ¶

fragment_to_annotations(text: str, tag_pattern: str = '<(\\w+)([^>]*)>(.*?)</\\1>') -> tuple[str, list[dict[str, str | int]]] ¶

iterparse_elements(filepath: Path | str, tag: str | None = None, events: tuple[str, ...] = ('end',), encoding: str = 'utf-8', remove_blank_text: bool = True, huge_tree: bool = False) -> Generator[tuple[str, ElementType], None, None] ¶

parse_attributes(element: ElementType, type_map: dict[str, type] | None = None, use_objectify: bool = False) -> dict[str, str | int | float | bool] ¶

parse_with_schema(filepath: Path | str, schema_path: Path | str | None = None, schema_type: str = 'xsd') -> ElementType ¶

`xml_parser` ¶

`MarkupExtractor(preserve_tags: set[str], nested: bool = False)` ¶

`extract(element: ElementType) -> tuple[str, list[dict[str, str | int]]]` ¶

`StreamingParser(filepath: Path | str, target_tags: set[str] | None = None, max_depth: int = 10)` ¶

`count_elements(tag: str) -> int` ¶

`iter_elements(tag: str) -> Iterator[ElementType]` ¶

`parse(handler: Callable[[ElementType], None]) -> None` ¶

`clear_element(element: ElementType, keep_tail: bool = False) -> None` ¶

`compile_xpath(expression: str, namespaces: dict[str, str] | None = None) -> XPathType` ¶

`extract_text_with_markup(element: ElementType, preserve_tags: set[str] | None = None) -> tuple[str, list[dict[str, str | int]]]` ¶

`fragment_to_annotations(text: str, tag_pattern: str = '<(\\w+)([^>])>(.?)</\\1>') -> tuple[str, list[dict[str, str | int]]]` ¶

`iterparse_elements(filepath: Path | str, tag: str | None = None, events: tuple[str, ...] = ('end',), encoding: str = 'utf-8', remove_blank_text: bool = True, huge_tree: bool = False) -> Generator[tuple[str, ElementType], None, None]` ¶

`parse_attributes(element: ElementType, type_map: dict[str, type] | None = None, use_objectify: bool = False) -> dict[str, str | int | float | bool]` ¶

`parse_with_schema(filepath: Path | str, schema_path: Path | str | None = None, schema_type: str = 'xsd') -> ElementType` ¶