glazing.utils.xml_parser¶
XML parsing utilities.
xml_parser
¶
High-performance XML parsing utilities using lxml.
This module provides fast, memory-efficient XML parsing utilities built on lxml, which uses C libraries (libxml2/libxslt) for 20x performance over pure Python parsers.
| FUNCTION | DESCRIPTION |
|---|---|
iterparse_elements |
Memory-efficient streaming parser for large XML files. |
parse_with_schema |
Parse XML with DTD or XSD validation. |
extract_text_with_markup |
Extract text preserving embedded markup tags. |
compile_xpath |
Pre-compile XPath expressions for repeated use. |
parse_attributes |
Parse and convert XML attributes to Python types. |
clear_element |
Clear element to free memory during parsing. |
fragment_to_annotations |
Convert XML fragments to annotation objects. |
| CLASS | DESCRIPTION |
|---|---|
MarkupExtractor |
Extract and preserve embedded markup from mixed content. |
StreamingParser |
Event-driven streaming parser for large files. |
Notes
Uses lxml.etree for maximum performance with large linguistic datasets. All parsers use iterparse for constant memory usage regardless of file size.
Classes¶
MarkupExtractor(preserve_tags: set[str], nested: bool = False)
¶
Extract and preserve embedded markup from mixed content.
Optimized for FrameNet's complex annotation structure with multiple levels of embedded markup.
| PARAMETER | DESCRIPTION |
|---|---|
preserve_tags
|
Tags to preserve as annotations.
TYPE:
|
nested
|
Support nested markup tags.
TYPE:
|
| ATTRIBUTE | DESCRIPTION |
|---|---|
preserve_tags |
Tags being preserved.
TYPE:
|
nested |
Whether nested tags are supported.
TYPE:
|
| METHOD | DESCRIPTION |
|---|---|
extract |
Extract text and annotations from element. |
extract_recursive |
Recursively extract nested annotations. |
Initialize markup extractor.
| PARAMETER | DESCRIPTION |
|---|---|
preserve_tags
|
Tags to preserve as annotations.
TYPE:
|
nested
|
Support nested markup tags.
TYPE:
|
Source code in src/glazing/utils/xml_parser.py
Functions¶
extract(element: ElementType) -> tuple[str, list[dict[str, str | int]]]
¶
Extract text and annotations from element.
| PARAMETER | DESCRIPTION |
|---|---|
element
|
Element to extract from.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[str, list[dict[str, str | int]]]
|
Plain text and annotation list. |
Source code in src/glazing/utils/xml_parser.py
StreamingParser(filepath: Path | str, target_tags: set[str] | None = None, max_depth: int = 10)
¶
Event-driven streaming parser for large files.
Processes XML files of any size with constant memory usage.
| PARAMETER | DESCRIPTION |
|---|---|
filepath
|
Path to XML file.
TYPE:
|
target_tags
|
Tags to process. If None, process all.
TYPE:
|
max_depth
|
Maximum parsing depth.
TYPE:
|
| ATTRIBUTE | DESCRIPTION |
|---|---|
filepath |
Path to XML file.
TYPE:
|
target_tags |
Tags being processed.
TYPE:
|
max_depth |
Maximum depth to parse.
TYPE:
|
| METHOD | DESCRIPTION |
|---|---|
parse |
Parse file with custom handler. |
iter_elements |
Iterate over elements with tag. |
count_elements |
Count elements without loading. |
Initialize streaming parser.
| PARAMETER | DESCRIPTION |
|---|---|
filepath
|
Path to XML file.
TYPE:
|
target_tags
|
Tags to process.
TYPE:
|
max_depth
|
Maximum parsing depth.
TYPE:
|
Source code in src/glazing/utils/xml_parser.py
Functions¶
count_elements(tag: str) -> int
¶
Count elements without loading.
| PARAMETER | DESCRIPTION |
|---|---|
tag
|
Tag to count.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number of elements with tag. |
Source code in src/glazing/utils/xml_parser.py
iter_elements(tag: str) -> Iterator[ElementType]
¶
Iterate over elements with tag.
| PARAMETER | DESCRIPTION |
|---|---|
tag
|
Tag to filter by.
TYPE:
|
| YIELDS | DESCRIPTION |
|---|---|
ElementType
|
Elements with specified tag. |
Source code in src/glazing/utils/xml_parser.py
parse(handler: Callable[[ElementType], None]) -> None
¶
Parse file with custom handler.
| PARAMETER | DESCRIPTION |
|---|---|
handler
|
Function called for each target element.
TYPE:
|
Source code in src/glazing/utils/xml_parser.py
Functions¶
clear_element(element: ElementType, keep_tail: bool = False) -> None
¶
Clear element to free memory during parsing.
Removes element content while preserving structure for continued iteration. Critical for processing large files with constant memory usage.
| PARAMETER | DESCRIPTION |
|---|---|
element
|
Element to clear.
TYPE:
|
keep_tail
|
Preserve tail text (text after element).
TYPE:
|
Examples:
>>> for event, elem in iterparse_elements("huge.xml"):
... process(elem)
... clear_element(elem) # Free memory immediately
Source code in src/glazing/utils/xml_parser.py
compile_xpath(expression: str, namespaces: dict[str, str] | None = None) -> XPathType
¶
Pre-compile XPath expressions for repeated use.
Compiled XPath expressions are 3-5x faster for repeated queries.
| PARAMETER | DESCRIPTION |
|---|---|
expression
|
XPath expression to compile.
TYPE:
|
namespaces
|
Namespace prefixes to URIs.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
XPathType
|
Compiled XPath expression. |
Examples:
Source code in src/glazing/utils/xml_parser.py
extract_text_with_markup(element: ElementType, preserve_tags: set[str] | None = None) -> tuple[str, list[dict[str, str | int]]]
¶
Extract text preserving embedded markup tags.
Handles mixed content like FrameNet's embedded FE references:
| PARAMETER | DESCRIPTION |
|---|---|
element
|
Element containing mixed content.
TYPE:
|
preserve_tags
|
Tags to preserve as annotations. If None, preserve all.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[str, list[dict[str, str | int]]]
|
Plain text and list of annotation dictionaries with positions. |
Examples:
>>> text, annos = extract_text_with_markup(elem, {"fex", "fen", "t", "ex"})
>>> print(text)
'The person abandoned the car.'
>>> print(annos[0])
{'tag': 'fex', 'name': 'Agent', 'start': 4, 'end': 10, 'text': 'person'}
Source code in src/glazing/utils/xml_parser.py
fragment_to_annotations(text: str, tag_pattern: str = '<(\\w+)([^>]*)>(.*?)</\\1>') -> tuple[str, list[dict[str, str | int]]]
¶
Convert XML fragments to annotation objects.
Alternative to full XML parsing for simple embedded markup.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Text with XML fragments.
TYPE:
|
tag_pattern
|
Regex pattern for matching tags.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[str, list[dict[str, str | int]]]
|
Plain text and annotations. |
Source code in src/glazing/utils/xml_parser.py
iterparse_elements(filepath: Path | str, tag: str | None = None, events: tuple[str, ...] = ('end',), encoding: str = 'utf-8', remove_blank_text: bool = True, huge_tree: bool = False) -> Generator[tuple[str, ElementType], None, None]
¶
Memory-efficient streaming parser for large XML files.
Uses lxml's iterparse to process elements as they're parsed, maintaining constant memory usage regardless of file size.
| PARAMETER | DESCRIPTION |
|---|---|
filepath
|
Path to XML file to parse.
TYPE:
|
tag
|
If specified, only yield elements with this tag.
TYPE:
|
events
|
Events to listen for ('start', 'end', 'start-ns', 'end-ns').
TYPE:
|
encoding
|
File encoding.
TYPE:
|
remove_blank_text
|
Remove whitespace-only text nodes.
TYPE:
|
huge_tree
|
Enable parsing of very large documents (>500MB).
TYPE:
|
| YIELDS | DESCRIPTION |
|---|---|
tuple[str, ElementType]
|
Event type and parsed element. |
Examples:
>>> for event, elem in iterparse_elements("frames.xml", tag="frame"):
... process_frame(elem)
... elem.clear() # Free memory
Source code in src/glazing/utils/xml_parser.py
parse_attributes(element: ElementType, type_map: dict[str, type] | None = None, use_objectify: bool = False) -> dict[str, str | int | float | bool]
¶
Parse and convert XML attributes to Python types.
| PARAMETER | DESCRIPTION |
|---|---|
element
|
Element with attributes to parse.
TYPE:
|
type_map
|
Mapping of attribute names to Python types.
TYPE:
|
use_objectify
|
Use lxml.objectify for automatic type detection.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, str | int | float | bool]
|
Parsed attributes with converted types. |
Examples:
>>> attrs = parse_attributes(elem, {"id": int, "confidence": float})
>>> print(attrs["id"]) # Returns int, not str
123
Source code in src/glazing/utils/xml_parser.py
parse_with_schema(filepath: Path | str, schema_path: Path | str | None = None, schema_type: str = 'xsd') -> ElementType
¶
Parse XML with DTD or XSD validation.
| PARAMETER | DESCRIPTION |
|---|---|
filepath
|
Path to XML file.
TYPE:
|
schema_path
|
Path to schema file. If None, use DTD from XML.
TYPE:
|
schema_type
|
Schema type ('xsd', 'dtd', 'relaxng').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ElementType
|
Parsed and validated root element. |
| RAISES | DESCRIPTION |
|---|---|
XMLSyntaxError
|
If XML is invalid. |
DocumentInvalid
|
If document doesn't match schema. |