Dave's Notes

Absence of evidence is not evidence of absence

Looking for Validation

December 13, 2017

Validating XML against the cited schema makes sure that the XML is at least structured correctly, and helps ensure that consumers of the XML have some guarantees about what to expect.

Validation can be done pragmatically with a couple of commonly available tools, typically coming down to a choice of libxml2 for C and Python or Xerces for Java. The examples here are for the Python lxml library, which uses libxml2 under the hood.

In a nutshell:

import os
CATALOG=os.path.expanduser("~/xmlcatalogs/catalog.xml")

# Note: do this before importing lxml
os.environ["XML_CATALOG_FILES"]="file://" + CATALOG

from lxml import etree

xsd = "http://www.ngdc.noaa.gov/metadata/published/xsd/schema.xsd"
schema = etree.XMLSchema(file=xsd)

doc = etree.parse(open("samples/iso_01.xml","r"))
s.assertValid(doc)

invalid_doc = etree.parse(open("samples/iso_02_cn-invalid.xml","r"))
s.assertValid(invalid_doc)
---------------------------------------------------------------------------
DocumentInvalid                           Traceback (most recent call last)
<ipython-input-9-aaa63f2f66c4> in <module>()
----> 1 s.assertValid(invalid_doc)

src/lxml/etree.pyx in lxml.etree._Validator.assertValid (src/lxml/etree.c:194448)()

DocumentInvalid: Element '{http://www.isotc211.org/2005/gmd}CI_Address': This element is not expected. Expected is one of ( {http://www.isotc211.org/2005/gmd}characterSet, {http://www.isotc211.org/2005/gmd}parentIdentifier, {http://www.isotc211.org/2005/gmd}hierarchyLevel, {http://www.isotc211.org/2005/gmd}hierarchyLevelName, {http://www.isotc211.org/2005/gmd}contact )., line 7

The XML_CATALOG_FILES environment can also be set outside the app doing the validation.

Note that the call to etree.XMLSchema() does not honor the default practice of lxml for no network access, and will in fact retrieve any referenced schemas from the network if those URIs are not resolved in the catalog.

Looking for Validation - December 13, 2017 - Dave Vieglais