Wednesday, May 22, 2013

Hans Moleman issues

XML, especially GML schema validation can be hard. The mysterious Xerces 'honor all schema locations' flag springs to mind (this is a mystery yet to be fully understood). Often, slow schema validation processes (which seem to fetch schemas from the web) can be traced to Hans Moleman. No, sorry, wrong link, to Hans Moleman.

So what's happening? And what does Hans Moleman have to do with it?

As the GML experts among you may know, GML application schemas depend on the GML schema, which in turn consists of many (varies amongst versions) schemas, depending on other schemas like for example the W3C XLinks schema, which in turn includes the W3C XML schema (the schema for the xml namespace itself: http://www.w3.org/XML/1998/namespace).

So even when validating a feature collection against a local version of a GML application schema, the schema parser might still get to a point where it needs to fetch dependent schemas from the internet. And since the xml.xsd is the last one in the chain, it's also the one that gets requested the most.

According to W3C people, they had ~130 million accesses to this file per day, and since decided to completely block eg. the Java default HTTP UserAgent and others. Apparently they later had a change of heart, and don't block it any more, but the xml.xsd URL has a delay of several seconds upon loading (see http://www.w3.org/2001/xml.xsd).

So when validating multiple documents, which all need the xml.xsd, with all schemas loaded freshly every time, you'll get a delay of several seconds where your computer seems to do nothing at all.

We've thought about the problem of remote schemas quite a while ago, and made use of a custom Xerces entity resolver to load OGC and W3C schemas from a local artifact which we ship with deegree. There would also be other solutions, our JAXB schema generation for example makes use of standard XML catalog files to avoid fetching schemas from the web.

But unfortunately the CITE WFS 1.0.0 tests (and others) do not (although newer versions tend to load required schemas from the classpath as well).

Using reverse engineering using an eclipse plugin (see the other post from today) I was able to fix this (they were already using a custom entity resolver, loading everything from the web all the time). Now a complete deegree build including integration tests runs only needs 13 minutes on fast machine!

For those interested, have a look at our deegree-compliance-tests module.