Tuesday, September 27, 2011

Xerces ClassCastExceptions in multiple deployments

Today I'd like to talk about strange XML ClassCastExceptions one sometimes gets when deploying the same web application twice in the same Tomcat.

The problem occurs when you deploy for example a newer xerces library in your webapp's WEB-INF/lib directory. The problem has to do with the mechanism the XML factories are instantiated dynamically. I've always thought (and wondered) why this is so, because obviously the xerces classes have been loaded properly by the webapp classloader specific for the webapp.

After not having to deal with the problem in deegree 2 (the xerces was not specifically needed there, and the problem could be solved to remove it from WEB-INF/lib) I ran into the problem again with deegree 3, where xerces is a central and necessary library to parse XML Schemas.

I've found that there is a debug setting for the JVM -Djaxp.debug=true, which was essential for finding the problem. It showed that the ClassCastException did not actually occur when instantiating the DocumentBuilderFactory for the second time, but when XSLT was used. So what happened?

The built-in XSLT-solution in Java is xerces' counterpart, xalan. Like the built-in xerces this is an older version, apparently preconfigured to use the built-in xerces. I think what probably happens is that the used class is somewhere cached within the parent class loader (one of the global class loaders of Tomcat), but loaded with the webapp class loader of the first deployed webapp. Once the second webapp tries to use XSLT, it fails, because a different (and inaccessible) class loader was used to instantiate the parser in question. But that's pure speculation...

... which led to a hunch on my side. I thought that if each webapp had its own xalan as well, it might solve the problem, because the dynamic loading mechanism for XML factories prefers factories that can be loaded via SPI. So I deployed a recent xalan as well (plus the xalan serializer jar) and the problem was gone.

To conclude, it seems good practice to deploy recent versions of all needed (directly or indirectly) XML factories in each webapp. An indirect need for XSLT can be created as quickly as converting a OMElement to an XMLStream, so it's a good guess you need it.

PS: Another solution would be to deploy all the XML factory libs centrally in the Tomcat, but that solution has obvious disadvantages (having to fiddle with the Tomcat, not being able to use different versions across webapps).

Thursday, September 22, 2011

Maven, Java and Yet Another Memory Problem

Everyone working with Java has probably experienced an OutOfMemoryError at one time or another. Since the default values of the JVM are often ridiculously low (used to be 64MB for a standard JVM), increasing the value through startup options usually solved the problems.

Experienced users know that there are different kinds of OutOfMemory errors, the most common being 'Java heap space'. People playing with Tomcat and redeploying webapps a couple of times run into the 'PermGen space' variant pretty fast.

But recently we encountered yet another memory problem, which even resulted in a vm crash, with problems in libjvm.so. This was reproducible on our build server, where we also release new versions using the maven-release-plugin.

It seems that during the maven run within maven (when running mvn release:prepare) the reserved code cache memory area runs full. The documentation about -XX:ReservedCodeCacheSize is a bit unclear on its workings:
Reserved code cache size (in bytes) - maximum code cache size. [Solaris 64-bit, amd64, and -server x86: 48m; in 1.5.0_06 and earlier, Solaris 64-bit and and64: 1024m.]
But it just so happens that increasing it to half a gigabyte fixed the problem for good.

I'm all for being able to configure the JVM in all ways imaginable, but being forced to control it just can't be the right way. To run the release plugin we now have to configure three different memory settings. Can't we have a maximum memory setting that limits heap, stack, permgen and code cache size?

Thursday, September 8, 2011

Raster Pyramids in deegree

Today I would like to talk about raster pyramids in deegree. Gone are the times where you have to use our custom tools to create pyramids and manually configure the different solutions.

In order to make the handling of raster pyramids easy we now support using standard GeoTIFF pyramids with overlays. Let me show you how to prepare your data and use it as coverage data source in deegree.

To prepare your data, you can use standard tools like GDAL. There are a couple of requirements that the processed data needs to meet:

  • it must be a GeoTIFF, containing the extent and coordinate system of the data
  • overlays must be multiples of 2
Assume you have a couple of raster files lying around in a directory, eg. png with .wld. First, build a GDAL virtual raster:

gdalbuildvrt virtual.vrt *png

Then, build the base GeoTIFF using gdalwarp:

gdalwarp -t_srs EPSG:26912 -co BIGTIFF=YES -co TILED=YES virtual.vrt merged.tif

The BIGTIFF option enables you to create files bigger than 4GB. There are a whole lot of other options (you can also compress the TIFF if you want), see the GDAL documentation for details.

Finally, calculate the overlays using gdaladdo. Take care to only use multiples of 2 for the overlays, like in this example:

gdaladdo -r average merged.tif 2 4 8 16 32

That's it, now you have a GeoTIFF ready to run. To configure it in deegree, simply add a new coverage data source using the web console and configure the location of your file.

All done! Now you can use it like any other coverage data source to configure your layers (or a WCS). The files can be very big, I've already tested it with ~150GB files (my lack of disk space prevented testing bigger files).

The implementation was done using the standard TIFF driver from imageio-ext, which has support for BigTIFF without the need for JNI libraries (imageio-ext also has support for GDAL using JAI, but I didn't use that one). Many thanks to the developers for making this easy!

Edit: Removed references to images. I was wondering where all my screenshots went. Seems it had something to do with joining Google+. But I'm not taking the effort to recreate the screenshots here (you should read the deegree handbook anyway, and use the new tile stores instead of this one).

Friday, August 26, 2011

Cascaded Layers and Themes

Today I want to talk about an experimental way to configure cascaded layers.

I already talked about a new layers and themes concept. In order to test out how this can work, I've now implemented an experimental way to cascade a WMS. This is a different approach than the one I already talked about, it focuses on cascading a complete service rather than a single layer.

So we have a couple of different resources here. The first and most basic resource is a remote OWS (OGC Web Service). There are now two different remote OWS resource types, with the old one still being used by the traditional cascading layers concept. The new one is currently a lot simpler to configure:

<?xml version="1.0"?>
<RemoteWMS xmlns="http://www.deegree.org/remoteows/wms"
  xsi:schemaLocation="http://www.deegree.org/remoteows/wms remotewms.xsd"
    location="http://deegree3-demo.deegree.org/deegree-utah-demo/services?request=GetCapabilities&amp;service=WMS&amp;version=1.1.1" />

Please note that the namespace differs from the traditional remote WMS store. The only thing to configure here is the location of the WMS capabilities document.

The next resource we have is a LayerStore. A remote WMS layer store can be used to provide a 'copy' of all the layers a remote WMS has. The configuration simply consists of telling the store which remote WMS to use:

<?xml version="1.0"?>
xsi:schemaLocation="http://www.deegree.org/layers/remotewms remotewms.xsd"
Next is the configuration of a Theme. The configuration of a theme is also pretty straightforward. It also needs a remote WMS, and it needs to have layer stores:

<?xml version="1.0"?>
xsi:schemaLocation="http://www.deegree.org/themes/remotewms remotewms.xsd"
The interesting thing here is what happens here. As you might remember, a theme combines sub-themes and layers to what is traditionally known as layer trees. You can now use theme references to configure your WMS layer structure (as we will see below).

The remote WMS theme creates a theme structure which is identical to the layer structure of the remote WMS. It then tries to find matching layers from the available layer stores, and inserts them at the appropriate places. If the layer store coincides with the remote WMS layer store (as in our case) you'll get the same structure as the cascaded WMS.

The WMS configuration is then as simple as this:

xsi:schemaLocation="http://www.deegree.org/services/wms http://scemas.deegree.org/services/wms/3.1.0/wms_configuration.xsd">

And that's it. True, you had to edit four files, but all are very simple to understand. Cascading a WMS with 524 layers used to be harder...

If you do a quick GetCapabilities, you'll see that bounding boxes, metadata etc. are just copied from the remote service. In the future it will of course be possible to manually create a custom theme, which can then be used to only select a couple of layers from the remote WMS, mix them with other layers and add/edit metadata.

I think this proves that the layers/themes concept can work. To further test it, a couple of prototypic other layer stores will be implemented soon.

Friday, July 29, 2011

deegree Themes

After talking about layers I recently read the OGC WMTS specification. A new concept called themes was introduced there, and that lead to me being able to understand the problem I'm trying to solve here more thoroughly.

Layers in WMTS are no longer hierarchical. All that remains of the old WMS layer trees is a single list of layers. In order to create order in a potentially huge linear list of layers, the theme concept was introduced.

Themes in WMTS can be hierarchical, and each theme can reference any number of layers. Layers can even be referenced multiple times, and you can have multiple top-level themes.

That actually makes a beautiful distinction between structure and data. Speaking in deegree workspace language, it suddenly makes a lot of sense to have not only a bunch of layers configured in one place, but also have a bunch of themes, referencing layers, in another place. A WMS would then only reference themes.

Other ideas include simple layer collections (which aggregate layers similar to current logical layers within other layers).

I still have to figure out a lot of the details, but I can feel that this is going to make things simpler. When configuring a layer you only think about what data it's going to use, and when configuring a theme you only think about the bigger picture (where do my layers belong).

I'm always open for suggestions, so if you have an idea about how to make things good in the end, please speak up.

Monday, July 18, 2011

Cascading WMS with deegree

In this post I want to explain how cascading a WMS works in deegree. Cascading a WMS can be useful to 'repair' broken WMS, restructure layer hierarchies or to hide multiple WMS behind one endpoint.

So how does cascading work in deegree? First of all, there's an abstract concept of a remote OWS data source. Currently there is only a WMS implementation, but more will follow. That means a remote WMS can be a resource in the workspace. That resource can then be used in your WMS configuration as a data source for a layer.

One important consequence is that one remote WMS data source means one data source. Although a WMS might typically have multiple layers and the config allows you to select multiple layers for cascading, it's still one resource. Let's have a look at an example:

<RemoteWMSStore xmlns="http://www.deegree.org/datasource/remoteows/wms"
xsi:schemaLocation="http://www.deegree.org/datasource/remoteows/wms remotewms.xsd"

It's always required to specify a capabilities document. This can also be a local file instead of a request. Then you can specify one or more requested layers. That's it, everything else (available crs, formats etc.) will be determined from the capabilities.

Now to actually make it available as WMS layer in deegree just add a layer with the datasource ID:

  <wms:Title>Utah ZipCodes</wms:Title>
And that's that, you're done.

But imagine the remote WMS has a slow png implementation, and you want to always request the map as jpeg. Also, transformation is broken and you want to let deegree always reproject the raster image. That's easy:

 xsi:schemaLocation="http://www.deegree.org/datasource/remoteows/wms remotewms.xsd"
    location="http://deegree3-testing.deegree.org/deegree-utah-demo/services?request=capabilities&amp;service=WMS&amp;version=1.1.1" />
    <ImageFormat transparent="false">image/jpeg</ImageFormat>
<DefaultCRS useAlways="true">EPSG:4326</DefaultCRS>

Please note that deegree will automatically reproject anyway, if the requested crs is not available in the remote WMS. Setting useAlways to true as in the example will force reprojection even for crs that the remote WMS claims to know.

The configuration you've seen so far is a kind of configuration where you tell deegree what you want to achieve on a high level. That's pretty nice, but sometimes one wants control on a lower level. Imagine you want to send a vendor specific parameter for all requests.

Let's say you really like red backgrounds:

    <ImageFormat transparent="false">image/jpeg</ImageFormat>
    <Parameter use="allowOverride" scope="GetMap" name="bgcolor">0xff0000</Parameter>
In this case, you'll have a red background as default, and the user can override it using BGCOLOR in a GetMap request. If the use-attribute is set to fixed, the parameter's value will always be used. The scope-attribute can be set to GetMap, GetFeatureInfo and All.

The DefaultRequestOptions block can also occur within a RequestedLayer section (named RequestOptions). In that case, the store will not request the layers in a single request, but will do multiple requests and combine them into a single map. That enables you to combine layers with different options, such as when two layers should be requested in different crs.

Last but not least, some WMS need authentication. Currently, only HTTP Basic is supported:

Another note on why cascading might be useful is WMS implementations that don't support proper GML output for featureinfo. deegree will read in the broken format and serve proper GML. There are currently workarounds for ArcIMS and MyWMS, but others should be easy to add. Just drop me a note. 

I hope that provides some insight on how to use deegree for cascading WMS. For a couple of examples you can check out the deegree-wms-remoteows-test module.

Edit: sorry for the crappy XML formatting, I'm still trying to figure out how to use blogger...

Thursday, July 14, 2011

deegree Layers

So Bolsena is over, and it's time to get back to other issues. That is a little unfortunate, as the atmosphere was really nice and productive, but hey, the Code Sprint will return in 2012.

So what's next? In this post I want to talk about something that has been on my mind for quite some time: the layer configuration in deegree. This might not be so interesting for most people, but writing it down definitely helps to sort my own mind.

People familiar with deegree 3.1's workspace concept might know about our resource oriented approach. In essence, everything you can configure is a resource, and corresponds to a single configuration file. So a JDBC connection is a resource, a shape file configuration is a resource, and a SQL feature store configured with all INSPIRE Annex I themes is a resource. A service configuration for a WMS or WFS is also a resource.

For WFS this is all very well, the configuration is mainly concerns service specific issues, such as whether transactions are enabled, which output formats are available and what protocol versions are switched on.

For WMS this also makes sense, you configure the layer structure, and tell the WMS which data source and which style(s) to use. A disadvantage is that configuration files grow real big for many layers. Another disadvantage is that often a single feature store is used for all layers, using a different feature type for each layer. Switching the store requires to change the ID in all layers.

From a REST like web service perspective, it is very easy to add another feature store, a single file needs to be PUT. To add another layer requires fetching the current config, modifying it and then POST it again, which is obviously not very comfortable.

And last but not least, having this integrated in the WMS config means that only the WMS can access the layers. From a service point of view this might not be too bad (after all, only the WMS currently uses layers), but one can easily imagine other use cases where layers as a resource make sense. Think of offline-updating a cache, or writing a Java based client.

So I think that's reason enough to think about a different approach. A hierarchical layer structure is easy enough in a file system. A directory is an unrequestable group layer without, a .xml file is a layer with content. A directory called layer with a file layer.xml right next to it is a group layer possibly also with its own content.

A single layer resource might not just produce a single layer, but also a single root layer with many children. The default WMS config might use all available layers, if it's not configured to use a specific root layer. That would enable people to produce layer files which properly correspond to a single feature store, not a single feature type. The trees could even be generated from the feature type hierarchy, which might be a useful starting point for many applications.

Well, it's still a lot of work, and would require quite a bit of refactoring on the WMS side. But I believe it would make the workspace just that bit more consistent. Implemented via SPI (like the rest of the resources) extending deegree WMS with custom layers would become a walk in the park.

Wednesday, July 6, 2011

Setting up eclipse using maven

A popular topic for developers is always what development platform to use and how to set it up. Once a project becomes bigger and more people collaborate the project management often comes up with best practices, such as how code should be formatted and so on. In this post I'll try to describe how to set up a deegree development environment in eclipse quick and easy.

Here's a quick list on things you need to do:
  • download and install a recent eclipse
  • in eclipse, go to Window/Preferences/Java/Build Path/Classpath variables, and add M2_REPO, have it point to your local maven repository (usually $HOME/.m2/repository/). You can do a mvn -Declipse.workspace=/home/user/workspace eclipse:configure-workspace instead of configuring it manually if you want.
  • check out deegree trunk from https://svn.wald.intevation.org/svn/deegree/deegree3/trunk/. You can check it out into your eclipse workspace, we can import the projects properly later on.
  • run the maven eclipse plugin plus deegree maven plugin in the trunk folder using mvn -Declipse.workspace=/home/user/workspace -DdownloadSources=true -DdownloadJavadocs=true -Dwtpversion=2.0 eclipse:clean eclipse:eclipse deegree:create-links -Declipse.formatter=deegree (all in one line please)
  • use File -> Import -> General -> Existing Projects  into Workspace in eclipse to import all projects at once. Choose the directory with your deegree checkout as a starting point to scan for projects.
Now you should be good to go to hack deegree.

I like it when things just work, but in case they don't I usually want to know what exactly happens on my computer. So let me explain what happens when you run the maven plugins.

The Maven Eclipse plugin generates the .project, .settings and .classpath files/folders for you. It adds the correct source/resource folders in eclipse, output folder, project/library dependencies and even generated source folders (such as the ones generated by the jaxb plugin).

That's already neat. But we currently have 86 maven projects in a hierarchical project tree, and eclipse wants to have the projects flat in its workspace. That's where the deegree maven plugin can help. On Linux it automatically symlinks the projects to the eclipse workspace folder, although this is not strictly necessary any more (eclipse projects need not be directly in the workspace).

Now getting back to management telling you how to format your code. It's not an option to manually set the formatter for 86 projects. Other projects may require a different formatter, so just setting the default is also not an option. Version 1.4 of the deegree maven plugin let's you specify a formatter using -Declipse.formatter=xxx (as shown above). Currently deegree and eclipse120 are supported, with deegree being our custom code style and eclipse120 being the standard eclipse formatter modified to allow 120 characters per line (instead of the default 80).

So that's it. Please tell if something does not work out as expected.

By the way, the deegree maven plugin can do more stuff. It has some helpers for web service integration testing, deegree workspace management and other utilities. There is a wiki page describing some of it, but it's currently outdated, only some of the functionality is documented.

Edit: Added a few missing steps to get everything to compile properly (Lombok and M2_REPO). Thanks Martin!

Edit: Updated the post with a couple of new things we learned. Also Lombok is not needed any more.

Thursday, June 23, 2011

Bolsena #3: INSPIRE on the GeoCouch

In case anyone ever wondered what good it can do to put a couple of hackers in an Italian monastery for a week, here's an example. Talking to Volker Mische (author of GeoCouch) we've wondered whether the BLOB mode approach for traditional databases could not be applied to the couch as well. Rather than wondering we thought we might just as well try it out.

After an evening spent hacking we've finally proved that it can work, and created an initial version of a deegree-featurestore-geocouch. It creates spatial indexes automatically upon startup, and has the ability to insert features via WFS-T or the loader. Querying by ID or BBOX also works already.

We've tested it with INSPIRE Annex I data themes (Addresses and CadastralParcels). That's what the code sprint is all about, collaborating, coming up with new ideas and combining things in a new way.

Mind you, the store is not ready for production or anything, but it's a start, and only needs a little more care to be a viable alternative to traditional databases.

Bolsena #2: PostgreSQL news

Following up on the last post, I've implemented a small patch against PostgreSQL-JDBC, and sent it to on the PostgreSQL JDBC mailing list. Oliver Jowett then kindly wrote a small benchmark to test my changes, and it seems indeed to be quite a bit faster than the original version.

That said, I've also asked about using a binary protocol to connect to the database, and there has indeed been work on that. So we can hope to get a faster version sometime in the future, which does not require special handling of bytea fields any more.

Monday, June 20, 2011

Bolsena #1: Blogging deegree

There has been a considerable lack of blogging about deegree. Googling for it yields the results for 'degree blog' (note the missing e), and forcing the issue reveals only a blob that seems only suitable for adults. Being in Bolsena under the hot Italian sun participating at the Bolsena Code Sprint 2011 seems like a good spot to change that.

So what's happening in the deegree world? Just now we're profiling our INSPIRE services. Turns out that the PostgreSQL JDBC driver has some strange handling concerning bytea fields. We're using these in our so called BLOB storage (storing GML directly in the database with a couple of indexes). In theory fetching the BLOBs (usually only a couple of KB) should be fast enough, and the actual GML parsing/exporting/rendering etc. should slow things down eventually.

The nice thing about actually profiling things is that you know where you can replace the 'should' with 'does not'. For PostgreSQL 9 there seem to be two 'encodings' to fetch the actual bytes of the bytea field. Both include a string representation of the byte in question, one using an octal number, one using a hex number (this is a new 'feature' in PostgreSQL 9). Decoding the bytes involves a method call for each and every byte, where the string is decoded into the actual byte value.

Our test case was ~23000 features sized 1-2KB each. This results in something like 30 to 40 million calls to the method that decodes the bytes, and consumes a lot of time.

So what are the options? There are other options to store large objects in PostgreSQL, so maybe using one of these might be a better option. But since PostgreSQL is Open Source, one might also try to have a closer look at the driver.

Another option would be to try and enjoy the beautiful look at the Lago di Bolsena more often, and not dig into other peoples code...

For those people who want to know more about deegree, have a look at our wiki. Posts which are more concerned about deegree will follow.

Stay tuned for other Bolsena Code Sprint stories!