I guess I’m on an XML rampage lately. I often receive XML documents that are not well formed. They started out that way and it’s a long story how they end up mangled but basically they contain extraneous whitespace. Something like this:
<root some attr="value"> <childnode attr ="123" /> </root>
To help recover documents like this I’ve built a web application based on Apache XMLBeans which parses and then displays a formatted version. I also added features to attempt to remove the extraneous whitespace, wrap long lines, and compress multiple occurrences of whitespace in attribute values and text nodes.
The first of these features I called Smart Scan. Looking back this was kinda arrogant as the code isn’t that smart at all. In fact, it is surprisingly difficult to programmatically examine an XML document that is not well formed and deduce how it should be corrected. Basically I attempt to remove whitespace that would cause the parse of the document to fail. This whitespace can occur in two places in an XML element. For example, consider:
<myele ment attr="1234"/>
As is this document will not parse. We could correct it in one of two ways:
- <myelement attr=”1234″/>
- <myele mentattr=”1234″/>
Both of these result in a well formed XML document but the first one is probably what was intended. As humans we deduce this because the words/abbreviations used make more sense. It’s pretty hard to write code that is this smart and I chose to format this as is shown in #2 above. The reasoning behind this is that an element is likely to have many more characters dedicated to attributes than to the element name. Therefore, it is more likely that extraneous whitespace will occur in an attribute.
The second feature I added is named Wrap Lines and basically attempts to wrap long element lines in a pleasing manner without impacting the validity of the document.
Finally there is a feature called Compress Whitespace that will replace all occurrences of whitespace in an attribute value or text node with a single space. This can be useful if, for example, I have a document that gets mangled as follows:
<myelement attr="This is the attribute value"/>
All three of these features may be enabled or disabled from the page that submits the XML document to be formatted so if they are not working for you just turn them off. If they behave inappropriately with one of your documents I would appreciate it if you would share the document with me by sending it to bwit AT pobox.com.
Here’s an example of using the application with our first example XML document:
That’s about it. If you read all this drivel you certainly deserve some free code and I hope it serves your needs. if it doesn’t leave a comment and take me to task for it but, remember, this is a mediocre application developed by an average guy. So be kind.
The code is here. By the way, if you’ve just jumped to the bottom without reading any of the valuable information above then this link just won’t work!
Nah, just kidding.