Thursday, November 27, 2008

Comparing XML Documents Semantically

I have been interested in comparing the contents of XML documents for some time now.

What I wanted to do is to perform a 'Diff' on 2 XML documents and determine whether or not they are the same document (regardless of whitespace or child element ordering).

In JAVA, I came across XmlUnit. This piece of software is excellent for determining whether or not 2 separate XML documents are equal.

In Perl, I came across XML-SemanticDiff. I thought it was great until I re-ordered the elements in one of my documents. Then this module wasn't so great anymore.

Since I really needed a piece of software equivalent to XmlUnit in Perl, I decided to create my own module and to call it XML-SemanticCompare. This new module really does perform a semantic diff on XML documents:
  • Child element re-ordering doesn't result in false negatives.
  • Whitespace is trimmed from text by default when comparing text and attribute values [can be turned off].
  • Attributes can be ignored [turned off by default].

Using the module is extremely straightforward:
  use XML::SemanticCompare;
my $x = XML::SemanticCompare->new;

# compare 2 different files
my $isSame = $x->compare($control_xml, $test_xml);
# are they the same
print "XML matches!\n"
if $isSame;
print "XML files are semantically different!\n"
unless $isSame;

# get the diffs
my $diffs_arrayref = $x->diff($control_xml, $test_xml);

# test xpath statement against XML
my $success = $x->test_xpath($xpath, $test_xml);
print "xpath success!\n" if $success;

The only downside to this piece of software is that it isn't very efficient (although, it isn't terribly inefficient either).

If you find yourself trying to compare XML documents DOM trees for equality and you are using Perl, please check out XML-SemanticCompare. If you can make the code more robust and efficient, please do!


Vibha S P said...

Hello, ekawas nice post :)
I am also using the same.
But I am unable to interpret the output of diff.
Pls thro some light on that.

Ed said...

Diff basically shows you xpath like statements to show what what is different between the 2 files.

gdan2000 said...

Could you pls post an example how to display in a proper way the output of diff method ?

Ed said...

The diff method is just a reference to an list of strings that are XPATH like and illustrate what the differences are.

For instance,

my $diffs_arrayref = $x->diff( $control_xml, $test_xml );
print "Diff: $_\n" foreach (@$diffs_arrayref);

Check out CPAN doc