Thursday, November 27, 2008

Comparing XML Documents Semantically

I have been interested in comparing the contents of XML documents for some time now.

What I wanted to do is to perform a 'Diff' on 2 XML documents and determine whether or not they are the same document (regardless of whitespace or child element ordering).

In JAVA, I came across XmlUnit. This piece of software is excellent for determining whether or not 2 separate XML documents are equal.

In Perl, I came across XML-SemanticDiff. I thought it was great until I re-ordered the elements in one of my documents. Then this module wasn't so great anymore.

Since I really needed a piece of software equivalent to XmlUnit in Perl, I decided to create my own module and to call it XML-SemanticCompare. This new module really does perform a semantic diff on XML documents:
  • Child element re-ordering doesn't result in false negatives.
  • Whitespace is trimmed from text by default when comparing text and attribute values [can be turned off].
  • Attributes can be ignored [turned off by default].

Using the module is extremely straightforward:
  use XML::SemanticCompare;
my $x = XML::SemanticCompare->new;

# compare 2 different files
my $isSame = $x->compare($control_xml, $test_xml);
# are they the same
print "XML matches!\n"
if $isSame;
print "XML files are semantically different!\n"
unless $isSame;

# get the diffs
my $diffs_arrayref = $x->diff($control_xml, $test_xml);

# test xpath statement against XML
my $success = $x->test_xpath($xpath, $test_xml);
print "xpath success!\n" if $success;

The only downside to this piece of software is that it isn't very efficient (although, it isn't terribly inefficient either).

If you find yourself trying to compare XML documents DOM trees for equality and you are using Perl, please check out XML-SemanticCompare. If you can make the code more robust and efficient, please do!

Tuesday, November 25, 2008

A perl script to rename your files

For those of you looking for a cool way to rename your files, here is a little script that could work wonders for you.

Basically, this scripts takes in a PERL substitution expression and applies it to the filename of interest.

To get started, copy the script below into a file called ''. Once you have done that, then read the usage instructions by running the script with the '-h' option. Of course, you will need perl installed on your machine.

So why write this script? Because I got sick and tired of manually renaming files by hand. Some people have files called 'some_file.txt'. I prefer that file to be called 'SomeFile.txt'.

To accomplish this, all I have to do is run the script like so:

   perl -u "s/_/ /g" some_file.txt

followed by
   perl "s/ //g" "Some File.txt"

Note: The -u option causes the first letter of each word separated by a space to be capitalized.

Okay, maybe its quicker to manually do this for a single file, but if you are on Windows and have a whole folder full of files like that, then the process is:

   for %v in (*.txt) do perl -u "s/_/ /g" "%v" 
   for %v in (*.txt) do perl "s/ //g" "%v"

That is all there is to it! All of the files will be renamed with 'camel' text names.

Please make sure that before you attempt to batch rename files, that you have them backed up first!

Script start [hint: dont copy this line ;-)]
#!/usr/bin/perl -w

 use Getopt::Std;
 use vars qw/ $opt_h $opt_u /;

 # usage
        sub usage {
  print STDOUT <<'END_OF_USAGE';

  Usage: rename [-hu] sub_regex [files]

                sub_regex is any expression that you would like to
                apply to filename. You can use capturing or just 

                -h .... shows this message ;-)
                -u .... makes first letter of each word uppercase


         perl "s/_/ /g" rename_me.txt
                   This renames rename_me.txt to "rename me.txt"

         perl -u "s/_/ /g" rename_me.txt
                   This renames rename_me.txt to "Rename Me.txt"

                3. MS Windows example
         for %v in (*.mp3) do perl -u "s/_/ /g" "%v"
                   This renames all mp3 files in the current directory such that
                   every word begins with a capital letter and all underscores
                   are replaced with spaces.


 if ($opt_h) {


# get the substition regex
$op = shift or (usage() and exit(1));

# go through file names
chomp(@ARGV) unless @ARGV;
for (@ARGV) {
    $was = $_;
    eval $op;
    die $@ if $@;
    # cap first letter of each word
    my $newname = "";
    my @components = split(/ /, $_);
    foreach (@components) {
      my $x = $_;
      $x = ucfirst lc $x if $opt_u;
      if ($newname eq "") {
         $newname = "$x";
      } else { 
         $newname .= " $x";
    rename($was,$newname) unless $was eq $newname;

Script End [hint: dont copy this line ;-)]