2006-01-23

SemanticXmlDiff failed on Unicode

I am using CPAN module SemanticXmlDiff to compare two set of XML files. It complains about not being able to handle "wide character". After looking at the source code, it turns out the problem is the MD5 function used inside the module. The MD5 hash is used to spead up the comparson of text node in the XML file. However, MD5 only supports computing hash of ASCII characters, thus it will fail on wide characters which has multiple bytes.

The solution? According to the Digest::Digest module, the text should be encoded before it is passed to MD5 function. To apply this, I change the following line from
$doc->{"$test_context"}->{TextChecksum} = md5_base64("$text");

to
 $doc->{"$test_context"}->{TextChecksum} = md5_base64(encode_utf8("$text"));

Now it works!