Input Sanitization: Invalid XML Data, Validation

The article "INPUT SANITIZATION : INVALID XML DATA, VALIDATION" by Gaurav Thakur discusses how to handle invalid XML characters in input data, particularly when converting JSON to XML.

Key points:

  • Problem: Data valid for JSON might be invalid for XML, requiring sanitization.

  • Solution (Initial): Use a precompiled regex pattern to identify invalid XML 1.0 characters:

    private static final String xml10pattern = "[^"
            + "\u0009\r\n"
            + "\u0020-\uD7FF"
            + "\uE000-\uFFFD"
            + "\ud800\udc00-\udbff\udfff"
            + "]";
    
  • Validation Method: A Java method hasInValidXmlCharacterData is provided to check if an input string contains invalid XML characters using the regex. It returns the invalid character found or null if the input is valid.

  • Curl Request Issue: The author found that validation would fail when sending data via curl requests, even though it worked from a main program or JUnit tests.

  • Resolution: The issue was resolved by unescaping characters in the input payload before validation. The StringEscapeUtils.unescapeJava method from Apache Commons Lang was used for this purpose. The updated validation method includes this unescaping step.

Comments (0)

Loading comments...