Implementing IXmlWriter Part 6: Escaping Attribute Content

This is part 6 of my Implementing IXmlWriter post series.

Last time’s IXmlWriter has a serious bug: it doesn’t properly handle attribute value escaping and can lead to malformed XML.

Consider the following test case:

StringXmlWriter xmlWriter;

xmlWriter.WriteStartElement("root");
  xmlWriter.WriteStartElement("element");
    xmlWriter.WriteAttributeString("att", "\"");
  xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();

std::string strXML = xmlWriter.GetXmlString();

The previous version of IXmlWriter will generate the XML string <root><element att="""/></root>, which is invalid and will be rejected by a XML parser. The rules for XML attribute escaping are given by Section 2.3 of the XML 1.0 spec—specifically, the AttValue literal:

[10] AttValue ::= '"' ([^<&"] | Reference)* '"'
                  | "'" ([^<&'] | Reference)* "'"

This Backus-Naur form-like construct says that attribute values can be enclosed in either single or double quotes, and that the characters <, &, and the respective quotation character cannot appear between these quotes. However, with the exception of < (see Well-formedness constraint: No < in Attribute Values—thanks dbt), we can insert escaped versions of these characters. As we always encase attribute values in double quotes, we only need to worry about escaping the " character and not the ' character. Let’s construct a test case:

StringXmlWriter xmlWriter;

xmlWriter.WriteStartElement("root");
  xmlWriter.WriteStartElement("element");
    xmlWriter.WriteAttributeString("att", "\"&");
  xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();

std::string strXML = xmlWriter.GetXmlString();
// strXML should be <root><element att="&quot;&amp;"/></root>

Note that we are now required to perform escaping (albeit with different characters) in two separate functions: WriteString() and WriteAttributeString(). This is a prime candidate for refactoring—we can separate the escaping code into its own function, and we can make such large changes with confidence because we have a test suite to verify that changed code is correct. Here’s the new code:

typedef std::map<char, std::string> translations_t;

std::string TranslateString
    (
    const std::string& value,
    const translations_t& translations
    )
{
    std::string str;
    for (std::string::const_iterator stringIter = value.begin();
         stringIter != value.end();
         ++stringIter) {
        translations_t::const_iterator mapIter = translations.find(*stringIter);
        if (mapIter != translations.end()) {
            str += mapIter->second;
        } else {
            str += *stringIter;
        }
    }

    return str;
}

class StringXmlWriter
{
private:
    std::stack<std::string> m_openedElements;
    std::string m_xmlStr;
    bool m_unclosedStartElement;
    // Translations used in character data
    translations_t m_charDataTranslations;
    // Translations used in attribute values
    translations_t m_attributeTranslations;

public:
    StringXmlWriter() : m_unclosedStartElement(false)
    {
        m_charDataTranslations['&'] = "&amp;";
        m_charDataTranslations['<'] = "&lt;";
        m_charDataTranslations['>'] = "&gt;";

        m_attributeTranslations['&'] = "&amp;";
        m_attributeTranslations['"'] = "&quot;";
    }

    void WriteStartElement(const std::string& localName)
    {
        if (m_unclosedStartElement) {
            m_xmlStr += '>';
            m_unclosedStartElement = false;
        }

        m_openedElements.push(localName);
        m_xmlStr += '<';
        m_xmlStr += localName;
        m_unclosedStartElement = true;
    }

    void WriteEndElement()
    {
        if (m_unclosedStartElement) {
            m_xmlStr += "/>";
            m_unclosedStartElement = false;
        } else {
            std::string lastOpenedElement = m_openedElements.top();
            m_xmlStr += "</";
            m_xmlStr += lastOpenedElement;
            m_xmlStr += '>';
        }
        m_openedElements.pop();
    }

    void WriteString(const std::string& value)
    {
        if (m_unclosedStartElement) {
            m_xmlStr += '>';
            m_unclosedStartElement = false;
        }

        m_xmlStr += TranslateString(value, m_charDataTranslations);
    }

    void WriteElementString(const std::string& localName,
                            const std::string& value)
    {
        WriteStartElement(localName);
        WriteString(value);
        WriteEndElement();
    }

    void WriteAttributeString(const std::string& localName,
                              const std::string& value)
    {
        m_xmlStr += ' ';
        m_xmlStr += localName;
        m_xmlStr += "=\"";
        m_xmlStr += TranslateString(value, m_attributeTranslations);
        m_xmlStr += '"';
    }

    std::string GetXmlString() const
    {
        return m_xmlStr;
    }
};

Because we cannot insert a < character into an attribute value, escaped or otherwise, we should explicitly forbid this value in the function WriteAttributeString(). I will be sure to address this when I get to error handling in a future post. However, be sure to be aware of this constraint when you design your XML schemas!

Advertisements

Implementing IXmlWriter Part 5: Supporting WriteAttributeString()

This is part 5 of my Implementing IXmlWriter post series.

Today I will add support for writing attributes to yesterday’s version of IXmlWriter.

To learn more about attributes, see the W3C XML 1.0 Recommendation. Writing attributes will be supported with the function WriteAttributeString().

Here’s today’s test case:

StringXmlWriter xmlWriter;

xmlWriter.WriteStartElement("root");
  xmlWriter.WriteStartElement("element");
    xmlWriter.WriteAttributeString("att", "value");
  xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();

std::string strXML = xmlWriter.GetXmlString();
// strXML should be <root><element att="value"/></root>

Because the changes in Implementing IXmlWriter Part 4 keep start elements unclosed until another function is called which requires them to be closed (e.g. WriteString() and WriteEndElement()), adding support for writing attributes is very simple. Here’s the version I came up with to pass all test cases:

class StringXmlWriter
{
private:
    std::stack<std::string> m_openedElements;
    std::string m_xmlStr;
    bool m_unclosedStartElement;

public:
    StringXmlWriter() : m_unclosedStartElement(false) {}

    void WriteStartElement(const std::string& localName)
    {
        if (m_unclosedStartElement) {
            m_xmlStr += '>';
            m_unclosedStartElement = false;
        }

        m_openedElements.push(localName);
        m_xmlStr += '<';
        m_xmlStr += localName;
        m_unclosedStartElement = true;
    }

    void WriteEndElement()
    {
        if (m_unclosedStartElement) {
            m_xmlStr += "/>";
            m_unclosedStartElement = false;
        } else {
            std::string lastOpenedElement = m_openedElements.top();
            m_xmlStr += "</";
            m_xmlStr += lastOpenedElement;
            m_xmlStr += '>';
        }
        m_openedElements.pop();
    }

    void WriteString(const std::string& value)
    {
        if (m_unclosedStartElement) {
            m_xmlStr += '>';
            m_unclosedStartElement = false;
        }

        typedef std::string::const_iterator iter_t;
        for (iter_t iter = value.begin(); iter != value.end(); ++iter) {
            if (*iter == '&') {
                m_xmlStr += "&amp;";
            } else if (*iter == '<') {
                m_xmlStr += "&lt;";
            } else if (*iter == '>') {
                m_xmlStr += "&gt;";
            } else {
                m_xmlStr += *iter;
            }
        }
    }

    void WriteElementString(const std::string& localName,
                            const std::string& value)
    {
        WriteStartElement(localName);
        WriteString(value);
        WriteEndElement();
    }

    void WriteAttributeString(const std::string& localName,
                              const std::string& value)
    {
        m_xmlStr += ' ';
        m_xmlStr += localName;
        m_xmlStr += "=\"";
        m_xmlStr += value;
        m_xmlStr += '"';
    }

    std::string GetXmlString() const
    {
        return m_xmlStr;
    }
};

Implementing IXmlWriter Part 4: Collapsing Empty Elements

This is part 4 of my Implementing IXmlWriter post series.

One of the enhancements that XML introduced over SGML was a shorthand for specifying an element with no content by adding a trailing slash at the end of an open element. For example, <br/> is equivalent to <br></br>. Let’s add this functionality to the previous version of IXmlWriter.

Here’s the test case:

StringXmlWriter xmlWriter;

xmlWriter.WriteStartElement("root");
  xmlWriter.WriteStartElement("emptyElement");
  xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();

std::string strXML = xmlWriter.GetXmlString();
// strXML should be <root><emptyElement/></root>

How does this affect our previous implementation?

  1. We clearly cannot write the > character to close the start element in WriteStartElement().
  2. WriteEndElement() needs to be able to detect if a still-opened start element was written and if so, to write the /> sequence. Otherwise, it needs to write the full end element string.
  3. Because the > character will not be written in WriteStartElement(), we also need to worry about closing the start element in virtually every function.

The simplest way to implement this feature that I can think of is to keep an extra bool which remembers whether a unclosed start element has been written, and to handle this case in all relevant functions. Because we have all the previous test cases, I simply made the relevant changes to WriteStartElement() and then kept running the test cases, adding code as necessary to fix failures. I ended up with the following implementation:

class StringXmlWriter
{
private:
    std::stack<std::string> m_openedElements;
    std::string m_xmlStr;
    bool m_unclosedStartElement;

public:
    StringXmlWriter() : m_unclosedStartElement(false) {}

    void WriteStartElement(const std::string& localName)
    {
        if (m_unclosedStartElement) {
            m_xmlStr += '>';
            m_unclosedStartElement = false;
        }

        m_openedElements.push(localName);
        m_xmlStr += '<';
        m_xmlStr += localName;
        m_unclosedStartElement = true;
    }

    void WriteEndElement()
    {
        if (m_unclosedStartElement) {
            m_xmlStr += "/>";
            m_unclosedStartElement = false;
        } else {
            std::string lastOpenedElement = m_openedElements.top();
            m_xmlStr += "</";
            m_xmlStr += lastOpenedElement;
            m_xmlStr += '>';
        }
        m_openedElements.pop();
    }

    void WriteString(const std::string& value)
    {
        if (m_unclosedStartElement) {
            m_xmlStr += '>';
            m_unclosedStartElement = false;
        }

        typedef std::string::const_iterator iter_t;
        for (iter_t iter = value.begin(); iter != value.end(); ++iter) {
            if (*iter == '&') {
                m_xmlStr += "&amp;";
            } else if (*iter == '<') {
                m_xmlStr += "&lt;";
            } else if (*iter == '>') {
                m_xmlStr += "&gt;";
            } else {
                m_xmlStr += *iter;
            }
        }
    }

    void WriteElementString(const std::string& localName,
                            const std::string& value)
    {
        WriteStartElement(localName);
        WriteString(value);
        WriteEndElement();
    }

    std::string GetXmlString() const
    {
        return m_xmlStr;
    }
};

Remember, although the bool-based approach may not be the most elegant nor the eventual long-term solution, the idea with test-driven development is to write the simplest code possible to pass the existing test cases. If future test cases show the need, I will freely change this implementation detail, as the growing test suite allows me to make changes with great confidence.

Implementing IXmlWriter Part 3: Supporting WriteElementString()

This is part 3 of my Implementing IXmlWriter post series.

Today’s addition to the previous iteration of IXmlWriter is quite trivial: supporting the WriteElementString() method.

Here’s the test case:

StringXmlWriter xmlWriter;

xmlWriter.WriteStartElement("root");
  xmlWriter.WriteElementString("element", "value");
xmlWriter.WriteEndElement();

std::string strXML = xmlWriter.GetXmlString();
// strXML should be <root><element>value</element></root>

Implementation is extremely simple because WriteElementString() is nothing but a convenience method which calls WriteStartElement(), WriteString(), and WriteEndElement(). Therefore, here’s the new StringXmlWriter:

class StringXmlWriter
{
private:
    std::stack<std::string> m_openedElements;
    std::string m_xmlStr;

public:
    void WriteStartElement(const std::string& localName)
    {
        m_openedElements.push(localName);
        m_xmlStr += '<';
        m_xmlStr += localName;
        m_xmlStr += '>';
    }

    void WriteEndElement()
    {
        std::string lastOpenedElement = m_openedElements.top();
        m_xmlStr += "</";
        m_xmlStr += lastOpenedElement;
        m_xmlStr += '>';
        m_openedElements.pop();
    }

    void WriteString(const std::string& value)
    {
        typedef std::string::const_iterator iter_t;
        for (iter_t iter = value.begin(); iter != value.end(); ++iter) {
            if (*iter == '&') {
                m_xmlStr += "&amp;";
            } else if (*iter == '<') {
                m_xmlStr += "&lt;";
            } else if (*iter == '>') {
                m_xmlStr += "&gt;";
            } else {
                m_xmlStr += *iter;
            }
        }
    }

    void WriteElementString(const std::string& localName,
                            const std::string& value)
    {
        WriteStartElement(localName);
        WriteString(value);
        WriteEndElement();
    }

    std::string GetXmlString() const
    {
        return m_xmlStr;
    }
};

Implementing IXmlWriter Part 2: Escaping Element Content

This is part 2 of my Implementing IXmlWriter post series.

In the previous post of this series, we ended up with a simple class which could write XML elements and element content to a std::string. However, this code has a common, serious problem that was mentioned in my post Don’t Form XML Using String Concatenation: it doesn’t properly escape XML special characters such as & and <. This means that if you call WriteString() with one of these characters, your generated XML will be invalid and will not be able to be parsed by an XML parser.

The rules for XML element value escaping are given by Section 2.4 of the W3C XML 1.0 Recommendation—specifically, by the following passage:

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings “&amp;” and “&lt;” respectively. The right angle bracket (>) MAY be represented using the string “&gt;”, and MUST, for compatibility, be escaped using either “&gt;” or a character reference when it appears in the string “]]>” in content, when that string is not marking the end of a CDATA section.

For simplicity, I will choose to always escape > with &gt;. As we are using test-driven development, we must first write a test case:

StringXmlWriter xmlWriter;

xmlWriter.WriteStartElement("root");
  xmlWriter.WriteStartElement("element");
    xmlWriter.WriteString("&<>");
  xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();

std::string strXML = xmlWriter.GetXmlString();
// strXML should be <root><element>&amp;&lt;&gt;</element></root>

Note how the previous version of StringXmlWriter fails this test case because it generates the invalid XML string <root><element>&<></element></root>. The changes to StringXmlWriter are fairly straightforward (note how I am following the advice from my post Prefer Iteration To Indexing):

class StringXmlWriter
{
private:
    std::stack m_openedElements;
    std::string m_xmlStr;

public:
    void WriteStartElement(const std::string& localName)
    {
        m_openedElements.push(localName);
        m_xmlStr += '';
    }

    void WriteEndElement()
    {
        std::string lastOpenedElement = m_openedElements.top();
        m_xmlStr += "';
        m_openedElements.pop();
    }

    void WriteString(const std::string& value)
    {
        typedef std::string::const_iterator iter_t;
        for (iter_t iter = value.begin(); iter != value.end(); ++iter) {
            if (*iter == '<') {
                m_xmlStr += "&amp;";
            } else if (*iter == '>') {
                m_xmlStr += "&gt;";
            } else {
                m_xmlStr += *iter;
            }
        }
    }

    std::string GetXmlString() const
    {
        return m_xmlStr;
    }
};