Fixing minidom.toprettyxml’s Silly Whitespace
Python’s xml.dom.minidom.toprettyxml has a feature/flaw that renders it useless for many common applications.
Someone was kind enough to post a hack which works around the problem. That hack had a small bug, which I’ve fixed; you’ll find the revised code below.
UPDATED Forget the hack; I’ve found another, better solution. See below. And please leave a comment if you find these workarounds helpful, or if you come across a better solution.
The Problem
First, a short summary of the problem. (Other descriptions can be found here and here.) Feel free to jump ahead to all the workarounds, or straight to my solution of choice.
toprettyxml adds extra white space when printing the contents of text nodes. This may not sound like a serious drawback, but it is. Consider a simple xml snippet:
<Author>Ron Rothman</Author>
This Python script:
# python 2.4
import xml.dom.minidom as dom
myText = '''<Author>Ron Rothman</Author>'''
print xml.dom.minidom.parseString(myText).toprettyxml()
generates this output:
<?xml version="1.0" ?>
<Author>
Ron Rothman
</Author>
Note the extra line breaks: the text “Ron Rothman” is printed on its own line, and indented. That may not matter much to a human reading the output, but it sure as hell matters to an XML parser. (Recall: whitespace is significant in XML)
To put it another way: the DOM object that represents the output (with line breaks) is NOT identical to the DOM object that represented the input.
Semantically, the author in the original XML is "Ron Rothman", but the author in the “pretty” XML is [approximately] " Ron Rothman ".
This is devastating news to anyone who hopes to re-parse the “pretty” XML in some other context. It means that you can’t use minidom.toprettyxml() to produce XML that will be parsed downstream.
Workarounds
UPDATED If you’re in a rush, skip ahead to the best solution, #4.
- normalize()
- calling toprettyxml with “creative” (non-default) parameters
1. Don’t use minidom
There are plenty of other XML packages to choose from.
But: minidom is appealing because it’s lightweight, and is included with the Python distribution. Seems a shame to toss it for just one flaw.
2. Use minidom, but don’t use toprettyxml()
Use minidom.toxml(), which doesn’t suffer from the same problem (because it doesn’t insert any whitespace).
But: Your machine-readable XML will make heads spin, should someone be foolish enough to try to read it.
3. Hack toprettyxml to do The Right Thing
Replace toprettyxml by using the code below.
But: It smells. Like a hack. Fragile; likely to break with future releases of minidom.
On the other hand: It’s not that bad. And hey, it does the trick. (But YMMV.)
def fixed_writexml(self, writer, indent="", addindent="", newl=""):
# indent = current indentation
# addindent = indentation to add to higher levels
# newl = newline string
writer.write(indent+"<" + self.tagName)
attrs = self._get_attributes()
a_names = attrs.keys()
a_names.sort()
for a_name in a_names:
writer.write(" %s=\"" % a_name)
xml.dom.minidom._write_data(writer, attrs[a_name].value)
writer.write("\"")
if self.childNodes:
if len(self.childNodes) == 1 \
and self.childNodes[0].nodeType == xml.dom.minidom.Node.TEXT_NODE:
writer.write(">")
self.childNodes[0].writexml(writer, "", "", "")
writer.write("</%s>%s" % (self.tagName, newl))
return
writer.write("<%s"%(newl))
for node in self.childNodes:
node.writexml(writer,indent+addindent,addindent,newl)
writer.write("%s</%s>%s" % (indent,self.tagName,newl))
else:
writer.write("/>%s"%(newl))
# replace minidom's function with ours
xml.dom.minidom.Element.writexml = fixed_writexml
I just copied the original toprettyxml code from /usr/lib/python2.4/xml/dom/minidom.py and made the modifications that are highlighted in yellow. It ain’t pretty, but it seems to work. (Suggestions for improvements (I’m a Python n00b) are welcome.)
[Credit to Oluseyi at gamedev.net for the original hack; I just fixed it so that it worked with character entities.]
UPDATE! 4. Use xml.dom.ext.PrettyPrint
Who knew? All along, an alternative to toprettyxml was available to me. Works like a charm. Robust. 100% Kosher Python. Definitely the method I’ll be using.
But: Need to have PyXML installed. In my case, it was already installed, so this is my method of choice. (It’s worth pointing out that if you already have PyXML installed, you might want to consider using it exclusively, in lieu of minidom.)
We just write a simple wrapper, and we’re done:
from xml.dom.ext import PrettyPrint
from StringIO import StringIO
def toprettyxml_fixed (node, encoding='utf-8'):
tmpStream = StringIO()
PrettyPrint(node, stream=tmpStream, encoding=encoding)
return tmpStream.getvalue()
Conclusion
One lesson from all this: TMTOWTDI applies to more than just Perl. ![]()
Please–let me know what you think.
Entries (RSS2)
Yikes, I just came across a better solution on comp.lang.python: use xml.dom.ext.PrettyPrint.
I’ve updated the main post to reflect this option. (See #4.)
I have found errors in fixed_writexml function:
1.- Errors in number %s into strings
2.- Errors in ending labels, the functions write label and not
The correct is:
def fixed_writexml(self, writer, indent=”", addindent=”", newl=”"):
# indent = current indentation
# addindent = indentation to add to higher levels
# newl = newline string
writer.write(indent+”")
self.childNodes[0].writexml(writer, “”, “”, “” )
writer.write(”%s” % (self.tagName, newl))
return
writer.write(”>%s” % (newl))
for node in self.childNodes:
node.writexml(writer,indent+addindent,addindent,newl)
writer.write(”%s%s” % (indent,self.tagName,newl))
else:
writer.write(”/>%s”%(newl))
# replace minidom’s function with ours
xml.dom.minidom.Element.writexml = fixed_writexml
gmodella, thanks for your feedback. You’re right, the code as published wasn’t working, because Wordpress converted my “” to html tag delimiters, instead of converting them to “<” and “>”. (Notice that it did the same thing in your comment; your ‘write.write(”%s%s”…’ should have a , but Wordpress stripped them in your comment as well.)
I’ve fixed it manually, and now Wordpress is rendering it correctly. Thank you for alerting me!
Hi, if you add a conditional in loop for, the extra newline character is not added in next calls to function:
writer.write(">%s" % (newl))
for node in self.childNodes:
+ if node.nodeType is not xml.dom.minidom.Node.TEXT_NODE: # 3:
node.writexml(writer,indent+addindent,addindent,newl)
Hi, I have fixed other problem [in minidom]:
When you use the writexml function for write a beautiful and readable file, and read the file, and other time use writexml function, then you could view a lot of ugly newline characters.
Adding a new conditional in the loop “for”, we could remove the blank lines with only indent and newline characters
for node in self.childNodes:
+ if node.nodeType is not xml.dom.minidom.Node.TEXT_NODE: # 3:
+ node.writexml(writer,indent+addindent,addindent,newl)
- #node.writexml(writer,indent+addindent,addindent,newl)