Beautiful Soup tiny bug

Beautiful Soup is great for parsing random bits of crummy HTML. However, I think I’ve found a small bug, and I’m putting it up here just in case anyone else comes across the same thing. If the HTML specifies a charset of “windows-1252” in its meta header, then it isn’t changed to utf-8, though the content is. If you change the case of the encoding, or if you specify the same encoding manually, it’s fine. I’ve put a short transcript below to show the problem. To fix the bug, simply apply the following patch to BeautifulSoup.py (currently version 3.0.5):

@@ -1505,25 +1505,26 @@
         if httpEquiv and contentType: # It's an interesting meta tag.
             match = self.CHARSET_RE.search(contentType)
             if match:
+                newCharset = match.group(3)
                 if getattr(self, 'declaredHTMLEncoding') or \
-                       (self.originalEncoding == self.fromEncoding):
+                       self.originalEncoding == self.fromEncoding or \
+                       self.originalEncoding.lower() == newCharset.lower():
                     # This is our second pass through the document, or
                     # else an encoding was specified explicitly and it
-                    # worked. Rewrite the meta tag.
+                    # worked, or we're already the encoding the meta tag
+                    # specifies. Rewrite the meta tag.
                     newAttr = self.CHARSET_RE.sub\
                               (lambda(match):match.group(1) +
                                "%SOUP-ENCODING%", value)
                     attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
                                                newAttr)
                     tagNeedsEncodingSubstitution = True
-                else:
+                elif newCharset:
                     # This is our first pass through the document.
                     # Go through it again with the new information.
-                    newCharset = match.group(3)
-                    if newCharset and newCharset != self.originalEncoding:
-                        self.declaredHTMLEncoding = newCharset
-                        self._feed(self.declaredHTMLEncoding)
-                        raise StopParsing
+                    self.declaredHTMLEncoding = newCharset
+                    self._feed(self.declaredHTMLEncoding)
+                    raise StopParsing
         tag = self.unknown_starttag("meta", attrs)
         if tag and tagNeedsEncodingSubstitution:
             tag.containsSubstitutions = True

Transcript showing problem

$ python
Python 2.4.3 (#1, May 18 2006, 07:40:45)
[GCC 3.3.3 (cygwin special)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> doc = """<html>
... <meta http-equiv="Content-type" content="text/html; charset=Windows-1252">
... Sacr\xe9 bleu!
... </html>"""
>>> print BeautifulSoup(doc).prettify()
<html>
 <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
 Sacré bleu!
</html>
>>> doc = """<html>
... <meta http-equiv="Content-type" content="text/html; charset=windows-1252">
... Sacr\xe9 bleu!
... </html>"""
>>> print BeautifulSoup(doc).prettify()
<html>
 <meta http-equiv="Content-type" content="text/html; charset=windows-1252" />
 Sacré bleu!
</html>
>>> doc = """<html>
... <meta http-equiv="Content-type" content="text/html; charset=windows-1252">
... Sacr\xe9 bleu!
... </html>"""
>>> print BeautifulSoup(doc, fromEncoding='windows-1252').prettify()
<html>
 <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
 Sacré bleu!
</html>