Beautiful Soup tiny bug
Beautiful Soup is great for parsing random bits of crummy HTML. However, I think I’ve found a small bug, and I’m putting it up here just in case anyone else comes across the same thing. If the HTML specifies a charset of “windows-1252” in its meta header, then it isn’t changed to utf-8, though the content is. If you change the case of the encoding, or if you specify the same encoding manually, it’s fine. I’ve put a short transcript below to show the problem. To fix the bug, simply apply the following patch to BeautifulSoup.py (currently version 3.0.5):
@@ -1505,25 +1505,26 @@ if httpEquiv and contentType: # It's an interesting meta tag. match = self.CHARSET_RE.search(contentType) if match: + newCharset = match.group(3) if getattr(self, 'declaredHTMLEncoding') or \ - (self.originalEncoding == self.fromEncoding): + self.originalEncoding == self.fromEncoding or \ + self.originalEncoding.lower() == newCharset.lower(): # This is our second pass through the document, or # else an encoding was specified explicitly and it - # worked. Rewrite the meta tag. + # worked, or we're already the encoding the meta tag + # specifies. Rewrite the meta tag. newAttr = self.CHARSET_RE.sub\ (lambda(match):match.group(1) + "%SOUP-ENCODING%", value) attrs[contentTypeIndex] = (attrs[contentTypeIndex][0], newAttr) tagNeedsEncodingSubstitution = True - else: + elif newCharset: # This is our first pass through the document. # Go through it again with the new information. - newCharset = match.group(3) - if newCharset and newCharset != self.originalEncoding: - self.declaredHTMLEncoding = newCharset - self._feed(self.declaredHTMLEncoding) - raise StopParsing + self.declaredHTMLEncoding = newCharset + self._feed(self.declaredHTMLEncoding) + raise StopParsing tag = self.unknown_starttag("meta", attrs) if tag and tagNeedsEncodingSubstitution: tag.containsSubstitutions = True
Transcript showing problem
$ python Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> from BeautifulSoup import BeautifulSoup >>> doc = """<html> ... <meta http-equiv="Content-type" content="text/html; charset=Windows-1252"> ... Sacr\xe9 bleu! ... </html>""" >>> print BeautifulSoup(doc).prettify() <html> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> Sacré bleu! </html> >>> doc = """<html> ... <meta http-equiv="Content-type" content="text/html; charset=windows-1252"> ... Sacr\xe9 bleu! ... </html>""" >>> print BeautifulSoup(doc).prettify() <html> <meta http-equiv="Content-type" content="text/html; charset=windows-1252" /> Sacré bleu! </html> >>> doc = """<html> ... <meta http-equiv="Content-type" content="text/html; charset=windows-1252"> ... Sacr\xe9 bleu! ... </html>""" >>> print BeautifulSoup(doc, fromEncoding='windows-1252').prettify() <html> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> Sacré bleu! </html>