Python Beautiful Soup gb2312 Windows-1252 Mojibake Issue

Back then, Python encoding issues made my life miserable, and I had long planned to write an article summarizing what I had learned. For now, I am整理ing the two articles that helped me most at the time, along with the key conclusions, so that the next time I run into mojibake involving Beautiful Soup, gb2312, gbk, gb18030, or Windows-1252, I can locate the issue quickly.

Symptoms

When scraping Chinese web pages or RSS feeds, the page may declare itself as gb2312 or gbk, but the parsed Chinese text turns into mojibake. Sometimes Beautiful Soup, feedparser, or other libraries identify the encoding as windows-1252, making the Chinese content completely unreadable.

The common cause is not that the parser has no knowledge of gb2312, but that the actual page content does not strictly belong to the gb2312 character set. Many Chinese web pages mix gb2312 and gbk, or even declare gb2312 while the body already contains characters outside the gb2312 range. During strict detection, once bytes are found that do not conform to gb2312, the parser may fall back to windows-1252, which results in mojibake.

Key Conclusion

Note: the gb2312 advertised by many web pages is not necessarily strict gb2312. When you encounter this kind of page, try decoding it as gb18030 first.

gb18030 is backward-compatible with gbk and gb2312, and it covers a much wider range. Microsoft also often maps gb2312 and gbk to broader Chinese encoding handling, which helps in some scenarios but also makes declared encodings and real content easy to mix up.

In other words, even if the page says:

<meta charset="gb2312">

its actual content may more likely need to be handled as gb18030.

Handling It in Beautiful Soup

If you are using the Python 2-era style, you can specify the encoding when passing the page to Beautiful Soup:

page = urllib2.build_opener().open(req).read()
soup = BeautifulSoup(page, fromEncoding="GB18030")

The point is: do not fully trust the gb2312 declared in the HTML. Instead, pass the raw bytes to the parser using the broader GB18030 encoding.

For newer versions of Beautiful Soup, the parameter name is usually written as from_encoding:

from bs4 import BeautifulSoup

html = response.content
soup = BeautifulSoup(html, "html.parser", from_encoding="gb18030")

If you have already decoded the bytes manually, you can also convert them to a Unicode string first and then pass that to Beautiful Soup:

text = response.content.decode("gb18030", errors="replace")
soup = BeautifulSoup(text, "html.parser")

errors="replace" replaces bytes that cannot be decoded, which is useful when scraping and trying to preserve as much body text as possible. During debugging, you can also omit it at first and let the exception surface, so you can confirm exactly which byte sequence is causing the problem.

Why It Gets Detected as Windows-1252

A typical example is Baidu News RSS: the page declares its encoding as gb2312, but the actual content may include characters from the gbk range. When feedparser checks strictly according to gb2312, it finds characters outside the gb2312 range, so it gives up on gb2312 and falls back to windows-1252. The Chinese bytes are then interpreted as a Western encoding and ultimately display as mojibake.

So the issue is not necessarily that “the library is too dumb,” but that “the page’s declared encoding and its real bytes are inconsistent.”

Troubleshooting Steps

When you encounter mojibake on a Chinese web page, check in this order:

  1. Keep the raw bytes; do not .decode() too early.
  2. Check the Content-Type in the HTTP headers and the meta charset in the HTML.
  3. If the declared encoding is gb2312 or gbk but parsing fails, try gb18030.
  4. Print repr() or write the content to a file for inspection; do not rely only on what the terminal displays.
  5. Confirm whether the terminal, editor, database connection, and output file itself are using UTF-8.

Example:

raw = response.content

for encoding in ["utf-8", "gb18030", "gbk", "gb2312"]:
    try:
        text = raw.decode(encoding)
        print("decoded as", encoding)
        break
    except UnicodeDecodeError:
        pass

If gb2312 fails while gb18030 succeeds, that basically indicates that the page’s declaration is not accurate enough.

References

Article 1: Beautiful Soup gb2312 mojibake issue

Article 2:

  • [[整理] In Python, you may already have the correct Unicode or encoded characters, but they look garbled or print as mojibake](https://www.crifan.com/python_already_got_correct_encoding_string_but_seems_print_messy_code/)

Summary

When handling mojibake on Chinese web pages, the most important thing is to distinguish between the “declared encoding” and the “real encoding.” Many old web pages say gb2312, but in practice they are closer to gbk or gb18030. For parsers such as Beautiful Soup and feedparser, directly specifying gb18030 is often more reliable than trusting the page declaration.

Leave a Reply