Visualization with Chinese in XGBoost and RandomForest

当特征名称或类别标签包含中文字符时，由 XGBoost、scikit-learn 和 Graphviz 生成的树可视化结果可能会把它们渲染成方框或缺失字形。通常的修复方式是在 Graphviz 渲染图像之前，设置一个支持中文的字体。

下面的示例使用 FangSong。你可以将它替换为系统中已安装的任何支持中文的字体，例如 SimHei、Microsoft YaHei、Noto Sans CJK SC 或 Source Han Sans SC。

Table of Contents

XGBoost

xgb.to_graphviz() 会返回一个 Graphviz 对象。在渲染它之前，修改 Graphviz 源码并添加节点字体定义。

import re
import xgboost as xgb


def set_graph_font(graph, font_name="FangSong"):
    graph.source = re.sub(
        r"graph [ rankdir=TB ]nn    0 ",
        f'graph [ rankdir=TB ]nn node [fontname="{font_name}" shape=plaintext]nn    0 ',
        graph.source,
    )
    return graph.source


diagraph = xgb.to_graphviz(model, num_trees=9)
diagraph.format = "png"
set_graph_font(diagraph)
diagraph

如果字体仍然无法渲染，请检查本地机器上 Graphviz 可用的字体。在 Linux 上，fc-list 通常是验证字体是否已安装的最快方式：

fc-list | grep -i "fang|noto|source han"

RandomForest

对于 scikit-learn 的 RandomForestClassifier 或 RandomForestRegressor，将每棵树导出为 DOT 文件，把默认字体替换为支持中文的字体，然后用 Graphviz 渲染该 DOT 文件。

import re
from pathlib import Path
from subprocess import call

from sklearn.tree import export_graphviz


def plot_forest(model, column_names, output_dir="forest", font_name="FangSong"):
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    for i, estimator in enumerate(model.estimators_[:10]):
        dot_path = output_path / f"tree-{i}.dot"
        cn_dot_path = output_path / f"tree-cn-{i}.dot"
        png_path = output_path / f"tree-cn-{i}.png"

        export_graphviz(
            estimator,
            out_file=str(dot_path),
            feature_names=column_names,
            class_names=["Class Name 1", "Class Name 2"],
            rounded=True,
            proportion=False,
            precision=2,
            filled=True,
        )

        source = dot_path.read_text(encoding="utf-8")
        source = re.sub(r"helvetica", font_name, source, flags=re.IGNORECASE)
        cn_dot_path.write_text(source, encoding="utf-8")

        # Convert to PNG using Graphviz.
        call(["dot", "-Tpng", str(cn_dot_path), "-o", str(png_path), "-Gdpi=600"])

在 Jupyter notebook 中，可以这样显示一棵已渲染的树：

from IPython.display import Image

Image(filename="forest/tree-cn-0.png")

在运行转换步骤之前，请确保 Graphviz 已安装，并且可在命令行中使用：

dot -V

如果缺少 dot，请使用操作系统对应的包管理器安装 Graphviz，然后重新运行 notebook 或脚本。

在 XGBoost 和 RandomForest 中可视化中文

XGBoost

RandomForest

Leave a Reply Cancel reply