为网页抓取与测试自动化浏览器设置:一套完整方案

引言

对于需要同时处理多个在线平台的开发者、测试人员和数字营销人员来说,网页自动化工具必不可少。无论你是在运行自动化测试、抓取数据,还是跨多个站点管理内容,一个可靠的浏览器环境都至关重要。设置过程往往很繁琐,尤其是浏览器和对应的 WebDriver 必须保持兼容。

挑战

许多开发者在准备浏览器自动化环境时,都会遇到类似的问题:

  • 浏览器与驱动之间的版本不匹配
  • 设置显示参数时出现配置错误
  • 每台机器都需要重复手动配置
  • 多个网站之间的会话管理

在使用 YouTube、小红书、抖音等平台时,这些问题会更加明显,尤其是在无头环境中,或涉及 X server 显示时。

一套完整方案

为了解决这些问题,我开发了一个脚本,用来自动完成 Chrome 或 Chromium 的网页自动化环境设置。它会在一次运行中处理浏览器安装、驱动下载、环境配置和别名设置。

主要特性

  • 浏览器选择:安装 Chrome 或 Chromium。
  • 自动版本匹配:检测浏览器版本并下载对应驱动。
  • 可配置显示:设置 X display server,例如 :0:1
  • 预配置别名:为常用平台创建可直接使用的命令。
  • 验证:确认浏览器、驱动和别名均可用。

如何使用脚本

这个脚本设计得简单但灵活:

# 使用默认值:Chrome 和 display :0
./install_chrome_driver_for_selenium.sh

# 安装 Chromium 而不是 Chrome
./install_chrome_driver_for_selenium.sh chromium

# 使用不同的 display
./install_chrome_driver_for_selenium.sh --display=:1

# 组合选项
./install_chrome_driver_for_selenium.sh chromium --display=:2

在一台全新的 Linux 机器上运行之前,请确保常见依赖已经可用:

sudo apt update
sudo apt install -y wget gpg unzip

该脚本假定运行环境是带有 apt 的 Debian 或 Ubuntu 风格系统,并且会把驱动写入 /usr/local/bin,因此安装步骤会使用 sudo

脚本

#!/bin/bash
# install_chrome_driver_for_selenium.sh
# Automatically installs Chrome/Chromium, the matching driver, and sets up aliases
# Usage: ./install_chrome_driver_for_selenium.sh [chrome|chromium] [--display=:X]

set -e  # Exit on error

# Function to display usage
usage() {
    echo "Usage: $0 [chrome|chromium] [--display=:X]"
    echo "  chrome|chromium - Browser to install (default: chrome)"
    echo "  --display=:X    - X display to use (default: :0)"
    echo "Examples:"
    echo "  $0                      # Install Chrome with display :0"
    echo "  $0 chromium             # Install Chromium with display :0"
    echo "  $0 chrome --display=:1  # Install Chrome with display :1"
}

# Function to install Chrome
install_chrome() {
    echo "=== Installing Google Chrome ==="

    # Check if Chrome is already installed
    if command -v google-chrome &> /dev/null; then
        echo "Google Chrome is already installed."
    else
        echo "Adding Google Chrome repository..."
        wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo gpg --dearmor -o /usr/share/keyrings/google-chrome-keyring.gpg
        echo "deb [arch=amd64 signed-by=/usr/share/keyrings/google-chrome-keyring.gpg] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list > /dev/null

        echo "Updating package lists..."
        sudo apt update

        echo "Installing Google Chrome..."
        sudo apt install -y google-chrome-stable
    fi
}

# Function to install Chromium
install_chromium() {
    echo "=== Installing Chromium ==="

    # Check if Chromium is already installed
    if command -v chromium-browser &> /dev/null; then
        echo "Chromium is already installed."
    else
        echo "Installing Chromium..."
        sudo apt update
        sudo apt install -y chromium-browser
    fi
}

# Function to get Chrome version
get_chrome_version() {
    echo "=== Detecting Chrome version ==="
    CHROME_VERSION=$(google-chrome --version | cut -d ' ' -f 3)
    echo "Detected Google Chrome version: $CHROME_VERSION"
}

# Function to get Chromium version
get_chromium_version() {
    echo "=== Detecting Chromium version ==="
    CHROMIUM_VERSION=$(chromium-browser --version | cut -d ' ' -f 2)
    echo "Detected Chromium version: $CHROMIUM_VERSION"
}

# Function to download Chrome Driver
download_chrome_driver() {
    echo "=== Downloading Chrome Driver for version $CHROME_VERSION ==="

    # Download the appropriate driver
    wget -q -O chromedriver.zip "https://storage.googleapis.com/chrome-for-testing-public/$CHROME_VERSION/linux64/chromedriver-linux64.zip"

    # Unzip the driver
    unzip -q -o chromedriver.zip

    # Move to /usr/local/bin
    sudo mv chromedriver-linux64/chromedriver /usr/local/bin/chromedriver
    sudo chmod +x /usr/local/bin/chromedriver

    # Clean up
    rm -rf chromedriver.zip chromedriver-linux64

    echo "Chrome Driver installed to /usr/local/bin/chromedriver"
}

# Function to download Chromium Driver (uses chrome driver underneath)
download_chromium_driver() {
    echo "=== Downloading Chromium Driver for version $CHROMIUM_VERSION ==="

    # Download the appropriate driver
    # Note: For newer versions, Chromium uses the same driver as Chrome
    wget -q -O chromedriver.zip "https://storage.googleapis.com/chrome-for-testing-public/$CHROMIUM_VERSION/linux64/chromedriver-linux64.zip"

    # Unzip the driver
    unzip -q -o chromedriver.zip

    # Move to /usr/local/bin
    sudo mv chromedriver-linux64/chromedriver /usr/local/bin/chromedriver
    sudo chmod +x /usr/local/bin/chromedriver

    # Create a symlink for chromium-driver if it doesn't exist
    if [ ! -f /usr/local/bin/chromium-driver ]; then
        sudo ln -s /usr/local/bin/chromedriver /usr/local/bin/chromium-driver
    fi

    # Clean up
    rm -rf chromedriver.zip chromedriver-linux64

    echo "Chromium Driver installed to /usr/local/bin/chromedriver with symlink at /usr/local/bin/chromium-driver"
}

# Function to set up the aliases
setup_chrome_aliases() {
    echo "=== Setting up Chrome aliases with display $DISPLAY_NUM ==="

    # Create the scripts directory if it doesn't exist
    mkdir -p ~/scripts

    # Create the aliases file with the specified display
    cat > ~/scripts/chrome_aliases.sh << EOF
#!/bin/bash
# Chrome aliases for various platforms
# Generated by install_chrome_driver_for_selenium.sh

# Create logs directory if it doesn't exist
mkdir -p "$HOME/chrome_dev_session_logs"

# Chrome aliases with DISPLAY=$DISPLAY_NUM
alias start_chrome_xhs='DISPLAY=$DISPLAY_NUM google-chrome --hide-crash-restore-bubble --remote-debugging-port=5003 --user-data-dir="$HOME/chrome_dev_session_5003" https://creator.xiaohongshu.com/creator/post > "$HOME/chrome_dev_session_logs/chrome_xhs.log" 2>&1'
alias start_chrome_douyin='DISPLAY=$DISPLAY_NUM google-chrome --hide-crash-restore-bubble --remote-debugging-port=5004 --user-data-dir="$HOME/chrome_dev_session_5004" https://creator.douyin.com/creator-micro/content/upload > "$HOME/chrome_dev_session_logs/chrome_douyin.log" 2>&1'
alias start_chrome_bilibili='DISPLAY=$DISPLAY_NUM google-chrome --hide-crash-restore-bubble --remote-debugging-port=5005 --user-data-dir="$HOME/chrome_dev_session_5005" https://member.bilibili.com/platform/upload/video/frame > "$HOME/chrome_dev_session_logs/chrome_bilibili.log" 2>&1'
alias start_chrome_shipinhao='DISPLAY=$DISPLAY_NUM google-chrome --hide-crash-restore-bubble --remote-debugging-port=5006 --user-data-dir="$HOME/chrome_dev_session_5006" https://channels.weixin.qq.com/post/create > "$HOME/chrome_dev_session_logs/chrome_shipinhao.log" 2>&1'
alias start_chrome_youtube='DISPLAY=$DISPLAY_NUM google-chrome --hide-crash-restore-bubble --remote-debugging-port=9222 --user-data-dir="$HOME/chrome_dev_session_9222" https://youtube.com/upload > "$HOME/chrome_dev_session_logs/chrome_youtube.log" 2>&1'
alias start_chrome_without_y2b='start_chrome_xhs & start_chrome_douyin & start_chrome_bilibili'
alias start_chrome_all='start_chrome_xhs & start_chrome_douyin & start_chrome_bilibili & start_chrome_shipinhao & start_chrome_youtube'
EOF

    # Make the file executable
    chmod +x ~/scripts/chrome_aliases.sh

    # Check if already sourced in .bashrc
    if ! grep -q "source ~/scripts/chrome_aliases.sh" ~/.bashrc; then
        echo "# Source Chrome aliases" >> ~/.bashrc
        echo "source ~/scripts/chrome_aliases.sh" >> ~/.bashrc
        echo "Added source command to ~/.bashrc"
    else
        echo "Chrome aliases already sourced in ~/.bashrc"
    fi

    # Source the file in the current session
    source ~/scripts/chrome_aliases.sh

    echo "Chrome aliases have been set up and are ready to use"
}

# Function to set up the aliases for Chromium
setup_chromium_aliases() {
    echo "=== Setting up Chromium aliases with display $DISPLAY_NUM ==="

    # Create the scripts directory if it doesn't exist
    mkdir -p ~/scripts

    # Create the aliases file with the specified display
    cat > ~/scripts/chromium_aliases.sh << EOF
#!/bin/bash
# Chromium aliases for various platforms
# Generated by install_chrome_driver_for_selenium.sh

# Create logs directory if it doesn't exist
mkdir -p "$HOME/chromium_dev_session_logs"

# Chromium aliases with DISPLAY=$DISPLAY_NUM
alias start_chromium_xhs='DISPLAY=$DISPLAY_NUM chromium-browser --hide-crash-restore-bubble --remote-debugging-port=5003 --user-data-dir="$HOME/chromium_dev_session_5003" https://creator.xiaohongshu.com/creator/post > "$HOME/chromium_dev_session_logs/chromium_xhs.log" 2>&1'
alias start_chromium_douyin='DISPLAY=$DISPLAY_NUM chromium-browser --hide-crash-restore-bubble --remote-debugging-port=5004 --user-data-dir="$HOME/chromium_dev_session_5004" https://creator.douyin.com/creator-micro/content/upload > "$HOME/chromium_dev_session_logs/chromium_douyin.log" 2>&1'
alias start_chromium_bilibili='DISPLAY=$DISPLAY_NUM chromium-browser --hide-crash-restore-bubble --remote-debugging-port=5005 --user-data-dir="$HOME/chromium_dev_session_5005" https://member.bilibili.com/platform/upload/video/frame > "$HOME/chromium_dev_session_logs/chromium_bilibili.log" 2>&1'
alias start_chromium_shipinhao='DISPLAY=$DISPLAY_NUM chromium-browser --hide-crash-restore-bubble --remote-debugging-port=5006 --user-data-dir="$HOME/chromium_dev_session_5006" https://channels.weixin.qq.com/post/create > "$HOME/chromium_dev_session_logs/chromium_shipinhao.log" 2>&1'
alias start_chromium_youtube='DISPLAY=$DISPLAY_NUM chromium-browser --hide-crash-restore-bubble --remote-debugging-port=9222 --user-data-dir="$HOME/chromium_dev_session_9222" https://youtube.com/upload > "$HOME/chromium_dev_session_logs/chromium_youtube.log" 2>&1'
alias start_chromium_without_y2b='start_chromium_xhs & start_chromium_douyin & start_chromium_bilibili'
alias start_chromium_all='start_chromium_xhs & start_chromium_douyin & start_chromium_bilibili & start_chromium_shipinhao & start_chromium_youtube'
EOF

    # Make the file executable
    chmod +x ~/scripts/chromium_aliases.sh

    # Check if already sourced in .bashrc
    if ! grep -q "source ~/scripts/chromium_aliases.sh" ~/.bashrc; then
        echo "# Source Chromium aliases" >> ~/.bashrc
        echo "source ~/scripts/chromium_aliases.sh" >> ~/.bashrc
        echo "Added source command to ~/.bashrc"
    else
        echo "Chromium aliases already sourced in ~/.bashrc"
    fi

    # Source the file in the current session
    source ~/scripts/chromium_aliases.sh

    echo "Chromium aliases have been set up and are ready to use"
}

# Function to verify installation
verify_installation() {
    echo "=== Verifying installation ==="

    # Check if Chrome/Chromium is installed
    if [ "$BROWSER_TYPE" = "chrome" ]; then
        if ! command -v google-chrome &> /dev/null; then
            echo "ERROR: Google Chrome is not installed properly"
            exit 1
        fi
        echo "✓ Google Chrome is installed"
    else
        if ! command -v chromium-browser &> /dev/null; then
            echo "ERROR: Chromium is not installed properly"
            exit 1
        fi
        echo "✓ Chromium is installed"
    fi

    # Check if the driver is installed
    if ! command -v chromedriver &> /dev/null; then
        echo "ERROR: Chrome Driver is not installed properly"
        exit 1
    fi
    echo "✓ Chrome Driver is installed"

    # Verify driver version matches browser version
    if [ "$BROWSER_TYPE" = "chrome" ]; then
        CHROME_CURRENT_VERSION=$(google-chrome --version | cut -d ' ' -f 3)
        DRIVER_VERSION=$(chromedriver --version | cut -d ' ' -f 2)

        if [[ "$DRIVER_VERSION" == "$CHROME_CURRENT_VERSION"* ]]; then
            echo "✓ Chrome Driver version $DRIVER_VERSION matches Chrome version $CHROME_CURRENT_VERSION"
        else
            echo "WARNING: Chrome Driver version ($DRIVER_VERSION) might not match Chrome version ($CHROME_CURRENT_VERSION)"
        fi
    else
        CHROMIUM_CURRENT_VERSION=$(chromium-browser --version | cut -d ' ' -f 2)
        DRIVER_VERSION=$(chromedriver --version | cut -d ' ' -f 2)

        if [[ "$DRIVER_VERSION" == "$CHROMIUM_CURRENT_VERSION"* ]]; then
            echo "✓ Chrome Driver version $DRIVER_VERSION matches Chromium version $CHROMIUM_CURRENT_VERSION"
        else
            echo "WARNING: Chrome Driver version ($DRIVER_VERSION) might not match Chromium version ($CHROMIUM_CURRENT_VERSION)"
        fi
    fi

    # Check if aliases are set up
    if [ "$BROWSER_TYPE" = "chrome" ]; then
        if ! type start_chrome_youtube &> /dev/null; then
            echo "WARNING: Chrome aliases are not loaded in the current session"
            echo "Please run: source ~/.bashrc"
        else
            echo "✓ Chrome aliases are set up correctly"
        fi
    else
        if ! type start_chromium_youtube &> /dev/null; then
            echo "WARNING: Chromium aliases are not loaded in the current session"
            echo "Please run: source ~/.bashrc"
        else
            echo "✓ Chromium aliases are set up correctly"
        fi
    fi

    echo ""
    echo "Installation completed successfully!"
    if [ "$BROWSER_TYPE" = "chrome" ]; then
        echo "You can now use the following commands:"
        echo "  start_chrome_xhs       - Start Chrome for Xiaohongshu"
        echo "  start_chrome_douyin    - Start Chrome for Douyin"
        echo "  start_chrome_bilibili  - Start Chrome for Bilibili"
        echo "  start_chrome_shipinhao - Start Chrome for Shipinhao"
        echo "  start_chrome_youtube   - Start Chrome for YouTube"
        echo "  start_chrome_without_y2b - Start Chrome for all except YouTube"
        echo "  start_chrome_all       - Start Chrome for all platforms"
    else
        echo "You can now use the following commands:"
        echo "  start_chromium_xhs       - Start Chromium for Xiaohongshu"
        echo "  start_chromium_douyin    - Start Chromium for Douyin"
        echo "  start_chromium_bilibili  - Start Chromium for Bilibili"
        echo "  start_chromium_shipinhao - Start Chromium for Shipinhao"
        echo "  start_chromium_youtube   - Start Chromium for YouTube"
        echo "  start_chromium_without_y2b - Start Chromium for all except YouTube"
        echo "  start_chromium_all       - Start Chromium for all platforms"
    fi

    echo ""
    echo "Using display: $DISPLAY_NUM"
    echo "Note: If the aliases aren't available, run 'source ~/.bashrc' to load them"
}

# Default values
BROWSER_TYPE="chrome"
DISPLAY_NUM=":0"

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        chrome|chromium)
            BROWSER_TYPE="$1"
            shift
            ;;
        --display=*)
            DISPLAY_NUM="${1#*=}"
            shift
            ;;
        --help|-h)
            usage
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            usage
            exit 1
            ;;
    esac
done

# Main execution
echo "Installing $BROWSER_TYPE and its WebDriver..."
echo "Using display: $DISPLAY_NUM"

# Install the selected browser
if [ "$BROWSER_TYPE" = "chrome" ]; then
    install_chrome
    get_chrome_version
    download_chrome_driver
    setup_chrome_aliases
else
    install_chromium
    get_chromium_version
    download_chromium_driver
    setup_chromium_aliases
fi

# Verify installation
verify_installation

echo "Done!"

理解各个组件

浏览器和驱动安装

脚本首先确定要安装哪种浏览器:Chrome 或 Chromium。然后,如果浏览器尚未存在,它会进行安装。检测到版本号之后,它会从 Google 的 Chrome for Testing 仓库下载匹配的 WebDriver。

在实际使用中,不同发行版的浏览器打包方式可能不同。如果 Chrome for Testing 的存储路径中没有精确匹配的检测版本,请先在本地检查已安装的浏览器版本,然后从官方 Chrome for Testing 可用性页面下载最接近且兼容的稳定版驱动。

显示配置

X display 设置对浏览器自动化很重要,尤其是在 Docker 容器、虚拟桌面和远程服务器中。--display 选项允许你选择浏览器别名使用的 display。

例如,如果你的自动化桌面运行在 :1,可以这样启动设置:

./install_chrome_driver_for_selenium.sh chrome --display=:1

你可以用如下命令确认某个 display 是否处于活动状态:

echo "$DISPLAY"
xdpyinfo -display :1 >/dev/null && echo "display is available"

提升效率的别名

最实用的功能之一,是为多个平台创建可直接使用的别名:

  • start_chrome_youtube:打开 Chrome 并进入 YouTube 上传页面。
  • start_chrome_douyin:打开 Chrome 并进入抖音创作者后台。
  • start_chrome_bilibili:打开 Chrome 并进入 Bilibili 上传界面。
  • start_chrome_xhs:打开 Chrome 并进入小红书创作者页面。
  • start_chrome_shipinhao:打开 Chrome 并进入微信视频号发文创建页面。

每个别名都使用单独的远程调试端口和单独的用户数据目录。这样既能更容易地隔离会话,也仍然允许 Selenium、Playwright 或直接使用 Chrome DevTools Protocol 的客户端连接到正在运行的浏览器。

安装完成后,如果别名没有立即可用,请重新加载 shell 配置:

source ~/.bashrc

然后验证别名是否存在:

type start_chrome_youtube

真实应用场景

这个脚本尤其适用于:

  • 内容创作者,需要跨多个平台管理上传。
  • 数字营销人员,需要自动化内容分发流程。
  • QA 工程师,需要一致的浏览器测试环境。
  • 数据科学家,需要为网页抓取配置浏览器环境。
  • DevOps 团队,需要在开发机器和服务器之间标准化浏览器自动化。

运维注意事项

一些实用检查可以让这套设置更可靠:

  • 在调试驱动问题之前,运行 google-chrome --versionchromium-browser --version
  • 安装后运行 chromedriver --version,并将主版本号与浏览器进行比较。
  • 让每个平台使用单独的 --user-data-dir,避免 cookie 和会话相互冲突。
  • 避免在同一个远程调试端口上运行无关的自动化任务。
  • 将已登录的浏览器配置文件视为敏感数据,因为其中可能包含有效会话。

结论

浏览器自动化不应该从数小时的设置和配置开始。有了这个脚本,你可以在几分钟内准备好一个可用的浏览器自动化环境,其中包含匹配的驱动、可配置的 display,以及面向常见工作流的便捷别名。

无论你是开发者、测试人员、内容创作者,还是数据工作者,这种方法都能去除重复的设置工作,让你专注于真正的自动化任务。随着工作流增长,这个脚本也可以继续扩展,加入更多别名、额外的浏览器配置文件,或项目专用的启动参数。

Leave a Reply