A Way to Help ChatGPT Understand Project Code More Easily

As AI and machine learning develop rapidly, it is becoming increasingly valuable to help ChatGPT understand the code of a specific project or domain. One direct approach is to organize a code repository into text material that is easier for the model to retrieve and understand, then upload it as a custom knowledge base or project material.

This article provides a simple Python script for merging Python files in a directory into text files by subdirectory. This preserves the source path of the code while reducing the overhead of organizing many scattered files.

Background

Imagine you have a codebase full of Python scripts, each containing valuable code snippets and project knowledge. You want to use this codebase to improve ChatGPT's understanding of the project structure, business logic, or implementation patterns in a specific domain.

The challenge is that directly uploading a large number of scattered files is often inconvenient, and it also makes it harder for the model to quickly connect code with its paths. Therefore, we need to convert the codebase into a clearer text format that is easier to ingest.

Solution

The script below traverses a specified directory, merges the .py files in each subdirectory into a single .txt file, and writes the original file path into the merged content. The output files are named according to their relative paths in the source directory, making them easier to locate and reference later.

This method is suitable for:

  • Preparing ChatGPT project knowledge base material;
  • Providing context for code review or refactoring tasks;
  • Quickly generating searchable and archivable code snapshots;
  • Organizing code material without changing the original project structure.

How It Works

  1. Traverse directories: The script uses os.walk to navigate the specified source directory and identify Python files with the .py extension.
  2. Merge files: It merges all Python files found in the same subdirectory into one text file, writing the original file path before each code section.
  3. Organize output: The merged files are named based on their relative paths in the source directory, keeping the output structure readable.
  4. Stay flexible: You can specify any source directory and target directory, so it can be used across different projects.

Code

import os


def merge_py_files_by_directory(source_directory, target_directory):
    for subdir, dirs, files in os.walk(source_directory):
        py_files = [f for f in files if f.endswith('.py')]
        if py_files:
            relative_path = os.path.relpath(subdir, start=source_directory)
            new_filename = relative_path.replace(os.sep, '_') + '.txt'
            target_file_path = os.path.join(target_directory, new_filename)

            os.makedirs(target_directory, exist_ok=True)

            with open(target_file_path, "w") as outfile:
                for file in py_files:
                    file_path = os.path.join(subdir, file)
                    outfile.write(f"{'=' * 20}n")
                    outfile.write(f"File: {file_path}n")
                    outfile.write(f"{'=' * 20}nn")
                    with open(file_path, "r") as infile:
                        outfile.write(infile.read())
                        outfile.write("nn")


# 示例用法
source_directory = 'diffraction'
target_directory = 'merged_py_files'
merge_py_files_by_directory(source_directory, target_directory)

Usage Suggestions

Before running the script, first confirm that source_directory points to the project directory you want to organize, and that target_directory points to the directory where you want to save the output text files. For example:

python merge_python_files.py

After the files are generated, you can spot-check a few output files to confirm:

  • Whether file paths are preserved correctly;
  • Whether the code content is complete;
  • Whether the output directory contains the expected subdirectory content;
  • Whether you need to exclude virtual environments, cache directories, or third-party dependency code.

If the project is relatively large, it is recommended to additionally exclude directories such as .venv, venv, __pycache__, and site-packages in the script, so unrelated dependencies are not packaged into the knowledge base.

Conclusion

This method provides a simple path for converting a Python codebase into knowledge base material suitable for ChatGPT to read. It does not modify the original code; it only generates structured text copies, making it useful for project understanding, code Q&A, refactoring preparation, and technical documentation organization.

Leave a Reply