Merging Python Files for a Customized ChatGPT Knowledge Base: A Step-by-Step Guide

In the rapidly evolving world of AI and machine learning, the ability to personalize and extend the capabilities of models like ChatGPT is becoming increasingly important. One useful approach is to prepare a customized knowledge base that helps the model answer questions about a specific domain, project, or codebase.

To do that, we need a practical way to convert a repository of Python code into a readable text format. The following guide shows how to merge Python files from a directory into text files that can be inspected, archived, or uploaded as part of a custom knowledge base.

The Context

Imagine you have a repository filled with Python scripts, each containing valuable code, comments, and project-specific context. You want to preserve that structure while making the content easier to read outside the original repository.

The challenge is to transform the codebase into plain text without losing the source file paths. Keeping the original paths matters because it lets you trace each snippet back to its location in the project.

The Solution

The script below walks through a source directory, finds Python files, and merges the files from each directory into a corresponding .txt file. Each section in the output begins with the original file path, followed by the file contents.

This gives you a simple export format that is easy to review and easy to split by directory if the project is large.

How It Works

  1. Walking through directories: The script uses os.walk to navigate through the specified source directory and identify Python files with .py extensions.
  2. Merging files: It combines all Python files found in each subdirectory into a single text file. The path of each original file is included for reference.
  3. Organizing output: The merged files are named based on their relative paths in the source directory, which keeps the output structured and understandable.
  4. Staying flexible: You can specify any source and target directory, making the script reusable across different projects.

The Code

import os


def merge_py_files_by_directory(source_directory, target_directory):
    for subdir, dirs, files in os.walk(source_directory):
        py_files = [f for f in files if f.endswith('.py')]
        if py_files:
            relative_path = os.path.relpath(subdir, start=source_directory)
            new_filename = relative_path.replace(os.sep, '_') + '.txt'
            target_file_path = os.path.join(target_directory, new_filename)

            os.makedirs(target_directory, exist_ok=True)

            with open(target_file_path, "w") as outfile:
                for file in py_files:
                    file_path = os.path.join(subdir, file)
                    outfile.write(f"{'=' * 20}\n")
                    outfile.write(f"File: {file_path}\n")
                    outfile.write(f"{'=' * 20}\n\n")
                    with open(file_path, "r") as infile:
                        outfile.write(infile.read())
                        outfile.write("\n\n")


# Example usage
source_directory = 'diffraction'
target_directory = 'merged_py_files'
merge_py_files_by_directory(source_directory, target_directory)

A Few Practical Checks

Before using the output as a knowledge source, it is worth doing a few quick checks:

  • Confirm that source_directory points to the project folder you actually want to export.
  • Confirm that target_directory is outside the source tree, so generated .txt files are not accidentally reprocessed later.
  • Open one or two generated files and check that file paths and code blocks are readable.
  • If the repository contains secrets, credentials, API keys, or private configuration, remove them before uploading the text anywhere.

For larger projects, you may also want to skip virtual environments, caches, build outputs, or dependency directories. The script can be extended with an ignore list for folders such as .venv, __pycache__, build, or dist.

Conclusion

This approach offers a streamlined method for converting a Python code repository into a plain text format suitable for review or use in a custom knowledge base. By preserving file paths and grouping files by directory, the exported text remains easier to navigate than one large undifferentiated dump.

For small and medium-sized projects, this simple script is often enough. For larger or sensitive repositories, add filtering, encoding handling, and secret checks before using the generated files.

Leave a Reply