In the rapidly evolving world of AI and machine learning, the ability to personalize and extend the capabilities of models like ChatGPT is becoming increasingly important. One useful approach is to prepare a customized knowledge base that helps the model answer questions about a specific domain, project, or codebase.
To do that, we need a practical way to convert a repository of Python code into a readable text format. The following guide shows how to merge Python files from a directory into text files that can be inspected, archived, or uploaded as part of a custom knowledge base.
Table of Contents
The Context
Imagine you have a repository filled with Python scripts, each containing valuable code, comments, and project-specific context. You want to preserve that structure while making the content easier to read outside the original repository.
The challenge is to transform the codebase into plain text without losing the source file paths. Keeping the original paths matters because it lets you trace each snippet back to its location in the project.
The Solution
The script below walks through a source directory, finds Python files, and merges the files from each directory into a corresponding .txt file. Each section in the output begins with the original file path, followed by the file contents.
This gives you a simple export format that is easy to review and easy to split by directory if the project is large.
How It Works
- Walking through directories: The script uses
os.walkto navigate through the specified source directory and identify Python files with.pyextensions. - Merging files: It combines all Python files found in each subdirectory into a single text file. The path of each original file is included for reference.
- Organizing output: The merged files are named based on their relative paths in the source directory, which keeps the output structured and understandable.
- Staying flexible: You can specify any source and target directory, making the script reusable across different projects.
The Code
import os
def merge_py_files_by_directory(source_directory, target_directory):
for subdir, dirs, files in os.walk(source_directory):
py_files = [f for f in files if f.endswith('.py')]
if py_files:
relative_path = os.path.relpath(subdir, start=source_directory)
new_filename = relative_path.replace(os.sep, '_') + '.txt'
target_file_path = os.path.join(target_directory, new_filename)
os.makedirs(target_directory, exist_ok=True)
with open(target_file_path, "w") as outfile:
for file in py_files:
file_path = os.path.join(subdir, file)
outfile.write(f"{'=' * 20}\n")
outfile.write(f"File: {file_path}\n")
outfile.write(f"{'=' * 20}\n\n")
with open(file_path, "r") as infile:
outfile.write(infile.read())
outfile.write("\n\n")
# Example usage
source_directory = 'diffraction'
target_directory = 'merged_py_files'
merge_py_files_by_directory(source_directory, target_directory)
A Few Practical Checks
Before using the output as a knowledge source, it is worth doing a few quick checks:
- Confirm that
source_directorypoints to the project folder you actually want to export. - Confirm that
target_directoryis outside the source tree, so generated.txtfiles are not accidentally reprocessed later. - Open one or two generated files and check that file paths and code blocks are readable.
- If the repository contains secrets, credentials, API keys, or private configuration, remove them before uploading the text anywhere.
For larger projects, you may also want to skip virtual environments, caches, build outputs, or dependency directories. The script can be extended with an ignore list for folders such as .venv, __pycache__, build, or dist.
Conclusion
This approach offers a streamlined method for converting a Python code repository into a plain text format suitable for review or use in a custom knowledge base. By preserving file paths and grouping files by directory, the exported text remains easier to navigate than one large undifferentiated dump.
For small and medium-sized projects, this simple script is often enough. For larger or sensitive repositories, add filtering, encoding handling, and secret checks before using the generated files.
