I was going to comment this. “What’s wrong with `cat`”, whose job is literally to concatenate files? Or even [uncompressed] `tar` archives, which are basically just a list of files with some headers?
Love this. I created (half-jokingly, but only half) the concept of a monofile (inspired by our monorepo) in our team. I have not managed to convince my colleagues to switch yet, but maybe this package can help. Unironically, I find that in larger python projects, combining various related sub 100 loc files into one big sub 1000 loc file can do magic to circular import errors and remove 100s of lines of import statements.
I suspect that from the usage in the code, it knows that there is a module foo and a submodule subfoo with a function bar() in it, and it can look directly in the file for the definition of bar().
It would be another story if we used this opportunity to mangle the submodules names for example, but that the kind of hidden control flow that nobody want in his codebase.
Also, it is not some dark arts of import or something: it is pretty standard at this point since its one of the most sane way of breaking circular dependencies between your modules, and the feature of overloading a module __getattr__ was introduced specifically for this usecase. (I couldn't find the specific PEP that introduced it, sorry)
I usually do this with docker/podman compose files for dev environments.
I see people creating all kinds of mounts and volumes but I just embed files inline under the configs top level key. I even embed shell scripts that way to do one shot/initialization tasks.
The goal is to just have one compose.yml file that the developer can spin up for a local development reproduction of what they need. It's quite nice.
I once had a 4k line javascript file (a vuex module), which I navigated using / in vim, which came with another 20k likes of tests (also in the single file). I would say 5k lines is the real celling.
I've been dreaming of a tool which resembles this, at least in spirit.
I want to figure out how to structure a codebase such that a failing test can spit out a CID for that failure such that it can be remotely recreated (you'd have to be running ipfs so that the remote party can pull the content from you, or maybe you push it to some kind of hub before you share it).
It would be the files relevant to that failure--both code files and data files, stdin, env vars... a reproducible build of a test result.
It would be handy for reporting bugs or getting LLM help. The remote party could respond with a similar "try this" hash which the tooling would then understand how to apply (fetching the necessary bits from their machine, or the hub). Sort of like how Unison resolves functions by cryptographic hash, except this is a link to a function call, so it's got inputs and outputs too.
Of course that's a long way from vomiting everything into a text file, I need to establish functional dependency at as small a granularity as possible, but this feels like the first step on a path that eventually gets us there.
Hmm, you could probably make a proof of concept on a weekend specifically in the typescript/JavaScript ecosystem, as it's already heavily reliant on bundlers.
The process could be
1. defining a new/temporary bundler entry point
2. copying the failing code into the file
3. Bundle without minification
It'd probably be best to reduce scope by limiting it to a specific testing framework and make it via an extension, i.e. jest
You're talking sense, but I'm kinda wanting to do it at the subprocess level so that caller and callee need not use the same language (I was talking in terms of tests but tests are just a special kind of function).
Whether to use nodejs or python or rust (and which version thereof) will be as much a part of the bundled function as its code. I figure I'll wrap nix so it can replicate the environments, then I'll just have to do the runtime stuff.
It'd be nice if something similar were available to traverse, say, directories of writings in Markdown, Word, LibreOffice, etc., and output a single text file so I have all my writings in one place. Plus allow plug-ins to extract from more exotic file types not originally included.
That's what I was thinking too. It looks like someone just reinvented tar, and given how it's a JavaScript thing I'm wondering if it's a zoomer who didn't know tar existed and the HN crowd would set them straight. But then I come into the comments here and people are posting about how absolutely brilliant it is, so surely I'm missing something… right?
I can imagine the token counts to be off the charts. How would an llm handle this input? Llm output quality already drops quite hard at a out 3000 tokens let alone 128k
Seems like repopack only packs the repo. How do you apply the refactors back to the project? Is it something that Claude projects does automatically somehow?
I have a bash script which is very similar to this, except instead of dumping it all into one file, it opens all the matched files as tabs in Zed. Since Zed's AI features let you dump all, or a subset, of open tabs into context, this works great. It gives me a chance to curate the context a little more. And what I'm working on is probably already in an open tab anyway.
Can you go 1 more step? Is there a way to not just dump someone's project into a plain text file, but sometime intelligently craft it into a ready to go prompt? I could use that!
> Can you go 1 more step? Is there a way to not just dump someone's project into a plain text file, but sometime intelligently craft it into a ready to go prompt? I could use that!
Cool! I'd like to see an indication of the total number of tokens in the output, so I know right away on which LLM I can use this prompt or, if it's too large, I can relaunch the script by excluding other files to reduce the number of tokens in the output
One feature you could add is allowing the user to map changes in the concatenated file back to the original files.
For example, if an LLM edits the concatenated file, I would want it to return the corresponding filenames and line numbers of the original files.
We use a C compiler for embedded systems that doesn't support link time optimizations (unless you pay for the pro version, that is). I have been thinking about some tool like this that merges all C source files for compilation.
That's called a "unity" build, isn't it? I was under the impression that it was a relatively well-known technique, such that there are existing tools to merge a set of source files into a single .c file.
Unless i am understanding you wrong, you could easily do this by #including all your a.c, b.c etc. into one file input.c and feeding that to the compiler.
We did this for a home-grown SoC with a gcc port for which there was no linker.
> A vomitorium is a passage situated below or behind a tier of seats in an amphitheatre or a stadium through which large crowds can exit rapidly at the end of an event.
> A commonly held but erroneous notion is that Ancient Romans designated spaces called vomitoria for the purpose of literal vomiting, as part of a binge-and-purge cycle
The name links up nicely with AI enshittification. Although if you wanted to be pedantic, for that metaphor to work you'd really want to call it "gorge" or something more related to ingestion rather than vomiting. (I'm aware that a vomitorium was the exit from a Roman stadium, so it's not really about throwing up either).
As an alternative to (npm -g)'ing here some potentially useful coreutils one-liners I've been using for a similar purpose:
- Dump all .py files into out.txt (for copy/paste into a LLM)
> find . -name "*.py" -exec cat {} + > out.txt
- Sort all .py files by number of lines
> find . -name '*.py' -exec wc -l {} + | sort -n
I was going to comment this. “What’s wrong with `cat`”, whose job is literally to concatenate files? Or even [uncompressed] `tar` archives, which are basically just a list of files with some headers?
Never underestimate the node community's willingness to ignore the existing tech stack and reinvent 50 year old tools. It's peak NIH.
Love this. I created (half-jokingly, but only half) the concept of a monofile (inspired by our monorepo) in our team. I have not managed to convince my colleagues to switch yet, but maybe this package can help. Unironically, I find that in larger python projects, combining various related sub 100 loc files into one big sub 1000 loc file can do magic to circular import errors and remove 100s of lines of import statements.
To help with circular import, we switched a few years ago to lazily importing submodules on demand, and never switched back.
Just add to your __init__.py files:
import importlib
def __getattr__(submodule_name):
And then just import the root module and use it without ever needing to import individual submodules:import foo
def bar():
Doesn't that mean your editor support is crap though?
Not at all. Sublime is perfectly fine with it.
I suspect that from the usage in the code, it knows that there is a module foo and a submodule subfoo with a function bar() in it, and it can look directly in the file for the definition of bar().
It would be another story if we used this opportunity to mangle the submodules names for example, but that the kind of hidden control flow that nobody want in his codebase.
Also, it is not some dark arts of import or something: it is pretty standard at this point since its one of the most sane way of breaking circular dependencies between your modules, and the feature of overloading a module __getattr__ was introduced specifically for this usecase. (I couldn't find the specific PEP that introduced it, sorry)
It does, which is why this is more easily done by importing exact bits or using a single file
I usually do this with docker/podman compose files for dev environments.
I see people creating all kinds of mounts and volumes but I just embed files inline under the configs top level key. I even embed shell scripts that way to do one shot/initialization tasks.
The goal is to just have one compose.yml file that the developer can spin up for a local development reproduction of what they need. It's quite nice.
I like splitting large compose file in smaller units and including them in the main compose file
I once had a 4k line javascript file (a vuex module), which I navigated using / in vim, which came with another 20k likes of tests (also in the single file). I would say 5k lines is the real celling.
I've been dreaming of a tool which resembles this, at least in spirit.
I want to figure out how to structure a codebase such that a failing test can spit out a CID for that failure such that it can be remotely recreated (you'd have to be running ipfs so that the remote party can pull the content from you, or maybe you push it to some kind of hub before you share it).
It would be the files relevant to that failure--both code files and data files, stdin, env vars... a reproducible build of a test result.
It would be handy for reporting bugs or getting LLM help. The remote party could respond with a similar "try this" hash which the tooling would then understand how to apply (fetching the necessary bits from their machine, or the hub). Sort of like how Unison resolves functions by cryptographic hash, except this is a link to a function call, so it's got inputs and outputs too.
Of course that's a long way from vomiting everything into a text file, I need to establish functional dependency at as small a granularity as possible, but this feels like the first step on a path that eventually gets us there.
Hmm, you could probably make a proof of concept on a weekend specifically in the typescript/JavaScript ecosystem, as it's already heavily reliant on bundlers.
The process could be
1. defining a new/temporary bundler entry point
2. copying the failing code into the file
3. Bundle without minification
It'd probably be best to reduce scope by limiting it to a specific testing framework and make it via an extension, i.e. jest
You're talking sense, but I'm kinda wanting to do it at the subprocess level so that caller and callee need not use the same language (I was talking in terms of tests but tests are just a special kind of function).
Whether to use nodejs or python or rust (and which version thereof) will be as much a part of the bundled function as its code. I figure I'll wrap nix so it can replicate the environments, then I'll just have to do the runtime stuff.
With fd: https://github.com/sharkdp/fd
E.g. to combine all .js files into combined.js:It'd be nice if something similar were available to traverse, say, directories of writings in Markdown, Word, LibreOffice, etc., and output a single text file so I have all my writings in one place. Plus allow plug-ins to extract from more exotic file types not originally included.
seems fairly trivial to chain together something with find, pandoc (https://pandoc.org/MANUAL.html) and cat.
Isn't this a tar file?
That's what I was thinking too. It looks like someone just reinvented tar, and given how it's a JavaScript thing I'm wondering if it's a zoomer who didn't know tar existed and the HN crowd would set them straight. But then I come into the comments here and people are posting about how absolutely brilliant it is, so surely I'm missing something… right?
> someone just reinvented tar
Or "shell archives", .shar files - https://en.wikipedia.org/wiki/Shar - they used to be kicked around in comp.sources.
I can imagine the token counts to be off the charts. How would an llm handle this input? Llm output quality already drops quite hard at a out 3000 tokens let alone 128k
Depends on the LLM, perhaps, and/or the problem being solved. I get very good output from 10K–25K token submissions to Anthropic's Claude API.
Similar https://github.com/simonw/files-to-prompt
Similar project: https://github.com/yamadashy/repopack
Repopack with Claude projects has been a game changer for me on repository-wide refactors.
Seems like repopack only packs the repo. How do you apply the refactors back to the project? Is it something that Claude projects does automatically somehow?
for me too
I have a bash script which is very similar to this, except instead of dumping it all into one file, it opens all the matched files as tabs in Zed. Since Zed's AI features let you dump all, or a subset, of open tabs into context, this works great. It gives me a chance to curate the context a little more. And what I'm working on is probably already in an open tab anyway.
This made me laugh. Thanks!
Can you go 1 more step? Is there a way to not just dump someone's project into a plain text file, but sometime intelligently craft it into a ready to go prompt? I could use that!
Here's my user test: https://www.youtube.com/watch?v=sTPTJ4ladiI
> Can you go 1 more step? Is there a way to not just dump someone's project into a plain text file, but sometime intelligently craft it into a ready to go prompt? I could use that!
https://aider.chat/
It does this, and smartly, using tree-sitter, for quite a few tree-sitter supported languages.
Looks very interesting! Thanks for the link!
Cool! I'd like to see an indication of the total number of tokens in the output, so I know right away on which LLM I can use this prompt or, if it's too large, I can relaunch the script by excluding other files to reduce the number of tokens in the output
One feature you could add is allowing the user to map changes in the concatenated file back to the original files. For example, if an LLM edits the concatenated file, I would want it to return the corresponding filenames and line numbers of the original files.
Really nice! I made a small cli tool that has an extra step of basically printing out a tree, so you can ask the ai what files you want to output:
https://github.com/markwylde/ai-toolkit
Why do we need modules at all? [1]
[1] https://erlang.org/pipermail/erlang-questions/2011-May/05876...
We use a C compiler for embedded systems that doesn't support link time optimizations (unless you pay for the pro version, that is). I have been thinking about some tool like this that merges all C source files for compilation.
That's called a "unity" build, isn't it? I was under the impression that it was a relatively well-known technique, such that there are existing tools to merge a set of source files into a single .c file.
Unless i am understanding you wrong, you could easily do this by #including all your a.c, b.c etc. into one file input.c and feeding that to the compiler.
We did this for a home-grown SoC with a gcc port for which there was no linker.
This is really helpful. I immediately thought I’d be useful for sending off to ChatGPT and then saw that’s what it’s actually for. Thank you!
Surely with storage being pretty slow and everything it would be better to compress it into an archive with really basic compression?
Shouldn't this work?
find /path/to/directory -type f -exec cat {} + > output.txt
vim-ai basically supports this use case out of the box. All you need is your a index file listing all the files you want included, starting with
>>> include
This is probably very useful for use with LLM’s.
Love the name :D.
> A vomitorium is a passage situated below or behind a tier of seats in an amphitheatre or a stadium through which large crowds can exit rapidly at the end of an event.
> A commonly held but erroneous notion is that Ancient Romans designated spaces called vomitoria for the purpose of literal vomiting, as part of a binge-and-purge cycle
https://en.wikipedia.org/wiki/Vomitorium
Related-ish: https://en.wikipedia.org/wiki/Nosebleed_section
In which the thing being spewed is people
the .sick file extension is a nice touch ^^
…although historically inaccurate.
Be careful with the name, McDonald’s might sue you for copyright infringement.
find ... | xargs head -n -0
The name links up nicely with AI enshittification. Although if you wanted to be pedantic, for that metaphor to work you'd really want to call it "gorge" or something more related to ingestion rather than vomiting. (I'm aware that a vomitorium was the exit from a Roman stadium, so it's not really about throwing up either).
[dead]
[flagged]