When Small Parquet Files Become a Big Problem (and How I Ended Up Writing a Compactor in PyArrow)
It all began with a fairly normal data pipeline. Events were coming in through Kafka, landing in AWS S3 as Parquet files after going through some lightweight microbatch processing. It looked clean at first glance. Efficient. Predictable. But one day I opened one of the hourly folders and saw the mess I had accidentally created - hundreds of files inside, many of them barely a few kilobytes in size.
It Didn't Seem Urgent - Until It Was
Parquet is supposed to be efficient. It's columnar, compressed and designed for analytics workflows. But what no one warns you about is the moment your pipeline turns into a file factory. All of these tiny files, written frequently and automatically, start to accumulate - not just in number, but in overhead and that overhead isn’t just technical. In Amazon S3 each object has a minimum billable size of 128 KB, which means that even if your file is just a few kilobytes, you’re still charged as if it were 128. When you’re generating hundreds or thousands of microfiles per hour, those costs begin to compound. The total storage might seem small, but the object count, listing time and the way tools like Athena handle fragmented input quickly turn into real bottlenecks. The pipeline wasn’t broken, but it was clearly becoming bloated and inefficient - and I knew I had to do something about it.
I Didn't Want to Overcomplicate It
Before writing any code I did what any decent engineer does - I looked around to see if someone had already solved this better than I could. Spark was the first thing that came to mind - it handles Parquet well, it’s fast and it’s distributed, but it felt like a very heavyweight solution to what was essentially a file housekeeping problem. AWS Glue looked tempting too: it's serverless and works well with Parquet, but it takes a while to spin up and the costs can add up quickly if you're running frequent jobs or dealing with lots of small files. Dask crossed my radar, but I've learned to be cautious with it when dealing with S3. DuckDB was the most fun to consider. I genuinely enjoy working with it locally, but since I was running this pipeline inside a Kubernetes pod with limited memory, I wasn’t convinced it could handle a few hundred files at once without eventually tripping over itself.
PyArrow Was Already There - So I Used It
In the end I kept circling back to PyArrow. I was already using it for smaller pieces of the pipeline and it just fit. No orchestration, no YAML, no spark-submit, no platform overhead. I wasn’t trying to rebuild the world. I wanted to reduce the clutter.
To test things out I grabbed the January 2025 Yellow Taxi trip dataset from the NYC open data portal. I took the full Parquet file and split it into smaller pieces - 696 files, to be exact, each with 5000 rows. That’s roughly the shape and size you might get from a streaming pipeline that flushes frequently or microbatches on small intervals. It felt like a realistic simulation of a production mess.
Here’s the code I used to break it up:
A quick aside: pyarrow.dataset
lets you treat a collection of files like one logical table, which makes it really easy to load, transform, or compact data across many files without having to read everything into memory at once. If you haven’t used it before, it’s worth checking out. Here’s the documentation.
Once I had the folder full of small files, I wrote a short script using pyarrow.dataset
to read all of them, merge them into a single dataset, and write them back out compacted. You can play with parameters like min_rows_per_group
, max_rows_per_group
, and max_rows_per_file
to tune the output size - and it’s amazing how much control you get from just a few lines of code. Here's what the compaction part looked like:
And just like that, I went from almost 700 microfiles to a handful of compacted ones - each large enough to be cost-efficient in S3 and fast to scan in downstream analytics.
It worked perfectly during testing, when I ran it on a small folder with just a handful of files. But in real data some hours are busier than others. Depending on traffic a single hourly folder might end up with thousands of files, and that’s where things started to fall apart. PyArrow’s dataset
method tries to read them all at once, and while it gives you a lot of control, it also assumes you know what kind of load you're about to put on your system. I didn’t think much of it until I ran the compactor in a Kubernetes pod and hit an out-of-memory error. Turns out that even small Parquet files can add up fast, especially once you account for all the metadata and row groups that get loaded into memory behind the scenes.
Batching Was the Fix
So I rewrote the whole thing with batching in mind. Instead of loading every file at once I grouped them into chunks of 100, read each batch, compacted it into a new file, cleaned up memory and moved on. It was simple, but effective. Here’s what the final version looked like:
This version is just as fast for small volumes but doesn’t choke when you throw hundreds of files at it. You can tune the batch size depending on your available memory, and the output files will still be cleanly structured and compressed.
Want to Run This on S3?
If you're testing locally, all you need to switch to S3 is one line. Replace fs.LocalFileSystem()
with fs.S3FileSystem(region="us-east-1")
, and it will just work. If you’ve configured AWS credentials in your environment, PyArrow will find them. If not, you can pass them directly like this:
s3 = fs.S3FileSystem(
access_key="YOUR_ACCESS_KEY",
secret_key="YOUR_SECRET_KEY",
region="us-east-1"
)
And That Was It
With those changes in place, the compactor stopped being a quick experiment and started acting like a real tool. It read hundreds of small Parquet files, merged them in manageable batches and wrote them out with clean GZIP compression and reasonable row group sizes. The output was tidy, fast to scan and cheap to store. And memory? Solid. I monitored usage during the first few runs, but after that it just ran quietly in the background, doing its job without drawing attention, which is honestly the best thing you can say about any data pipeline component. I’ve had it running in production for a while now and I haven’t needed to touch it since.
If You’re Drowning in Tiny Files Too...
You probably don’t need Spark. You probably don’t need Glue. What you need is a little script that knows how to behave, keeps its memory to itself and quietly cleans up after your streaming pipeline. PyArrow might not be flashy, but for jobs like this, it’s exactly what you want - just enough control to get it done without dragging in a whole data platform.
📂 Code is available in the repo
📚 Learn more about pyarrow.dataset