20 Dec 2023

How to split a big CSV file using Linux or macOS Terminal?

So, you’ve got your hands on a massive CSV file, and the task at hand is to break it down into smaller, more manageable chunks. Whether it’s for avoiding limitations in CSV editors or for ease of handling, the Linux or macOS Terminal has got you covered with a straightforward utility: split. For a detailed reference, consult the split man page.

Basic Usage

The split command offers versatility in breaking up files based on lines, bytes, or a specified number of parts. This flexibility makes it suitable for both text-based and binary files. For our CSV file, which is essentially a structured text file, we’ll leverage the lines parameter.

split -l 100000 filename.csv

This basic command generates segmented files labeled xaa, xab, xac, and so forth. To further organize and enhance the filenames, let’s explore additional features.

Adding Structure with Prefixes and Numeric Suffixes

The split command allows customization by introducing two useful parameters: -d for numeric suffixes and the prefix parameter for generated files. Here’s an example:

split -l 100000 -d filename.csv file_

This command results in files named file_01, file_02, file_03, and so on. To refine the naming convention, we can append “.csv” to all generated files:

for i in $(find file_*); do mv $i "$i.csv"; done

Ensuring Consistent Headers

Dealing with CSV files means contending with header lines. If your split files lack the initial header present in the original file, follow these steps to ensure consistency.

for i in $(find . -type f -name "file_*.csv" -not -name "file_00.csv");
    do echo -e "$(head -1 file_00.csv)\n$(cat $i)" > $i;
done

This script copies the first line (header) from file_00.csv to all other split files, maintaining uniformity across segments.

Putting It All Together

To summarize the process:

Step 1: Split the CSV file into parts of 100,000 lines and prefix the generated files with “file_”.

split -l 100000 -d filename.csv file_

Step 2: Add “.csv” to all generated files.

for i in $(find file_*); do mv $i "$i.csv"; done

Step 3: Copy the header from the first generated file to the beginning of the other files.

for i in $(find . -type f -name "file_*.csv" -not -name "file_00.csv");
    do echo -e "$(head -1 file_00.csv)\n$(cat $i)" > $i;
done

By following these steps, you can efficiently manage and manipulate large CSV files in a Unix-like environment using the command line.

Luca Palonca