How to split a big CSV file using Linux or macOS Terminal?
Table of Contents
So, you’ve got your hands on a massive CSV file, and the task at hand is to break it down into smaller, more manageable chunks. Whether it’s for avoiding limitations in CSV editors or for ease of handling, the Linux or macOS Terminal has got you covered with a straightforward utility: split
. For a detailed reference, consult the split man page.
Basic Usage
The split
command offers versatility in breaking up files based on lines, bytes, or a specified number of parts. This flexibility makes it suitable for both text-based and binary files. For our CSV file, which is essentially a structured text file, we’ll leverage the lines
parameter.
split -l 100000 filename.csv
This basic command generates segmented files labeled xaa, xab, xac, and so forth. To further organize and enhance the filenames, let’s explore additional features.
Adding Structure with Prefixes and Numeric Suffixes
The split
command allows customization by introducing two useful parameters: -d
for numeric suffixes and the prefix
parameter for generated files. Here’s an example:
split -l 100000 -d filename.csv file_
This command results in files named file_01, file_02, file_03, and so on. To refine the naming convention, we can append “.csv” to all generated files:
for i in $(find file_*); do mv $i "$i.csv"; done
Ensuring Consistent Headers
Dealing with CSV files means contending with header lines. If your split files lack the initial header present in the original file, follow these steps to ensure consistency.
for i in $(find . -type f -name "file_*.csv" -not -name "file_00.csv");
do echo -e "$(head -1 file_00.csv)\n$(cat $i)" > $i;
done
This script copies the first line (header) from file_00.csv to all other split files, maintaining uniformity across segments.
Putting It All Together
To summarize the process:
Step 1: Split the CSV file into parts of 100,000 lines and prefix the generated files with “file_”.
split -l 100000 -d filename.csv file_
Step 2: Add “.csv” to all generated files.
for i in $(find file_*); do mv $i "$i.csv"; done
Step 3: Copy the header from the first generated file to the beginning of the other files.
for i in $(find . -type f -name "file_*.csv" -not -name "file_00.csv");
do echo -e "$(head -1 file_00.csv)\n$(cat $i)" > $i;
done
By following these steps, you can efficiently manage and manipulate large CSV files in a Unix-like environment using the command line.