Jacob Mulquin

Merging and Splitting Documents on Linux

Split, merge, slice, join, these knives can do it all!

✍️ Jacob Mulquin
📅 11/04/2022

As part of my job, I deal with various documents sent by people using all types of different devices and varying levels of skillset. This can lead to some interesting situations where someone can send a document in as photos of printouts, multiple scanned pages, screenshots, a document that has been printed, then scanned to email. You get the idea... There's a lot of variation.

At work we use Windows, but I find the tools on Windows lacking for splitting and merging. You usually have to load these bulky GUIs which lock out features behind paywalls, no thanks! (I'm aware that the following tools are open source and available on Windows, but let's not let facts get in the way of some Linux evangelism)

Thanks to the wonders of open-source and many dedicated developers, there are numerous tools at your disposal if you want to merge or split documents. Thankyou community!

Merging documents

The first option is the convert program, part of ImageMagick. I use this one frequently if I have numerous JPG images that need to be turned into a single PDF. You can turn down the quality significantly if someone has sent through absurd 9MB images of each page.

convert -quality 100 -rotate 0 *.jpg output.pdf

I find that using convert when the input file is PDF can sometime lead to bad image quality. In cases where convert doesn't do the job, the gs (Ghostscript) program saves the day:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dCompressFonts=true -r150 -sOutputFile=gs.pdf *.pdf

Using Poppler's pdfunite, one I haven't used much but I have used the excellent pdftotext program in the past.

pdfunite *.pdf output.pdf

Splitting a document

Using Imagemagick

convert input.pdf[1-2] output.pdf

Using Ghostscript

gs -dNOPAUSE -dQUIET -dBATCH -sOutputFile="output.pdf" -dFirstPage=1 -dLastPage=2 -sDEVICE=pdfwrite "input.pdf"

Using Poppler's pdfseparate

pdfseparate input.pdf output-%d.pdf

pdfsplit

I threw together a script to help me split PDFs using a pattern, very originally named pdfsplit.sh.

With this script you can pass it a pattern, e.g. Say I want pages 1-2 as a document, then 3 standalone, 4 standalone, and finally 5-7 as a document. I would pass it "1-2, 3, 4, 5-7". Much like you can with GUI programs.

If you know of a program that can do this please let me know.

#!/bin/bash

# ./pdfsplit.sh input.pdf pattern [output-prefix]
# Written by Jacob Mulquin (https://mulquin.com), 2022

INPUT_FILE="$1"
PATTERN="$(echo -e "$2" | tr -d '[:space:]')"
if [[ -n "$3" ]]; then
    OUTPUT_PREFIX="$3"
else
    OUTPUT_PREFIX=$INPUT_FILE
fi

for i in $(echo $PATTERN | tr "," "\n")
do
    FIRST_PAGE=$i
    LAST_PAGE=$i
    if [[ "$i" == *"-"* ]]; then
        FIRST_PAGE=$(echo "$i" | cut -d- -f1)
        LAST_PAGE=$(echo "$i" | cut -d- -f2)
    fi

    gs -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$OUTPUT_PREFIX-$FIRST_PAGE.pdf" -dFirstPage=$FIRST_PAGE -dLastPage=$LAST_PAGE -sDEVICE=pdfwrite "$INPUT_FILE"
done