Merging and Splitting PDFs
In this lesson, you’ll learn how to merge and split PDFs. It’s super useful to know how to manipulate PDFs in those ways. You can also check out the argparse
module.
00:00 Welcome back to the Real Python course on how to work with PDFs in Python. This is part 4, where you will learn how to merge and split PDFs.
00:09 There are many times where you will want to merge two or more PDFs into a single file. For example, having a cover page that has to be used on many different report types.
00:19 Python can help you do this. For this example, open up a PDF and print a single page out as a separate PDF. Then do it again but for a different page. This will give you a couple of files to work with.
00:32
What I’ve done is taken the first two pages of the Jupyter_Notebook.pdf
from the previous video, and just split out the first couple of pages.
00:41
Now let’s take a look at the code that will help you merge PDFs. Much like the page rotation example, we again have to import a PdfFileReader
as well as a PdfFileWriter
.
00:53
Now, you can use the merge_pdfs()
method when you have a list of PDFs to merge together. You will also need to know where to save the result, so along with the list of input paths, it also takes an output path, just there.
01:09
You then loop over the inputs and create a pdf_reader
object per input, just there. Next, you iterate over each of the pages in the PDF file and use .addPage()
—just there—to add each of the pages to itself.
01:27 Once each page in the list has been iterated over, the result is written out at the end, just there.
01:37 Something that should be pointed out is that this script could be enhanced by adding an option for adding a page range if you didn’t want to merge the entire PDF.
01:47
And if you’re up for a real challenge, you could create a command line interface for this script by using Python’s argparse
module. Let’s take a quick look at this script in action.
01:58
In my working folder, I’ve got a couple of pages named document1.pdf
and document2.pdf
. So if we run that—it’s run successfully, that’s a good start.
02:11
Then I open the merged.pdf
, and you can see page 2 of the Jupyter_Notebook.pdf
, followed by page 1. Now, I did that intentionally so that you could see it had worked and it’s called merged
up here, which you can see is what I’ve named it just there. Now to take a look at the opposite of merging: splitting.
02:33
This is particularly useful for documents that have a lot of scanned-in content, but there are a lot of reasons for wanting to split a PDF. The example we’re going to look at is how you could use the PyPDF2
module to split a PDF into multiple files. You start by once again creating a reader object and looping over the PDF pages, just there. For each page in the PDF, you create a new pdf_writer
instance and add a single page to it, right there.
03:06 You then write that page out to a uniquely named file. When the script is finished running, you should have each page of the original PDF into separate PDFs with unique names.
03:18 I just need to update that title because I’ve worked on it since then.
03:28 And there we have the fact that it’s run correctly so far. And if I go to my working folder,
03:38 you can see here that each page has opened on its own.
03:44 There’s page 1 on its own,
03:53 page 3, et cetera. And as you can see in the tabs here, each file has its own unique name, as mentioned, so you can discern which document is which page. Now, hopefully you’ll join me in the next part so that you can find out how to add a watermark to a PDF as well as how to encrypt a PDF.
donrogstad on Feb. 26, 2020
I just noticed that comments are not separated by lessons so the above comment about the path variable refers to the rotate_pages.py script in lesson 2.
muzixaba on Feb. 26, 2020
Please give link to jupyter_notebook.pdf
Andrew Stephen RP Team on Feb. 27, 2020
Hi @muzixaba, In order to create your own jupyter_notebook.pdf, you can go to the Real Python tutorial for Jupyter Notebook - An Introduction and print it to a .pdf from your browser.
Become a Member to join the conversation.
donrogstad on Feb. 25, 2020
I notice that you pass the file name to the rotate_pages routine, but you open the pdf_reader with “path” variable instead of the passed filename. I guess the routine still works since path is define below and is set as a Global variable, but you may want to change the code to use the passed variable.