Convert PDF to Text via Linux Command Line

Converting PDFs to Editable Text in Linux
There are numerous scenarios where converting a PDF file into an editable text format becomes necessary. Perhaps you need to modify an existing document, and only a PDF version is available to you.
Converting PDF files within a Windows environment is generally straightforward. However, if you are operating on a Linux system, the process differs slightly.
Fortunately, Linux provides a robust solution. We will demonstrate how to convert PDF files to editable text utilizing a command-line tool known as pdftotext, which is included in the “poppler-utils” package.
Checking for Installation
It’s possible that this tool is already installed on your system. To verify, open a terminal window by pressing “Ctrl + Alt + T”.
Then, enter the following command and press “Enter”:
dpkg –s poppler-utils
Important Note: When entering commands, do not include the quotation marks unless explicitly instructed.
Installing pdftotext
If the previous command indicates that pdftotext is not installed, you can install it using the following command in the terminal:
sudo apt-get install poppler-utils
You will be prompted to enter your password; do so and press “Enter”.
Understanding poppler-utils
The poppler-utils package contains a collection of tools designed for converting PDFs to various formats, manipulating PDF files, and extracting data from them.
Basic Conversion Command
The fundamental command for converting a PDF to an editable text file is as follows. Open a Terminal window (“Ctrl + Alt + T”), type the command, and press “Enter”.
pdftotext /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt
Remember to adjust the file paths to accurately reflect the location and name of your PDF file, as well as the desired location and name for the resulting text file.
The converted text file will then be created and can be opened like any other text file in Linux.
Addressing Line Breaks
The initial conversion may introduce unwanted line breaks. This is because pdftotext inserts a line break after each line of text present in the original PDF.
Preserving Document Layout
To maintain the original document’s layout – including headers, footers, and page formatting – utilize the “-layout” flag.
pdftotext -layout /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt
Converting Specific Page Ranges
If you only need to convert a specific section of the PDF, use the “-f” (first page) and “-l” (last page) flags.
pdftotext -f 5 -l 9 /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt
Handling Password-Protected PDFs
For PDFs secured with an owner password, employ the “-opw” flag. (Note the lowercase 'o').
pdftotext -opw ‘password’ /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt
Replace “password” with the correct password, ensuring it is enclosed in single quotes.
If the PDF is protected with a user password, use the “-upw” flag instead of “-opw”. The rest of the command remains the same.
Specifying End-of-Line Characters
You can control the type of end-of-line character used in the converted text, which is useful for compatibility across different operating systems. Use the “-eol” flag followed by “unix”, “dos”, or “mac”.
Important Note: If no output filename is provided, pdftotext will automatically name the text file based on the PDF’s filename, adding a “.txt” extension. Using “-” as the output filename directs the text to the Terminal window instead of saving it to a file.
To close the Terminal window, simply click the “X” button in the upper-left corner.
For a comprehensive overview of the pdftotext command and its options, type “man page pdftotext” in the terminal.