LOGO

Convert PDF to Text via Linux Command Line

September 11, 2015
Convert PDF to Text via Linux Command Line

Converting PDFs to Editable Text in Linux

There are numerous scenarios where converting a PDF file into an editable text format becomes necessary. Perhaps you need to modify an existing document, and only a PDF version is available to you.

Converting PDF files within a Windows environment is generally straightforward. However, if you are operating on a Linux system, the process differs slightly.

Fortunately, Linux provides a robust solution. We will demonstrate how to convert PDF files to editable text utilizing a command-line tool known as pdftotext, which is included in the “poppler-utils” package.

Checking for Installation

It’s possible that this tool is already installed on your system. To verify, open a terminal window by pressing “Ctrl + Alt + T”.

Then, enter the following command and press “Enter”:

dpkg –s poppler-utils

Important Note: When entering commands, do not include the quotation marks unless explicitly instructed.

Installing pdftotext

If the previous command indicates that pdftotext is not installed, you can install it using the following command in the terminal:

sudo apt-get install poppler-utils

You will be prompted to enter your password; do so and press “Enter”.

Understanding poppler-utils

The poppler-utils package contains a collection of tools designed for converting PDFs to various formats, manipulating PDF files, and extracting data from them.

Basic Conversion Command

The fundamental command for converting a PDF to an editable text file is as follows. Open a Terminal window (“Ctrl + Alt + T”), type the command, and press “Enter”.

pdftotext /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt

Remember to adjust the file paths to accurately reflect the location and name of your PDF file, as well as the desired location and name for the resulting text file.

The converted text file will then be created and can be opened like any other text file in Linux.

Addressing Line Breaks

The initial conversion may introduce unwanted line breaks. This is because pdftotext inserts a line break after each line of text present in the original PDF.

Preserving Document Layout

To maintain the original document’s layout – including headers, footers, and page formatting – utilize the “-layout” flag.

pdftotext -layout /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt

Converting Specific Page Ranges

If you only need to convert a specific section of the PDF, use the “-f” (first page) and “-l” (last page) flags.

pdftotext -f 5 -l 9 /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt

Handling Password-Protected PDFs

For PDFs secured with an owner password, employ the “-opw” flag. (Note the lowercase 'o').

pdftotext -opw ‘password’ /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt

Replace “password” with the correct password, ensuring it is enclosed in single quotes.

If the PDF is protected with a user password, use the “-upw” flag instead of “-opw”. The rest of the command remains the same.

Specifying End-of-Line Characters

You can control the type of end-of-line character used in the converted text, which is useful for compatibility across different operating systems. Use the “-eol” flag followed by “unix”, “dos”, or “mac”.

Important Note: If no output filename is provided, pdftotext will automatically name the text file based on the PDF’s filename, adding a “.txt” extension. Using “-” as the output filename directs the text to the Terminal window instead of saving it to a file.

To close the Terminal window, simply click the “X” button in the upper-left corner.

For a comprehensive overview of the pdftotext command and its options, type “man page pdftotext” in the terminal.

#PDF to text#command line#Linux#PDF conversion#text extraction#pdftotext