How to extract text from a PDF file?

extract text and images from a pdf file
extract text and images from a pdf file

Last updated: July 17, 2022

Got a PDF document you want to extract all the text from? What about the scanned images whose text you want to convert and edit? Here are some of the most common questions I have seen in my workplace when working with these files.

We have already seen a chrome extension that allows you to copy text from an image, today I will tell you in detail about various ways, with which you can extract text or image from PDF file. The extraction results will vary depending on the type and quality of the text or image in the PDF. This means that your results will vary depending on the tool you use, so it is best to try as many of the options listed below as possible for the best results.

Extract text or image from PDF

The easiest and fastest way to get started is to try an online service.PDF text extractor. These are usually free and can give you exactly what you are looking to have without having to install anything on your computer. Here are the ones that I have used with very good results:

ExtractPDF

ExtractPDF is a free tool for recover images, text and fonts from PDF file. Quick and easy to use, just upload your document or indicate the Url address of the PDF file you want to use and start the extraction. The only limitation is that the maximum PDF file size is 10MB. That's a bit small; so if you have a bigger file, try to compress PDF or to test the other methods mentioned in the article. Completely free, ExtractPDF can be used without any prior registration.

 extractpdf-extract-text-pdf

Overall, the online ExtractPDF tool works great, but I ran into some issue with a PDF file giving me funny results. The text is extracted very well, but for some reason there is a line break after each word! Not a big problem for a small PDF file, but definitely a problem for files with a lot of text. If this happens to you, try the following tool.

Online OCR

Online OCR  Usually tends to work for documents that could not be converted correctly with ExtractPDF, so it's a good idea to try both services to see which one gives you the best quality output. Online OCR also has some nicer features that can come in handy for anyone with a large PDF file who does not need to convert a little bit of text on a few pages, not the entire document.

The first thing you should do is go ahead and create a free account. It's a bit annoying, but if you don't create a free account, it will partially convert your PDF rather than the entire document. This is to say that instead of only being able to download a 5MB document, you will be able to download up to 100MB per file with an always free account.

extract-text-pdf

To use Online OCR, go to the following address: www.onlineocr.net, choose a language, then choose the type of output formats you want for the converted file. You have two options and you can choose more than one if you want. Under the multipage document, you can select page numbers, and then choose only the pages you want to convert. Then you select the file. Finally, click Convert!

Online OCR did a great job converting my PDF files because it was able to maintain the actual layout of the text during the test I had done. I had taken a Word document that took into account various dashes, different font sizes, etc. And the software still managed to convert everything into a PDF file. Then I used Online OCR to convert back to Word format and the result was about 95% the same as the original. It is quite impressive for me.

Free Online OCR 

Speaking of image and text as well as OCR, let me mention another great site that works really well for images. Free Online OCR  was very good and very accurate when extracting text from my test images. I then took some photos from my iPhone of various pages of books, brochures, etc., and was amazed at how well the tool was able to convert the text.

online-ocr
To use it, first choose your file then click on the Download button. On the next screen there is a group of options and an image preview. You can crop the part you want to extract. Then click the OCR button and your converted text will appear under the image preview. It doesn't have limits either, which is really nice.

In addition to online services, there are two freeware PDF converters, which I absolutely want to mention in case you need software that runs locally on your computer to perform these types of conversions. With online services, you will always need the Internet, something that is not always possible for everyone at all times. However, I noticed that the quality of conversions from freeware programs was noticeably lower than from websites.

A-PDF Text Extractor

A-PDF Text Extractor is freeware that indeed does a pretty awesome job of extracting text from PDF files. Once you download and install it, click the Open button to choose your PDF file. Then click on Extract text to start the process.

apdf-extractor

You will be required to choose a location to store the output file and the extraction should finally begin. You can also click the Option button, which will allow you to choose only certain pages to extract and the type of extraction. The second option is interesting because it extracts the text in different layouts and it is indeed very interesting to try all three to see which ones give you the best result and expected return.

PDF2Text driver

PDF2Text driver does a decent job of extracting text. He unfortunately has no options; just you add files or folders, you convert and hope for the best. It worked fine on some PDF files, but for the majority of them there were several issues.

pdf2text

Click Add Files, and then click Convert. When the conversion is complete, click Browse to open the file. Your performance will vary using this program, so don't expect too much.

Also, it should be noted that if you are in a corporate environment or can get your hands on a copy of Adobe Acrobat to work, then you can really achieve much better results.

Acrobat is obviously not free, but the software does indeed have conversion options for your projects from PDFs to Word, Excel and HTML. It also did the best rendering keeping the original document structure and complicated text conversion.