I've seen a bunch of posts about reading PDF files, but they seem to be mostly requesting to convert PDF files to text. That is not exactly what I am looking for.
I have PDF files that have been scanned and have the text now emebedded in the file. What I am looking to do, is read the text that is stored in the file, not convert it. I have tried the StreamReader on one of these PDF files, but I just get a lot of junk.
If there a C# way to look at the text in these type of PDF files? read pdf text using c# http://www.xspdf.com/guide/pdf-text-extracting/
Your wording is confusing, you don't want to convert PDF files to text but you want to read the text in the file, it sounds like the exact same thing to me.
Second question is does the PDF use optical character recognition to change the scanned text into selectable text? What I mean is can you select the text or is it just a picture of the scanned words? If it's just a picture, this gets complicated because you'll have to do optical character recognition. If not it should be easy, but I don't understand what the problem is.
The text in the file has already been scanned and OCR'd - so the file now does have selectable text. I would like to read that selectable text.
Everything I've seen on Google and Codeproject has been how to render or OCR the text, not extract the text already in the PDF.
Do you mean read the contents of pdf files without opening them with some readers that can read pdf tiles?
After a little searching I think his problem is this: there are tools that read text from a PDF, but for some reason it won't get text that has been created by OCR. I can't find any documents with OCR'd text so I can't really help.
Using the code found here: http://www.codeproject.com/KB/string/pdf2text.aspx. I was able to extract the text from a PDF file. But only if that was PDF file that was created from a Word document or other source, not from OCR'd text. If it was OCR'd text, I got nothing.
However, using the code here: http://www.codeproject.com/KB/cs/PDFToText.aspx I was able to extract the text from an OCR'd file (I used the Fujistu ScanSnap) but the text was only one long string. There were no spaces between each word.