Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

how to get the text from any type file in c program

Status
Not open for further replies.

linuxuserchina

Programmer
Jun 26, 2006
2
CA
I am programming using c language. I want to get the text from
any type of file(like .doc,.pdf,.ppt,.ps).Is there any way to implement this in c program?

Thank you very much.

 
Yes, as long as you know the exact format of the file you want to read from.
 
Hello, cpjust!

How can I know the exact format of any type of file? For example, how to know the format of .pdf file? Can you expain it in detail? Thank you very much.
 
Usually if the file type is standardized, it will have an RFC that fully describes how it works in extremely boring detail. For example:

Another thing you can do is see if anyone has already written a custom API that makes it easy to read that file type. Sometimes you can find one for free, and other times you might have to pay.

It would be impossible to create a program that can read any file type; you can only read file types that you know about ahead of time.
 
There is a UNIX / Linux utility called 'file', which attempts to identify file types.
Eg.
Code:
$ file m*
malloc.sh:                Bourne shell script text executable
malloc.txt:               ASCII text
mallocFree.pl:            perl script text executable
mallocFree.c:             ASCII C program text
malloc:                   directory

There is also a utility called 'strings', which looks for printable strings inside any file. But it won't work on files containing unicode text, compressed text or encrypted text.

--
 
Hello,

while I was thinking about my reply, I saw that Salem already pointed out Unix command strings.

You might try to do something like this command. The idea is just this:
Read the file byte by byte, or character by character, and if there are some consecutive printable characters (at least 5, or at least 10, or whatever you choose), print them. And if you are going to program that yourself, you should be able to make it work for Unicode or national characters as well. (But not for compressed text or encrypted text.)

hope this helps
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top