Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extracting text from Word Documents via PHP and COM

Status
Not open for further replies.

Bersani

Programmer
Nov 10, 2011
48
US
I want to extract text from Word Documents via PHP and COM and am not able to open Word in this way. I believe that com is enabled in php.ini but am not sure. Here is the code that I am using to open a new Word application:

<?php
$word = new COM("word.application") or die ("Could not initialise MS Word object.");
$word->Documents->Open(realpath("Sample.doc"));

// Extract content.
$content = (string) $word->ActiveDocument->Content;

echo $content;

$word->ActiveDocument->Close(false);

$word->Quit();
$word = null;
unset($word);
?>
 
Assuming you are using docx format why do you need to use Com?
 
You are implying that I do not need to use com if using .docx? Why doesn't it work with com?
 
you need to provide more information before we can tell you why com does not work.

but it is not needed for docx which is just a text format.

1. are you using windows.
2. have you completely installed a valid licensed version of ms word. does it start from the command line?
3. is com enabled in php? telling us 'it might be' does not help. go check.
4. is the script running?
5. is the script failing? if so, what is the error message (you must have error reporting and display turned on in php.ini).
6. is anything being logged to an error log at php or system level?
 
1. are you using windows. yes 7
2. have you completely installed a valid licensed version of ms word. does it start from the command line? yes, MS Word runs fine on its own
3. is com enabled in php? telling us 'it might be' does not help. go check. - I do not know how to check- I went to php.ini but I do not know what to turn on
4. is the script running? I get a blank screen
5. is the script failing? if so, what is the error message (you must have error reporting and display turned on in php.ini).
6. is anything being logged to an error log at php or system level? no

Thank jpadie
and star this post!


 
[COM]
; path to a file containing GUIDs, IIDs or filenames of files with TypeLibs
; ;com.typelib_file =

; allow Distributed-COM calls
; com.allow_dcom = true

; autoregister constants of a components typlib on com_load()
; ;com.autoregister_typelib = true

; register constants casesensitive
; ;com.autoregister_casesensitive = false

; show warnings on duplicate constant registrations
; ;com.autoregister_verbose = true

; The default character set code-page to use when passing strings to and from COM objects.
; Default: system ANSI code page
;com.code_page=
 
a blank screen suggests that the script is failing. My suspicion is that COM is not properly loaded and you have not turned on error display.
open php.ini and ensure these values:
Code:
error_reporting  =  E_ALL
display_errors = On
display_startup_errors = On
log_errors = On
log_errors_max_len = 2048

although the manual says otherwise, from php v 5.3.15 DOTNET was not compiled in statically to the php binary so you must load it at runtime. If you are using this version of php (or later) ensure also that you have a section of your php.ini that looks like this

Code:
[PHP_COM_DOTNET]
extension=php_com_dotnet.dll

restart your web server.

next make sure that the user under whose permissions the webserver (and thus php) are being run has permissions to access the sample document. This is a windows issue and not php. If you suspect that this is a problem it may be better to put the file in a root directory with no permissions lock on it.

change your script as follows (essentially this adds permission checks and proper footprinting).

Code:
<?php
echo '<pre>';
echo "starting\n";
set_time_limit (30); //to allow time for Word to load.
$word = new COM("word.application") or die ("Could not initialise MS Word object.");
echo "COM instantiated\n";
$word->Application->Visible = False; 
echo "set visibility to false\n";

$doc = 'sample.doc';
$document = reapath($doc);
if (is_readable($document):
 echo "Document exists and is readable \n";
else:
 if(!is_file($document)):
   echo "Document does not exist\n";die();
 else:
   echo "Document is not readable\n"; die();
 endif;
endif;
$word->Documents->Open( $document ); 
echo "Document opened\n";
// Extract content. 
$content = $word->ActiveDocument->Content; 
echo "test\n----------\n";
print_r($content);
echo "test\n----------\n";
echo "Extracting string value of content\n";
$content = (string) $content;
echo "test\n----------\n";
echo $content;
echo "test\n----------\n";

echo $content; 

$word->ActiveDocument->Close(false); 
echo "Closed Document\n";
$word->Quit(); 
echo "Quit Word \n"
$word = null; 
unset($word); 
?>

I say again that this is a suboptimal route for extracting text from docx files. if that is your aim then post back with some business context around your aims so that we can assist further.
 
I made the changes to php.ini and copied your code.
I got this in the browser: "Parse error: syntax error, unexpected ':' in C:\xampp\htdocs\word7.php on line 12"
I tried to change the : to ; and got the same error but with ";".

I am trying to extract binary files from a table that contains a binary file to store .doc or .docx MS Word files. I had a VisualFoxpro database that was able to open MS word files in a table in which the documents were embedded. I converted the table to MySql and when I click on the field all I get is a .bin file that opens MS word with a few strange characters with no text.
 
a close bracket was missing
Code:
if (is_readable($document)[red])[/red]:

i see. i strongly suspect that the vsfp table used OLE to link and embed the document, much like the access equivalent. you will not be able to open this programmatically as a word doc. I struggled for a very very long time trying to do the same but with jpegs. the OLE cell effectively wraps the binary in its own code. you might be able to delete this with a hex editor on a file by file basis.

the solution for jpegs was to write a dll that created a small form that iteratively showed each of the records and saved the contents of the OLE Field as a normal image (in fact it was worse than that - it screenshotted a full screen rendering).

I believe that it is possible to make a better fist of it in VFP using the reportwriter class. A quick google has also brought this up, which looks promising.
 
I made the change: if (is_readable($document)): and now I get a blank screen.
 
I made some changes (set visibility to true) and now get:

starting
COM instantiated
set visibility to true
Document exists and is readable
Document opened

Fatal error: Uncaught exception 'com_exception' with message '<b>Source:</b> Microsoft Word<br/><b>Description:</b> This command is not available because no document is open.' in C:\xampp\htdocs\open12.php:25
Stack trace:
#0 C:\xampp\htdocs\open12.php(25): unknown()
#1 {main}
thrown in C:\xampp\htdocs\open12.php on line 25

 
changing visibility should have made no difference at all to the php script. so something else was probably changed too. a blank screen (PROVIDING YOU HAVE ERROR DISPLAY AND REPORTING turned ON) is an unlikely scenario. after 30 seconds you would at least have got a time out error. the delay is most likely Word hanging when you tell it to open a document that is not a word document but in fact an embedded word document.

anyway ...

the fatal error indicates that the document did not open but failed. which backs up my previous post. you cannot achieve your aims using this method. check out the links in my post of 11 Sep 08h39.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top