Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

External programs without opening command line window (MS Windows) 2

Status
Not open for further replies.

Trevoke

Programmer
Jun 6, 2002
1,142
US
I need to parse word documents. Best solution I've found to do so is using 'antiword'. If there is a Ruby solution, I'll definitely go for it, because it'll solve this problem too...

Code:
report = `c:/antiword/antiword.exe -m cp850.txt "#{@filename}"`.split("\n")

It's nice -- but if I run this as a scheduled task with rubyw.exe, then I get a whole lot of command-line windows that blink in and out as antiword gets run.
If I run it with ruy.exe, then I just get one command-line window..

Is there a way to have Ruby either hide the command-line window or have it run minimized?

.. This may be a Windows question, and if so I apologize, but I was wondering if Ruby could really do it all :)

Tao Te Ching Discussions : Chapter 9 (includes links to previous chapters)
What is the nature of conflict?
 
Hi

Trevoke said:
Is there a way to have Ruby either hide the command-line window or have it run minimized?
As far as I know, this has nothing to do with Ruby.

Try using [tt]start[/tt] somehow like this to reduce the ugliness :
Code:
report = `[red]start[/red] [red]/min[/red] [red]/wait[/red] c:/antiword/antiword.exe -m cp850.txt "#{@filename}"`.split("\n")


Feherke.
 
I didn't think it had to do with Ruby either, but.. He who asks a question is a fool 5 minutes, he who doesn't is a fool his entire life ;-)

Thanks.
I don't have time to play with it right now (sigh), but the line you suggested breaks my parse. I probably have to be careful what I get back out of it.

Tao Te Ching Discussions : Chapter 9 (includes links to previous chapters)
What is the nature of conflict?
 
You can parse word documents with help of win32ole too.
For example if I have a word document named my_word.doc which contains this one heading and this one basic text sentence
Code:
[b]My heading[/b]
This is the basic text.
then this ruby script
rubyword.rb
Code:
[COLOR=#a020f0]require[/color] [COLOR=#6a5acd]'[/color][COLOR=#ff00ff]win32ole[/color][COLOR=#6a5acd]'[/color]

word = [COLOR=#2e8b57][b]WIN32OLE[/b][/color].new([COLOR=#6a5acd]'[/color][COLOR=#ff00ff]Word.Application[/color][COLOR=#6a5acd]'[/color])
[COLOR=#0000ff]#word.Visible = true[/color]
word.Visible = [COLOR=#ff00ff]false[/color]
document = word.documents.open([COLOR=#6a5acd]'[/color][COLOR=#ff00ff]c:\Users\Roman\Work\my_word.doc[/color][COLOR=#6a5acd]'[/color])

[COLOR=#0000ff]# processing sentences[/color]
puts [COLOR=#6a5acd]"[/color][COLOR=#ff00ff]Sentences:[/color][COLOR=#6a5acd]\n[/color][COLOR=#6a5acd]"[/color]
doc_sentences = []
document.Sentences.each { |[COLOR=#008080]s[/color]|
  doc_sentences << s.text
}

puts doc_sentences.inspect

[COLOR=#0000ff]# processing words[/color]
puts [COLOR=#6a5acd]"[/color][COLOR=#ff00ff]Words:[/color][COLOR=#6a5acd]\n[/color][COLOR=#6a5acd]"[/color]
doc_words = []
document.Words.each { |[COLOR=#008080]w[/color]|
  doc_words << w.text
}

puts doc_words.inspect
parses sentences and/or words with the following output:
Code:
c:\Users\Roman\Work>ruby rubyword.rb
Sentences:
["My heading\r", "This is the basic text.\r\r"]
Words:
["My ", "heading", "\r", "This ", "is ", "the ", "basic ", "text", ".", "\r", "\r"]
 
mikrom - how fast can this parse? Does it properly support threading?
With antiword, I can parse ~1,000 word documents (50-100k each) in ~30 seconds.

Tao Te Ching Discussions : Chapter 9 (includes links to previous chapters)
What is the nature of conflict?
 
Hi trevoke,

I don't know how fast it is - never used this in production, I tried only the toy example I posted.
 
Hi,
I discovered a problem in the script I posted before: When I executed the script 10 times, then 10 instaces of WINWORD.EXE were created.
It's because I used everytime WIN32OLE.new('Word.Application')

Here is the improved version:
rubyword.rb
Code:
[COLOR=#0000ff]# Running the script:[/color]
[COLOR=#0000ff]#   ruby rubyword.rb > result.txt[/color]
[COLOR=#0000ff]#   [/color]
[COLOR=#a020f0]require[/color] [COLOR=#6a5acd]'[/color][COLOR=#ff00ff]win32ole[/color][COLOR=#6a5acd]'[/color]

file_name = [COLOR=#6a5acd]'[/color][COLOR=#ff00ff]my_word.doc[/color][COLOR=#6a5acd]'[/color]

[COLOR=#0000ff]# split working directory path into a list[/color]
file_path_list = [COLOR=#2e8b57][b]Dir[/b][/color].pwd.split([COLOR=#2e8b57][b]File[/b][/color]::[COLOR=#2e8b57][b]SEPARATOR[/b][/color])
[COLOR=#0000ff]# add file name into the list[/color]
file_path_list << file_name
[COLOR=#0000ff]# compose filename from list[/color]
file_path = [COLOR=#2e8b57][b]File[/b][/color].join(file_path_list)
puts [COLOR=#6a5acd]"[/color][COLOR=#ff00ff]Processing file: '[/color][COLOR=#6a5acd]#{[/color]file_path[COLOR=#6a5acd]}[/color][COLOR=#ff00ff]'[/color][COLOR=#6a5acd]"[/color]
puts

[COLOR=#804040][b]begin[/b][/color]
  [COLOR=#0000ff]# try to connect to an existing instance of the Word application object[/color]
  word = [COLOR=#2e8b57][b]WIN32OLE[/b][/color].connect([COLOR=#6a5acd]'[/color][COLOR=#ff00ff]Word.Application[/color][COLOR=#6a5acd]'[/color])
[COLOR=#804040][b]rescue[/b][/color] [COLOR=#2e8b57][b]WIN32OLERuntimeError[/b][/color] 
  [COLOR=#0000ff]# if this exception occurs:[/color]
  [COLOR=#0000ff]#   "OLE server `Word.Application' not running (WIN32OLERuntimeError)"[/color]
  [COLOR=#0000ff]# then create a new instance of the Word application object[/color]
  word = [COLOR=#2e8b57][b]WIN32OLE[/b][/color].new([COLOR=#6a5acd]'[/color][COLOR=#ff00ff]Word.Application[/color][COLOR=#6a5acd]'[/color])
  [COLOR=#0000ff]# so, after running this script more than once, only one instance[/color]
  [COLOR=#0000ff]# of WINWORD.EXE is running[/color]
[COLOR=#804040][b]end[/b][/color]

[COLOR=#0000ff]# instance of Word will be not visible[/color]
word.Visible = [COLOR=#ff00ff]false[/color]
[COLOR=#0000ff]# open Word document[/color]
document = word.documents.open(file_path)

[COLOR=#0000ff]# processing sentences[/color]
puts [COLOR=#6a5acd]"[/color][COLOR=#ff00ff]Sentences found:[/color][COLOR=#6a5acd]\n[/color][COLOR=#6a5acd]"[/color]
nr = [COLOR=#ff00ff]0[/color]
document.Sentences.each { |[COLOR=#008080]s[/color]|
  str = s.text.strip
  nr += [COLOR=#ff00ff]1[/color]
  puts [COLOR=#6a5acd]"[/color][COLOR=#6a5acd]#{[/color]nr[COLOR=#6a5acd]}[/color][COLOR=#ff00ff]. '[/color][COLOR=#6a5acd]#{[/color]str[COLOR=#6a5acd]}[/color][COLOR=#ff00ff]'[/color][COLOR=#6a5acd]"[/color]
}
puts

[COLOR=#0000ff]# processing words[/color]
puts [COLOR=#6a5acd]"[/color][COLOR=#ff00ff]Words found:[/color][COLOR=#6a5acd]\n[/color][COLOR=#6a5acd]"[/color]
nr = [COLOR=#ff00ff]0[/color]
document.Words.each { |[COLOR=#008080]w[/color]|
  str = w.text.strip
  [COLOR=#804040][b]if[/b][/color] str != [COLOR=#6a5acd]''[/color]
    nr += [COLOR=#ff00ff]1[/color]
    puts [COLOR=#6a5acd]"[/color][COLOR=#6a5acd]#{[/color]nr[COLOR=#6a5acd]}[/color][COLOR=#ff00ff]. '[/color][COLOR=#6a5acd]#{[/color]str[COLOR=#6a5acd]}[/color][COLOR=#ff00ff]'[/color][COLOR=#6a5acd]"[/color]
  [COLOR=#804040][b]end[/b][/color]
}

[COLOR=#0000ff]# close Word document[/color]
document.close
Output:
Code:
c:\Users\Roman\Work>ruby rubyword.rb
Processing file: 'c:/Users/Roman/Work/my_word.doc'

Sentences found:
1. 'My heading'
2. 'This is the basic text.'

Words found:
1. 'My'
2. 'heading'
3. 'This'
4. 'is'
5. 'the'
6. 'basic'
7. 'text'
8. '.'
 
Trevoke said:
how fast can this parse? Does it properly support threading?
With antiword, I can parse ~1,000 word documents (50-100k each) in ~30 seconds.
I have not 100 or 1000 Word documents to answer your question, but you can modify the script I posted and parse all Word documents you have and measure the time.
You can create a list of all documents in a given directory using Dir.glob() e.g. something like this
Code:
c:\Users\Roman\Work>irb
irb(main):001:0> Dir.glob("C:/Users/Roman/Work/*.doc")
=> ["C:/Users/Roman/Work/my_word.doc", "C:/Users/Roman/Work/my_word_02.doc", "C:/Users/Roman/Work/Uctovna uzavierka.doc"]
and then process the list in a loop.

Why it shouldn't support threading? I think YES.
 
I had issues with parsing the word documents using Win32OLE.

In addition, I am pasting here my script (including my custom class) and your script, as well as how long it took to run each once (I would have done more testing, but the preliminary results are telling).

How I used your script:
Code:
t0 = Time.now
require 'win32ole'
total_files = 0

Folders = ['full_path_of_folder1',
  'full_path_of_folder2'
]

begin
  word = WIN32OLE.connect('Word.Application')
rescue WIN32OLERuntimeError
  word = WIN32OLE.new('Word.Application')

end

word.Visible = false

Folders.each do |folder|
  Dir.chdir folder
  Dir.glob('*').each do |file|
    next if File.directory? file or file[0..0] == "~"
    total_files += 1
    document = word.documents.open(folder + file)
    puts "#{folder + file}"
    document.close
  end
end

puts "Total files processed: #{total_files}."
puts "#{Time.now - t0} seconds."



How I used my script and class:
Code:
#TODO: Create 'installer' for this.

require 'rubygems'
require 'dbi'
require 'activerecord'


class Report
  attr_reader :last, :first, :ssn, :date, :proc, :refphys,
              :filename, :modality, :age, :absolute_age

  def initialize filename
    @last, @first, @ssn, @date, @proc, @refphys = "", "", "", "", "", []
    @filename = filename
    @modality = ""
    @absolute_age = 0 # Age in seconds, for ease of comparison.
    @old_threshold = 18 # Number of hours, chosen by John.
    @old = false  # If the report is older than old_threshold, make it true.
    @age = get_age
    report = `c:/antiword/antiword.exe -m cp850.txt "#{@filename}"`.split("\n")
   
    report.delete_if { |a| a == ""}
    report.each do |line|
      line = line.upcase.split(":")
      case line[0]
      when "PATIENT NAME":
          @last, @first = line[1][0..-4].strip.split(",")
          @ssn = (line[2].strip rescue "000-00-0000")
      when "DATE OF EXAM":
        @date = line[1].strip
        fix_date
      when "RE":
          @proc = (line[1].strip rescue "procedure unavailable")
      end
      if line[0][0..3] == "DEAR":
        line = line[0][4..-1]
        line = line[0..-2] if line[-1..-1] == ","
        line.split(",").each do |doc|
          if doc.include? " AND "
            @refphys.concat doc.strip.split(" AND ")
          else
            @refphys.concat [doc.strip]
          end
        end
      end
    end
    determine_modality
  end

  def old?
    @old
  end

  private

  def get_age
    @absolute_age = Time.now - File.ctime(@filename)
    age = (@absolute_age / 60).round
    # The first calculation returns seconds. Divide by 60 to get minutes.
    return "#{age}min" if age < 60
    age = (age / 60).round  # Divide by 60 to get hours.
    @old = true if age > @old_threshold
    return "#{age}h" if age < 24
    return "#{(age / 24).round}d #{age % 24}h" # Divide by 24 to get days
  end

  def fix_date
    a = @date.split('/')
    @date = "#{a[2]}-#{a[0]}-#{a[1]}".to_date
  end

  def same_modality? list
    check = list[0][0]
    same = true
    list.each do |i|
      same = false if check != i[0]
    end
    return same
  end

  def determine_modality
    if @proc.include? "X RAY" or @proc.include? "X-RAY"
      @modality = "CR"
      return
    end
    sqldate = @date.to_s
    last = @last.gsub("'", "%").gsub(" ", "%").gsub("-","%")
    first = @first.split(' ')[0].gsub("'", "%").gsub("-", "%")
    query = %Q{
SELECT EXP.modality_code FROM examination EX
INNER JOIN patient P on P.pat_ckey = EX.pat_ckey
INNER JOIN exam_procedure EXP on EXP.procedure_ckey = EX.procedure_ckey
WHERE P.pat_name like '#{last}%#{first}%' AND
CONVERT(SMALLDATETIME, CONVERT(VARCHAR(10), EX.study_dttm, 101) ) = '#{sqldate}'
    }
    begin
      dbh = DBI.connect( 'DBI:ODBC:PACS', 'sa', '')
      data = dbh.select_all query
    rescue DBI::DatabaseError => e
      puts "An error occurred"
      puts "Error code: #{e.err}"
      puts "Error message: #{e.errstr}"
    ensure
      # disconnect from server
      dbh.disconnect if dbh
    end
    if data[0].nil? or data[0].empty?
      @modality = "NA"
    elsif data.size > 1
      if same_modality? data
        @modality = data[0][0]
      else
        @modality = "NA"
      end
    else
      @modality = data[0][0]
    end
  end

end

T0 = Time.now
require 'rubygems'
require 'activerecord'
require 'report'

reports = []

#file_name = 'my_word.doc'

Folders = ['full_path_of_folder1',
  'full_path_of_folder2'
]

Folders.each do |folder|
  Dir.chdir folder
  Dir.glob('*').each do |file|
    next if File.directory? file or file[0..0] == "~"
    reports << (Report.new file)
    puts "#{folder + file}"
  end
end

puts "Total files processed: #{reports.size}."
puts "#{Time.now - T0} seconds."

Using Word/Win32OLE gave me this:
Total files processed: 1003.
406.344 seconds.

Mine, which does a fair bit more work than just opening the Word document, gave me this:
Total files processed: 1002.
67.727 seconds.

I am beginning to understand why this company (no names!) is so inefficient - Word sucks golf balls through garden hoses and they've been using Win32OLE (with C++ or C#, granted, but still) and opening Word to parse every single document...

Tao Te Ching Discussions : Chapter 9 (includes links to previous chapters)
What is the nature of conflict?
 
Your measurements show, that processing the word documents using Win32OLE is much slower than with antiword.
Interesting to know - maybe I have to do a similar task in the future.
Thank you for the valuable analysis.
 
You're welcome :)
After that.. Is it worth it to have a command-line window open when the script is run? Well, for this particular purpose, I think so, yes. It's worth it to save that kind of time :)

Tao Te Ching Discussions : Chapter 9 (includes links to previous chapters)
What is the nature of conflict?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top