Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Cross-language challenge (of POSSIBLE interest)

Status
Not open for further replies.

WMK

Programmer
Jun 2, 2003
325
US
A complete "idiot" (I won't tell you what others call him) has posted a challenge in the PL/I newsgroup - based on his post in the Fortran newsgroup - in response to a C++ challenge <G>.

I responded in the PL/I newsgroup, that I wouldn't bother with this until it was asked by a valid COBOL programmer in the the COBOL newsgroup, but after looking at it, I thought it *MIGHT* be fun for some of us to actually come up with a COBOL soltuion. (Depending upon table size restrictions, I think that an UNSTRING and SEARCH solution should be medium easy to do).

Anyone interested in trying? (Besides requested timing, also include your compiler and OS used in the solution)

P.S. <joke> Extra credit for using ANSI/ISO ('85, '89, or '02) conforming syntax - except for processing of Line Sequential input/output.

> ------------- below posted in comp.lang.fortran ----------
> Want to give this a shot?
> Get the text file at: > Using a text editor (notepad) remove text before
> Book 01 Genesis and text after last word in Revelations, (amen)
> producing bible.txt file containing
> 4,947,047 chars.
>
> The challenge is to process bible.txt file into words array and count unique
> word occurances in count array.
> (this challenge was initiated by LR, whose C++ source/results will be posted
> later, along with my own source/results)..
>
> In this processing, convert all punctuation and numbers to blanks, and
> uppercase to lower.
> One punctuation exception is ' (within a word) is deleted leaving wife's
> as wifes
>
> When bible.txt is "thusly" processed I expect you shud get following outputs
> total words = 789781
> unique words = 12691
> xx.xx Sec ?.?? Ghz PC
>
> Pls post your time and PC speed..
>
> ! A few template statements in this word-processing benchmark challenge to
> get started are:
> ! -----------------------------
> program count_word_occurances ! in bible.txt
> implicit none
> integer,parameter :: maxw = 65536 ! or lower if possible
> character(24) :: words(maxw)
> integer :: i, n, counts(maxw)=0, t1(8), t2(8)
>
> call date_and_time(values=t1) ! get benchmark start time
>
> ! open file='bible.txt' ...
>
> ! process word occurances into the 2 arrays words,counts
> ! until EOF as per your word-processing algorithm
>
> call date_and_time(values=t2) ! get benchmark stop time
>
> n = 0 ! count unique words found
> do i = 1,maxw
> if (counts(i) /= 0) n = n+1
> end do
>
> write (*,*) 'total words =',sum(counts)
> write (*,*) 'unique words =',n
>
> write (*,'(f0.2,a)') (t2(5)*3600.+t2(6)*60.+t2(7) +t2(8)/1000.) &
> -(t1(5)*3600.+t1(6)*60.+t1(7) +t1(8)/1000.), ' Sec <- 2.8 Ghz PC'
> end program
>
>


Bill Klein
 
Actually this is an interesting problem. The speed would definitely depend on which path you took to the solution. I can think of 2 very good possibilities for solutions in COBOL of this problem and perhaps a third.

Whether people want to attempt it is another issue. I might if I find some time. Actually a problem like this is one of COBOL's strong suits.
 
For those interested (and who don't also watch comp.lang.cobol, my sample solution is available at:


1) This is probably NOT the most efficeint solution

2) with the exception of the word "line" in "line Sequential" this is a "fully ANSI/ISO '85 / '89 Standard" conforming program (so should be portable to most compilers

3) There are "bunches" of debugging lines - which may show others how I "came" to this solution.

Bill Klein
 
I just downloaded your source and attempted to run it. How long is it supposed to run? I let it run for about 2-3 minutes before I finally terminated it. The ISAM file turned out to be 15.8 MB when I was done. I also got a compiler error on line 24. I'm on Fujitsu COBOL 3.0 if that helps.
 
Did you edit the "ASSIGN" clause for the input file?

Line 24 is the FD for the ISAM file, so I don't know what problem it would have (unless there was a problem with the ASSIGN clause for the ISAM file. (What message did you get?)

I am using Fujitsu V7 - but I don't *think* I am doing anything that shouldn't work with the old (free V3) compiler.

The ISAM file is about 32M (I don't worry about size with my 50G hardrive on my PC) If that is a problem for you, you can make each record smaller (change the two key field size). The largest word I found in the input file is 18 characters (I used 20 - as a round number - again not caring about file size)

When I run it on my PC (not in debug mode - but "executable" it takes about 40 seconds.

I am running on Windows XP - on a Pentium 4 1.90 Ghz processor.

If you still can't get my program to compile and run in "reasonable" time, please let me know.

P.S. In response to your AWK posting in the comp.lang.xxx newsgroups, it would seem to me that
AWK
PERL
and/or
REXX

might well be the "right tool for the job" for this particuarly assingment. (I know just enough awk to get a bit confused by your code)

Bill Klein
 
You wrote:
"Did you edit the "ASSIGN" clause for the input file?"

Yes. I just pointed it to the same directory as the executable.

Line 24 is the FD for the ISAM file, so I don't know what problem it would have (unless there was a problem with the ASSIGN clause for the ISAM file. (What message did you get?)

24: JMN9261I-W (STD)LINE SEQUENTIAL FILE IS BEYOND THE COMMON RANGE.

I mention that because I'm definitely puzzled about how to solve it. Hard telling if that is the cause for the program running like it does.

When I run it on my PC (not in debug mode - but "executable" it takes about 40 seconds.

I am not running on anything near that fast, but I let it sit for about 2-3 minutes. I can always let it sit longer, especially since it was indicated that the ISAM file is larger than what I found it.

P.S. In response to your AWK posting in the comp.lang.xxx newsgroups, it would seem to me that

I think you got me confused with someone else. I've only posted to comp.lang.cobol in the past (and here). You may be confusing me with someone here that simply calls themselves "Glenn".
 
The message about being "BEYOND THE COMMON RANGE" Is the ANSI/ISO "flagging" message and doesn't mean it is an error. If you want to get rid of it, just remove the @OPTIONS line at the beginning of the source code.

(I used this to verify that I was creating "conforming" source code.)

Sorry about confusing you with another Glenn. (That Glenn posted an "awk" solution to the problem)

Bill Klein
 
Thanks. Your version ran in about 5 minutes here and gave correct results. If I get the chance, I'll attempt at least one (or both) of the other solutions.
 
Microfocus COBOL Version 3.2.46 (DOS) running on an E-Machines @ 400MHz I got 2:42:45 using an internal table and 4:45 using a file sort.
 
For those interested in a version of my program that:

A) runs MUCH faster (under 10 seconds on my machine)

B) Uses a COBOL internal sort, but

C) creates no (large) temporary files,

D) Is still ANSI '85/'89 conforming (other than LINE sequential for input)

See:


Bill Klein
 
Tried your 2nd program. Actually it gives me quite a surprise compared to my own implementation I just completed. My guess is that Fujitsu COBOL has a problem with INSPECT in some way.

On a Celery 433 with Fujitsu 3.0
Your first program: 4:55:45
Your second program: 1:03:44
My program using the sort method: 1:54:40

I used a slightly different method than you to pick the words apart (INSPECT/MOVE rather than UNSTRING into a table). Of course I might find something that'll make it comparable. I might try solution #3 I had in mind and see how it goes.
 
In your SORT method, did you actually create tables of all the words and/or the unique words?

If you look at my code, I just "counted" them (before RELEASE - and after RETURN - checking for duplicates)

I have another version (not posted) that creates HUGE tables. It slows things down a bit (more than I would have thought)

FYI-
I do believe that there have been some improvements in the internal SORT algorithms between Fujitsu V3.0 and V7. I do NOT think that I have Fujitsu's "BSORT" availabke, but it might be there (under the covers).

Bill Klein
 
No I didn't create any kind of tables. I picked apart the strings, sent the individual words to SORT and then just did a control-break loop to count the unique words and the total number of words.

I'm thinking it has to do with inspect or how I'm approaching it since we did differ on the methods used. (I used a loop and used "INSPECT FOR CHARACTERS BEFORE INITIAL SPACE" to find each of the words).

The third method is going to involve tables in memory, but probably a little different than just running SEARCH.
 
I tried using the value of the words to address a 20k-entry table. It ran twice as fast as using SEARCH ALL. It still took over an hour. Using a SORT File is under 5 minutes.

I think memory management of the large table is the cause of the slowdown. If I had a 32-bit compiler, I suspect that the internal table would be faster than the sort.
 
I got my 3rd idea coded up.

433 Celery 22.31 seconds, correct results.
 
433 Celery, Fujitsu 3.0 - I'm down to 13.75 seconds now on my third program idea (4th one total in this thread). I will post after I'm done trying to get the time down.

WMK: The big time difference between our SORT programs is because UNSTRING is much more efficient than picking the strings apart individually. When I changed my method in that program to use UNSTRING my program time became comparable to yours. In fact, changing that code was the major contributor to get the time of this program down from 22.31 seconds to 13.75 seconds.

webrabbit: I use a memory storage table in my program to hold the unique values I encounter. So I'm not sure memory management is the issue in your table implementation.
 
OK, I looked at this and can't think of any way to pare the time down further.

I'm curious about how this would work on a faster machine. It'd have to be almost instantaneous, I think.

Anyway here it is. If anyone's willing to host it, please let me know - that'd probably be easier on those who are interested.

As a note, I have a Win32 FPC 1.010 implementation as well that runs in the high 13's...

Code:
IDENTIFICATION DIVISION.
 PROGRAM-ID. WORDCNT3B.
 AUTHOR. GLENN9999 AT TEKTIPS.COM
* Any use or distribution of this code must include the statement
* "copyright 2005 by Glenn9999 at Tektips.com
 ENVIRONMENT DIVISION.
 INPUT-OUTPUT SECTION.
* 3A, COUNT WORDS IN KJV12.TXT BUT WITH INDEXES INSTEAD OF SUBSCRIPTS.
 FILE-CONTROL.
     SELECT KJV12 ASSIGN TO "KJV12.TXT"
     ORGANIZATION IS LINE SEQUENTIAL.
 DATA DIVISION.
 FILE SECTION.
 FD  KJV12
     Record Varying in Size from 1 to 80
      	Depending on WS-Char-Cnt
     BLOCK CONTAINS 0 RECORDS.
 01  KJV12-REC.
     05  Char Occurs 1 to 80 times
     	Depending on WS-Char-Cnt.
         10  Each-Char   	Pic X.

 WORKING-STORAGE SECTION.
 01  KJV-INPUT-PROCESS.
     04  KJV-DATA             PIC X(80).
     04  KJV-OUT1             PIC X(20).
     04  WS-CHAR-CNT          PIC 9(4) COMP-5.
     04  KJV-COUNT            PIC S9(4) COMP-5.
     04  WORD-TABL.
        08  EACH-WORD OCCURS 20 TIMES INDEXED BY IND PIC X(20).
     04  WORDS-IN-LINE        PIC S9(4) COMP-5.

 01  KJV-OUTPUT-PROCESS.
     04  KJV-WORD-COUNT       PIC S9(8) COMP-5 VALUE 0.
     04  KJV-UNIQ-COUNT       PIC S9(8) COMP-5 VALUE 0.
     04  EOF-DATA             PIC X VALUE "N".
     04  SEARCH-INDICATOR     PIC X VALUE 'N'.
         88  SEARCH-DONE      VALUE IS "Y".
         88  SEARCH-NOT-DONE  VALUE IS "N".

 01  TIMING-PROC.
     04  Start-Time       Pic 99/99/99/99.
     04  End-Time         Pic 99/99/99/99.

 01  TABLE-STORE-DATA.
     04  KJV-TABLE-STORE OCCURS 13000 TIMES
                         INDEXED BY KJV-CURR-PTR KJV-UNIQ-PTR.
         08  KJV-OBJECT       PIC X(20).
         08  KJV-LEFT-VAL     POINTER VALUE NULL.
         08  KJV-LEFT-PTR REDEFINES KJV-LEFT-VAL INDEX.
         08  KJV-RIGHT-VAL    POINTER VALUE NULL.
         08  KJV-RIGHT-PTR REDEFINES KJV-RIGHT-VAL INDEX.

 PROCEDURE DIVISION.
 0000-MAIN SECTION.
     Move Function Current-Date (9:8) 	To Start-Time.
     PERFORM 1000-INPUT.
     DISPLAY " TOTAL WORDS IN FILE: " KJV-WORD-COUNT.
     DISPLAY "UNIQUE WORDS IN FILE: " KJV-UNIQ-COUNT.
     Move Function Current-Date (9:8) 	To END-Time.

     DISPLAY "END: " END-TIME.
     DISPLAY "START: " START-TIME.
     GOBACK.

 1000-INPUT SECTION.
     OPEN INPUT KJV12.
     READ KJV12 INTO KJV-DATA
       AT END MOVE "Y" TO EOF-DATA
     END-READ.
     PERFORM UNTIL EOF-DATA = "Y"
       IF KJV-DATA NOT = SPACES
          PERFORM 1100-PARSE-STRING
       END-IF
       READ KJV12 INTO KJV-DATA
          AT END MOVE "Y" TO EOF-DATA
       END-READ
     END-PERFORM.
     CLOSE KJV12.

 1100-PARSE-STRING SECTION.
* UPPERCASE DATA AND REMOVE NUMERICS AND PUNCTUATION
     MOVE FUNCTION UPPER-CASE (KJV-DATA) TO KJV-DATA
     INSPECT KJV-DATA REPLACING ALL "'S" BY "S "
     INSPECT KJV-DATA
             CONVERTING "."",;?:)(!-'0123456789"
                     TO "                     " .

* REMOVE LEADING SPACES
     MOVE 0 TO KJV-COUNT.
     INSPECT KJV-DATA TALLYING KJV-COUNT FOR LEADING SPACES.
     IF KJV-COUNT > 0
       MOVE KJV-DATA (KJV-COUNT + 1:) TO KJV-DATA
     END-IF.

* PICK APART STRING INTO WORDS
     MOVE 0 TO WORDS-IN-LINE.
     UNSTRING KJV-DATA (KJV-COUNT + 1:)
       DELIMITED BY ALL SPACES
            INTO Each-Word (1)  Each-Word (2)  Each-Word (3)
                 Each-Word (4)  Each-Word (5)  Each-Word (6)
                 Each-Word (7)  Each-Word (8)  Each-Word (9)
                 Each-Word (10) Each-Word (11) Each-Word (12)
                 Each-Word (13) Each-Word (14) Each-Word (15)
                 Each-Word (16) Each-Word (17) Each-Word (18)
                 Each-Word (19) Each-Word (20)
          TALLYING WORDS-IN-LINE
      END-UNSTRING.

     PERFORM VARYING IND FROM 1 BY 1
        UNTIL IND > WORDS-IN-LINE
          MOVE EACH-WORD (IND) TO KJV-OUT1
          PERFORM 2000-ENTRY-TABLE
     END-PERFORM.

 2000-ENTRY-TABLE SECTION.
     ADD 1 to KJV-WORD-COUNT.
     IF KJV-UNIQ-COUNT = 0
        ADD 1 TO KJV-UNIQ-COUNT
        SET KJV-UNIQ-PTR TO 1
        MOVE KJV-OUT1 TO KJV-OBJECT (KJV-UNIQ-PTR)
     ELSE
        SET KJV-CURR-PTR TO 1
        SET SEARCH-NOT-DONE TO TRUE
        PERFORM 2100-WALK-TABLE UNTIL SEARCH-DONE
     END-IF.

 2100-WALK-TABLE SECTION.
     IF KJV-OBJECT (KJV-CURR-PTR) = KJV-OUT1
        SET SEARCH-DONE TO TRUE
     ELSE
        IF KJV-OBJECT (KJV-CURR-PTR) > KJV-OUT1
           PERFORM 2200-NAV-LEFT
        ELSE
           IF KJV-OBJECT (KJV-CURR-PTR) < KJV-OUT1
              PERFORM 2300-NAV-RIGHT
           END-IF
        END-IF
     END-IF.

 2200-NAV-LEFT SECTION.
     IF KJV-LEFT-VAL (KJV-CURR-PTR) = NULL
        ADD 1 TO KJV-UNIQ-COUNT
        SET KJV-UNIQ-PTR UP BY 1
        MOVE KJV-OUT1 TO KJV-OBJECT (KJV-UNIQ-PTR)
        SET KJV-LEFT-PTR (KJV-CURR-PTR) TO KJV-UNIQ-PTR
        SET SEARCH-DONE TO TRUE
     ELSE
        SET KJV-CURR-PTR TO KJV-LEFT-PTR (KJV-CURR-PTR)
     END-IF.

 2300-NAV-RIGHT SECTION.
     IF KJV-RIGHT-VAL (KJV-CURR-PTR) = NULL
        ADD 1 TO KJV-UNIQ-COUNT
        SET KJV-UNIQ-PTR UP BY 1
        MOVE KJV-OUT1 TO KJV-OBJECT (KJV-UNIQ-PTR)
        SET KJV-RIGHT-PTR (KJV-CURR-PTR) TO KJV-UNIQ-PTR
        SET SEARCH-DONE TO TRUE
     ELSE
        SET KJV-CURR-PTR TO KJV-RIGHT-PTR (KJV-CURR-PTR)
     END-IF.
 
I think it is memory management. As the table is 400kb in size, a HUGE memory model is invoked. I get a message from the linker that a temporary file has been generated. I have not gotten that message for any other program. Since the compiler I have (Microfocus 2.3.46) is only a 16-bit compiler, I must run the program under XM. With my e-machines motherboard and the Celeron processor, I think this is very inefficient.
 
I think it is memory management. As the table is 400kb in size, a HUGE memory model is invoked."

For a 16-bit environment, you might be right. It can only handle 64K at a time in the data stack so it is paging the data in and out of a temp file it sounds like (a bad solution really, the rest of the 640K heap is there).

It might be neat since we have similar machines if I tried putting my Pascal implementation of the COBOL I posted above through a 16-bit Pascal compiler (with changes to go to the heap of course with my big table) and see if there's a time difference. I said that because when I did a little testing I realized that search or search all wouldn't be good, but the in-core table would still be good.

There has to be a difference, though. My sort implementation on Fujitsu runs at around 1 minute, so if yours takes 5 there might be something there to the 16-bit compiler thing.

I'll have to try it with my Pascal source.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top