Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Max file size of 2Gb for gfortran under win XP

Status
Not open for further replies.

GerritGroot

Technical User
Nov 3, 2006
291
ES
Hi,

Under Windows XP I made a little program to change words in huge text files (in fact three files of 1Gb, 1.8Gb and 7Gb) as normal text editors refuse to open these huge text files.

To do so, without an editor, I
- 1 - Open the input textfile
- 2 - Read one line only (so the program uses little memory, I don't read the whole file, but only one line)
- 3 - See if the word that has to be changed occurs in that line and if so change it within that string
- 4 - Write the eventually changed text line (string) to the output file and flush

Everything goes ok for the 1Gb and the 1.8Gb input files, however for the 7Gb text file it doesn't work anymore. As soon as EXACTLY 2Gb of the output text file have been written to disk, the program just get stuck forever.

Possibly this has somethng to do with OS dependent maximum file sizes, but the input file comes from windows as well and contains 7Gb of text.

How can I solve this?

Thanks,

Gerrit
 
Hhhmmm...Fortran would not have been my first choice for this task.

There are command line tools to address this on the fly...in Linux (sed for one). You can download 'sed' and other Linux utilities onto Windows; they are readily available.

 
This looks as if your prog keeps all the lines in memory.

So to post your code would be helpful.

Norbert


The optimist believes we live in the best of all possible worlds - the pessimist fears this might be true.
 
I flush everything I write, but I'll post it anyway, maybe you see more wrong stuff in it.
 
@salgerman: I didn't know about "sed". I'll have a look
@gummibaer: You can find my code hereunder.

Code:
PROGRAM MultipleFileReplacer
IMPLICIT NONE
INTEGER, PARAMETER :: mxll=1000
INTEGER :: amountoffiles,ifi,i1,i2,amountofwords,iw,sumclosed
INTEGER, DIMENSION(:), ALLOCATABLE :: iur,iuw,isclosed
CHARACTER(LEN=1000), DIMENSION(:), ALLOCATABLE :: InFile,OutFile,OldWord,NewWord
CHARACTER(LEN=1000) :: Line


OPEN(UNIT=11,FILE='./tobechanged.txt',STATUS='OLD')
READ(11,*)amountoffiles
ALLOCATE(InFile(amountoffiles),OutFile(amountoffiles))
DO ifi=1,amountoffiles,1
   READ(11,FMT='(A)')Line
   i1=INDEX(Line,'"')
   i2=INDEX(Line(i1+1:),'"')+i1
   InFile(ifi)=Line(i1+1:i2-1)
   i1=INDEX(Line(i2+1:),'"')+i2
   i2=INDEX(Line(i1+1:),'"')+i1
   OutFile(ifi)=Line(i1+1:i2-1)
END DO
READ(11,*)amountofwords
ALLOCATE(OldWord(amountofwords),NewWord(amountofwords))
DO iw=1,amountofwords,1
   READ(11,'(A)')Line
   i1=INDEX(Line,'"')
   i2=INDEX(Line(i1+1:),'"')+i1
   OldWord(iw)=Line(i1+1:i2-1)
   i1=INDEX(Line(i2+1:),'"')+i2
   i2=INDEX(Line(i1+1:),'"')+i1
   NewWord(iw)=Line(i1+1:i2-1)
END DO
CLOSE(11,STATUS='KEEP')


! Define unit numbers and open the files
ALLOCATE(iur(amountoffiles),iuw(amountoffiles),isclosed(amountoffiles))
DO ifi=1,amountoffiles,1
   iur(ifi)=10+ifi
   iuw(ifi)=10+amountoffiles+ifi
   OPEN(UNIT=iur(ifi),FILE=TRIM(InFile(ifi)),STATUS='OLD')
   OPEN(UNIT=iuw(ifi),FILE=TRIM(OutFile(ifi)),STATUS='REPLACE')
END DO


! Read and write (change the words)
isclosed(:)=0
sumclosed=0
DO WHILE(sumclosed/=amountoffiles)
   DO ifi=1,amountoffiles,1
      IF(isclosed(ifi)==0)THEN
         READ(iur(ifi),FMT='(A)',ERR=100,END=100)Line
         DO iw=1,amountofwords,1
            i1=0
            i1=INDEX(Line,TRIM(OldWord(iw)))
            IF(i1/=0)CALL ReplaceText(mxll,i1,OldWord(iw),NewWord(iw),Line)
         END DO
         WRITE(iuw(ifi),'(A)')TRIM(Line)
         CALL FLUSH(iuw(ifi))
      END IF
      GOTO 200
      100 CONTINUE
          CLOSE(iur(ifi),STATUS='KEEP')
          CLOSE(iuw(ifi),STATUS='KEEP')
          CALL FLUSH(iuw(ifi))
          isclosed(ifi)=1
          WRITE(*,'(A)')'File "'//TRIM(InFile(ifi))//'" replaced by "'//TRIM(OutFile(ifi))//'"'
      200 CONTINUE
   END DO
   sumclosed=SUM(isclosed(:))
END DO


DEALLOCATE(InFile,OutFile,OldWord,NewWord,iur,iuw,isclosed)


END PROGRAM MultipleFileReplacer

It uses the following subroutine:
Code:
SUBROUTINE ReplaceText(mxll,i1,OldWord,NewWord,LineInOut)
IMPLICIT NONE
INTEGER, INTENT(IN) :: mxll,i1
CHARACTER(LEN=*), INTENT(IN) :: OldWord,NewWord
CHARACTER(LEN=*), INTENT(INOUT) :: LineInOut
CHARACTER(LEN=mxll) :: LineIn,LineOut
INTEGER :: lold,lnew,ldiff,lr
LineOut=''
LineIn=LineInOut
lold=LEN_TRIM(OldWord)
lnew=LEN_TRIM(NewWord)
ldiff=lnew-lold
! The first part is unchanged
LineOut(1:i1-1)=LineIn(1:i1-1)
! Next, change the word
LineOut(i1:i1+lnew-1)=TRIM(NewWord)
! ...and write the rest unchanged again
lr=LEN_TRIM(LineIn(i1+lold:))
LineOut(i1+lnew:i1+lnew+lr)=LineIn(i1+lold:)
LineInOut=LineOut
END SUBROUTINE ReplaceText

And a text file as input that tells the program what should be changed and in which files:
"tobechanged.txt"
Code:
3
"test0.txt" "test1.txt"
"best0.txt" "best1.txt"
"west0.txt" "west1.txt"
4
"How are you?" "Comment ça va?"
"Cheers" "Slainte"
"What the hell?" "Would you be so kind to excuse me please"
"I'm a stunning programmer" "I'm a lousy programmer"

Hope this helps. As you see I flush after every write (but windows only seems to obey that partially).
BTW, the goto construct is quite ugly, but as the read needs a line number on END= or ERR= I saw no other way (avoiding the non standard WHILE(.NOT.EOF(unitnr)) and making it possible to open several files at the same time, that seemed faster to me (no idea whether it really is, I didn't test that). The 2Gb limit has been tested with one file only!
BTW2: It's true that you can't change the " with this program, but I didn't need that (to the contrary I needed to be able to change spaces)
 
IMO, it could be OS dependent problem. You have very big text file and 4 arrays, with 1000 chars in each element.
Which is the approximately size of these arrays, when you take 7GB input file?

As suggested above, try to use any standard utility like sed. The usage of sed is simple, for example
Code:
$ cat input_file.txt
word1 other words word2 and something else
word2 other words word1 and something else
xyz word1 word2 xxxx and something else
abc    word2 other words word1 and something else

$ sed -e 's/word1/NEW_WORD1/g; s/word2/NEW_WORD2/g' input_file.txt
NEW_WORD1 other words NEW_WORD2 and something else
NEW_WORD2 other words NEW_WORD1 and something else
xyz NEW_WORD1 NEW_WORD2 xxxx and something else
abc    NEW_WORD2 other words NEW_WORD1 and something else

$ sed -e 's/word1/NEW_WORD1/g; s/word2/NEW_WORD2/g' input_file.txt > output_file.txt

$ cat output_file.txt
NEW_WORD1 other words NEW_WORD2 and something else
NEW_WORD2 other words NEW_WORD1 and something else
xyz NEW_WORD1 NEW_WORD2 xxxx and something else
abc    NEW_WORD2 other words NEW_WORD1 and something else
 
At first sight, your code looks okay to me. Could not test it though for I do not have such a big file.

But in th epast I had some trouble with input-buffering. It looked like the prog used a new location in memory for every read statement.
This is just an assumption, I am not sure that you may have a similar problem. So I would just mess around a bit and maybe we hit on a workaround.

If you are running XP, then the standard size of memory is 2GB (provided you have enough RAM installed). Still provided you have enough RAM (4GB) you might extend this limitation to 3GB. This will not solve your problem, but if you could do thid and find your prog quits after 3 GB, then we know for sure it is something about memory management. If so, we may be able to find a solution to this.

Google for 'xp 3gb ram' for an instruction to set about.

Soryy for not being somewhat vague.

Norbert

The optimist believes we live in the best of all possible worlds - the pessimist fears this might be true.
 
Back again,

Yep, the 100 characters in the 4 strings is highly exaggerating, I think about 100 will do, but who cares? That shouldn't influence the ability to write more than 2Gb, should it, if it would, I would never end up with EXACTLTY 2Gb, but some strange number instead.

It's true that my laptop has only 2Gb of RAM, which is rather suspicious, so I tried to run the executable on another computer with 16Gb RAM, but again the written file gets stuck at 2Gb, so it seems that it doesn't have to do anything with the amount of memory.

I'm running sed, which may solve the "change text" problem but later on I'll have to do more with that text file, like extracting parts, reformatting parts, adding extra variables after some little computations etc. etc., so much more than only changing some words.

So even if sed solves my problem for now (it's actually running and I'll let you know the outcome), I'll need a code smilar to the one above able to read and write textfiles that are bigger than 2Gb so I can adapt them.

It's quite disappointing that, even when you only read one line and write only one line and flush it, the OS finally does what it wants, isn't it?

Any solution or tip is appreciated!
 
It looks like this search-and-replace is starting to get complex.

sed will do it, but you need to get good at it; probably awk would also work...but talk about awk-ward programs.

Say, among the Linux/Unix utilities, there should be a file splitter one...maybe you should split the file into manageable chunks that you can edit without any other expertise than knowing how to use your favorite text editor.

Among other things, have you googled large-file text editors? There quite a few conversations, out there...but there seems to be an architecture limitation. I read that 32-bit of VEDIT can only handle up to 2GB file; but VEDIT Pro64 can handle any size. Some people claim UltraEdit handle large file, but it is not clear if it can handle 5GB, for example. There is also a custom compiled 64-bit version of SciTe that somebody uploaded. I also run into Cream whose file-size handling seems to only be limited by hard-disk space! Check third bullet in feature list...maybe you should give this one a try.

Germán
 
I used JuJuEdit, but it won't open files that are larger than 2Gb. Anyway, I don't think an editor is the tool to do computations and add values etc., but I'll have a look at the ones you mention, they may be better than JuJu, thanks for pointing those out.

On the other hand, shouldn't fortran just do what we tell it to do? Write and FLUSH...

I'm also asking myself the question why sed is able to do this in windows, after all sed is written in some language as well.

Splitting the file in chunks is an option (without mentioning the method, which should be something that's not fortran), but requires recoding on the computational part... ...not really an attractive option.
 
Gerrit,

you are flushing the output variable. But the problem seems to be on the input side of the business - at least if it is similar to what I had in the past.

The point is, you order fortran to do things, consequently fortran orders your OS to do things, and OS tries to be most efficient in doing things. Most of the time OS is quite good at it, but on certain conditions it gets to its limits.

Without deeper understanding I would try to handle the way your variable line is read and saved. I know your code is proper fortran and should work but these things might be worth a try. These are options you might try:

- Change input-format from (a) to (a1000) if line is character*1000
- define line to character(len = 1000) :: line (1) and allways read line(1)
- make line allocatable and allocate it to line (1)
- make line a target and associate it with a pointer and specify the pointer in the read statement
- specify action = 'read' when you open your inputfile
- set the size of the iobuffer while opening your inputfile, in my compiler this would be blocksize = 1024 in the open-statement, but this is not standard fortran. So on your compiler this may be a different keyword.

All of these options may just give the same fortran-performance but might translate differently to the OS.

If all of this does not work, maybe you may have a look at your compiler's and Linker's options for memory management and optimizations.

Norbert


The optimist believes we live in the best of all possible worlds - the pessimist fears this might be true.
 
When sed does the task right, then it's probably not OS limitation. Maybe your fortran compiler has a bug. Have you tried it with other compiler?

Other problem could be with your program processing logic:
The success or failure of your program fully depends on your input file. What when your input file is corrupted after 2 GB?

Your program reads first a number of lines with
Code:
READ(11,*)amountoffiles
- say, the number is N, so the program iterates over next N lines and then tries to read other number of lines - say M - with
Code:
READ(11,*)amountofwords

What if N or M is wrong, so it doesn't correspond with real number of lines?
How could you validate if your input file is correct or not, when there isn't an editor to open so huge text files?

In contrast of this - sed doesn't have a program logic which depends on input file structure: it only replaces a string or pattern with other string.

As suggested above - for more complicated task is probably better to use other very nice tool: awk.




 
@gummibärchen

You're right, I focused too much on the output file, you mentioned the input file as the cause indeed. So let's focus on the READ from the input file.

About your tips:
Your first 3 and the 5th tip didn't make any difference unfortunately. About the 4th one, the pointer one, I have no idea how to do that, I never programmed something with pointers, so I will need time ot learn that :-( On the other hand, I doubt whether that will help. Your 6th tip may be a solution, I'll see if I can find out how to do that, I use gfortran (didn't I mention that?).

@mikrom
Code:
READ(11,*)amountoffiles
and
Code:
READ(11,*)amountofwords
are not being read in the huge file, the above is being read in a small text file, where I define the input and output filenames of the huge files and after that the words I want to change in the huge files. This small definition input file (in sed these filenames and words to change are being piped) is posted as well and is called: "tobechanged.txt" which is not the huge file to be changed, but only defines what the huge files are and what should be changed in them.

The huge files themselves are being read without knowing how long they are, using END= and ERR= in the READ (at line 52)
 
Pointers would work like this:

Code:
...
character (len = 1000), target :: line
character (len = 1000), pointer :: ptoline

ptoline => line

... and then reference ptoline instead of line. This of course looks funny here, for pointers were not really made for this sort of thing - but it might translate differently to the OS.

If all of these does not provide a solution to your problem, then I am running out of my depths. Maybe you can try the various compiler and / or linker options that modify optimization and there may be some options that deal with memory management. I am not using gfortran, so I do not know how these things look in your environment, but there might be some of these too.

Just a last thought: My compiler here has two different modes of compilation: one is called 'debug'-mode which is used while you develop your code. Under 'nodebug' the executable is considerably smaller, maybe there are other modifications to the executable too. Maybe your compiler offers the same options., might be worth a try.

Good Luck.

Norbert


The optimist believes we live in the best of all possible worlds - the pessimist fears this might be true.
 
Just some quick feedback in the little time I have.
I tried another compiler, unfortunately with the same result (I also tried to change the windows pagefile settings, but that's superfluous to mention here, that didn't work).

I will let you know about the rest
 
I have to try the other compiler again (which is the old 90's version of Microsoft'de Developers Studio). I didn't notice that it was giving other names to my exe, it's running now and it seems to be running more than 10000 times slower than the exe that was generated by gfortran (not joking), so this will take overnight at least.

Meanwhile I found this about gfortran (at
the above link: said:
2.8 Influencing runtime behavior

These options affect the runtime behavior of programs compiled with GNU Fortran.

-fmax-subrecord-length

Specify the maximum length for a subrecord. The maximum permitted value for length is 2147483639, which is also the default. Only really useful for use by the gfortran testsuite.
Say again? 2147483639? That's 2Gb Does that have anything to do with my problem?
 
Might very well be.

.... by the gfortran testsuite

Sounds somewhat similar to the 'debug' compiler option that i tried to explain in my last post.
Search for any information on how to run your program outside of the testsuite.

Norbert


The optimist believes we live in the best of all possible worlds - the pessimist fears this might be true.
 
Hi,

IT'S SOLVED


I didn't try the extremely slow exe, generated by the other (Microsoft's) compiler yet, BUT I tried a newer version of gfortran, a very recent one from march 2013 and now.... ..it works!!

So it was the compiler indeed!

I took the original code, as posted here and the original compiling and building options and the new gfortran version did the trick.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top