Maximum # of characters per character string

JackTheC · Sep 7, 2015

In the help file of VFP6 and VFP9 it states:
Maximum # of characters per character string or memory variable is 16,777,184.
That's 16 MB right?

I have a file xxx.yyy sized 75 MB
When I perform this command: content=filetostr('xxx.yyy')
then len(content) shows 78,643,200
and strtofile(content,'xx2.yyy') makes an exact copy of the original (checked with FC)

How is that possible with a maximum of 16,777,184 ?

Mike Lewis · Sep 7, 2015

Yes, I'm seeing exactly the same result.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

mplaza · Sep 7, 2015

That's right for most cases, but not for others.

Here you have a complete explanation:

http://www.west-wind.com/wconnect/weblog/ShowEntry.blog?id=882

Mike Lewis · Sep 7, 2015

Yes, according to Rick Strahl:

... the real sticking point with large strings. You can create them, but once they get bigger than 16megs you can no longer assign them to a new variable.

He shows how you can create strings larger than 16MB, and he specifically mentions FILETOSTR() for that. But he also says that, for example, REPLICATE() will produce an error if you try to use it to create a large string (and my own testing confirms that).

See the article that MPlaza referenced for a list of some more things you can and cannot do with large strings.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Olaf Doschke · Sep 7, 2015

We got to this topic via foxypreviewer lately. Indeed working on strings also gets slower with larger strings, so you better do string processing portion by portion, eg line by line or with several KB blocks.

Bye, Olaf.

JackTheC · Sep 7, 2015

I was looking for a method to compare 2 files binairy as good and as fast as possible.
I found sys(2007) was extremely fast and accurate enough.

content=filetostr('xxx.yyy')
checksum=sys(2007,content)

I believe the max len in VFP6 = 256MB and VFP9 at least 900MB.
Most files are smaller.

Maybe someone knows an even bettter method?

Olaf Doschke · Sep 7, 2015

The simplest comparison would really be reading in both files and not computing anything, simply compare content1==content2.

If you want to compare client side and server side files and avoid network load, you could indeed compare checksums, there are better candidates than CRC32, though, eg SHA1. And to make this avoid network load you'll need two installations of your tool, one client side, one server side. Otherwise no matter if your exe runs server or client side you read one of the files through network. If you comapre files to see whether you need to update one folder with the other that way you'd load all files through the network anyway in the one or other direction. And this way you do not optimize the network load, as you always copy all files in one direction that's a questionable approach, you could simpy copy any file then, no matter whether changed or not, the whole comparison is useless.

If the main goal really is mirroring a folder like a local installation of your software with an update folder located in some server share, take a look at robocopy /mir option. It will decide about the need to copy a file with file meta data like length, creation date and last change date. That's sufficient to know files differ in the usual case. And that only loads a small fraction of meta data about each file. It's also sufficient because try to make a content change of a file without changing the last modified date. You'd need to hack the filesystem, eg with foxtouch of foxtools.fll and even that would be detectable, as foxtouch only has 2 seconds precision and in at least 50% of the cases you get a 1 second difference. So the main case of detecting files needing a refresh is guaranteed very well without reading content at all.

By the way: robocopy /mir also deletes folders and files from the target folder, if they don't exist in the source folder, so it can really tidy up. That also means, if you want/need some files like foxuser.dbf individually you have to put them outside of the mirrored folder. But no that's going quite off topic from your original question,

Furhtermore:
I don't know the reason you want to compare files, there are of course several use cases, eg you might verify an FTP upload by downloading it and verifying the download is 100% the original files. Then the content1==content2 is your best option, a customer downloading your files has no reference of them, though, for that reason MD5 or SHA1 checksums are used to enable verifying at least against typical tranfer errors. Anyway, don't confuse this with security, if some hacker replaces a download at a site, it's quite easy to also change the checksum accordingly. In that case file signatures, which can be compared against your public key are a better way, though also this could be replaced by a hacker it's a permanent key and it could be stored more securly and be verified regularly, you might also offer it in several places/profiles and thereby hardening it as it's hard to hack all sites at once.

Bye, Olaf.

Scott24x7 · Sep 7, 2015

Cool topic. I had no idea...

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler." [hammer]

JackTheC · Sep 8, 2015

Thanks Olaf,

In fact I want to delete double files e.g. thousands of pictures on the hard disk. I know there are many programs available but I want to write my own so I can tune it the way I want. First I collect all the files in a table. Recursive Adir(*.ext). Then I compare either name+size or name+size+modification datetime or size+modification datetime and find the potentially doubles. But before deleting, it is an option to compare binairy to be on the safe side before deleting. Even a manual selection can be done for files that are 99.99999999% doubles.
So I want a hash or a checksum as a field in the table. I can't store all the content. A hash is enough. Or sys(2007) that's fine too. FCIV.exe to compute hashes is to slow for a big table. sys(2007) is very fast. I ignore the error for files > 256K, the chance that they are not real doubles is extremely small. If two of these big files even exist with same name, size, datetime.
Edit: Movie files can be that big. In that case I'm gonna use fciv.exe for hashes. There are probably only few of these files. Exec time is than less important.

During this programming I noticed that there was no error in files > 16K. Filetostr(bigfile) worked perfectly. Sys(2007) too.

Mike Lewis · Sep 8, 2015

So, Jack, what problem did you have with SYS(2007)?

I just tried it with a string of 50 MB. It works OK, but was quite slow - about 800 ms on my system.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

vgulielmus · Sep 8, 2015

JackTheC said:
I was looking for a method to compare 2 files binairy as good and as fast as possible.

If you have enough memory, then load both files and compare them with "==", as Olaf Doschke already said. It's almost instantaneously.
I tried with a 125MB file (LibreOfficePortable_4.4.4_MultilingualStandard.paf.exe)
Something like this :

Code:

xx1 = filetostr("LibreOfficePortable_4.4.4_MultilingualStandard.paf.exe")
xx2 = m.xx1
?m.xx1 == m.xx2
xx2 = chrtran(m.xx1,"W","V")
?m.xx1 == m.xx2

Here is another demo, using strings, each one with total length of 160M

Code:

LOCAL lcS1,lcS2,lni,lcS3
CLEAR all
CLEAR

lcS1 = ''
FOR lni = 1 TO 10
	lcS1 = m.lcS1 + REPLICATE(CHR(64 + m.lni) , 16777184)
NEXT
lcS2 = m.lcS1
?LEN(m.lcS1) , LEN(m.lcS2) , m.lcS1 == m.lcS2
lcS2 = CHRTRAN(m.lcS2,"A","U")
?LEN(m.lcS1) , LEN(m.lcS2) , m.lcS1 == m.lcS2

This way I found that CHRTRAN() is extremely fast, if the length of the string isn't changed.
STRTRAN() is relatively fast if the length of the string isn't changed, but its speed depends on how many replacement are made.
And using CHRTRAN() / STRTRAN() I can manipulate strings larger than 16M and assign them to a new variable.

These considerations contradicts my previous assumption :
The 16M limit can be broken in two ways
1) The assignment command where x is on the both sides of "=", x is the leftmost term of the right side and the rest of the string expression is within the 16,777,184 limit. I mean :
x = x + expression && where len(expression) <= 16777184 (if x already exceeded 16M)
2) FileToStr() also can produce memory variables longer than 16M

It seemes that CHRTRAN() and STRTRAN() can be used to assign / generate and assign to a new variable, strings larger than 16M.

Code:

LOCAL lcS1,lcS2,lni,lcS3
CLEAR all
CLEAR

lcS1 = ''
FOR lni = 1 TO 10
	lcS1 = m.lcS1 + REPLICATE(CHR(64 + m.lni) , 16777184)
NEXT
* CHRTRAN() manipulates a string longer than 16M, and the result is stored into a new variable
lcS2 = CHRTRAN(m.lcS1,"A","U") && lcS1 has 160M
?LEN(m.lcS1) , LEN(m.lcS2) , m.lcS1 == m.lcS2

lcS1 = "AB"
* STRTRAN() generates a string longer than 16M, and the result is stored into a new variable
lcS2 = STRTRAN(m.lcS1, "B", REPLICATE("U",16777184))  && lcS2 has 160M + 1
?LEN(m.lcS1) , LEN(m.lcS2)
* STRTRAN() manipulates a string longer than 16M, and the result is stored into a new variable
lcS3 = STRTRAN(m.lcS2, "A", REPLICATE("V",16777184))  && lcS2 has 160M + 1, lcS3 has 320M
?LEN(m.lcS1) , LEN(m.lcS2), LEN(m.lcS3)

Respectfully,
Vilhelm-Ion Praisach
Resita, Romania

vgulielmus · Sep 8, 2015

I have fun with this, please excuse me.
STUFF() can't create / manipulate strings larger than 16M, but LEFT(), RIGHT() or SUBSTR() have no problems

Code:

LOCAL lcS1,lcS2,lni,lcS3,lcS4,lcS5,lcS6,lcS7,loErr as Exception
CLEAR all
CLEAR

lcS1 = "AB"
lcS2 = STRTRAN(m.lcS1, "B", REPLICATE("U",16777184))
?LEN(m.lcS1) , LEN(m.lcS2)
lcS3 = STRTRAN(m.lcS2, "A", REPLICATE("V",16777184))
?LEN(m.lcS1) , LEN(m.lcS2), LEN(m.lcS3)

* STUFF() can't create
TRY 
	lcS4 = STUFF(m.lcS1 , 2 , 1 , REPLICATE("U",16777184))
	?LEN(m.lcS4)
CATCH TO m.loErr
	?m.loErr.Message
ENDTRY

* STUFF() can't manipulate
TRY 
	lcS4 = STUFF(m.lcS2 , 2 , 1 , "M")
	?LEN(m.lcS4)
CATCH TO m.loErr
	?m.loErr.Message
ENDTRY


lcS5 = LEFT(m.lcS3,16777185)
?LEN(m.lcS5)

lcS6 = RIGHT(m.lcS3,16777185)
?LEN(m.lcS6)

lcS7 = SUBSTR(m.lcS3,2,16777185)
?LEN(m.lcS7)

Respectfully,
Vilhelm-Ion Praisach
Resita, Romania

Olaf Doschke · Sep 8, 2015

The only reason two reasons to use chacksum or hashes would be finding triples or more files and repeating the search, so the computation you store can be reused.
But checksum or hash calculation reads the whole file anyway, so why not make the full and simple comparison? Even with potential triples you can compare each second, third and so forth with the first file read in.

Especially in large files the most common error is one of them is incomplete so the length comparison is most important first. You don't need to read in two files having differnt byte length, but you might make the comparison with the length of the smaller file and delete it, if the longer file begins with it, which means it's the more complete version.

Bye, Olaf.

Mike Lewis · Sep 8, 2015

A possible way to speed this up would be to do it in two phases:

First, look for duplicates based exclusively on file size and creation date. Ignore all the others. Then do your checksum on just those that remain.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

JackTheC · Sep 8, 2015

Mike that is exactly what i'm doing. Of course.
Search all files of a certain extension regardless of the drive and folder (options: minimum size / in/exclude certain or symlinked or hidden folders) -> filtering doubles/triples/quadruples (take account of p.e. name/size/date/time) and sorting on date/time/size/name -> group them by probably double. -> per group a binairy check (option) -> manually / semi automatic select for deleting -> delete with recycle bin. I don't want to loose files from somebody elses drives so I try to be carefull.

Adir() is extemely fast. Tested also *.* reading all the files, but some folders have more files than an array can hold. But with *.extension there is no problem. ( and exclude Windows folders, don't want to mess with that)

Mind this: If you download a file twice (or three times and so on) they have different names and different date/times. Nevertheless they are doubles: size and binairy are the same. Size alone is not enough to decide.

Olaf Doschke · Sep 8, 2015

If you use VFP9, there is no limit of 65000 array elements anymore. What's faster in my experience is SYS(2000,"*.*") followed by SYS(2000,"*.*",1) if you only need file names. ADIR has the advantage to add a ot f other file meta info.

Bye, Olaf.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Maximum # of characters per character string

JackTheC

Programmer

Mike Lewis

Programmer

mplaza

Programmer

Mike Lewis

Programmer

Olaf Doschke

Programmer

JackTheC

Programmer

Olaf Doschke

Programmer

Scott24x7

Programmer

JackTheC

Programmer

Mike Lewis

Programmer

vgulielmus

Programmer

vgulielmus

Programmer

Olaf Doschke

Programmer

Mike Lewis

Programmer

JackTheC

Programmer

Olaf Doschke

Programmer

Similar threads

Part and Inventory Search

Sponsor