Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Import or Append what else could work for a big comma-delimited file to be transferred into *.dbf ? 6

Status
Not open for further replies.

german12

Programmer
Nov 12, 2001
563
DE
When I have a textfile like this:

aa105, bbb, ccc, ddd......
aa106a, bb, cc176, ddfa34....

and a very huge amount of further lines.

it is of course possible to transfer that into a *.dbf - file by using
APPEND FROM FileName | ?[FIELDS FieldList said:
[FOR lExpression] [[TYPE] [DELIMITED [WITH Delimiter | WITH BLANK | WITH TAB | WITH CHARACTER Delimiter] | DIF | FW2 | MOD | PDOX | RPD | SDF | SYLK | WK1 | WK3 | WKS | WR1 | WRK | CSV | XLS | XL5 [SHEET cSheetName] | XL8 [SHEET cSheetName]]] [AS nCodePage]]

but that only works, when the amount of data does not exceed the restricted 2 GB Limit for a *.dbf in VFP.

Is there any possibility to pick only the 2nd. "column" in the sample above, when only that is needed?
(It is the "column" with bbb, bb above).
So instead of dividing a textfile "horizontal" into a lot of parts, could it be that it could be done "vertical", by filtering only every 2nd. string (or 3rd,4th)(which could it make possible to split the text into several
*dbf's and therefore it could be possible to create just a few *dbf's?

Again - to take that sample above:
Can I create the following dbfs somehow?:

DBF#1: aa105
aa106a

DBF#2: bbb
bb

DBF#3:cc
cc176

DBF#4:ddd
ddfa34

Could low-level functions eventually help here?
Unfortunately I have no experience with that commands.

Thanks in advance

Klaus


Peace worldwide - it starts here...
 
Hi Klaus,

I have done a similar task many many times. There are many ways of doing this. In general, I usually do it this way (you might have to look up the syntax of these functions):

1. Create each dbf.

2. Convert the csv file to a string using the ALINES() and FILETOSTR() functions. Each line will be one row of the array.

3. Use FOR...NEXT loop to process each line.

4. Use GETWORDCOUNT() and GETWORDNUM() functions to parse each line into individual memory variables and INSERT each mvar into its corresponding table. You might even do this in a single command.

Hopefully, this can get you started, at least until you have specific questions.

Steve


 
Klaus,

Considering that if the source data itself is bigger than 2 GB, there is no way that VFP will be able to properly process the file natively, you may contemplate using the CSVProcessor class at for this.

Code:
* 
* previously: put the CSVProcessor class in scope, something like
*  DO LOCFILE("csv.prg")
*

* prepare the CSV sample data

LOCAL SampleData AS String

TEXT TO m.SampleData NOSHOW
aa105, bbb, ccc, ddd
aa106a, bb, cc176, ddfa34
ENDTEXT

* prepare the CSVProcessor object

LOCAL CSVP AS CSVProcessor

m.CSVP = CREATEOBJECT("CSVProcessor")

* the CSV source data has no header row (so it seems)
m.CSVP.HeaderRow = .F.
* and we want to trim the source data
m.CSVP.InTrimmer = 3

* we need just to import the 2nd and 4th columns, the Bs and the Ds
CREATE CURSOR TargetCursor (Bs Varchar(20), Ds Varchar(20))

m.CSVP.FieldMapping.Add("csv column not imported")
m.CSVP.FieldMapping.Add("Bs")
m.CSVP.FieldMapping.Add("csv column not imported")
m.CSVP.FieldMapping.Add("Ds")

* into this cursor

m.CSVP.Workarea = "TargetCursor"

* note: if the CSV data has a header row, the filtering can be made
*   by mapping columns of the target cursor to columns in the CSV file.

* load from the source string (but it could be from a file)

IF m.CSVP.ImportString(m.SampleData) == 0
	SELECT TargetCursor
	GO TOP
	BROWSE
ELSE
	ERROR "Failed to import data"
ENDIF

The result:

Captura_de_ecr%C3%A3_2021-08-15_092257_wramsq.png
 
Unfortunately, Steve's method won't work if the text file exceeds 2 GB. The 2 GB limit applies to text files just as it does to DBFs. If your file is bigger than that, you won't be able to use APPEND FROM, or FILETOSTR(), or any of the low-level file functions to read the file.

One solution would be to find a way of splitting the file into multiple smaller text files before importing it. I can't off-hand think of a way of doing that outside of VFP.

Another approach would be to use an external process, via Automation or a COM object, to split the file. One possibility would be to use the File System Object. Something like this:

Code:
loFS = CREATEOBJECT("Scripting.FileSystemObject")
loFile= loFS.OpenTextFile("c:\MyFile.txt", 1)
DO WHILE NOT loFile.AtEndOfStream
  lcLine = loFile.ReadLine()
  * code here to add the lines to an array or to 
  * write them out to multiple smaller files
ENDDO 	
loFile.Close()

That should work regardless of the size of the text file. But be warned: it will be horribly slow.

Mike



__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
We had a whole thread about UK address data, which also exceeded 2GB: thread184-1810484

I see you contributed, too, and have seen Steve split the data.
And you also know my follow up thread thread184-1810935

So you still are looking for a way to handle >2GB files. It does not only matter how much of that data is imported into the target DBF, VFP simply chokes on reading more than 2GB, which it also still needs to do when you use the FIELDS clause to only import into a few fields.

One further problem is, if you specify a field list, the appending still imports coluumn1,2,3 of the CSV file into the fields specified, there is no way to skip columns in the csv. VFP does not map CSV columns to DBF columns and when you limit the fields still uses that mapping for the given fields and skips unspecified DBF target columns in the CSV.

Chriss
 
I'd look into other tools more specialized into CSV file processing. MSSQL import is one way to have no problem with table and csv file sizes, then get what you need from there. Steve did mention usage of a split tool and I used Power Shell. I think the best programming language to work with CSV files is PERL (it's in the expansion of the language name Practical Extraction and Report Language, - as it was new csv was (and it still is) one very common data format and so extracting data from it for simple reporting(lists) was what this language can do best. Python may also work with a CSV module. Both PERL and Python will surely support only fetching column 2 or any combination of columns from a csv simply by having something that reads line by line and allows you to pick items from an array and stream that into a new file or to stdout.



Chriss
 
Mike,

I tried your code just reading the pp-complete.csv data (about 4GB) and that works and needed almost 10 minutes. But the Powershell splitting I used took over 30 minutes. While all that depends on drive speed and splitting does not only read but also write out files, I don't think you can call this slow.

I don't see how you can speed up the reading just because you only want column2. No matter what reads through the CSV data has to parse for commas and line feeds and read in more than just the wanted column, the only way you could skip reading bytes and skip to the next record column2 start would be knowing how many bytes to skip without parsing for linefeed and commas. How would that be possible? Only with fixed size records in the CSV file.

The only part you can speed up by only wanting column2 in the DBF is writing to the DBF, of course. as you only write a small percentage of the data the whole process likely needs something between just reading and splitting up all data.

I still think splitting files is the way to go if you need to process the same data multiple times.

Chriss
 
Hi,
I was going to suggest Notepad++ but found it also has a 2GB limit. Googled "Notepad++ big file" found some suggestions for breaking up bigger files. HTH
Steve
 
german12 said:
Is there any possibility to pick only the 2nd. "column" in the sample above, when only that is needed?
I'm sorry if I misunderstood this, but if you only is interested in the second column, wouldn't it be solved if you create a DBF with only two columns and then APPEND from?
 
Dan, you still fail with that idea as you can't read larger than 2 GB files. Otherwise that's an idea for column 2 with just a bit excess data, but when you would need the last column that would mean more waste, you could only make that waste portion smaller when defining c(1) columns up the the one you want or more general for all the columns you don't want.

Still, this will only read in <2GB files.

Chriss
 
It might be worth looking for a stand-alone utility that can split large files into smaller chunks, and which can be driven by command-line parameters, and therefor can be executed within VFP using RUN or ShellExecute(). The one that comes immediately to mind is Split, which is part of the GNU Core Utilities (and which is available for Windows), but there might be others.

But an issue to watch out for: Such utilities tend to split the file at some specified number of megabytes, or at a percentage of the whole file. What you need is a way to split after a given number of lines, otherwise you risk having a break in the middle of a line, which would screw up the records on either side of the split.

It would be worth searching for a suitable utility.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
One other possibility ....

Can you open the file in Excel? (Recent versions of Excel can handle 1,048,576 rows. I don't know what the limit is on overall file size.)

If so, you could use Excel Automation to open the file, delete the unwanted columns, and save the result back to a file, which you can then append in the usual way.

I don't know for sure if this will work, but it will cost very little effort to try.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Thank you to all of you for your many interesting considerations on a topic that keeps occupying me.
As Chris correctly noted, my question is based on the threads
thread184-1810484: Use a separate file, file type or other method to access nearly 29 million addresses
and
thread184-1810935: Extracting addresses from UK.gov price paid data

The problem is that the amount of data that has to be processed today has become more and more enormous over time - and it's not just the 27 million postal addresses.

The founders of VFP probably could not have imagined it themselves at the time, otherwise they would certainly have built in a method that would have read in files with a different format than their own line by line.

In the above threads, the csv file first had to be split into 262 segments, and it was clear to me that it mainly had to do with the horizontal split of the huge csv file.

Now I wanted to know that the csv file could also be read vertically - and in order not to let the previous threads get too long, I tried to ask in a separate thread with a small model as an example.
That would have the advantage that you would have a lot less work, and
if each VFP file were fed only with "columns" from the csv file, then you would only have as many * .dbf's as the fields required.
I thought that you would then work with a lot fewer files in VFP afterwards, each of which would not reach the 2 GB limit as quickly
and could make programming and searchability much easier.

It would have been too good.....

Klaus

Peace worldwide - it starts here...
 
It didn't let go of me how this could be done faster, because 4GB in 10 minutes is only about 7 MB per second and that's really slow. I can simply copy the file from one to another directory (copy, not just move) in about a minute.

I don't have a solution yet, but you should in fact be able to read the file much faster.

The Windows API should have functions for low level file handling.

Chriss
 
Klaus said:
Now I wanted to know that the csv file could also be read vertically

You got the answer: No. In a CSV file columns are defined by commas, of course, but as the values have varying width, you can't just read one column without also parsing the rest.
Let's just consider the case of second column, then after that is found the parser could skip to the linefeed next. Well, but what does it mean to skip to line feed? It means reading following bytes from the file until 13,10 is found. So you can't skip something, really. CSV can just be processed fully.

As you saw the file reading itself should be able to go faster. One problem besides the 2GB limit is that VFP isn't asynchronous. You can't let the OS read the file while your VFP process already processes blocks read, VFP's own FREAD will wait until the bytes are read in. APPEND has it better, as both file reading and processing is all in one in a C routine that could make use of parallel processing. But the 2GB limit is hindering this idea to work, it's not just a limit of DBFs and related files, it's a limit for all files.

If you simply FOPEN() a file larger than 2GB you get "invalid seek offset". One of the first operations FOPEN does is determining the file size by seeking to the end and there is the limit of VFP about file processing.

So ideally you need something else that can read a file in and provide chunks of the file to VFP somehow. In the simplest case a FREAD equivalent in the Windows API that reads in say 8KB per call, even if this won't be possible asynchronous and you can't work on the first 8KB while the OS reads the next 8KB, this should be possible to run faster. But it will also run through all bytes of the file, CSV is no structure that can seek to "column2" just like you can do in a DBF. To find Column2 it has to orient by line feeds and commas. And that means reading all bytes. When you get the file presented in an editor the linefeeds are processed to, well, lines, you yourself then can easily "see" column2, but this is only possible because the editor readd the whole file and made a line feed for the line feed control codes 13,10.

Chriss
 

Mike, I think I understood your suggestion to try an external process.
I don't mind if it's slow - if the csv doesn't have to be constantly updated, then you could certainly work with it.

Your code is good, because it immediately "feeds" a number of
dbf files with the comma separated columns from the CSV and
that's already what I wanted.
Of course there has to be a "brake" to prevent overflow
in VFP - but first you have to see how far
one comes with your suggestion.
Perhaps in most cases, even with large CSV files, it is not even necessary because the load on the individual VFP files increases
done slowly.

Here is my primitive code

*!* This program looks into a multitext.txt - file
*!* which has this contents:

*!* aaa,bbbb,ccccc
*!* dd, eee
*!* fff,ggg,hhh,iiiiiii
*!* jjj, kk, l, mmmm, nnnn, oooo



Code:
CLEAR
CLOSE DATA
loFS = CREATEOBJECT("Scripting.FileSystemObject")
loFile= loFS.OpenTextFile("multitext.txt", 1)
DO WHILE NOT loFile.AtEndOfStream

	lcLine = loFile.ReadLine()

*Count the words in the line, which are delimited with comma
	thiswords = GETWORDCOUNT(lcLine, ",")

*In this sample I created some dbf's from No.1 to No.5, assuming that
*I only need the csv-column from 1 to 5
*All dbf's have only one field which is called named "col1" to keep it simple.

	FOR i = 1 TO thiswords
		filenum = "dbf"+ALLTRIM(STR(i))
		W= GETWORDNUM(lcLine,i, ",")
		W = ALLTRIM(W)
		USE &filenum
		APPEND BLANK
		REPLACE col1  WITH W
	ENDFOR

* code here to add the lines to an array or to
* write them out to multiple smaller files
ENDDO
loFile.CLOSE

* If the file is too big, I have to code
* a "break" - but that should be done later.


DO showme
*IN linesuch.prg
DO killme
*IN linesuch.prg

*In this case I uses dbf's -
*it should of course be cursors due to speed

PROCEDURE showme
	USE dbf1
	BROWSE
	USE dbf2
	BROWSE
	USE dbf3
	BROWSE
	USE dbf4
	BROWSE
	USE dbf5
	BROWSE
ENDPROC

* To try it again
PROCEDURE killme
	USE dbf1
	ZAP
	USE dbf2
	ZAP
	USE dbf3
	ZAP
	USE dbf4
	ZAP
	USE dbf5
	ZAP
ENDPROC

CLOSE DATA

Thanks for your help
Klaus


Peace worldwide - it starts here...
 
This is the result:
Panoramabild_yfxzsy.jpg


Peace worldwide - it starts here...
 
This just has the same problem VFP has with CSV. If a character column is delimited with quotes, a comma within them is not a column separator. This case might not be important to you.

I don't see a big benefit of single column DBFs, when you'd like to do SQL and joins, there's a key necessary. I did use two column tables with key,char columns for sake of normalizing addresses. And then the actual address record has to have all the keys to the details.

And in case of the addresses it would work okay, as you don't do much more than looking up an address and put it together for printing an invoice/bill/delivery note and last not least the address on an envelope or packet label. When thinking of an online shop example. Similar for invoicing customers for services like freelance programming.


Edit: This actually has a name: Column store. But the way to make use of this for either performance or storage benefits only is achieved by special indexes on such columnar data. Let's say you sort each column and in a numeric column that has lots of similar values, a row difference often may be 0, so you can store all numbers by just storing the first number and then differences to the next column. That often means 0 difference and that's easy to compress. But now getting at row N means reading n rows and adding all the differences.

Just one example.

Querying only field2, 5, 7 from a wide table you only need the files for these fields. You don't need to skip the bytes of other columns, so disk accesss and reading bytes gets down, but that will need a query engine specialized on this. There's much to do to get this concept working better than usual row based data.

Chriss
 
Chris - thanks for your very prompt answer.

Chriss said:
This just has the same problem VFP has with CSV. If a character column is delimited with quotes, a comma within them is not a column separator. This case might not be important to you.

I agree - it depends on the structure of the csv - and what has to be done in VFP: That is really a problem.

Chriss said:
I don't see a big benefit of single column DBFs, when you'd like to do SQL and joins, there's a key necessary.

I see a benefit in speed
You can remember, that I tried using the FINDSTR command which could find a certain expression in a file of 27 million addresses within 6 seconds.
I can imagine that with a SEEK command in single column-dbf (eg. with the name of a person) would be much quicker - and as the recno() in all other dbf's are identical for the full
details of the same person (name, location, street etc.) the search could be quioker.
But that has still to be tested.

I repeat that, because that was in the beginning what Steve wrote in his thread:

Steve. said:
The purpose of this project is for someone to be able to enter a postcode, search for it and the address will be auto filled elsewhere.

I see my thread caused a lot of discussion - that was and is very educational - I hope not just for me.
Regards
Klaus






Peace worldwide - it starts here...
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top