Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Import or Append what else could work for a big comma-delimited file to be transferred into *.dbf ? 6

Status
Not open for further replies.

german12

Programmer
Nov 12, 2001
563
DE
When I have a textfile like this:

aa105, bbb, ccc, ddd......
aa106a, bb, cc176, ddfa34....

and a very huge amount of further lines.

it is of course possible to transfer that into a *.dbf - file by using
APPEND FROM FileName | ?[FIELDS FieldList said:
[FOR lExpression] [[TYPE] [DELIMITED [WITH Delimiter | WITH BLANK | WITH TAB | WITH CHARACTER Delimiter] | DIF | FW2 | MOD | PDOX | RPD | SDF | SYLK | WK1 | WK3 | WKS | WR1 | WRK | CSV | XLS | XL5 [SHEET cSheetName] | XL8 [SHEET cSheetName]]] [AS nCodePage]]

but that only works, when the amount of data does not exceed the restricted 2 GB Limit for a *.dbf in VFP.

Is there any possibility to pick only the 2nd. "column" in the sample above, when only that is needed?
(It is the "column" with bbb, bb above).
So instead of dividing a textfile "horizontal" into a lot of parts, could it be that it could be done "vertical", by filtering only every 2nd. string (or 3rd,4th)(which could it make possible to split the text into several
*dbf's and therefore it could be possible to create just a few *dbf's?

Again - to take that sample above:
Can I create the following dbfs somehow?:

DBF#1: aa105
aa106a

DBF#2: bbb
bb

DBF#3:cc
cc176

DBF#4:ddd
ddfa34

Could low-level functions eventually help here?
Unfortunately I have no experience with that commands.

Thanks in advance

Klaus


Peace worldwide - it starts here...
 
Okay,

Steve said:
The purpose of this project is for someone to be able to enter a postcode, search for it and the address will be auto filled elsewhere.

A postcode alone will not identify an address, so there's not enough benefit of the single columns. Also you don't reduce all London addresses, London will be repeated many times. So, while the vertical split of data into single columns may be able to overcome the 2 GB problem, you'd have a lot of redundancy for the price of identical record numbers. I'd prefer a central table with all the foreign keys.

For example, when you look for London addresses in the DBC my earlier code generates, you get a city Id of 597 and can find all London addresses by SELECT * FROM addresses WHERE cityid = 597. It's about 2 million. And the DBC only stores LONDON once in the cities.dbf

So in the normalized structure you only need one Id column for each data column, which adds +4 bytes per row, but saves about 2 million times LONDON, that's not just 12 million bytes but as the city column is char(30) 2 million * 30 bytes, circa 60 MB.

The goal doesn't have to be smallest storage, that's clear, so your idea also works for the case the list of records is your final goal.

With the normalized form including keys for each value I could go one more step and put all texts, no matter whether they are city names, postcodes or whatever into one table and have a "full text search": When you locate/seek for one string you can take the id and select all addresses that have this id in any column.

Having a key and not just the recno, you get back VFPs rushmore when you have two or more details of a record, in the first step you find the two or more keys and in the next step you can select records which have the combination of these ids, for which VFP can make use of rushmore.

Chriss
 
Klaus,

I think yuou are making this too complicated.

You said at the start of the thread that you only want the second column. In that case, you could so something like this:

Code:
* Assume you want to add second column of text file to 
* field Col2 of table MyDBF.
USE MyDBF IN 0
SELECT MyDBF
loFS = CREATEOBJECT("Scripting.FileSystemObject")
loFile= loFS.OpenTextFile("multitext.txt.txt", 1)
DO WHILE NOT loFile.AtEndOfStream
  lcLine = loFile.ReadLine()
  ALINES(laLines, lcLine, ",")
  INSERT INTO MyDBF (Col2) VALUES (laLines(2))
ENDDO 	
loFile.Close()
USE IN SELECT("MyDBF")

Since you only want one column, there's probably no need to check the file size. If you are worried that the total size of just the second column could exceed 2 GB (which seems unlikely), you could keep track of the size of the DBF by multiplying the record count by the width of the field.

Mike


__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
To store all columns in column DBFs, I'd use ALINES, too, but only when I want to insert all array elements, and thats what Klaus does.

While I did this I also realized not all rows in the text file have data for each column, so the data does not align in same rows. A normal CSV would have the same number of columns for each row. I fixed this with a csv text with 3 columns in every row.

Code:
Cd c:\programming
TEXT TO multitext Noshow
aaa,bbbb,ccccc
dd, eee, fff
ggg, hhh, iii
jjj, kk, llll
mmmm, nnnn, oooo
ENDTEXT

Strtofile(multitext,"multitext.txt")

Clear
Close Data

For i = 1 To 3
	Create Cursor ("column"+Alltrim(Str(i))) (Data c(10))
	If i>1
		Select column1
		Set Relation To Recno() Into ("column"+Alltrim(Str(i))) Additive
	Endif
Endfor

loFS = Createobject("Scripting.FileSystemObject")
loFile = loFS.OpenTextFile("multitext.txt", 1)
Do While Not loFile.AtEndOfStream
	lcRow = loFile.ReadLine()

	For i = 1 To Alines(laColumns, lcRow, ",")
		Insert Into ("column"+Alltrim(Str(i))) Values (laColumns[i])
	Endfor
Enddo
loFile.Close

Scan
	? column1.Data, column2.Data, column3.Data
Endscan

This also makes use of record number based relations. This way a loop through column1 also skips in the other two cursors to the same row. This sounds as if you can use this as a single table, but it fails at SET FILTER not spanning all column cursors. And the relations are one-way only, so locating in column1 aligns the row position in the other cursors, but when you locate or seek in column2 or 3 you'd need relations starting in these dbfs/cursors to all other cursors. I fear this will not go well with many columns, not just because you need two relations per pair of colums, but this might cause feedback which ends in stalling VFP. You can do without the relations as you can alwas taake the recno of whatever cursor you last seeked or located in and go to the same recno in all other coluumn cursors, that might be the easiest way to handle this.

You can forget SQL, though. You can't make SQL use a relation or join by recordnumbers. So you're bound to work without SQL.






Chriss
 
Klaus, going back to your initial concern, regardless of how you would address (a small pun intended) the issue:

Klaus said:
The problem is that the amount of data that has to be processed today has become more and more enormous over time - and it's not just the 27 million postal addresses.

That's what web services are for. Of course, they must be available for a specific type of information or process. For instance, looking for the UK addresses: (in another thread, I already mentioned the Here platform with geocoding-related and location search services covering 70 countries worldwide).
 
Thank you Mike and Chriss.

Mike:
As I already mentioned, the example file was only intended as a small demo.
With the demo I first wanted to know whether a csv file can be read vertically.
With your suggestion, this was possible for the model - and then I thought again of a huge csv file such as the 27 million csv postcode number file, the number of columns of which is of course very small when compared with the number of lines.

I therefore understand your comment and thank you for improving the program.
Mike said:
I think you are making this too complicated.
You said at the start of the thread that you only want the second column


Chriss
Also your comments on a much larger number of
processing possibilities in a file without redundancies makes sense to me.
Of course, this also makes a lot more work at first - but when everything has been cleared, it must be much faster when you only compare, sum or select using SQL with keys and indexed data.

For me it was about whether you can ignore the 2 GB limits in VFP when you have a very large text file and when it comes to searching or comparing a certain term.

Sometimes it is perhaps also time problems that do not allow
to go too deep into such a file, to split etc.

I appreciate the countless improvements and
References to advantages and disadvantages - I'm really grateful for that
and always will be.

Regards
Klaus

Peace worldwide - it starts here...
 
It seems to me it should be fairly simple to convert (say) 50 single-column dbfs with (say) a million records into (say) 50 (or so) 50-column dbf's, limiting their size to <2GB of course.

That would simplify processing the dbfs, perhaps in a single loop.

All what's needed is to split the csv file using one of the suggestions above.

Steve
 
Klaus,

thanks. And your ideas are not wrong at all, as you can see there even exist databases specialised on columnar storage of data. It just has pros and cons and VFP isn't really made for it. A DBF with just one column is still row storage, you only have either recno as "virtual key" but then need to be aware of problems due to recnos going out of sync if you only delete in one DBF and PACK. And you still don't gain anything special, really, as in a dataabse sepcializsed on columnar data.

The one showcase mentioned in one article I (think I) linked to, was saying when you only need one column of data from a table the separate storage of columns means you just read through this one file instead of needing to skip over all the other columns. But especially this advantage is lost in VFP when you wanted two columns of same rows, when you first only select one of them you already lose the information which record numbers they were. But you might think in other ways like SET FILTER, LOCATE, SEEK. You can take the advantage of faster aggregation, ie when you want SUM(column3) you can get that from column3.dbf only. But when it then comes to filtering what rows by other columns you'd start to reinvent the wheel by reinventing rushmore getting the recnos of relevant data and then fetching those recnos from the column you eventually want to SUM partially (like the total of an order).

You only could have these advantage in full when the VFP SQL engine would allow to have a simple way of reading multiple DBFs as if they were 1 on the outset of same number of records and same deletion state. That could be a nice new feature, but that ship has sailed. I think the relation mechanism reprogrammed a little and a new column table type that extends one "normal" main DBF would work good for that, this new file type would ne included in APPENDS and INSERTS, DELTTES etc. done on the main DBF, just like the main CDX belonging to a DBF is updated with any change, these new column table file would need no deletion flag and a simple header just for the column name and type.

Well, you can have some of that when you don't go for the extreme, you can extend a main record with not just one but a set of columns. For example you could store persons with id, first, last name and have an extension table of employees pointing to the person table for persons that are employed and this table could hold data like hired date, wage, etc. that are only relevant for employees.

All of which is just normalized data, in that case 1:1 related. Or better yet 1:0-1 related. Many database developers actually use that concept but don't share the key in person and employee record, instead employees have their own autoincrement key and, well, just a normal personid for the general person information. This still just wastes 4 bytes per employee as it has an unrelated own primary key and a foreign key, but this is not the worst standard to use.

The advantage of groups of columns is, of course, that this limits the number of joins to be done to get whole records and becomes a quite normal relational DBMS data storage concept.

Chriss

PS: And the one counterargument against column storage is you can't split this up any further. Well when you combine this with horizontal splitting it becomes even less good to handle, in that case you better only split horizontal and any projecct that needed more than 2GB made use of exactly that. You can usually put most often needed data in one DBF and archived/rest data in another or several more or partition by any other way that makes it easy to know which DBFs have the data you need.
 
Chriss,
Chriss, thanks for the extensive information about column stores and the pros and cons. and I also have the interesting comment
read about it by John Schulz.

I have to admit that I have never heard of this before, but I now understand the advantages and disadvantages and find the whole topic exciting ..

This week I won't be able to make any attempts - (my golden wedding anniversary is on Thursday) but I'll stick with it for the time being - I'm now particularly interested in the speed of some actions and also at what point when saving in columns for VFP the air becomes tight, and when an unsplitted csv file is really too big for this method of data transfer for VFP.

I'll get back to you here.
thanks again

Regards Klaus

Peace worldwide - it starts here...
 
Steve said:
It seems to me it should be fairly simple to convert (say) 50 single-column dbfs with (say) a million records into (say) 50 (or so) 50-column dbf's, limiting their size to <2GB of course.

Steve, maybe I misunderstand something.
Why/how do you you want to convert a dbf with a million records into a lot of smaller dbf's to limit their size to < 2 GB?
When the million-record - column.dbf = >2 GB you can't do that in VFP.
???

Klaus



Peace worldwide - it starts here...
 
Klaus,

I may have the wrong idea here, but what I meant was after slicing the csv horizontally into segments under the 2GB limit, you could then parse the data from them into a dbf having the same number of columns (fields). The max number of records allowed will depend on the dbf's record langth in order to keep the dbf size under 2GB.

I don't know how many dbfs would be necessary. Your program could handle them one at a time, perhaps in a loop.

Steve

p.s. Keep in mind Mike's caution about splitting the last row in a splitted csv file when split size (in bytes) are specified.
 
Steve,

in the preprocessing scenario slicing a large csv horizontally has the goal to be able to use APPEND into a cursor that has the record structure. You're right to be cautious, as a DBF can grow larger than the split csv as char fields are padded and APPEND skips memo fields.

But you simply make smaller splits with just 1000 or 10000 lines each from the CSV to have no problem with too large DBFs. In my case I got 17MB parts, they don't pose a problem.

Klaus idea mainly is to not need the split preprocessing, as Mike has shown a way to read large text/csv files line by line. I fear this line by line processing takes much longer than APPEND and does not pay in a faster import even though no time is spent on splitting the large csv file. I think Klaus idea is not just about the import stage, but I might have got him wrong all the time and he just wanted to eliminate the need to prepare an import dbf or cursor and instead just have char column cursors that can cope with any value up to 255 bytes length.

One advantage of line by line processing could be handling commas within character values better than APPEND.

Chriss
 
Chriss said:
Klaus idea mainly is to not need the split preprocessing, as Mike has shown a way to read large text/csv files line by line

Steve:
What Chriss writes is exactly what I want.
It all depends on the circumstances

a) How often does the large csv file have to be updated?
How long does the import take, is the time reasonable?

b) For example, if a csv file is 4.7 GB,
you have to generate 4700/17= approx. 276 dbf's for VFP in order to
not to reach the limit of 2 GB.
But when the csv only has say 20 columns, this results in 20 one-column-dbf's only.
c) Which fields do I have to access for my VFP project?

c) How big is the difference in access speed,
when it's just a column to look for?
In the link (sent by chriss)
the following case was described:
An example of how significant the differences in performance column stores can provide, I was working with a customer last year with a database containing twenty million records related to purchasing information. Using MySQL MyISAM a typical query they would make against the purchase order table would take 30-40 minutes. Those same queries using MariaDB AX (a column store variant of MySQL) took between 0.1 seconds and 8 seconds.

d) How big / long can the one-column dbf be until limits are also reached with VFP. (depending on the column width and number of rows in the CSV).

e) Does the csv also have fields with quotation marks and commas in a column (problem)

All of this - and certainly more (see arguments in statements by Chriss and Mike) - must be checked.

But there are cases conceivable where this leads to the limits
to be achieved by VFP much later and where no csv-splits are necessary and no memo-fields can
cause problems when converting.

I hope to be able to say more about it soon.

Best regards
Klaus
 
Hi Chris:

The column store looks good and way beyond my knowledge. Good luck with it. As an aside,

One advantage of line by line processing could be handling commas within character values better than APPEND.
Yes, and as I'm sure you know, it can get complicated. I have dealt with embedded commas, quotes, special characters and even some normal characters. For example, an apostrophe in a company name is not allowed in one of my apps (IRS). Another consideration is dollar signs, commas, quotes (and who knows what else) surrounding numbers going into a numeric (or integer) field requiring handling of decimals sometimes. Of course, every case is different.

Steve
 
Okay, nice feedback Steve and Klaus.

I also already have a simple idea as I looked at result data. Wehn ALINES() splits at commas it leaves quotes in character fields and most of the time you then have the actual value delimited by the quotes inside the dbf column field. That's simple to fix by removing them. But when the closing delimiter quote is missing that indicates there waws a comma in the string, you can put that back, increment the index counter for the ALINES array and add elements until you have one ending in a quote. Done.

Also ALINES producing one or two more elements than usual are indicating there is a problem. So you only need special parsing when the number of ALINES array element is too high. In a correct CSV it can never be too low, only too high, so the correct value indicates no comma problem.

Another special case is a linefeed within a quote delimited field that's not a new record, so in that case you would need to continue working on the field until one of the next ALINE array elements has the closing quote at it's end.

I think there will be a few more special cases, you already gave some examples. Those may be handled in postprocessing, when you at first only import everything as text. In the best case you could at least prepare one or more DBFs to have the data types you know are correct. But even then a "2021-12-31" string does not automatically convert to a VFP date field.

How should we name this? AppendX? Or Appendix? Joking aside this could become an AppendCSV solution that works better than IMPORT or APPEND FROM textfile. Even without much effort. No need to parse character by character because ALINES won't make a difference for commas in quotes; It already does good work that can be enhanced.

Chriss
 
Could I just check ...

Have we determined if Klaus's data actually contains embedded commas, double-quotes or other string delimiters? In one of the earlier bits of code posted here, Klaus used GETWORDNUM() with a comma as the third parameter, suggesting that there are no embedded commas or string-delimiters. If that's the case, some of the previous discussion will be irrelevant.

Apologies if this point has already been discussed. I haven't read all the most recent posts in as much detail as I should.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
I think Klaus idea is much more general and I test with the pp-complete.csv and that has commas embedded and quotes as delimiters. And as Steve joined the thread this makes sense. To me it all points towards being able to generally use VFP to process large csv files at least to a stage you have either horizontally or vertically split data in several column dbfs (vertical or in partition dbfs (horizontal).

Klaus announced he's working on something. Let's see.

Chriss
 
Thanks for clarifying that, Chris. It makes sense to discuss the general problem of parsing and importing large text files, as opposed to finding an immediate answer to Klaus's original question. It's an interesting topic.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
It makes sense to discuss the general problem of parsing and importing large text files, as opposed to finding an immediate answer to Klaus's original question. It's an interesting topic.

I agree with you, Mike.

The immediate answer to my question and for me a satisfactory answer
was already given by the fact that you mentioned a very good solution here,
(combined with the warning that it was terribly slow.)
See 4th answer in this thread.
Mike said:
Another approach would be to use an external process, via automation or a COM object, to split the file. One possibility would be to use the File System Object. Something like this .....


Klaus


Peace worldwide - it starts here...
 
Mike Yearwood said:
One thing none of you mentioned is the Windows API Memory Mapped Files.
With that it should be possible to define sub 2Gb sections of the huge file and append from each section.

Hi Mike Yearwood,
That sounds interesting
Can you use a code example to illustrate how to access the API Memory Mapped Files?
I would also be interested in the opinion of the other members in this VFP forum on this topic.

Thank you in advance.
Klaus



Peace worldwide - it starts here...
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top