Use a separate file, file type or other method to access nearly 29 million addresses 4

Steve-vfp9user · Jun 11, 2021

Hello all

I have the full and up to date set of UK addresses and postcodes obtained from the Government website. These were downloaded in CSV format and I have managed to separate them into smaller CSV files and stripped the columns that were not required.

I’m now left with nine columns which I have replicated in a DBF. The field names are called:

Code:

datetimeup (c,16)
postcode (c,12)
numbname (c,30)
flatnumb (c,30)
add01 (c,50)
add02 (c,50)
add03 (c,50)
add04 (c,50)
add05 (c,50)

I am aware that a DBF has a size limit of 2GB or 1 billion records.

I put together and ran a piece of code in an effort to append from those CSV files but as I am sure you have guessed by now, the DBF quickly fills up and creates the error message, “postcodes.dbf is too large”.

Code:

USE postcodes EXCLUSIVE
ZAP
a=0
DO WHILE .T.
  APPEND FROM "pp-complete_"+LTRIM(STR(a))+".csv" ;
    DELIMITED WITH , WITH CHARACTER ,
  a=a+1
  IF a=29
    EXIT
  ENDIF
ENDDO

The purpose of this project is for someone to be able to enter a postcode, search for it and the address will be auto filled elsewhere.

I only need to know the following:

Is there a separate file, file type or other method I can use to access nearly 29 million addresses?

Any pointers or suggestions would be much appreciated.

Thank you

Steve Williams
VFP9, SP2, Windows 10

german12 · Jun 11, 2021

Could it be with insufficient memory?
Klaus

Peace worldwide - it starts here...

Steve-vfp9user · Jun 11, 2021

Hello Klaus
No. The DBF stops when it reaches 2 gigs.
It’s a known fact that DBF’s have a size limit.

Thank you

Steve Williams
VFP9, SP2, Windows 10

german12 · Jun 11, 2021

Code:

USE postcodes EXCLUSIVE
ZAP
a=0
DO WHILE .T.
  APPEND FROM "pp-complete_"+LTRIM(STR(a))+".csv" ;
    DELIMITED WITH , WITH CHARACTER ,
  a=a+1
  IF a=29
    EXIT
  ENDIF
ENDDO

For me it looks like you try to append from 29 *.csv files
1, APPEND FROM "pp-complete_"+LTRIM(STR(0))+".csv"
2. APPEND FROM "pp-complete_"+LTRIM(STR(1))+".csv"
3. APPEND FROM "pp-complete_"+LTRIM(STR(2))+".csv"

and so forth up to 29

What is the real name (not the calculated one) of your *csv.file?
May be that I missunderstand something.

Klaus

Peace worldwide - it starts here...

Steve-vfp9user · Jun 11, 2021

That’s correct. I split the files because the original CSV file was too large (13gb) hence why I split it and removed data I didn’t require.
It would make no difference if I merged all 29 CSV files into one because appending from them would still be too large for a DBF.
I hope that makes sense.

Thank you

Steve Williams
VFP9, SP2, Windows 10

german12 · Jun 11, 2021

Have you tried to do it only with one *.csv - and if yes - does that lead to “postcodes.dbf is too large”?
What is the size of one *.csv in MB about (as you have splitted and also deleted columns, it should be less 500 MB, right?

Peace worldwide - it starts here...

Steve-vfp9user · Jun 11, 2021

It does work with about five of the CSV files and then stops.

Thank you

Steve Williams
VFP9, SP2, Windows 10

german12 · Jun 11, 2021

Perhaps in one of the residual files (No 6 ff.).. there has been something destroyed, when you deleted columns in it.
I would test it, by transferring from *.csv No. 6 then *.csv No. 7 and so on.
It would also be useful to know, how big each *.dbf - size would be (in records AND size)

Instead of gathering all in one *.dbf I would also try to build 29 *.dbf files and later try to append one after another.

This could be done by changing your programm a little bit.

Also it would perhaps be more clear, when you try to transfer into EXCEL with the Excel-import from text - function - just to see, what that will cause there

If everything is ok, then maybe it would be possible to search in
several Foxpro files to start, depending on the main memory
reloaded or can be in the main memory at the same time.

There are many experts here who can say more about this.

Klaus

Peace worldwide - it starts here...

Chris Miller · Jun 11, 2021

Divide 2GB by 29 million and that's roughly 64 Bytes per address. Of course you fail with 340 bytes per record. I assume add04 (c,50) is the city. Then there's a lot to reduce. You don't want to store "London" several million times, do you?

Store the cities, streets, etc. in separate tables and only store IDs in the address table: CityID, StreetID. That's data normalization, in short.

Chriss

Mike Lewis · Jun 12, 2021

You could also try storing each of the address lines as a memo field instead of a C(50).

A C(50) containing an address line is going to be almost all padded spaces. A memo field isn't.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Chris Miller · Jun 12, 2021

Let's take it from address parts and how to put them together to address lines, then any address (with a ton of exceptions) is like

Code:

<<Addressee>>
<<Company>>  (Remark: optional for mail to an employee, for example)
<<House number or name>> <<Street>>
<<City>>
<<Postcode>>
UK (Remark: for international mail sent to the UK)

And then there are PO Boxes and other address types and details differing from that scheme. But let's stick with that for the moment.
Then the actual address data could simply be a tuple of (ID, AddresseeID, CompanyID, HouseNumber, StreetID, CityID). And that's not overexaggerating normalization as you talk of 29 million addresses of the UK.

A query getting the addresslines would then start in an address record and join addressee name, company, street, city, and whatnot.

As it seems you only have one CSV with all address details as it results from such a multiple join query. Well, whatever isn't available separately in the CSV would need to be taken from the address lines, but I'd not store them as is from the CSV into a DBF. Just for one step of an import. Neither char nor memo justify storing "London" for each London address. It'S one core idea of database design.

Assuming add04 normally has the city name, then you get a list of all cities by Select Distinct add04 as City From import.dbf after the import of the (partial) CSV file. You'll have a hard time getting a clean list of cities when even this governmental data is not including the address parts as separate, single fields and only has the address as address lines. Perhaps not always using the same format. But at least it seems you have some details separated already, like the postcode field.

Chriss

Steve-vfp9user · Jun 12, 2021

I appreciate the responses guys. I’ll work through them and post back.

Thank you

Steve Williams
VFP9, SP2, Windows 10

tomk3 · Jun 14, 2021

Hello,

maybe it helps to put the lines for add1-4 alltrimmed() into a memo with 4 or less lines
if !empty(alltrim(m.add1))
cMemo = cmemo + alltrim(m.add1) + chr(13)+chr(10)+...
endif
...
Or instead of linebrakes separate them like #add1# <add2>,@add3@ , (add4) or better <a1>add1<\a1>,... if you need to search for content in specific add.

regards
tom

Dan Olsson · Jun 14, 2021

I agree with

Chris Miller said:
Store the cities, streets, etc. in separate tables and only store IDs in the address table: CityID, StreetID. That's data normalization, in short.

For once I don't agree with Mike (about Memos), that would only make the problem worse. IMHO there are only 3 solutions:
1. Go with Chris's suggestion and split the info into several DBFs linked together.
2. Use VFP Advanced which doesn't have the 2GB limit (but is of course maybe not so legal).
3. Use a real SQL-database.

TamarGranor · Jun 14, 2021

Since it's fixed data that's not going to be edited, write some code to go through and figure out the maximum size of the actual data for each of the 5 address lines. Then create the table with those fields only as wide as you need them.

If it were me, I'd have a loop that collected that information for each of the 29 files and throw it into a cursor. Then, just run a query against that 29-row cursor to find the max for each column.

Tamar

Chris Miller · Jun 15, 2021

Well, Dan,

With memos, you can not only store each line with a minimum size which can still cost more space because of technically necessary 8 bytes for DBF and FPT, especially when combining all address lines into one memo. The necessary linefeed can be CHR(10) only, and you still store much less than all the spaces wasted in the separate char fields, even when they are optimized. When the longest names are far longer than the average length (and that's very common) even a length optimized char field can cost a lot. With static data you dont have memo bloat, you only store each memo value once anyway, so you can set blocksize to 0.

Besides that, to stay with my example of London, even though it'Äs just a 6 letter city name, an ID for it will cost 4 byte per London address storing a CityID, while it takes 6 bytes in a char or memo. In an optimized ccity field you'll likely need something like char(20) and even waste 14 bytes per London address.

Chris.

Mike Lewis · Jun 17, 2021

Dan said:
For once I don't agree with Mike (about Memos), that would only make the problem worse.

Dan, my memo suggestion might not be the best solution on offer^[note], but I'm curious to know why it would make the problem worse. Is it to do with the 8-byte overhead for each memo field? If so, surely that would only be an issue if the average length of an address line was less than eight bytes, which I wouldn't have thought was the case.

Note: I quite like Tamar's suggestion of figuring out the actual maximum line length. It is likely to be considerably less that 50 chars.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Steve-vfp9user · Jun 17, 2021

Thank you for all these responses guys. I have found some of the records are corrupt in that some postcodes are either missing or have the incorrect information in some of the CSV files.

I also noticed that in some of the columns, the data contains an address such as: "The Building, Somewhere" The fact that there is a comma between Building and Somewhere, this causes Somewhere to be shifted into the next field when appended into the DBF, thus making a lot of the records out of alignment. I am aware that this is because the delimiter is a comma.

I will endeavour to have a look over the next few days on stripping back and checking these CSV files, after which I'll post back with an update.

Thank you

Steve Williams
VFP9, SP2, Windows 10

Chris Miller · Jun 17, 2021

Oh my, and that's about it regarding the IT capabilities of governments.

Just one questions: Are names like that in quotes? Because there VFP has a weakness. You can output CSV data quoted (so commas within a quote delimited string don't count as field separator). But VFP can't read in such data, especially when the quoted string spans multiple lines, which then also is allowed.

Instead of importing into VFP DBF, Excel could do a better import job, and then you can address this by OLE automation. (Append FROM XLS is unlikely to work, even when you save in legacy binary format, but with OLE automation you can also get a region of data quickly over to VFP.

Chriss

Steve Meyerson · Jun 17, 2021

Steve,

You can open a csv file in Excel as Chris suggested.

To overcome the embedded commas, Save as Type: Text (Tab-delimited) (*.txt).

You can then parse the text file using tab delimiters instead of commas. Of course embedded tabs could be a problem but I found that to be highly improbable occurrence.

HTH

Steve

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Use a separate file, file type or other method to access nearly 29 million addresses 4

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor