Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Use a separate file, file type or other method to access nearly 29 million addresses 4

Status
Not open for further replies.

Steve-vfp9user

Programmer
Feb 5, 2013
334
0
16
GB
Hello all

I have the full and up to date set of UK addresses and postcodes obtained from the Government website. These were downloaded in CSV format and I have managed to separate them into smaller CSV files and stripped the columns that were not required.

I’m now left with nine columns which I have replicated in a DBF. The field names are called:

Code:
datetimeup (c,16)
postcode (c,12)
numbname (c,30)
flatnumb (c,30)
add01 (c,50)
add02 (c,50)
add03 (c,50)
add04 (c,50)
add05 (c,50)

I am aware that a DBF has a size limit of 2GB or 1 billion records.

I put together and ran a piece of code in an effort to append from those CSV files but as I am sure you have guessed by now, the DBF quickly fills up and creates the error message, “postcodes.dbf is too large”.

Code:
USE postcodes EXCLUSIVE
ZAP
a=0
DO WHILE .T.
  APPEND FROM "pp-complete_"+LTRIM(STR(a))+".csv" ;
    DELIMITED WITH , WITH CHARACTER ,
  a=a+1
  IF a=29
    EXIT
  ENDIF
ENDDO

The purpose of this project is for someone to be able to enter a postcode, search for it and the address will be auto filled elsewhere.

I only need to know the following:

Is there a separate file, file type or other method I can use to access nearly 29 million addresses?

Any pointers or suggestions would be much appreciated.

Thank you

Steve Williams
VFP9, SP2, Windows 10
 
Guys
I’m not sure what questions you wanted me to answer but please be assured, I’m not intentionally ignoring you it’s just that I’ve not had time to dip into this thread.
Give me a couple of days and I’ll respond accordingly.

Thank you

Steve Williams
VFP9, SP2, Windows 10
 
Chris:

Steve, could you really point out which data you used or is the source lost in your (perhaps deleted) browser history?

Web page for the CSV file here: [URL unfurl="true"]https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
[/url]
Direct download here: [URL unfurl="true"]http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv[/url]

Where add01 to add05 proposed column names or did you name them this way in your import dbf or created them while splitting the data?

Yes. I originally imported one of the split CSV files via VFP9 using Tools > Wizards > Import. The columns came out as column1, column2 etc so I renamed them accordingly. Prior to this, I deleted some of the columns in the CSV file as they were not needed.

If there is anything else I can answer, then please let me know (there maybe a delay replying).

Thank you

Steve Williams
VFP9, SP2, Windows 10
 
Well, okay, that makes it hard to know what ended up as adr1-adr5 in your DBF. If you did multiple things with third party tools, remove combine columns etc and don't have all steps reproducible we (I) can only start with price paid data and see what I can make of that.

I'll give it a try to normalize this as I think will work, I just wonder how you get to 13GB when the download is "just" 4.3GB. Your "aggregation" of the data must have introduced a lot of redundancy or were you just using long char fields, like c(254)?

Anyway, it'll take me a while, Sunday is already planned for other things and during the week I can't make that a fulltime job, so maybe next weekend, I think that's okay since you stopped working on it and so don't need it urgent, perhaps not at all.



Chriss
 
Chriss

Thank you for your post. You mentioned:

Anyway, it'll take me a while, Sunday is already planned for other things and during the week I can't make that a fulltime job, so maybe next weekend, I think that's okay since you stopped working on it and so don't need it urgent, perhaps not at all.

Perhaps it's best for me to respond with "not at all".

Please don't waste your time on my behalf.

Best wishes

Thank you

Steve Williams
VFP9, SP2, Windows 10
 
It's okay. I just want to show how normalization helps to reduce the storage space, and the calculation is quite simple, so I spare myself the effort of doing it, just show how much you can save.

Just as posted somewhere above 30 million addresses = records mean about 64 bytes per record. Even with short but often occurring city names like LONDON you need 6 bytes in a blocksize 0 memo field, 22 bytes in an optimized city char field (as I found out the longest city name is 22 chars long) but only 4 bytes as a foreign key. I already stated that simple reason.

I showed earlier how the city table is built:
Code:
SELECT distinct CAST(0 as int) as Id, City FROM import INTO cursor cities readwrite
UPDATE cities SET id = RECNO()
copy cities To cities.dbf

When the integer IDs are given, you can determine the ID of a city by
Code:
SELECT id as cityid From cities where alltrim(city) == alltrim(import.city)
That's how easy foreign keys are determined, which is maybe the step you thought is too complicated.

I'll nevertheless play with this and a few more ideas. But I'll start my own thread to show my results.








Chriss
 
Hello Chris,
I'm curious what your own announced thread will look like - I'm looking forward to it.
I have one more question about this note that you wrote above in this thread:

Chris Miller said:
SELECT distinct CAST(0 as int) as Id, City FROM myAlias INTO cursor cities readwrite
UPDATE cities SET id = RECNO()

What is "SELECT distinct CAST" - is CAST a special field which I could not discover in your structure for your import-cursor/myAlias or is that a command which I did not find in VFP-help?
Sorry if that is a stupid question .
But I don't have any peace of mind if I don't understand something.
Please help...

Klaus


Peace worldwide - it starts here...
 
Distinct is just selecting each city name once, it has no direct reference to CAST, distinct is about the whole record and not just the first field or expression following.

CAST in itself is a VFP function for converting a type, CAST(0 as int) ensures the field type is int and not N(1) or such nonsense VFP could also create to store the 0.


Chriss
 
Klaus, I can understand your confusion. CAST() is a peculiar function. Instead of having a list of parameters between the parentheses (zero, one or more parameters), it has this [tt]expression AS datatype[/tt] syntax, which is completely non-standard.

I wonder why the VFP team didn't give us something like T-SQLs CONVERT() function, which does pretty well the same thing as CAST().

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Mike,

I would also like to have an extended TRANSFORM() that can convert between any two types (as long as it makes sense), like CONVERT() does. But CONVERT is Microsofts' proprietary SQL dialect, while T-SQL also supports the standard CAST() CONVERT does a lot of conversions better, which is quite logical as MS knows the binary structures of its database types in and out.

For ANSI SQL the CAST() syntax is a standard. VFP SQL also supports BETWEEN A and B, though it has its BETWEEN() function and you cannot use BETWEEN A AND b outside of a query.

The more peculiar thing is you can use CAST(value as type) outside of a SQL query, which I think no other programming language has. Not that it is of any use. But within a query, it helps define result field types strictly.

Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top