Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Use a separate file, file type or other method to access nearly 29 million addresses 4

Status
Not open for further replies.

Steve-vfp9user

Programmer
Feb 5, 2013
334
GB
Hello all

I have the full and up to date set of UK addresses and postcodes obtained from the Government website. These were downloaded in CSV format and I have managed to separate them into smaller CSV files and stripped the columns that were not required.

I’m now left with nine columns which I have replicated in a DBF. The field names are called:

Code:
datetimeup (c,16)
postcode (c,12)
numbname (c,30)
flatnumb (c,30)
add01 (c,50)
add02 (c,50)
add03 (c,50)
add04 (c,50)
add05 (c,50)

I am aware that a DBF has a size limit of 2GB or 1 billion records.

I put together and ran a piece of code in an effort to append from those CSV files but as I am sure you have guessed by now, the DBF quickly fills up and creates the error message, “postcodes.dbf is too large”.

Code:
USE postcodes EXCLUSIVE
ZAP
a=0
DO WHILE .T.
  APPEND FROM "pp-complete_"+LTRIM(STR(a))+".csv" ;
    DELIMITED WITH , WITH CHARACTER ,
  a=a+1
  IF a=29
    EXIT
  ENDIF
ENDDO

The purpose of this project is for someone to be able to enter a postcode, search for it and the address will be auto filled elsewhere.

I only need to know the following:

Is there a separate file, file type or other method I can use to access nearly 29 million addresses?

Any pointers or suggestions would be much appreciated.

Thank you

Steve Williams
VFP9, SP2, Windows 10
 
Yes, a good summary, Brigmar. I would only disagree with one small part of your post. You said:

UK postal addresses are some of the most confusing on the planet

From the outside, they might look confusing at first glance. But they follow a logical structure and are really not that difficult to understand.

You might find my article on validating UK postcodes useful.
Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Steve
In one of the previous posts, Atlope asked this:
Question by atlope said:
If it's not sensitive data, can you post one of your CSV files here, please?
I realize that you can't because it's sensitive data that doesn't belong to us either.
My request would be this:

Can you perhaps - let's say - provide us with 10 data records in csv format from your csv-file that are typical (nothing missing) or not typical (inverted commas, too much commas)in your csv- file?

A bit of anonymity (different name, similar format for the postcode, different house number) - that would be enough to work with it as test material, and you could quickly change it manually with a few data records, so that there is no risk of data protection or ownership of the Data exists.

I would very much like to test how a group of many large text files in connection with the FINDSTR command is in terms of performance.

I would of course also publish the result here.


As you can see, a lot of people are interested in this subject.

It touches on the core question of when it is still worthwhile to convert large amounts of data into a VFP format first and then to continue working with it, or simply a visual comparison between entered data and a large cloud of data is required.

It also touches on the question of how to avoid main memory overflow during such transactions-
and how to avoid too much preliminary work if the amount of data available is poorly structured.

Danke
Klaus


Peace worldwide - it starts here...
 
Hi Klaus

If anyone would like to download the full file, here is the link:


There is also a monthly update file on that same page. As far as any sensitive detail is concerned, this isn't the case as the data is freely available but has a caveat shown on the Government website:

If you use or publish our Price Paid Data, you must add the following attribution statement:

Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0.

Good luck to all those interest and much appreciated to those who posted.

Thank you

Steve Williams
VFP9, SP2, Windows 10
 
Steve,

Using the CSVProcessor class at
Code:
CREATE CURSOR curPricePaid (Identifier Char(38), Price Integer, RefDate Date, PostCode Varchar(8) NULL, ;
	Attrib1 Char(1), Attrib2 Char(1), Attrib3 Char(1), ;
	PAON Varchar(100) NULL, SAON Varchar(100) NULL, ;
	Street Varchar(100) NULL, Locality Varchar(100) NULL, City Varchar(100) NULL, ;
	District Varchar(100), County Varchar(100), ;
	Attrib4 Char(1), Attrib5 Char(1))

LOCAL CSVP AS CSVProcessor

m.CSVP = CREATEOBJECT("CSVProcessor")

m.CSVP.HeaderRow = .F.
m.CSVP.DatePattern = "%4Y-%2M-%2D 00:00"
m.CSVP.Workarea = "curPricePaid"

m.CSVP.Import(GETFILE("csv"))

SELECT curPricePaid
GO TOP
BROWSE

importing the complete 2020 data (at 112MB), results in

Captura_de_ecr%C3%A3_2021-06-24_165329_j7ssxc.png


I don't know what the Attrib* columns mean, but hopefully, you'll have some documentation for them.

Putting the large data into some persistent database storage will be another issue. But importing it in the already available chunks seems feasible and straightforward.

Edited info: beware that this takes some time to import, and the 2020 cursor fills 780488 rows.
 
Atlopes, your screen shot is showing land registry data, not the postcode address file which Steve is interested in. Did you post it as an example of how to use the CSVProcessor class? If so, that's fine. It serves that purpose well.

I don't know what the Attrib* columns mean

In case anyone is interested, the land registry file shows details of all properties sold in England (and Wales?). Each row represents one such transaction. The attribute codes describe certain aspects of the property. So in the first row in your example, S means semi-detached, N means it is a new building, and F means freehold tenure.

Of course, this is all pretty irrelevant to the subject under discussion, but I thought I would post it anyway, in case anyone is curious.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Mike,
Mike Lewis said:
your screen shot is showing land registry data, not the postcode address file which Steve is interested in.
I believe that the postcode addresses that Steve is looking for are included in the Price Paid file.

Thank you for the hints regarding the Attrib* columns. From now it's possible to build a more meaningful cursor:

Code:
CREATE CURSOR curPricePaid (Identifier Char(38), Price Integer, TransferDate Date, PostCode Varchar(8) NULL, ;
	PropertyType Char(1), OldNew Char(1), Duration Char(1), ;
	PAON Varchar(100) NULL, SAON Varchar(100) NULL, ;
	Street Varchar(100) NULL, Locality Varchar(100) NULL, City Varchar(100) NULL, ;
	District Varchar(100), County Varchar(100), ;
	Category Char(1), RecordStatus Char(1))

LOCAL CSVP AS CSVProcessor

m.CSVP = CREATEOBJECT("CSVProcessor")

m.CSVP.HeaderRow = .F.
m.CSVP.DatePattern = "%4Y-%2M-%2D 00:00"
m.CSVP.Workarea = "curPricePaid"

m.CSVP.Import(GETFILE("csv"))

SELECT curPricePaid
GO TOP
BROWSE
 
There are many downloads, it would be good you point out which to download, steve.

The 'the complete Price Paid Transaction Data as a CSV file' is 3.7 GB, that still doesn't reflect the 13GB you mentioned. Perhaps that's the unzipped file size?

Since Mike thankfully explained wht this data means, are you sure that's what you looked for? I can imagine most houses/buildings were sold at least once, so you get all addresses, bu the topic isn't a full address registry but which real estate properties where sold.



Chriss
 
I'm pretty sure what Steve is using is the PAF, which is the official Postcode Address File. I've had some experience myself with this file, and it seems to fit the data that Steve has described.

The Land Registry file is related to the PAF in that it contains a postcode field, which can be considered a foreign key into the PAF. But the addresses in the Land Registry file are not necessarily the same as those in the related PAF record. The LR record includes fields such as Locality, City, District and County, none of which are relevant to the PAF.

This seems reasonable when you consider the different functions of these files. The data in the PAF is mainly intended to help deliver mail. The LR file is designed to identify the locations of properties which have been bought and sold. Also the PAF covers Great Britain, Northern Ireland, the Channel Islands and certain overseas territories, whereas the LR file only covers England and Wales.

Mike



__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
As I announced above, I wanted to test how the FINDSTR command can be used for very large text files.

I've done that now.

Since I don't know the original file with the postcodes, I used the "land registry" file for testing.
I renamed it to 1.scv and then I duplicated it 38 times with a small VFP program.
Code:
*Makemore.pgr
*This programm multiplies a csv-file
CLEA

FOR i = 1 TO 38
	filnam = ALLTRIM(STR(i))+".csv"
	COPY FILE p1.CSV TO &filnam
ENDFOR
DIR *.*

That left me with a bunch of 38 * .csv files
corresponding to about 29.6 million rows or records.

Each * .csv file has a volume of 132 MB so that the following batch had to search for 38 * 132 MB = in approx. 5 GB.

The batch is very easy:

Code:
rem This batch lookup.bat uses the command FINDSTR and looks into 38 *.csv-files (one after another)
rem look for this string in File 38, Line 24 
rem which is this one: {B5xxxFFD-E965-7D01-E053-6C04A8C036B6},"116500","2020-09-18 00:00","DE1 3QD","F","N","L","5","","CITY WALK","","DERBY","CITY OF DERBY","CITY OF DERBY","A","A"

rem the total volume = 38 * 780488 rows = 29,658,544 records


[b]echo "Started: " %time% >>resultcsv.txt
findstr "B5xxxFFD-E965-7D01-E053-6C04A8C036B6" *.csv >>resultcsv.txt
echo "Ended: " %time% "End-Time">>resultcsv.txt[/b]

I only manually changed one sentence in one of all the files (file 38.csv), creating a unique one that needed to be looked for.
String begins with B5xxxFFD in the batch.

I have redirected the result to a text file - and in my opinion that FINDSTR command is very fast.
Here is the result:

Code:
"Started: " 11:45:42,49 
38.csv:"{B5xxxFFD-E965-7D01-E053-6C04A8C036B6},""116500"",""2020-09-18 00:00"",""DE1 3QD"",""F"",""N"",""L"",""5"","""",""CITY WALK"","""",""DERBY"",""CITY OF DERBY"",""CITY OF DERBY"",""A"",""A"""
"Ended: " 11:45:48,82 "End-Time"

As you can see - with one single batch-command only you can find a certain string in about 6 seconds - within a volume of 29 mio records.

Of course, it depends on what / and how often / and how structured the large amount of * .csv material is needed
If you only compare strings, then you can get those big Text-files in their own format to evaluate them this way -
the result is then easy and the small result-file can then be used for further purpose by VFP.

I have not yet tested whether you have to divide the text file into smaller portions - but there will definitely be comments here.


I am curious if VFP searching in its own divided and indexed *.dbf-Files can exceed the search speed of my batch.


Klaus


 
Mike, quoting Steve's original code
Code:
 APPEND FROM "pp-complete_"+LTRIM(STR(a))+".csv" DELIMITED WITH , WITH CHARACTER ,
directs to the Price Paid files ("pp-complete.csv" is the name of the Single File mentioned in the link Steve provided).

Also in the original post:
Steve Williams said:
The purpose of this project is for someone to be able to enter a postcode, search for it and the address will be auto filled elsewhere.

Steve, you may consider using the Here platform live data for that. One of its resources is the suggestion of addresses given structured or non-structured queries. For instance, the last address in the 2020 list is this line
Code:
{9DBAD221-97BA-6EB3-E053-6B04A8C0F257},72500,2020-01-14 00:00,"S21 4GS","F","N","L","5","","SPRINGFIELD CLOSE","ECKINGTON","SHEFFIELD","NORTH EAST DERBYSHIRE","DERBYSHIRE","A","A"

Using the overHere library as an interface to the Here platform resources, you could fetch something like this

Captura_de_ecr%C3%A3_2021-06-25_121547_byler4.png


(the coverage of the Here platform is worldwide, so we can see suggestions from other parts of the globe, but you may restrict a query to a single country, for instance).

If you're interested, the overHere library is at
 
Chris said:
Mike, no idea what you talk about

Yes, a lot of people tell me that. Sometimes I've no idea myself.

My point was that the record structure that Steve quoted does not look like it being part of the LR (PPD) file. It is closer to that of the PAF (although not identical), although of course his application might well involve linking the two files in some way.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
I now gave it a shot and downloaded just pp-monthly-update-new-version.csv

The file is 16MB, read into a cursor it becomes 190 MB with this structure:
Code:
CREATE CURSOR Import(;
guid V(38),;
Price I,;
TransferDate D,;
Postcode V(254) null,;
PropertyType V(1),;
OldNew V(1),;
Duration V(1),;
PAON V(254) null,;
SAON V(254) null,;
Street V(254) null,;
Locality V(254) null,;
City V(254) null,;
District V(254) null,;
County V(254) null,;
Category V(1),;
RecordStatus V(1))

Atlopes used V(100), i think when you now use Tamars approach this can be shortend a bit more. Indeed Max(len(alltrim(city))) is just 22.

Anyway, now applying my recommended normalization you can get this from the City field:
Code:
SELECT distinct CAST(0 as int) as Id, City FROM myAlias INTO cursor cities readwrite
UPDATE cities SET id = RECNO()

That's 1123 records of the original 95312. Replacing the City V(254) with an int field you save 95312*250 bytes, that's 22.7 MB, the new column size is just 95312*4 bytes. And the cities table takes about 285KB, but as it's a separate DBF that's of no interest until you have so many cities, that this table would exceed 2GB. Very unlikely, even in the future, there's a finite geographical limit...

With Tamars method I could reduce the city field to 22 and save 95312*232 bytes, without an extra table. The difference, 95312*18 bytes is 1MB, so the 285KB for an extra cities table are well spent, you just still have all text data in the one dbf, which might be worth the price.

I'm not comparing this to the solution of a single memo for all address parts, but that would still have all city names in full and as often as they occur.

The potential of savings is less for street names, but high again for post codes, I'm not comparing normalization to Memo solution as I'd need much work to get there, anyway you see how that helps reduce the sizes by using multiple tables instead of just one addresses dbf.

And btw, I let Excel import the data including to determine data types from all data. Then the full range of data can be turned to a VFP array with
Code:
#Define xlLastCell 11

Create Cursor Import(....see atlopes cursor definition)...)

WITH oExcel.ActiveWorkBook.ActiveSheet
  loLastCell = .Cells.SpecialCells( xlLastCell )

  laData = .Range( .Cells(2,1), m.loLastCell ).Value
ENDWITH

Insert into import From Array laData

This can also be used to split one large EXcel sheet of 13 GB csv to arrays with say 100 lines each and insert them into an import cursor to then forward the data to the normalized table schema portion by portion.



Chriss
 
Thanks Mike, that explains it.

Steve, could you really point out which data you used or is the source lost in your (perhaps deleted) browser history?
Where add01 to add05 proposed column names or did you name them this way in your import dbf or created them while splitting the data?

Chriss
 
Chris,
you wrote:

Klaus' idea isn't bad, because the search for parts of interest is done by the shell command findstr, which doesn't have that 2GB limit.
The output it creates usually will be much shorter (if you don't search for something as short as a single letter) and can be read from VFP.
What I think is not good with this solution is the performance

Klaus question said:
Is that still a bad performance for you when I found (as you can see above) a certain string was found in about 6 seconds - within a volume of 29 mio records ?

Klaus

Peace worldwide - it starts here...
 
If you want to do something like intellisense for addresses 6 seconds is too long, but in comparison to looking into 29 million index cards it's fast.

Chriss
 
Mike:
We talked about text-editors and their capacity long time ago and today.
Meanwhile I found this huge one



Unfortunately it costs money, but I once downloaded a trial version (30 days) for free - and even 10 GB files were no problem for this editor, even when searching.

Just for your information.
Time goes on and on .....

Regards
Klaus

Peace worldwide - it starts here...
 
Chris

Thanks for your answer (Defintion of fast or not), which has convinced me.

I am very sorry that Steve didn’t answer your last questions by now.

I would have loved to know what the search speed is at
the multi-VFP files after you had already started the data normalization.

Klaus


Peace worldwide - it starts here...
 
Klaus, thanks for your empathy,

I could just take the PPD data and extract something from it.
But even if this was Steve's data source or he would simply upload his data somewhere, I don't really have the time to work on this.

Google shows autcompletion - in their case rqather suggestions - are possible with their search engine.

Royal mail has this demo about address completion:

Chriss
 
And regarding editors and >2GB or >4G files.

While notepad.exe chokes at large files, there's an out of the box alternative in Windows with Wordpad. It shows txt or doc files top while it still loads. I admit I haven't tried it with a 4GB file, but since we now have that much RAM, any 64bit editor software could be used to search in a large file.

Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top