Extracting addresses from UK.gov price paid data 2

Chris Miller · Jul 14, 2021

Hello Steve, Klaus and anyone interested.

I had a stab at reading in the addresses mentioned in thread184-1810484.

I took a lot of the code I already had in there from myself and others, but first I tried to find a way to split the large ~4.5 GB pp-complete.csv
Here Powershell proves to do this in a 1-liner, credit goes to Stephan Solomonidis blog article:

https://stephan.io/split-a-csv-in-windows-powershell/

In short I used this at the Powershell prompt:

Code:

$i=0; Get-Content C:\data\pp-complete.csv -ReadCount 100000  | ForEach-Object{$i++; $_ | Out-File C:\data\pp-partial_$i.csv utf8 }

In the first run I didn't have the utf8 and Powershell then by default outputs a Unicode encoding needing 2-4 byte per character (I think UCS2), which ended in about 12 GB of partial csv files. In theory you could output codepage 1252 to directly get the VFP default codepage, but on my machine Powershell denied to know that codepage.

By the way, this splits the pp-complete.csv into 262 files with each 99999 records, except the last file being a bit shorter. Overall after processing all these partial csv files I get a DBC summing up all files to ~2 GB. But I jump ahead. First of all I used Excel to read in the CSV and formatted the columns so the import into VFP is as easy as I showed in the other thread: By using a Range object you get out an array VFP can use in an SQL-Insert FROM ARRAY.

I tried to automate the OpenText Method and failed, didn't want to invest much into try&error and simply created an Excel template file with this Function:

Code:

Function openppdcsv(ByVal File As String) As Workbook
    Workbooks.OpenText Filename:=File, Origin:=xlWindows, StartRow:=1, DataType:=xlDelimited, TextQualifier:=xlTextQualifierDoubleQuote, Comma:=True
    Set openppdcsv = ActiveWorkbook
End Function

This can be used from VFP to import the data with this routine:

Code:

Close Tables All
Close Databases All
Cd Getenv("TEMP")
Erase *.Dbc
Erase *.dct
Erase *.dcx

Erase *.Dbf
Erase *.fpt
Erase *.Cdx

Create Database UKAddressDB

Create Table Cities     (Id Int Autoinc Primary Key, City Char(30))
Index On City Tag City

Create Table Streets    (Id Int Autoinc Primary Key, Street Char(50))
Index On Street Tag Street

Create Table Postcodes  (Id Int Autoinc Primary Key, Postcode Char(8))
Index On Postcode Tag Postcode

Create Table AONs       (Id Int Autoinc Primary Key, AON  Char(25))
Index On AON Tag AON

Create Table Addresses (Id Int Autoinc Primary Key,;
   StreetID Int References Streets Tag Id,;
   CityID Int References Cities Tag Id,;
   PostcodeID Int References Postcodes Tag Id)

Create Table AddressAONs (Id Int Autoinc Primary Key,;
   AddressID Int References Addresses Tag Id,;
   AONID Int References AONs Tag Id,;
   PrimaryAON Logical)

#Define xlDelimited 1
#Define xlWindows 2
#Define xlLastCell 11

loExcel = Createobject("Excel.Application")
loExcel.DisplayAlerts  = .F.
loExcel.Workbooks.Open("C:\data\pp-part_n.xlsm")
loExcel.Visible = .T.

Create Cursor Import (Identifier Char(38), Price Integer, TransferDate Date, Postcode Varchar(8) Null, ;
   PropertyType Char(1), OldNew Char(1), Duration Char(1), ;
   PAON Varchar(100) Null, SAON Varchar(100) Null, ;
   Street Varchar(100) Null, Locality Varchar(100) Null, City Varchar(100) Null, ;
   District Varchar(100), County Varchar(100), ;
   Category Char(1), RecordStatus Char(1))

? DATETIME(), "Start"

For lnI = 1 To 262
   lcFile = "C:\Programming\pp-addressdata\pp-partial_"+Transform(lnI)+".csv"
   loWorkbook = loExcel.Run("openppdcsv", lcFile)

   With loWorkbook.ActiveSheet
      .Columns(1).NumberFormat = '@' && Identifier (Guid) -> Text
      .Columns(2).NumberFormat = '0.00' && Price -> Number
      .Columns(3).NumberFormat = 'yyyy-mm-dd;@' && TransferDate -> Date
      .Columns(4).NumberFormat = '@' && Postcode -> Text
      .Columns(5).NumberFormat = '@' && PropertyType
      .Columns(6).NumberFormat = '@' && OldNew
      .Columns(7).NumberFormat = '@' && Duration
      .Columns(8).NumberFormat = '@' && PAON
      .Columns(9).NumberFormat = '@' && SAON
      .Columns(10).NumberFormat = '@' && Street
      .Columns(11).NumberFormat = '@' && Locality
      .Columns(12).NumberFormat = '@' && City
      .Columns(13).NumberFormat = '@' && District
      .Columns(14).NumberFormat = '@' && County
      .Columns(15).NumberFormat = '@' && Category
      .Columns(16).NumberFormat = '@' && RecordSatus

      loLastCell = .Cells.SpecialCells( xlLastCell )

      laData = .Range( .Cells(2,1), m.loLastCell ).Value
   Endwith

   For lnJ = 1 To Alen(laData)
      If Vartype(laData[lnJ])="C"
         laData[lnJ] = Strconv(laData[lnJ],11) && convert UTF-8 to Ansi
      Endif
   Endfor

   Insert Into Import From Array laData

   loWorkbook.Close(.F.)

   ExtractAddressData()
   Set Safety Off
   Zap In Import
   Set Safety On
Endfor lnI


Function ExtractAddressData()
   ExtractCities()
   ExtractStreets()
   ExtractPostCodes()
   ExtractAONs()
   ExtractAddresses()
Endfunc

Procedure ExtractCities()
   Select Distinct City From Import Into Cursor importCities
   Scan
      If Not Indexseek(City,.F.,"Cities","City")
         Insert Into Cities (City) Values (importCities.City)
      Endif
   Endscan
   Use In importCities

Procedure ExtractStreets()
   Select Distinct Street From Import Into Cursor importStreets
   Scan
      If Not Indexseek(Street,.F.,"Streets","Street")
         Insert Into Streets (Street) Values (importStreets.Street)
      Endif
   Endscan
   Use In importStreets

Procedure ExtractPostCodes()
   Select Distinct Postcode From Import Into Cursor importPostCodes
   Scan
      If Not Indexseek(Postcode,.F.,"Postcodes","Postcode")
         Insert Into Postcodes (Postcode) Values (importPostCodes.Postcode)
      Endif
   Endscan
   Use In importPostCodes

Procedure ExtractAONs()
   Select Distinct PAON From Import Into Cursor importPAONs
   Scan
      If Not Indexseek(PAON ,.F.,"AONs","AON")
         Insert Into AONs (AON) Values (importPAONs.PAON)
      Endif
   Endscan
   Use In importPAONs

   Select Distinct SAON From Import Into Cursor importSAONs
   Scan
      If Not Indexseek(SAON ,.F.,"AONs","AON")
         Insert Into AONs (AON) Values (importSAONs.SAON)
      Endif
   Endscan
   Use In importSAONs

Procedure ExtractAddresses()
   Select Import
   Scan
      Indexseek(City,.T.,"Cities","City")
      Indexseek(Street,.T.,"Streets","Street")
      Indexseek(Postcode,.T.,"Postcodes","Postcode")
      If !Empty(PAON)
         Indexseek(PAON ,.T.,"AONS","AON")
         AddressPAONID = AONs.Id
      Else
         AddressPAONID = 0
      Endif

      If !Empty(SAON)
         Indexseek(SAON ,.T.,"AONS","AON")
         AddressSAONID = AONs.Id
      Else
         AddressSAONID = 0
      Endif

      Insert Into Addresses (CityID, StreetID, PostcodeID) Values (Cities.Id, Streets.Id, Postcodes.Id)
      If AddressPAONID >0
         Insert Into AddressAONs (AddressID, AONID, PrimaryAON) Values (Addresses.Id, AddressPAONID, .T.)
      Endif

      If AddressSAONID >0
         Insert Into AddressAONs (AddressID, AONID, PrimaryAON) Values (Addresses.Id, AddressSAONID, .F.)
      Endif
   Endscan

I used atlopes import cursor definition and while I am also willing to try his CSVProcessor class, I coped with the slow performance of Excel but the ease of getting a range as an array.

It took a night to import it and the result is a neat DBC:

There are still some things to improve, for example the PAON/SAON data: Wherever you have just a normal (numeric, integer) house number Excel exports it with 10 decimal places, so the formatting of columns can be improved.

The major learnings from this is, data normalization isn't just important to have a generally accepted database design, but it takes out most redundancy and reduces the size even in comparison with the original CSV file.

I didn't import counties, districts localities and anything else I forget to mention, but you can easily extend the code to add more info. You could also foresee tales like CityPostcodes to store all valid combinations of City and Postcode, dito with streets, etc. So this data can also be used to confirm correct combinations. I think the 26 million addresses might not have all possible postcodes, so a pair you don't find in the price paid data might still exist, but you can confirm a lot of correct pairings of address detail information yourself, if that's the goal.

As the data contains prices paid you could also start to build up a UK map of rich and poor regions. Many things are possible. What I haven't yet tried is to use this data for some kind of address intellisense. I think I leave it at this, you can pick out what you like and proposed extensions and changes are welcome.

Chriss

german12 · Jul 14, 2021

Hello Chris,
I am deeply impressed by the much work you have done.

There are a lot of things in it that are new to me - so I'll still ask questions.
Since I am going to a painting course for watercolor in the next 14 days, I can only come back to it later and will try to understand your explanations bit by bit ..
Your work shows, however, that the 2 GB limit in VFP can still be overcome even with huge files if it is well filtered.
It's very reassuring.

Many Thanks
Klaus

P.S.
I hope you didn't have to make that face too often when you did that job:

Link

Peace worldwide - it starts here...

Chris Miller · Jul 15, 2021

It was straight forward with everthing given in the other thread.

No, you can't process >2GB files in VFP, the first step is to cut down the large file into smaller ones, just reread that part if you come back.

Good luck and have fun in your painting course.

Chriss

PS: From the ERASE commands in my code you can see I did rebuild the DBC and tables multiple times. Mainly starting with just one of the partial CSV files and then mending problems like addressing the Workbook where I should have addressed a Sheet of the Workbook and such "off by one" errors. The gold nuggets I already knew was that Excel is much better at reading in CSV properly, taking into accout character field delimiters and therefore not splitting columns at commas within quotes. Not to forget that a Range.Value results in an array and you can insert an array into a table or cursor.

It's short code, but slow. Still an import over night is acceptable, even when you could change it to minutes using a CSV reader.

Whether that data is what Steve needs is another question, you mainly have addresses, not people living there or owning the houses. And it's likely incomplete if you aim for knowing all UK addresses. You could add a few more, when you have a big house number you likely can add the same address with all lower house numbers, but it's not certain ll house numbers exist, I think.

Chris Miller · Jul 15, 2021

Thanks for pointing that out. But the bottleneck is Excel, not determining new records.

Chriss

Chris Miller · Jul 15, 2021

Your insert is an idea. But why close import? It's needed again to import the next part of the data. Did you even read the thead and code?

To be more clear: I didn't accuse you of introducing Excel, but if you wanted to improve the process, you should have addressed that bottleneck and not anything that would only gain 1% of your optimization. If the insert is optimized it would speedup the import more essentially than your earlier optimization effort.

Chriss

Chris Miller · Jul 16, 2021

All in all, Mike, it's okay, I asked for it with

myself said:
proposed extensions and changes are welcome

Your advice goes into the category of changes.

But the major goal wasn't to have the fastest import but an import of all this data at all. Steve already said his interest for this data has died away, but I still think it's a good lesson about normalizing data and how that's not just a hurdle professionals made to set off hobbyists. One major result of normalization is avoiding any redundancy and that also results in smaller file sizes.

I could now prove my point this can't be stored shorter in other proposed solutions. But I'll try your advice of changing innex scanning to a query to find only distinct new records for cities, streets and such things. Another solution might work its way by creating 262 DBFs at first and then merge the partial results into larger and larger groups. Because one thing is true: The later csv portions will have lesser and lesser new cities and streets.

Chriss

Chris Miller · Jul 16, 2021

I now tried atlopes CSVProcessor, and it turned out to be slower. ~3 minutes for a csv part in comparison to 20 seconds with Excel.

I have not yet used all features it supports, so there might be possible progress, especially when skipping the columns I don't extract into the tables.
I also stumbled upon a .NULL. value for a street with atlopes processor which I think was imported as empty street name in Excel.

Chriss

Chris Miller · Jul 16, 2021

Mike,

I don't know who sets the standards of what's rude or not. I think it's generally accepted that answering a thread you only skimmed through you better tell so, because then it's forgivable when an answer is not on the point. It's a well known practice to introduce some answer like that with "Pardon me for jumping in" or similar introduction.

If you even would just have read my post here you would also read what I said about possible improvements and what kind of enhancements and changes I would be more open to than a mere superior attitude you display here.

That said, I don't mean to escalate this, you have our opinion, I have mine, they don'T seem to match, that's the end of a less fruitful discussion. Sad, but true.

Chriss

atlopes · Jul 17, 2021

Chris Miller said:
I now tried atlopes CSVProcessor, and it turned out to be slower. ~3 minutes for a csv part in comparison to 20 seconds with Excel.

I wouldn't expect it to be significantly otherwise, but this may trigger an interesting optimization exercise.

Chris Miller said:
I also stumbled upon a .NULL. value for a street with atlopes processor which I think was imported as empty street name in Excel.

This depends on the value of the NullValue property: setting it to .NULL. will read empty CSV cells as empty.

Chris Miller · Jul 17, 2021

atlopes,

thanks for the feedback. I have looked into your code by single stepping through an import process. This is how I used your CSVProcessor:

Code:

CD JUSTPATH(SYS(16))
Do csv.prg
LOCAL CSVP AS CSVProcessor

m.CSVP = CREATEOBJECT("CSVProcessor")

? DATETIME(), "Start"
m.CSVP.HeaderRow = .F.
m.CSVP.DatePattern = "%4Y-%2M-%2D 00:00"
m.CSVP.Workarea = "Import"

For lnI = 1 To 262
   lcFile = "C:\Programming\pp-addressdata\pp-partial_"+Transform(lnI)+".csv"
   
* breakpoint 
   m.CSVP.Import(lcFile)
   Flush
   DoEvents

   ExtractAddressData()
   Set Safety Off
   Zap In Import
   Set Safety On
   
   ? DATETIME(), "Part "+transform(lnI) 
Endfor lnI

I saw you take the type from the m.CSVP.Workarea fields, which is a good way to handle a CSV line. I currently have no great idea to improve this as it's well written.

As I pointed out my goal was mainly to show Steve how just the normal straight forward import approach into a normalized database structure results in the capability to store all these addresses, even though he lost interest in the data overall, to know this should help motivate to learn more about standards.

Your code also is a nice example of how to deal with CSV and data types. Your contribution is well worth looking into.

Chriss

Chris Miller · Jul 19, 2021

One more thing to add:

myself said:
Excel exports it with 10 decimal places, so the formatting of columns can be improved.

If you make the worksheet visible you see the PAON and SAON columns (with house numbers or other Primary or Secondary Addressable Object Names) are formatted as Text With NumberFormat="@".

Btw how this works is documented here:

https://docs.microsoft.com/en-us/office/vba/api/excel.range.numberformat,

number format codes are described here:

https://support.microsoft.com/en-us...8cd-bf33-80f18b4eae68?ui=en-us&rs=en-us&ad=us

and that this property also is responsible for text formats and spacings can be seen when you expand the "Text and Spacing" section of the "Format code guidelines".

That's a bit of bad news to anyone thinking of using the Excel.Range.Value Array as export helper. Without exact control over types read from the Excel Range, you still have to deal with conversion issues.

Chriss

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Extracting addresses from UK.gov price paid data 2

Chris Miller

Programmer

german12

Programmer

Chris Miller

Programmer

Chris Miller

Programmer

Chris Miller

Programmer

Chris Miller

Programmer

Chris Miller

Programmer

Chris Miller

Programmer

atlopes

Programmer

Chris Miller

Programmer

Chris Miller

Programmer

Similar threads

Part and Inventory Search

Sponsor