Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Use a separate file, file type or other method to access nearly 29 million addresses 4

Status
Not open for further replies.

Steve-vfp9user

Programmer
Feb 5, 2013
334
GB
Hello all

I have the full and up to date set of UK addresses and postcodes obtained from the Government website. These were downloaded in CSV format and I have managed to separate them into smaller CSV files and stripped the columns that were not required.

I’m now left with nine columns which I have replicated in a DBF. The field names are called:

Code:
datetimeup (c,16)
postcode (c,12)
numbname (c,30)
flatnumb (c,30)
add01 (c,50)
add02 (c,50)
add03 (c,50)
add04 (c,50)
add05 (c,50)

I am aware that a DBF has a size limit of 2GB or 1 billion records.

I put together and ran a piece of code in an effort to append from those CSV files but as I am sure you have guessed by now, the DBF quickly fills up and creates the error message, “postcodes.dbf is too large”.

Code:
USE postcodes EXCLUSIVE
ZAP
a=0
DO WHILE .T.
  APPEND FROM "pp-complete_"+LTRIM(STR(a))+".csv" ;
    DELIMITED WITH , WITH CHARACTER ,
  a=a+1
  IF a=29
    EXIT
  ENDIF
ENDDO

The purpose of this project is for someone to be able to enter a postcode, search for it and the address will be auto filled elsewhere.

I only need to know the following:

Is there a separate file, file type or other method I can use to access nearly 29 million addresses?

Any pointers or suggestions would be much appreciated.

Thank you

Steve Williams
VFP9, SP2, Windows 10
 
Mike Lewis said:
if the average length of an address line was less than eight bytes

The 8 byte overhead of a memo isn't a problem because it's for the whole address, not because each line is longer than 8 bytes.

Still, the normalization of addresses with multiple tables takes the least size of the address table itself. You surely need more disk space, but I don't think that's the problem.

Chriss
 
Steve,

If it's not sensitive data, can you post one of your CSV files here, please?
 
If it's not sensitive data, can you post one of your CSV files here, please?

Probably not sensitive, but almost certainly proprietary PAF data and therefore subject to strict licensing conditions.

At least, that's my understanding. Steve will know more about that than me.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
The 8 byte overhead of a memo isn't a problem because it's for the whole address, not because each line is longer than 8 bytes.

That's interesting, Chris. Are you saying that, if a record contains multiple memo fields, the total overhead for the record is still only eight bytes? Or are you thinking that there would only be one memo field for each address?

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Why use more than one memo field? You can still separate the address lines, well, by CRLF or LF only - CHR(10).

To be precise that'll make it 8+4=12 bytes "wasted". But even in a char field solution with optimized char fields, assume the max length is an outlier and average length is rather just 50-60% of that, then you still waste 40-50% of the bytes. What you save also is ery clear 29 million x the sum of char fieeld shortenings.

If you use 5 memo fields that only pays when you save more than 8 chars in comparison with Tamars' method.

Chriss
 
That's very clear,Chris. I had originally assumed a separate memo field for each line of each address, because I sort of thought it would make processing the address lines easier. But I can see now that that is not really an issue.

So, on that basis, moving the address lines to a memo field would be a very big saving. But that's not to say that Steve should rule out any of the other suggestions.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Also consider, when using memo fields the final size can be up do 4 GB (2GB dbf + 2GB fpt), because the file size limits are separate for dbf und fpt file.

Manni

 
There's a crux in that 4GB argument. To make use of it you'd need a balance of data in DBF and memo. The address lines will make up most of the data and the 2GB fpt limit hits you earlier than the 2 GB dbf limit, but also hinders you to add further data, further records then only could work with NULL memos.

Chriss
 
I have followed the discussion so far with great interest.

The main problem is VFP's 2GB capacity.
If I understand correctly, you want to have additional information about a postcode in a large * .csv file (comma-separated text file).

This problem increases the larger the volume of data processing data for Foxpro is.
(It could also have been that the request came for 50 million or more addresses).

So I thought about a different approach -

instead of
recreating (with a lot of effort) a VFP file to have the information needed reduced - why can we not leave the big text file as it is (or split it into portions )
and first of all isolate the search term with additional information in a separate text file.?

Only then can you access this separate file with VFP, and
compare their information with the input of the user.


Here is a model with 5 files:
1 batch file
3 text files (so as not to overload the main memory,
divide the large text / csv file into 3 or more text files.
1 result file

4dateien_rm2jt0.png


The batch file (such.bat) looks like this:

inhalt_suchbat_vpuxc0.png


The content of the text file (No1.txt to No3.txt with different contents) e.g. like this:

Inhaltno1_udplnu.png


The output file (Result.txt) now receives this content, after all 3 text files can be searched very quickly with FINDSTR (Windows integrated execution file):

inhaltergebnis_yudths.png


For example, the FINDSTR command is. explained here in more detail.

As my model shows, text files can also be used on many files
can be accessed one after the other (important in order not to overload the main memory) - and if you are even more familiar with batch files than I am, you may be able to access these files with a
Specify loop (No1.txt to Nonn.txt) in the batch file.
With a text file of 1 GB (approx. 800,000 lines) I could not find any delay in the search in the test.

I would be very interested in your opinion.

Maybe I'll make another mistake in this.

Best regards Klaus



Peace worldwide - it starts here...
 
Klaus,

Thank you for your contribution, which looks very interesting, and could be worth following up.

However, you said: "why can we not leave the big text file as it is". Are you aware that the 2 GB limit also applies to text files, not just to DBFs or FTPs? That's true whether you want to import or export the file, append or copy it, access it with low-level file functions, or read it into a memory variable.

It's possible to use the File System Object to process text files larger than 2 GB, but that would be horribly slow, given the overhead involved.

This doesn't invalidate your suggestion, as you also mentioned the possibility of splitting a large text file into smaller portions. In fact, I think Steve said he had already done that. But it is something to keep in mind.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Mike,

Klaus' idea isn't bad, because the search for parts of interest is done by the shell command findstr, which doesn't have that 2GB limit. The output it creates usually will be much shorter (if you don't search for something as short as a single letter) and can be read from VFP. What I think is not good with this solution is the performance.

I think when you explicitly tell Windows to index the csv files you could use Windows search from VFP to find the right file. Keeping just one file then would not be very productive. This would need splitting the file into very many small ones of which a search can point to 1 or a few. That will help reduce the per search reading in of data.

There also are drivers addressing CSV (txt) through ODBC, but that'll be slow as you have no indexes. So I'd modify Klaus's idea to first split up into say MB sized portions and let Windows index them, then use Windows search to find something and read that into a DBF format for further processing.

But overall I still think you could make much more use of importing into multiple DBFs rather than one.

Chriss
 
german12 (Programmer)
(OP)
21 Jun 21 18:46
Mike and Chris,
Thank you both for your quick comment.

Mike when I wrote "why can we not leave the big text file as it is"
Was it meant in such a way that we shouldn't change the type at first (i.e. convert it to a VFP file) - I shouldn't have mentioned the word "big" here, I agree with you. (Difficulties between languages are sometimes a problem for me).
I also noticed it during testing when I wanted to fill a program (Notepad +++) with data - at some point I got the message "file too big".
Therefore also a split of such files.
I also noticed that FINDSTR glided across the files very quickly.

Chris: Your reference to indexing the text files with Windows is
a good hint.
I've never needed this before - but can you tell me how to do it?
I would like to try and create a lot of very large text files indexed on my own - because I'm curious how the performance is then in connection with FINDSTR

Danke,
Klaus

Peace worldwide - it starts here...


Peace worldwide - it starts here...
 
Appreciating all these replies. Still looking at the original file from the government website and will post back soon.

Thank you

Steve Williams
VFP9, SP2, Windows 10
 
I also noticed it during testing when I wanted to fill a program (Notepad +++) with data - at some point I got the message "file too big".

That's right. Notepad++ has a limitation of 2 GB.

The version of Notepad that comes with Windows has, I believe, a lower limit (512 MB?). If you think that's bad, the limit was 64 KB in Windows 95 (if I remember rightly).

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Thanks Chris

meanwhile I found this article written in german (easier for me) concerning indexing with Windows
Translation by Google (Pictures visible in that link)

Windows 10: adjust, deactivate or activate indexing for quick searches - Here's how
Robert Schanze


Windows 10, like its predecessors, creates an index of certain files to enable a quick search for them. The process is also called indexing.
We'll show you how you can adjust the indexing options, add more folders for quick searches, deactivate or activate them completely.

By default, Windows 10 indexes the Users folder, offline files, and the Start menu. Program and system files are not indexed because they are searched for less often. We'll show you how to add additional folders and locations for indexing, deactivate or activate them.

Windows 10: Adjust indexing for quick searches

To adjust the indexing of files in Windows 10, you do the following:

Open the indexing options by pressing the key combination [Windows] + [R], typing in

control.exe srchadmin.dll

and confirming with the Enter key.
In the new window above you can see how many elements have been indexed.

Windows 10: Here you add or remove new folders for file indexing.

To determine which file types should be indexed, click on the "Advanced" button and then on the "File types" tab. Places a check mark in front of the file types that are to be indexed. For each type, you can choose whether only the properties or file contents should be indexed. Confirmed with OK.
To add new folders and locations of the indexing, you click on the button Change. The "Indexed Locations" window opens. Click on the button "Show all locations" below. At the top you can expand the tab of your hard drive and add more locations by checking the box in front of the respective folder. In the lower part of the window, a summary of all the places that Windows 10 indexes is displayed. Confirmed with "OK". Indexing may take a while, depending on the folders you selected. It continues to run in the background.
To stop the indexing, click on the "Stop" button.
Under the Advanced button you will find the “Create new” button in the “Index settings” tab. This will rebuild the index in Windows 10 if you should have problems with it.
When you have adjusted the indexing options, click on the "Close" button.

Those who do not need indexing of files can deactivate them if necessary.

Klaus

Peace worldwide - it starts here...
 
Mike - I think you are right.
On Windows 95, 98 and Me, Notepad could not open files larger than 64 KB.

Wow! What a huge volume! That is 0,064 MB

If the current Notepad ++ can manage up to 512 MB, then it is
that better "only" a factor of 512 / 0.064 = 8000 !!


Not everything was better before...
Klaus

Peace worldwide - it starts here...
 
Just an update on my quest:

After looking at all your suggestions and browsing through the 4GB+ file of postcodes which I managed to split with a software application called "CSV Huge Splitter" downloaded from SourceForge.net, it appears that there are multiple inconsistencies in the files in that some postcodes are missing, addresses contain inverted commas or commas. So even after trying to remove them and testing different ways to achieve just a small batch of postcodes to work with, I am unable to do this and for what it's worth, it's not.

Whilst not wishing to be defeatist, I have abandon this little project as I've already spent too much time but needless to say, I'm grateful for the contributions on this thread.

Best wishes and stay safe guys.

Thank you

Steve Williams
VFP9, SP2, Windows 10
 
OP's original premise is off. There is not a 1:1 correlation between a UK postcode and an address.
A UK postcode will generally break down to the "postman's walk", similar to a US ZIP+4.
With the postcode AND a street number, you'll most likely get a hit.

I've done something similar with USPS data. This allows call centres to just ask for a ZIP and street number, and then they will confirm the full address back to the caller.

With that in mind, you'll want to preprocess that data until it is crying and you are too.

UK postal addresses are some of the most confusing on the planet :
All of the other items that other contributors have been brought up will help:
[ul]
[li] Town Names into a separate table, with an FK to that ID rather than the town name. The format you show would seem to indicate that the LAST used 'add' field would have the town naame[/li]
[li] Process each column (across ALL files) to determine lengths that are actually required.[/li]
[li] Perhaps, create a pipe-delimited format such as numbname|flatnumb|add01|add02|add03|add04|add05, that you unpack when displaying[/li]
[li] Determine if the overhead of using an FPT (Memo field) for the delimited address information is better. There is an overhead to loading the FPT and the Memo pointer size, but each address is only as long as it needs to be with no waste. Since this data is readonly, you won't encounter Memo bloat.[/li]
[li] You may want to normalize/compress some other address data "ROAD"-->"RD", "LANE"-->"LN", but you have to be careful you're not compressing the wrong part of an address. USPS data is very structured in that regard, but addr01-addr05 tells me that the Royal Mail isn't so structured. How do you uncompress "ST" ? Is it Saint or Street? I would leave this for a much later analysis.[/li]
[/ul]

Without a method to compress/uncompress the address data, you're going to have to break the dataset into pages, and load those pages in and out ('paging').

In fact, the 29 files they provided are a good starting point for paging the data.
You can keep the data in 29 separate DBFs, index them on the postcode.
For speed, you might even want to consider breaking down the size to smaller files. You'll be opening and closing tables often, so you want a smaller overhead when loading. You'll want to find the sweet spot between NEEDING to 'page' and the IMPACT of 'paging' (bigger pages means less often, but slower load/unload times. smaller pages mean more often, but faster load/unload times). Smaller tends to give a more consistent response.

Something like:
POSTCODE CHAR(12) PK
TOWNID INT (fk to your TOWNS table)
ADDRESS CHAR(??) or MEMO in a packed/delimited format like above

You'll want to check that no postcode straddles two files. If they do, move them between files until a postcode is only found in ONE file

So now you have
POSTCODE01.DBF --> POSTCODE29.DBF (or whatever)

Then create your main index table (INDEX01.DBF)
POSTCODE (PK, indexed)
ADDRESSTABLE (INT)

Load Distinct postcodes from each POSTCODExx.DBF file into your index
INSERT INTO INDEX01 (postcode,addresstable) SELECT POSTCODE,1 FROM POSTCODE01
...
INSERT INTO INDEX01 (postcode,addresstable) SELECT POSTCODE,29 FROM POSTCODE29

When searching, find the postcode in the INDEX01 table, which will tell you which DBF to open to find the addresses for that postcode
If you are using this interactively, you may want to 'remember' which file you have open, so that if the index points you to the same table again, you can skip the unload/load cycle.

So, a search runs like:
* If no hit in INDEX01 : Invalid Postcode
* If hit, then Load the POSTCODExx table the index tells you to, and SELECT records with matching postcodes into a cursor, where you can do further filtering by address components (house number, etc).

Now, you could also add some analytics (saving of search terms) to see which postcodes are queried most often. That might allow you to move the most queried postcodes into one file to minimize the paging.
 
Nice summary, brigmar.

Steve, now it would only still interest me what you downloaded from where.

The problem of commas within values is solvable, for example, with Excels better CSV niterpretation.
I already knew postcode isn't the only determining factor. It'll rather just be enough to know the post office the mail will be routed to. And therefore you'll have postcodes that range different cities or districts, streets that are so long, that a part of it is nearer to one postal office than another, so a street will have two postcodes, etc.

I have looked a bit and it's hard to find a source of data. I found a statement about data from 2017 census should have been worked into public open data, but that's still in progress...

Before I'd go for brigmar's advice, what actually is the intention of getting this data? Do you want to verify addresses? Or what else do you want to do with that data? Because there are APIs with addresses, that you could utilize easier than trying to maintain your own address data. This isn't just a one-off task, you'll need updates and the only way I see to do them is again find full sources and process them with all difficulties and partly manually necessary correction, etc.



Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top