Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Unique ID from String

Status
Not open for further replies.

newtofoxpro

Programmer
Sep 16, 2007
301
IN
I want to generate Unique ID from String. For Example . SYS(2007,padr("HI",70))
But sys(2007) generate same id from two string.

Thanks & Best Regards.




 
But sys(2007) generate same id from two string.

That's very unlikely. SYS(2007) generates a checksum. It's unlikely that two random strings would have the same checksum, although it's theoretically possible. Are you saying that you are frequently seeing the same checksum from different strings? Or that you want to guarantee never to get the same checksum?

You can substantially reduce the chances of getting a duplicate checksum by passing 1 as the third parameter to the function. That will give you a 32-bit value rather than a 16-bit value. But the only way to guarantee a unique value would be to generate the checksum, then scan the existing values in the table to see if the new value already exists, and then modify it if it does.

If I have misunderstood the question, perhaps you could clarify the problem.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
What do you really want? A checksum, Hash or Signature? Or a globally unique identifier (GUID)?

checksums, hashes and signatures all are in the same soup using a hash or message digest algorithm to get a value from a string (or binary file or whatever bytes), which will: a) vary even with just simple change in the original bytes, b) not be inversible, you can't get the original string or data from a checksum or hash.

SYS(2007) uses CRC16, if you use it this way. This has one major downside in only computing 2^^6= 65536 different checksum values, so indeed two strings might have the same checksum. You can at least expand the number of different checksums from 2^16 to 2^32, that is not double as many, but 2^16 times as many, 65536 times as many checksum values, so you don't get as much equal checksums from different strings, make use of the nFlags parameter: Sys(2007,stringdata,0,1) - may depend on your VFP version.

Otherwise use other hash algorithms, eg by using craig boyds VFPEncryption.FLL or use the Windows Cryptographic API:
Bye, Olaf.
 

Example....

create cursor Customer (Name c(80))
index on Name

with this thousands of customers, cdx file grows. Smaller file size indexes perform good. So..

create cursor Customer (NameId c(5), Name c(80))
repl NameId with sys(2007,Name) all
index on NameId

This perform fast. but fear of duplication id.

****************************************************


create cursor Customer (NameId c(10), Name c(80))
repl NameId with sys(2007,Name,0,1) all
index on NameId

I tried above solution and found no duplication. But I don't know how sys(2007,"",0,1) perform by vfp inside.

So are you sure ? this would generate unique id

Thank you for fast help.
 
And by the way, if I vary your string parameter I get varrying checksums in both CRC16 and CRC32.

? SYS(2007,padr("HI",67))
60534
? SYS(2007,padr("HI",68))
28352
? SYS(2007,padr("HI",69))
26890
? SYS(2007,padr("HI",70))
54253

But if you store all these differently long padded strings into a C(70) field you get 54253 for all of them, as a C(70) field pads any value written into it to 70 char length, padding with spaces. So that may be your situation, don't forget the effect of storing data into constant width char fields, if you do SYS(2007,charfield)

You see there is more bits in the CRC32 checksum already by it's value range:
? SYS(2007,padr("HI",67),0,1)
2469707907
? SYS(2007,padr("HI",68),0,1)
2639145247
? SYS(2007,padr("HI",69),0,1)
1694076057
? SYS(2007,padr("HI",70),0,1)
1624972585

Most likely, if you have two equal checksums, the values of parameter 2 also matched, but of course also 2^32 just make a value range of 4 byte (integer) 4 billion different checksums. Every text message with 1000 of chars can vary a lot more 2^8000 different original 1000 Byte messages are possible, of course. That also makes clear you can't get a single hash value for every single possible original string even not with MD5 or SHA-512, they go up with their value range, but never equal to the number of possible original data byte combinations. Then a hash must have at least as many bytes as the original messages. But the chance of two different strings getting the same checksum/hash goes down very much even with just a 16 byte hash like MD5 or a 512 Bit = 64 Byte hash as SHA-512 does.

Bye, Olaf.
 
No, Sys(2007) doesn't generate IDs, it genrates cehcksums. For IDs use GUIDs.

Bye, Olaf.

 
If you use a checksum you'll always face the situation you have two identical original names and they get the identical ID. You don't generate IDs with the data, you generate them independant of the data. Also you generate IDs once when inserting a record, it should never change later on, which it would, if you'd recompute checksums. Checksums are for checking a message was received unchanged, not for IDs.

Bye, Olaf.
 
Why don't you simply assign a unique integer to each customer? You can use an Integer (Autoinc) datatype, which involves almost no work at all on your part. Every customer will then have a unique ID, which you can use for searching, creating joins, or whatever. That's not only an easy solution, but it's the standard approach in this situation.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
As Mike says, you can use an integer (Autoinc) field for IDs, if you don't like GUIDs having 38 characters (including their curly braces - or 36 without).

On the other hand autoinc fields are a bit problematic due to their readonly nature, eg in conjunction with CSV inserts.

But there is a solution to that, too. See my new FAQ faq184-7743

Bye, Olaf.

 
But there is a solution to that, too. See my new FAQ

Very neat, Olaf. To summarise, you are proposing a single table whose only job is to maintain an autoinc integer ID for use by any other table. You have a stored procedure which returns the next available ID. The ID field in each of the separate tables has a call to that SP as its Default constraint. And, you don't actually store the IDs in the special table, but simply rely on the autoinc count to know which is the next one to be assigned.

I've never seen that appoach before, but I must say it is ingenious. I'll have to consider using it in future projects.

Mike


__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
You fully understood, yes, that's the intention.

To not store the values is just a neat addition. More important is the stored proc never switches workareas and just temporarily opens S_Auto.dbf. That makes it slower fur bulk inserts

You can do similar things in SQL Server and also prevent the storage of records with instead of insert triggers, but I didn't attempted that so far, as it's also quite ok to use newid() as default value of uniqueidentifier fields, there.

If you like, you can make an article out of this agin, for Hex Central.

Bye, Olaf.

 
...and a last word on the performance of indexes:

If, as you say, you'd do thi
>create cursor Customer (Name c(80))
>index on Name
>with this thousands of customers, cdx file grows. Smaller file size indexes perform good. So..

For Indexes to SEEK a certain customer by name the normal index will be totally sufficient, don't fear a too large CDX. If you really think a checksum index is performing better: You have to create the checksum of the name you want to find (padded to field width) and then seek/locate/rushmore optimize on the checksum index. It'll not perform significantly faster, I'd stand the test of this, I think, but you may prove me wrong, simply create sample data and see for yourself. You only burden yourself with an index, which you eg can't use to sort by the name column, so that index on Sys(2007,name,0,1) is less valuable alone in that aspect, than a simple index on the name column.

If you really think field sizes influence index searches so much, optimize the column widths with real name data. 80 char length is very large for person names, unless perhaps for arabic names.

For an insight of what SYS(2007,string,0,1) does, it's doing CRC32 with string and CRC32 is well documented as you'll see it's not good for your purpose. I suggest you rather test with MD5 and compute that as a 16 byte binary Q(18) value. No need to have such a field, the index expression resulting in such a Q(16) vartype will create a corresponding index structure.

Last not least it's a viable idea for the shortening of index leaf nodes. If you want unique results, you'd rather index on a compression of the name, though, not a checksum or hash. Compression has the disadvantage of being variable length and won't work good for that matter.

Bye, Olaf.
 
I have used to generate 100,000 random names (as tab delimited txt file) and appended them into a table with firstname C(40),lastname C(40) fields. I indexed this data with the following index expressions and CDX file sizes:

1. Lastname+Firstname ... 816KB
2. Sys(2007,Lastname+Firstname,0,1) ... 697 KB
3. Val(Sys(2007,Lastname+Firstname,0,1)) ... 574 KB.

I didn't tested the resulting indexes for seek performances, but you see VFP why CDX indexes are called compressed indexes. a CRC32 has 4 byte values, still an index on 100,000 of theses values instead of the same amount of 80 byte long expressions Lastname+Firstname does not shrink the index to 5% but not even to 50%, or inversely, the large index on Lastname+Firstname only needs about 2x as much space, not 20x.

That is so, because the index root nodes start with first bytes of the expression and even the leaf nodes of an index tree don't contain the full index expression, but the record number of where to find the data in the DBF. And this leads me to the final thought on why the normal indexes are better: You can SEEK for partial expressions, eg seek only the lastname in Lastname+Firstname or seek names beginning with "A"-"Z" or "Mc" or whatever you like, while you will need to have the full name you seek before computing the CRC32 to seek it, you can't seek partial names.

So this idea doesn't work out well, here, without even starting to make performance tests.

Bye, Olaf.
 
And just By the way, the DBF file has about 8MB and the original mockdata download 1.3 MB, so indeed the file that bloats the most is the DBF, because most names don't need C(40). Mockaroo only gave me names with max 10 to 11 chars. Considering that the savings of hash indexes vs normal are even lower, as an index on Left(Lastname,11)+Left(Firstname,11) only needs 648KB and the hash index is as large as hash values still are 4 byte each.

Bye, Olaf.
 
Thanks

My fault, I did not show full story. While developing, Customer is not only data. In addition there is Product dbf and their huge transactions. I have already data having thousands of Customer & Product list and millions/billions of transactions. Example. If I want to browse transactions for a Customer. transactions dbf should be indexed. But not only by Customer Name. i.e. Index key =
1. TransactionTYpe+CustomerName+dtos(TransactionDate) with this index key I can find transactionsType & Customer & Selected Period.

I am using this way now. the reports has may filers on date, Cust_address ect.

I don't know how vfp perform inside for index. But smaller files perform better than bigger I think so.

Anyway, With your detail advice, I accept sys(2007 is not good idea for NameId.

Thank you very much
 
Hi,
The idea of having one autoinc table per project has been around since 1995. It was suggested by Whil Hentzen in his book "Programming Visual Foxpro 3".
hth
MK
 
MK,

the technical field type Integer autoinc was introduced with VFP8, what Whil Hentzen talked abouit was part of VFP samples in the newid() stored proc of the newid.dbc
But that's having a record per Database table with an individual counter. If you just want that, the autoinc field per table is doing that easier and more reliable.

So the idea of a central autoinc for all tables still is something not yet proposed, it just compares to the general idea of having unique keys, as you can also generate uniqueidentifiers that won't even repeat globally.

Anyway, admitted, the idea is not as revolutionary, but has a bigger benefit than the simple autoinc: In a database with intergers used as primary key any table will have a record ID=1, so the value 1 can be used for any foreign key field. That's not a problem, as the foreign key field and it's constraint also specify explicitly which parent table an id references. But you can easily copy some primary key ID value into the wrong foreign key field without any warning from referential integrity. If your IDs are not only unique per table, but in the whole database, you will detect such flaws. This error happend to me quite easily by appending data, which skips MEMO fields and thus can insert data "shifted". Manually copying keys is another scenario, and there are many more.

Bye, Olaf.
 
newtofoxpro,

>My fault, I did not show full story.

Yes, as you gave the example for the index usage, the general idea was clearer, you still should avoid the term ID just because you want the expressions to be unique. Uniqueness is of course a major attribute of record IDs = identifiers, but you actually want to index a checksum of a name instead of the name for a shorter index and the name column might have double names. Even if not, it's not that you compute the expression to use it as primary key.

As shown VFP has ways to shorten the index enough to skip this step and CRC16 is surely not sufficient, CRC32 will most probably also not be unique for millions of original values. I'd really skip this idea, even if your original data is longer than C(80), if concatenated. VFP does a good job as long as the index expression doesn't exceed the 240 byte limit. With any collation sequence this means only 120 chars. Also see Create Small Indexes Using BINTOC( ) Sample: It's not a shortener for string data, but is better than concatenating several integer ID values with STR(id1)+STR(id2)+STR(id3).

Indexing a checksum you a) dont't gain much space and performance and risc indexing different values identical. What you can always do is double secure you found the right data by first looking up a hash value and then checking whether the detail data matches what you used to compute the hash. In the end it really might be simpler to seek a key value, a real key, not just a hash. Not a key you generate with the data, but a unique key you generate via autoinc or GUID and attach to the group pf records or in a central n:m table.

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top