Mark Duplicate Records Based on Multiple Fields with Fields Too Large for Indexing Together

JasonCannon · Nov 6, 2019

I need to mark records in one table that are considered duplicates of another table, but the five fields that determine the duplicate are too large to put into one index. I get "Invalid Key Length" when I try.

Code:

INDEX ON ALLTRIM(ALLTRIM(UPPER(lstnm))+ALLTRIM(UPPER(lstgn))+ALLTRIM(UPPER(dirtx))+ALLTRIM(UPPER(locnm))+ALLTRIM(UPPER(state))) TAG listing

I need to mark the field "dupe" in table one with "Y" for the duplicates based on those fields.

Usually I would create the index and then run through a DO WHILE loop, but this is a first for me.

Suggestions?

"..if you give a man a fish he is hungry again in an hour. If you teach him to catch a fish you do him a good turn."
-- Anne Isabella Thackeray Ritchie.

Mike Gagnon · Nov 6, 2019

This error means:
For a standalone index (.idx) file, the length of an index key must be between 1 and 100 characters. For a compound index (.cdx) file, the length of an index key must be between 1 and 240 characters.

But it is not the length of the string but rather the field length in the expression.

If you want to get the best response to a question, please check out FAQ184-2483 first.

JasonCannon · Nov 6, 2019

How can I make that index a compound index?

"..if you give a man a fish he is hungry again in an hour. If you teach him to catch a fish you do him a good turn."
-- Anne Isabella Thackeray Ritchie.

Mike Lewis · Nov 6, 2019

Are you saying that the total length of those five fields is more than 240 (which is the maximum length of an index tag)? That seems an awful lot to use in a de-dupe test. But presumably that's what you want.

I'd also suggest that using variable-length fields (which is what you get when you use ALLTRIM()) is a bad idea.

And what collating sequence are you using? If you are not using MACHINE, that each character in the index takes two bytes.

Finally, is the second table (the one that you are indexing) particularly large? If it is not, you could always do without the indexes. The process would take longer (depending on the size of the table), but it should work the same.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Mike Lewis · Nov 6, 2019

How can I make that index a compound index?

It is a compound index by default.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

JasonCannon · Nov 6, 2019

Mike Lewis said:
Are you saying that the total length of those five fields is more than 240 (which is the maximum length of an index tag)? That seems an awful lot to use in a de-dupe test. But presumably that's what you want.

- Since the index is compound by default, then yes. When I run that INDEX ON command, I get that "Invalid Key Length" error. And yes, this is a lot of fields to use but it is what I need.

Mike Lewis said:
I'd also suggest that using variable-length fields (which is what you get when you use ALLTRIM()) is a bad idea.

- Okay, what would be a better idea?

Mike Lewis said:
And what collating sequence are you using? If you are not using MACHINE, that each character in the index takes two bytes.

- MACHINE

Mike Lewis said:
Finally, is the second table (the one that you are indexing) particularly large? If it is not, you could always do without the indexes. The process would take longer (depending on the size of the table), but it should work the same.

- I will test this. Always thought you needed the index on the second table. Thanks.
- Tested and if I do not have an index on the second table, I get the error "Table has no index order set" when I SET RELATION.

"..if you give a man a fish he is hungry again in an hour. If you teach him to catch a fish you do him a good turn."
-- Anne Isabella Thackeray Ritchie.

Mike Lewis · Nov 6, 2019

I get the error "Table has no index order set" when I SET RELATION

That's right. But I'n not sure why are are using SET RELATION.

A good way of doing a check for duplicates would be something like this:

Code:

SELECT * FROM Table1 WHERE SomeField IN (SELECT SomeField FROM Table2)

where SomeField is the field you want to test for duplication. In your case, you have five such fields, so the code is a bit more complicated, but the principle would be the same.

A point about the above code is that it would work the same regardless of whether SomeField is indexed. It would just take longer (possibly a lot longer) if it isn't indexed.

In your case, you might be able to get away with something like this:

Code:

SELECT * FROM Table1 WHERE ;
  SomeField1 IN (SELECT SomeField1 FROM Table2) AND;
  SomeField2 IN (SELECT SomeField2 FROM Table2) AND;
  ....
  SomeField5 IN (SELECT SomeField5 FROM Table2)

The point here is that you can have a separate index on each of the five fields, so you won't come up against that same error.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Olaf Doschke · Nov 6, 2019

INDEX ON will not scan through all data and see whether the trimmed and concatenated values are exceeding the maximum length, the length estimation works based on ignoring the ALLTRTIMs.

Proof:

Code:

Create Cursor crsWideFields (cfield1 C(200), cfield2 C(200))
Insert into crsWideFields Values ('short','values')

Index on Alltrim(cfield1)+Alltrim(cfield2) tag test

Reason: VFP has to expect worst-case scenarios because an index expression will also be used on future data, not only existing.

The antidote: Use LEFT() on fields and limit the total width. Look for yourself what part of the fields really matter to detect double values. As long as you don't have records that only differ in the 100s position the detection of double values will not need the full content of field values. You can still also detect false alarms in your final scan loop.

Code:

Index on Left(cfield1,120)+Left(cfield2,120) tag test2

This will work. As Mike already said your INDEX ON Syntax already creates a tag of a CDV, your limitation is 240 Bytes. That's 240 characters in MACHINE COLLATION and only 120 characters in other collations, so setting machine collation may also help.

But indeed you can almost forget using ALLTRIM() in an index expression. It will not save space in the CDX at all, as the tag will foresee the maximum key length also for records that don't have it now. Simple reason (again): That can change with an update and you don't want to need to reorganize a CDX file just because you have no place for the change key value when the values in the fields get longer.

All in all. about the same reasoning VFP records in DBF are fixed size and fields really capable to have varying lengths when they are stored in the FPT file of a table and the DBF only has a 32bit reference to the FPT file.

Another very simple strategy is indexing on checksum values, CRC32 would suffice for many things, even though its realm of 32bit numbers isn't collision-free, you still have the final comparison of the expression, so you identify false alarms. VFP has SYS functions for that, see SYS(2007) and Sys(2017). Please read the help topic of it fully, you can explicitly choose whether CRC16 or CRC32 is computed and fields to exclude, so you can limit it to fields you wanted to index.

Bye, Olaf.

Olaf Doschke Software Engineering

https://www.doschke.name

TamarGranor · Nov 6, 2019

You don't need an index. You can do this with SQL.

Check out the "Dealing with duplicates" section in this paper:

http://tomorrowssolutionsllc.com/ConferenceSessions/Solving Common Problems with VFPs SQL.pdf

Tamar

GriffMG · Nov 7, 2019

I think the key length is a problem because of the use of alltrim()

When VFP sees the index statement it tries to assemble the index based on an empty record and it can't because the key length is actually zero in that instance.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

Olaf Doschke · Nov 7, 2019

No, Griff,

I see where your idea comes from:

Error 112 description said:
If you created an index key that [highlight #FCE94F]exceeds[/highlight] the following limits, create a shorter index key.

The length of an index key for an index or compiled index (.idx) must be between 1 and 100 characters. The length for a compound index (.cdx) must between 1 and 240 characters.

For certain collating sequences, or when using double-byte character sets (DBCS) characters, each character in the eExpression parameter in the INDEX command uses two characters in the index key. In these cases, the length of eExpression is limited to 120 characters.

[highlight #FCE94F]If you created an index key with length of 0, make sure the first record of the fields you base the index on contains some data.

When Visual FoxPro creates an index, it evaluates fields in the first record in the table.[/highlight]

But the last part isn't the case anymore or this would also error:

Code:

Create Cursor crsWideFields (cfield1 C(100), cfield2 C(100))
Insert into crsWideFields Values ('','')

Index on Alltrim(cfield1)+Alltrim(cfield2) tag test

A zero-length key value isn't a problem here. You only get the error once you exceed a max length value of 240 bytes. So that part of the description is correct. invalid key length only is about exceeding the max length. And as I said earlier VFP will compute a worst-case length and will not base the length it computes for index nodes by concrete values of data but by assuming a full-length value could be stored in fields involved in the index expression. So it doesn't help if no record will produce long key values.

My guess is, the help wasn't updated when this was changed, but it's easy to see such indexes are possible. The limitation is about the possibility of exceeding the maximum key length. And it's not about any current data exceeding the 240 bytes limit, it's about any possible values, and VFP here simply assumes that in the worst-case scenario ALLTRIM changes nothing, the C(N) field has length N, so concatenating with ALLTRIM() doesn't buy you the ability to have more fields, in simple string concatenation of fields VFP will assume the index length will be the sum of field lengths and so only things like using LEFT() help, as I showed earlier.

And yes, Tamar is right, you can solve that with SQL, SQL joins are not limited by expression result lengths simply because you can join by complex conditions ANDing simpler conditions, so you don't need to come up with a single index expression. When you want to join by full name, other than SET RELATION does it with a single index only, you can, of course, write ON t1.firstname=t2.firstname AND t1.lastname=?t2.lastname and those fields could even be memos.

If you want relations for other reasons, you have to think of other ways of relating data, you may rethink how your data is structured if you need such complex index expressions to relate and sort correctly at the same time or simply start using the usual way of data processing: SQL.

Bye, Olaf.

Olaf Doschke Software Engineering

https://www.doschke.name

GriffMG · Nov 7, 2019

I stand corrected, I was under the impression that the indexes were always based on the size of an empty record.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

Olaf Doschke · Nov 7, 2019

Yes, Griff,

the way the help text describes how a key with length 0 causes that error and how to fix that by having non-empty fields in the first record suggests that's the case.
But it has to be different for many versions already. I don't remember that I needed data in tables before indexing, creating primary and fo4reign keys was something I always did on empty DBFs since VFP6.

OK those indexes on single fields don't use an expression that has the chance to be empty.

But I also learned this the exact same way when I created indexes using trimming of fields. There's another reason this isn't really helpful. For example: ALLTRIM('1')+ALLTRIM('23') == ALTRIM('12')+ALLTRIM('3'), so a relation of two tables with records '1','23' and '12','3' will relate those records, while join condition on two key fields will not.

The usual rules you apply to your index expressions ensure they create the same length keys for any data in a record, which makes ALLTRIM() something you typically avoid. It's not forbidden, but it leads to effects you don't expect in advance when you don't think about it in detail.

There's something else happening when you are at EOF, you have something like an empty record at hand:

Code:

Create Cursor crsWideFields (cfield1 C(100) default 'test1^', cfield2 C(100) default 'test2')
? Eof()
lcField1 = cfield1
? Len(cfield1)
? Len(lcfield1)
? lcfield1==Space(100)

It may not astonish people anymore, but you're addressing a non-record here, you still can "read" values where there aren't any. I explicitly set default values, yet lcField1 still is Space(100).

I find cases for key length 0 to be a problem, try this

Code:

INDEX ON SPACE(0) TAG zero1
INDEX ON '' TAG zero1

So that end of key length still plays a role, but VFP examines the possible length of key values and what it uses for the tree structure of a tag by other means than just evaluating your index expression on record 1 or on the empty "ghost" record you have after EOF. The values of its fields are typical for the data type. A string of spaces for a char field, 0 for integer fields, and so on. You can find out yourself. But if the index on ALLTRIM(cField1)+ALLTRIM(cField2) would compute the key length on space(100) values you would, of course, also end up on the zero-length key value.

The only chance to cope with this is not in ALLTRIM(), but in really shortening your DBF fields, find out MAX(LEN(ALLTRIM(field))) and set the char field length to that. Maybe a few chars longer to have the capacity for longer values in the future. But only that way you have a chance your expression is evaluated as shorter key length.

Bye, Olaf.

Olaf Doschke Software Engineering

https://www.doschke.name

Mike Lewis · Nov 8, 2019

Olaf said:
And yes, Tamar is right, you can solve that with SQL

Jason,

Did you try the SQL I suggested in my post above (dated 6 Nov 19 19:25)?

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

mjcmkrsr · Nov 9, 2019

Hi Jason,

Please have a look at the below code for finding duplicates (two fields are compared) - adapt it to your needs

Code:

*!*	Check for duplicates - Found records are hilighted, bold and italic

PUBLIC go_Form

go_Form = CreateObject ("frmForm")
go_Form.Visible = .T.
go_Form.Show

READ Events
CLOSE ALL
CLEAR ALL

***********************************************************************

DEFINE CLASS frmForm As Form
  Width = 450
  MinWidth = 450
  Height = 360
  MinHeight = 360
  AutoCenter = .T.
  Caption = "Identical Code2Check = Duplicate"

	ADD OBJECT grdNames as Grid WITH ;
		RecordSource = "csrAll", ColumnCount = 5, Visible = .t., Top = 12, Left = 12, Width = 426, Height = 294, Anchor = 15

	ADD OBJECT cmdCancel As CommandButton WITH;
		Width = 60, Height = 30, Left = 12, Top = 318, Caption = "Release", Anchor = 6

	PROCEDURE Load()
	
		CREATE CURSOR csrdemo (cDrawNo C(5), cBallNo C(3), cPartNo C(10))
		INSERT INTO csrdemo VALUES ('8342C','130','61225-350')
		INSERT INTO csrdemo VALUES ('8342D','130','61225-350')
		INSERT INTO csrdemo VALUES ('8342C','131','834X-131')
		INSERT INTO csrdemo VALUES ('8342D','131','834X-131')
		INSERT INTO csrdemo VALUES ('8342C','132','834X-132')
		INSERT INTO csrdemo VALUES ('8342D','132','834X-132')
		INSERT INTO csrdemo VALUES ('8342C','133','61225-362')
		INSERT INTO csrdemo VALUES ('8342D','133','61225-362')
		INSERT INTO csrdemo VALUES ('8342C','137','834X-137')
		INSERT INTO csrdemo VALUES ('8342D','137','834X-137')
		INSERT INTO csrdemo VALUES ('8342C','141','834X-141')
		INSERT INTO csrdemo VALUES ('8342D','141','834X-141')
		INSERT INTO csrdemo VALUES ('8342D','141','834X-141')
		INSERT INTO csrdemo VALUES ('8342C','150','67242-107')
		INSERT INTO csrdemo VALUES ('8342D','150','67242-107')
		INSERT INTO csrdemo VALUES ('8342C','152','8332-124')
		INSERT INTO csrdemo VALUES ('8342D','152','8332-124')
		INSERT INTO csrdemo VALUES ('8342C','153','834X-153')
		INSERT INTO csrdemo VALUES ('8342D','153','834X-153')
		INSERT INTO csrdemo VALUES ('8342D','153','834X-153')
		INSERT INTO csrdemo VALUES ('8342D','153','834X-153')
		INSERT INTO csrdemo VALUES ('8342C','154','834X-154')
		INSERT INTO csrdemo VALUES ('8342D','154','834X-154')
		
*!*	BEGIN of sql to check for duplicates		

		SELECT csrDemo.*, csrDemo.cDrawNo + csrDemo.cBallNo as cCode2Check, csrDupes.lDupes ;
			FROM csrDemo ;
				JOIN (SELECT cDrawNo, cBallNo, .T. as lDupes FROM csrDemo GROUP BY 1, 2 HAVING COUNT(cDrawNo + cBallNo) >= 2) csrDupes ;
					ON csrDemo.cDrawNo + csrDemo.cBallNo = csrDupes.cDrawNo + csrDupes.cBallNo ;
			UNION ALL ;
				SELECT csrDemo.*, csrDemo.cDrawNo + csrDemo.cBallNo as cCode2Check, csrNoDupes.lDupes ;
					FROM csrDemo ;
					JOIN (SELECT cDrawNo, cBallNo, .F. as lDupes FROM csrDemo GROUP BY 1, 2 HAVING COUNT(cDrawNo + cBallNo) < 2) csrNoDupes ;
						ON csrDemo.cDrawNo + csrDemo.cBallNo = csrNoDupes.cDrawNo + csrNoDupes.cBallNo ;
			ORDER BY 2, 1 ;
			INTO CURSOR csrAll
			
*!*	END of sql to check for duplicates

	ENDPROC
	
	PROCEDURE grdNames.Init()
		WITH ThisForm.grdNames
			.Column1.Header1.Caption = "DrawNo"
			.Column2.Header1.Caption = "BallNo"
			.Column3.Header1.Caption = "PartNo"
			.Column4.Header1.Caption = "Code2Check"
			.Column5.Header1.Caption = "Duples?"
			.SetAll("DynamicFontItalic", "csrAll.lDupes", "Column")	  
			.SetAll("DynamicFontBold", "csrAll.lDupes", "Column")	  
			.SetAll("DynamicBackColor", "ICASE(csrAll.lDupes, RGB(225,0,0), ;
											MOD(ASC(RIGHT(csrAll.cDrawNo, 1)), 2) = 0, RGB(200,200,200), ;
											RGB(254,254,254))", "Column")	  	
		ENDWITH 
	ENDPROC 
	
	PROCEDURE cmdCancel.Click()
		CLEAR Events
		ThisForm.Release

	ENDPROC
  
	PROCEDURE Destroy()
		CLEAR Events
		ThisForm.Release

	ENDPROC
ENDDEFINE

***********************************************************************

hth

MarK

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Mark Duplicate Records Based on Multiple Fields with Fields Too Large for Indexing Together

JasonCannon

IS-IT--Management

Mike Gagnon

Programmer

JasonCannon

IS-IT--Management

Mike Lewis

Programmer

Mike Lewis

Programmer

JasonCannon

IS-IT--Management

Mike Lewis

Programmer

Olaf Doschke

Programmer

TamarGranor

Programmer

GriffMG

Programmer

Olaf Doschke

Programmer

GriffMG

Programmer

Olaf Doschke

Programmer

Mike Lewis

Programmer

mjcmkrsr

Technical User

Similar threads

Part and Inventory Search

Sponsor