Count Matches in Array (with duplicates)

alphacooler · Feb 5, 2007

I have two arrays like the following:

$array1 = array("red", "red", "blue", "green", "blue");
$array2 = array("blue", "blue", "red", "red");

I want to count all of the matches between the two arrays; however, (counting) array_intersect does not work because each array has duplicates (which I want present and counted as matches for as many times as they show up).

I am analyzing two user's tags to compare their similarity.

I'm scratching my head on this one. I'm sure somebody smarter than I has an answer.

vacunita · Feb 5, 2007

Just to make sure I understand you you'd want your output to be something like:

red:2
red:2
blue:2
green:0
blue:2

----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

alphacooler · Feb 5, 2007

I didn't say what the output array should look like because I am not sure.

The problem with your suggested output array is that if I try to sum it and then normalize by the total possible matches I end up with less than a 50% match between the two arrays. But array2 has 100% of its tags included in array1; therefore (in the sense of comparison between two users), the match should be 100%. I am not sure if I am explaining that sufficiently. I think it makes more sense if you imagine trying to use these arrays to compare color-compatability between two users.

jpadie · Feb 5, 2007

if I think about it the way that you suggest, I'd come up with the same type of output that vacunita did!

off the top of my head is this the kind of comparison you are looking for?

Code:

<?
function array_match ($needle, $haystack) {
	$cntArray = array_count_values($haystack);
	foreach ($needle as $pin){
  		if (isset($cntArray[$pin])){
   		$matches[] = array($pin, $cntArray[$pin]);
  		}
		else {
		$matches[] = array($pin, "no match");
		}
  }
  return $matches;
}
$array1 = array("red", "red", "blue", "green", "blue");
$array2 = array("blue", "blue", "red", "red", "yellow");

echo "<pre>";
print_r(array_match($array2, $array1));

 
?>

this would output

Code:

Array
(
    [0] => Array
        (
            [0] => blue
            [1] => 2
        )

    [1] => Array
        (
            [0] => blue
            [1] => 2
        )

    [2] => Array
        (
            [0] => red
            [1] => 2
        )

    [3] => Array
        (
            [0] => red
            [1] => 2
        )

    [4] => Array
        (
            [0] => yellow
            [1] => no match
        )

)

vacunita · Feb 5, 2007

So What your looking for would be to see if array2 completely exists in array 1.

Code:

$array1=array("red","red","blue","green","blue");
$array2=array("blue","blue","red","red");
$mycounter=0;

for($i=0;$i<=count($array2)-1;$i++){
if(in_array($array2[$i],$array1)){
$mycounter++;
}
}
echo $mycounter . "Percent Match= " . ($mycounter/count($array2)*100) 

. "%";

This will give me a result of 100%. however if i were to add a "yellow" part to array2 i would get an 80% match.

----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

alphacooler · Feb 5, 2007

I originally had the same function and resulting array, but I run into a problem of placing meaning to this. If I sum up the matches, then divide by the total possible matches to normalize (count(array1)*count(array2)) I have a hard time with interpretation.

For instance if one array has 1000 tags and another has 10, then your divisor is 10,000. Given the tag set I am working with I always get matches under 1%. This skew in the data makes interpretation difficult.

What I like about this method is that it accounts for multiple tag matches (versus a normal array_intersect approach) which is important in detecting patterns.

Maybe I am just plain missing something and you need to spell it out for me.

alphacooler · Feb 5, 2007

My last post was referring to jpadie,

vacunita, your last post would neglect multiple tag matches, which are going to be key in detecting patterns of what people like. But I think the biggest problem with that approach is that for any given subject there is going to be a finite set of possible tags, and over time a users tagset will approach the limit (full 100% match).

I think this is why it is key to allow for weights afforded by counting multiple tags.

vacunita · Feb 5, 2007

My approach will not ignore multiple matches. It will check every single key in array2, for its existance in array1. It really doesn't matter if it exists more than once in array1. Because as long as it exists at least once in array 1 you get a match.

And i'm not dividing by the number of keys in array1, but buy ther number of keys in array2.

Say your array1 looks like this:
red,blue,green,red,red,blue,yellow.

and array 2 is:
red,yellow.
As far as you are concerned its a 100% match because both keys exist in array1.

This is all based on your post that states that as long as every key in array2 exists in array1, its a 100% match.

----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

alphacooler · Feb 5, 2007

Yes, I see exactly what you mean vacunita, but my statement in post #3 is incorrect because of what I mentioned in my last post. Over time all users will asymptotically approach 100% matches due to the nature of "tagging" for categorization.

I apologize for not telling you to disregard that post.

Perhaps a better way to approach the problem would be to ask "what would you do to compare compatibility between two users given their respective tag sets". To give you a bit of background, these tag sets on average have about 150 unique tags with a lot of duplication.

jpadie · Feb 5, 2007

does the position of a tag in the array have any significance?

alphacooler · Feb 5, 2007

None. The only differentiating factor is the frequency of a tag within a user's set.

vacunita · Feb 5, 2007

So your saying that the closer to each other they are in terms of type of tags, and frequency of said tags would be a better match correct?

User1 has [red,green,red,red,blue,red]
is a batter macth to:
User2 with [red,red,red,blue,red];
Than
User3 with [red,red,green,green];

Because user2 has the same ammmount of reds and blues that user1 has, where as user3 only has 2 reds correct?

----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

jpadie · Feb 5, 2007

i think this is rapidly branching into a study on statistical analysis. Which is fascinating albeit slightly off topic. so we can bring it back on topic and contribute sensibly, can you, Alphacooler, give us some background to this problem with the data points that you are capturing and the reason for the analysis between the two sets of tags?

Without the position of the tags having a meaning, I can't currently see how a meaningful comparison between two structurally variant sets of information can be done. i'm sure it will all become clear with explanation, however!

vacunita · Feb 6, 2007

I agree with Jpadie, maybe a bit more background can help us understand this better.

----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

alphacooler · Feb 6, 2007

I see your point. Tags are used to describe a wide array of user submitted articles. Over time a user's tagset will start to fill with tags as they view more articles. Duplication of certain tags (after normalizing for tags that are more likely than others) means that a user is more interested in articles pertaining to those particular tags.

So my reasoning was that I could compare matches between raw tagsets to get a rough approximation of similarity in article tastes between two users.

Perhaps the fundamental reasoning is flawed?

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Count Matches in Array (with duplicates)

alphacooler

Programmer

vacunita

Programmer

alphacooler

Programmer

jpadie

Technical User

vacunita

Programmer

alphacooler

Programmer

alphacooler

Programmer

vacunita

Programmer

alphacooler

Programmer

jpadie

Technical User

alphacooler

Programmer

vacunita

Programmer

jpadie

Technical User

vacunita

Programmer

alphacooler

Programmer

Similar threads

Part and Inventory Search

Sponsor