Hi, I am working on a program to compare entries in a large data-set. This is for a national statistics department of a Scandinavian country.
I'll describe the problem, then talk about two approaches I tried.
Consider 2 entrants, A and B.
A is observed at location 1 3 5 8
and
B is observed at location 2 5 8 10
This means, A and B are comparable at two locations, namely 5 and 8.
I have some simple function which just takes the scores they get at two locations and gives me the appropriate comparison. Func(A5,A8,B5,B8) for example.
If it had been:
A is observed at location 1 3 5
and
B is observed at location 2 5 10
We get Func(A5,B5)
and if there is no overlap, trivially, zero.
Note that A and B are sorted, and, the locations are also sorted. So I have a long array with 3 columns, Column 1 is the names of the entrants, column 2 is the locations, and column 3 is the scores of the entrants at each location.
Now, the problem is to easily obtain the comparisons Func(.) for some purpose. One way is to precompute everything, and store it. This doesn't make sense, as I have 10million entrants. Computing all the comparisons would generate a matrix that contains a maximum of 10million^2 terms. This is just way too large.
I tried to recompute it each time instead, but ran into the problem of an intersection problem. For each A,B which I want to compare, I need to find the locations where they intersect and recover those locations. What would be the way forward?
I tried to sort them, then it is easy to determine if any A and B are not comparable. This is true when the lowest location of A is higher than the highest location of B, or vice versa, but when this is not true, I can't think of another way besides to compare each entry. Just start at the lowest numbered locations on both, and look for coincidences as we move up. This is just too slow.
The locations are invariant to renumbers. I was wondering if renaming the locations so that the overlap is minimal would help, but it is not entirely clear if it would.
Also, I was wondering if a hashtable would help, but am not sure exactly how it would help me overcome any problems.
Thanks so much. It is a long description.
I'll describe the problem, then talk about two approaches I tried.
Consider 2 entrants, A and B.
A is observed at location 1 3 5 8
and
B is observed at location 2 5 8 10
This means, A and B are comparable at two locations, namely 5 and 8.
I have some simple function which just takes the scores they get at two locations and gives me the appropriate comparison. Func(A5,A8,B5,B8) for example.
If it had been:
A is observed at location 1 3 5
and
B is observed at location 2 5 10
We get Func(A5,B5)
and if there is no overlap, trivially, zero.
Note that A and B are sorted, and, the locations are also sorted. So I have a long array with 3 columns, Column 1 is the names of the entrants, column 2 is the locations, and column 3 is the scores of the entrants at each location.
Now, the problem is to easily obtain the comparisons Func(.) for some purpose. One way is to precompute everything, and store it. This doesn't make sense, as I have 10million entrants. Computing all the comparisons would generate a matrix that contains a maximum of 10million^2 terms. This is just way too large.
I tried to recompute it each time instead, but ran into the problem of an intersection problem. For each A,B which I want to compare, I need to find the locations where they intersect and recover those locations. What would be the way forward?
I tried to sort them, then it is easy to determine if any A and B are not comparable. This is true when the lowest location of A is higher than the highest location of B, or vice versa, but when this is not true, I can't think of another way besides to compare each entry. Just start at the lowest numbered locations on both, and look for coincidences as we move up. This is just too slow.
The locations are invariant to renumbers. I was wondering if renaming the locations so that the overlap is minimal would help, but it is not entirely clear if it would.
Also, I was wondering if a hashtable would help, but am not sure exactly how it would help me overcome any problems.
Thanks so much. It is a long description.