large array

tzuohann · Apr 18, 2013

Hi, I am working on a program to compare entries in a large data-set. This is for a national statistics department of a Scandinavian country.
I'll describe the problem, then talk about two approaches I tried.

Consider 2 entrants, A and B.

A is observed at location 1 3 5 8
and
B is observed at location 2 5 8 10

This means, A and B are comparable at two locations, namely 5 and 8.
I have some simple function which just takes the scores they get at two locations and gives me the appropriate comparison. Func(A5,A8,B5,B8) for example.

If it had been:
A is observed at location 1 3 5
and
B is observed at location 2 5 10
We get Func(A5,B5)

and if there is no overlap, trivially, zero.

Note that A and B are sorted, and, the locations are also sorted. So I have a long array with 3 columns, Column 1 is the names of the entrants, column 2 is the locations, and column 3 is the scores of the entrants at each location.

Now, the problem is to easily obtain the comparisons Func(.) for some purpose. One way is to precompute everything, and store it. This doesn't make sense, as I have 10million entrants. Computing all the comparisons would generate a matrix that contains a maximum of 10million^2 terms. This is just way too large.

I tried to recompute it each time instead, but ran into the problem of an intersection problem. For each A,B which I want to compare, I need to find the locations where they intersect and recover those locations. What would be the way forward?

I tried to sort them, then it is easy to determine if any A and B are not comparable. This is true when the lowest location of A is higher than the highest location of B, or vice versa, but when this is not true, I can't think of another way besides to compare each entry. Just start at the lowest numbered locations on both, and look for coincidences as we move up. This is just too slow.

The locations are invariant to renumbers. I was wondering if renaming the locations so that the overlap is minimal would help, but it is not entirely clear if it would.
Also, I was wondering if a hashtable would help, but am not sure exactly how it would help me overcome any problems.

Thanks so much. It is a long description.

salgerman · Apr 18, 2013

Is this in anyway similar to the "comparing arrays" thread, two threads down? Also, see my posting using "forall"...should be faster and lend itself to parallel processing...as much explanation as you have given, I am not 100% certain I understand what you want to do or what you mean with Func(A5,A8,B5,B8) or Func(A5,B5)...or maybe this is part of the problem...that you are going about it in some odd way (no offense) ?

tzuohann · Apr 18, 2013

Thanks for trying. Its different. Let me try again to explain. I really do not think I can improve it further, but I would like to try before throwing hardware at it.

Def some function Fn(INPUTS) as a function. It is completely not important. For now, suppose I give it some numbers and it gives me an output.
Suppose there are 4 entrants. A, B, C, D. And they get scores at different locations. We want to compare the scores with some statistical method. That was what Fn was for.

I arrange their results in 3 columns.
Col 1: The entrant {A,B,C}
Col 2: Location {An integer to represent the location}
Col 3: Their score at each location. {a1 would be A's score at location 1}

For example:

A 1 a1
A 2 a2
A 6 a6
B 1 b1
B 2 b2
C 1 c1
D 5 d5

One way to summarize all that information is in a table. T(x,y) is what we get when we compare x to y.
So we get the table entries:

T(A,B) = Fn(a1,a2,b1,b2) because A and B have scores from locations 1 and 2.
T(A,C) = Fn(a1,c1) (A and C have scores from location 1)
T(A,D) = 0 (A and D do not have any locations that intersect)
T(B,C) = Fn(b1,c1) (B and C have scores from location 1)
T(B,D) = 0 (B and D do not have any locations that intersect)
T(C,D) = 0 (C and D do not have any locations that intersect)

T(x,x) = 0 because we do not compare x to itself. i.e. we don't compare A to A.
T(x,y) = inv(T(y,x)) so the information is redundant.

Basically, for 4 entrants, A,B,C,D I can summarize all the information in 4*3/2 = 6 numbers. The only problem is, instead of 4, I have 4 million. I can't save a 8 trillion array anywhere without special data. Also, there are roughly 200000 unique locations. Each entrant can be seen at anywhere from 1 to 20 locations.

The reason why I wanted to summarize the data was the entries of the table is used many times over and over again for some optimization routine. Since Fn(.) is a very simple function, I thought I would just recompute Fn(.) as and when I needed it. So, I sorted the information (as you can see above) and then indexed the start location of every entrant. Information for A starts at row 1 and ends at row 3
Information for B starts at row 4 and ends at row 5

A,B,C,D are just integers, and so are the locations. So it is very easy to index them.

So, the problem boils down to, given two entrants to compare, how to
a) Tell if there is an intersection in locations i.e. given A, D, there is no intersection so Fn evaluates to zero.
b) If there is an intersection, what are the locations that intersect. i.e., given B,C, there is an intersection, and I should evaluate F(b1,c1)

So far, what I do is feed the locations of the two entrants I want to compare, LocX and LocY are arrays into this algorithm: Please ignore syntax.

If (min(LocX) > max(LocY) .OR. max(LocX) < min(LocY) then
!FOR SURE THERE IS NO INTERSECTION BECAUSE LOCX and LOCY ARE SORTED
else
!THERE MIGHT BE AN INTERSECTION
Initialize:
i = 1
j = 1
while (i and j have not hit the maximum number of locations for either)
if LocX(i) .lt. LocY(j)
i = i + 1
elseif LocX(i) .gt. LocY(j)
j = j + 1
else
Compute Fn each time an intersection is found.
i = i + 1
j = j + 1
end if

That is the basic idea:

I also quit the loop if LocX(i) > max(LocY) or LocY(j) > max(LocX) to speed things up a tiny bit more, but there are no significant gains.

Is there a smarter way?
I was thinking of renumbering the locations so that there are minimal intersections, but it is not clear to me how it would help.
How about come clever way of hashing?

Its a bit complicated, and if you actually read up to here, thank you so much. I owe you a homebrew. I would mail it out.

gummibaer · Apr 19, 2013

Just trying to understand the problem:

Do your entrants score on any location only once or is there a chance that they get two or more scores for one location like:

A 1 a1
A 2 a2
A 2 a3
A 5 a4

Norbert

The optimist believes we live in the best of all possible worlds - the pessimist fears this might be true.

mikrom · Apr 19, 2013

And what is number of locations you have?

gummibaer · Apr 19, 2013

Other questions while thinking about your problem:

Do you only use intersections of two entrants not three or four etc.?
What if two entrants intersect at more than two locations? Are all these intersections used in one Fn only?
Do you need the information which entrant did the score (your definition of Fn would indicate it only needs the scores, not the identity of the entrants)?
Are your data sorted?

Norbert

The optimist believes we live in the best of all possible worlds - the pessimist fears this might be true.

tzuohann · Apr 19, 2013

Do your entrants score on any location only once or is there a chance that they get two or more scores for one location like:
One score per location, if any.

And what is number of locations you have?
The grand total will be roughly 200,000.
However, each entrant will have a maximum of 15 (the average will be 3-5).
The locations and entrants are just IDs, so the numbers don't mean anything. I was wondering if there was some clever renumbering scheme.

Do you only use intersections of two entrants not three or four etc.?
Yes. I will only compare 2 entrants. So, the problem of finding intersections of 3 or more entrants does not arise.

What if two entrants intersect at more than two locations? Are all these intersections used in one Fn only?
Yes. If they intersect at N locations, then N pairs of scores go into Fn. Fn is just a very simple function (e.g., mean score of 1 minus mean score of another)

Do you need the information which entrant did the score (your definition of Fn would indicate it only needs the scores, not the identity of the entrants)?
This is also correct. The IDs do not contain any information. The only important thing is:
a) Are two entrants comparable. (Do any locations intersect)
b) If yes, what is the comparison.

Are your data sorted?
Yes. The entrants and the locations can be renumbered. So assume that the data is presented sorted by the entrants, then by locations.
This was the basis behind the algorithm I used to figure out the intersections asap.

I am actually using the information in a specific way. Let me explain more. This is to find the best ranking of entrants. Suppose entrants are ranked B,C,A:
The quality of this ranking is the sum of:

T(B,C) (B is ranked above C)
T(B,A) (B is ranked above A)
T(C,A) (C is ranked above A)

Suppose I now propose a different ranking:
A,B,C

The quality of this ranking would be:
T(A,B) (A is ranked above B)
T(A,C) (A is ranked above C)
T(B,C) (B is ranked above C)

Now of course, the solution would entail enumerating all ranking possibilities and computing the quality. This is impossible. Its actually proven.

So the approximation algorithm (which has theoretical backing) is to have a good initial guess of the ranking, e.g:

B,C,A

then, consider changing the ranking of any single entrant: B,C,A --> C,B,A this way, computing the change in quality is easy.

So, the algorithm simply proposes considering all single entrant reordering until there is no improvement in the score. There are some heuristics which we can relegate to details, but that is the core of it.

Thanks again for the effort.
Tzuo

tzuohann · Apr 19, 2013

Sorry this sentence might be confusing:

So, the algorithm simply proposes considering all single entrant reordering until there is no improvement in the score.

Replace it with

So, the algorithm simply proposes considering all single entrant reordering until there is no improvement in the quality of the ranking.

mikrom · Apr 19, 2013

I thing, to do the work you will need a suitable datastructure: associative array (aka hash, dictionary, map)
For example if you have this data:

Code:

A 1 a1
A 2 a2
A 6 a6
B 1 b1
B 2 b2
C 1 c1
D 5 d5

you could organize it within a hash of hashes based on entrants and locations:

Code:

{A => {1 => a1, 2 => a2, 6 => a6}, B => {1 => b1, 2 => b2}, C => {1 => c1}, D => {5 => d5}}

or based on locations and entrants

Code:

{1 => {A => a1, B => b1, C => c1}, 2 => {A => a2, B => b2}, 5 => {D => d5}, 6 => {A => a6}}

Unfortunatelly, Fortran doesn't provide this datastructure.

tzuohann · Apr 19, 2013

Hi Mikrom,

The data is all numeric. I was just using A for the names of entrants, numbers for the locations and b1 for the scores to illustrate.

All the names are integer(4), all the locations are integer(4) and the scores are real(8).

Further more, the location and names are just IDs, which means I can change Name = 1 to Name = 1000 and not affect the problem so long there is not another already called 1000.
Likewise for the locations.

salgerman · Apr 19, 2013

Well, I had the time to calmly read and I think I finally understood the goal, here.

It seems to me, you already have a pretty good algorithm.

Trying to think of something totally different, here, but I am not being successful...I would relate what I have thought of...maybe it will trigger something else in your head.

I guess the logic is not the problem, the size is.

If the problem was a lot smaller, say, 50 possible location IDs, one could simply create a logical array of length 50 for each entrant, where each index corresponds to each location and set to .true. if the entrant has visited such location. Then, AND-ing two entrants at a time readily yields intersection, where such operation could be done quickly and for the entire array in one shot; the .true.-valued indexes can then be used to retrieve scores from companion score arrays...so, you would also need 50-long score arrays for each entrant.

This true/false thing lend itself to using bits, instead of byte-long logical entries. So, the problem above could be made to use less storage if instead of 50-long logical arrays, one would use a single 50-bit-long integer to store the same information.

...the problem is that you have 200,000 locations! We could try to use 200,000 bits, like this
INTEGER*4 A(6250)
where each entry records visitation to 32 locations at a time.

I am not sure if all this would help and whether is worth the effort; it seems too wasteful...I don't know...and you would need to do bit-wise stuff.

On a totally different thing and along the lines of quickly discriminating non-overlapping entrants...have you thought of grouping entrants or grouping the locations they visit? I mean, 200,000 locations and each entrant can have up to 15? It seems to me that if you sort the locations based on geographic location (zip codes), you will find that most entrants only visit locations within the same or neighboring zip codes...earlier you were talking about using different numbering system to make things easier...maybe something with the zip code make ease things...just an idea...maybe you can add a layer...before you get to evaluate specific location, you can first find out if the entrant has even visited such zip code...I guess this is similar to your short-circuit where you inspect to see if the least location of one entrant is already greater than the max location of the other entrant...same thing, I guess.

...bla, bla, bla...

Germán

tzuohann · Apr 19, 2013

German,

I actually had something quite similar to what you proposed with the .AND. but found that .ANDing. super long arrays wasn't very helpful.

However, what you said regarding the geographic location was brilliant. Wonder why I hadn't thought of it. I can separate things by other properties (like industry). I think in this way, a lot of locations will be quickly excluded! Come to think of it, that is brilliant. Right now I am working on simulations where this location thing doesn't matter, but for the real thing, it certainly would. Thanks so much for the idea!

Tzuo

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

large array

tzuohann

Programmer

salgerman

Programmer

tzuohann

Programmer

gummibaer

Programmer

mikrom

Programmer

gummibaer

Programmer

tzuohann

Programmer

tzuohann

Programmer

mikrom

Programmer

tzuohann

Programmer

salgerman

Programmer

tzuohann

Programmer

Similar threads

Part and Inventory Search

Sponsor