Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

algorithm needed 1

Status
Not open for further replies.

satellite03

IS-IT--Management
Dec 26, 2003
248
IN

hi, say i have a sequence like below

10 12 10 10 12 14 125 0 -1 -2 14 15 12 11 3 11 ........ (in a file)

i want to store all the elements in an array...but i want to store
each element only once(no repetation) and also how many times
elements are repeting.

that is , my array will have all distinct elements.

for example, i want output for mine above sequence like below

array[0] = 10 repetation = 3
array[1]=12 repetation = 3
array[2]=14 repetation = 2
array[3] =125 repetation = 1

..............like this

how can i do it?

i wrote a code ....but its not fulfiling my desire
Code:
#include<stdio.h>
#include<stdlib.h>

int main( )

{
	FILE  *ptr1, *ptr2;
	ptr1 = fopen(&quot;d:\\datatest.txt&quot;,&quot;r&quot;);
	if(ptr1==NULL ) printf(&quot;can not open&quot;);
  	ptr2 = fopen(&quot;d:\\datatest.txt&quot;,&quot;r&quot;);
    if(ptr2==NULL ) printf(&quot;can not open&quot;);
	int repetation[50],number[50];
	int static i = 0 ;
	int x,y;
    static long int k=0;
	do
	{
		fscanf(ptr1,&quot;%d&quot;,&x);
		int count = 0;
		do

		{
			fscanf(ptr2,&quot;%d&quot;,&y);
			if(x==y)
			    count++;
		}while( !feof(ptr2) );
        //printf(&quot;count=%d&quot;,count);
		repetation[k] = count;
        k = k+1;
		number[i]= x;
	    i=i+1;
	    rewind(ptr2);
	}while( !feof(ptr1));
	for(int m =0;m<k; m++)
	  printf(&quot;%d   %d\n &quot;,number[m],repetation[m]);
	close(ptr1);
	close(ptr2);


}


getting output :
________________

10   3
 12   3
 10   3  
 10   3  
 12   3
 14   2
 125   1
 0   1
 -1   1
 -2   1
 14   2
 15   1
 12   3
 11   2
 3   1
 11   2

&quot;d:\lcc2\lcc\adf.exe&quot;
Return code -1
Execution time 0.032 seconds
Press any key to continue...

although i am getting output which shows which element is repetating how many times..but they are showing  multiple times.here 10  3 has come three times but i want it for only once.

my expected output:
------------------
 10   3
 12   3
 14   2
 125   1
 0   1
 -1   1     
 -2   1   
 14   2
 15   1
 11   2
 3   1
 .....

i am not able to get a good algo...how can i do it? i have large amount of data..
 
some hints:
- you need to open the file only once
while reading the numbers in a loop:
- keep a list of the numbers you have read
already
- use a nested loop to check the current
number with each number in the list
- if the number is already in the list,
increase the corresponding count by 1.

so you dont need 2 file pointers (or 2 fopen calls). you still need 2 loops, the outer one for the input (as you have it now) and the inner one for your already-read list.

hope you can proceed from here.
--
ranga
 
What is the expected range of numbers in the file?

--
 
Two sugestions :

1. While loading, store numbers in sorted array (or list) and use binary-search to find duplicates. It will reduce time significantly.
or
2. If the range is not too high, use a separate bit-string structure for fast lookup
 
salem
-------
file contains around 272145 integers.....

bNasty
-------
i have already usued qsort() from the stdlib.h. and simply using while loop , calculating frequency for small data..but for large data i think it is not a good algo.

>>>>&quot;While loading, store numbers in sorted array (or list) and use binary-search to find duplicates. It will reduce time significantly.&quot;

....
are you sugesting to use link list and binary search...ok let me think about it.
>>If the range is not too high ....i have 272145 integers. is not it big?

in fact sorting will help me to get the result...but it is unnecessary...i simply need the frequency of each data.

so, what you people suggest ? which one would be the best algo for my data set for calculating frequencies?

thanks all.


 
> file contains around 272145 integers.....
I meant the range
Like -50 to 5000
or -1000000 to 1000000

If the range is like less that say 10 million, then a simple large array initialised to zero will suffice

Code:
// initialise
#define MIN -500000
#define MAX  500000
int counters[MAX-MIN+1] = { 0 };


// count 'value' as read from the file
counters[value-MIN]++;


// print it
for ( j = 0 ; j < MAX-MIN+1 ; j++ ) {
  if ( counters[j] != 0 ) {
    printf( &quot;%d appears %d times\n&quot;, j+MIN, counters[j] );
  }
}

--
 

hi, i am not clear about your algo....but i guess it can not remove the redundancy(as mine previous output)....it can not give me desired output...you are not comparing number at all!

i am not clear about

// count 'value' as read from the file
counters[value-MIN]++;

what is going on here?

anyway, can you show using data only
10 12 10 10 12 14 125 0 -1 -2 14 15 12 11 3 11

about the range you are talking...the numbers are in between 0-250 ...all integers.


 
Example - MIN is 0, MAX is 10
So
Code:
int counters[MAX-MIN+1] = { 0 };
is effectively
Code:
int counters[11] = { 0 };

You read in the following numbers (into value)
Code:
1 3 4 6 4

Which does the following (subtracting MIN omitted since it is 0 in this example)
Code:
counters[1]++; // Was 0, Now 1
counters[3]++; // Was 0, Now 1
counters[4]++; // Was 0, Now 1
counters[6]++; // Was 0, Now 1
counters[4]++; // Was 1, Now 2

So at the end, the counters array contains
Code:
index = 0  1  2  3  4  5  6  7  8  9 10
value = 0, 1, 0, 1, 2, 0, 1, 0, 0, 0, 0

When you apply the final loop, you should see
Code:
1 appears 1 times
3 appears 1 times
4 appears 2 times
6 appears 1 times


> you are not comparing number at all!
Exactly, because it wastes too much time.
It uses the value read from the file as an index into an array - so its simply count[value]++

You search one array for some array[pos] == value, so you can do count[pos]++

Which is why I asked you what the range was. If the range is such that you can fit the whole array in memory, you may as well use it.



> the numbers are in between 0-250
So the -1 and -2 are just there for confusion?

--
 
yes....your algorithm is very bright and strong...i have understood it .

>So the -1 and -2 are just there for confusion?

oohh....those are not actual data ...i just gave a demo.

all my data lies in the range 0-250 and all of them are integer and number of data 272145.

i think it would be ok...if i use your algo in my actual data.

thanks
 
If you ever need to do this with something where you can't fit the entire range in memory, then sorting is worth considering, because if you sort it you do not need to compare each number with all the others (no nested loop necessary). All you need to do is add up the lengths of each run of the same number.
Sorting is (usually!) less than N^2, but the nested loop is N^2. Therefore for a bigger range than can be handled by Salem's approach, your qsort (or other appropriate sort) option is good.
 
I think, the best choice depends on your resources.
If your comp has enough memory (e.g 50-100 Mb) and numbers are 'randomized', try a simple binary tree (number as key, counter as data ). Build it from the file (one pass), then traverse and print. You may also build 2nd tree (counter/number pair as a key) then traverse it (print in counter incr/decr order). No assumptions about ranges etc. It's more better to use some kind of balanced trees (e.g. AVL or 2-3 trees), but its implementation is harder...

The 2nd alg is: build temp text file for sort sys cmd. Read all numbers and write to the temp file as text lines (one number per line) with fixed width. Now sort the temp file (e.g. system(&quot;sort ...&quot;)) then read the sorted file and print numbers with counters (one pass). No assumptions about machine resources, ranges etc. It's rather fast alg too (however strange it is;)...
 
Am I right in thinking the tree method is NlogN like sorting should be in a good world? (i.e. N things to look at, and logN depth of tree to insert a new leaf)
If so, I'd think the speed difference between them would come down to how much work you do per inner loop. How does inserting a new tree leaf compare to the inner loop of quicksort?
ArkM is right to mention randomized data. Both quicksort and the unbalanced tree have a big proviso of a really awful worst-possible case of sorted data.
 
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Sponsor

Back
Top