algorithm needed 1

satellite03 · Jan 31, 2004

hi, say i have a sequence like below

10 12 10 10 12 14 125 0 -1 -2 14 15 12 11 3 11 ........ (in a file)

i want to store all the elements in an array...but i want to store
each element only once(no repetation) and also how many times
elements are repeting.

that is , my array will have all distinct elements.

for example, i want output for mine above sequence like below

array[0] = 10 repetation = 3
array[1]=12 repetation = 3
array[2]=14 repetation = 2
array[3] =125 repetation = 1

..............like this

how can i do it?

i wrote a code ....but its not fulfiling my desire

Code:

#include<stdio.h>
#include<stdlib.h>

int main( )

{
	FILE  *ptr1, *ptr2;
	ptr1 = fopen(&quot;d:\\datatest.txt&quot;,&quot;r&quot;);
	if(ptr1==NULL ) printf(&quot;can not open&quot;);
  	ptr2 = fopen(&quot;d:\\datatest.txt&quot;,&quot;r&quot;);
    if(ptr2==NULL ) printf(&quot;can not open&quot;);
	int repetation[50],number[50];
	int static i = 0 ;
	int x,y;
    static long int k=0;
	do
	{
		fscanf(ptr1,&quot;%d&quot;,&x);
		int count = 0;
		do

		{
			fscanf(ptr2,&quot;%d&quot;,&y);
			if(x==y)
			    count++;
		}while( !feof(ptr2) );
        //printf(&quot;count=%d&quot;,count);
		repetation[k] = count;
        k = k+1;
		number[i]= x;
	    i=i+1;
	    rewind(ptr2);
	}while( !feof(ptr1));
	for(int m =0;m<k; m++)
	  printf(&quot;%d   %d\n &quot;,number[m],repetation[m]);
	close(ptr1);
	close(ptr2);


}


getting output :
________________

10   3
 12   3
 10   3  
 10   3  
 12   3
 14   2
 125   1
 0   1
 -1   1
 -2   1
 14   2
 15   1
 12   3
 11   2
 3   1
 11   2

&quot;d:\lcc2\lcc\adf.exe&quot;
Return code -1
Execution time 0.032 seconds
Press any key to continue...

although i am getting output which shows which element is repetating how many times..but they are showing  multiple times.here 10  3 has come three times but i want it for only once.

my expected output:
------------------
 10   3
 12   3
 14   2
 125   1
 0   1
 -1   1     
 -2   1   
 14   2
 15   1
 11   2
 3   1
 .....

i am not able to get a good algo...how can i do it? i have large amount of data..

RanganathRamachandra · Jan 31, 2004

some hints:
- you need to open the file only once
while reading the numbers in a loop:
- keep a list of the numbers you have read
already
- use a nested loop to check the current
number with each number in the list
- if the number is already in the list,
increase the corresponding count by 1.

so you dont need 2 file pointers (or 2 fopen calls). you still need 2 loops, the outer one for the input (as you have it now) and the inner one for your already-read list.

hope you can proceed from here.
--
ranga

Salem · Jan 31, 2004

What is the expected range of numbers in the file?

--

bNasty · Feb 1, 2004

Two sugestions :

1. While loading, store numbers in sorted array (or list) and use binary-search to find duplicates. It will reduce time significantly.
or
2. If the range is not too high, use a separate bit-string structure for fast lookup

satellite03 · Feb 1, 2004

salem
-------
file contains around 272145 integers.....

bNasty
-------
i have already usued qsort() from the stdlib.h. and simply using while loop , calculating frequency for small data..but for large data i think it is not a good algo.

>>>>"While loading, store numbers in sorted array (or list) and use binary-search to find duplicates. It will reduce time significantly."

....
are you sugesting to use link list and binary search...ok let me think about it.
>>If the range is not too high ....i have 272145 integers. is not it big?

in fact sorting will help me to get the result...but it is unnecessary...i simply need the frequency of each data.

so, what you people suggest ? which one would be the best algo for my data set for calculating frequencies?

thanks all.

Salem · Feb 1, 2004

> file contains around 272145 integers.....
I meant the range
Like -50 to 5000
or -1000000 to 1000000

If the range is like less that say 10 million, then a simple large array initialised to zero will suffice

Code:

// initialise
#define MIN -500000
#define MAX  500000
int counters[MAX-MIN+1] = { 0 };


// count 'value' as read from the file
counters[value-MIN]++;


// print it
for ( j = 0 ; j < MAX-MIN+1 ; j++ ) {
  if ( counters[j] != 0 ) {
    printf( &quot;%d appears %d times\n&quot;, j+MIN, counters[j] );
  }
}

--

satellite03 · Feb 1, 2004

hi, i am not clear about your algo....but i guess it can not remove the redundancy(as mine previous output)....it can not give me desired output...you are not comparing number at all!

i am not clear about

// count 'value' as read from the file
counters[value-MIN]++;

what is going on here?

anyway, can you show using data only
10 12 10 10 12 14 125 0 -1 -2 14 15 12 11 3 11

about the range you are talking...the numbers are in between 0-250 ...all integers.

Salem · Feb 1, 2004

Example - MIN is 0, MAX is 10
So

Code:

int counters[MAX-MIN+1] = { 0 };

is effectively

Code:

int counters[11] = { 0 };

You read in the following numbers (into value)

Code:

1 3 4 6 4

Which does the following (subtracting MIN omitted since it is 0 in this example)

Code:

counters[1]++; // Was 0, Now 1
counters[3]++; // Was 0, Now 1
counters[4]++; // Was 0, Now 1
counters[6]++; // Was 0, Now 1
counters[4]++; // Was 1, Now 2

So at the end, the counters array contains

Code:

index = 0  1  2  3  4  5  6  7  8  9 10
value = 0, 1, 0, 1, 2, 0, 1, 0, 0, 0, 0

When you apply the final loop, you should see

Code:

1 appears 1 times
3 appears 1 times
4 appears 2 times
6 appears 1 times

> you are not comparing number at all!
Exactly, because it wastes too much time.
It uses the value read from the file as an index into an array - so its simply count[value]++

You search one array for some array[pos] == value, so you can do count[pos]++

Which is why I asked you what the range was. If the range is such that you can fit the whole array in memory, you may as well use it.

> the numbers are in between 0-250
So the -1 and -2 are just there for confusion?

--

satellite03 · Feb 1, 2004

yes....your algorithm is very bright and strong...i have understood it .

>So the -1 and -2 are just there for confusion?

oohh....those are not actual data ...i just gave a demo.

all my data lies in the range 0-250 and all of them are integer and number of data 272145.

i think it would be ok...if i use your algo in my actual data.

thanks

lionelhill · Feb 2, 2004

If you ever need to do this with something where you can't fit the entire range in memory, then sorting is worth considering, because if you sort it you do not need to compare each number with all the others (no nested loop necessary). All you need to do is add up the lengths of each run of the same number.
Sorting is (usually!) less than N^2, but the nested loop is N^2. Therefore for a bigger range than can be handled by Salem's approach, your qsort (or other appropriate sort) option is good.

ArkM · Feb 2, 2004

I think, the best choice depends on your resources.
If your comp has enough memory (e.g 50-100 Mb) and numbers are 'randomized', try a simple binary tree (number as key, counter as data ). Build it from the file (one pass), then traverse and print. You may also build 2nd tree (counter/number pair as a key) then traverse it (print in counter incr/decr order). No assumptions about ranges etc. It's more better to use some kind of balanced trees (e.g. AVL or 2-3 trees), but its implementation is harder...

The 2nd alg is: build temp text file for sort sys cmd. Read all numbers and write to the temp file as text lines (one number per line) with fixed width. Now sort the temp file (e.g. system("sort ...&quot

) then read the sorted file and print numbers with counters (one pass). No assumptions about machine resources, ranges etc. It's rather fast alg too (however strange it is

...

lionelhill · Feb 2, 2004

Am I right in thinking the tree method is NlogN like sorting should be in a good world? (i.e. N things to look at, and logN depth of tree to insert a new leaf)
If so, I'd think the speed difference between them would come down to how much work you do per inner loop. How does inserting a new tree leaf compare to the inner loop of quicksort?
ArkM is right to mention randomized data. Both quicksort and the unbalanced tree have a big proviso of a really awful worst-possible case of sorted data.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

algorithm needed 1

satellite03

IS-IT--Management

RanganathRamachandra

Programmer

Salem

Programmer

bNasty

Programmer

satellite03

IS-IT--Management

Salem

Programmer

satellite03

IS-IT--Management

Salem

Programmer

satellite03

IS-IT--Management

lionelhill

Technical User

ArkM

IS-IT--Management

lionelhill

Technical User

Similar threads

Part and Inventory Search

Sponsor