Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Getting lines

Status
Not open for further replies.
May 3, 2002
633
US
I always use korn shell in my job, however, there is a file that is 200,000+ lines, and needless to say I need speed because it runs for too long in ksh.

What is needed is finding a line that matches a string and then print out that line plus the next 16 lines after the match. There could be any number of string matches in the file.

Thanks for any input in solving this using C.
 
*********************************************************************************************************************************
Here is a little program that i have wrote, i hope that it will help you in solving your problem !
Code:
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

#define  _MAXSTRING_    100

struct FileSearch
{
	bool Search( FILE *f, char *string );
    int FindLen( FILE *g, int currentpos );
	void PrintLines( FILE *fp );
};

FileSearch acc;


void main()
{
	FILE *file;
	char *filename = &quot;your filename&quot;;
	char *string = &quot;the string that need to be search&quot;;
	
	file = fopen( filename, &quot;r&quot; );

	if(!file)
	{
		fprintf( stderr,&quot;There was an error while opening \&quot;%s\&quot;.\n&quot;,filename);
		exit(1);
	}

	if( acc.Search( file, string ))
	{
		acc.PrintLines( file );

		while( acc.Search( file, string ))
		{
			acc.PrintLines( file );
		}
	}

	else
	{
		printf(&quot;Can't find the string \&quot;%s\&quot;\n&quot;,string);
	}
}


/* search for a specific string in a file */
bool FileSearch::Search( FILE *f, char *string )
{
	char *sentence;
    bool found = false;

	sentence   = (char*)malloc( sizeof(char) * _MAXSTRING_ );

	if( !sentence )
	{
	    fputs(&quot;memory allocation has fail.\n&quot;, stderr );
		exit(1);
	}
	
	while( fgets( sentence, _MAXSTRING_, f ) != NULL )
	{
		if( strstr( sentence, string ) != NULL )
		{
			found = true;
			break;
		}
	}

	free(sentence);
	return found;
}

/* find the lengh of the current line */
int FileSearch::FindLen( FILE *g, int currentpos )
{
	char c;
	int len = 0;

	rewind(g);

	while(( c = getc(g)) && ftell(g) != currentpos && !(feof(g)))
	{
		len++;
		if( c == '\n' )
			len = 0;
	}

	return len;
}

void FileSearch::PrintLines( FILE *fp )
{
	char c;
	int line = 0;
	int len = FindLen( fp, ftell(fp)); 
	fseek( fp, -(len + 2), 1 );  /* place the file pointer to the begining of the current line */

	while(( c = getc(fp)) && line != 16 && !(feof(fp))) 
	{
	    if( c == '\n' )
			line++;

		putchar(c);
	}

    printf(&quot;\n\n /***********************************************/\n&quot;);
}
 
Leibnitz,
I ws looking at Ur program, and I ended up with some queries

1) If the string is present in line # 4 and line #10.then only line number 4 to 20 will be printed instead of the two sets of line 4-20 and 10-26. This could be a non-issue based on the layout of the file.

2) We can use fgets() instead of getc() in PrintLines for better peformance.

3) The code rwinds and scans the file everytime a string is found this will also lead to a bad performance. Or maybe the rewind function has changed in ebtween C++ and C as I havent been still not been able to understand how FindLen works.

cheers
amit
crazy_indian@lycos.com

to bug is human to debug devine
 
I wrote a small C program,feels like I am back in school doing assignments :)

------------------------------------------------
#define LINES_TO_WRITE 16
#include <strings.h>
#include <stdio.h>
#include <stdlib.h>

const char *str = &quot;crazy_for_U&quot;;

main() {

FILE *fp,*fpw;
int i,saved_offset;
char buffer[LINE_MAX];
fp =fopen(&quot;abc&quot;,&quot;r&quot;);
fpw = fopen(&quot;abc_output&quot;,&quot;w&quot;);
while( fgets(buffer, LINE_MAX, fp)) {
if (strstr(buffer,str) != NULL) {
saved_offset = ftell(fp);
fputs(buffer,fpw);
for ( i=0;i < LINES_TO_WRITE - 1; i++) {
if (!fgets(buffer, LINE_MAX, fp)) {
break;
}
fputs(buffer,fpw);
}
fputs(&quot;**** End of Record ****\n&quot;,fpw);
fseek(fp,saved_offset,SEEK_SET);
}
}
fclose(fp);
fclose(fpw);
}
--------------------------------------------------------
I hae ahrdcourted a few things like filenames and the string to search.

cheers
amit
crazy_indian@lycos.com

to bug is human to debug devine
 
If you have ksh you should have awk too.

I think awk is simpler that C for this kind of tasks. It should be almost as fast too. Could you make a try and post the elapsed time in C and with awk ? I would be interested in the results.

Code:
awk '/the string to match/ { count = 17; } { if (count > 0) { count--; print }}'
 
****************************************************************************************************************************************************
ok ami123 you are probaby right for the first and third remark.
But is it really true that we can get a better performance by using &quot;fgets&quot; instead of &quot;getc&quot; ?
Anyway i have try to made some corrections in my original code but i haven't find the way to modify the &quot;FindLen&quot; function to make it more efficient.
I have seen your code and i can say it is good,the reason why mine is a lot bigger is for &quot;extensibility&quot; and &quot;flexibility&quot; as well.

Code:
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

#define  _MAXSTRING_    100

struct FileSearch
{
	int Search( FILE *f, char *string );
    int FindLen( FILE *g, int currentpos );
	void PrintLines( FILE *fp, int pos );
};

FileSearch acc;


void main()
{
	FILE *file;
	char *filename  = &quot;your filename&quot;;
	char *string    = &quot;the string that need to be search&quot;;
	int pos;
	
	file = fopen( filename, &quot;r&quot; );

	if(!file)
	{
		fprintf( stderr,&quot;There was an error while opening \&quot;%s\&quot;.\n&quot;,filename);
		exit(1);
	}
	
	if(( pos = acc.Search( file, string )) != -1 )
	{
		acc.PrintLines( file, pos );

		while(( pos = acc.Search( file, string )) != -1 )
		{
			acc.PrintLines( file, pos );
		}
	}

	else
	{
		printf(&quot;Can't find the string \&quot;%s\&quot;\n&quot;,string);
	}

	fclose(file);
}

// search for a specific string in a file
int FileSearch::Search( FILE *f, char *string )
{
	char *sentence;
	bool found = false;
	char *pdest;
	int pos;
	
	sentence   = (char*)malloc( sizeof(char) * _MAXSTRING_ );

	if( !sentence )
	{
	    fputs(&quot;memory allocation has fail.\n&quot;, stderr );
		exit(1);
	}
	
	while( fgets( sentence, _MAXSTRING_, f ) != NULL )
	{
		if((pdest = strstr( sentence, string )) != NULL )
		{
			found = true;
			break;
		}
	}

	if(found)
	{
	   pos = pdest - sentence;
	   delete sentence;
	   return pos;
	}

	free(sentence);
	return -1;
}


// find the lengh of the current line
int FileSearch::FindLen( FILE *g, int currentpos )
{
	char c;
	int len = 0;

	rewind(g);

	while(( c = getc(g)) && ftell(g) != currentpos && !(feof(g)))
	{
		len++;
		if( c == '\n' )
			len = 0;
	}

	return len;
}

void FileSearch::PrintLines( FILE *fp, int pos )
{
	char c;
	int line = 0;
	int count = 0;
	int len = FindLen( fp, ftell(fp)); 
	fseek( fp, -(len + 2), 1 ); // place the file pointer to the begining of the current line
    
	while(( c = getc(fp)) && line != 16 && !(feof(fp))) 
	{
	    if( c == '\n' )
			line++;

		count++; /* count the characters */
        
		putchar(c);
	}

	fseek( fp, len - ( count + pos ), 1 );

    printf(&quot;\n\n /***********************************************/\n&quot;);
}
 
Leibnitz,

If one takes an overview at the implemntations of fgets() and a loop calling getc() the CPU instructions look the same, so there seems to be no apparent gain/loss of performance.

But here is another factor to consider, the time required to call a function. If getc() is implemented as a function (as it normally is),then to read a file of 100 lines(each line has 60 chars) the code executes 6000 function calls to getc(),whereas with gets() the code just makes 100 calls.
As a result of this overhead gets() becomes faster.

And yes definitely Ur code is much more modular, structured and re-usable, whereas my code is written in the old programming style.

cheers
amit
crazy_indian@lycos.com

to bug is human to debug devine
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top