Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Cleaning a text file...

Status
Not open for further replies.

BuilderSpec

Programmer
Dec 24, 2003
383
GB
Hi All

I have got some text files which have some 'crap' characters in the middle.. these can be NULL , line feeds etc.just crap in some fields.

I want to be able to process this file and replace the 'crap' characters with an arbitrary character ( eg. '9' )

I have tried reading in the file using fgetc / fputc but cannot determine the real end of file ( the end of file is one such 'crap' character that appears ), Notepad seems to open the file ok with a true end of file. My program stops when it hits the first end of file 'crap' character..

So i changed to using fstat to determoine file size in bytes and then process that many characters but then it never gives a true figure and seems to work for some files but not larger ones....

Any one ever done this ? got any ideas and how to proceed ?


I will post the code I have tried so far...



AnsiString Output = ChangeFileExt(argv[1],".txc");
FILE *fp = fopen(argv[1],"r");
FILE *out = fopen ( Output.c_str() , "w");

char c=0;
struct stat statbuf;
fstat(fileno(fp), &statbuf);
long ct = 0;
while ( ct < statbuf.st_size )
{
c = fgetc(fp);
if ( c != -1 )
{
if ( (isalpha ( c ) && isupper(c)) || isdigit ( c ) || ( c == '\r' || c == '\n' || c == ' ' ) )
fputc(c,out );
else
fputc('9',out);
}
else
if ( ct < (statbuf.st_size -1 ) )
{
c = fgetc(fp);
if ( c != -1 )
fputc ( '9' , out );
else
ungetc( c , fp );
}
ct++;
}
fclose ( out );
fclose ( fp );

return 0;


Any pointers would be greatly appreciated ?



Hope this helps!

Regards

BuilderSpec
 
Personally, I would use streams and one of getline variations. Then you could parse the whole line and try changing the invalid characters.


James P. Cottingham
-----------------------------------------
I'm number 1,229!
I'm number 1,229!
 
2ffat
Thank you .

can you elaborate a bit more ? what getline variation ?

Hope this helps!

Regards

BuilderSpec
 
Now that I think about it, you can just pull in every character through the stream and check each one. You may have to set up the stream to not ignore whitespace, etc. but it should work.


James P. Cottingham
-----------------------------------------
I'm number 1,229!
I'm number 1,229!
 
2ffat

Been there done that.. the problem i have is that i then come across end of file markers in the middle of the file. Which is why I went to the fstat thing to read in a certain amount of characters regardless of what they are...but the fstat doesn't seem to be always returning the correct byte size of the file...



Hope this helps!

Regards

BuilderSpec
 
Have you tied opening the file in binary mode, then reading a block into memory and then processing it?
if you have and are still having problems with the eof char, I would try to force the file pointer past the eof character with fsetpos, looping based upon the file size instead of testing for eof.

The simplest solution is the best!
 
Hi

Thanks all for your input but I ran with some suggestions and ended up with the following which works a treat !

( Obviously this has my particular needs in it as well)


int main(int argc, char* argv[])
{
AnsiString Output;
Output = ChangeFileExt ( argv[1] , ".txo");

if ( ! FileExists(argv[1]) )
{
printf ( "Cannot open %s\n" , argv[1] );
getchar();
return 0;
}
if ( FileExists ( Output ) )
DeleteFile ( Output );
if ( ! RenameFile(argv[1] , Output) )
{
printf ( "Failed to rename file from %s to %s\n" ,
argv[1] , Output.c_str() );
return 0;
}

TFileStream *fs;
FILE *out = NULL;

try
{
fs = new TFileStream(Output, fmOpenRead );
}
catch( Exception &E )
{
printf ( "%s\n" , (E.Message).c_str() );
getchar();
return 0;
}
out = fopen ( argv[1] , "w");

char MaxBuf[2000];

memset ( MaxBuf , 0 , 2000 );
fs->Read(MaxBuf , 2000 );

int RecLen = 0;
for ( RecLen = 0 ; RecLen < 2000 ; RecLen++ )
{
if ( !memcmp ( MaxBuf+RecLen , "\r\n" , 2 ) )
break;
}

fs->Seek(0,soFromBeginning );

#define BUFF_SZ 20000
char buffer[BUFF_SZ];
int bc = 0;
bc = RecLen+2;


while ( bc == RecLen+2 )
{
memset ( buffer , 0 , BUFF_SZ );
bc = fs->Read(buffer , RecLen+2 );
int i = 0;
if ( buffer[23] == '4' )
{
for ( i = 45 ; i < 65 ; i++ )
{
if ( isalnum(buffer) || buffer == ' ' )
;
else
buffer = '9';
}
}
try
{
fwrite(buffer,RecLen+2,1,out);
}
catch( Exception &E )
{
printf ( "%s\n", (E.Message).c_str() );
getchar();
}
}
fclose ( out );
return 0;
}


Hope this helps!

Regards

BuilderSpec
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top