Parsing not functional for no particular reason

clagagag · Apr 18, 2012

I'm trying to parse some data for my thesis but it's not functional. I'm not very fluent in Perl so any help would be appreciated!

Code:

#!/usr/bin/perl -w
#use strict;
#use encoding 'cp437';

# $inputfilename=@ARGV;
open(INPUTFILE, "<data"); # Open the data file for reading
#$outputFilename = $inputfilename . '_processed.txt';
#open(OUTPUTFILE, $outputFilename);
open(DATAFILE, ">tempDataFile.txt"); # Open this for writing (it'll be a temporary copy of the data)

#
#
#
#
#
while ($line=<INPUTFILE>) # for each line in the inputfile
    {
        $line=~s/\r/\n/g; # replace the weird mac newline tag with a unix style \n
        print DATAFILE "$line\n"; # print the line to the temporary data file
    }
    
close(INPUTFILE); # close this
close(DATAFILE); # close this

open(DATAFILE, "<tempDataFile.txt"); # open this for reading
open(OUTPUTFILE, ">data_processed.txt"); # open this for writing

while ($line=<DATAFILE>) # for each line of the temp data file
    {
 #       my ($line) = $_; 
    chomp $line; # cut off the newline char
    $firstchar=substr($line, 0, 0);  # consider the first character of the line
    if ($firstchar =~ /\d/) # if the first character of the line is a digit...
        {
        $line =~ s/,/\t/g ; # convert all commas in it to tabs
        print OUTPUTFILE "$line\n"; # print it to the output file
       }
    }


close(DATAFILE); # close this   
close(OUTPUTFILE); # close this

Before I wrote this, of course I wrote a pseudocode to help me through the process.

Code:

Take text file as input (data.txt)
Open a blank text for output (data_formatted.txt)
If data_formatted.txt already exists, prompt to overwrite (y/n)
If 'n', prompt for new name.
Check if new name already exists, and if so prompt for overwrite, etc (recurse as necessary)

(replace mac newline characters with unix ones, if needed to do the next part)

If the current line starts with "SubjectID:"
Store the value of the next 'word' (e.g. 7) as the variable $subjectID.

For each line that begins with a number (0-9):
Read it in
Replace each comma with a tab
Append as the next line of data_formatted.txt:
$subjectID \t [current line of data] \n 

Close data_formatted.txt

But this seems to not work.

Here's the snippet of the data file:

Code:

Trial	Condition	trial start	ResponseLabel	Time	keys	sequence	mouse_down	UBRelativeTS	UBAbsoluteTS	UBSystemTS	UBDrift	UBButtons	UBVoice	UBOptic	UBIOPorts	UBQueueLength	TrialTime	TrialLabel	
4		269474	skunk,Animals,2.03342375548695,sk,1,5,1,1,1,15,1,0,0,Fam	N/A	[N/A]		0	[N/A]	[N/A]	[N/A]	46	00000000	0	X	0000000000000000	0	1691	RESPONSE	
5		274746	chin,BodyParts,2.66838591669,J,1,3,17,1,2,16,2,0,0,Fam	N/A	[N/A]		0	[N/A]	[N/A]	[N/A]	47	00000000	0	X	0000000000000000	0	1088	RESPONSE	
6		278615	bus,Transit,3.18949031369937,b,1,3,22,1,2.99,11,13,0,0,Fam	N/A	[N/A]		0	[N/A]	[N/A]	[N/A]	47	00000000	0	X	0000000000000000	0	638	RESPONSE	
7		282031	cactus,Plants,1.93449845124357,k,2,6,0,1,3,2,4,0,0,Fam	N/A	[N/A]		0	[N/A]	[N/A]	[N/A]	48	00000000	0	X	0000000000000000	0	752	RESPONSE	
8		285563	knife,Utensils,3.09898963940118,n,1,3,9,1,4,4,6,0,0,Fam	N/A	[N/A]		0	[N/A]	[N/A]	[N/A]	48	00000000	0	X	0000000000000000	0	572	RESPONSE

Thank you so much for your help.

rharsh · Apr 18, 2012

From your pseudo code:

(replace mac newline characters with unix ones, if needed to do the next part)

On what system is data.txt being created (Mac?) and on what system is the perl script being running (linux?)

clagagag · Apr 18, 2012

I believe it's like what you said. But I tried running the script on the mac too, or creating the data file on linux vice-versa.

clagagag · Apr 18, 2012

Actually, my mistake, all of this was done on a mac.

rharsh · Apr 18, 2012

So I'm curious, why did you comment out 'use strict' - it probably gave you some helpful errors. I would suggest, until you are fluent, using warnings (-w on your shebang line or 'use warnings;' on a windows box) and strict (the 'use strict;' line that was commented out) on all your scripts.

The following code is based on your pseudo code, see how this works for you:

Perl:

#!/usr/bin/perl -w
use strict;

### GLOBALS
my ($in_file, $out_file);	#Input and Output file names
my $subjectID = 'NoID';	# Default until a SubjectID is seen in input

### GET FILE NAMES
$in_file = $ARGV[0];
if (! defined $in_file) {
	die "Must specify an input file as ARGV 0.\n";
} elsif (! -r $in_file) {
	die "$in_file not readable.\n";
}

if ($in_file =~ m/(.+)(\..+)\s*$/) {
	$out_file = $1 . '_formatted' . $2;
} else {
	$out_file = $in_file . '_formatted';
}

# VERIFY OUTPUT FILE NAME
{
	my $verif_out_file = $out_file;
	while (-e $verif_out_file) {
		print "$verif_out_file exists, overwrite [y/n]: ";
		$_ = <STDIN>;
		if (m/^\s*[Yy]\s*$/) {
			# OVERWRITE EXISTING FILE
			last;
		} elsif (m/^\s*[Nn]\s*$/) {
			# CHOOSE NEW FILE
			print "Enter a new filename to write the results to: ";
			chomp($_ = <STDIN>);
			$verif_out_file = $_;
		} else {
			print "unrecognized input\n";
		}
	}
	$out_file = $verif_out_file;
}
		
### OPEN FILES
open IN, "< $in_file" or die "Cannot open $in_file for read.\n$!\n";
open OUT, "> $out_file" or die "Cannot open $out_file for write\n$!\n";

# PARSE INPUT AND PRINT TO OUTPUT FILES
while (<IN>) {
	chomp;
	if (m/^\s*SubjectID:\s*(\S+)/) {
		$subjectID = $1;
	} elsif (m/^\s*\d+/) {
		s/,/\t/g;
		print OUT "$subjectID\t[$_]\n";
	}
}

clagagag · Apr 18, 2012

Thanks for your help. I tried it but the output file was empty. I'm not sure why so I'll attach the data file.

Really appreciate the help.

rharsh · Apr 19, 2012

So if the files are getting created and the script is exiting normally, what ever problem exists is likely in the last section of code. Are you getting any errors? Have you tried to do any troubleshooting? What'd you come up with?

rharsh · Apr 19, 2012

You're likely having problems with the line endings. Near the end of the script, make yours look like the lines in red below:

Code:

### OPEN FILES
open IN, "< $in_file" or die "Cannot open $in_file for read.\n$!\n";
open OUT, "> $out_file" or die "Cannot open $out_file for write\n$!\n";

[red]# Split on Windows, Unix, or Mac Line Endings
my @input = split /[\x0a\x0d]+/, <IN>;

# PARSE INPUT AND PRINT TO OUTPUT FILES
foreach (@input) {[/red]
    if (m/^\s*SubjectID:\s*(\S+)/) {
        $subjectID = $1;
    } elsif (m/^\s*\d+/) {
        s/,/\t/g;
        print OUT "$subjectID\t[$_]\n";
    }
}

clagagag · Apr 20, 2012

That worked thanks! Appreciate the help a lot!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Parsing not functional for no particular reason

clagagag

Programmer

rharsh

Technical User

clagagag

Programmer

clagagag

Programmer

rharsh

Technical User

clagagag

Programmer

rharsh

Technical User

rharsh

Technical User

clagagag

Programmer

Similar threads

Part and Inventory Search

Sponsor