Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Counting number of commas in CSV header 1

Status
Not open for further replies.

NickFort

Technical User
Jun 10, 2010
113
Hi all,

I'd like to be able to read a header line of a CSV as a character type, and then count the number of commas there (i.e. to determine the number of columns), without having to assign a fixed length to the character.

The way I'd like to do it is to read the first line character by character, each time search for a match between the currently read character and the delimiter, to count the number of delimiters, and stop at the end of the first line. Unfortunately, Fortran reads vertically with each successive read call, not horizontally.

One option is to throw an "overkill" character variable at the read, of length greater than the maximum number of characters in the header; the problem with this is that it's not maintainable. I can't guarantee what the longest line could possibly be, and 99% of the time, it'll be a waste of memory.

To illustrate what I mean, say I want to read a CSV containing:

Code:
HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6
3,TEXT,2.3452,6,1.2246,7.64E+12
6,TEXT2,32.235247,12.2,3467.12,1.21E-05

One option is to do the following (which doesn't do the counting of commas yet):

Code:
program test_read_line

implicit none

character(len=1000) :: current
character :: delimiter
integer :: read_status

open (unit=11, file="mydata.csv", action="read", status="old")

read (11,'(A1000)',iostat=read_status) current
print *, "read_status: ", read_status
print *, "current: ", trim(adjustl(current))

close (11)

end program test_read_line

So here, I assign a length to "current" which is greater than the number of characters in the the header line. I could search through the "current" string to count the number of delimiters, but that's not optimal, in my opinion, because I've assigned way more memory to it than I need with the file as it currently is, but what if columns are added, and the header line exceeds 1000 characters?

Is there a more efficient way of doing something like this?

--------------------------------------
Background: Chemical engineer, familiar mostly with MATLAB, but now branching out into real programming.
 
And to count the number of columns, I could do this:

Code:
program test_read_line

implicit none

integer, parameter :: max_header=1000
character(len=max_header) :: header_string
character :: delimiter
integer :: read_status, string_length, posit, delimiter_count

delimiter_count = 0

delimiter = ","

open (unit=11, file="mydata.csv", action="read", status="old")

read (11,'(A1000)',iostat=read_status) header_string

do
    header_string = adjustl(header_string)
    posit = index(header_string,delimiter)
    if (posit==0) exit
    delimiter_count = delimiter_count + 1
    header_string = header_string(posit+1:)
end do

print *, "Number of headers: ", delimiter_count+1

close (11)

end program test_read_line

--------------------------------------
Background: Chemical engineer, familiar mostly with MATLAB, but now branching out into real programming.
 
Is there a more efficient way of doing something like this?
IMHO there isn't more efficient way, I would do that at similar way as you, except I would create a function for this purose (iterative or recursive) :
Code:
[COLOR=#a020f0]program[/color] csv_columns
[COLOR=#2e8b57][b]implicit[/b][/color] [COLOR=#2e8b57][b]none[/b][/color]
[COLOR=#2e8b57][b]character[/b][/color]([COLOR=#ff00ff]80[/color]) :: line
line [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]'HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6'[/color]

[COLOR=#804040][b]write[/b][/color]([COLOR=#804040][b]*[/b][/color],[COLOR=#804040][b]*[/b][/color]) [COLOR=#ff00ff]'line = '[/color], line
[COLOR=#804040][b]write[/b][/color]([COLOR=#804040][b]*[/b][/color],[COLOR=#804040][b]*[/b][/color]) [COLOR=#ff00ff]'Number of columns = '[/color], csv_columns_count(line, [COLOR=#ff00ff]','[/color])
[COLOR=#804040][b]write[/b][/color]([COLOR=#804040][b]*[/b][/color],[COLOR=#804040][b]*[/b][/color]) [COLOR=#ff00ff]'Number of columns = '[/color], r_csv_columns_count(line, [COLOR=#ff00ff]','[/color])

[COLOR=#0000ff]! functions used[/color]
[COLOR=#a020f0]contains[/color]
[COLOR=#2e8b57][b]integer[/b][/color] [COLOR=#a020f0]function[/color] csv_columns_count(line, delimiter)
  [COLOR=#0000ff]! number of colums in CSV line - iterastive  [/color]
  [COLOR=#2e8b57][b]implicit[/b][/color] [COLOR=#2e8b57][b]none[/b][/color]
  [COLOR=#2e8b57][b]character[/b][/color]([COLOR=#804040][b]*[/b][/color]), [COLOR=#2e8b57][b]intent[/b][/color]([COLOR=#2e8b57][b]in[/b][/color]) :: line
  [COLOR=#2e8b57][b]character[/b][/color], [COLOR=#2e8b57][b]intent[/b][/color]([COLOR=#2e8b57][b]in[/b][/color]) :: delimiter
  [COLOR=#2e8b57][b]integer[/b][/color] :: delimiter_idx
  [COLOR=#2e8b57][b]character[/b][/color]([COLOR=#008080]len[/color](line)) :: str

  str [COLOR=#804040][b]=[/b][/color] [COLOR=#008080]adjustl[/color](line)
  csv_columns_count [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]0[/color]
  delimiter_idx [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]1[/color]
  [COLOR=#804040][b]do[/b][/color] [COLOR=#804040][b]while[/b][/color] (delimiter_idx [COLOR=#804040][b]>[/b][/color] [COLOR=#ff00ff]0[/color])
    delimiter_idx [COLOR=#804040][b]=[/b][/color] [COLOR=#008080]index[/color](str, delimiter)
    str [COLOR=#804040][b]=[/b][/color] str(delimiter_idx[COLOR=#804040][b]+[/b][/color][COLOR=#ff00ff]1[/color]:)
    csv_columns_count [COLOR=#804040][b]=[/b][/color] csv_columns_count [COLOR=#804040][b]+[/b][/color] [COLOR=#ff00ff]1[/color]
  [COLOR=#804040][b]end do[/b][/color]
[COLOR=#a020f0]end function[/color] csv_columns_count

[COLOR=#a020f0]recursive[/color] [COLOR=#a020f0]function[/color] r_csv_columns_count(line, delimiter) [COLOR=#a020f0]result[/color](res)
  [COLOR=#0000ff]! number of colums in CSV line - recursive  [/color]
  [COLOR=#2e8b57][b]implicit[/b][/color] [COLOR=#2e8b57][b]none[/b][/color]
  [COLOR=#2e8b57][b]character[/b][/color]([COLOR=#804040][b]*[/b][/color]), [COLOR=#2e8b57][b]intent[/b][/color]([COLOR=#2e8b57][b]in[/b][/color]) :: line
  [COLOR=#2e8b57][b]character[/b][/color], [COLOR=#2e8b57][b]intent[/b][/color]([COLOR=#2e8b57][b]in[/b][/color]) :: delimiter
  [COLOR=#2e8b57][b]integer[/b][/color] :: delimiter_idx, res

  delimiter_idx [COLOR=#804040][b]=[/b][/color] [COLOR=#008080]index[/color](line, delimiter)
  [COLOR=#804040][b]if[/b][/color] (delimiter_idx [COLOR=#804040][b].eq.[/b][/color] [COLOR=#ff00ff]0[/color]) [COLOR=#804040][b]then[/b][/color]
    res [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]1[/color]
  [COLOR=#804040][b]else[/b][/color]
    res [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]1[/color] [COLOR=#804040][b]+[/b][/color] r_csv_columns_count(line(delimiter_idx[COLOR=#804040][b]+[/b][/color][COLOR=#ff00ff]1[/color]:), delimiter)
  [COLOR=#804040][b]end if[/b][/color]
[COLOR=#a020f0]end function[/color] r_csv_columns_count

[COLOR=#a020f0]end program[/color] csv_columns
Code:
$ g95 col_count.f95 -o col_count

$ col_count
 line = HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6                                             
 Number of columns =  6
 Number of columns =  6
...but what if columns are added, and the header line exceeds 1000 characters?
Maybe I don't understand exactly what you mean, but Fortran is statically typed language, so you need to know the maximal possible length of the line processed. If you mean that the line could be more than 1000 characters long, then declare it longer.
 
Thanks for the reply, mikrom.

Yeah, I can't see another way of doing it.

I was just wondering if there was a way to read the first character, check if it is a delimiter, move to the next character horizontally, check if it is a delimiter, and so on, until the end of the line is reached. That way, you have a working variable which is a character of length 1, and is also completely flexible in that it doesn't require you to know the longest possible read line beforehand.

As I say, the issue is that Fortran reads columns by entire horizontal line at a time, rather than being able to do it by single character in a line, and move horizontally through the line through successive reads.

It's a bit like the allocatable arrays vs static arrays. If allocatable arrays didn't exist, you would have to anticipate what the largest array you'd ever need for that problem is. Therefore, most of the time you're wasting memory when the array you need is smaller than the "maximum-size" static array you've declared, but if you anticipate the incorrect maximum and your program needs a bigger array, there's nothing you can do. Similarly, I find it wasteful to use a string that's probably longer than I need, but could also not be long enough if I don't estimate correctly.

Sorry about this little rant. What's annoying me is the principle of writing inefficient, wasteful code. In practice, I'm sure I could throw a string of length 10000 at it -- which translates to 10 kB of RAM -- and never have to worry about it. For modern PCs, that's a laughable amount of RAM, I know, but as I say, the principle is what's getting to me...

/rant

--------------------------------------
Background: Chemical engineer, familiar mostly with MATLAB, but now branching out into real programming.
 
I was just wondering if there was a way to read the first character, check if it is a delimiter, move to the next character horizontally, check if it is a delimiter, and so on, until the end of the line is reached.
Yes there is, but this brute force way is IMHO not better than using the intrinsic function index() because:
1. Processing a line (i.e. a string) character by character needs everytime len_trim(adjustl(line)) of iterations. For your example it's 35 iterations versus 6 iterations using index().
2. I think that the intrinsic function index() is optimalized and it would be better to use it, than program the thing self.

Here is the example for processing the string character by character:
Code:
[COLOR=#a020f0]program[/color] csv_columns
[COLOR=#2e8b57][b]implicit[/b][/color] [COLOR=#2e8b57][b]none[/b][/color]
[COLOR=#2e8b57][b]character[/b][/color]([COLOR=#ff00ff]80[/color]) :: line
line [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]'HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6'[/color]
[COLOR=#804040][b]write[/b][/color]([COLOR=#804040][b]*[/b][/color],[COLOR=#804040][b]*[/b][/color]) [COLOR=#ff00ff]'line = '[/color], line
[COLOR=#804040][b]write[/b][/color]([COLOR=#804040][b]*[/b][/color],[COLOR=#804040][b]*[/b][/color]) [COLOR=#ff00ff]'Number of columns = '[/color], csv_columns_count2(line, [COLOR=#ff00ff]','[/color])

[COLOR=#0000ff]! functions used[/color]
[COLOR=#a020f0]contains[/color]

[COLOR=#2e8b57][b]integer[/b][/color] [COLOR=#a020f0]function[/color] csv_columns_count2(line, delimiter)
  [COLOR=#0000ff]! number of colums in CSV line - iterative  [/color]
  [COLOR=#2e8b57][b]implicit[/b][/color] [COLOR=#2e8b57][b]none[/b][/color]
  [COLOR=#2e8b57][b]character[/b][/color]([COLOR=#804040][b]*[/b][/color]), [COLOR=#2e8b57][b]intent[/b][/color]([COLOR=#2e8b57][b]in[/b][/color]) :: line
  [COLOR=#2e8b57][b]character[/b][/color], [COLOR=#2e8b57][b]intent[/b][/color]([COLOR=#2e8b57][b]in[/b][/color]) :: delimiter
  [COLOR=#2e8b57][b]character[/b][/color]([COLOR=#008080]len_trim[/color](line)) :: str
  [COLOR=#2e8b57][b]integer[/b][/color] :: j

  str [COLOR=#804040][b]=[/b][/color] [COLOR=#008080]adjustl[/color](line)
  csv_columns_count2 [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]1[/color]
  [COLOR=#804040][b]do[/b][/color] j [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]1[/color], [COLOR=#008080]len_trim[/color](str)
    [COLOR=#804040][b]if[/b][/color] (str(j:j) [COLOR=#804040][b].eq.[/b][/color] delimiter) [COLOR=#804040][b]then[/b][/color]
       csv_columns_count2 [COLOR=#804040][b]=[/b][/color] csv_columns_count2 [COLOR=#804040][b]+[/b][/color] [COLOR=#ff00ff]1[/color]
    [COLOR=#804040][b]end if[/b][/color]
  [COLOR=#804040][b]end do[/b][/color]
[COLOR=#a020f0]end function[/color] csv_columns_count2

[COLOR=#a020f0]end program[/color] csv_columns
Code:
$ g95 col_count_02.f95 -o col_count_02

$ col_count_02
 line = HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6                                             
 Number of columns =  6
 
I think you're missing my point a bit, mikrom.

Using "index" is the better way to do it, once you've read the whole line into a variable. The problem is that you've declared "character(80) :: line", and if you read into that, you'd better hope that 80 characters is enough to hold that string.

What I was proposing is read one character at a time, so declare "character :: single_byte", then read the first character into "single_byte", check if it's a delimiter, read the next character into "single_byte", check if that one is a delimiter, and so on, until the end of the line. It can't be done that way, but I wish it could be, so that there would be no hoping that the declared variable is big enough to hold the full line.

The process would be something like:

1) Declare "character :: single_byte"
2) Read the first character from the file into "single_byte", which makes "single_byte" hold the character "H".
3) Check if what "single_byte" contains is a delimiter. This is not a delimiter, so just move on.
4) Read the next character from the file into "single_byte", which then holds "E", now.
5) Check if "single_byte" is a delimiter.
6) Repeat 4 and 5, which successively become, "A", "D", "1", and then ",", at which point the delimiter count is increased by 1, and the process continues, finally stopping at the end of the line.

Does that make sense now?

--------------------------------------
Background: Chemical engineer, familiar mostly with MATLAB, but now branching out into real programming.
 
I'm personally very happy that Fortran supports implicitely reading file line by line. It's optimal for data processing. With reading file character by character you need to bother with end of lines (if it's CR or LF or CR+LF) and other problems... IMHO it's waste of time to do the things so, until isn't necessary.
However, the low level file processing is possible in Fortran too, but you need to process the file as unformatted.
For your purpose - processing only the first line - you can do something like this:

read_unformatted.f95
Code:
[COLOR=#a020f0]program[/color] read_unformatted
  [COLOR=#2e8b57][b]implicit[/b][/color] [COLOR=#2e8b57][b]none[/b][/color]
  [COLOR=#2e8b57][b]integer[/b][/color] :: stat, j, nr_columns
  [COLOR=#2e8b57][b]character[/b][/color] :: single_byte, CR, LF, column_delimiter

  LF [COLOR=#804040][b]=[/b][/color] [COLOR=#008080]char[/color]([COLOR=#ff00ff]10[/color]) [COLOR=#0000ff]! Line Feed[/color]
  CR [COLOR=#804040][b]=[/b][/color] [COLOR=#008080]char[/color]([COLOR=#ff00ff]13[/color]) [COLOR=#0000ff]! Carriage Return[/color]
  column_delimiter [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]','[/color]

  [COLOR=#0000ff]! open file[/color]
  [COLOR=#804040][b]open[/b][/color] ([COLOR=#ff00ff]1[/color], [COLOR=#804040][b]file[/b][/color][COLOR=#804040][b]=[/b][/color][COLOR=#ff00ff]'csv_f.csv'[/color], [COLOR=#804040][b]form[/b][/color][COLOR=#804040][b]=[/b][/color][COLOR=#ff00ff]'unformatted'[/color], [highlight #ffff00][COLOR=#0000ff]&[/color][/highlight]
       [COLOR=#804040][b]access[/b][/color][COLOR=#804040][b]=[/b][/color][COLOR=#ff00ff]'direct'[/color], [COLOR=#804040][b]status[/b][/color][COLOR=#804040][b]=[/b][/color][COLOR=#ff00ff]'old'[/color], [COLOR=#804040][b]recl[/b][/color] [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]1[/color], [COLOR=#804040][b]iostat[/b][/color][COLOR=#804040][b]=[/b][/color]stat)
  [COLOR=#804040][b]if[/b][/color] (stat [COLOR=#804040][b].ne.[/b][/color] [COLOR=#ff00ff]0[/color]) [COLOR=#804040][b]then[/b][/color]
    [COLOR=#804040][b]write[/b][/color]([COLOR=#804040][b]*[/b][/color],[COLOR=#804040][b]*[/b][/color]) [COLOR=#ff00ff]'File cannot be opened !'[/color]
    [COLOR=#804040][b]go to[/b][/color] [COLOR=#ff00ff]99[/color]
  [COLOR=#804040][b]end if[/b][/color]

  [COLOR=#0000ff]! process header line of the file[/color]
  j [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]0[/color]
  nr_columns [COLOR=#804040][b]=[/b][/color] [COLOR=#ff00ff]1[/color]
  [COLOR=#804040][b]do[/b][/color] [COLOR=#804040][b]while[/b][/color] ((single_byte [COLOR=#804040][b].ne.[/b][/color] CR) [COLOR=#804040][b].and.[/b][/color] (single_byte [COLOR=#804040][b].ne.[/b][/color] LF))
    j [COLOR=#804040][b]=[/b][/color] j [COLOR=#804040][b]+[/b][/color] [COLOR=#ff00ff]1[/color]
    [COLOR=#804040][b]read[/b][/color]([COLOR=#ff00ff]1[/color], [COLOR=#804040][b]rec[/b][/color] [COLOR=#804040][b]=[/b][/color] j) single_byte
    [COLOR=#804040][b]if[/b][/color] (single_byte [COLOR=#804040][b].eq.[/b][/color] column_delimiter) [COLOR=#804040][b]then[/b][/color]
      nr_columns [COLOR=#804040][b]=[/b][/color] nr_columns [COLOR=#804040][b]+[/b][/color] [COLOR=#ff00ff]1[/color]
    [COLOR=#804040][b]end if[/b][/color]
    [COLOR=#0000ff]!write (*,'(I3,5x,I3,5x,A)') j, ichar(single_byte), single_byte[/color]
  [COLOR=#804040][b]end do[/b][/color]

  [COLOR=#804040][b]write[/b][/color] ([COLOR=#804040][b]*[/b][/color],[COLOR=#804040][b]*[/b][/color]) [COLOR=#ff00ff]'Header line is '[/color], (j[COLOR=#804040][b]-[/b][/color][COLOR=#ff00ff]1[/color]) , [COLOR=#ff00ff]' character long and contains '[/color], [highlight #ffff00][COLOR=#0000ff]&[/color][/highlight]
              nr_columns, [COLOR=#ff00ff]' columns '[/color]
  [COLOR=#0000ff]! close file[/color]
  [COLOR=#6a5acd]99[/color] [COLOR=#804040][b]continue[/b][/color]
  [COLOR=#804040][b]close[/b][/color]([COLOR=#ff00ff]1[/color])
[COLOR=#a020f0]end program[/color]
Then with this simple file csv_f.csv
Code:
HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6
data1,data2,data3,data4,data5,data6
only the 1.line will be processed - with this result
Code:
$ g95 read_unformatted.f95 -o read_unformatted

$ read_unformatted
 Header line is  35  character long and contains  6  columns
You can uncomment the write statement in the code above to see how it works.

 
Hey, that's quite cool! :) Thanks, mikrom; I honestly thought that that couldn't be done. I'm very glad to know that both options are available in Fortran.

--------------------------------------
Background: Chemical engineer, familiar mostly with MATLAB, but now branching out into real programming.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top