Help with shuffling data 3

LovecraftHP · May 19, 2005

Hi guys!

I'm working on a little program to randomly shuffle my data (for use in tenfold cross-validation experiments). My data file has 4332 records, so I thought I'd first fill an array with numbers from 1 to 4332 (each number being in there once) and then later use these numbers as indices to shuffle my data. The program already does everything I want it to apart from the all-important fact that it doesn't give me all numbers between 1 and 4332. Instead it repeats certain numbers even though I put in a test to prevent these repetitions. I have no idea what I'm doing wrong here so any help would be greatly appreciated! Here's my code so far:

Code:

{
while (x<(NR+1)) {
  number=1+int(rand()*4332)
  if (number in sample) {
    continue
    }
  else {
    inst[x]=$0
    sample[x]=number
    x++
    }
  }
}
END {
for (z=1;z<=NR;z++) {
  n=sample[z]
  print inst[n] > "random.txt"
  }
}

Thanks for looking!

PHV · May 19, 2005

Something like this ?
{
if(NR>4332) exit
do number=1+int(rand()*4332); while(number in sample)
sample[number]=$0
}
END {
for(z=1;z<=NR;++z) if(z in sample) print sample[z] > "random.txt"
}

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

vgersh99 · May 19, 2005

you need to 'seed' your 'rand' - srand()

man nawk said:
srand([expr ])
Set the seed value for rand to expr or use the
time of day if expr is omitted. The previous seed
value will be returned.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

LovecraftHP · May 19, 2005

Thank you very much, everybody.

PHV, I think that is exactly what I needed!

Always learning a little bit more by looking at somebody else's take on a problem. Thanks!

futurelet · May 19, 2005

PHV said:
[tt]do number=1+int(rand()*4332); while(number in sample)[/tt]

This loop could have to iterate for thousands of times before it finds a "good" random number.

In the code below, every random number produced is "good"; none are discarded.

Code:

BEGIN { srand() }

{ sample[NR] = $0 }

END {
  while ( length(sample) )
  {
    random = int( rand() * length(sample) ) + 1
    print sample[random]
    sample[random] = sample[length(sample)]
    delete sample[length(sample)]
  }
}

PHV · May 19, 2005

Which flavor of awk admits the use of array 'sample' in a scalar context ( length(sample) )?

futurelet · May 19, 2005

PHV said:
Which flavor of awk admits the use of array 'sample' in a scalar context ( length(sample) )?

Which awk? The awk. That is, The One True AWK; the awk by Brian Kernighan. He is one of the creators of awk and of C.
Ever heard of him?

futurelet · May 19, 2005

Download [tt]awk95.exe[/tt] from Brian Kernighan's site:

http://cm.bell-labs.com/cm/cs/awkbook/index.html

Unix users can download the source.

I would guess that the current version of gawk has a [tt]length[/tt] that works this way.

vgersh99 · May 19, 2005

Solaris' nawk - doesn't support it [no wonder]
Even Solaris' POSIX awk [/usr/xpg4/bin/awk] - doesn't support it.
My old gawk [GNU Awk 3.1.0] doesn't have it either

It would be nice to have it available....

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

futurelet · May 19, 2005

Can't you download the source and compile it? Isn't that how Unix users install software?

vgersh99 · May 19, 2005

.... just statin' the facts - nothing more!

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

LovecraftHP · May 19, 2005

One last question, guys and gals. I've almost got this program finished. As already mentioned above it's supposed to first randomize the data and then produce 10 different files each containing a different 10% of the randomized data. Up to this point the program works. But then it also has to produce for each of the 10 files the accompanying other 90% of that randomized data. Seeing that the original data has 4332 records this should result in 10 files of 433 records and 10 accompanying files with each 3899 records. The 433-record files I get, but instead of 10 3899-record files I get 10 files with either 3898 or 3899 records, which of course can't be right. This is the code:

Code:

{
  while (number in sample) {
    number=1+int(rand()*4332)
    }
  sample[number]=$0
}

END {
  i=0
  while (i<10) {
    m=1
    for (x=1;x<=NR;x++) {
      backup[x]=sample[x]
      }
    while (z<(433*(i+1))) {
      if (z in sample) {
        print sample[z] > "test-" i ".txt"
        delete backup[z]
        }
      z++
      }
    while (m<(NR+1)) {
      if (backup[m]!="") {
        print backup[m] > "train-" i ".txt"
        }
      m++
      }
    delete backup
    i++
    }
}

Any help would be greatly appreciated!

futurelet · May 19, 2005

O.K.

My question is this: since the source is available as a compressed tar archive, can't most Unix users compile it and use it?

vgersh99 · May 19, 2005

futurelet said:
O.K.

My question is this: since the source is available as a compressed tar archive, can't most Unix users compile it and use it?

My answer: yes, they can!

My statement: there was no 'lure' in my previous posting - I was simply stating the facts and nothing more.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

LovecraftHP · May 20, 2005

Anyone? Please?

futurelet · May 20, 2005

Run this on your original data file.

Code:

BEGIN { srand() }

# Store all lines.
{ sample[NR] = $0 }

END {
  len = NR
  # Make scrambled copy of array.
  while ( len )
  {
    random = int( rand() * len ) + 1
    scrambled[++i] = sample[random]
    sample[random] = sample[len]
    len--
  }
  numgroups = 10
  groupsize = int( NR / numgroups )
  for (group=1; group<=numgroups; group++)
  {
    first = (group-1) * groupsize + 1
    last = first + groupsize - 1
    if ( (NR - last) < groupsize )
      last = NR
    for (i=1; i<=NR; i++)
    { if ( i >= first && i <= last )
        print scrambled[i] > ("test-" group ".txt")
      else
        print scrambled[i] > ("train-" group ".txt")
    }
    close( "test-" group ".txt")
    close( "train-" group ".txt")
  }
}

LovecraftHP · May 27, 2005

Thank you, futurelet. I do think that is exactly what I need. Thanks!

marsd · May 28, 2005

Futurelet:

Your postings in this thread are irregular and
undereducated IMO.
I can circulate anything in tarred and compressed
form and not expect it to compile on a *nix.

FYI: gawk in my experience has supplanted and improved on
most of the 'traditional' offerings. The implementation of
more complicated data structures to re-represent simple
linked lists and arrays, more sophisticated algorithms, etc..make many of the suggestions you make nonsensical.

The 'length function' you are suggesting is not going to
work when applied to anything progressive or specific.

M

LovecraftHP · Jul 12, 2005

One last (and probably all too obvious) question. In:

Code:

  while ( len )
  {
    random = int( rand() * len ) + 1
    scrambled[++i] = sample[random]
    sample[random] = sample[len]
    len--
  }

why do you need the following line?

Code:

    sample[random] = sample[len]

If anyone could please explain, I might understand the logic behind it and will be able to use it myself next time. Thanks!

futurelet · Jul 12, 2005

Every time a value from the "sample" array is used, it is discarded and the perceived length of the array is decreased by one. These two lines accomplish that:

Code:

    sample[random] = sample[len]
    len--

We replace the value sample[random], which has just been used, with the value at the very end of the array, sample[len]. We use the value at the very end in order to save it from the oblivion that threatens it because the "len--" command will make that array position inaccessible in the future.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Help with shuffling data 3

Programmer

MIS

Programmer

Programmer

Programmer

MIS

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

IS-IT--Management

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor