Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help with shuffling data 3

Status
Not open for further replies.

LovecraftHP

Programmer
Dec 3, 2002
15
CN
Hi guys!

I'm working on a little program to randomly shuffle my data (for use in tenfold cross-validation experiments). My data file has 4332 records, so I thought I'd first fill an array with numbers from 1 to 4332 (each number being in there once) and then later use these numbers as indices to shuffle my data. The program already does everything I want it to apart from the all-important fact that it doesn't give me all numbers between 1 and 4332. Instead it repeats certain numbers even though I put in a test to prevent these repetitions. I have no idea what I'm doing wrong here so any help would be greatly appreciated! Here's my code so far:

Code:
{
while (x<(NR+1)) {
  number=1+int(rand()*4332)
  if (number in sample) {
    continue
    }
  else {
    inst[x]=$0
    sample[x]=number
    x++
    }
  }
}
END {
for (z=1;z<=NR;z++) {
  n=sample[z]
  print inst[n] > "random.txt"
  }
}

Thanks for looking!
 
Something like this ?
{
if(NR>4332) exit
do number=1+int(rand()*4332); while(number in sample)
sample[number]=$0
}
END {
for(z=1;z<=NR;++z) if(z in sample) print sample[z] > "random.txt"
}

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
you need to 'seed' your 'rand' - srand()

man nawk said:
srand([expr ])
Set the seed value for rand to expr or use the
time of day if expr is omitted. The previous seed
value will be returned.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Thank you very much, everybody.

PHV, I think that is exactly what I needed! :) Always learning a little bit more by looking at somebody else's take on a problem. Thanks!
 
PHV said:
[tt]do number=1+int(rand()*4332); while(number in sample)[/tt]
This loop could have to iterate for thousands of times before it finds a "good" random number.

In the code below, every random number produced is "good"; none are discarded.
Code:
BEGIN { srand() }

{ sample[NR] = $0 }

END {
  while ( length(sample) )
  {
    random = int( rand() * length(sample) ) + 1
    print sample[random]
    sample[random] = sample[length(sample)]
    delete sample[length(sample)]
  }
}
 
Which flavor of awk admits the use of array 'sample' in a scalar context ( length(sample) )?
 
PHV said:
Which flavor of awk admits the use of array 'sample' in a scalar context ( length(sample) )?
Which awk? The awk. That is, The One True AWK; the awk by Brian Kernighan. He is one of the creators of awk and of C.
Ever heard of him?
 
Solaris' nawk - doesn't support it [no wonder]
Even Solaris' POSIX awk [/usr/xpg4/bin/awk] - doesn't support it.
My old gawk [GNU Awk 3.1.0] doesn't have it either

It would be nice to have it available....

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Can't you download the source and compile it? Isn't that how Unix users install software?
 
.... just statin' the facts - nothing more!

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
One last question, guys and gals. I've almost got this program finished. As already mentioned above it's supposed to first randomize the data and then produce 10 different files each containing a different 10% of the randomized data. Up to this point the program works. But then it also has to produce for each of the 10 files the accompanying other 90% of that randomized data. Seeing that the original data has 4332 records this should result in 10 files of 433 records and 10 accompanying files with each 3899 records. The 433-record files I get, but instead of 10 3899-record files I get 10 files with either 3898 or 3899 records, which of course can't be right. This is the code:

Code:
{
  while (number in sample) {
    number=1+int(rand()*4332)
    }
  sample[number]=$0
}

END {
  i=0
  while (i<10) {
    m=1
    for (x=1;x<=NR;x++) {
      backup[x]=sample[x]
      }
    while (z<(433*(i+1))) {
      if (z in sample) {
        print sample[z] > "test-" i ".txt"
        delete backup[z]
        }
      z++
      }
    while (m<(NR+1)) {
      if (backup[m]!="") {
        print backup[m] > "train-" i ".txt"
        }
      m++
      }
    delete backup
    i++
    }
}

Any help would be greatly appreciated!
 
O.K.

My question is this: since the source is available as a compressed tar archive, can't most Unix users compile it and use it?
 
futurelet said:
O.K.

My question is this: since the source is available as a compressed tar archive, can't most Unix users compile it and use it?

My answer: yes, they can!

My statement: there was no 'lure' in my previous posting - I was simply stating the facts and nothing more.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Run this on your original data file.
Code:
BEGIN { srand() }

# Store all lines.
{ sample[NR] = $0 }

END {
  len = NR
  # Make scrambled copy of array.
  while ( len )
  {
    random = int( rand() * len ) + 1
    scrambled[++i] = sample[random]
    sample[random] = sample[len]
    len--
  }
  numgroups = 10
  groupsize = int( NR / numgroups )
  for (group=1; group<=numgroups; group++)
  {
    first = (group-1) * groupsize + 1
    last = first + groupsize - 1
    if ( (NR - last) < groupsize )
      last = NR
    for (i=1; i<=NR; i++)
    { if ( i >= first && i <= last )
        print scrambled[i] > ("test-" group ".txt")
      else
        print scrambled[i] > ("train-" group ".txt")
    }
    close( "test-" group ".txt")
    close( "train-" group ".txt")
  }
}
 
Thank you, futurelet. I do think that is exactly what I need. Thanks!
 
Futurelet:

Your postings in this thread are irregular and
undereducated IMO.
I can circulate anything in tarred and compressed
form and not expect it to compile on a *nix.

FYI: gawk in my experience has supplanted and improved on
most of the 'traditional' offerings. The implementation of
more complicated data structures to re-represent simple
linked lists and arrays, more sophisticated algorithms, etc..make many of the suggestions you make nonsensical.

The 'length function' you are suggesting is not going to
work when applied to anything progressive or specific.

M
 
One last (and probably all too obvious) question. In:

Code:
  while ( len )
  {
    random = int( rand() * len ) + 1
    scrambled[++i] = sample[random]
    sample[random] = sample[len]
    len--
  }

why do you need the following line?

Code:
    sample[random] = sample[len]

If anyone could please explain, I might understand the logic behind it and will be able to use it myself next time. Thanks!
 
Every time a value from the "sample" array is used, it is discarded and the perceived length of the array is decreased by one. These two lines accomplish that:
Code:
    sample[random] = sample[len]
    len--
We replace the value sample[random], which has just been used, with the value at the very end of the array, sample[len]. We use the value at the very end in order to save it from the oblivion that threatens it because the "len--" command will make that array position inaccessible in the future.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top