Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

true randomization

Status
Not open for further replies.

Graziella

Programmer
Jul 8, 2004
38
AT

Hi,

I need to split a file of 20290 lines into two files randomly for 1000 times; each random split has to produce two subfiles, one representing 80%, the other 20% of the original 20290.

I have to make sure that each of the 1000 80-20 splits of the corpus, I get a different split.
Now, is this line at the beginning of my perl script enough to make sure that I have 1000 different 80-20 random splits of the corpus ?

How do I seed the Perl function "rand" to make sure that "rand" produces each time I run the program a different sequence ?

Grazia
 
This may be a platform specific issue.
When I call rand on Linux perl I have different results every time.
It may be that windoze resets the seed every time, I don't know.
What platform are you running on?

Trojan.

 
Depends on what you really mean by "random". The "srand" is the right tool, but what you pass it as an argument determines rand's behavior, so if that isn't somewhat random, your results won't be either.

For example, doing something like

srand(time() ^($$ + ($$ <<15))) ;

may answer your needs perfectly. It's not really random, but it's probably good enough. .

For many situations, "good enough" is all we need. See for more on that.

Tony Lawrence
Linux/Unix/Mac OS X Resources
 
Tony,

what does your srand(time() ^($$ + ($$ <<15))) ;
mean ? Why $$, why $$<<15.

Grazia
 
That's out of the Camel book - $$ is the PID of your process - basically it's mixing the PID with time to produce something reasonably random - but if you ran this as part of a boot up scripts and your machine rebooted at the same time every day, it wouldn't necessarily be random. Likewise, if you were depending on this to protect assets, and someone knew your PID and the time that this ran, they could crack your protection. Even being close - knowing within a few seconds what the time was - could help because it cuts down on the numbers they'd try. That's why the article I refrenced says you wouldn't use this for anything critical, but it might be plenty good enough for what you want to do.



Tony Lawrence
Linux/Unix/Mac OS X Resources
 
Oh, no it is not for security programming.

I am just running tests on a classifier and a bunch of features I implemented.
My data set is kind of limited in size. Therefore, I am running a bunch of tests "blindly", i.e. without engineering the features on the test set (which I have not see yet).
Of course, running tests with one random split does not seem very sound, because my one random split might be a lucky one or a very unlucky one.
This is why I am going to generate 1000 80-20 splits.
No security stuff.
I guess seeding srand as you suggested would work for my case, then.
Grazia
 
The problem is, no matter what you use for a seed there's still the possibility it could generate the same 'random' number. To guarantee that you don't get the same number, you'll have to keep track of the numbers you have used. Something like this might work for you:

Code:
my (%used_nums, @rand_nums, $num);
for (my $i = 0; $i <= 10; $i++) {
    $num = int(rand(100)) + 1;
    if ($used_nums{$num}) {
        redo;
    } else {
        $used_nums{$num}++;
        push @rand_nums, $num;
    }
}

After running that code, you'd end up with 10 unique random numbers from 1 to 100 in @rand_nums.
 
Thank you very much for the tutorial. I needed it !

Thank you,

Grazia
 
As you want 1000 different random number sets, you don't really need a random seed. You just need 1000 different seeds. So a simple for loop would do the trick, and your tests would be repeatable, too.
Code:
my @data20290 = (<>);
for (my $seed = 1; $seed < 1001; $seed++) {
   srand($seed);
   foreach (@data20290) {
      if (int(rand(5)) == 4) {
         # print $_ to 20% file
      }
      else {
         # print to 80% file
      }
   }
}
(untested)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top