Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

scramble the order of records in a file 4

Status
Not open for further replies.

mpramods

Technical User
Jun 3, 2003
50
US
This is in continuation to the my previous post called 'create file with possible combinations of numbers from 3 files'

I have a file with 10,000,000 numbers. Sample data(18 records) from this file1 is as follows:
100333505
100333606
100333707
100444505
100444606
100444707
100555505
100555606
100555707
200333505
200333606
200333707
200444505
200444606
200444707
200555505
200555606
200555707

I would like to scramble the order of records in file1 and put it into a new file. The scrambled_file would look something as follows:
200444505
100333505
100333606
200555505
100333707
200555707
100444505
200333606
100444606
100555606
100555707
200333505
200333707
200444606
100444707
200444707
200555606
100555505

Please note that the records remain same in the scrambled_file but the order is changed (The order is random). How can I do this?

Thanks,
Pramod
 
Just curious, why do you want to do that?


--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.
 
What you require is far from trivial and, I would guess, near imposible in shell scripting. My first thought was to use a language such as perl, read in the records, suffle them and print them out again. However 10,000,000 records is an enormus amount and the memory overheads would be high - the data file is 100,000,000 bytes before any overhead used reading it into memory.

The next problem you hit is how you define random. I work for a company that runs a national lottery with a large number of entries. We need to provide, on a monthly basis, something along the lines of what you require and it takes a significant amount of computing to do it - and the random number generator is a separate, distinct piece of hardware custom designed and audited.

Ceci n'est pas une signature
Columb Healy
 
Hi

Until now [tt]sort[/tt] sorted huge amount of data to me. I assume it can deal with those 10 000 000 lines too.
Code:
sort -R /input/file > /scrambled/file
Tested with GNU coreutils [tt]sort[/tt]. And of course, fewer data.

Feherke.
 
feherke

My RedHat ES4 sort doesn't support the -R flag, and I thought it would be GNU coreutils. I've checked the GNU documentation and they don't mention it. Where did your version come from?

Ceci n'est pas une signature
Columb Healy
 
Hi

Code:
[blue]master #[/blue] seq 100000000 110000000 > hugefile.txt

[blue]master #[/blue] wc -l hugefile.txt
10000001 hugefile.txt

[blue]master #[/blue] ls -lh hugefile.txt
-rw-r--r-- 1 master None 96M May 18 14:01 hugefile.txt

[blue]master #[/blue] time sort -R hugefile.txt > scrambledfile.txt

real    4m11.061s
user    3m56.765s
sys     0m7.093s

[blue]master #[/blue] head -$RANDOM scrambledfile.txt | tail -10
108197787
101050327
108051105
106260414
104965658
108832288
104934445
105478192
104684364
102577666

[blue]master #[/blue] feherke@f:~>sort --version | head -2
sort (GNU coreutils) 6.7
Copyright (C) 2006 Free Software Foundation, Inc.

[blue]master #[/blue] sort --help | grep -- -R
  -R, --random-sort           sort by random hash of keys

[blue]master #[/blue] uname -a
CYGWIN_NT-5.1 master 1.5.24(0.156/4/2) 2007-01-31 10:57 i686 Cygwin

Feherke.
 
Hi

But I have to admit it, that [tt]sort[/tt] did not performed exactly as I expected. It used approximatively 335 Mb memory. But I still hope it is able to deal with less memory in case it is not available.

Feherke.
 
Hi

Columb, indeed, that document does not mention the -R parameter, but if you download the current [link ftp://ftp.gnu.org/gnu/coreutils/coreutils-6.9.tar.gz]coreutils[/url] package you can see it in the coreutils-6.9.tar/coreutils-6.9/man/sort.1 file.

Yet another example that my CygWin is ages newer then all the regular Linux distributions I used to work on...

Feherke.
 
I tried out the following, but it certainly fails on both points made by columb (memory and use of pseudo-rand).

Code:
perl -e '@a = <>; print splice @a, rand @a, 1 while @a' /input/file > /scrambled/file

I'd run this by the C or possibly the perl people. You'll probably have to have run some heafty IO and a crack rand lib.

Cheers,
ND [smile]
 
Hi

If it fails because the lack of memory, then I do not think just choosing another programming language could solve it.

Better change the approach :
[ol]
[li]split the huge file in more smaller chunks[/li]
[li]sort each chunk separately[/li]
[li]copy the resulted chunks together by passing through them line by line and comparing only one line from each one time[/li]
[/ol]

Feherke.
 
And what about this (legacy unix ;-)):
awk 'BEGIN{srand()}{print rand(),$1}' /path/to/input | sort -n | awk '{print $2}' > output

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
PHV beat me to it...

But combining the two tasks (assuming files named f1, f2 and f3):

Code:
awk 'BEGIN{
srand()
i=k=j==0
while ((getline f1 <"f1") > 0) f1a[i++]=f1
while ((getline f2 <"f2") > 0) f2a[j++]=f2
while ((getline f3 <"f3") > 0) f3a[k++]=f3
for (a in f3a)
 for (b in f2a)
  for (c in f1a)
   printf "%7d %s-%s-%s\n", rand()*10000000, f1a[a], f2a[b], f3a[c]
}'|sort -n|cut -d" " -f2


HTH,

p5wizard
 
spookie,
To answer your question, we need to change some sensitive data in a file. So I am creating a scrambled lookup file, the data from which will be used to replace the real data.

feherke,
I do not have GNU coreutils on the system I am working on. So the 'sort -R' does not work for me. I was going to try your other approach of splitting into number of files. I may have to use this approach in the future if we run into issues.

PHV and p5wizard,
Your solutions worked perfectly. Thanks for the help.

Cheers,
Pramod
 
Here's a quick-and-dirty I wrote in Korn shell a while back. It bogs down a bit if the file gets really massive.
Code:
typeset -RZ6 KEY

INFILE=original.dat
OUTFILE=unsorted.dat

while read RECORD
do
        KEY=${RANDOM}
        print "${KEY}:${RECORD}"
done < ${INFILE} | sort -n | sed 's/^[0-9]*://1' > ${OUTFILE}
It also assumes there's no ":" in the data.
 
Hi

Hmm... I wondered if I can tease abit Sam by suggesting [tt]cut[/tt] for speed optimization. My theory was that [tt]cut[/tt] should beat [tt]sed[/tt] because it does not use a slow regular expression.
Code:
[blue]master #[/blue] time sed 's/^[0-9]*://1' seq2.txt > /dev/null

real    0m8.629s
user    0m5.093s
sys     0m0.937s

[blue]master #[/blue] time cut -d: -f2 seq2.txt > /dev/null

real    0m9.318s
user    0m6.905s
sys     0m0.421s

[blue]master #[/blue] time awk -F: '{print$2}' seq2.txt > /dev/null

real    0m6.179s
user    0m4.515s
sys     0m0.624s
Sorry for posting this outputs, but I find the result interesting.

Feherke.
 
No problem. Like I said, it was a quick-and-dirty. I'm kind of surprised that the [tt]cut[/tt] wasn't the fastest.

Since the random key is always 6 characters, can you try it with the same data file with this?
Code:
time sed 's/^......://1' seq2.txt > /dev/null
Also how about...
Code:
time cut -c1-7 seq2.txt > /dev/null
That simplifies or eliminates the pattern match.

Actually, that "1" at the end of the search and replace in the sed does make it allow for colons (":") in the data. It will still only chop off the prepended key up to the first colon it sees.
 
Hi

Hmm... A minor problem. I did not used fixed width prefixes... You know, my login shell is [tt]bash[/tt] and that [tt]typeset[/tt] is [tt]ksh[/tt] specific... Anyway, solved now. Another minor problem, see in red.
Code:
[blue]master #[/blue] time sed 's/^......://1' seq2b.txt > /dev/null

real    0m4.641s
user    0m4.061s
sys     0m0.531s

[blue]master #[/blue] time cut -c[red]8-[/red] seq2b.txt > /dev/null

real    0m5.899s
user    0m5.156s
sys     0m0.796s
Anyway, your approach of cutting characters instead of fields seems to be useful. [medal]

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top