Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Thoughts on splitting a huge file to smaller ones.

Status
Not open for further replies.

theotyflos

Programmer
Oct 15, 2000
132
GR
Hi group,

I would like to share my thoughts with you and ask for your suggestions and/or thoughts on the following situation:

Language is Rm/Cobol v7.10 on Sco Unix 5.0.5.

I have a file that looks like this:
Code:
           Select Optional         Signals-Fil
              Assign To Random     Signals-Path
              Organization Indexed Access    Dynamic
              Record Key           Sig-N-A
              Alternate Record Key Sig-C-D-T Duplicates
              Alternate Record Key Sig-D-T   Duplicates
              File Status Flst.
:
:
       FD  Signals-Fil.
       1   Signals-Rec.
           2   Sig-N-A.
               3   Sig-Number      Pic 9(9).
               3   Sig-AA          Pic 9(2).
           2   Sig-C-D-T.
               3   Sig-Customer    Pic x(4).
               3   Sig-D-T.
                   4   Sig-Date.
                       5  Sig-YY   Pic 9(4).
                       5  Sig-MM   Pic 9(2).
                       5  Sig-DD   Pic 9(2).
                   4   Sig-Time.
                       5  Sig-Hour Pic 9(2).
                       5  Sig-Min  Pic 9(2).
                       5  Sig-Sec  Pic 9(2).
           2   Sig-Some-Data       Pic ...
:
:
       1   Signals-Path        Pic x(10) Value "signals".
Facts:
a) The file is automaticaly and continuously updated (without any user intervention) with about 60 new records per minute.
b) New records writen, always have the current date and time in the corresponding fields, which furthermore never change.
c) Users can also update the file (rewrite only, no write/delete).
d) All three keys are necessary for gaining information.
e) Records older than 3 months can be transfered to a "history" file, as a last solution.
f) There is a program that runs every minute (through cron), which checks for some records of (some) specific customers and accordingly updates some other files if necessary. This program must do it's job in less than a minute.

Problems:
a) Up to now, the file has grown to a size of almost 3.5GB and is about to reach the language/OS limits.
b) Since the file is continuously updated, it can't be backed-up correctly with tar or cpio.

Thoughts:
a) Upgrade to newer lang/os version with very large file support. (Not sure if the customer wants to pay for the upgrade, doesn't solve problem b).
b) Transfer records older than 3 months to a "history" file - users won't have access to "current" and "history" data simultaneously, they will have to "switch" files somehow. (Solves problem a, doesn't solve problem b).
c) Split the file to smaller ones - one file per month:
Signal-Path must change at runtime to "sig200512", "sig200601", "sig200602" and so on for example.
Files should be opened and closed according to user's request:
Case 1: User wants to see all signals from "2005/11/01" to "2006/01/31". I'll have to start from file "sig200511" and at the end of it, close it, open "sig200512" and so on.
Case 2: User wants to see signals for customer "1234" from "2005/11/01" to "2006/01/31". Same as Case 1.
Case 3: User wants to see all signals for customer "1234". Don't know which file to start from, because don't know the date of the customer's first signal. If I start from the very first file, I may end up opening/closing/starting a lot of files for nothing. I could keep the date of the customer's first and last signals to the customer's master file. In that case how should I deal with out-of-sync dates between the signals files and the customer's master file?
If I split the file, would the program from fact#f be able to finish ?

Questions:
a) Has any of you ever faced a similar problem?
b) If yes, how did you confront it?
c) If you were facing this issue, which solution would you prefer and why?
d) What do you think about thought#c ?

Any reply is appreciated.

Theophilos.

-----------

There are only 10 kinds of people: Those who understand binary and those who don't.
 
Split the file is a must in my opinion.

How to split the file depends on some details that were not clarified (in my opinion).

1- Split by customer and date (e.g. file per year)
I would normally do this with a filename as
zzCCCCYYwww
where zz and file identifiers, CCCC is customer, and YY is year. (adjust sizes accordingly).

2- Split just by date.


In either case the master customer file should have it's start date so you know where to start.

Regarding your main processing, the "auto update" process (A) should write to a temporary file.

Another process (B) should run, read that file, insert into the final file, and then delete from the tempfile created by (A).

this way you could backup your files normally, and if you set your process to run every minute you would only loose a max of 2 min worth of data if things go very bad.


Regarding processing the files with date/customer on the file name this is easy enough to do if you create specific I/O programs to deal with the files.

Regards

Frederico Fonseca
SysSoft Integrated Ltd
 
I have several similar programs in RM cobol, and all work the same way.
1. The file name is constructed so the year and month are a suffix of the file name, i.e. historyyymm. The programs which update the file check for the system date before each write, so that I can close and open a new file if the month changes.
2. Any inquiry to the files uses a unix 'ls' command to prepare a list of all files available:
(ls history * > hfiles). This file is then used by the program to open the possible file names one at at time depending on how far back I want to start searching.
3. RM cobol can open and close hundreds of files like this in a very short time frame. Since they are indexed, I can
see if there is any appropriate data in each file very quickly.
 
Theophilos,

While your SCO operating system has limits to the size of a file, the limit for RM/COBOL indexed files is approximately a petabyte. The version of RM/COBOL you are using, while fairly old by modern software standards, has this capability; see Chapter 10 of the RM/COBOL User Guide for information of configuring large files.

I didn't want anyone to have the impression you were encountering a limit in RM/COBOL.

Now as a means of assuring that you are getting everything you can out of your 3.5GB, you should run the rmmapinx program and share the output with me (you have my email address). It may be that other configuration options might help with these unusual requirements.

You can reduce the number of digits used to store the date by using a "days since the beginning of an epoch" approach. You may find a date conversion program here. It only requires six digits to represent all the days in 250+ years. Likewise, it only takes five digits to represent the number of seconds in a day.

In addition to reducing the number of digits used in your data, you can further reduce the storage requirement by using USAGE. I would suggest COMP-6, which is packed-decimal with no sign.

I have not addressed any of the operational issues, such as having time to backup the file, etc. I would be happy to offer some opinions there as well, but need to understand how many programs would be affected by processing changes, and if there are any other time constraints beyond the cron job.

Best regards,
Tom Morrison
Manager, Technical Support and Services
Liant Software Corporation
 
Just for the record.

I did not suggest any of the data compression methods given by Tom as this will only solve the issue for awhile. e.g. file will reach limit in 6 month instead of 3 (or whatever timeframe is involved.)

Now if the timeframe is over 1 year and if compression is not used (it does not seem to be, and I am not referring to file compression which may or not be on!!), then maybe just archiving year old data will suffice.
As i said there are a few unknowns here.

Regards

Frederico Fonseca
SysSoft Integrated Ltd
 
Let's address them in order:
a) Up to now, the file has grown to a size of almost 3.5GB and is about to reach the language/OS limits.

As mentioned by the others, this is not a concern of the language or OS, but it needs to be one for performance issues.

b) Since the file is continuously updated, it can't be backed-up correctly with tar or cpio.

Here's what I see as the bigger issue, which hasn't exactly been addressed. This is definitely a design problem with the system.

In any busy system with many transactions, you want to update a transaction file, which would have the day's changes in it, and then you would want to process those against your main file during a down time. This allows you to have full access to the main data file at any time other than when a daily merge/change process runs. Then this can give the user a "undo" capability too if they so desired it over the span of that day.

I would even consider breaking the full file down into monthly or yearly changes simply for that performance issue mentioned above. A 3.5GB file is fine in and of itself, but I'm sure the performance issues of handling so much data all the time can take a toll.
 
I guess in COBOL you would just make a new file and read some of the older records and write them to the new file. Then you can read the new file and seach for the record in the new file and delete it.

If you had the date as part of the key you could use that as a key or an alternete key.

If you do not like my post feel free to point out your opinion or my errors.
 
ceh4702 said:
If you had the date as part of the key you could use that as a key or an alternete key.

Please review the original post, where you will discover that the date is lodged in both alternate keys.

Tom Morrison
 
ceh4702

please do review the original post as well as the others, and then do give a constructive idea of how to deal with the problem. instead of just quoting something that the original post (and others) have already mentioned.

Regards

Frederico Fonseca
SysSoft Integrated Ltd
 
If that application need such huge capacity, and need to keep on-line old period recoreds, for my opinion, maybe you will reconsider, that the best cost/performance is to upgrate your system, that will handle such file, or to reduce the period you are keeping online.

Other sugetiones, if your software not rebuilted at first, to handle split files, will cost at the end of day much more.

Baruch
 
Sounds like you want a magic pill. One option could be some type of database file that lets you break the limit of your operating system, Like a databse package you could install.

I work for a school and we only keep the last 7 years of data. So the transcript file has what we want to keep on each person after that.

Some databases use a system that has File Suites where every year they use a new file. Looks like eventually you will have to use a history file of some kind.

The messages, do they repeat? you could have a message code. Even if you can shave some size off of the fields you will still run out of room. It is a fact that you will need some way to split up the file or use a history file. We use some file techniques called file suites where every year we create a new file suite and to access the data we have to open and close files. Each quarter could be a new file suite. You have to acknowledge that there are file size limitations.

If the same messages are used over and over you could make codes for that or use a specialized short hand you can only shink the file so much.



If you do not like my post feel free to point out your opinion or my errors.
 
First of all, my deepest apologies to all of you for beeing so late to reply. I'm very sorry for that. The main reason for beeing so late is that I first wanted to study very carefully your answers.


Frederico
1- Split by customer and date (e.g. file per year)
I can't put the customer's code in the filename, because then, I couldn't make reports based on the date only (regardless of customer).
In either case the master customer file should have it's start date so you know where to start.
That was also my original thought. The only problem to that, is how to deal with out-of-sync dates (see "Thoughts-Case 3" in my original post).
Another process (B) should run, read that file, insert into the final file, and then delete from the tempfile created by (A).
I totally agree with you.





mrregan
Thank you for your answer. However my problem is not how to accomplish the splitting of the file, but if such an action should be done and whether it worths it. I am an experienced Cobol programmer, I know how to do things.


Tom
the limit for RM/COBOL indexed files is approximately a petabyte.
I made a terrible mistake: The version of Rm/Cobol is 6.09, NOT 7.10. So the limit of a file size (as fas as I know) is much less than a pettabyte. About your proposal for compressing: I don't think that this would make much of a difference, since most of the data (besides the date and he number) in the record are alphanumeric. Besides that, changing the file description requires a)conversion of the original file which means "disconecting" it from a running application that can't be stoped. b)changes/recompilation to a number of programs that accessing this file (almost 650 of them). I'm sorry that I can't reveal more about the nature of the application since the company I'm developing it for, holds copyrights and does not allow me to say anything more in public. Maybe I can tell you more through a direct email.



Glenn9999
Thank you also for your thoughts.
This allows you to have full access to the main data file at any time other than when a daily merge/change process runs.
I'm afraid that this is not an option. "Old" data as well as the "today's" ones are critical and must be available at any time to the various updating programs of the application, as well to the end users for reports.



baruch
Thank you also. Maybe you're right. An upgrade to both the OS and the language should simplify things a lot. I just wonder if this turns out to be just a temporary solution. I have no way to tell how much data to expect in the future.



To the group
Thank you all once again and please forgive me for being late.

After a lot of thoughts I have decided that I'll try to make the split. It will take a lot of time. Periodically I will post my progress and my experiences here, hoping that this will help others who may face a similar situation.

Theophilos.

-----------

There are only 10 kinds of people: Those who understand binary and those who don't.
 
I'm afraid that this is not an option. "Old" data as well as the "today's" ones are critical and must be available at any time to the various updating programs of the application, as well to the end users for reports.

Why is that? Please enlighten me. Tell me why splitting the file wouldn't work. In what I described, any and all data could be read, and any updates would only be put through at the end of the day, so the main files would be available for reporting and the like. Most reports have specific time periods, so splitting the file up based on time periods is not out of the question. In fact, this is how most systems operate - I'm seriously wondering what makes yours different.

I'm saying what a lot of people here are saying...you need to seriously rethink the design end of it.
 
I saw some articles on a file system called XFS that some people use with Linux:


However, the larger the file the longer it will take to find what you want. Imagine running a querry on a 20 gig file?

If you do not like my post feel free to point out your opinion or my errors.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top