Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

converting DOS data using original TSR (newbie) 1

Status
Not open for further replies.

Maggy160

Programmer
Jun 14, 2007
2
NL
I have a 42 MB database that I used in the "good old" DOS days. The way to call it from any program was using a TSR that I still have as well.
The data is a strange combination of ASCII and hex.
For example for "to do; verb; irregular", "to do" is in the database, followed by a hex string. The stings "verb" and "irregular" are in a 300 lines data block in the program. My problem is that I've not yet been able to figure out a 1 on 1 relationship between the hex strings in the data and the strings in the program. In that case it would have been very simple to use any hex editor to replace the hex strings with the actual words. Hex strings can be from one to five bytes long as far as I found so far and show no obvious relationship with the DS:xxxx and/or string length.
But the program of course is flawlessly able to read these hex strings and translate them to human readable ASCII.

So now I got the idea that the same program with some tiny adaptation should be able to step through the entire database and convert it completely to a simple comma delimited file.

I am more or less able to read the ASM I created using IDA, I can see how it loads the program in memory, how it "terminates" but stays resident, how it responds to its hotkey, how it switches between Dictionary and Encyclopedia mode and so on. I've still not yet been able to find how it walks through the database, nor how it translates the hex strings.

Any suggestions are VERY welcome.
 
> The stings "verb" and "irregular" are in a 300 lines data block in the program.
It seems fairly obvious that the hex bytes are going to be encoding access to that table in some way.

Some thoughts.

Is the table at the start nul terminated, as in [tt]"verb\0irregular"[/tt] or is it just one long string like [tt]"verbirregular"[/tt]
The former might mean that only word number is encoded in say 9 bits. The latter might mean you have word number in 9 bits + word length in 4 bits.

> For example for "to do; verb; irregular", "to do" is in the database, followed by a hex string.
Does that mean the "; " is also generated automatically as well?

If you have two definitions which expand to the same text, are the hex bytes the same or different? This tells you whether they're relative to the current position or not.

Is the length of the hex byte sequence related to the number of words which get printed?

Is the length of the hex byte sequence related to the position of the definition in the table. More frequent definitions are at the start of the table, and have shorter sequences for example?

If you change the 300 entry table at the start, say change "verb" into "werb", does everything work as expected except for the obvious change of spelling. If something else changes, then it would seem the table itself is part of the decode logic.

If the hex bytes represent a fairly simple encoding, then a methodical set of tests changing one bit at a time should eventually reveal how the encoding works from observing the effects it produces (say the next word gets printed, or an extra letter gets printed).

Are you able to run this code under the control of a debugger? If you can, then putting a breakpoint on the print character interrupt and waiting for a ";" to be printed should start to tell you where it came from.

> I found so far and show no obvious relationship with the DS:xxxx and/or string length.
They're more likely to be relative to the start of the table. Look through the asm for where "offset table" gets loaded into a register, followed by various additions before using that result to index into memory.

--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
 
Thanks Salem,

Although it'll take at least a full day to test all your suggestions, it sure looks like you know better what you're talking about than I am.

Actually this database has been tormenting me ever since I started using Windows XP. In Windows ME I was still able to get limited use of it in a DOS box, in XP it crashes the dos box. I "solved" this by installing a Windows ME virtual machine... how virtual can you work...

Another problem is that the database is very outdated, the access program only allows looking up data, not edit or save... how mediaeval can I work...

Imagine, this database was once over 2400 guilders, that's about 1100 Euro's, about 1500 US$. And my first CD ROM.

The table is 0x00 terminated, after each entry is a 0x00.
There are 0x00's in the database main file as well, on more places actually than I understand.
There are two more files, one of 7KB with a seemingly random pick of keywords from the main database file delimited by AAAAAAA and 6 bytes starting and ending with 0x00. I guess this one's meant for the program to jump faster through the (for a DOS program very large) database. The second small file, 6KB is entirely hex, most bytes by far 0x00. I'm affraid the program first finds a keyword in the database, then uses the hex of that keyword to look up something in that small hex file and uses that result to look up the data in the 300 words list. Perhaps, maybe?

The ; in my example is not an actual delimiter that the program uses to show the result. Actually the program creates a nearly full screen pop-up with a rather complex layout depending on the selected options. For example, you can select the word "done" in a text editor, press the hotkey and get the full conjugation of "to do" and even its ethymology.

In some cases it seems that two bytes of hex after a keyword do have a unique meaning, for example 0x0B06 = male noun, 0x0C06 = female noun, 0x0D06 = noun, both male and female.
But I've seen both 0x1404 and 0xFF04 behind plural's, while 0x1404 also occurs behind abbreviations. So far I've only been able to find such relatively easy relationships in 2 byte hex.
I've not yet been able to find any relationship between the number of bytes and the amount of data displayed for that keyword.
The length of the hex string shows no relationship with its place in the 300 words data.
The hex string itself shows no obvious relation with where a word is in the data list for example "conjugate with" is the 95th entry, accessed by 0x6504 while "lattitude" is the 180th entry, accessed by 0x5304.

I am able to run much simpler code under control of a debugger but have no idea how to do that with a TSR. I tried several times to cut out the TSR part and/or the hotkey part to run it as an ordinary program that I should also be able to step through in debugger, but so far that has lead only to no longer working code.

Bed time, I'm going to experiment with editing the data block in the program and/or changing the hex in the data file tomorrow.
Feel free to give more suggestions ;-)

cheers
Maggy
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top