Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Intrpolating missing values in Access table 1

Status
Not open for further replies.

adfo

Programmer
Oct 23, 2009
30
US
Dear all,

I have an Access table contains a "column" field , "row" field and some values under "Value" field (example will follow below). I need to interoplate / extrapolate missing values within each COLUMN if enough minimum number of values exist to allow interpolation. Next, interpolate within each ROW for any missing values that havn't been filled from the prvious process. example below:

Group Column Row Value
1 1 A -
1 1 B 8
1 1 C -
1 1 D 9
1 1 E -
1 1 F -
1 1 G 12
1 2 A -
1 2 B 6
1 2 C -
1 2 D -
1 2 E -
1 2 F 9
1 3 G -
1 3 A 7
1 3 B -
1 3 C 8
1 3 D -
1 3 E -
1 3 F 12
1 3 G -
2 1 A -
2 1 B -

STEP 1:

- For each COLUMN within the group, if there's at least 3 values, interpolate/ extrapolate linearly all missing values as necessary (column 1 within Group 1 above meets the criteria of 3 values)

- Column 2 within only 2 values, doesn't meet the criteria then skip

- Column 3 meets the criteria( at least 3 values), then interpolate / extrapolate and so on

STEP 2:

Now, for each ROW across all columns within the group (row A, B, etc.) , interpolate / extrapolate all missing values. This should fill all values skiped from step 1 above. After finishing, move to the next group and so on

Thanks
 

See if this does what you need. Still have questions
1) When you are interpolating the columns within a row, which rows are included. I included all rows that had atleast 2 columns with data points. That does not look correct.
2) Do you extrapolate the leading and trailing edge of a row like you do with a column? I did not.

Your data sets will have to be very clean. I found duplicate records in the set you gave me. This will cause the program to crash. I could have put error checking in for this, but I think it is better to see the problems.

I left a qry that shows how you can do this by SQL. It is not worth it. The sql is very complicated and it takes a long time to run.

Buttons 1,2,3 show the pieces. And then button 1,2,3 does all the execution at once.
 
If you want to send the complete data set then execute a make table query and select only the needed fields. Import this table into a clean db. The entire dataset will likely be under a meg. I am at 1.2 but I basically have 2 copies of the table which is about 7k records.
 
Thanks a lot MajP!

This was a great help. Just couple of observations:

1- STEP 2 seems to work fine. I just noticed a little thing in the extrapolation in the columns series (ID 3463, 3464, 3465, and 3466). The increment is expected to be 0.35 but is calculated as 0.7 instead "double". Only ID 3466 (last element) is incremented correctly (0.35). Same observation for ID's (from 3478 to 3482), the "increment" value is correct for the last element only and doubled for others. Is it because of decRow increment (by two)?If so, why it's not for the last element?

2- STEP 3, I think the current interpolation for a row is not considering the offset

To answer your questions above:

(1) That's correct, I think at least two values should exist in step 3. In the real life there will not be otherwise.

(2)Yes, extrapolation leading and trailing edges in a row will ultimately be needed similar to columns (within the same 'Group').

I will do more tests on real data on Monday/ Tuesday, I will send a more complete dataset. I appreciate your help. Thanks a lot!
 
I goofed on the row, but I have some questions.

First the discussion and the data do not seem to match. I think I would understand if you said

if value "7.5" at row A, column "1" has offset=150, and value "7" for row A, column "3" has offset=300, then row A , column "2" , which has an offset of 200 should be 1/3 of the incremental value between column 1 & 3 "7.33" (200 is located at 1/3 the distance between 150 & 300)

is that correct. You said the following:
(i.e. if value "8" at row A, column "1" has offset=150, and value "10" for row A, column "3" has offset=300, then row A , column "2" , which has an offset of 200 should be 1/3 of the incremental value between column 1 & 3 "8.66" (200 is located at 1/3 the distance between 150 & 300).
Code:
Group     Y          X         Value     offset(Y)
1            1           A         (7.5)        150
1            1           B          8            150
1            1           C          "8.5"       150    
1            1           D          9            150
1            1           E          "10"        150
1            1           F          "11"        150
1            1           G          12          150

1            2           A          [7.33]      200
1            2           B          6             200
1            2           C          [8.33]      200
1            2           D         [9.11]       200
1            2           E          [10.22]    200
1            2           F          9             200
1            2           G          [12.55]     200


1            3           A           7           300
1            3           B          "7.5"       300
1            3           C          8            300
1            3           D          "9.33"      300
1            3           E          "10.66"     300
1            3           F          12            300
1            3           G          (13.66)     300
So if I sort on row A

Column Row Value offset(Y)

1 A (7.5) 150
2 A X2 200
3 A 7 300


in the above example you found X as follows

y3-y1/x3-x1 = y2.y1/X2-x1
(300-150)/(7-7.5) = (200-150)/(X2-7.5)
solve for X2 = 7.33

This was a really simple example, but as far as I can see that only works for the above case: missing value between two known values, columns spaced proportionally. In the real data you get things like

7 A (7.5) 150
8 A X2 200
10 A 7 300

where there is a missing column. What would be a general formula to calculate this? Or would you treat it like above and ignore the fact there is a missing column 9?

And then you usually have multiple values to calculate.

7 A (7.5) 150
8 A X2 200
10 A X3 300
11 A X4 300
12 A 7 300

So can you explain how to do the above examples and what values would you expect?

Also how would you extrapolate?
7 A X1 150
8 A X2 200
10 A X3 250
11 A 7.5 300
12 A 7 350

In regards to this
I just noticed a little thing in the extrapolation in the columns series (ID 3463, 3464, 3465, and 3466)
We seem to have a disconnect. I have these points within Loc 7 Group 30 Column 11. There is only one datapoint and thus no interpolation or extrap within the column. Also notice the far right column shows the type of calculation. Are those the correct IDs?
 
I just noticed a little thing in the extrapolation in the columns series (ID 3463, 3464, 3465, and 3466). The increment is expected to be 0.35 but is calculated as 0.7 instead "double". Only ID 3466 (last element) is incremented correctly (0.35). Same observation for ID's (from 3478 to 3482), the "increment" value is correct for the last element only and doubled for others. Is it because of decRow increment (by two)?If so, why it's not for the last element?

in the fixTrailingEdge code change the following from beginrow to endRow

rsInterp!interpValue = Round(endValue + snglSlope * (currentRow - endRow), 3)

I now get these results
ID_FK Location intGroup decColumn strRow decRow decValue interpValue offset typeValue
3476 5.08 + 91.44 Measured Depth
3477 5.48 + 91.44 Measured Depth
3478 5.88 + 91.44 -Exptrapolated Back End-
3479 6.28 + 91.44 -Exptrapolated Back End-
3480 6.68 + 91.44 -Exptrapolated Back End-
3481 7.08 + 91.44 -Exptrapolated Back End-
3482 7.48 + 91.44 -Exptrapolated Back End-

I got beginRow and EndRow turned around
 
I apologize MajP,
There was a typo in some Id's I reoprted and this what probably caused you some confusion..

I meant ID (3643, 3644, 3645, and 3646) it seems that I switched 4 & 6. I'm glad that I reported another case which you responded with the corrected code above. I think this corrects all my STEP 2 issues including the piles I mistyped. I apologize for the confusion.

Regarding your questions on interpolating along the row. I think this should be purly based on offsets of the known values regadless of any missing columns. Using your example:

7 A (7.5) 150
8 A X2 200
10 A 7 300
--> X2= 7.33

and in your next example (note I corrected the offsets for col. 11 & 12 in your example)
7 A (7.5) 150
8 A X2 200
10 A X3 300
11 A X4 400
12 A 7 500

The general formula will be based on calculating the slope based on the known values, then solving for the unknown:
slope (m) = (y5-y1)/ (x5-x1)
in the example above, m=(7-7.5)/(500-150)= -0.0014

then m=(y4-y1)/ (x4-x1) = (X4-7.5)/(400-150) = -0.0014, and
X4= (-0.0014*250)+7.5 = 7.15 and so on. I'm not sure if the code is already doing this, I'll check on Monday.

Regarding your extrapolation question:
7 A X1 150
8 A X2 200
10 A X3 250
11 A 7.5 300
12 A 7 350
Also, I would say, regardless of any missing columns, I'll look at the offsets. The etrapolated values should be:
X3=8, X2= 8.5, and X1=9 (since all are equidistant)

I appreciate your help. Will let you know how it is coming along. Thanks
 
In my example:
7 A (7.5) 150
8 A X2 200
10 A X3 300
11 A X4 400
12 A 7 500

you said ignore the missing rows. Easy to do, but does not seem like correct logic. I think you would want to "weight" both the column placement and the offset. What about this approach?

1)Calculated X2 using the standard
(X2-7.5)/(200-150) = (7-7.5)/(500-150)
I think 7.42 is the answer

2)Now use column 8 to solve for X3
(X3-7.42)/(300-200) = (7-7.5)/(500-150)

 
disregard my last. That was part of something else. What I was thinking is you could solve two equations with two unknowns since both the column position and the offset define the location.

Col(Y) Val(X) Offset(Z)
7 (7.5) 150
8 X2 200
10 X3 300
11 X4 400
12 7 50

So you could solve using col 7 and 12
X = Ay + Bz

7.5 = 7A + 150B
7 = 12A + 500B

Solving for A&B
A = 1.588
B = -.0241
Now to solve for any X
X = 1.588(Col) - .0241(Offset)

The concept is correct, but the answers are not what you would want. The solutions are points on a plane that may not fall within your X begin and X end values. So without further information, I will implement the pure linear solution using offset values regardless of the column number.
 

Thanks MajP for the great help!

I may have to wait couple of days or so until I get more 'depth' values filled into the database to do through testing. I did a quick test on subset of the actual database, which has many depth values recorded to date. Just few uncalculated values are still there. Mostly for ROWS that have two characters (such as 'AB, 'JK', etc.) (?) I am happy with what we achieved so far. Thanks very much MajP


 
Ok. Couple of things.
1) you need to add some error checking to all the procedures. Especially the getSlope and getIncrement functions. You can easily get division by zero errors.

2) For my error checking I would write a single procedure to call and then pop up a message box that an error has been encountered and then debug.print
location
group
column
row
interpvalue
beginvalue
endvalue

using a global variable only pop the message up once, you would not want a 1000 messages to confirm.

3)If I was going to do this again I would add more fields to my table to store intermediate calculations so then you could see for a record:

beginValue, beginRow, endValue, endRow, increment, slope

By saving the intermediate steps, you could actually simplify the code, but more importantly see the problem areas.

 
Thanks MajP!

I've been trying the code, and was working fine mathematically wise. However, due to the nature and low number of recorded depths in the database to interpolate from, The "Row extrapolation" doesn't seem to provide reasonable results. For example, if a "group" has 50 columns, and only the 9th, 15th, 28th columns were valid for column interpolation. When it comes to row interpolation, the 8 trailing edges and 22 leading edges values may provide unrealistic results due to the extrapolation concept. My approach is to eliminate "ROW EXTRAPOLATION" if possible. I uploaded description of my concept, which may require a little adjustment to the code, to exclude the row extrapolation. I appreciate if you can help on this. Please check:

Thanks!!
 
I was aware of this problem in the logic, and your proposed changes will be easy to implement.
1) I will fix it and send for demo.

Now I will ponder the information and see if I can come up with a better mathematical solution for doing this. FYI, I have an undergraduate in Civil Engineering and a Masters in mathematical modeling so this is kind of familiar.

So in that regards, is there something you are specifically trying to maximize or minimize in the depth settings? I showed that you can get the best fit line for a series of coordinats.
1)Could a best fit line through all the rows in the group be a better solution? Is it more important to be most linear across the whole group or most linear between known points?
 
Thanks a lot MajP,

it seems that you have the right match of skills lucky you!

I think the best is the pure linear (straight line intrpolation between two known points considering the distance between them (in case of rows). I think this is the most simple approach. Your question brought up my method (2) which I didn't put it on my notes. I added it on for you to comment on:


You may agree on that method(1)will be more simple to implement (?)

I would go with linear only for now and can not ask for more than that:) ?!
 
I have a few questions, and I think I would approach this differently.

First to have a common reference make two queries on your raw data. Choose location 2, group 6, and sort the first by column then row. Now make the other query and sort by row then column.

1)Is the location in the column more predictive of depth than the location in the row? I am not sure how you choose your sample depths, but my guess is the answer is no. In other words assume the following grid with x representing unknown depths and numbers representing a known depth

1 2 3 4 5
a x 1 x x 3
b x x x x x
c 2 x 3 4 x
d x x x x x
e x 7 x x x

look at pile 2c. The way we do this know it interpolate the row first. If rows and columns are about equally spaced, then using a2 and e7 in a column interpolation seems much less predictive than using c2 and c3 in a row interpolation.

2)Now assume the next grid
1 2 3 4 5
a x 1 x x 3
b x 3 x x x
c 2 x 3 4 x
d x 5 x x x
e x 7 x x x

I believe that the offset gives the true distance between columns. So assume the following
strRow Column offset
C 1 + 12.57
C 2 + 14.01
C 3 + 15.41

So column 2 is 1.44 units (feet?) from column 1, and column 3 is 1.4 units from column 2. However, what I do not know is how far Row B is from row C and row D from row C.

So looking at the grid above and pile 2c, I would choose a column interpolation if the distance between column 1 and 3 was less than the distance between row b and d. If not the row interpolation would be more predictive.

3) To see what I am talking about look at those two queries at pile row A column 10.

In Column order:
Col strRow decValue
10 A
10 B
10 C
10 D 5.19

So I have to backward extrapolate from row D (an unknown distance). I would think pretty poor predictor.

In row order:
Row Col Value offset
A 9 4.2 +12.57
A 10 +14.01
A 11 4.2 +15.41

But here I have 2 points about 1.4 units away, and I would think 4.2 is a pretty solid indicator.

4) If this logic makes sense here is what I think a better approach.
a)Would need actual equivalent row offsets or something that compares row distances to column distances.
Row D has a value of 7, Row E has a value of 9. That is a distance of 2 units. Does that distance have any relation to an offset distance of 2 units?
b)I would find first the points that are immediately surrounded by values (ex 2c) in the row above and row below then predict those. Then go to the rows and find the points that have values immediately next to them in the adjacent column.
c)now I would iterate back and forth. Do this until I can not find any points with values immediately surrounding the unknown pile.
d)Now I would look for points that have values 2 rows away, and then piles with columns 2 columns away, and do the same until I can find none of of those.
e) Keep going until all points have been predicted based on the best possible predictor row or column.
f) At some point all values are interpolated except for the end columns or end rows. I have several ideas for this but I want to get your thoughts first on this.
 
Thanks MajP!

I'm not sure if the actual data will have such problems. I will look into this and get back to you. I will upload the actual data may be tomorrow. The preference is to interpolate through "valid columns" first, then by Rows.

I proposed a "simplification" to the approach which I communicated before.

STEPS:

1. Interpolate/ extrapolate the valid columns (as before)

2. COPY the “first valid” column and “last valid” column values; in a location; into the edging columns (i.e. into the first column in first group and last column in last group respectively in the LOCATION , no extrapolation anymore, just simple copy).

3."Row interpolation" logic spans the whole columns within "LOCATION"

I uploaded explanation into:

I apologize if the description is long, the idea I have is really simpler than what you may think. I don't think iteration is needed. The approach may not be 100% mathematically sound, but If FEW weird values showed up, I think it's OK to resolve manually. I appreciate your help MajP! Please let me know what you think. Thanks
 
MajP,

Further to my previous post above, the offest is in meters.

I uploaded the database:

I manually identified the edging columns for each location and filled them with data (see "qrySelectBentsToFill" for those got filled). This was to eliminate the raw extrapolation as explained before. The following problems are noted on the final results:

1- It still goes into "row extrapolation" code "should not"

2- Rows with combined characters (such as AB, BC, etc) are not handled.

I appreciate your help

Thanks
 

MajP!

Just to sum up with my last 2 posts.

I constructed the first column and last column within a location manually, to skip the row extrapolation. But it seems that the it still extrapolate, I appreciate if you can help on figuring out the reason and advice. Thanks

Two files are uploaded per my last 2 posts

Thanks
 
Need you just to hear me out. Just because something is easy to calculate manually, coding it may be difficult. At the same time something that is difficult to manually calculate, may be easy to code and compute.

If you were doing this manually, then the method that we are using of:
1. Interpolate within columns
2. Extrapolate leading column edges
3. Extrapolate trailing column edges
4. Interpolate within rows
5. Extrapolate leading row edges
6. Extrapolate trailing row edges
may be the easiest method. Coding the 6 cases is not trivial, requiring a lot of "if then" checks.

I also do not think the approach that we are using is not logically correct if the following is true.

The closest point/s to a pile is the greatest predictor of depth. In other words looking at this grid
1 2 3 4 5
a x 1 x x 3
b x x x x x
c 2 x 3 4 x
d x x x x x
e x 7 x x x

point 2c is best predicted by 1c and 3c not by 2a and 7e. This assumes I can determine the distance between rows (not sure if this can be done because unlike columns we do not have offset distance we only have a relative number.)

To take this further looking at the following grid:
(which is modified)

1 2 3 4 5
a x 1 x x 3
b 2 x 4 x x
c x x x x 4
d 2 x 3 x x
e x x x x x

2c is best predicted not by a row or column value but by, 1b,1d,3b,3d

QUESTION 1: Is the above premise correct? It might not be depending on your business rules for measuring and estimating depths.

QUESTION 2: Is there a way to know the distance between rows? The offset gives me the column distances, but is there a way to measure the distance between rows? For example row 4 and row 1. Are they (3 * constant distance). Are they 3 meters away?

If we can answer the above questions a mathematically sound, and relatively easy solution to code is done using the
Inverse Weighted Distance Method. See the following:

And again the formula may look complicated, but the solution is far easier than we have done so far. If question 1 is yes, then I just need to know if there is some way to know the distances between rows.

If it is close (all whole rows are about n meters apart)then we can still use that.
 
One more thing to verify. There are no differences in Groups? The grouping is an arbitrary grouping. In other words Grp 4, Col 11 is next to Grp5, Col 1. Estimating a point in Grp5 Col 1 using the nearest point/s in Grp4 Col 11, is no different than using a point/s in Grp5 Col2.

This would not be true if for example in group 5 you are using a different method of setting piles, or it is different structure type requiring greater depth, or if code says that everything in group 5 needs additional 2 foot depth, or if you are using a different pile type, etc.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top