Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Intrpolating missing values in Access table 1

Status
Not open for further replies.

adfo

Programmer
Oct 23, 2009
30
US
Dear all,

I have an Access table contains a "column" field , "row" field and some values under "Value" field (example will follow below). I need to interoplate / extrapolate missing values within each COLUMN if enough minimum number of values exist to allow interpolation. Next, interpolate within each ROW for any missing values that havn't been filled from the prvious process. example below:

Group Column Row Value
1 1 A -
1 1 B 8
1 1 C -
1 1 D 9
1 1 E -
1 1 F -
1 1 G 12
1 2 A -
1 2 B 6
1 2 C -
1 2 D -
1 2 E -
1 2 F 9
1 3 G -
1 3 A 7
1 3 B -
1 3 C 8
1 3 D -
1 3 E -
1 3 F 12
1 3 G -
2 1 A -
2 1 B -

STEP 1:

- For each COLUMN within the group, if there's at least 3 values, interpolate/ extrapolate linearly all missing values as necessary (column 1 within Group 1 above meets the criteria of 3 values)

- Column 2 within only 2 values, doesn't meet the criteria then skip

- Column 3 meets the criteria( at least 3 values), then interpolate / extrapolate and so on

STEP 2:

Now, for each ROW across all columns within the group (row A, B, etc.) , interpolate / extrapolate all missing values. This should fill all values skiped from step 1 above. After finishing, move to the next group and so on

Thanks
 

Thanks MajP,

I think I got it to work with reasonable outcomes.
I noticed another limitation in extrapolation through column that I get negatives, which gets even worse if I follow this with Row interpolation. I locked the minimum value to 0.5 during Column extrapolation (step 2)so it doesn' get less than that.

I also did the following changes:

1- I created the first column and last column in each location as explained before (Manually by copying from closest values)

2- I did some change on the "qrySortRows" to not to select or stop grouping by intGroup.


I think the final results seems to be more controllable. Thank you very much for your help, let me know if you want me to upload the final version. Thanks a lot.



 


Sorry MajP!

Didn't see your two posts before I send mine. Will get back to you
 
Your last upload is very sparsely populated. Is that what the real data will look like? Any idea of the real percentage of measured data that you will have? I see 5.2%. This seems pretty tough to get any reasonable answer. However if you are going to have sparsely populated data then for sure a method such as the Inverse Distance Weighting will give you a more reasonable answer because it uses all points to help in the predictions.
 
Yes!it is pretty much it. recorded values are too low, and best estimate to the missing 95% of the data. This mandated new rules to avoid the side effects in a reasonable way. For instance, the negative values that I get through column extrapolations (the piles should always have positive values, not buried). Although the calculations are mathematically sound, but I have to keep a limit to what the minimum value could be and keep the simple linear approach, let's say 0.2 is what is agreed on.

The same was for row extrapolation, I think switching into pure interpolation is more relistic to avoid (as I can)the same extrapolation problems. The steps to handle these scenario in light of the actual data are summarized below:

1- column interp/extrap, extrapolated values should should have a cap when decreases (0.2 for example)

2- Creating boundary columns for locations (copy is OK considering the small distances between columns(I did this manually at this point), then do row interpolation on each location

There's still some glitches when the whole data is sorted back together. For example if you look at a column, which values were totally missing before, you may find fluctuating values (up and down) along the column which doesn't make sense with the trend of a typical column values (ascending or descending).

I think there should be a second iterations to smooth these values up. I appreciate if you have an approach in light of what I have. Thanks


 
Sorry but if you want further help I really need you to stop postulating these possible ideas and first answer the numerous questions that I have asked. You are proposing an overly complicated solution:
1. Check left edge column for a given location and see if it is valid
2. If not valid find lowest valid column
3. Copy lowest valid column
4. Insert lowest valid column into left edge
5. check right edge column for a given location and see if it is valid
6. If not valid find highest valid column
7. Copy highest valid column
8. Insert highest valid column into left edge
9. Interpolate within columns
10. Extrapolate leading column edges
11. If value decreases below .2 set remaining values equal to .2
12. Extrapolate trailing column edges
13 If value decreases below .2 set remaining values equal to .2
14. Interpolate within rows
15. Extrapolate leading row edges
16 If value decreases below .2 set remaining values equal to .2
17. Extrapolate trailing row edges
18. If value decreases below .2 set remaining values equal to .2

Although the above is complicated, it is not sound and provides unrealistic results. So I have no approach to this.

I laid out some pretty detailed discussions and questions, but you have ignored them. I believe I have shown clearly that this approach is not sound and will give unrealistic results, and putting a bandaid on it is not the solution. I believe I can provide a quality solution, but I need to understand some fundamental questions first.
 
I apologize; I didn’t ignore your questions at all, but it was a combination of we both have been posting in the same time and also couldn’t manage to respond timely until I get some information first. Below are the answers:

QUESTION 1:
Ans.: I agree with this concept. It sounds to provide more accurate results considering the relatively small distances between all piles also from the illustration you provided. In reality, we may not have the luxury of the presence of the entire close by values as shown in your illustration. Although this concept is not the first preference, but I don’t mind applying it if you concluded that this is still the best after you finish reading my answers below.

QUESTION 2:
Ans.: The “measured” distance between 2 consecutive rows ranges between 1.0 & 1.2 m (1.1 m on average). If we have a missing pile in a particular column (for example E then G, Not F) that mean there’s no pile at this particular location and the distance between E & G is 2.2 m

VERIFY REQUEST:
Regarding Grouping, yes they are arbitrary with no major difference within the location. You may recall the figure I provided before on how the groups are back to back within the same location. And yes, Grp 4, Col 11 is next to Grp5, Col 1. Estimating a point in Grp5 Col 1 using the nearest point/s in Grp4 Col 11, is no different than using a point/s in Grp5 Col2.

YOUR SUMMARY STEPS (18 steps):
A little correction on your steps summary, the following steps should not be there:
11: the decreasing values are one sided problem since soil profile is sloping in one direction (column direction)
15 to 18: There should be no row extrapolation if the edging columns were filled with some data from the closest column (from steps 1 to 7)

Question/ idea for discussion: do you think there’s a room to build on what we have achieved so far? For example, is it possible to do second iterations on all columns and rows after been filled with data and perform “relaxation” between all points or getting rid of sharper changes in both directions? I’m not sure if this is feasible though, but thought to throw this idea also in case you think there’s a mathematical way.

I hope that the above answered your questions. I’m pretty sure you have a good idea on what is the best to be performed under these circumstances.

I loaded the latest run results, in case if you need to look at it:
The only difference it has is it assigns 0.3 if extrapolation decreases below negative values
 
OK, when I look at just the measured values I see that in the columns there is always a decreasing slope, and within the rows the values undulate.

1) Is that just a coincidence of the underlying soil profile or are you specifically trying to force the columns to slope? This is real important because if we want a continous slope, then that is different solution then if the profile just happens to be sloping and we are estimating the location of the next point. If you are trying to force a slope then the column values are more important in predicting a point than the adjacent row values.
1 2 3 4 5
a x 7 x x 3
b x x x x x
c 2 x 3 4 x
d x x x x x
e x 6 x x

So in the above grid is 2c better predicted by 1c and 3C with a value of 2.5, or better predicted by 2a and 2e with a value around 6.5)?

2) If you are trying to force a slope in the column direction what is better.
a)Trying to find a single slope that best matches all known points (best fit). The dots represent my predicted line and the x's represent known points

. x
.
.
.
x .
.
. x
b)Or Making a series of slopes between known points. Like you are doing now. Possibility exists of the slope going up and down or getting flat.
x
.
.
.
x
.
.
x

3) I do not see any natural slope between columns. But if there is a desire to try to fit the best slope in both the x and the y that can also be done somewhat easily. So imagine you have 3D projection showing x, y, and all the z (depths). Then you could try to fit a plane that best fits all of these values.

I believe that the nearest point/s are the best predictor of an unknown point, but possibly there is a column dependence that you need to describe.

Can you describe how you determine which 5% of piles to measure? In the perfect world I would think that you evenly space out the 5% to measure, and that then could be used to estimate the required depth for all other piles. That does not seem to be the case because the measurements are often clustered.

Can you also describe what that measurement is. The key here is if the randomness of the soil profile dictates the depth or if you are trying to force a depth. Is it where you are hitting bedrock?

 
Thanks MajP,

"I see that in the columns there is always a decreasing slope"

Ans: Yes that should make sense due to the typical nature of sloped ground as well as the row numbering system. I strongly suggest that you look at the file I uploaded. This will be the best to visualize the condition and I believe it contains other information that may clarify all questions you may have.


Regarding your questions:

Q1:
Ans.: I believe you will agree in light of the typical cross section that is included in the file that 2c is better predicted by 2a and 2e with a value around 6.5

Q2:
Ans: for now, it was agreed on that your solution (b) is what is needed at this point to give a rough estimate.

Q3:
Ans: I agree this is the best. It is similar to the 3D rendering concept for ground topography. It's good to have this, but I think it may be complicated (?). If it is, then I would stick with 2D approach in both directions that we approached.

Description of measured piles selection:" also in the uploaded file"

Thanks
 
Here is the updated solution.


It allows you to pick several possible means to get an answer.
1. Column interpolation
2. column extrapolation
3. Row interpolation
4. Row extrapolation
5. Column linear best fit
6. Inverse Distance weighting

My recommendation is either a pure Inverse Distance Weighting that uses the distance to all known points to estimate an unknown point. Or possibly first interpolate the columns since there is a strong correlation in the columns. Then do an IDW on the remaining points. The IDW method I used is the simplest version. If I was doing this I would look at some improvements to this function. Currently all points influence the result with the influence decreasing at a linear rate. You can improve the function by limiting which points influence the estimate and how fast the influence decreases.

I had to massage the data because you have what appears to be mistakes or inconsistencies.
1) Duplicated entries
2) repeating offsets within locations for different columns
3) Same offset for different columns
 
Thanks MajP for the great help!
I will experiment with these various methods.
If I know your whereabouts, I would send you a rose!

Thanks a lot!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top