Parsing Excel formula 2

N1GHTEYES · Jan 4, 2011

I need to be able to parse the formulae in some excel cells. Specifically, I want to be able to determine which elements of the formulae consist of operators, inherent functions, UDFs, range addresses, named ranges and, crucially, which are undefined names.

Obviously, I could write this from scratch by checking for keywords and so forth, but that would be quite a complex task. However, it occurs to me that excel itself must do this internally - i.e. it must know what a formula "means" before it can execute it. Therefore it must have some form of internal representation of the formula in which each entity is categorised into one of the above groups (operator, address etc).

Does anybody know if this internal formula breakdown is accessible in any way? If so, how? Is there any way, in VBA, in which I can see the structure into which Excel internally parses a formula in a given cell?

Thanks, in advance,

Tony

N1GHTEYES · Jan 4, 2011

If the above can't be done by directly accessing Excel's internal parsed structure, can anyone suggest where I might be able to find some useful VBA to do the job?

Actually, I don't need a full formula parser. All I am really looking for is a means to identify named variables (ranges) called in a formula. The tricky bit is that it has to work even if the named range has not yet been defined.

Let me explain. If I have a cell with the formula:

=fred*sin(bill)/sqrt(henry)

then I want to be able to apply a function (in code) to that formula such that the output of the function is the array (fred, bill, henry) - whether fred, bill or henry are defined as range names or not. In other words, the function should look for anything in the formula which is not an operator, or a cell address, or a worksheet function, or a UDF.

Can anybody think of an easy way of doing this?

I had the vague thought that it might be possible to try to do it by using the evaluate method and somehow catching any errors generated, but I'm not sure where to start.

Any suggestions would be welcome.

Thanks,

Tony

SkipVought · Jan 4, 2011

...and the business case for this is...

???

Skip,
_{Just traded in my old subtlety...

for a NUANCE!}

xlbo · Jan 4, 2011

There was a similar (ish) post on this a while back about searching for constants in formulae - that wasn't so difficult as the constants were all numbers and therefore relatively easily identifiable.

Your main issue is that to know if the string "abc" is not a function, you need to be able to iterate through ALL the functions available to the workbook - this may mean non native functions created as udfs as well as functions available as part of an add-in. In short, without hard coding a list of functions somewhere I think you'll be hard pressed to do this...

Rgds, Geoff

We could learn a lot from crayons. Some are sharp, some are pretty and some are dull. Some have weird names and all are different colours but they all live in the same box.

Please read FAQ222-2244 before you ask a question

N1GHTEYES · Jan 5, 2011

Geoff,

Thanks for the reply. I realise it could take a bunch of coding to properly parse a formula, which is why I was asking in the original post if any of the clever chaps here knew of any way to access Excel's internal representation of the formula structure. However, the problem is not quite as hard as having to do a full parse of all possible excel formulae - see my reply to Skip to see why.

Skip.

I am considering creating a simple model development tool in Excel. It is partly a "business case" and partly an intellectual exercise.

I recently put together a feasibility / tradeoff model for a particular optical sensor. It started off as a simple list of inputs and calculated values, but the logic was quite complex and a little hard to understand - especially when discussing the results with colleagues. So I tried assigning the values to text boxes which could be connected via connector arrows and moved to make the logical flow and internal dependencies more obvious.

This actually proved very useful, but it was a tedious p.i.t.a. to do the assigning of the data to the textboxes. It also occurred to me that if the process was automated it could do more than simply display the logic flow and would also be a very quick way of building a model in the first place.

Therefore, I have been thinking about putting together such a tool, mostly in my own time, partly because I think it would actually be quite useful at work, but also as an interesting exercise.

The idea is based around a set of interfaces as follows.

1 Variable Parameter Input
This would enable the user to specify a set of parameter values associated with each variable, e.g.

Name: Integrated_Photons
Description: Number of photons captured per frame by each pixel
Units: Photons
Default Value: 1000000
Use Default: no
Formula: Source_photance*pixel_footprint*sensor_subtense*staretime

The purpose of the default values would be to allow the user to run the model while it was being developed (e.g. before the derivation of a given parameter had been defined) or to determine desired values to achieve specific downstream results.

2 Variables Listing

This would be a list of all the variables currently in the model – including not just the ones for which the definition had already been given as above, but also those whose existence is implied by their inclusion in the formulae. For example, the above variable implies the existence of the additional variables “Source_photance”, pixel_footprint”, sensor_subtense”, and staretime”. Clicking on a defined variable would activate the input form for that variable, allowing it to be edited. Clicking on a currently undefined variable (which would be displayed in a different colour) would activate a new input form to prompt the user to define it.

3 Tabular Data

This would show all of the parameters of all the currently recognised variables. Essentially this is the actual Excel model.

4 Graphical Display

This would show a text box per variable, laid out in a logical flow given by the dependence hierarchy, with connecting arrows showing dependence. The user would be able to arrange these by hand or to select “auto-placement”. Types of variable would be colour coded – e.g. final value, calculated intermediate value, user-input, physical constant. Any clash of units would also be highlighted here (and probably in the Tabular Data section too). For example, if the variable “velocity” is given by the formula “=distance/time” and “distance” is in units of m and “time” is in units of s, then the units of “velocity” should be m/s.

I realise that I might be better off using Labview or Simulink for something like this, but I was developing the original model at home over Christmas, where I don’t have access to either of those, and I do frequently work at home, so being able to do this kind of thing in Excel is quite handy. Also, both of the above are pretty much terra incognita as far as I’m concerned. In addition, I thought it might be a fun exercise (which is why I’ll mainly be doing this in my own time – except for stuff like this message!).

I hope the above answers your question.

Anyway, I’m sorry if that was a bit long winded, but from the above (very superficial) description, it can be seen why I want to be able to parse the formulae:

a) To determine the existence of implied variables
b) To determine variable dependence
c) To check the consistency of units

However, given the way in which this will be used, I can’t see any reason why I would need to explicitly specify any actual cell addresses – all variables would be referenced by name alone. Consequently, I at least don’t need the parser to handle addresses, which might make the job a little easier. In fact, as I write this, I think I might have the answer. All functions must contain brackets, so any text which falls between an operator and an open bracket “(“ must be a function and can thus be ignored. Hmm, I’ll have to think about that.

Tony

macropod · Jan 5, 2011

The tricky bit is that it has to work even if the named range has not yet been defined.

And how do you suppose Excel could do this? If there's an unresolvable string in a formula, IMHO Excel will have no way of differentiating between a typo, a call to a non-existent UDF or an un-named range.

To get you started, here's a sub I posted some time back to parse a cell's formula. It doesn't do what you're asking (it converts all the formula's expressions to their calculated values), but at least it shows how you might go about it:

Code:

Sub GetRefs()
Dim MyRange As Range, strFormula As String, strVal As String
With ActiveCell
  strFormula = .Formula
  For Each MyRange In .Precedents.Cells
    With MyRange
      strVal = """" & Range(.Address).Value & """"
      strFormula = Replace(strFormula, .Address, strVal)
      strFormula = Replace(strFormula, .Address(RowAbsolute:=False), strVal)
      strFormula = Replace(strFormula, .Address(ColumnAbsolute:=False), strVal)
      strFormula = Replace(strFormula, .Address(RowAbsolute:=False, ColumnAbsolute:=False), strVal)
    End With
  Next
  MsgBox .Formula & vbCr & strFormula
End With
End Sub

Cheers
Paul Edstein
[MS MVP - Word]

N1GHTEYES · Jan 6, 2011

Thanks for the reply Paul.

In answer to your question, I think Excel must internally do what I need. If I evaluate, step by step, a formula containing an undefined name, excel happily completes those steps which are resolvable, and the unknown name is replaced with NAME?. This is essentially what I am after. Obviously you are right - excel cannot know whether the item is an undefined name or a typo, but that's fine. It would be presented to the user as a new, undefined name. If he saw it and thought "that's a typo" he'd just go in and fix it.

As far as undefined UDFs are concerned, I think that is resolved by my comment at the end of my last post. All functions (UDFs or otherwise) must contain the structure:
<operator> <function name> <(> <*> <)>. If there are no brackets, there are no functions. If the brackets are not paired, Excel will immediately throw this up as an error, before my code could get to the parsing stage. Anything which lies directly between an operator and a "(" must be a function name - whether it is currently defined or not.

Regarding the code, thank you for passing it on. I tried using .Precedents before I first posted, but whenever I used it on a cell with an undefined name, I just got an error when I tried to access it. Having run your code, it works fine, so I guess I must have cocked up somewhere, but since I was just experimenting, I did not bother saving the workbook, so I can't say what I got wrong.

As I mentioned at the end of my last post, I can't think of a reason why I would need to have any actual addresses in the formulae I will be parsing (for this specific application), but if I'm wrong and I do have some, then I think I can use your method to remove them before I start looking for names. Having removed addresses, it should be fairly straightforward to identify names by: spliting the formula at operator boundaries, removing functions (as defined above), then eliminating any remaining sections which are purely numeric. Thereafter, any sections which are left should be names. I can then compare those with the recognised names in the workbook .names collection to identify those which are as yet undefined.

This at least accomplishes my goals a and b. To do c I would need to do a little bit more work, but it should not be too hard having got that far.

Thanks again for the help.

Tony

macropod · Jan 6, 2011

Hi Tony,

Re the UDF, it need not be preceded by an operator and might only have an empty pair of brackets following. For example:
=SUBSTITUTE(UDF(),"$","")
where the UDF is:

Code:

Function UDF()
UDF = ActiveCell.Offset(-1, -1).Address
End Function

Cheers
Paul Edstein
[MS MVP - Word]

N1GHTEYES · Jan 6, 2011

Hi Paul,

Hmm, yes and no. I'll rephrase what I said. The first function in a formula must lie between <operator> and <(>. If such an identified function plus its opening bracket is then regarded as a form of operator, the statement is true of all the functions in a formula.

So, in your example, =SUBSTITUTE(UDF(),"$","")

If one regards "=" as an operator, then the structure <=> <substitute> <(> indicates that Substitute is a function. Then, if "substitute(" is regarded as an operator, UDF can also be seen to lie between an operator and an open bracket, and hence can itself be identified as a function.

I could probably have phrased that better, but is that clear?

Obviously, if one wants to know what the function is operating on, one also needs to correctly identify the closing brackets, but the above should at least be sufficient to find the parts of the formula which are function names.

Tony

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Parsing Excel formula 2

N1GHTEYES

Technical User

N1GHTEYES

Technical User

SkipVought

Programmer

xlbo

MIS

N1GHTEYES

Technical User

macropod

Technical User

N1GHTEYES

Technical User

macropod

Technical User

N1GHTEYES

Technical User

Similar threads

Part and Inventory Search

Sponsor