Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extract strings made of capital letters and underscores

Status
Not open for further replies.

JohnJDB

Technical User
Jan 11, 2011
4
BE
Hi!

I would need your help to extract specific strings from a word document.

Actually, one of my colleagues typed in math formulas (thousands of them) in MS Word and I need to be able to handle them in Excel.

They look like this: MO_SAL_EXP=1_62112_10_10+1_62120_10_10+1_62130_10_10

What we call the "complex criterias" (here MO_SAL_EXP) are always made of capital letters and underscore, but length is variable as well as the number of underscores.

What we call the "basic criterias" are always structured this way "N_NNNNN_NN_NN" (N being a number). The humber of basic criterias in a formula may vary and all types of calculations are possible: +, -, *, / with or without ().

Other big issue for me: there is text embedded between some formulas (for example, "MO_SAL_EXP=1_62112_10_10+1_62120_10_10+1_62130_10_10 should be positive while MB_PROD_NI_VEG should be negative")

The result I want to obtain is a list with all formulas and/or criterias. If the list is just text, it's OK. So, basically, I want to get rid of the polluting text between the formulas contained in my word documents.

I know it is tricky and uncommon, but if you have any idea, I would very much appreciate your help.

Thanks!


 


hi,
So, basically, I want to get rid of the polluting text between the formulas contained in my word documents.
Check out the Split function.

If your 'formula' is in the activecell then
Code:
Sub test()
    Dim a, i As Integer
    
    a = Split(ActiveCell, " ")
    
    For i = LBound(a) To UBound(a)
        If a(i) = UCase(a(i)) Then _
            Debug.Print a(i)
    Next
End Sub
in the debug window...
[tt]
MO_SAL_EXP=1_62112_10_10+1_62120_10_10+1_62130_10_10
MB_PROD_NI_VEG
[/tt]


Skip,

[glasses]Just traded in my old subtlety...
for a NUANCE![tongue]
 
You might want to look at the Mid, Split, Instr, Len and possibly Replace functions in the VBA help.

Alternatively, you might be able to do what you want directly in the spreadsheet without resorting to VBA. In that case, you might want to look up what you can do with: Find, Search, Len, Mid, Replace, Substitute, and any of the other functions available in the "Text" category.

Tony
 
Thanks for your quick reaction!

I thought of using the SPLIT function. The only problem is that all the "formulas" aren't typed correctly. Some of them include spaces between the "criterias".

So, the only common thing to those "formulas" is that they begin with capital letters (at least 2) and end with a basic criteria "N_NNNNN_NN_NN".

Is there a way to find 2 capital letters in a row ? and is it possible the find a string formatted like this N_NNNNN_NN_NN" ?

Thank you
 


So, basically, I want to get rid of the polluting text between the formulas contained in my word documents.
The only problem is that all the "formulas" aren't typed correctly. Some of them include spaces between the "criterias".

So is that a problem?

Where are your 'formulas' now? I assume that each formula is programatically identifiable.

Skip,

[glasses]Just traded in my old subtlety...
for a NUANCE![tongue]
 
You can use the Code function (in the spreadsheet) or Asc function (in VBA) to determine the ascii code of an individual character.

Numbers all lie between 48 to 57.
Upper case letters are between 65 to 90
Lower case letters are between 97 to 122

To see them all, create a list from 1 to 256 in excel and find the Char() of the value.

Tony
 
Sorry, I meant a list from 1 to 255. 256 gives you a VALUE error.
 


How would this formula, sans text, be finally rendered?
[tt]
MO_SAL_EXP=1_62112_10_10+1_62120_10_10+1_62130_10_10 should be positive while MB_PROD_NI_VEG should be negative

[/tt]

Skip,

[glasses]Just traded in my old subtlety...
for a NUANCE![tongue]
 
What I want to obtain is a list like the following:

MO_SAL_EXP=1_62112_10_10+1_62120_10_10+1_62130_10_10 MB_PROD_NI_VEG

One "formula" or one "criteria" per line...
 


If each formula is programatically identifiable, then my solution can be used to generate that.

Skip,

[glasses]Just traded in my old subtlety...
for a NUANCE![tongue]
 


If you think not, post some difficult examples.

Skip,

[glasses]Just traded in my old subtlety...
for a NUANCE![tongue]
 


Here's 2 examples with text and spaces
[tt]
MO_SAL_EXP=1_62112_10_10+1_62120_10_10+1_62130_10_10 should be positive while MB_PROD_NI_VEG should be negative
MO_SAL_EXP = 1_62112_10_10 + 1_62120_10_10 + 1_62130_10_10 should be positive while MB_PROD_NI_VEG should be negative
[/tt]
Here are the respective results...
[tt]
MO_SAL_EXP=1_62112_10_10+1_62120_10_10+1_62130_10_10 MB_PROD_NI_VEG
MO_SAL_EXP = 1_62112_10_10 + 1_62120_10_10 + 1_62130_10_10 MB_PROD_NI_VEG
[/tt]
using this function in the adjacent cell in Excel...
Code:
Function StripFormula(r As Range)
    Dim a, i As Integer
    
    a = Split(r.Value, " ")
    
    For i = LBound(a) To UBound(a)
        If a(i) = UCase(a(i)) Then _
            StripFormula = StripFormula & a(i) & " "
    Next
    StripFormula = Left(StripFormula, Len(StripFormula) - 1)
End Function



Skip,

[glasses]Just traded in my old subtlety...
for a NUANCE![tongue]
 
Thanks very much Skip, your solution works great!

Have a nice day !
 
Hi John,

In Word, you could use a wildcard Find/Replace, where:
Find = *([^32^65-^90^95]{8,12}=[^32^40-^43^45^47-^57^95]{10,100})*
Replace = \1^p

Cheers
Paul Edstein
[MS MVP - Word]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top