Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Microsoft Excel - Filter Cells by Format 5

Status
Not open for further replies.

kjv1611

New member
Jul 9, 2003
10,758
US
I've come accross a particular need to be able to filter a column in Excel by the format. The specific problem consists of a field that is populated with various types of personal information from various consumers. It for a report showing what employees searched for with a particular application.

The current report only gives us what that person searched for. For instance, Agent Smith might have searched for "123 Main St", whereas Agent Jones might have searched for "123-45-6789", and Agent Miller searched for "555-555-1212" just as some examples as to the actual data in that column.

What I would like to do (and I can import it into Access if that would make it simpler) is to split all of this out into separate Excel Spreadsheets (or tables/queries in Access) based on what type of information is in that field.

I was not sure of a way to do this in Excel with an autofilter (it is possible that I just overlooked it, or am just not aware of the particular method, as of yet). So, I thought I could come up with a VBA method to loop through the cells in the one column, and based on the findings for each column, copy that record to the appropriate spreadsheet.

So, what I would want to eventually have (in whatever is the best format - Excel, Access, SQL, whatever) is probably up to 5 or 6 different files, and each would be for a different search criteria. One would be address, one = SSN, and one = telephone (there are a few others, but I think I can just take care of those rather easily, based on a different field).

I have done some VBA in Excel (haven't done much in the past year or so), but I can't think of what would be the best way to check for specific formatting in the cells. I am sure there is a way, and there may be an altogether better method than I am currently dealing with.

Here is what I've gotten started with so far:
Code:
Private Sub FilterSSN()
    Dim wb As Workbook
    Dim ws As Worksheet
    Dim x As Long [green]'Long Integer Variable used for counting rows[/green]
    Dim bolLoop As Boolean [green]'True if to continue Looping because there is still a record, or False if no record to check.[/green]
    
    Set wb = ActiveWorkbook
    Set ws = wb.Worksheets(ActiveWorksheet)
    x = 2
    bolLoop = True
    Do While bolLoop = True
        If ws.Cells(x, 1) = vbNullString Then bolLoop = True
        Else
            Select Case ws.Cells(x, 3).Formula
             ~just psuedo-code here...
                Case Format1
                    Copy Row to Sheet1

                Case Format2
                    Copy Row to Sheet2

                Case Format3
                    Copy Row to Sheet3

                Case Format4
                    Copy Row to Sheet4

                Case Else
                    Copy Row to SheetUnknowns 'For manual review
            End Select
        End If
    Loop
End Sub

Of course, the error handling and every detail has not been worked out yet, but I'm trying to get the idea down pat before I worry about the extras.

Any suggestions would be greatly appreciated. Thanks in advance.
 
Just a note here: on the SSN mixed with other stuff. It appears now that some of them are extended zip codes, and some of them are actual SSNs, but since it's all numbers, there's no way to differentiate very well. So, I'm just going to flag those (very few compared to the whole) as "UNKNOWN" for now, so they can be researched manually.

--

"If to err is human, then I must be some kind of human!" -Me
 
In reply to an inadvertent misposting in a different thread... first the mis-posted post. Ok, now I need to untie my tongue. [wink]

SkipVought said:
kjv,

Not to throw a fly in the ointment, or any thing like that, but...

there's this issue that I often run into, of NUMBERS versus IDENTIFIERS.

Numbers are often used as identifiers (hence we get masks in Access for phone, ssn, zips, etc)

And in the days of expensive core memory, "we" used to take all kinds of abbreviations to SAVE memory and storage.

But, hey, this is the twenty first century, the age where we argue not the question of how many angels can dance on the head of a pin, but how many gigs can fit thereon.

Gigs not gigs!

So back to NUMBERS versus IDENTIFIERS. Not so much in the Access world, but in the Shreadsheet world, people can enter

999-99-9999
or
999999999

in a column FORMATTED for a SS-NUMBER. So you get this MIXTURE of NUMBERS and STRINGS -- very confusing when it comes to sorting and querying and parsing etc. Then there's the issue of leading zeros, a whole nuther aspect.

So as a matter of practice, shouldn't IDENTIFIERS be strings to begin with? Hope that this is not rain on your parade, but it's sunny in Fort Worth.

Skip,
[GLASSES]
[TONGUE]

My reply/note:

As far as dealing with SSNs, the field in our SQL Server is set to Varchar, and I set the Excel sheet column to text (from general) prior to taking out the dashes, to as to avoid any problems with that sort of thing. The reason I looked at the general/text format was because of the leading zeros being whacked off when Excel auto formatted the cells as numeric format - which would be correct probably 90% of the time, I guess.

--

"If to err is human, then I must be some kind of human!" -Me
 
Using Regexp you can extract the text that matches the pattern, not just test for its existence.
 
ESquared,

You mean that using Regexp, I can actually pull the text out of the cell, and put it in a different cell, or are you saying something totally different?

Could you elaborate a little?

Thanks.

--

"If to err is human, then I must be some kind of human!" -Me
 
Put a string in cell 1,1 such as "asdlfkja sdljaw3 lkj 12309587213 098 123-45-6789 asdflkj asdflkj 456-78-9012 asldkjasd"

In your module add a reference to Microsoft VBScript Regular Expressions 5.5.

Then run this code:

Code:
Sub RegExpExample()
   Dim oRgx As VBScript_RegExp_55.RegExp
   Dim oMatches As VBScript_RegExp_55.MatchCollection
   Dim oMatch As VBScript_RegExp_55.match
   Dim iCol As Long
   Set oRgx = New VBScript_RegExp_55.RegExp
   With oRgx
      .Pattern = "\b\d{3}-\d{2}-\d{4}\b"
      .Global = True
      Set oMatches = .Execute(ActiveSheet.Cells(1, 1))
      iCol = 2
      For Each oMatch In oMatches
         Cells(1, iCol).Value = oMatch.Value
         iCol = iCol + 1
      Next
   End With
End Sub
That should do all the explaining you need...
 
Oh, you can also check for submatches. Submatches are any parenthetical groups in the found string that don't have the operator that says not to make them a submatch.

"\b\(d{3})-(?:\d{2})-(\d{4})\b"

This means that if a match is found (the match object's value being the whole SSN) then there will be two submatches. The first submatch is the first three digits, as they are in parentheses. The second set of parenthesis is not a submatch as it has the ?: operator which says not to make it a backreference. The third set of parentheses is the second submatch, the last four digits.

If you are replacing, you can use special tokens to refer to these submatches. So if for some reason you needed to reverse the groups in an SSN, you could so something like...

.pattern = "\b\(d{3})-:)\d{2})-(\d{4})\b"
.replace sourcestring, "$3-$2-$1" ' refers to the three groups in reverse order
(I closed Excel already so you can look up the exact syntax :) )


See for more. I recommend you read the entire tutorial if you're going to keep using Regular Expressions. This particulary site is good and has a good explanatory style, but there are others which come from a different angle. Sometimes I have trouble finding what I need on this site, but when I know exactly what I want or I read top to bottom, it's always there.
 
Wow! That is a TON of good information. I'm almost drooling at the possibilities!

Off the top of my head, I'm thinking of a way to possibly divide up the data in to different columns in order to make the data more query-able! So, for instance, I suppose I could use this, and have the following columns:

Original Data SSN Name Address City State PHONE

Well, at least maybe. The address/city/state stuff would probably be difficult to distinguish.

Either way, it'd be better than what's there to start with.



--

"If to err is human, then I must be some kind of human!" -Me
 
I think that deserves an extra pointy thing, but I can't give more than one in the same thread. Oh well.

--

"If to err is human, then I must be some kind of human!" -Me
 
I meant "more than one in the same thread" ... and to the same person.

--

"If to err is human, then I must be some kind of human!" -Me
 



There! I did one of them pointy thangs for the regulr guy!

Skip,

[glasses] [red][/red]
[tongue]
 
[smile]

--

"If to err is human, then I must be some kind of human!" -Me
 
That tutorial at looks pretty good. I'm going to look at it some more, and may consider buying it in book form, so that I can read it over some time away from the computer - maybe. It's got so much there on the topic.

--

"If to err is human, then I must be some kind of human!" -Me
 
Well, since the address is so hard to define, one thing you could do is extract all the pieces that you CAN find, and replace them out of the main string. Once all the phone numbers, SSNs, and other easily identifiable things have been removed, what's left just might be address information.

State is only a list of 50, call it 100 with abbreviations, so you could possibly extract that.

There are also a limited number of "Street" expressions, which could help you start identifying where in the string the address might be found: Street St Way Wy Road Rd Circle Cir Boulevard Blvd Expressway Expwy Avenue Ave Loop Lp Place Pl Turnpike Highway Hwy Canyon Cyn

With clever use of sorting, I've found that I can do something halfway between fully-manual and fully-automatic. A sort of recursive process where I see a feature I can pull out, and then use other columns in the spreadsheet to extract that, then sort. Sometimes the sorted data has enough uniformity that I can use entire huge sections as-is. After enough iterations of this, the last 30 or so lines which defy mass manipulation I end up doing manually, as that's more efficient than trying to build an automatic process to handle each remaining variation.

I'd have to see your data but to elaborate, there are times when you can start performing transformations on known patterns to make the data more uniform without actually modifying its information. For a trivial example, replacing all double spaces with single spaces works in some data tasks for me. Although in your mixed-up data, double spaces could signify a break between two types of information.

You might want to look into something like the free Tiger maps data from the government (if I'm remembering this correctly). There might even be an API where you could submit addresses and the software would try to scrub them to see if it can put them in a canonical form. That at least might help you identify rows which need human intervention.
 
All of that information is very interesting indeed. Hopefully, I'll get the time to look into all of that. At the least, I might look into the Tiger maps and/or available APIs for this purpose. Thanks!

--

"If to err is human, then I must be some kind of human!" -Me
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top