Traverse all records - return REGEX/VALUES pattern and count

tcerv79 · Jun 25, 2015

In my case I am trying to see if there has been anyone who has devised a way to obtain an output that would traverse through all records for a given column and find all the patterns (whether it returns REGEX or the VALUES), then count how many times that exists. I know a lot of 1 counts would come back, so I more than likely would include a HAVING > 5 for instance.

Table =

dbo.PEOPLE

Columns =

LogID [int] IDENTITY(1,1) NOT NULL

,SourceData [varchar] (50) NOT NULL --which could include values that have numeric, alpha, and characters

,Account [nvarchar] (50) NULL

EXAMPLE DATA SET of table output (pipe delimited)
1|123X-456|56789
2|456/123|A345
3|ABC 345|X567
4|456 HELP-123|YCHB
5|456/987|2345
6|CDE 345|XX987
7|345X-456|YY6767

So what I would like my output to look like is two columns - PATTERN|COUNT
%X-456|2 --pulling in LogID rows 1 and 7, finding the like pattern and omitting everything else replacing with wildcard
456/%|2 --pulling in LogID rows 2 and 5
% 345|2 --pulling in LogID rows 3 and 6

Now, if someone has something that returns REGEX pattern vs. actual pattern values, then I would work with that.

Essentially I figured I would need to traverse through each character of a field, after reading the LEN of the field, and starting with position/char 1 reading each character until the end and writing each character to a TEMP with a cursor/CTE. Then coming back through and seeing how many times I can match that pattern where it is, for instance > 5 as stated at first. I know the execution plan will be robust and consume a lot of my SQL server resources, but it has to be easier than hiring people to find patterns.

Thanks in advance.

johnherman · Jun 25, 2015

I wonder if this is better suited to a unix awk/grep type analysis. But then again, you're in the SQL Server forum, which means Microsoft and which precludes a unix/linux solution. In case you hadn't noticed, Microsoft completely shuns Unix and Linux because they are competitors.

I see what you're thinking, but pattern matching groups of bytes might be more efficient at the OS level. You can always write your results back to the database.

==================================
adaptive uber info galaxies (bigger, better, faster, and more adept than agile big data clouds)

fredericofonseca · Jun 25, 2015

not absolutely clear on how you determine what a common pattern is so can you please clarify.
is it when either start or end of string match?
if not would the following 2 lines issue a match on "456"

2|2456/123|A345
4|12456 HELP-123|YCHB

and what about
1|456/123|A345
2|456 HELP-123|YCHB
3|45612/123|A345
4|45612 HELP-123|YCHB

should the 4 above be considered a single match on 456 or should they be 2 matches (id1 and 2)- one for 456 and the other for 45612 (id 3 and 4)

Regards

Frederico Fonseca
SysSoft Integrated Ltd

http://www.syssoft-int.com

FAQ219-2884
FAQ181-2886

tcerv79 · Jun 25, 2015

Agree that SQL is not a great vehicle for this, but it is where all the data resides and is currently my only avenue to go down.

In regards to the other question from Frederico - in the data set you present - 456 would have a count of 4, THEN 4561 would have a count of 2, 45612 would have a count of 2, etc -- which is why for the ones HAVING a count of < 5 I would probably ignore.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Traverse all records - return REGEX/VALUES pattern and count

tcerv79

Programmer

johnherman

MIS

fredericofonseca

IS-IT--Management

tcerv79

Programmer

Similar threads

Part and Inventory Search

Sponsor