Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

ETL Process for Clickstream Analysis

Status
Not open for further replies.

Nuffsaid

Technical User
Aug 3, 2001
374
CA
Hi All,

My company is very interested in uncovering consumer web behavior that is
buried within our Apache web logs. Currently we use a product called "Sawmill"
which is resource intensive and produces limited web analytics.

We would prefer to extract this information and populate fact and dimension
tables in our data warehouse as outlined by Ralph Kimball in his book
"The Data Warehouse Toolkit" Second Edition, Chapter 14, Electronic Commerce.

Although the book outlines the final schema(s), it doesn't mention anything about
the ETL process. During our research I came across this article where Kimball
mentions a sophisticated backend ETL process. See;


So, I guess what I'm after here is, what are other people doing? How do you extract, transform and load.
What tools do you make use of? Are they any good? Would you recommend them to others?

I look forward to your feedback.


Nuffsaid.
 
Do you want to implement the ETL as a streaming ETL proces (where every modification in the source system is processed on its own) or do you want to use near-real-time ETL, with batchprocessing every few (say 15) minutes?

This will depend on what you want to do with your datawarehouse. If you want to analyse your customer on-the-fly (e.q. scoring a risk-profile), you will need streaming ETL, if you want to do analysis in the back-office, then batchwise ETL will suffice.

The tools I use (for batchwise ETL) is PL/SQL on a Oracle 10g database.
The first thing I do is to limit the number of records I need to process as soon as possible (get rid of all useless data). After that I do the transformations (conformation, cleaning, etc).
In a clickstream datawarehouse, I would probably use the webserver logfiles as a datasource. You can load these using a batchloader, and strip the lines you don't need during the load.
 
Hey Hans63,

Thanks for the reply.

We'll more than likely do batch processing. Ultimately we would want to tie a session to a purchase transaction. At present this isn't possible with our current OLTP system. All we know is that a consumer made a purchase. We don't know where he came from or how he got here. Our developers are working on this now. The metrics we're after are how many are just browsing and move on, how many search our site and actually buy something and, if they buy, are they likely to buy something else? Typical consumer behaviour.

It's like we're stuck in a "gap". The web logs can track the users movement but don't record transactions. The transactional system records purchases but has no info on how the consumer got here. Trying to tie this altogether should prove interesting.

Nuffsaid.
 
Hi Nuffsaid

So all you can do at this moment is analyse the visitor behavior on the website. I know that the SAS institute had software for analysing this behavior already in 2000. So perhaps you can search their site or forums for clues on how they tackle this.

Hans
 
Well, the powers that be have decided to go for an out of the box analytic solution, "plug and pray". (not a type o)

So much for "One Source Of The Truth".

Frustrating......[banghead]





Nuffsaid.
 
Yeah, this can be very frustrating (just like politics), but we have to learn to cope with this frustration.

Good luck.
 
The key to implementing a successful box solution is to do your homework on what the offered solution does and doesn't do, along with the ease of customizing the solution. As the Senior developer and Architect for a canned BI solution for a specific market I now see the issue from both sides.

The right out of the box solution will/should have a number of knowledge experts, these people should know the sector more than those who want the analysis. I know we have developed items that our customers didn't think possible, or ever thought about being able to do with their data. A well built application should have the ability to answer the items people needs answering along with defining new ways to look at the data.

The other aspect is the client. Since decisions are typically made by those with C and V in their titles, and not those who actually use and support the system there typically seems to be some resistance to the solution. Reasons being are resistance to change and job security. One purpose to a BI system is to quickly give answers to the data being requested, sometimes the numbers in a report are compiled from numerous other reports, all of which takes time. The person sees a sytem brought in that can do in a few minutes what used to take them hours and now they feel threatened and their goal now is to prove the inaccuracy of the new system. On the IT side there can be the sense of "Me and my team can build a better solution because we know our business better than anyone else". The answer to this is possibly but highly unlikely. As I stated earlier a good out of the box solution will be supported by people who are experts in the needed areas, BI, Data warehousing, Report development, and industry knowledge. While they may not the specifics of your company they do know and understand the industry, which should result in a solution that allows you to view our data the same as others in your industry do/would/should.

You will have a much more successful implementation if you work towards making this solution a success and make sure that you are infact getting the best possible solutions available.

Shoot Me! Shoot Me NOW!!!
- Daffy Duck
 
In case your "out of the box" solution doesn't do everything you need, you might want to consider writing your ETL in Perl. Perl has excellent support for pattern matching. Our clickstream data mart uses Perl to extract data out of the weblogs and load in to an Oracle 10g database.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top