Q&A with Erik Veerman
Data Warehousing Friends and Enemies
The biggest enemy of the data warehousing project is thinking you know what someone is asking. Wide tables aren’t so friendly either.
But you can count on proper planning, a thorough up-front design, and an effective dimensional model, says SQL Server MVP Erik Veerman in this preview of his Nov. 6 PASS Summit Unite 2009 post-con, “Jump-Start to Data Warehouse Data Modeling and Architecture.”
Q: Do you remember your first data warehousing project? What did you learn from it?
A: The best way to learn is trial by fire! I remember that my first data warehouse project involved the complexities of allocating measures in special cases and the assignment of sales amounts to different levels of a dimension hierarchy. I learned that you have to be thorough up front in your design and that it is much easier to get it right the first time than trying to change something after the fact.
Q: If you could share one piece of advice with someone new to data warehousing, what would it be?
A: Easy—avoid wide tables (lots of columns) that mix a lot of text and numeric data type columns together. I see this so often at my clients, where someone has a “data warehouse,” but all it really is is a bunch of inefficiently designed flattened tables. The right answer is a dimensional model, including dimension and fact tables. It’s tremendously more efficient for both storage volume and query performance.
Q: Are DBAs the ones typically responsible for the DW projects? Is it a natural extension in which they can take a lot of what they know about the relational world and the tools they use every day and make the transition pretty easily?
A: Many DBAs are also data architects. DBAs who understand data normalization and entity design have a strong foundation for data warehouse design. But what they should be careful not to do is apply standard transactional normalization techniques to a data warehouse design. Instead, they need to learn dimensional normalization techniques, which are very different. On the administration side, DBAs can apply their wealth of knowledge in monitoring and tuning a data warehouse; they just need to realize that DW queries perform more data scans than a normal transactional query.
Q: What’s your definition of a successful data warehouse?
A: Data warehousing is about results—ROI. The warehouse has to allow business users to get answers both for the questions they are asking now and the questions they will ask after they get their first round of answers. A DW designer has to have the skills of a business analyst and be able to predict some of the questions that business users will ask in the future. A key output is how the data is presented to business users. It must be in the terminology they use, and it must be clearly presented based on the kind of users you are serving, from execs to the operational folks.
Q: Your seminar stresses planning, proper design, and laying a firm foundation for your data warehouse. Is lack of time to do these up-front steps one of the biggest enemies of an effective data warehouse—the philosophy that we can fix those things later, but let’s just get something done now?
A: I am actually a fan of getting output to the business as quick as possible, but not thrown together. What does that mean? The sooner you can show business users some information (even if it is loosely organized), the better because the biggest enemy of the data warehouse project is thinking you know what someone is asking. You need to verify and do that often. The risk in this is falling into the messy-code and inconsistency trap. You can offset this risk by using iterations that are clearly defined with goals and that target focused aspects of the solution. But you also need to be diligent about your code. Every proof of concept turns into production code. Therefore, put code best practices together up front.
Q: Organizations use data warehouses to offload historical data from production systems for reporting purposes and to implement business intelligence, which can give businesses the information they need to reduce costs, increase revenue, and in general work smarter. Do data warehouses pay for themselves in most cases?
A: A good data warehouse design will pay for itself usually in the first or second year, if not sooner. But the solution has to be coordinated with the rest of the corporate data nervous system. Most companies have problems with data dependencies because they have dozens of systems that rely on dozens of systems, and it is really a mess. Data changes in source systems break the chain and cost a lot of money to maintain, fix, and extend—and that takes away from ROI. Your data warehouse should strive to be as close as it can to your data sources to help alleviate the mess. I like working on data warehouses that are at the business core and either drive revenue up or drive costs down. “Wow” is the response I like to hear.
Q: Do you have any tips for planning your DW infrastructure and architecture? Do Microsoft’s Fast Tracks architectures provide benefit here for most organizations?
A: I have a lot of DW infrastructure and architecture recommendations…but you’ll have to attend my PASS post-conference seminar to get them all. :) Microsoft’s Fast Track reference configurations are a great start because they save you a lot of planning and testing time and take the guess work out of the hardware—especially how to best configure the drives and SAN.
Q: Do you have any recommendations for organizations that are considering implementing a DW but are paralyzed by the number of options and can’t get beyond the research phase? It seems like companies have a lot of choices from the infrastructure/architecture elements, to design considerations, to the different types of columns, attributes, hierarchies, hierarchy types, and change tracking they can use.
A: Start with a deep breath! Then tell the business analysts that they can’t have everything at once. Start small, and then extend—and use shorter project iterations to manage the project. Get a subscription to Intelligent Enterprise to start learning the lingo, and read Ralph Kimball’s Data Warehouse Lifecycle Toolkit. Honestly, if a company is still paralyzed, it really pays to bring someone in to help even for a short period of time to get the solution going. A good project team requires thinkers and doers. If you just have thinkers, nothing will get done. If you just have doers, then the solution will likely miss the mark.
Q: Do you have any interesting facts about the growth in the volume of data that companies are collecting these days and the importance of scalability in DW?
A: In my own experience, a multi-terabyte DW solution is commonplace these days. And it’s normal to see fact tables with over a billion rows. Scalability starts with the right design—the tables, the partitioning, the indexing, the I/O, the queries, and so on. The answer to the scalability question is, “It’s the design, dummy.” BTW, I’m not calling you dummy—just stating the obvious but often overlooked answer.
Q: What would you like to see the attendees of your seminar doing back at their workplaces the week after PASS Summit?
A: The sooner they can put into practice what they learn, the better. I would like to see them design—or redesign!—their dimension models and see the results by running queries and trying to answer the business questions rather than designing the solution in a box.
Hear more from Erik and other SQL Server and BI experts at PASS Summit Unite 2009 in Seattle, WA. See the lineup of pre/post-conference seminars, review the full conference agenda, and register today.
Erik Veerman is a SQL Server MVP and a Mentor with Solid Quality Mentors, focusing on training, mentoring, and architecting solutions on the SQL Server BI platform. He led the ETL architecture and design for the first production implementation of SQL Server Integration Services and helped drive the ETL standards and best practices for SSIS on Microsoft's SQL Server 2005 reference initiative, Project REAL. Erik is the lead author for the Business Intelligence Training Kits (MS Press 70-445, 70-448) and a co-author of the WROX Integration Services titles (Professional SSIS & Expert SSIS). An expert in OLAP design, ETL processing, and dimensional modeling, Erik frequently speaks at industry events, including PASS Summit and his local Atlanta PASS chapter.