What Drives DBA Decision Making

A database administrator is responsible for managing database growth, data availability and security.  Security* aside, they are concerned with how large a database will grow, how fast this will happen and how the consumption of data will be affected. At the highest level this is what drives your decision-making. This, as well as a company’s dependency on the database, is what’s used to estimate risk and design strategies to avoid unplanned data outages, data loss and minimize downtime.

As a DBA you are going to encounter two general types of databases: the ones you personally build and the ones you inherit. In both cases your job is to design strategies to prevent things from happening that negatively affect data availability. When you get to build your own database you have an opportunity to proactively address the stability of the system; however, you also will likely have to work within certain budget constraints that force you to make design tradeoff. When you inherit someone else’s database you need to have the investigative skills and understanding of the architecture to find vulnerabilities in what is likely and unfamiliar environment.

We can look at system vulnerabilities in two ways: points of failure and potential bottlenecks. All bottlenecks have the potential to become points of failure, but not all points of failure are bottlenecks. A drive failure in a RAID 0 array will take a database down, but this is obviously not because RAID 0 arrays are bottlenecks. A poorly written SELECT statement issued against a large, high-volume table can create a bottleneck that will become a point of failure by creating a blocking chain that will slow a database to a crawl and likely force a restart. Both cases result in system outages and possibly data loss.

While points of failures are usually addressed by using redundant components, designing strategies to prevent bottlenecks takes a little more planning and starts with understanding how the system is used. Different database systems will be susceptible to different bottlenecks based on their usage and design. For example, a data warehouse will be susceptible to certain bottlenecks that are not an issue in OLTP environment, and vice versa. A data warehouse with denormalized data is optimized for query performance, and a highly normalized OLTP database is optimized for transactional speed and data integrity. A typical bottleneck in the data warehouse is high processor utilization due to complex queries. A typical bottleneck in an OLTP database is high disk utilization from non-sequential reads and writes because of the high volume of transactions.

Generally speaking bottlenecks fall into 2 different but related categories: hardware bottlenecks and bottlenecks internal to the database. As far as hardware bottlenecks are concerned, there are four main areas where they tend to occur. These include the disk subsystem, processor, memory, and the disk I/O bus. Bottlenecks internal to the database tend to occur because of poorly written queries, stored procedures, read/write conflicts (locking/deadlocks) and improper uses of indexes on large tables. Every internal bottleneck is made worse by hardware bottlenecks.

There are several strategies that can be used to avoid bottlenecks, minimize downtime and data loss.  In the next post I’ll discuss how some of these strategies can be applied.

*I’ll talk about security in a separate post.  For this  I wanted to zero in on non-security related points of failure and bottlenecks.

Does this really measure efficient workforces?

Today I ran across an infographic from GOOD, reposted by the Business Pundit, that attempts so show the efficiency of a country’s workforce using the amount of hours worked compared to a country’s Gross Domestic Product (GDP) per capita.

Clicking the image below will take you to the full infographic

The size of each gear’s diameter in the infographic represents the amount of value added to a given  country’s GDP by the average individual citizen on an hourly basis.  In other words, how much you contribute to your nations economic output each hour worked. The average American worker put in 1,792 hours of work for year 2008, and their per capita GDP was $47,186.  When you divide the per capita GDP by the hours worked, Americans are contributing $26.33 to our nation’s GDP on an hourly basis. Compare that to an hourly contribution margin of $54.48 by the average worker from Luxembourg who puts in 1,555 hours, and has a per capita GDP of $84,713.  For a citizen of Luxembourg they contribute to their GDP slightly more than twice what the average American contributes to their GDP.  The Business Pundit and GOOD say this is a measure of workforce efficiency, and don’t quite see it that way.

A as soon as I saw how this infographic  was described I knew something wasn’t right, and it’s taken me most of the day to figure out why.  Couple my suspicion with GOOD giving it the additional title of  “The Value of an Hour of Work”, and then commenting on the stark contrast between the hourly effort given by the average US and Luxembourg  citizens I was confused. On the surface it felt like they were somehow trying to compare Starbucks to the Boston Consulting Group using only average employee salary.

If I were instead to give this infographic a label it would read: “The Contribution of an Hour’s Work to a Nation’s GDP”.  To me it better represents the efficiency of a nation’s economy, and not its workforce.   You could have the most efficient workforce on the planet and still have a low per capita GDP contribution margin if your country imports more than it exports, and has a large population.  If a country’s citizens are spending money on goods and services from another country, they’re not contributing to their nation’s GDP, which will lower the per capita contribution margin. The cost of imports, which subtract from a country’s GDP, is independent of workforce efficiency other than possibly helping create the market where the workforce attempts to operates.  This is why I think the infographic is misleading.

I’m not an economist, and I might be treading into unfamiliar waters, but looks as if the average Luxembourg citizen has such a high contribution margin to their nation’s GDP because they are the world’s second largest investment fund next to the United States with a population of only 600,000. More money is simply coming into that country than what’s going out, and with a small denominator of citizens it’s easy to see why the per capita GDP contribution margin is so great.  However, this is no reflection of workforce efficiency.  It instead reflects on the high efficiency of the Luxembourg economy.

What does this all really mean anyway?  Luxembourg has an economy that’s more than twice as efficient as the US on a per capita basis, but when you simply look at GDP another picture emerges.  For 2008 the United State’s gross domestic product was $14,500,000,000,000 (trillion dollars) for a nation of over 300 million people.  The next highest GDP was Japan at $4.9 trillion dollars for a population of 127 million people.  Luxembourg comes in with a total GDP of $53.7 billion for a population of 600,000, which is more than 270 times less than the US GDP. The infographic is mislabeled and misleading.  It doesn’t have anything to do with workforce efficiency, and instead represents the economic efficiency of a nation as a whole expressed on a per capita basis.  Without a broader perspective it’s kind of like using only miles-per-gallon to describe an automobile.

Using SQL to Parse Text with Two Delimiters

In many databases that drive application behavior there exist a table made up of name/value pairs that’s used to store configuration data.  This is maintenance data that doesn’t typically change, but it changes often enough not to be stored as a data structure in compiled code.  Over time a table like this begins take on a life of its own.

Maybe it starts out, for example, containing name/value pairs for specific web service URLs, but as time progresses we start to see some stranger things in the value column.  Perhaps after a year or two there are comma separated lists of phone numbers for a single key, and eventually, as things deviate in the rush to deliver new features, you begin to see multi-delimited text strings mapped to only one key.  For example, in a rush to add home page A/B testing a developer has mapped “A=40,B=30,C=20,D=10”  to a single key in the configuration table instead of building a more stable data model to support and track A/B test changes for reporting purposes.

Interestingly enough, and to the delight of the business, A/B testing has become wildly successful. They have since added more tests to our config table, and now they want an application that will allow them to manage the test settings themselves without having to find someone to write SQL for database changes, and they want it now.

The problem with this is that the A/B settings are stored in a configuration table and need to be keyed a specific way or the application will throw an exception when it attempts to read them, and you don’t have confidence the business owner is impervious to the fat-finger.  In working out the details to the application you don’t have time to design a new data model for A/B test settings, perhaps using XML, but you want to code something that can be re-used by other applications so you don’t have to repeat yourself, and somewhat limit technical your debt.  You decide to code a SQL function that will parse multi-delimited text and present the in name/value pair table format.

I know this is a hypothetical, and somewhat tongue-in-cheek, example of why one might create something that can parse multi-delimited text, but in an unorganized development environment executing off of an unorganized list of business needs from multiple owners who don’t talk with one another, this kind of stuff can happen.  Perhaps a more practical example would be parsing query string parameters so they can be stored in the database in a format that can be queried.   Regardless of the example, below I walk you through how to parse multi-delimited text using SQL functions, and give you a link to download the code.

Before we dive into the code, you need to know the basic framework.  This solution involves three functions:

  • parse_text_before_delimited_value – This parses all the text before a specified delimiter.  For example, if you had a string that looked like this: A=40,B=30,C=20,D=10 and the specified delimiter was a comma, this function would return: A=40.
  • parse_text_after_delimited_value – This parses all the text after a specified delimiter.  For example, if you had a string that looked like this: A=40,B=30,C=20,D=10 and the specified delimiter was a comma, this function would return: B=30,C=20,D=10.
  • parse_multidelimited_values_to_table – This function uses both of the functions above, along with common table expressions (CTE) and recursion to return a two column table of the names and values of our hypothetical test settings.

Function 1: parse_text_before_delimited_value:

CREATE FUNCTION parse_text_before_delimited_value (	
   @text_to_parse varchar(50),	
   @delimiter char(1)
)
RETURNS varchar(50)
AS
BEGIN	
   DECLARE @local_parsedText varchar(50)	
   DECLARE @local_delimiterPosition INT		
   SET @local_delimiterPosition = PATINDEX('%' + @delimiter + '%', @text_to_parse)

   IF @local_delimiterPosition != 0		
      SELECT @local_parsedText = SUBSTRING(@text_to_parse , 1, @local_delimiterPosition-1)	
   ELSE		
      SET @local_parsedText = ''	
RETURN @local_parsedText
END
GO

This function uses PATINDEX to find the first occurrence a specified delimiter @delimiter, and SUBSTRING is used to return the text that is in front of the delimiter.

Function 2: parse_text_after_delimited_value

CREATE FUNCTION parse_text_after_delimited_value (	
   @text_to_parse varchar(50),	
   @delimiter char(1)
)
RETURNS varchar(50)
AS
BEGIN	
   DECLARE @local_parsedText varchar(50)	
   DECLARE @local_delimiterPosition INT		
   SET @local_delimiterPosition = PATINDEX('%' + @delimiter + '%', @text_to_parse)		

   IF @local_delimiterPosition != 0		
      SELECT @local_parsedText = SUBSTRING(@text_to_parse , @local_delimiterPosition+1, LEN(@text_to_parse)-@local_delimiterPosition)	
   ELSE		
      SET @local_parsedText = ''			
   RETURN @local_parsedText
END
GO

Like the first function this uses PATINDEX to find the first occurrence a specified delimiter @delimiter, and SUBSTRING, but the starting place is @local_delimiter + 1, and everything after the delimiter is returned.

Function 3:  parse_multi_delimited_values_to_table

This is where the magic happens, so to speak. The pattern used here is fairly straight forward.  Given a string to parse that looks like this: A=40,B=30,C=20,D=10 the function sets up a CTE with 5 columns to store a row number, the text to parse, the remaining text to parse, the parsed key and the parsed value.  Using the string above the function first creates this in the CTE:

Row                   text                               remaining text          parsed key        parsed value
1         A=40,B=30,C=20,D=10     B=30,C=20,D=10           A                           40

With this first row in place a second select statement acts on the data in the table to continue parsing the name/value pairs, and the UNION ALL enforces recursion to fill out all the rows.

CREATE FUNCTION dbo.parse_multidelimited_values_to_table(	
   @multidelimited_text varchar(50),	
   @delimiter1 CHAR(1),	
   @delimiter2 CHAR(1)
)
RETURNS @outputTable TABLE (    
   parsed_key varchar(2) NOT NULL,    
   parsed_value int NOT NULL
)
AS
BEGIN	
   DECLARE @local_row_num INT;	
   SET @local_row_num = 0;

    WITH parse_multidelimited_table (row_num, multidelimited_text, remaining_text, parsed_key, parsed_value)	
   AS (		
      SELECT -- populates first row			
          @local_row_num + 1,			
          @multidelimited_text,			
          dbo.parse_text_after_delimited_value(@multidelimited_text, @delimiter2),			
          dbo.parse_text_before_delimited_value(dbo.parse_text_before_delimited_value(@multidelimited_text, @delimiter2), @delimiter1),			
          dbo.parse_text_after_delimited_value(dbo.parse_text_before_delimited_value(@multidelimited_text, @delimiter2), @delimiter1)		
      UNION ALL		
      SELECT -- parses first row to populate remaining rows			
         a.row_num + 1,			
         a.multidelimited_text,			
         dbo.parse_text_after_delimited_value(a.remaining_text, @delimiter2),			
        CASE WHEN LEN(dbo.parse_text_after_delimited_value(a.remaining_text, @delimiter2)) > 0 THEN				
             dbo.parse_text_before_delimited_value(dbo.parse_text_before_delimited_value(a.remaining_text, @delimiter2), @delimiter1)			
        ELSE
             dbo.parse_text_before_delimited_value(a.remaining_text, @delimiter1)			
        END,			
        CASE WHEN LEN(dbo.parse_text_after_delimited_value(a.remaining_text, @delimiter2)) > 0 THEN				
              dbo.parse_text_after_delimited_value(dbo.parse_text_before_delimited_value(a.remaining_text, @delimiter2), @delimiter1)			
         ELSE				
              dbo.parse_text_after_delimited_value(a.remaining_text, @delimiter1)			
         END		
      FROM			
          parse_multidelimited_table a		
      WHERE					
          LEN(a.remaining_text) > 0		
)	   
   INSERT @outputTable	   
   SELECT parsed_key, parsed_value FROM parse_multidelimited_table 	  
    ORDER BY row_num;	   

   RETURN
END
GO

Basically, the first SELECT statement seeds the CT named parse_multidelimited_table with the data for the following SELECT statement to act on in a recursive way.  The main difference is that in the first SELECT statement the parameter @multidelimited_text is what’s parsed, but in the second SELECT the column named remianing_text, from the CTE table parse_multidelimited_table (aliased ‘a’) , is what’s parsed.  When the two SELECT statements are combined with a UNION ALL you have a recursive expression, the limits of which are determined when the string length of the remaining_text column is zero.

The full CTE looks something like this:

And the real two column output that is presented to the calling app looks like this:

That’s pretty much it.  I know this isn’t the best experience, but you can download the complete SQL script including some test code at media fire by following this link:
http://www.mediafire.com/file/mcsb96bwfa5ky7w/multi_delimeted_parsingv2.sql

– t

The Cruel Irony of ASAP

The minute you are asked to complete something ASAP is the exact moment when quality is no longer a concern. You begin making sacrifices to complete your work. In terms of software development you can’t hold accountable a developer for high quality and craftsmanship when they are forced to complete something as soon as possible.

Allowing the developer to work within the time estimates they give is part of the business’s commitment to quality. It’s as important as driving out the business requirements and understanding ROI. If speed to market is crucial part of the project, iterative development should be adopted. The business and development teams should work together to flesh out the prioritized list discretee of minimal marketable features that can be developed quickly, with high quality, and delivered to production independently.

Another way of looking at the problems caused by an ASAP mentality is by the risk it creates. Quality and ASAP are mutually exclusive terms. If you’re trying to complete something as fast as possible, you greatly increase the risk of downstream surprises. Nothing derails progress as fast as production bugs. When this happens not only are you forced to fix what’s broken, but that effort can also bring new development to a halt as the solution works its way through development cycle back into production. If it takes a day to fix a bug that’s 8 to 10 hours of lost productivity as the problem is investigated, corrected, tested and released. If you have resources working to fix the bug that earn $100,000 annually, it’s costing you a conservative $400-$500 in lost productivity. That’s money not being spent to move the business forward. Poor quality is measured in dollars by the amount of time you’re not working on projects that move business forward.

To be clear, I’m not making a case against speed to market. I’m making a case against sloppiness. Speed to market is a difference maker, but you cannot make a difference if what goes to production has bugs. Trust is built during the effort to optimize the development cycle. It doesn’t happen overnight, but eventually the business should derive satisfaction from knowing that if a project or task has an estimated cycle time of 2 weeks, you’re going to see results in 2 weeks, and the development team should derive satisfaction knowing that the business is taking their word on estimates.

For the record, I’ve personally watched the business demand projects be done ASAP and every single time there was always a problem with the production release. In each case the irony was that the business always expected fast results, but didn’t get them, and the finger was always pointed at IT for the lack of quality.