Insights on Authentic Intelligence – Bob Leithiser, Ph.D. | Insights from my research and work related to large databases, SQL Server, predictive analytics, and automated problem solving

Now a SQL Master

Posted on November 21, 2009 by Bob L

After a long and painful process, I have finally achieved this. Was it worth it, would I do it again? Absolutely. But please don’t call me Master Bob, Microsoft Bob is fine…

I am doing quite a bit more work with my simulation database this weekend and finally creating some reports to show some startling correlations in equity trading strategies achieved through amazing throughput of SQL in these simulations. We’re talking 10 million rows being queried, correlated and ranked through 15,000 queries in a 4 hour period in order to leverage over 1,000 permutations including continuous-adaptation containing an average of 2,000 rows per simulation on modest hardware. The techniques rely heavily on optimal indexing strategies and large queries that perform transformations on thousands of rows at one time using CTE-type queries. This approach has proven to have far better performance than an iterative looping approach typically of a step-by-step programmatic solution rather than declarative-type set-based operations. Stay tuned…

Here’s my SQL Master intro http://blogs.technet.com/themasterblog/

Posted in Uncategorized | Leave a comment

Trying to catch up

Posted on October 6, 2009 by Bob L

I apologize as I notice there are some inquiries related to code posted. I should probably set expectations about this, as I am not the typical Microsoft employee with a MSDN blog. Most of the other folks with MSDN blogs are in the product teams and blogging is part of their job. With me, it is an extra activity and my real job is being committed 24×7 to a high profile customer which takes priority over all my other professional activities. I provide this blog as a way to share with the community, but there is no committment implied to fix any issues that people may find in the code or respond quickly to questions.

Depending on customer situation, I may have time to frequently post as I did in July/August posting about a dozen articles, othertimes like this last month, customer issues take priority and I ma not post at all.

I apologize for not being able to provide quicker turn-around on inquiries and will try to catch up with the questions/requests over the next week or two.

Thanks for your patience.

Posted in Uncategorized | Leave a comment

Nested Identity Keys in SQL?

Posted on August 24, 2009 by Bob L

This weekend I played with partitioning. I have a table with about 10 million rows spanning the last 6 years of stock history for AMEX, NYSE, Indexes, and NASDAQ. I partitioned each year into a separate filegroup so that should make it much easier to purge archive the old years and try to speed up queries on date ranges that involved multiple tables by storage-aligning the indexes. I aligned a related table used for storing calculations and metrics along with aligning an indexed view. Although a few queries actually ran slower, the most critical ones ran faster with the partitioning, since I typically retrieve data grouped together in a fairly tight range.

It seems that in order to really get storage alignment to work well with partitioned tables, you may want to move away from simple surrogate primary keys to compound primary keys where one segment is the partitioning key and the other segment is the unique key. One of the requirements for a partitioning a table is that the partitioning column be part of the clustered key, so unless you are partitioning on what is already the primary clustered key, you’re looking at either no longer using the primary key as the clustering key or creating a compound key. And until recently, the advice has typically been to use simple surrogate primary clustered keys for best performance, which means you would not typically want to partition on an existing surrogate key when moving to partitioning, since by definition a surrogate key generally has no significance, and thus typically not a relevant partitioning parameter.

Therefore partitioning introduces some new dynamics in the way one designs primary keys.

What would be helpful would be to create an identity column grouped on another column (nested). Say I have a partitioning key of year, then I could have my order Id be reset for each year. The advantage is that the overall index space can be reduced since the numbers are only unique within a range for each of the partitioning keys, allowing a smaller data type to be used for the nested identity key.

For example, if I have over 2^32 billion rows (or about 4.2 billion rows), I would have to use a bigint for a surrogate primary key, even if using the negative range (which is unlikely, so the real limit is more like 2.1 billion). If I create 20 partitions, I can now go to 80 billion rows using a simple surrogate integer plus a tinyint partitioning key, assuming non-negative identity values. This means that using the partitioning scheme, I actually use only 5 bytes per unique index value, rather than 8 bytes, and am able to accommodate up to 255 times more data than if using the simple 4 byte key alone.

The other scenario is where you commonly need to keep items in a static order within a parent, i.e. a bill of materials assembly where the line number represents the sequence in which an assembly is to be constructed. To be able to automatically generate sequence numbers guaranteed to be unique within the context of a parent assembly number would be useful.

This could be done with a trigger of course, but that’s not real efficient, and presents more challenges in bulk-loading scenarios.

What I’m thinking of is some syntax along the lines of

CREATE TABLE Table-name

(Grouping-key datatype,

Grouped-Identity-Key datatype identity [start,increment] group by PartitioningId

)

And then you just partition on the Grouping-key for the primary key on a partitioning scheme as in:

ALTER TABLE Table-name

ADD

CONSTRAINT [Primary-key-Constraint-Name] PRIMARY KEY CLUSTERED
(
[Grouping-Key],
[Grouped-Identity-Key]
)

ON [Partition-Scheme](Grouping-Key)

And storage-aligning the other indexes can be done by just referencing the grouping-key in the same way.

This would cause the GroupdIdentityKey to reset back to the initial value each time the Grouping-key changes. The table could then easily be partitioned on the grouping-key, which can be very compact to just do the partitioning, such as the year from an order date.

Thought on this? Is there a way to accomplish this that I am missing?

Technorati Tags: SQL Server 2008,SQL Server Enhancement Suggestions,SQL Partitioning

Posted in Uncategorized | Leave a comment

A Wrapper for Running SQL Server 2008 Reporting Services Reports Anonymously

Posted on August 20, 2009 by Bob L

With SQL Server 2008, Microsoft re-architected SQL Server Reporting Services (SSRS) to no longer have an IIS dependency. It now relies directly on HTTP.SYS and handles web protocols independently. This is a good thing, in a lot of respects (see http://msdn.microsoft.com/en-us/library/bb630410.aspx). But, there is also a down-side for those of us who are used to utilizing the IIS framework for hosting SQL Server reports. For one thing, allowing anonymous access to reports just by configuring an IIS virtual directory mapping is gone. Also, client certificates are no longer supported, which is a big deal if you’re in an environment where these are used (like I am).

There have already been a few articles written about anonymous access for SSRS 2008, and a couple of techniques to allow Report-manager type access (see http://blogs.msdn.com/jameswu/archive/2008/07/15/anonymous-access-in-sql-rs-2008.aspx, http://blogs.msdn.com/brianhartman/archive/2008/11/21/custom-credentials-in-the-report-viewer.aspx). Rather than rehashing them, my purpose in this post is to provide a generic solution for accomplishing anonymous access to your reports, without compromising your server security, and something you can tailor to your environment. Although my goal with this is to simply provide anonymous access, you can use this same technique for other purposes such as allowing smart-card logon to the hosting web site, since essentially the limitations related to accessing SSRS related to the authentication are wrapped in an ASP.NET application that does live off of IIS.

The basic concept is to take the SSRS report manager and wrap up the core functionality for executing reports inside of an ASP.NET web page that dynamically discovers and presents the reports and report folders in the same manner that the Report Manager does and then allow them to be executed. It does not provide all the functionality of Report Manager, such as the report management aspects, but you could tailor this to add that. Obviously for an anonymous-access scenario, most of those management functions don’t make sense, since they are associated with the identity of the user.

I’ve posted the solution on the web site http://www.bobthesis.com/reportsportal/ if you want to actually see how this in action. The menu user-interface is not pretty, but I’m not a web-designer, you can easily improve the cosmetics for that by modifying the master page/styles. The demo uses the Adventure Works samples, but you’ll find as we go through the code, there is nothing specific to Adventure works, other than an entry in the config file that indicates the name of the Adventure Works reports folder as the root folder.

So, let’s step through the solution. There are 7 main components to the solution:

Reporting Services Web Service: The SQL Server Reporting Web Services (ReportingService2005.asmx – that’s right, 2005, it has not been changed for 2008) provides the interface for enumerating the reports and report folders. It provides a host of other capabilities for report execution, rendering and management, but that is the only capability we need for our solution.
The ReportViewer Control: This is a control that we integrate into our ASP.NET solution. It provides the user experience for executing the reports including prompting for the parameters and providing the various options.
A custom class for authenticating to the Reporting Web Service. This is what allows us to wrap the authentication in the ASP.NET page for the Report Viewer control to use to connect to the Reporting Service web service.
An ASP.NET tree-view control for displaying a menu hierarchy for running the reports corresponding to the reporting folder structure defined in SSRS for the particular reporting application.
A master page that includes the tree view control
A default.aspx page that includes code to execute the Reporting Services web service and enumerate the results.
A config file along with entries to define the authentication parameters.

Below is the overall architecture showing the main components. The dashed curved lines represent data flows, solid lines represent composition, dashed lines represent inheritance, and curved non-dashed line represent process flow.

Now, let’s walk through the development of the components, I’m assuming you’re familiar with Visual Studio.NET so will not delve into the details related to the IDE.

First, we create a ASP.NET application project in Visual Studio.NET and we’ll call it ReportsPortal (You could also use a web site project type if you prefer). At this point, we have our web.config and default.aspx. Next, add a master page to the project. At this point, our project structure should like this:

Next, add a reference to the ReportingService2005.asmx. For now, we will use the local machine to create the web service reference. However, we can modify this later to point to a different server, which is what I do for my demo solution. My reporting web service is on a different server than my web server.

Now, our solution should look as follows:

Now, let’s setup our master page to have place holders for our menu and our report viewer. For my demo, I provide the navigation interface on the left side bar and the report viewer on the right side. I use a table control to divide the placeholders and insert the tree navigation control on the left as shown below:

Below is the source code for the Master Page after setting a few properties to make the tree view not completely vanilla:

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

<%@ Master Language="C#" AutoEventWireup="true" 
CodeBehind="ReportsPortal.master.cs" 
Inherits="ReportsPortal.ReportsPortal" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" >
<head runat="server">
    <title></title>
   
    </style>
    <style type="text/css">
        .style1
        {
            font-family: "Comic Sans MS";
            height: 351px;
        }
        .style2
        {
            font-family: "Lucida Console";
            font-size: large;
            text-align: center;
        }
    </style>
</head>
<body bgcolor="#f9f9f9">
    <form id="form1" runat="server">
    <div class="style2" 
        style="background-color: #6699FF; font-family: 'Microsoft Sans Serif'; 
font-size: large; color: #800000;">
        Anonymous SQL Server 2008 Reporting Services Wrapper</div>
    <table class="style1">
        <tr valign="top">
            <td bgcolor="#E1FFF7" >
                <asp:TreeView ID="TreeView1" runat="server" 
                        NodeIndent="10" ShowLines="true"  
ImageSet="BulletedList4">
                        <HoverNodeStyle BackColor="#339966" />
                        <SelectedNodeStyle BackColor="#666699" />
                </asp:TreeView>
            </td>
            <td>
        <asp:ContentPlaceHolder ID="ContentPlaceHolder1" runat="server">
        
        </asp:ContentPlaceHolder>
            </td>
        </tr>
    </table>
    </form>
</body>
</html>

Before, we can start writing the code to use the web service to return our report items and before we can start using the Report Viewer Control, let’s put in the “plumbing” in our web.config, add the report viewer control to the the form, ensure we have the necessary reporting services references and leverage the authentication interface.

Here are the web config entries, we create for both defining the starting location for our report menu as well as the user authentication settings for the proxy reporting user. We need to add the ApplicationSettings section to the config file, which I stick on the bottom. Of course, you should use an account wiht minimal user rights as your proxy reporting user, and you may want to use the .NET DPAPI encryption capabilities to encrypt your web.config file (see http://www.dotnetcurry.com/ShowArticle.aspx?ID=185&AspxAutoDetectCookieSupport=1 for more info on encrypting .net config files), although since these are anonymous reports anyways, it isn’t critical, as long as the proxy reporting user has very limited rights. The report user account should be a normal user account and have permissions in the SQL database to the stored procedures, views, tables, required to run the various reports as well as have browser permissions to the Report Server. If you don’t have an instance name remove the “_myinstance” part of the report service url. Notice we have to repeat the server url in 2 different formats – 1 for the web service reference including the reportservice2005.asmx and the other just containing the root location. The second one is used by the authentication interface and it will not work with the full web service url.

<applicationSettings>
  <ReportsPortal.Properties.Settings>
         <setting name="ReportsPortal_WsReportService_ReportingService2005"
              serializeAs="String">
  <value>http://myServer/ReportServer_myinstance/ReportService2005.asmx</value>
          </setting>
          <setting name="MyReportServerUrl" serializeAs="String">
            <value>http://myServer/ReportServer_myinstance</value>
          </setting>
            <setting name="ReportsRootPath" serializeAs="String">
                <value>/</value>
            </setting>
            <setting name="TopLevelMenuText" serializeAs="String">
                <value>AdventureWorks Reports</value>
            </setting>
            <setting name="MyReportViewerPassword" serializeAs="String">
                <value>myPasswordvalue>
            </setting>
            <setting name="myReportViewerUser" serializeAs="String">
                <value>myreportUser</value>
            </setting>
            <setting name="MyReportViewerDomain" serializeAs="String">
                <value>mydomain</value>
            </setting>
      </ReportsPortal.Properties.Settings>
  </applicationSettings>

However, in addition to the above standard application settings, we need to create a special configuration setting for the ReportServerConnection interface in the appsettings section. (see xxx).

Below is in the part of the config file higher up and before system.web:

<appSettings>
        <add key="ReportViewerServerConnection" 
value="MyReportServerConnection, ReportsPortal"/>
</appSettings>

In this case, “My ReportServerConnection” refers to the name of the class file that you create to implement the IreportServerConnection interface and “ReportsPortal” and “ReportsPortal” refers to the name of the assembly for your web application.

I know this is a bit confusing, so I’ve included all of the code in a zip file that you download when you get to the end of the article…)

Note, you can also use the Settings approach to create these and then use the Properties.Default.Settings to access the values rather than using the Configuration Manager method.

Next, let’s add the ReportViewer control to the default.aspx page. Set the property on the default.aspx page to use the ReportsPortal master page, so we should have 1 placeholder to work with and we drag the Microsoft report viewer control from the Reporting section of the toolbox onto the form. When we drag the control on the form, it will update our project to include a reference to the Microsoft.ReportViewer.WebForms assembly as well as update our web.config with the assembly information.

So, now our default.aspx should look like below, we don’t need to set any properties as this will all happen dynamically in the code, however, you will probably want to increase the width and add zoom mode to make the user experience better.

Below is the source of the default.aspx including some property settings for the report viewer rendering:

<%@ Page Language="C#" AutoEventWireup="true" 
CodeBehind="Default.aspx.cs" Inherits="ReportsPortal._Default" 
MasterPageFile="~/ReportsPortal.Master" %>

<%@ Register assembly="Microsoft.ReportViewer.WebForms, 
Version=9.0.0.0, Culture=neutral, 
PublicKeyToken=b03f5f7f11d50a3a" 
namespace="Microsoft.Reporting.WebForms" 
tagprefix="rsweb" %>

<asp:Content ID="Content1" runat="server" 
    contentplaceholderid="ContentPlaceHolder1">
    <asp:Label ID="ReportLabel" runat="server"></asp:Label>
    <rsweb:ReportViewer ID="ReportViewer1" runat="server" 
    SizeToReportContent="True" Width="800px" 
    ZoomMode="PageWidth" Height="600px">
</rsweb:ReportViewer>
</asp:Content>

Now, let’s take a look at the authentication interface portion. I use the ReportServerConnection2 interface because it supports stateless reporting and seems the most robust, there are others that can be used (see http://blogs.msdn.com/brianhartman/archive/2008/11/21/custom-credentials-in-the-report-viewer.aspx). To get it to compile, you will need to add a reference to the Microsoft.ReportViewer class. Here is the code for the MyReportServerConnection.cs class (you can use different names, but remember everything has to be in sync with the config setting that references back to the class).

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using Microsoft.Reporting.WebForms;
using System.Net;
using System.Security.Principal;
using System.Configuration;

namespace ReportsPortal
{
    [Serializable]
    public class MyReportServerConnection : IReportServerConnection2
    {
        public Uri ReportServerUrl
        {
            get
            {
                string url = Properties.Settings.Default.MyReportServerUrl;
                if (string.IsNullOrEmpty(url))
                    throw new Exception("Missing url from the Web.config file");
                return new Uri(url);
            }
        }
        public int Timeout
        {
            // set timeout to 60 seconds
            get { return 60000; }
        }

        public IEnumerable<Cookie> Cookies
        {
            // No custom cookies
            get { return null; }
        }
        public IEnumerable<string> Headers
        {
            // No custom headers
            get { return null; }
        }

        public MyReportServerConnection()
        {
        }


        public WindowsIdentity ImpersonationUser
        {
            get { return null; }
        }

        public ICredentials NetworkCredentials
        {
            get
            {
                // return null will force the use of impersonation, 
                // otherwise, remove the return null and 
                // implement the other app settings to specify the credential details
                // return null;
                string userName = Properties.Settings.Default.myReportViewerUser;
                if (string.IsNullOrEmpty(userName))
                    throw new Exception("Missing user name from Web.config file");
                string password = Properties.Settings.Default.MyReportViewerPassword;
                if (string.IsNullOrEmpty(password))
                    throw new Exception("Missing password from Web.config file");
                string domain = Properties.Settings.Default.MyReportViewerDomain;
                if (string.IsNullOrEmpty(domain))
                    throw new Exception("Missing domain from Web.config file");
                return new NetworkCredential(userName, password, domain);
            }
        }

        public bool GetFormsCredentials(
         out Cookie authCookie, 
         out string userName, 
         out string password, 
         out string authority)
        {
            authCookie = null;
            userName = null;
            password = null;
            authority = null;
            return false;
        }
    }
}

The project should now look something like below:

Now, that the plumbing is done, we can focus on the 2 other pieces of code. First, lets do the code to enumerate the reporting services folder and return the lists into the treeview control. This is the heart of the application as this basically wraps the Report Manager user interface for navigating reports and folders. We put the enumeration code in the Master page .cs (ReportsPortal.Master.cs in our case). The Page_load just clears out the menu and repopulates it from the Report Server by calling BuildNavigation as long as this is not just a postback event. The not postback helps the performance by only doing this when first executing the application. The tree view control will contain all of the folders along with the items arranged hierarchically. Each report item then maps to a navigation url that contains a link back to the default.aspx passing in the full report item path. We can leverage the same credentials from the config file for invoking the report service as we do for running the reports.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using ReportsPortal.WsReportService;
using System.Security;
using System.Net;


namespace ReportsPortal
{
    public partial class ReportsPortal : System.Web.UI.MasterPage
    {
        protected void Page_Load(object sender, EventArgs e)
        {
            if (!Page.IsPostBack)
            {
                TreeView1.Nodes.Clear();
                TreeView1.Nodes.Add(new TreeNode
                    (Properties.Settings.Default.TopLevelMenuText,
                    Properties.Settings.Default.ReportsRootPath));
                BuildNavigation(TreeView1.Nodes[0]);
            }
        }

        private void BuildNavigation(TreeNode parentNode)
        {

            // Enumerate all of the folders and add the reports
            WsReportService.ReportingService2005 ws =
                new WsReportService.ReportingService2005();
// Need to pass in credentials for reporting service.
            NetworkCredential cred = new NetworkCredential(
                    Properties.Settings.Default.myReportViewerUser,
                    Properties.Settings.Default.MyReportViewerPassword,
                    Properties.Settings.Default.MyReportViewerDomain);
            
            ws.Credentials = cred;
            CatalogItem[] items = ws.ListChildren(parentNode.Value, false);
            foreach (CatalogItem item in items)
            {
                if (item.Name != "Data Sources")
                {
                    switch (item.Type)
                    {
                        case ItemTypeEnum.Report:
                            parentNode.ChildNodes.Add
                                (new TreeNode(
                                    item.Name,
                                    item.Path,
                                    null,
                                    "Default.Aspx?Report="
                                    + item.Path, "_top"));
                            break;
                        case ItemTypeEnum.Folder:
                            TreeNode newParentNode = 
new TreeNode(item.Name, item.Path);
                            parentNode.ChildNodes.Add(newParentNode);
                            BuildNavigation(newParentNode);
                            break;
                        default:
                            break;
                    }
                }
            }
            return;
        }

    }
}

Next, we make sure our default.aspx is set Now, we can bring it all together in the default.aspx.cs by doing a page load to evaluate the query string that gets set by the master page navigator when the user selects the link. It then instantiates the MyReportServerConnection class and maps the credentials back to the Server report that needs to be run.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using Microsoft.Reporting.WebForms;
using System.Net;


namespace ReportsPortal
{
    public partial class _Default : System.Web.UI.Page
    {

        protected void Page_Load(object sender, EventArgs e)
        {

            if ((!IsPostBack) && Request.QueryString.Count > 0)
            {
                string reportPath = Request.QueryString[0];
                this.ReportLabel.Text = Request.QueryString[0];
                this.ReportViewer1.ProcessingMode = ProcessingMode.Remote;
                MyReportServerConnection rsc = new MyReportServerConnection();
                this.ReportViewer1.ServerReport.ReportServerCredentials = rsc;
                this.ReportViewer1.ServerReport.ReportPath = reportPath;
                this.ReportViewer1.ServerReport.ReportServerUrl =
                    new Uri((Properties.Settings.Default.MyReportServerUrl));
                this.ReportViewer1.ServerReport.Refresh();
            }
        }
    }
}

Before testing this out, we need to validate that permissions are correct for our demonstration report user. The account will need to have local logon rights for example. If you’re going across firewall boundaries, you’ll probably need to either setup the reports as not requiring credentials and use an execution account, or use SSL with a certificate so you can use basic authentication on the web service without worrying about credentials in the clear.

You can setup an execution account by using the Report Server configuration manager as shown below. If you do this, then you might not even have to bother with the special interface for the connection, but I haven’t tested that yet.

So, here is the end result, Yes, the treeview menu interface isn’t pretty, but it works, and it automatically enumerates down to the lowest folders.

Try it for yourself at http://www.bobthesis.com/reportsportal/

Technorati Tags: SSRS,SQL Server Reporting Services

Posted in Uncategorized | Leave a comment

Cautions with Indexed Views

Posted on August 15, 2009 by Bob L

If you’ve been following my blog, you may remember an article I did extolling the virtues of indexed views – http://blogs.msdn.com/microsoftbob/archive/2009/05/13/sql-data-warehousing-tip-2-indexed-views.aspx. I stand behind that and have found this feature extremely useful. In fact in my last post the 11 seconds it was taking to perform a few thousand transactions against a 10-million row history table with 3 million transaction went down to just 2 seconds using an indexed view.

However, before you go out and make every view into an indexed view, it is important to understand what an indexed view is and what are the consequences. I was reminded of this very sharply when I did the following “routine” update command on my table containing all of the trading symbols used for my simulation processing:

/*------------------------
update dbo.EquitySymbol 
set Shortable = 1 
from Load.ShortableList l
where l.Symbol = dbo.EquitySymbol.TradingSymbol
------------------------*/

(6334 row(s) affected)

Now, normally this had been taking less than 2 seconds. But, this was before I added the indexed view that happened to include the “Shortable” column. So, how long did it take with the indexed view in place?

Almost 19 minutes? Yes, 18 minutes and 51 seconds? That is actually about twice as long than it takes to create the indexed view along with all of it’s alternate indexes! Pardon the drama, but my point is that Indexed views are dramatic by their nature. They are dramatic in that on one hand you can analyze millions of rows and help enabling performing something as complex as generating 7 years of trading simulations consisting of over 3 million orders for 10 million rows in a space of a couple of hours, while on the other hand causing a small update to take over a thousand times longer than usual…

So, how do we explain this? First, we need to understand what happens when you create an indexed view. The first step is to enable schemabinding on the view. This ensures that SQL can tag a dependency between the table and the view so it can prevent and detect changes that might cause the view to stop functioning, that isn’t where the performance hit comes, (see http://msdn.microsoft.com/en-us/library/ms187956.aspx for an explanation of the process for creating an indexed view including the schemabinding step). The key threshold happens (pun intended) when you create the primary key for the view which is the unique clustered index for the view and required before you can add secondary indexes to the view. What that does it “materializes” the view such that it is no longer just a query into your tables, but it becomes a table itself with real data. And of course, that data must be maintained to match the underlying tables.

So, what was the situation in my case? Well, the “shortable” flag was part of the indexed view that joined the equity symbol table (which has about 9,000 rows) to the Equity History table, which happens to have about 10 million rows. My “little” update of the 6,000 or so Equity Symbol rows cascaded into about 7 million updates necessary to maintain all of the rows in the indexed view. Here you can see the major time-consuming aspects of the query plan with the indexed view in place:

Notice the large time for the sorts, these were apparently needed to prepare for secondary index updates that included the shortable column as an included column. And notice that the update time for the clustered index for the table actually containing the column ([EquitySymbol].[PK_EquitySymbol]) is estimated to take virtually 0 percent of the total query time.

So, what does this mean – avoid indexed views? Not at all! It just means as the title suggest: use caution. So what could I have done differently to avoid this issue. To start with, my update command is flawed. I told SQL to update every row and didn’t even look to see if the rows needed to be updated. SQL Server will happily do what you tell it to, it doesn’t have a built-in mechanism that says – “Hey, you don’t need to update that column, it already contains the value you specified”. It doesn’t do that for a very good reason. For example, what if I had a trigger on a table that checked to see if a particular column was updated. Since SQL trigger processing with the “IF UPDATE(column)” doesn’t check to see if the value is actually changed, just whether it is updated, it is easy for me to force a trigger to execute just by updating the column to its current value. So, what I should have done was:

update dbo.EquitySymbol 
set Shortable = 1 
from Load.ShortableList l
where l.Symbol = dbo.EquitySymbol.TradingSymbol
AND dbo.EquitySymbol.Shortable <> 1

Since there were only a handful of rows that had changed since the last time this update command was run, doing it this way, should have taken only a few seconds, albeit it still would have required a couple thousand updates to cascade the updates to the related rows the view. Another option is that if I really did need to update a few thousand rows, would have been to simply drop the indexed view, perform the update, and recreate the indexed again. There may be some more exotic approaches such as disabling the indexes on the indexed view, but effectively the approach is the same.

So, here are a few suggestions when dealing with indexed views:

Do not include more columns than you really need, particularly if the columns may be subject to change.
Consider the cardinality of the values in the indexed view. If your indexed view ends up repeating a few value thousands of times, consider the source of the data. If that data is often changed, then expect those updates to take many times longer than without the indexed view. Be very careful when joining a table with relatively few rows (i.e. 6,000) that is frequently updated to a much larger table (i.e. 10 million), even if the larger table is infrequently updated.
Consider carefully the use of low-cardinality items in the indexes either as segments of the indexes or as included columns, as this will add to the update time for the view.
Don’t create an indexed view unless there is at least one critical query that can benefit significantly and measure that benefit against the cost of other queries that may need to update the table.
When considering the costs/benefit of an indexed view, don’t forget to factor in that SQL will use the indexed view for queries that do not directly reference the indexed view. Absent a query hint to indicate otherwise, SQL Server will leverage the indexed view in query plans that access any tables related to the indexed view. So, when you create an indexed view intended to help one query, you may find it helping other queries as well.

So, the bottom line is that like just about everything else in SQL Server 2008, indexed views have their place, but use them wisely…

Technorati Tags: Indexed Views,SQL Server 2008,SQL Server,SQL Server Performance

Posted in Uncategorized | Leave a comment

Some Uses for Query Hints

Posted on August 14, 2009 by Bob L

Some of you are all set to flame me just because of my title. I know because a couple of years back, I would have probably been one of the first to say “NO!”, when anybody started talking about query hints. But, like everything else in this world, they have their place. First of all, I encourage you to try everything else before resorting to query hints and to avoid hints that specify index names and such directly. The hints I have found to be most useful are fairly generic and do not impose maintenance burdens on the database structure.

The rest of you may be wondering exactly what I’m talking about, so let me explain. When SQL Server evaluates a query, it takes it apart and looks for how it can optimize the retrieval, what indexes can be used. It looks at all sorts of things including selectivity of the indexes, number of physical and logical I/Os required, the parameter values, the search arguments (SARGs) and makes a best-guess stab. Any query you execute, you can select the “Show Query Plan” menu item (or CTRL-L) and get a graphical picture of just how SQL will execute the query. You can also save it in XML format. And it will actually detect “missing indexes”, indexes that would help performance, which I have found incredibly useful and powerful (although I don’t usually follow the index suggestions verbatim – more about that in another post). A query hint is basically telling the query optimizer that you have inside information that is better than what SQL knows. For more background on query plans, query optimization, etc, I suggest the book SQL Server 2008 Internals” by Karen Delaney (http://search.barnesandnoble.com/booksearch/results.asp?ATH=Kalen+Delaney). The SQL team has also created a special blog focused on query optimization at http://blogs.msdn.com/queryoptteam/. Understanding query plans and every single different hints are long blog posts in and of themselves, so I’m not going to attempt to cover this in this article.

My recent experience with query hints came in 2 different areas. Alas, I actually have stats to backup one of my observations. The first observation involved the use of the FAST N hint. This basically tells SQL to optimize only for the first N matches. This can be real useful when using a TOP N where N is a parameter value and you happen to know that it will usually be a small number of rows. SQL doesn’t have the knowledge of that luxury, so forcing an OPTION N hint will cause the optimizer to do SEEKs instead of SCANs which (and I am oversimplifying) means it goes right to the page with the page with the first N matches rather than trying to scan through an entire range or entire table. This can also be useful for partitioning/ranking queries where again you are looking for just the first few rows and not wanting to rank the entire set.

The other experience involved the use of my simulation application. I recently ran the simulations against about 6 years of stock history (almost 10 million rows) in order to simulate different entry/exit trading strategies. This generated nearly 2 million orders and over 1 million end-of-day portfolio rows. What was interesting was that at the beginning of the simulation before there were any rows in my portfolio Orders table, the buy/sell procedures were running very fast. They didn’t start slowing down until over 1 million rows, and even then they weren’t too bad, considering the volume of data. You can see this from the attached chart. At that time, the indexes were undoubtedly not pretty as some of them were being split a lot due to the various combinations of fields that had to be indexed in order to close out positions, evaluate trading effectiveness, etc.

If my routine was simply to generate a couple of million of rows each night, I could just add a post-processing step to rebuild all of the fragmented indexes, then I would have left well enough alone. (for a great tool to automate this, see http://blog.sqlauthority.com/2008/03/04/sql-server-2005-a-simple-way-to-defragment-all-indexes-in-a-database-that-is-fragmented-above-a-declared-threshold/ – I copied and pasted this into a SQL job step I run nightly and I don’t worry any more about index fragmentation and it runs a lot faster than using the built-in SQL index maintenance tools, because it only does what it needs to). In any case, turns out that what I really want to do is just update the data daily with a few new simulations and append to the existing one. For the append, the query plan was still good, since I still need to create a few thousand orders and update a few thousand (fill the buy orders, close the sell orders). But, for just adding a new simulation that goes back 7 years, it was taking inordinately long given the amount of data. You can see from the below chart, that even though only a few or in some cases no records were updated, the buy and sell procedures were taking almost as long as when 2 – 3,000 rows were being updated.


	Main simulation		1 Portfolio Simulation
MarketDate	ElapsedTime	Rows	ElapsedTime	Rows
8/5/2009	11	2466	11	70
8/6/2009	11	2895	11	36
8/7/2009	11	2681	11	206
8/10/2009	11	1603	11	22
8/11/2009	11	2820	12	91
8/12/2009	13	1925	11	29

So, the question is, how can I still keep the batch-oriented performance optimized to process thousands of buy/sell orders all at once, while also getting good-turn around when we have a handful of orders? Enter the query hint “OPTION RECOMPILE”. Most of the time, using this hint doesn’t make sense, because it defeats one of the great features of SQL Server for auto-parameterizing execution. Auto-parameterization is another complex topic covered in the earlier links I listed, but basically allows SQL to look at similar queries, particularly when they are done in a stored procedure, and come up with a reusable plan. It does this based on the parameters for the execution. This works great if you’re calling a stored procedure repeatedly at short intervals, since it saves the recompilation step. However, in my case, my process is mainly batch, I am only calling my stored procedures once per market date for all of the simulations. With the exception of 1 row-oriented procedure, I basically simulate an entire’ days worth of stock market activity in 7 stored procedures, so in my scenario I don’t care about the time being saved on the recompile. By using OPTION RECOMPILE, SQL will look at the stored procedure each time and look at all of the parameters and their values and then make the right decisions.

Lets take a look at the stored procedure. You can see for example in my stored procedure where dynamic evaluation of the parameter values would help. If for example, I am calling for just a single portfolio, that makes for an entirely different plan than when there are hundreds of portfolios. One approach involves scanning through the orders by key to look for qualifying orders and then matching back to the portfolios while the other approach involves looking up a single portfolio and evaluating only the orders for that portfolio.

CREATE PROCEDURE [Olap].[ProcessSellOrders]
    @MarketDate DATE, @RowCount BIGINT OUT
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;
    SET XACT_ABORT ON;
    -- Assume bollinger history technique for now
    -- Similar logic as Process Buys, except reverse the trading methods with same
    -- logic for daily high/daily low, but update Close date, limit instead of Open date, 
    -- and check stop limits 
    
    -- We don't know if the low or high hit first, so to be on the safe side, we assume
    -- that the worse happened and so check the stop limit first 
    UPDATE [Olap].[PortfolioOrder]
        SET FilledPriceClose = dbo.udf_CalculateFillPrice
            (PortfolioOrder.StoplimitPrice
            ,PortfolioOrder.LimitPriceClose
            ,ed.PriceAtOpen
            ,ed.DailyLow
            ,ed.DailyHigh
            ,PortfolioOrder.TradingMethod),
        CloseDate = @MarketDate
    FROM    olap.Portfolio pf, 
            dbo.view_EquityDetails_Brief ed
    WHERE    (pf.StartDate <= @MarketDate and (pf.EndDate >= @MarketDate or pf.EndDate IS NULL))
        AND (pf.LastUpdateDate < @MarketDate or pf.LastUpdateDate is null)
        -- Last Update Date not set until EOD
        AND PortfolioOrder.PortfolioId = pf.PortfolioId
        AND ed.MarketDate = @MarketDate
        AND ed.TradingSymbol = PortfolioOrder.TradingSymbol    
        AND PortfolioOrder.OpenDate <= @MarketDate 
        -- See if valid for day trade - An equity is valid for day trade if it was bought using the
        -- opening price.
        -- Don't get confused by new orders for next market date
        AND PortfolioOrder.CreateDate < @MarketDate 
        -- Redundant since earlier than OpenDate, but include so that optimal index can be used
        AND PortfolioOrder.OrderStatus = 2
        AND (       (ed.PriceAtOpen >= PortfolioOrder.StopLimitPrice AND PortfolioOrder.TradingMethod = 'S' )
-- Stop price hit at open
                OR (ed.PriceAtOpen <= PortfolioOrder.LimitPriceClose AND PortfolioOrder.TradingMethod = 'S' ) 
-- Limit price hit at open
                OR (ed.DailyHigh >= PortfolioOrder.StopLimitPrice AND PortfolioOrder.TradingMethod = 'S' )
-- Stop price hit during the day
                OR (ed.DailyLow <= PortfolioOrder.LimitPriceClose AND PortfolioOrder.TradingMethod = 'S' ) 
-- Limit price hit during the day
            
                OR (ed.PriceAtOpen <= PortfolioOrder.StopLimitPrice AND PortfolioOrder.TradingMethod = 'L')
-- Stop price hit at open
                OR (ed.PriceAtOpen >= PortfolioOrder.LimitPriceClose AND PortfolioOrder.TradingMethod = 'L') 
-- Limit price hit at open
                OR (ed.DailyLow <= PortfolioOrder.StopLimitPrice AND PortfolioOrder.TradingMethod = 'L')
-- Stop price hit during the day
                OR (ed.DailyHigh >= PortfolioOrder.LimitPriceClose AND PortfolioOrder.TradingMethod = 'L')
 -- Limit price hit during the day
            )

            AND (    (PortfolioOrder.IsValidForDayTrade = 1 AND PortfolioOrder.OpenDate = @MarketDate)
                OR (PortfolioOrder.OpenDate < @MarketDate)
            )


    SET @RowCount = @@ROWCOUNT
END

SQL 20008 provides much more flexibility in how to supply query hints. For example, in the above stored procedure, we can just put the keywords OPTION (RECOMPILE) at the end of the query rather than using the legacy RECOMPILE option for the whole stored procedure. If there were multiple queries, we could put different query hints or no query hints with each query, so there is a great improvement in granularity for performing this kind of tweaking in SQL 2008.

Here’s the difference in performance after changing to using the RECOMPILE hint. Times are rounded to the nearest second, so 0 means less than .5 seconds. Here you can see the difference in the query plan when I am only calling for a single portfolio versus the case where all of the portfolios are being processed. OPTION RECOMPILE is particularly useful when there is a lot of variation in the parameters value that ultimately determine the search arguments and predicates for scenarios where one index may make much more sense than another index depending on the search criteria.

	Main simulation		1 Portfolio Simulation		After Option Recompile
MarketDate	ElapsedTime	Rows	ElapsedTime	Rows	ElapsedTime	Rows
8/5/2009	11	2466	11	70	1	70
8/6/2009	11	2895	11	36	0	36
8/7/2009	11	2681	11	206	0	206
8/10/2009	11	1603	11	22	0	22
8/11/2009	11	2820	12	91	0	91
8/12/2009	13	1925	11	29	0	29

P.S: Since I wrote this, I put my indexed view back in and now the sell orders stored procedure even with the main simulation also performs in just a couple of seconds, even with the indexes fragmented.

Technorati Tags: SQL Server 2008,SQL Server Performance,Query Hints

For more about query hints, see http://technet.microsoft.com/en-us/library/ms181714(SQL.90).aspx

Posted in Uncategorized | Leave a comment

Getting the full exception from .NET

Posted on August 4, 2009 by Bob L

While we’re on the topic of .NET coding, here’s another technique. If you thought my last post was short, here’s an even shorter one…

Ever wanted to make sure you returned an error message with all of the inner exceptions? I know I’ve wasted a lot of time trying to debug code that only returned the top level exception, especially if it occurred in production due to an environmental issue that couldn’t be repro-ed in dev, and the production environment didn’t allow for Visual Studio debugging.

The below function is a recursive function that calls itself until it has traversed all of the inner exceptions. It will make sure you always get all of the exceptions delimited by a “/” down to the lowest level. The deepest level message will be the last one in the string.

        public static string buildExceptionMessage(String message, Exception ex)
        {
            Exception innerException = ex.InnerException;
            message += ex.Message;
            if (ex.InnerException != null)
            {
                buildExceptionMessage(message + "/", ex.InnerException);
            }
            return message;
        }

Technorati Tags: Visual Studio .NET

Posted in Uncategorized | Leave a comment

Fun with Enums and a Generic File Date/Time Stamper

Posted on August 4, 2009 by Bob L

Time for a break from SQL stuff for a little .NET. Have you ever wanted to use the integer value of an ENUM instead of the actual Enum Value? For example, here is an enum I created to map datetime formats.

public enum FileDatePrecision
{
    Day = 0, Hour = 2, Minute = 4, Second = 6,  
    Secondsf1 = 7, Secondsf2 = 8, Secondsf3 = 9, 
    Secondsf4 = 10, Secondsf5 = 11, Secondsf6 = 12, 
    Secondsf7 = 13
}

Never mind the weird names for the fractional second parts, was going to do “Millisecond, Microsecond, etc, but couldn’t figure out the names for all of the fractional precisions. Each “N” value for f is the fractional precision – i.e. f1 means 1/10 of a second, F2 means 1/100, and so on.

So, why did I create the enums with non-contiguous integer values? The reason is to accommodate a simple way to generate the format mask. Here’s the code that returns the date time stamp with a couple of overload methods to provide defaults. It’s pretty simple, just use the enum value as the substring length in order to include the characters in the format mask needed to achieve the desired output precision.

public static string BuildFileDateTimeStamp(DateTime dt, FileDatePrecision precision)
{
    // Convert the enum to it's integer representation
    int precisonSequence = (int)Enum.Parse(typeof(FileDatePrecision), precision.ToString());
    string formattedTime = "";
    if (precisonSequence > 0)
    {
        // Format the time based on the level of precision specified in the enum
        formattedTime = "_" + dt.ToString("HHmmssfffffff").Substring(0,precisonSequence);
    }
    //to get the date
    string formattedDate = dt.ToString("yyyyMMdd");
    return formattedDate + formattedTime;
}

public static string BuildFileDateTimeStamp(DateTime dt)
{
    return BuildFileDateTimeStamp(dt, FileDatePrecision.Minute);
}

public static string BuildFileDateTimeStamp()
{
    return BuildFileDateTimeStamp(DateTime.Now);
}

So, to create a filename for example for “StockDownloader_20090803.LOG” for use in a trace writer, I can just do

TextWriterTraceListener myWriter = new
      TextWriterTraceListener(
      System.IO.File.CreateText("StockDownloader_" 
      + Common.Utility.BuildFileDateTimeStamp() + ".log"));
    Trace.Listeners.Add(myWriter);

Technorati Tags: Visual Studio .NET

Posted in Uncategorized | Leave a comment

Using NTILE with Cross Reference Values to Create Dimensions

Posted on August 4, 2009 by Bob L

I confess lack of depth in SQL Analysis Services (SSAS). I’ve read quite a bit about the capabilities including data mining, and played with a couple of the wizards, but just haven’t had time to immerse deeply into end-to-end scenarios. Based on that, I’m writing this post with a big disclaimer – You might be able to do some of what is in this post easier with Analysis Services – I don’t really know yet. You’ll have to forgive me for ignorance here.

Although SQL Analysis Services provides a lot of capability, it does seem that you will get better results if you “prep” the data before-hand in SQL, particularly the sources for the dimensions. Therefore this post may be useful even if you are highly leveraging SSAS. In my scenario, it would be useful to reduce the range of values associated with the variables, due to the sheer size of the data sets, the multiplicative effect with combining different variable permutations, and for simplicity of the analysis . Basically, this means to summarize ranges of values into different groups and then analyze correlations of the grouped values of the variables – hence the “NTILE” title of this article. Think of it as assigning grades rather than point scores, so you can identify the “A”,”B” etc students without getting immersed in the point distinctions. Once you have the students grouped by grade levels, then you can much easier do correlative analysis – i.e., how much time do “A” students spend on homework versus “”C” students, as opposed to analyzing the homework correlation with students with 93/100 do compared to 74/100 point-scaled graded students.

First a little bit about the application – My requirements include the ability to correlate various factors about an equity – it’s earnings per share (EPS), Price-earnings growth (PEG), Sector, period of performance etc. to indicate if any of these have to do with profitability of a stock purchased at a particular time and when it should be sold.

Hopefully, that provides enough background so some of this will make some sense.

So for my scenario, some what-if questions: Do certain sectors or industries perform better during certain periods in relation to other stocks? How does the EPS affect trading profitability? What about market capitalization? Do the combination of these have an impact? For example, is it more profitable to invest in stocks with small capitalization in some industries during certain times of the year?

Essentially, it would be nice to be able to take all of these factors, generate permutations, and then analyze simulations of stock trades for a period of time to see if any combinations are effective to predict how successful a strategy will perform. Now, if we were to just take all of the combinations of these, the multiplicative effect of the permutations would be unmanageable and more than the “fact” data since each the precision on many of these values is large enough to generate hundreds if not thousands of distinct values. “Permutating” all of this together produces a lot of interesting but meaningless data.

The NTILE feature allows assigning a group based on an interval to a set of data. One of the variables I want to NTILE is Earnings-Per-Share EPS). Another is market capitalization. Another could be Sector. And finally, the period. By NTILE, I mean grouping into categories, so for example, every stock would fall into just 1 of 5 different tiles or ratings for EPS, or for market cap, etc. You can see how by reducing these into groupings, we greatly simplify the analysis and reduce the number of permutations while still having enough granularity to differentiate the groupings. We just have to figure out a way to map the original “fact” data values to their groupings (grades). Once we have the mapping, we can join the tiled dimensional data back to the fact data. Yes, I know this is sounding more like a SSAS scenario, but remember I already gave my disclaimer… I’d welcome responses back from the BI experts on how to approach this scenario using SSAS.

So, here are some steps to accomplishing this:

1) Use computed fields to round the fact data into more discrete values, particularly if the original values are not precise, such as floating point. For example, if we are only need 5 tiles, we probably don’t need to map down to the .01 decimal precision. If we instead round down to just the .1 value, our tiling should still be fairly accurate. This reduces the amount of discrete values in the fact data that need to be tiled and reduces the size of the mapping table. Below is an example of an Equity Information table with the computed rounded fields

CREATE TABLE [dbo].[EquityInfo](
    [TradingSymbol] [varchar](25) NOT NULL,
    [Description] [varchar](50) NULL,
    [IndustryId] [smallint] NULL,
    [SectorId] [smallint] NULL,
    [Shares] [bigint] NULL,
    [PERatio] [float] NULL,
    [EPS] [float] NULL,
    [DivYield] [float] NULL,
    [DivPerShare] [float] NULL,
    [PEG] [float] NULL,
    [PriceToSalesRatio] [float] NULL,
    [PriceToBookRatio] [float] NULL,
    [ExchangeName] [varchar](10) NULL,
    [EnabledFlag] [bit] NULL,
    [DateUpdated] [date] NULL,
    [DateCreated] [date] NULL,
    [SampleName] [varchar](50) NULL,
    [TotalBookValue]  AS (case when [PriceToBookRatio]>(0) AND [Shares]>(0) 
then [Shares]/[PriceToBookRatio]  end),
    [TotalSalesValue]  AS (case when [PriceToSalesRatio]>(0) AND [Shares]>(0) 
then [Shares]/[PriceToSalesRatio]  end),
    [PEG_Rounded]  AS (CONVERT([smallmoney],ROUND([PEG],1))) 
PERSISTED,
    [EPS_Rounded]  AS (CONVERT([smallmoney],ROUND([EPS],1))) 
PERSISTED,
    [PE_Rounded]  AS  (CONVERT([smallmoney],ROUND([PERatio],1))) 
PERSISTED,
    [PTB_Rounded]  AS (CONVERT([smallmoney],ROUND([PriceToBookRatio],1))) 
PERSISTED,
    [PTS_Rounded]  AS (CONVERT([smallmoney],ROUND([PriceToSalesRatio],1))) 
PERSISTED,
    [Shortable] [bit] NULL,
    [HistoryReloadRequired] [bit] NULL,
    [HistoryReloadedDateTime] [datetime] NULL,
 CONSTRAINT [PK_EquityInfo] PRIMARY KEY CLUSTERED 
(
    [TradingSymbol] ASC
))

2) Create a view that tiles the values for each of the dimensions of interest. For example, for EPS, we would have the following view:

CREATE VIEW [Olap].[Vdim_EPS_Tiled]
AS
SELECT     NTILE(5) OVER (Order BY EPS_Rounded) as EPS_Tile, EPS_Rounded
FROM         dbo.EquityInfo
WHERE EPS_Rounded Is NOT NULL

3) Generate a table from the view to contain cross-reference the tiles back to the source fact data. The table can be generated through a SELECT INTO and then maintained via a MERGE after we add a primary key. We need to use Select DISTINCT because our first view will contain duplicate EPS_Rounded values.

select distinct * into olap.tdim_EPS_Tiled from Olap.vdim_EPS_Tiled;

GO

ALTER TABLE Olap.tdim_EPS_Tiled ADD CONSTRAINT
    PK_tdim_EPS_Tiled PRIMARY KEY CLUSTERED 
    (EPS_Rounded)

GO

CREATE PROCEDURE olap.Update_tdim_EPS_Tiled
AS BEGIN
  MERGE INTO olap.tdim_EPS_Tiled AS T
  USING olap.vdim_eps_Tiled AS S
    ON S.EPS_Rounded = T.EPS_Rounded
  WHEN MATCHED 
    THEN UPDATE
        SET EPS_Tile = S.EPS_Tile
  WHEN NOT MATCHED BY TARGET
    THEN INSERT
        (EPS_Tile, EPS_Rounded)
        VALUES
        (S.EPS_Tile, S.EPS_Rounded)
  WHEN NOT MATCHED BY SOURCE
    THEN DELETE; -- Might not want to do this if there is history involved
END

Now, whenever I want to analyze performance of an equity related to a simulation, I can easily join to the tdim_EPS_Tiled table on the EPS_Rounded in order to derive the tile. I can then group these, average the profitability from the simulations and evaluate the degree of correlation. This should work well with a cube in SSAS as well to allow correlation and predictive data mining. I’ll be trying that out as soon as I get through the SSAS tutorial.

The below diagram illustrates the process flow including the automation aspect.

In summary:

First we reduce the number of values to map, so that instead of mapping the entire universe, we map a rounded representation. We also store it as a precise data type (not floating). This makes our cross-reference tables smaller. This introduces some complexity when we want to map from the fact data back, but this is simply remedied by creating computed fields on the fact data that map back to the source.

Next, we create “Ntile” views of the grouped data. We can now join back from our source data to find the grade that a particular equity is associated with. This allows us to do iterative simulation. By iterative simulation, I mean taking the results of one pass of simulations as input into a higher layer of simulations. For example, if my simulations show a positive correlation for a particular EPS grade, then I can create another simulation that filters just on the EPS grade to drill-down and better explore the correlations from that. good article about iterative and non-iterative simulation algorithms on this link is http://www.stat.columbia.edu/~gelman/research/published/itsim_interface.pdf

You might be wondering why not just use indexed views – I tried that and couldn’t get it to work, there are limitations on what you can index when it comes to aggregation and apparently using a function like NTILE is one of them.

Let’s all the time I have for this today. You might be thinking this is neat, but what about all my “what-if” questions? How do we query our dimension data and link back to our fact to find the correlations. I have made some interesting correlation discoveries, but I still have a lot of work to do here. I guess this is where I really need to start using SSAS because the queries to actually find the correlations become very complex — Time to get out the manual and go through the tutorials… Time to start playing with the data mining wizards…

Technorati Tags: SSAS,SQL Analysis Services,Data Warehouse,Data Warehousing,SQL Server 2008,NTILE,MERGE

Posted in Uncategorized | Leave a comment

Using Persisted Computed Columns in SQL Server Indexes

Posted on July 11, 2009 by Bob L

As part of my research work, I generate thousands of simulations on an almost daily basis for various scenarios. One of my scenarios involves determining strategies for selecting the most profitable short and long stock entry/exit positions. Calculating this involves data mining stock history information, analyzing moving averages of the equities, and correlating parameters associated with the equities to find the the optimal entry and exit position for a given stock. I’ve found one way to calculate this is using brute force computational power to simulate the trades over historic time periods and measure the results. (There probably is an easier way with advanced calculus using derivatives, etc.). In order to play out the simulation, I need to track the position of the equity associated with the order – whether there is an order pending, if the position has been opened, or if the position has been closed, or if the order was cancelled. For example, I need to measure the dollar value of positions as well as cash in the portfolio to tell how much “buying power” remains to purchase additional equities at a given point of time in the simulation.

I track dates associated with all of the events, so a “normalized” way to do this is simply to look at the date as shown in these views:

CREATE VIEW [Simulator].[view_ValidOrders]
AS
SELECT     PortfolioOrderId, PortfolioId, TradingSymbol, CreateDate, OpenDate, UpdateDate, CloseDate, TradingMethod, Shares, LimitPriceOpen, FilledPriceOpen, 
                      LimitPriceClose, FilledPriceClose, StopLimitPrice
FROM         Simulator.PortfolioOrder
WHERE     (CancelDate IS NULL);
CREATE VIEW [Simulator].[view_OpenOrders]
AS
SELECT     PortfolioOrderId, PortfolioId, TradingSymbol, CreateDate, OpenDate, UpdateDate, CloseDate, TradingMethod, Shares, LimitPriceOpen, FilledPriceOpen, 
                      LimitPriceClose, FilledPriceClose, StopLimitPrice
FROM         Simulator.view_ValidOrders
WHERE     ((CloseDate IS NULL) and NOT (OpenDate IS NULL));
  
CREATE VIEW [Simulator].[view_OpenOrders]
AS
SELECT     PortfolioOrderId, PortfolioId, TradingSymbol, CreateDate, OpenDate, UpdateDate, CloseDate, TradingMethod, Shares, LimitPriceOpen, FilledPriceOpen, 
                      LimitPriceClose, FilledPriceClose, StopLimitPrice
FROM         Simulator.view_ValidOrders
WHERE     ((CloseDate IS NULL) and NOT (OpenDate IS NULL));

I can then index on the cancel-date, open-date, and close-date in combination with other fields to try to optimize the queries. But, why do this, when you only need a single status since each state is mutually exclusive? Based on that, we can create a computed field called OrderStatus (we’ll come back to the other computed column “Profit” in a moment) that just has the status as in:

CREATE TABLE [Simulator].[PortfolioOrder](
    [PortfolioOrderId] [int] IDENTITY(1,1) NOT NULL,
    [PortfolioId] [int] NOT NULL,
    [TradingSymbol] [varchar](50) NOT NULL,
    [OpenDate] [date] NULL,
    [CloseDate] [date] NULL,
    [CancelDate] [date] NULL,
    [TradingMethod] [char](1) NOT NULL,

    [Shares] [smallmoney] NOT NULL,
    [FilledPriceOpen] [smallmoney] NULL,
    [FilledPriceClose] [smallmoney] NULL,
...
    [Profit]  AS ([shares]*
case 
when [TradingMethod]='L' then [FilledPriceClose]-[FilledPriceOpen] 
else [FilledPriceOpen]-[FilledPriceClose] end) PERSISTED,

    [OrderStatus]  AS (CONVERT([tinyint],
case 
when [CancelDate] IS NOT NULL then (0) 
when [CloseDate] IS NULL AND [OpenDate] IS NULL then (1) 
when [CloseDate] IS NULL AND [OpenDate] IS NOT NULL then (2) 
else (3) end,(0))) PERSISTED,

...

So, how does the computed field help us with performance? First, I need to make sure the computed field is persisted and the caveat on that is that only deterministic values can be computed. Not only that, the computed field must be derived directly from columns in the table, rather than other functions that look at other tables. For example, I have a current value column on my portfolio that is computed based on the value of all of the underlying positions, but that column is not persist-able since it depends on a function that sums the results from the order detail data (I could use an indexed view for this to get around that, but there are quite a few rules and restrictions around indexed views – see my earlier blog on indexed views).

The second thing to do is to make sure that the storage for the computed field is optimal. I wouldn’t normally use a floating value for example as a persisted field that gets indexed since floating values are approximate. Also, I want to make sure the computed field uses as little storage as possible, so that my index is as small as possible, thus potentially reducing the number of levels required for a large number of rows. That is why I use the CONVERT function for the order status – by default, SQL would set this to be an integer field which is 4 bytes, but all I really need is a single byte or TINYINT.

Note, that I also use the included columns technique to only include what is needed. I leverage another persisted computed field called profit rather than the all the source fields needed to calculate the profit so that a covering index will work for my queries, thus eliminating the need to do a lookup back to the base table (see the link at end of article for more info on covering indexes). This keeps the included data small, which also contributes to a smaller size index entry.

Using the profit field as an include field along with other relevant fields, and then using the order status gives us the following index, which is just right for the types of queries we need to run in order to perform a stock trading simulation. This allows us to be able to quickly calculate the profit (or loss) and hence current buying power for the portfolio as well as find orders in a given status (such as newly submitted orders with a limit price) that may need to still be opened into a position or open positions that need to be closed to seal a profit.

Index for finding profit of closed orders for portfolio using computed fields:

CREATE NONCLUSTERED INDEX [IX_PortfolioOrder_Status] ON [Simulator].[PortfolioOrder] 
(
    [PortfolioId] ASC, -- 4 bytes
    [OrderStatus] ASC -- 1 byte
)
INCLUDE ( [Profit]) - 4 bytes

Total: 9 bytes

Contrast that with the below index structure that had additional included columns for (Shares, PriceAtClose, PriceAtOpen instead of just Profit) and included the cancel date to filter out invalid orders, or alternatively did not include any of the date fields, but then had to filter after retrieving by PortfolioId.

Index for finding profit of closed orders for portfolio without benefit of computed fields:

CREATE NONCLUSTERED INDEX [IX_PortfolioOrder_Status] ON [Simulator].[PortfolioOrder] 
(
    [PortfolioId] ASC, -- 4 bytes
    [CloseDate] ASC-- 4 bytes, probably not worth having this as part of index and just incorporate

-- into the include list and filter.
)
INCLUDE ( [FilledPriceAtOpen], [FilledPriceAtClose], [TradingMethod], [Shares]) - 13 bytes

Total: 21 bytes

So, how effective is this technique? I”m sorry I don’t have hard numbers before-and-after, but I can honestly tell you that my simulations for this application are running about twice as fast now. I believe the main reason is that the number of index levels was reduced, because I was at a threshold due to the number of rows in the Portfolio Orders. Up to a few hundred thousand rows, the performance was similar, however once we cross that number, we require a second index level due to the sheer length of the index when using the . Each additional level in the index requires another I/O to per row lookup.

This brings up another point about indexes and performance. If you experience an overnight-drop in performance on some queries as more data is added, it’s probably because one of your indexes just had to add another level to support the binary tree. You’ll want to understand how SQL stores indexes to get a better insight on this and other ways to mitigate the issue if merely shrinking the size of the supporting columns is not feasible. Take a look especially at filtered indexing (which we could have used for this scenario, also), where an index can be defined based on horizontal partitioning – i.e. only create this index for rows meeting a specific filter.

This points out the need to carefully consider the lengths of the columns you are indexing as well as the sizes of the fields you include in your include list (at some point it doesn’t make sense if you’re including every field, may as well forget the list and do a lookup). First, do you really need to use the whole column? If not, you might want to use a computed field to get a substring of what you really need to index. Second, are you using the right data type? If you’re shamelessly using GUIDs for primary keys, shame on you, you are wasting 4 times the storage space of a single identity (and not only that creating fragmentation because your inserts are happening all over the tree causing page splits).

I’d like to go deeper into how to calculate number of bytes per row in an index, etc, but don’t have time and there are other resources that provide a much better explanation than I can give. For more information, see the following links:

Computed columns including their use in indexes: http://www.mssqltips.com/tip.asp?tip=1702 – Has great explanation and example, better than mine.
Covering indexes: http://msdn.microsoft.com/en-us/library/ms190806.aspx
Indexing tips: http://www.sql-server-performance.com/tips/optimizing_indexes_general_p1.aspx
Why not to use GUIDs if you don’t have to: http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx
SQL Server Storage Engine: http://blogs.msdn.com/sqlserverstorageengine/archive/2006/06/25/646865.aspx – Includes discussion of storage of included columns.
SQL Server 2008 Internals: http://www.microsoft.com/learning/en/us/book.aspx?ID=12967&locale=en-us (Kalen Delaney, Paul S. Randal, Kimberly L. Tripp, Conor Cunningham, Adam Machanic, and Ben Nevarez)

Technorati Tags: SQL Server,SQL Performance,SQL Server Indexing

Posted in Uncategorized | Leave a comment

	Bob L on Moving Databases from One Inst…
	Bob L on History may not repeat, b…
	Michael Abrams on History may not repeat, b…
	Mark Polczynski on Moving Databases from One Inst…
	emilie parker on Now at 5.11 GB/s SQL backup to…

Insights on Authentic Intelligence – Bob Leithiser, Ph.D.

Now a SQL Master

Trying to catch up

Nested Identity Keys in SQL?

A Wrapper for Running SQL Server 2008 Reporting Services Reports Anonymously

Cautions with Indexed Views

Some Uses for Query Hints

Getting the full exception from .NET

Fun with Enums and a Generic File Date/Time Stamper

Using NTILE with Cross Reference Values to Create Dimensions

Using Persisted Computed Columns in SQL Server Indexes

Recent Posts

Recent Comments

Archives

Categories

Meta