Some major enhancements and a convergence on features from Azure to on-premise. The ‘R’ capabilities and seamless integration with big data are game-changers.
Some major enhancements and a convergence on features from Azure to on-premise. The ‘R’ capabilities and seamless integration with big data are game-changers.
My hope is that by now all companies are using multifactor (MFA for short) authentication to protect all critical assets. MFA authentication simply means that a user must validate their identity through at least two independent mechanisms. Two-factor authentication or 2FA for short just means two independent mechanisms are utilized. However many companies that are implementing 2FA are not implementing true 2FA because the mechanisms are not totally independent. Unless there is no way that one of the mechanisms in a 2FA scheme can be leveraged to gain access to the other mechanism, the mechanisms cannot be considered independent.
Two common 2FA schemes that are subject to compromise because of lack of independence are the use of a dial-back number that goes to a smart phone and hosting a 2FA on a corporate laptop. This post focuses on the risk with a smart phone that is on the receiving end of a dial-back number.
Here is a typically sequence of events for 2FA involving a dial-back number:
1) Logon to a web site or remote desktop or some other resource
2) User is prompted for credentials
3) User enters credentials
4) Upon successful entry of the credentials, the system prompts for an access code that is communicated on the dial-back number.
5) User then receives a phone or in the case of a cell, perhaps a text message or there may be an app configured to acknowledge the call.
6) At this point, the authentication generally occurs in either of the below fashions:
a. User enters the code communicated via text message or voice call to the primary authentication interface. This is commonly used for web site scenarios. For example, my GoDaddy account is setup in that manner.
b. User answers the phone or runs an app which causes a notification to be sent automatically back to the application. This scenario is utilized commonly for remote desktop scenarios such as the solution provided by Duo security.
So, what can go wrong with the dial-back scenario when it involves a smart phone – A lot:
Suppose the phone doesn’t have an auto-lock device on the phone or has a long interval for the lock to kick in (i.e. 1 minute), and also hosts Email. Consider what happens if a person steals the cell phone.
1) The user of the phone may not immediately realize the cell phone is gone or may be slow to report it.
2) The thief is able to access the phone before the auto-lock kicks in.
3) Even if the auto-lock does kick in:
4) The notification default on the iPhone for example seems to show notifications even in lock mode. So, in that case, the thief can actually see the notification code. Worse, yet, the thief can do a “forget password” if the site does not utilize a security question and then receive an Email with a new password or a link to reset the password which also displays on the screen.
The bottom line for a smart phone scenario for 2FA with a smart phone is that the user needs to configure the phone to be secure, particularly if the user’s Email is on the phone. In reality, a smart cell phone is not a true secondary authentication method for 2FA if it is has Email on it unless it is very tightly secured through the use of biometric method such as a thumb print along with a very short lockout period. A land line or a dumb cell phone provides a much higher level of security in that there is no Email or web access that the thief can leverage to determine the password for the logon authentication. Organizations that are relying on a dial-back number may wish to rethink allowing a cell phone to be used for 2FA or at least utilize group policy to lock down the phone and never send passwords over Email for a forgotten password. Another approach is to utilize a completely separate device such as a token-oriented or smart-card security device that does not have other application capabilities.
In my next post, I will describe the risks with 2FA associated with a corporate laptop, where the token is sent to some application installed for a VPN connection. This is potentially almost as bad as relying on a cell phone.
Note – If you do not have multifactor authentication implemented or have the type of risk discussed in this article, please contact us at email@example.com and we can help you get robust multifactor authentication implemented quickly and inexpensively. Controlling any access with just a username and password or any other single mechanism is a huge risk.
Note: My first version of this had an error in the query which has been corrected and the difference between up and down days is not significant. However, the larger the VIX change, the more likely to be a VIX up day and S&P down day the next day.
Today was an interesting day in the stock market. The S&P 500 closed positive while the most correlated indicators aside from other equity indexes closed strongly negative. This includes Treasuries (up) and high yield/corporate bond (down). These are all the opposite results of the norm, at least for the last several years. Along with that volatility was down significantly, which does correlate with upwards moves. Since this seems to make the move in the S&P suspect in terms of at least the immediate term, I ran a query in my equity history database. I used the cube with the grouping function to get a rollup of each grouping and the entire selection.
Below is the query:
Although I have history of the S&P back to 1950, some of the other instruments only go back to 1990, so the analysis is limited to then. 195 instances were found that met the condition over a 25 year period, so this only has only happened on average about 13 times per year.
Results for Days where VIX is down while High Yield is down, Treasuries are up, but S&P 500 is up:
Out of 195 instances, VIX was up the next day slightly less often (93 versus 102 instances) test. Along with that, the average change on the days that the VIX went up was negatively much stronger than on the VIX down days. The up days occur most often during tumultuous markets such as 2000 and 2008. Below are the detailed results.
Note that if I decrease the change percent to –7% which represents the percent change in the VIX today, the likelihood of a down day is higher (15 to 12), but with such a small number of instances, it is hard to make a case for trading based on that.
More interestingly would be to look at technical indicators such as MACD, RSI, etc. associated with the instruments and examine how these interact over longer periods in this type of configuration, but that takes more complex queries.
The usual disclaimer applies: I.e. I am not a professional investment advisor, short-selling and options involves significant risks including complete and total financial ruin, talk to your personal financial counselor, etc, etc, etc.
Somebody from a private forum I belong to known as “T-Theory” posted a graph that shows the rationale for being long the S&P when the 10 week exponential moving average is trending above the 50 week exponential moving average and being short in the opposite case. T-Theory is an approach invented by the late Terry Landry to viewing market behaviors in terms of cycles – that is markets tend to spend half of their time rising at a faster pace than the other half of the time and that these periods tend to occur symmetrically. For a more detailed explanation of T-Theory, see http://cdn3.traderslaboratory.com/forums/attachments/34/31965d1350013176-beyond-taylor-a1997introttheory_.pdf
The below graph shows the exponential moving averages (EMA) for 10 weeks using the red line and for 50 weeks using the green line. A sell signal is generated when the red line crosses the green line while a buy signal occurs when the green line crosses over the red line. A sell signal was recently generated for the S&P 500 and also exists for the other major indexes including the Dow. A sell signal has also been in place for many foreign indexes for many months including the Chinese market.
Clearly, this was a good strategy since 2000. I decided to quantify the benefits since then as well as over the longer-term using my equities database. I recently added .NET SQLCLR (A mechanism for Microsoft SQL Server that allows one to write .NET code and integrate into database functions) functionality to my SQL Server database that makes it relatively easy to calculate different technical indicators on the fly. I have been able to build a library of technical functions that are available from within the database. It is easy to find samples of C# code for calculating technical indicators such as the EMA that can then just be plugged into the CLR. I will write a more detailed post that includes samples of the CLR code and implementation process if that is of interest.
Below is the query to calculate the average return per week with being on the “right” side of the trade from the picture. I used a temporary table because executing the CLR function inline and joining is slower than storing the results into a temp table and then joining for some reason.
CrossOver AvgWeeklyChg (S&P 500 index since 2000)
The results show a significant benefit since 2000 avoiding a weekly loss of almost -0.2% during downtrends and achieving over 0.2% gain on the uptrend.
But, what about the longer-term? For this, I regenerated the temporary table using the Dow from 1928. This also shows a definitive benefit to being on the right side of the trade. In fact the performance on the upper cross-over is 8 times better per week and the performance on the down cross over is negative by -0.026 %. While -0.026 seems small, that is significant considering the Dow itself went from 240 to over 17,500 (it has lost over 500 points already since it’s EMA crossover trend change). And this is only the weekly loss, not the cumulative loss from when the trend changes. I will quantify the total amount of changes on average between crossover changes in a later post, but just looking at the crashes of 1929, 2000, 2008, etc. this has often historically amounted to over 50%
CrossOver AvgWeeklyChg (Dow since 1928)
Is past history guaranteed to repeat? No, but when something has worked for almost 90 years it gives cause to pause.
Even though history does not always repeat in the same fashion, it does rhyme – that is cycles tend to repeat. The conditions associated with 2001 and 2008 are different from today but unfortunately bear enough similarities (i.e. high valuations like 2000-2001, high debt and slow growth like 2007-2008) to make a case that a negative cycle in equities is unfolding again. I’ve already documented this through some other postings related to not only the long-term technical indicators but fundamental indicators such as the ratio of GDP to market cap, Schiller’s 10 year Cyclically Adjusted Price Earnings (CAPE), wealth growth versus growth of stock prices, etc.
Psychologically, it is difficult to change one’s mindset often until too late. I”m afraid many will jump into equities more and more aggressively on the dips rather than take risks off the table at high points. The type of trend that has started could easily lead to further drops of over 50% from current values based on a reversion to mean. At some point, capitulation occurs where people sell out of the market at the point where it actually has bottomed and don’t resume buying until the indexes have once again risen too much.
A common answer to this corundum it that it is impossible to time the market and so one may as well just buy and hold as it always comes back. But, how long can that take? The stock market peak of 1929 took nearly 30 years to recover and Japan is still only about 50% back from the peak of 1989. Will a large decline in equities in the US take that long to recover from? That seems unlikely based on history, but the possibility that recovery to the recent highs may take a decade or more has been proposed by some market strategists who have predicted prior bear markets such as John Hussman (http://www.hussmanfunds.com/wmc/wmc150921.htm).
Execution of the EMA crossover strategy does mean that one would not sell at the best possible point or buy at the lowest point and would miss out on a perfect strategy, but the results are far better than simply buying and holding. It seems to me wiser to follow a model that has historically proven most likely to succeed over the long-term when making investment decisions than to just hold and hope for the best rather than plow into a market that from a long-term technical perspective is now likely in decline.
Here’s factoid that most people, even many at the IT management level don’t realize:
A 128 GB Thumb drive which can be had for under $40.00 can store enough information to accomplish identity theft for the population of the entire world (7 billion people). 128 GB is approximately 137 billion bytes which is 19 bytes per person. Name and address data can normally be compressed by a factor of 3 and birth dates and social security numbers or other national identities only use 6 bytes in packed format. So, figuring about 40 bytes for name and address compressed to 13 bytes, the total amount compresses to 19 bytes per person.
Another factoid: It takes less than 90 seconds to download 10 million records containing a person’s name, address, spouse, birth date, and social security number. This is based on a cable modem connection of 50 Mb/S which equates to about 6 MB/s or 360 MB/minute. A person’s complete identity record normally is less than 50 bytes. The entire data set for 10 million people therefore is only about 500 MB (50 * 10,000,0000). That is in uncompressed format. With compression, it takes less than 30 seconds.
Unfortunately, my experience has been that the government and most companies are not making the effort of protecting data much of a priority. They throw money at it for sophisticated products that do not actually address the problem. They pay for expensive audits from companies that do not actually have the technical expertise to spot them. The regulatory audits seldom find actual problems as they are focused on outdated security mechanisms that do not have applicability to the most common scenarios whereby data is taken from inside the network to the outside rather than somebody breaking through a firewall from the outside.
In some cases, the approaches taken actually make it easier to steal data. For example, data thieves love encryption, which only protects against actual physical data being stolen and does nothing to protect data once it arrives to the user decrypted. Encryption allows thieves to encrypt the data they steal so that systems have no way of knowing what is leaving the network. Even worse the companies rarely even audit access to sensitive data and have no idea that there is a breach until all of the data is exposed. Many IT departments fail to implement simple controls for locking down files stored on the network, ensuring point-to-point security for service accounts, etc. We are not just talking about PII (Personally Identifiable Information) or Private Health Information (PHI) which is bad enough, but now large chunks of intellectual property (IP) are being stolen. The number of instances of data theft will only continue to multiply as organizations do not try to solve the root problems.
Our company is focused on data security from the inside-out. We will come onsite for a day for free and call out vulnerabilities. We can provide a rapid assessment over a couple of weeks that generates a score sheet which identifies the specific vulnerabilities and remedies. We have a complete tool set to automate identification and resolution of security issues at the database, applications, and file system levels where data theft originates. We are especially focused on healthcare with experience with HIPAA regulations. Most healthcare providers are not actually meeting HIPAA requirements. One of the requirements is that a record of access for all individuals who have looked at a person’s healthcare data can be produced upon demand.
We are experts at analyzing for vulnerabilities at the database and file system level where data theft originates. By the time the data goes out the firewall, it is already encrypted and non-detectable by firewalls. The only way to stop data theft is to implement safeguards at the data and application level. This requires a unique combination of data security, database, and application development skills. We are experts at working with huge amounts of data – one of our products we developed for financial risk management has a database of over 3 billion records which supports near instantaneous queries of complex information requests.
I am one of less than 150 certified Microsoft SQL Server masters in the world and one of less than probably 15 or 20 that also holds a top-tier ISC2 CISSP certification. My recent PhD is in the area of automated learning whereby problems can be modeled, simulated, and used to learn heuristics for solving the problems. I have over 30 years of experience in application development. My experience includes 10 years working with classified system. My network of resources includes the top persons and companies in the world with expertise in machine learning related to data security as well as all aspects of data security including at the network level.
Do you want to do something proactive to stop data theft and have truly end-to-end security implemented to prevent inside-out theft or wait until after a breach occurs? Do you have a way to detect that a breach has even happened if the person uses trusted credentials to carry it out? Most data theft is carried out by an unauthorized person using authorized credentials and misusing them. Do your systems really detect this situation? This can only be done by implementing controls at the database and application levels.
Contact us at firstname.lastname@example.org if you really want to practice due diligence to prevent and stop data theft. Give us an opportunity to help you before it is too late. We host a large secure co-located environment that can provide a sandbox area where we can stage your entire IT structure as virtual machines. Through the use of over 20 fusion-IO high speed SSD drives, we can provision virtual machines in seconds. We have an automated data obfuscation tool that includes verification that will allow you to create a realistic testing environment without risk of theft of meaningful data. Using our sandbox also helps evaluate your level of data preparedness and disaster recovery ability.
Why wait until after a breach is out to take action? Does your company really want the liability of not only having it’s data stolen, but also now meeting regulatory requirements such as those mandated by HIPAA.
In my pursuit of understanding asset markets, I’ve maintained a SQL Server database with a lot of information about the stock market and other indexes. Using some data services along with SSIS, this database has been kept current to the point that it now has over 3 billion records in total including 2.3 billion records in an intraday table. Most of my queries and interest concerns cumulative equity and index end-of-day history which is only 75 million row and the options data since 2003 which is now up to 175 million rows.
To be able to query this level of data, I utilize Fusion-io PCIE SSD storage for the SQL Server database. Using self-joins can produce some very interesting analysis. For example, the below query outlines the performance of a few global indexes where there have been large bounces close to market tops and the ensuing performance afterwards. This query complete in just a few seconds. There are couple of tricks that make this run faster – one is the storing of a relative day number to avoid performance issues with working around weekends for querying prior dated history. The day number is sequential across holidays and weekends so a direct link can be done without a range test. The other trick is that the table is partitioned based on the date which allows a good deal of parallelism.
Here is the query
select h.TradingSymbol, h.MarketDate, h.ChgPct + hprev.ChgPct as TwoDayChg, hprev.PriceAtClose as Price,
(h20.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chg1Month,
(h60.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chg3Month,
(h180.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chg9Month,
(h360.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chg18Month,
(h540.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chng27Month,
((select MAX(hpast.PriceAtClose) from dbo.EquityHistory hpast
where hpast.TradingSymbol = h.TradingSymbol
and hpast.MarketDate < hprev.MarketDate) – hprev.PriceAtClose) / hprev.PriceAtClose
from dbo.EquityHistory h
inner join dbo.EquityHistory hprev
on hprev.DayNumber = h.DayNumber – 1
and hprev.TradingSymbol = h.TradingSymbol
inner join dbo.EquityHistory h20
on h20.TradingSymbol = hprev.TradingSymbol
and h20.DayNumber = hprev.DayNumber + 20
inner join dbo.EquityHistory h60
on h60.TradingSymbol = hprev.TradingSymbol
and h60.DayNumber = hprev.DayNumber + 60
inner join dbo.EquityHistory h180
on h180.TradingSymbol = hprev.TradingSymbol
and h180.DayNumber = hprev.DayNumber + 180
inner join dbo.EquityHistory h360
on h360.TradingSymbol = hprev.TradingSymbol
and h360.DayNumber = hprev.DayNumber + 360
inner join dbo.EquityHistory h540
on h540.TradingSymbol = hprev.TradingSymbol
and h540.DayNumber = hprev.DayNumber + 540
where h.TradingSymbol in (‘^dji’,’^dax’,’^ixic’,’^n225′,’^ftse’,’^djt’,’^gspc’,’^rut’,’^ssec’)
and h.ChgPct + hprev.ChgPct > 6.0
and (select MAX(hpast.PriceAtClose) from dbo.EquityHistory hpast
where hpast.TradingSymbol = h.TradingSymbol
and hpast.MarketDate < hprev.MarketDate) between 1.0 * h.PriceAtClose and 1.15 * h.PriceAtClose
order by h.MarketDate
And here are the results formatted in Excel:
This is not a good omen for the stock market for the next couple of years based on history.
There is probably not enough data here to draw conclusions, but since the query is against indexes rather than individual stocks, it does seem pretty convincing. It is pretty certain that a query against the entire equity history for this would yield similar bottom line averages, but that would take several minutes to complete.
Here is the query again as it shows with the SQL markup:
We’ve all seen the news about the latest data security breaches. While bureaucrats blame these on sophisticated hacks from China, the reality is that these are mainly due to negligence and are so simple that a child with basic computer knowledge could pull off many of these. http://www.darkreading.com/attacks-breaches/the-eight-most-common-causes-of-data-breaches/d/d-id/1139795?
The problem is that technology has focused on encryption and firewalls while neglecting security at the basic data and application tiers and has very little concept of proactive monitoring of actual user behaviors. Most data theft occurs due to compromised employees or stolen credentials wherein the perpetrator appears to the system as a trusted user and is not monitored. Our company holds credentials that include a PhD for automated learning, the highest-level ISC security certification, CISSP, SQL Server master, and certified .NET application developers. We are uniquely qualified to resolve the use cases that lead to security breaches at the application and data levels. We have produced a video that outlines how millions of PII data records can be stolen without trace in less than 5 minutes that will work at most companies using an ordinary user account. We are focused on resolution of the actual uses cases that lead to data theft rather than on elaborate technologies that are difficult to configure and mostly ineffective. Contact us and we can perform an audit as well as provide remediation including deployment of automated scripts and tools.
I’ve spent quite a bit of time the last few days trying to get Power BI preview to work for accessing a tabular model so am sharing my experience. Based on that experience, I am making some assertions that are not even in the Microsoft documentation, albeit most of this is found on various other blogs of users who have tried to get this working.
Below is a screen snapshot of a demo that my adult son, Blake built in his spare time. My thanks to Blake for his work to do the tabular model and report.
My conclusion is that the Power BI configuration is extremely complex, difficult to troubleshoot and requires Active Directory infrastructure configuration. However, once it is working it is seamless to develop reports from a single Power BI site that access tabular models exposed by other organizations without the need to store the data locally.
Here are the steps that I had to do – your mileage may vary and I’m happy if you didn’t have to go through all of this:
1) Power BI can only access a tabular model on a machine joined to a domain.
2) The service account registered with Power BI must be defined in Active Directory with a full UPN exactly matching the Power BI logon. For example, if the logon account is email@example.com, then there must be an account in Active Directory registered with firstname.lastname@example.org This is because Power BI sends all requests for tabular data using the “effectiveusername” technique (see profiler trace at end of this post). This requires the following steps:
a. Use Active Directory Domains and Trusts tool and select properties at the root level to get the UPN Suffixes dialogue as shown below and add the UPN suffix that matches to the Power BI domain name (i.e. company.onmicrosoft.com)
Note that multiple UPNs may be added, but only one can be registered to a specific A/D account.
b. Use Active Directory Users and Computers to identify or create users is mapped to the UPN suffix. Use caution when creating Power BI accounts that are the same as those of the company domain accounts unless the Power BI domain is the same as the standard UPN suffix. Best practice is to register for Power BI using the company’s Email domain rather than relying on the company-name.onmicrosoft.com. Below is an example of using a different UPN than the corporate domain:
The only article I’ve found on this is https://www.artisconsulting.com/blogs/greggalloway/Lists/Posts/Post.aspx?ID=27
Power BI uses a semantic model connection technique with effectiveusername – see https://msdn.microsoft.com/en-us/library/gg471575.aspx for an explanation.
c. This needs to be done for all the Power BI user accounts that consume the model not just the one under which the Analysis Services Connector is configured.
3) Depending on the firewall restrictions, port 8051 (HTTP transport) or 8050 (HTTPS) may need be opened. The documentation states that this is only needed for OData, but I have found it needed at least in my environment even for tabular only. (http://blogs.technet.com/b/powerbisupport/archive/2014/12/19/failed-to-register-the-gateway-key.aspx) For Azure use the endpoint functionality of either 8051 for HTTP and 8050 for HTTPS so that the Analysis Services Connector will work. If using 8051 all data will be transmitted in plain text. However, the credentials themselves appear to be tokenized between the connector and the Power BI site such that username/password theft does not appear to be a risk. I’ve not yet done extensive network sniffing to evaluate the risks with not using SSL.
Only after the above steps are done, will a connector be able to be defined that will actually allow the tabular data to be piped back to Power BI.
4) Configure the Analysis Services Connector using instructions from http://support.powerbi.com/knowledgebase/articles/471577-configure-a-power-bi-analysis-services-connector. When configuring the connector, you are asked to login to Power BI again. Use the Power BI account configured in step 2 that has the required UPN suffix.
5) At this point, Power BI should be able to be connected directly to the tabular model and Excel reports that use the tabular model can be uploaded and should function.
Note the profiler trace of SSAS tabular connection via the gateway – the gateway was actually configured using email@example.com, yet the connection is made on behalf of the specific Power BI user. This means that all users must be granted read access using their full UPN mapping to the Power BI account.
…<PropertyList xmlns=”urn:schemas-microsoft-com:xml-analysis” xmlns:soap=”http://schemas.xmlsoap.org/soap/envelope/”><Catalog>DemoSales_Tabular</Catalog><LocaleIdentifier>1033</LocaleIdentifier>
Americans are increasingly becoming aware of the vulnerability of their private data with breaches now up in the millions that have not only disclosed credit card information, but even more potentially dangerous information such as social security numbers linked to birthdates, names, and addresses. Your name and address along with a birth date and social security number is a virtual ticket to complete identity theft creating risks of massive financial destruction. There are multiple pieces of data that are private to an individual, but the social security number (SSN) is the one that is the most critical and is the focus of this post. The same principles apply to securing SSN as any other private data.
How did we get to this point? Is this just the inevitable consequence of Internet access, distributed systems, and evolving technology? Or is much of this due to irresponsibility of those that hold the most confidential aspects of your data? In this post, I make the argument that it is due to carelessness of organizations that hold such data and that there are ways to ensure safeguarding of data protecting not only against external threats, but against the more powerful threats from within an organization.
This post first articulates the drivers behind the situation which can be divided into two main areas: wrong assumptions and lack of diligence. Next, we identify the architectural principles of the software and infrastructure that mitigate the risks for security breach and contrast that with what most organizations believe is adequate. Finally, we provide practical strategies to remediate the situation.
The most common wrong assumption is that security threats are primarily outside of an organization rather than from within. Studies have shown this is not the case. Even in the case of external attack, there is often an insider involved such as distraught employee who gives away or even sells his username and password that will provide access to confidential data. Organizations spend large amounts of data on firewalls, external network monitoring, but how much do they invest to protect the confidentiality of the data within their own network?
Many organizations run on the assumption that their employee usernames and passwords could never be compromised and do not consider the impact of an external user using a username/password to gain access into the network. Many organizations allow the simple use of a username and password to access a VPN without use of any other physical validation such as a smart card or token device. But, even those that have restrictions to minimize the risk of an external user utilizing corporate credentials to enter the network are often careless about private data including the SSN and birthdate and have overly lose restrictions on access with often no auditing. A disgruntled user who turns dark has the potential to create just as much damage as any hacker from North Korea.
I find it interesting that virtually every break-in is blamed on sophisticated attacks from foreign entities or terrorists. However, the reality is that most of the break-ins that occur require very little sophistication and are probably mostly home-grown even if foreign entities are involved.
I have had the opportunity to work on database, architecture and application development projects over the years. Many of those customers stored SSN unencrypted and lacked effective safeguards to audit access to that data. If I was a criminal, I could have retired a long time ago. 10 million SSNs along with birthdates, and complete private data including with identifying information can be dumped from a database into a .CSV file in a manner of seconds. Such a file can be zipped up and encrypted, renamed to a .txt and small enough to be emailed as an attachment using a web mail client with virtually no detection. That is how easy it is to completely confiscate social security numbers within many organizations. At $25.00 per complete private identity record (street value) for 10 million records, that works out to a $250 million payoff.
It amazes me that our government which supposedly is so concerned about protecting the welfare of it’s citizens has created virtually no regulatory standards for storage and retrieval of private data. Yes, there are requirements for secure external interfaces such as SSL, but that is only one piece of a secure infrastructure. I speak from experience in working with state and federal entities and have found that the vast majority of the databases that have SSNs do not even store these encrypted. Worse, the SSN is stored right alongside all of the information needed to identify the person linked to the number, including complete name, address, and many times even their birthdates. The complacency in regard to social security numbers is startling given that one cannot change this and once compromised means a person is at risk of identity theft for the rest of their life. At least when a password is compromised, it can be changed. There is no excuse for any organization to store social security numbers unencrypted in a database. This is a worse violation of security than storing a password in plain text.
The fact that the government has not imposed regulations that require companies to store private data encrypted is no excuse for organizations. The necessary technology exists to not only encrypt private data, but to ensure that it remains encrypted through the lifecycle of it’s access and use. Even in organizations that have made attempts to secure this data, huge gaps are usually left. Consider the following ineffective strategies that often employed:
1) Data at Rest Security: This is the concept that data as it exists on any file system should be encrypted so that even if a database is stolen, it cannot be moved to another system so that the data can be used. Microsoft SQL Server Transparent Data Encryption (TDE) is an example of this approach. While this is useful to protect against physical data theft, it is useless for protection from internal threats unless a framework of auditing and security has been placed around access to the system containing the data. Secure data at rest systems still reveal private data to those that are authorized through the interfaces that access the data.
2) Reversible encryption: The problem with this is in the name – it is reversible. SQL Server again provides the capability to encrypt particular data columns such as a SSN. I have seen this utilized, but the fact that the encryption is reversible implies that it will be decrypted at some point downstream. In systems that use this, often the private key information is included in the views or stored procedures so that the data is only secure in the case of direct table access. Even in the case where the decryption is not accomplished until further down in the application layer, it is still decrypted and available to anybody who has the key. Worse, the security of the data is totally dependent on some supposedly private key that has to be defined somewhere in code. While reversible encryption provides some protection, it is only as good as the secret of the encryption key and anybody with access to that information has the keys to go in and exploit the data.
3) Partial obfuscation. Even the last 4 digits commonly used to validate along with other PI data should not need to be stored in plain text or with reversible encryption. A partial number such as the last 4 is still a ticket to identity theft when combined with other data. Any portion of the SSN that is involved with identification must be non-reversibly encrypted to minimize identity theft risk.
Earlier, I suggested that the only real solution to protecting private data such as social security numbers is to treat them with the same respect as a password. This implies that they utilize non-reversible encryption. With non-reversible encryption, there is no way to decrypt the value. The only way to determine a value that is based on non-reversible encryption is to actually know the value. When one knows the value, the value can be encrypted and then it can be matched against the stored encrypted key. But, is this practical? There are many arguments used for why social security numbers cannot be encrypted. Below is a list of them and why none of them are valid excuses:
1) The SSN is used to link data together. This is easily resolved through the use of surrogate keys.
2) Users need to search on a SSN or partial SSN to verify the identity – i.e. A credit agency will need the user’s security number as one piece of data to verify the person is authorized for the information requested. This is not a problem for a non-reversible encryption scheme. The full SSN as well as last 4 digits is stored in the database using a non-reversible encryption scheme. The user can enters the last 4 digits in a web form on the client over a secure connection. The web client then invokes the same encryption scheme to pass this along with other identifying information to find a match. Note that with non-reversible encryption whether it is a MD5 hash or some other technique, the encryption scheme does not even need to be kept private. Since there is no way to reverse the encrypted values, knowing the scheme does not circumvent the security.
The bottom line is that even organizations who need to utilize SSNs for validation purpose have no excuse to store the data in plain text or even with a reversible encryption scheme. At no point, should such data be unencrypted. Instead such data should be permanently one-way encrypted (non-reversible). Validation using the SSN should always be done with partial strings – the last 4, rather than the whole number. The full SSN does not need to be used since the last four digits can be encrypted separately. In this approach, presentation of the last part of the SSN is immediately encrypted to lookup a matching record in the database using just the last 4 digits encrypted along with some other personal identification, rather than the plain text value.
What is an organization to do that finds itself in the position of having SSNs in plain text or with a reversible encryption scheme, especially when it is linked to other private data and there are applications dependent on these structures? The first step is to start to remediate the situation immediately. This needs to be a top priority for any business, even given the government’s failures to make it a priority. Consider the risks and costs and litigation not to mention the tarnished reputation when a breach occurs. And it will occur or worse it maybe already has occurred and the business isn’t even aware of it. Few organizations actually audit external activities at the level of detail needed to even know if private data has been breached. They only find out usually after the data is already been utilized in a fraudulent fashion.
It may take months or years to remediate all the systems within an organization in a way that still allows the organization to do the business. In keeping with the principle of due diligence, what can an organization do until they reach the ideal state in securing data? The key is to go after the low hanging fruit to put safeguards in place even while mitigating the core issue. One way to do this is through implementation of a gatekeeper mentality. If you have to send an important package, you request tracking from the mail service. In the same way any confidential data must be tracked from end to end. That is the first step toward protection. It is not adequate, but it is a start.
Below are 5 steps to implementing a gatekeeper approach:
1) Ensure that all access to private data is captured. This includes the source from which the data was retrieved, the account used to access the data and everything that happens to the data afterwards. This means that all aspects of data movement must be tracked including not only the database access, but any user manipulation of the data after it has been retrieved. This means that both database auditing and network auditing must be implemented.
2) Define a repository for storing the complete lifecycle of data access corresponding to the actions captured from #1. For every piece of sensitive data, all access to the information must be accomplished. There are ways to implement this framework in SQL Server trough the use of custom profiler traces as well as extended events and logon triggers. I cannot speak to how to do this at the network level, but I know that it is possible.
3) Define a dictionary of rules that identify the type of accesses allowed to the data. As data movements are captured, exceptions will be encountered. Safeguards can be put in place to block movements that are not defined in the dictionary. A major part of this development is training such that the system learns from previously sanctioned activities so that it can recognize unauthorized accesses.
4) Stop movements that are not allowed by the dictionary. This may result in some pain as individuals with legitimate requests are blocked.
5) Put a learning framework in place with continuous feedback so that the systems can learn from mistakes and improve.
Databases: Any organization that is storing sensitive data non-encrypted or even with reversible encryption has been following a flawed data architecture. All databases and applications should be reviewed for compliance with principles of non-reversible encryption as well as complete tracking of all such data. Systems that are in violation need to be prioritized in terms of business impact due to downtime with changing them over to a secure approach.
Interfaces: In addition to databases, interfaces also need to be analyzed and modified to preserve security of SSN and other private data in flight. The service accounts associated with all processes must be linked to the interfaces using such techniques as IPSEC such that compromise of the password for the service account does not create a risk. It is not enough to simply have secure interfaces, the data within the secure interface must be obfuscated so that if the interface is compromised, there is little to gain from the data being transmitted.
User interfaces: User applications that simply retrieve data including unencrypted SSN must be modified to work with non-reversible encrypted SSN and components need to be added to allow validating a SSN by re-encrypting the portion of it needed to check for a match rather than by matching in the database.
This post just scratches the surface of the data security issue. I hope to write more in the upcoming months. It takes a focused team of database experts, network experts, application developers, architects, and user-interface experts working together to remediate this issue. The time is now to insist that organizations become accountable for protecting their customer data and this includes government agencies. Partial obfuscation with just the last 4 digits unencrypted is almost as bad as having the whole string unencrypted since this provides the means to impersonation to acquire the full SSN.
As a citizen, I encourage you to lobby your congressman for regulations that require protection of private data. Developers, I encourage you to not even consider transmitting any piece of private PI without using non-reversible encryption schemes. CEOs, I encourage you to place private data security at the top of the risk and treat this with the urgency of the Y2K crisis. DBAs, I encourage you to educate your organization and insist on not storing private data without non-reversible encryption. Architects, do not even consider a design that does not include complete auditing of all sensitive data flows or allows for transmission of SSN in plain text, even within the context of a SSL interface.
More testing today with larger database, multiplied the number of rows by 10x up to 1 billion rows in the test dataset. The latest results indicate I/O about 3 times slower on an Azure D13 server using similar CPU/Ram specs from a dedicated VM on a Gen-1 raid Fusion-io duo drive. Also, the time to insert the test record set was about 2.5 times slower. This aligns with the performance metrics outlined by http://azure.microsoft.com/blog/2014/10/06/d-series-performance-expectations/. The D series scales upward in terms of SSD capability, so the only way to achieve the I/O of the VM would be to upgrade to a D14 configuration with 800 GB SSD. Along with that comes more cores and memory (16 cores and 112 GB RAM) but that should not significantly affect the I/O comparison.
Here are the bottom line backup logs:
From the dedicated VM (48GB RAM, 6 cores @ 2.66, Dell-branded Fusion-io 640 GB MLC card:
Processed 1931504 pages for database ‘TestDb’, file ‘TestDb’ on file 1.
100 percent processed.
Processed 9 pages for database ‘TestDb’, file ‘TestDb_log’ on file 1.
BACKUP DATABASE successfully processed 1931513 pages in 26.102 seconds (578.114 MB/sec).
From the Azure D13 server (56 GB RAM, 8 Cores @ 2.20 GHz, 400 GB SSD)
Processed 1931648 pages for database ‘TestDb’, file ‘TestDb’ on file 1.
100 percent processed.
Processed 9 pages for database ‘TestDb’, file ‘TestDb_log’ on file 1.
BACKUP DATABASE successfully processed 1931657 pages in 81.958 seconds (184.131 MB/sec).
I’m going to come up with some more tests to perform such as how long it takes to provision a VM from scratch, how long to provision from an image, boot up times, and other metrics that are significant for somebody who wants to be able to quickly provision VMs. I’ll compare these with the times from the colocation environment. This will be posted in a new article.
Here is the configuration I am running on Azure, running the server hasn’t been too expensive yet, used about 5$.00 in last 24 hours, but only put a load on for a few minutes.
These are the main findings to this point:
1) For all but a basic/small SQL database in the Azure environment, customers will probably want to utilize the ‘D’ type configuration with the SSD option as this improves performance by a factor of at least 5.
2) Even with the SSD option, the I/O is still 2 – 3 times slower than simply storing a VM or at least the drive with the SQL data on a gen-1 fusion-IO card.
As these offerings mature to allow SSD storage to be enabled for fault-tolerance with larger sizes and higher performance, the standard D offerings will become more appealing for larger customers. It would be helpful if more robust I/O capability could be had without the expense for additional cores as databases tend to get bottlenecked by I/O well below they hit the CPU bottlenecks.
Azure is a great environment to create virtual machines quickly and test out things. But the ‘A’ series is hard to justify as a database solution even for a virtual machine provisioned with 8 cores and 56 GB RAM with a cost of over $1,000 to run over a month. I also have a virtual machine configured on one of my servers at a colocation running SQL Server utilizing PCIE SSD with Fusion-IO storage. The ‘ VM is configured with a similar number of cores and 64 GB of RAM. I created two identical databases with a two tables of 10 million rows. After building out the data, I tested the SQL backup using the data compression option on both.
Here are the results for the ‘A’ type Azure VM side by side to the results of my VM on PCIE SSD. My performance beat the Azure Enterprise SQL VM with comparable cores and memory by 16x (685 MB/s vs. 39 MB/s and 4 seconds versus 64 seconds). SQL is all about the I/O.