Clearing a SQL Server Database
UPDATE: Be sure to check out the follow-up article, Clearing a SQL Server Database, Take 2.
I posted recently in Project Greenfield: Testing the TDD Waters about a small conversion project I’ve been using as a test bed for some of the techniques and technologies I’ll be using in Project Greenfield. I know SQL and am very comfortable in DB2 on the IBM System i, but this is the most extensive work I have done with SQL Server in recent memory. I have really appreciated the ease of database management provided by the tools integrated into Visual Studio. I especially appreciate it since I cannot seem to get SQL Server Management Studio installed on my development machine, but I won’t go into that right now.
The Database Schema
As background for the rest of this article, here is the schema we’ll be discussing. The diagram itself is too hairy to post, but the simplified version below should suffice. In this case, “->” means “Parent Of” and indicates a 1 to many relationship.
RealEstateMaster
-> CardDetail
-> CardImprovementDetail
-> TransferHistory
-> LandDetail
This collection of tables and relationships is exposed by the Entity Framework as a RealEstateMaster Entity. In the database, these tables also hold numeric Codes for various elements, each with a corresponding table, a “look up” table for normalization. There are well over a dozen of these, so I’ll not list them all, but they all function like so:
CodeTable
-> CardDetail
-> RealEstateMaster
-> LandDetail
From an Entity standpoint, these are not child and parent relationships, but from a database standpoint they do enforce the same type of Foreign Key constraints. In other words, each code in the CardDetail table must exist in it’s corresponding CodeTable.
Starting Fresh
I have several scenarios where the conversion process requires a “fresh start”, in other words a clean database with no data in the tables. This means that on demand I need to be able to wipe out all the data from all the tables in the database. This seemingly simple task turned out to take a lot more effort to figure out than I originally anticipated.
Using Entity Framework
At first, I assumed (wrongly) that since I was using Entity Framework for all of my database access that there would be a way to do this built in to the Entity Context. I made the rookie mistake of equating my EF classes to direct access to the database and all it’s functionality. I also made the rookie mistake of equating the EF classes to database tables: this one to one mapping is in no way necessary, so in hindsight I understand why there is no “TableName.Clear()” kind of an option.
I believe this problem can be solved using the EF classes but it would be very cumbersome. As I see it, it would require you to loop through the entire collection of RealEstateMaster entities and delete each one. That delete operation should loop through it’s children and delete those records as well. Afterwards, you could then do the same to each code table, which at that point should have no constraining records.
NOTE: The statements above are theoretical: I did not try this because it seemed like way too much work and not really a proper application of EF. I chose EF because it provides an easier way to work with SQL Server, but when EF actually gets in the way, it tells me I should find a different solution.
Back to SQL Server
Having explored my EF options, I decided the best thing to do was create a Stored Procedure in SQL Server to perform this task. Having never written a Stored Procedure in SQL Server, I wasn’t sure exactly what I was getting into, so I reached out for help to a known SQL Server guru: Andy Leonard. One of the great things about being involved with the community is knowing people who know things!
Andy graciously tolerated my newbie questions and with his guidance via an email exchange he led me to the solution I finally implemented. With his permission, I’m going to share a little of our exchange. I’m going to leave it unedited, mostly because I love Andy’s way of putting things, but also so you can get the same undiluted experience I did.
ME: having explained the situation and schema above …
So I guess I have several questions:
1) How do the relationships affect the deletes? Does the order matter?
2) Is there a way to do a "cascading" delete that will loop through the relationships and delete the related table rows automatically?
3) Am I making this harder than it needs to be? Is there a better way?
ANDY:
1. Usually 1-many indicates parent-child. The parent is usually on the "one" side of this relationship; child is usually on the "many" side. Order matters. You want to remove the child(ren) first. If you want to be extra cool about it, remove the child(ren) and then the parent in a transaction. That way, if something "bad" happens (like crossing the streams </GhostBusters>) during the parent delete, the child delete can rollback. Transactions are built for this. You are set up here for something "bad" to happen – you have multiple children for a given parent. If you miss one and get all the rest – and there’s data in that parent-child relationship you miss – your best outcome is a failure with rollback. Everything will return to its pre-transaction state. Without a transaction, you risk introducing data pollution (half the data for a given entity is missing).
2. There is a way to set up cascading referential integrity. It’s rare in practice and has to be in place before you begin your delete statements.
3. This is rocket surgery. You are not adding complexity, the complexity was here when you arrived.
My solution would be something like:
begin tran
delete sh
from SalesHistory sh
inner join MasterRecord mr on mr.ID = sh.MasterRecordID
where mr.ID in (…<list of MasterRecord table IDs>…)
<cut more examples of the same approach>
— commit tran
— rollback tran
Notice I worked from the bottom of the list to the top – that’s intentional. Most people think of entity construction "top down." Deletes need to work in the opposite order.
If everything appears to work, execute the commit statement. If not, you can execute the rollback and put everything back just like it was before you started. As a precaution, always execute the commit or rollback at least twice – you want to make sure you close all the transactions you opened. And it’s easy to start a new one accidentally – and it becomes a nested transaction when you do (if you close the SSMS window and leave an open transaction, the tables involved are locked. That’s "bad"…). You want to highlight the commit or rollback in SSMS and keep clicking Execute until you get an error indicating there are no open transactions to end. It’s a best practice.
ME:
My first quest
ion would be why this:delete sh
from SalesHistory sh
inner join MasterRecord mr on mr.ID = sh.MasterRecordID
where mr.ID in (…<list of MasterRecord table IDs>…)
instead of this:
delete sh
from SalesHistory sh
Here is why I ask:
1) The purpose here is really just to clear out all the tables, completely disregarding the current data. A total purge, if you will.
2) Using the first statement leaves open the possibility of orphaned data – or does it? If the relationships are defined, what happens when there are SalesHistory rows with no associated MasterRecord row?
3) It seems like additional complexity: won’t the joins be a performance hog?
ANDY:
If you’re just after clearing all the tables, a simple DELETE statement – starting with the children – will work. There is a popular myth that JOINs slow down performance. It’s akin to saying a farm of web servers slow down performance because there’s all that time lost deciding which server to send the request and then managing the distributed session management.
The truth is Joins can improve performance as much as hurt it. They’re a tool. Proper indexing and server management are the keys to performance.
That said, you can use Truncate Table to clear them. That does a couple things:
1. Wipes out the data.
2. Is not logged (so it flies).
3. Resets identity columns to the initial seed value (usually 1).
4. Requires the ddl_admin role for permission.
That’s a nice middle ground between dropping/recreating and deleting.
Order Matters
Andy’s first response talked about the best practice for doing an operation of this nature, which I rejected only because I just wanted a total purge of the data: if I was doing something more production oriented I would have taken the approach Andy suggested.
So the idea of just issuing a bunch of DELETE commands over all the tables does what I need. The first lesson here, though, is that Order Matters. Because of the relationships I created between the tables I could not simply issue a DELETE against the parent tables until there were no longer any children constraining them.
Recall the relationships listed above:
RealEstateMaster
-> CardDetail
-> CardImprovementDetail
-> TransferHistory
-> LandDetail
I had to start at the deepest point in the heirarchy and work my way up, so the final order looks like this:
- DELETE CardImprovementDetail
- DELETE CardDetail
- DELETE TransferHistory
- DELETE LandDetail
- DELETE RealEstateMaster
- DELETE all the Code tables mentioned above
Using TRUNCATE instead of DELETE
From Andy’s last email, I decided that TRUNCATE might be a better option. I had never heard of TRUNCATE before, but using it is very simple: ex. “TRUNCATE CardDetail”.
Unfortunately, when I changed all my DELETEs to TRUNCATEs, I discovered a little foible. Apparently, TRUNCATE will not work when a Foreign Key Relationship is defined, even when there is no data affected. So in other words, I can issue TRUNCATE CardImprovementDetail, because it does not define a FOREIGN KEY relationship to any other table. I can NOT, however, issue TRUNCATE CardDetail, because it defines a FOREIGN KEY relationship to CardImprovementDetail (as well as all of it’s corresponding Code tables). This held trus even after CardImprovementDetail had been truncated itself and held no data.
The Final Solution
So the final solution was to use TRUNCATE when possible, and DELETE when necessary. I wrapped all of these up in a Stored Procedure and now when I need to clear the entire database I can simply execute.
Remembering that Order Matters, the final procedure execution looks like this:
ALTER PROCEDURE dbo.ClearDatabase
AS
BEGIN
TRUNCATE TABLE CardImprovementDetail;
DELETE FROM CardDetail;
TRUNCATE TABLE Land;
TRUNCATE TABLE TransferHistory;
DELETE FROM RealEstateMaster;
DELETE FROM Carport;
DELETE FROM Condition;
DELETE FROM Easement;
DELETE FROM ExteriorWall;
DELETE FROM Floor;
DELETE FROM Foundation;
DELETE FROM Garage;
DELETE FROM Heat;
DELETE FROM InteriorWall;
DELETE FROM MagisterialDistrict;
DELETE FROM Occupancy;
DELETE FROM RightOfWay;
DELETE FROM RoofMaterial;
DELETE FROM RoofType;
DELETE FROM Sewer;
DELETE FROM SiteCharacteristic;
DELETE FROM SiteTerrain;
DELETE FROM Water;
END
RETURN
The Nuclear Option
Before I close, I did want to mention the Nuclear Option: I could just drop the database and recreate it at runtime. I considered it briefly because while the Entity Context does not have a Clear Table kind of option, it does have CreateDatabase(). It also has CreateDatabaseScript(), which you can use to extract the schema into an executable SQL script. It seems to me that you could just nuke the database (probably with a Stored Procedure) and use some combination of these to recreate it.
I only considered it for a moment, because it seems heavy handed. On top of that, if something were to go wrong it could leave the application in an unusable state. It also assumes that the SQL Script generated by EF will match the standards required by the application or the client. I’m not saying the generated schema would not function, but there could be outside factors. I suppose you cold store the schema in a file locally and use it to recreate the database, outside of EF, but it just feels ill advised.
Back to Entity Framework
At the end of all this, what I was left with was a Stored Procedure defined in my SQL Server that will do the task required. Unfortunately, if I leave it as is it means I will need to use a tool like SQL Server Management Studio to execute the procedure manually. Since my users will occasionally need to do this themselves, I don’t think that is a viable option.
Instead, I need to be able to run the Stored Procedure programmatically, but doing so means using traditional ADO.NET methods. I would then turn around and create an Entity Context, and it feels silly to do both in the same program. To solve that problem, I added the Stored Procedure to my Entity Context. And that’s where we’ll pick up next time: using Stored Procedures in Entity Framework.