asp tutorials, asp.net tutorials, sample code, and Microsoft news from 15Seconds
Data Access  |   Troubleshooting  |   Security  |   Performance  |   ADSI  |   Upload  |   Email  |   Control Building  |   Component Building  |   Forms  |   XML  |   Web Services  |   ASP.NET  |   .NET Features  |   .NET 2.0  |   App Development  |   App Architecture  |   IIS  |   Wireless
 
Pioneering Active Server
 Power Search








Active News
15 Seconds Weekly Newsletter
• Complete Coverage
• Site Updates
• Upcoming Features

More Free Newsletters
Reference
News
Articles
Archive
Writers
Code Samples
Components
Tools
FAQ
Feedback
Books
Links
DL Archives
Community
Messageboard
List Servers
Mailing List
WebHosts
Consultants
Tech Jobs
15 Seconds
Home
Site Map
Press
Legal
Privacy Policy
internet.commerce














internet.com
IT
Developer
Internet News
Small Business
Personal Technology
International

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers

HardwareCentral
Compare products, prices, and stores at Hardware Central!

Advanced SQL Techniques - Part 1: Analyzing Duplicate Records
By David Penton
Rating: 4.3 out of 5
Rate this article


  • email this article to a colleague
  • suggest an article

    Introduction

    This article offers an analysis of Structured Query Language (SQL) and presents techniques that can be used in building SQL statements. People tend to think of SQL as a programming language, but in reality, SQL is a Set language. One could say SQL is Set Theory or Relational Algebra; these terms are interchangeable. Some people think of this in terms of Venn diagrams or DeMorgan's Law. Although those can be quite complicated, the basic forms are not difficult to understand. The cases presented here are intended for certain circumstances, but they can be applied to many situations. Each article in this four-part series will address one case.

    Deleting Duplicate Records

    The following is an analysis of records in a table, some of which may be duplicated. Many different things can be accomplished with this method, including looking for duplicates, looking for records that are duplicated n times, deleting duplicates, etc.

    People develop many ways to delete duplicate records from a table. You can make a new table and select the unique records into that table. You can concatenate fields together, get a distinct representation of those records, and place those in another table. But these methods are normally used when people think they do not have any other way to accomplish this.

    Let's suppose we have a table structure with a multiple field Primary Key. The key is from an old ordering system we imported into a table that includes an IDENTITY field:

    
    CREATE TABLE SHIPMENT_ORDERS (
        [id] [int] IDENTITY (1, 1) NOT NULL
        , [shipment] [varchar] (20) NOT NULL
        , [purchase_order] [varchar] (20) NOT NULL
        , [item_sku] [char] (20) NOT NULL
        , [units_ordered] [int] NOT NULL
        , [units_shipped] [int] NOT NULL
        , [shipment_date] [datetime] NOT NULL
        , [order_date] [datetime] NOT NULL
        , [last_name] [varchar] (30) NOT NULL
        , [first_name] [varchar] (30) NOT NULL
        , [full_address] [varchar] (30) NOT NULL
        , [email] [varchar] (30) NOT NULL
    )
    
    
    For sake of argument, there is no Primary Key on this table. Let's assume that some duplicate data was introduced into it. For our discussion, we will assume a target Primary Key to be [shipment], [purchase_order], [item_sku]. We want to get some statistical information from the data first so our end goal will be to remove the duplicate records. I will now provide the base query for all of our operations:
    
    /* 1.  Query heading */
    SELECT
      *  /* Remember to never use SELECT * and write out the field names */
    FROM
      /* 2.  Table with duplicates */
      SHIPMENT_ORDERS
    WHERE
      EXISTS  (
        SELECT
          NULL
        FROM
          /* 3.  Table with duplicates, with an alias */
          SHIPMENT_ORDERS b
        WHERE
          /* 4.  Join each field with *itself*.  These are fields that could be Primary Keys */
          b.[shipment] = SHIPMENT_ORDERS.[shipment]
          AND b.[purchase_order] = SHIPMENT_ORDERS.[purchase_order]
          AND b.[item_sku] = SHIPMENT_ORDERS.[item_sku]
        GROUP BY
                /* 5.  I must GROUP BY these fields because of the HAVING
             clause and because these are the possible PK */
          b.[shipment], b.[purchase_order], b.[item_sku]
        HAVING
          /* 6.  This is the determining factor.  We can control our
             output from here.  In this case, we want to pick records
             where the ID is less than the MAX ID */
          SHIPMENT_ORDERS.[id] < MAX(b.[id])
      )
    
    
    1. The kind of query we want will be either SELECT * or DELETE FROM. We will view our results first, then change this to delete the offending records if necessary.
    2. This is the table we are performing the above action on.
    3. This is the same table, duplicated. We are self-referencing our target table.
    4. We must compare each field to its copy. This is 1,000 times better than concatenation, since we do not need to worry about the data lining up. This can be used when you have two (or more) tables you need to compare that have similar data. The way most people do this is concatenate all target fields together. That is the most intensive thing you can do when comparing a group of fields in a table.
    5. We need to GROUP BY the same fields because we want to ensure that our data is unique in our analysis.
    6. This ultimately determines what our condition for the resulting data will be.

    After running this query, we would see all of the duplicated records. For instance, if we had the following records in the table:

    
    id   shipment   purchase_order   item_sku
    3    435436     A123765          453987001201
    46   435436     A123765          453987001201
    354  435436     A123765          453987001201
    23   981123     C543219          843209132209
    613  981123     C543219          843209132209
    
    the result after running the above query would be:
    
    id   shipment   purchase_order   item_sku
    3    435436     A123765          453987001201
    46   435436     A123765          453987001201
    23   981123     C543219          843209132209
    
    This would be a report of the duplicated records that are present in the data. Now, if we look at section 6 in the query:
    
          SHIPMENT_ORDERS.[id] < MAX(b.[id])
    
    
    we can see that an alternate query could use MIN() instead of MAX(). Note that this would not affect the outcome, except it would pick the minimum of the field instead of the maximum. In the case of our table, the only field that is not duplicated is [id]. If the table we were dealing with had a field such as a datetime field, it could be used as a pseudo-IDENTITY field.

    Suppose we want to get a report of all records that have a single, duplicated record. This is easily obtainable by changing one part of this query (6):

    
          COUNT(b.[id]) = 2
    
    
    Based on the previous data, our output would be:
    
    id   shipment   purchase_order   item_sku
    23   981123     C543219          843209132209
    613  981123     C543219          843209132209
    
    
    Notice that we get both records.

    Remember the first query we ran? It showed all of the duplicated records in the table. If we changed the first two words (SELECT *) in the SQL sentence to DELETE, we accomplish our goal of removing all duplicates from the table.

    
    id   shipment   purchase_order   item_sku
    354  435436     A123765          453987001201
    613  981123     C543219          843209132209
    
    

    Conclusion

    For a recap, we did not:

    1. create a new table
    2. concatenate fields
    3. use the DISTINCT keyword
    4. alter the table

    and we did:

    1. use the original table to compare
    2. compare field to field without altering them
    3. rely on the existence of data, not generate new data
    4. use the IDENTITY field to ensure unique records

    This query template can be used for many different types of reports and operations on data. At first look, that query may seem rather complicated, but when it is dissected, the sections are not complex at all. The next article will discuss the uses/differences of UNION and UNION ALL on single-table and multiple-table instances.

    Remember, Set Theory is cool!

    About the Author

    David L. Penton, MCP, lives in Dallas and has a bachelor's degree in Mathematics from Northwestern State University. For the last two years, he has been the technical lead for the International Systems Group at JCPenney, Inc., in Dallas. Before that, he was a senior business and analysis analyst for Lockheed Martin Missile and Fire Control, also in Dallas. During that time he has worked on various technology, including but not limited to ASP, SQL, VB, C++, ColdFusion, and VB .NET/C# with the related web technologies (JavaScript, HTML, DHTML, XHTML, CSS1/2). Database platforms he has dealt with (development and admin) include but not limited to SQL Server 6.5/7/8, Oracle 7/8, DB2, MS Access (from version 2).

    David also participates in discussion lists hosted by 15Seconds.com, aspfriends.com, Swynk, www.dfwcfug.org, and lsoft.com.

    He can be reached by email at lemans70@home.com and dpenton@jcpenney.com.

  • Rate This Article
    Not HelpfulMost Helpful
    1 2 3 4 5
    Supporting Products/Tools
    Proposion N2N
    Proposion N2N connects Microsoft .NET applications to Lotus Notes and Lotus Domino databases. This ADO.NET managed data provider allows you to perform blindingly fast queries and updates of Notes data from ASP.NET pages, .NET web services, Windows, or Mobile applications. An innovative SQL-like query language leverages the unique features of Notes and makes collaborative software accessible to relational database programmers.
    [Top]
    Other Articles
    Sep 15, 2005 - Building an Image Keyword System
    Unlike text-based file formats image files aren't made up of words, which makes searching for an image file by keyword difficult. Instead of being able to simply open the file to see what it contains, we're stuck looking at the text around it and other metadata to determine the image's meaning. In this article, Ziran Sun shows you how to build a simple database-based image keyword system that allows you to associate keywords with images and use these keywords to make finding images easier.
    [Read This Article]  [Top]
    Apr 7, 2005 - A Step-by-Step Guide To Using MySQL with ASP.NET - Part 2
    In the second part of of his article on using MySQL with ASP.NET, Ziran Sun covers how to add a new MySQL user to the database server, assign the user the appropriate permissions, connect to the database, and build a simple ASP.NET page to perform a query.
    [Read This Article]  [Top]
    Feb 10, 2005 - A Step-by-Step Guide To Using MySQL with ASP.NET - Part 1
    Back in the days of classic ASP, if you were building a database-driven web site, your choice was either to invest a lot of money to get a copy of Microsoft SQL Server (or some other enterprise-ready database) or invest a lot of time finding a way to deal with the performance and scalability limitations of Microsoft Access. Luckily these days there's another viable alternative: MySQL.
    [Read This Article]  [Top]
    Jan 27, 2005 - Moving a Database from SQL Server 7.0 to SQL Server 2000
    Moving or copying a SQL Server database from one machine to another requires a lot of preparation in order to ensure a smooth transfer. In this article, Dina Fleet Berry examines the different methods and highlights the different issues associated with each of them.
    [Read This Article]  [Top]
    Jan 6, 2005 - Debugging a SQL Stored Procedure from inside SQL Server 2000 Query Analyzer
    There are many times when using SQL Server 2000 Query Analyzer to debug SQL statements is a better choice than debugging in Visual Studio .NET. In this article, Dina Fleet Berry explains why and walks you through the debugging process step-by step.
    [Read This Article]  [Top]
    Nov 24, 2004 - Persisting .NET Objects to SQL Server Using SQLXML and Serialization
    As a follow up to his article on retrieving objects from SQL Server using SQLXML and serialization, Gianluca Nuzzo discusses saving objects back to SQL Server using a schema definition file and updategrams.
    [Read This Article]  [Top]
    Sep 14, 2004 - Transaction Processing in ADO.NET 2.0
    One area that stands out when comparing ADO.NET 1.x to ADO.NET 2.0 is transaction processing. Bill Ryan shows just how easy transaction processing has become with the TransactionScope object in ADO.NET 2.0.
    [Read This Article]  [Top]
    Sep 8, 2004 - Custom Object Data Binding with .NET
    Developers often use brute force coding to marshal data between the GUI and application objects. In this article, Luther Stanton explains how to use .NET's out-of-the box data-binding functionality to make this job much easier.
    [Read This Article]  [Top]
    Sep 2, 2004 - Queue MSMQ Messages from SQL Server
    Learn how to create a console application to queue a message in Microsoft Message Queuing (MSMQ) and then use an extended stored procedure to call the console application from a SQL Server trigger.
    [Read This Article]  [Top]
    Aug 30, 2004 - Tuning Up ADO.NET Connection Pooling in ASP.NET Applications
    Connection pooling increases the performance of Web applications by reusing active database connections instead of creating a new connection with every request. This article shows how to monitor the connection pool, diagnose a potential problem, and apply the appropriate fix.
    [Read This Article]  [Top]
    Mailing List
    Want to receive email when the next article is published? Just Click Here to sign up.

    Support the Active Server Industry



    JupiterOnlineMedia

    internet.comearthweb.comDevx.commediabistro.comGraphics.com

    Search:

    Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

    Jupitermedia Corporate Info


    Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

    Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

    Solutions
    Whitepapers and eBooks
    Microsoft Article: Will Hyper-V Make VMware This Decade's Netscape?
    Microsoft Article: 7.0, Microsoft's Lucky Version?
    Microsoft Article: Hyper-V--The Killer Feature in Windows Server 2008
    Avaya Article: How to Feed Data into the Avaya Event Processor
    Microsoft Article: Install What You Need with Windows Server 2008
    HP eBook: Putting the Green into IT
    Whitepaper: HP Integrated Citrix XenServer for HP ProLiant Servers
    Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 1
    Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 2--The Future of Concurrency
    Avaya Article: Setting Up a SIP A/S Development Environment
    IBM Article: How Cool Is Your Data Center?
    Microsoft Article: Managing Virtual Machines with Microsoft System Center
    HP eBook: Storage Networking , Part 1
    Microsoft Article: Solving Data Center Complexity with Microsoft System Center Configuration Manager 2007
    MORE WHITEPAPERS, EBOOKS, AND ARTICLES
    Webcasts
    Intel Video: Are Multi-core Processors Here to Stay?
    On-Demand Webcast: Five Virtualization Trends to Watch
    HP Video: Page Cost Calculator
    Intel Video: APIs for Parallel Programming
    HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
    Microsoft Silverlight Video: Creating Fading Controls with Expression Design and Expression Blend 2
    MORE WEBCASTS, PODCASTS, AND VIDEOS
    Downloads and eKits
    Sun Download: Solaris 8 Migration Assistant
    Sybase Download: SQL Anywhere Developer Edition
    Red Gate Download: SQL Backup Pro and free DBA Best Practices eBook
    Red Gate Download: SQL Compare Pro 6
    Iron Speed Designer Application Generator
    MORE DOWNLOADS, EKITS, AND FREE TRIALS
    Tutorials and Demos
    How-to-Article: Preparing for Hyper-Threading Technology and Dual Core Technology
    eTouch PDF: Conquering the Tyranny of E-Mail and Word Processors
    IBM Article: Collaborating in the High-Performance Workplace
    HP Demo: StorageWorks EVA4400
    Intel Featured Algorhythm: Intel Threading Building Blocks--The Pipeline Class
    Microsoft How-to Article: Get Going with Silverlight and Windows Live
    MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES