This is the first post of many to come that talk about new enhancements in the Sql Server 2005 T-Sql language.
INTERSECT
Let say you have a products table. Iâm going to use the
products table from the AdventureWorks database. Now letâs say
you imported a list of products into the database into a table called
productduplicates. Just for a reason, letâs say you manufacturer
sent you a full list of products, and you need to see which products
you already have in your database.
Normally, if you didnât have matching primary keys on both tables,
like a productId, you would do an inner join on some columns that
uniquely identify the rows, as such:
INNER JOINselect a.[name], a.productnumber from production.product a inner join production.productduplicates b on a.[name] = b.[name] and a.productnumber = b.productnumber
Execution plan for inner joinHash
Match(Inner Join, HASH:([b].[Name], [b].[ProductNumber])=([a].[Name],
[a].[ProductNumber]),
RESIDUAL:([AdventureWorks].[Production].[Product].[Name] as
[a].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] as
[b].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber]
as
[a].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]
as [b].[ProductNumber])) |--Table
Scan(OBJECT:([AdventureWorks].[Production].[productduplicates] AS
[b]))
|--Clustered Index
Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]
AS [a]))
Now you can replace this with a simple INTERSECT statement like the following:
INTERSECTselect [name], productnumber from production.product intersect select [name], productnumber from production.productduplicates
Execution plan for INTERSECTHash Match(Right
Semi Join,
HASH:([AdventureWorks].[Production].[productduplicates].[Name],
[AdventureWorks].[Production].[productduplicates].[ProductNumber])=([AdventureWorks].[Production].[Product].[Name],
[AdventureWorks].[Production].[Product].[ProductNumber]),
RESIDUAL:([AdventureWorks].[Production].[Product].[Name]=[AdventureWorks].[Production].[productduplicates].[Name]
AND
[AdventureWorks].[Production].[Product].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]))
|--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates])) |--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]))
Just like if you were using a UNION for two select statements, the
same number of columns must exist in both select statements, and they
must be of the same datatype. In both above examples, you get a
list of products that exist in both tables, so you know which products
from the manufacturer you already have listed.
I havenât seen any performance difference in the two, but Iâve only
played with this on small tables. The execution plan does show
that while you are doing a hash match inner join on the first query,
the second query performs a hash match right semi join. Both hash
matches perform 50% of the work in the query. The other 50% is
split evenly between the table scan and index scan in both
queries. So what does the fact that a right semi match is
performed when using INTERSECT? I tells me its doing the same
thing as if we were to use the EXISTS, like so:
EXISTSselect a.[name], a.productnumber from production.product a where exists(select b.[name], b.productnumber from production.productduplicates b where a.[name] = b.[name] and a.productnumber = b.productnumber)
EXCEPT
Same scenario, if you wanted to know which products you have that do
not exist in the duplicates table, you would normally do a NOT EXISTS
like so:
NOT EXISTSselect a.[name], a.productnumber from production.product a where not exists(select b.[name], b.productnumber from production.productduplicates b where a.[name] = b.[name] and a.productnumber = b.productnumber)
Execution plan for NOT EXISTSHash
Match(Right Anti Semi Join, HASH:([b].[Name],
[b].[ProductNumber])=([a].[Name], [a].[ProductNumber]),
RESIDUAL:([AdventureWorks].[Production].[Product].[Name] as
[a].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] as
[b].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber]
as
[a].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]
as [b].[ProductNumber])) |--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates] AS [b])) |--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [a]))
And you get back the list of products that you have that arenât in
the duplicates list. You can flip this around to figure out which
products are in the duplicates that you donât have in your products.
Letâs simply the code a little bit by using the EXCEPT operator. This looks identical to the INTERSECTS query:
EXCEPTselect [name], productnumber from production.product except select [name], productnumber from production.productduplicates
Execution plan for exceptHash Match(Right
Anti Semi Join,
HASH:([AdventureWorks].[Production].[productduplicates].[Name],
[AdventureWorks].[Production].[productduplicates].[ProductNumber])=([AdventureWorks].[Production].[Product].[Name],
[AdventureWorks].[Production].[Product].[ProductNumber]),
RESIDUAL:([AdventureWorks].[Production].[Product].[Name]=[AdventureWorks].[Production].[productduplicates].[Name]
AND
[AdventureWorks].[Production].[Product].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]))
|--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates])) |--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]))
Again, canât find a performance difference. In all queries
shows, the reads and duration all average out to be virtually
identical. Youâd have to test this on some large tables (high
number of rows) to figure out if there is a performance difference at
all.
Looking at the execution plans for everything, it doesnât seem that
there is a any gain from using INTERSECT and EXCEPT over JOINS or
EXISTS/NOT EXISTS except for one big thing: Much simpler code!