Friday, March 30, 2012

How to get Top 3 records in the group level?

Hi all,

I have a query which is grouped by a field called R_ID, which gives me a list of records for each R_ID. So here is the thing, I want to get only top 3 records for each R_ID. I tried using 'select top 3....', but the result is not what I wanted. It doesn't consider the grouping. Is there any way to achieve this task?

Please let me know if some body has any idea.

Thx.

The answer depends on whether or not you have SQL Server 2005 or SQL Server 2000. If you are using SQL Server 2005 you can use the ROW_NUMBER function. What is also needed is to know what field you are ordering your groups by (to get the top 3).


Dave

|||Doing this in a report wouldn′t be the best way as you would get all the data from the server and then only would display a part of it. better filter the data on the server and send only back the appropiate results, then you would have no problem on the client / reporting service. Anway, if you want to do this, you might have a look at the RowNumber() property in Reporting Services. You could set the Visibility of the row to Visible=True if the RowNumber("GroupName") equals or is less than 3 =IIF(Rownumber("GroupName") <= 3;True;False)

Doing this on the server depends on which server version you are using. SQL Server 2005 probably would enable you to use ROW_NUMBER() on the server side. SQL Server 2k does not implement this new function, so you would have to use something else instead.

HTH, Jens K. Suessmeyer.

http://www.sqlserver2005.de
|||

How about this:

-- sample table / data
CREATE TABLE t1 (R_ID nvarchar(3), text_data NVARCHAR(50));

INSERT INTO t1 VALUES ('A', 'Text for A1');
INSERT INTO t1 VALUES ('A', 'Text for A2');
INSERT INTO t1 VALUES ('A', 'Text for A3');
INSERT INTO t1 VALUES ('A', 'Text for A4');
INSERT INTO t1 VALUES ('B', 'Text for B1');
INSERT INTO t1 VALUES ('B', 'Text for B2');
INSERT INTO t1 VALUES ('B', 'Text for B3');
INSERT INTO t1 VALUES ('B', 'Text for B4');
INSERT INTO t1 VALUES ('C', 'Text for C1');
INSERT INTO t1 VALUES ('C', 'Text for C2');
INSERT INTO t1 VALUES ('C', 'Text for C3');
INSERT INTO t1 VALUES ('C', 'Text for C4');

-- Retrieve TOP(3) Entries for Each R_ID
SELECT R_ID, text_data
FROM t1 AS a
WHERE text_data IN
(SELECT TOP(3) text_data
FROM t1 AS b
WHERE a.R_ID = b.R_ID
ORDER BY text_data)

|||

Amde:

I set up a mock-up to compare Lucky's query with a query based on ROW_NUMBER. I created the T1 table with 32767 distinct R_ID entries and 30 distinct TEXT_DATA entries for each R_ID -- nearly 1,000,000 rows. When I ran Lucky's query with the table without any indexes it took about 102 seconds with 6.6 million logical IOs. I then added an index based on R_ID only and repeated the run. Again, it took about 102 seconds with 6.6 million logical IOs. I then looked at the SHOWPLAN and verified that the plan was still using a TABLE scan and was not using the index. This is expected because use of the index would require a prohibitive amount of bookmark lookups. I lastly dropped that index and added a cover index based on both R_ID and TEXT_DATA and repeated the run. This time it took 17 seconds with 3.1 million logical IOs.

Next, I repeated this using a query based on ROW_NUMBER. I first ran the query without any indexes. It took 7263 ms with 7709 logical IOs. I added an index based on R_ID only and it again took about 7600 ms with 7709 logical IOs. I then dropped that index and added an index based on both R_ID and TEXT_DATA. This query then took 2406 ms with 7706 logical IOs. This is the script I used for testing:

set nocount on
use tempdb
drop table t1
go

-- sample table / data
CREATE TABLE t1 (R_ID nvarchar(5), text_data NVARCHAR(50));
--exec sp_help t1

insert into t1
select convert (nvarchar (5), i.iter),
'Text for ' + convert (nvarchar(5), i.iter)
+ '-' + convert (nchar(2), j.iter)
from mugambo.dbo.small_iterator (nolock) i
inner join mugambo.dbo.small_iterator (nolock) j
on j.iter <= 30
--select * from t1

--drop index t1.t1_ndx
--create index t1_ndx on t1(r_id)
create index t1_ndx on t1(r_id, text_data)
update statistics t1
go

declare @.begDt datetime set @.begDt = getdate()

/*
--insert into @.receiver
-- Retrieve TOP(3) Entries for Each R_ID
SELECT R_ID, text_data
FROM t1 AS a
WHERE text_data IN
(SELECT TOP(3) text_data
FROM t1 AS b
WHERE a.R_ID = b.R_ID
ORDER BY text_data)
*/

set statistics io on

select r_id,
text_data
from ( select r_id,
row_number () over
(partition by R_ID order by text_data)
as seq,
text_data
from t1
) x
where seq <= 3

set statistics io off

print ' '
select datediff (ms, @.begDt, getdate()) as [Elapsed Time]


-- Output:

-- r_id text_data
-- -- --
-- 1 Text for 1-1
-- 1 Text for 1-10
-- ...
-- 9999 Text for 9999-11

-- Table 't1'. Scan count 1, logical reads 7706, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- Elapsed Time
--
-- 2220

|||

Hi,

keeping same table as Lucky_p has dipicted , just try this

select *from

(SELECT i.r_id, i.text_data,

DENSE_RANK() OVER (PARTITION BY i.r_id order by text_data) as RANK

FROM t1 i) A where rank<=3

Madhu

|||

That's all nice and i agree my solution is not fast when the table is large.

Do you have a fast solution for SQL 2000 too? All your solutions are for SQL 2005.

|||

Lucky:

I cannot come up with a method for SQL Server 2000 that is any faster than your version. Hopefully, one of the senior members can shed some light; I struck out.


Dave

|||

I found an alternative that might be of some use:

declare @.top3 table
( r_id nvarchar (5) not null,
seq integer not null,
text_data nvarchar (50) not null,

primary key (r_id, seq)
)

insert into @.top3
select r_id,
1,
min(text_data)
from t1
group by r_id

insert into @.top3
select r_id,
2,
( select min (b.text_data) from t1 b
where a.r_id = b.r_id
and b.text_data > a.text_data
)
from @.top3 a

insert into @.top3
select r_id,
3,
( select min (b.text_data) from t1 b
where a.r_id = b.r_id
and b.text_data > a.text_data
)
from @.top3 a
where seq = 2


set statistics io off

print ' '
select datediff (ms, @.begDt, getdate()) as [Elapsed Time]


-- Output:

-- Table '#33ADF722'. Scan count 0, logical reads 66286, physical reads 0, read-ahead reads 0.
-- Table 't1'. Scan count 1, logical reads 7640, physical reads 0, read-ahead reads 0.
-- Table '#33ADF722'. Scan count 1, logical reads 97120, physical reads 0, read-ahead reads 0.
-- Table 'Worktable'. Scan count 1, logical reads 33468, physical reads 0, read-ahead reads 0.
-- Table 't1'. Scan count 32767, logical reads 98558, physical reads 0, read-ahead reads 0.
-- Table '#33ADF722'. Scan count 1, logical reads 100528, physical reads 0, read-ahead reads 0.
-- Table 'Worktable'. Scan count 1, logical reads 33468, physical reads 0, read-ahead reads 0.
-- Table 't1'. Scan count 32767, logical reads 98552, physical reads 0, read-ahead reads 0.
-- Elapsed Time
--
-- 2766

|||

I have an additional solution for SQL 2000 which uses a Partition calculation. But I'm not sure if its faster than the other 2000 compatible approach.

-- Create a Temporary Table using an additional column
-- Row-Number Partitioned by R_ID
SELECT R_ID , text_data,
(SELECT COUNT(*) FROM T1 AS b
WHERE b.R_ID = a.R_ID
AND (b.text_data < a.text_data
OR (b.text_data = a.text_data))) AS rownum
INTO #T1
FROM T1 AS a
ORDER BY R_ID, text_data;

-- Get the results from Temporary Table
SELECT R_ID, text_data FROM #T1
WHERE rownum <= 3

|||

Lucky P wrote:

I have an additional solution for SQL 2000 which uses a Partition calculation. But I'm not sure if its faster than the other 2000 compatible approach.

-- Create a Temporary Table using an additional column
-- Row-Number Partitioned by R_ID
SELECT R_ID , text_data,
(SELECT COUNT(*) FROM T1 AS b
WHERE b.R_ID = a.R_ID
AND (b.text_data < a.text_data
OR (b.text_data = a.text_data))) AS rownum
INTO #T1
FROM T1 AS a
ORDER BY R_ID, text_data;

-- Get the results from Temporary Table
SELECT R_ID, text_data FROM #T1
WHERE rownum <= 3

again with little modification u can fine tune this script.. eliminate Temp table

Select *from

(SELECT top 100 R_ID , text_data,

(SELECT COUNT(*) FROM T1 AS b

WHERE b.R_ID = a.R_ID

AND (b.text_data < a.text_data

OR (b.text_data = a.text_data))) AS rownum

FROM T1 AS a

ORDER BY R_ID, text_data) A

where rownum<4

Madhu

|||

I tested these other queries too with my "million-row" mockup and here is what I think:

If running SQL Server 2005 use either Madhu's or my first query|||

Your getting a partial result because of these line:

(SELECT top 100 R_ID , text_data,

If You change it to

(SELECT top 100 PERCENT R_ID , text_data,

You should get all the rows....

|||

Thanks for catching that lucky; I tested this one too and your original query is still much faster.


Dave

|||

ms sql server

data:


idno Email Category
1 a@.a.com c
2 b@.b.com c
3 b@.b.com c
4 b@.b.com c

only unique email address record are requierd.
1 a@.a.com c
2 b@.b.com c

what could be the query for such requirements.

No comments:

Post a Comment