SQL expert?

jiminator

[H]F Junkie
Joined
Feb 2, 2007
Messages
11,605
Need some help building a query. I have a table from another DB with 4.5 million records and I want to copy a representational sample of the data out of it,
I think it requires a double select but I can't quite make it work

ie:

i1 i2 i3 misc
100 100 19 xxx x x x
100 100 20 x x xxx x x x
100 101 21 x z z z z
100 101 22 x z z z z
100 102 23 z z z z z
101 102 34 z z z zz
101 102 35 z z z zz
101 104 30 z z z z z

something like select distinct i1 i2 -> copy entire first record into new table

final
100 100 19 xxx x x x
100 101 21 x z z z z
100 102 23 z z z z z
101 104 30 z z z z z

SQL is not my thing, yet, but working on it :)
 
Why do you want to select distinct records? Do you want to sample data or get a complete set of unique records based on a certain set of keys?

You can:

SELECT * FROM YourTable ORDER BY NEWID();

which will assign a new random GUID to all 4.5M records, and sort by this new random key. You then have a random set of (n) records. This is expensive - assigning 4.5M GUIDs and performing the sort are not free. For 4.5M records and a one-time randomization, this will be fine - but if this query will be run more frequently, you might need something more scalable.

A far less expensive way to sample data is TABLESAMPLE. The IO, CPU, TempDB resource savings can be huge - but there are implications - this will return a random count/percent of records by page - i.e., entire pages of data are returned. This means that your "random" sample will have groups of data (8K) with sequential clustered keys. (which is likely your PK, but not necessarily)


Or perhaps you want a comprehensive list of counts of different combinations of keys? Look into group by.
 
Are you looking for a random sample, or every N number of rows? If it's random, then calebb has some (expensive) options that could be done from a restored DB backup so that you're not affecting Production's performance. If you're looking for grabbing a coarse sampling of the rows in the existing or alternate order, then you can use a filter on your query based on a conditional evaluation of the ROWID.
 
Thanks, I was not really looking for a random sampling but rather a representative test case in this instance for i1 and i2....
 
Well... I'm just going to say it because it is going to be said. No one knows what you want because you aren't being clear. What the heck is a "representative test case"? What should the result of your query be? What are your tables? What are your column names? What should the results be with respect to these tables and columns?
 
Back
Top