Row Sampling and Percentage Sampling in SSIS 2008R2 Example

row sampling

Row Sampling and Percentage Sampling both are similar task and perform the same function with little difference. The Row Sampling transformation is used to obtain a randomly selected subset of an input dataset. You can specify the exact size of the output sample, and specify a seed for the random number generator. There are many applications for random sampling. For example, a company that wanted to randomly select 50 employees to receive prizes in a lottery could use the Row Sampling transformation on the employee database to generate the exact number of winners.

While, The Percentage Sampling transformation creates a sample data set by selecting a percentage of the transformation input rows. The sample data set is a random selection of rows from the transformation input, to make the resultant sample representative of the input.

Both will take one input and provide two output. In easy term you can say. If 5000 records in my source then In Row Sampling, if I take 25 as value then output is 25 records and In Percentage Sampling, If I take 25 as value then 25% of 5000 i.e 1250 records will be the output.

Let start with the source first, here I am taking and OLE DB source in a Dataflow task.

Now configure the source, I am taking a table  DimCustomer from AdventureWorksDW the database provided by Microsoft.

Now press “OK”. Source configured now . Now take a Row Sampling Task .

Let configure Row Sampling Task. In that you will find.

Number of rows
Specify the number of rows from the input to use as a sample. The value of this property can be specified by using a property expression.
Sample output name
Provide a unique name for the output that will include the sampled rows. The name provided will be displayed within SSIS Designer.
Unselected output name
Provide a unique name for the output that will contain the rows excluded from the sampling. The name provided will be displayed within SSIS Designer.
Use the following random seed
Specify the sampling seed for the random number generator that the transformation uses to create a sample. This is only recommended for development and testing. The transformation uses the Microsoft Windows tick count as a seed if a random seed is not specified.

Here I have selected 25 row as a sampled data and Named a sample output name as Sampled Data Input and Unselected output name as Unsampled Data Input. And I don’t check the random seed option because I do not want it. Now I select two output destination one is Flat file Destination for Sampled Data input and another Flat File destination for Unsampled Data Input.

When you drop the pipeline from Row Sampling to any destination it will open a pop up and it will ask that which input you want to select for flat file destination.

After selecting the input it will show like this.

Now configure both the Flat file destination one for Sample Data and Another for Unsampled data.

Now it all set.

Now Run the task , after running the task you can see that only 25 row will go to the sampled destination and all the rest of the rows will go to the unsampled.

That is it for the Row Sampling . As I explained before that Percentage sampling is similar like row sampling it differs only at how much column it take for sampling will depend on the % not on the defined number of row.

Let see what result comes when we replace the row sampling task with percentage sampling task. Select Percentage Sampling instead of Row Sampling

And Configure it , I select 25% of the whole data .

Now run the task and see the difference.

Here you can see that transformation is sending 4469 rows in sampled data destination which is 25% of 18484 I mean whole data and Row sampling was sending only 25 rows to  the  sampled data.

Give a try guys !!
Thanks Swathi Srivatsava for the post.

Roopesh Babu V