Big Data Sampling
Experimental Simulation
sampling.tool.controls
The Theoretical Framework
The Velocity Problem
In Big Data, we deal with the 3Vs: Volume, Velocity, and Variety. Traditional sampling requires knowing the population size (N) to calculate probability (P = 1/N).
But in a data stream, N is infinite. Simple Random Sampling fails here. We need an algorithm that maintains a uniform probability distribution at every single moment of the stream.
The Reservoir Solution
We maintain a 'Reservoir' of size k.
Algorithm Implementation
import random
class ReservoirSampler:
def __init__(self, k):
self.k = k
self.reservoir = []
self.n = 0
def add(self, item):
self.n += 1
if len(self.reservoir) < self.k:
self.reservoir.append(item)
else:
j = random.randint(0, self.n - 1)
if j < self.k:
self.reservoir[j] = itemThis class implements Algorithm R. It processes data one item at a time, making it ideal for streaming pipelines.
Comparative Methodologies
Batch Processing (Dask)
When data is massive but static (stored on disk), we don't need a reservoir. We use Lazy Evaluation with Dask to sample without loading memory.
import dask.dataframe as dd
df = dd.read_csv('big_data.csv')
sample = df.sample(frac=0.1)
print(sample.compute())Stratified Sampling
Randomness can be biased. If we need to ensure minority groups are represented, we split the data into Strata first.
stratified = df.groupby('cat').apply(
lambda x: x.sample(frac=0.1),
meta=df
)