Algorithm Study: Big Data

Big Data Sampling

Experimental Simulation

sampling.tool.controls

sampling.control.size5

sampling.control.velocity1x

sampling.logic

P = \frac{k}{n}

sampling.prob1.0000

STREAM: 0

sampling.bucket.status (Capacity: 5)

Null

The Theoretical Framework

The Velocity Problem

In Big Data, we deal with the 3Vs: Volume, Velocity, and Variety. Traditional sampling requires knowing the population size (N) to calculate probability (P = 1/N).

But in a data stream, N is infinite. Simple Random Sampling fails here. We need an algorithm that maintains a uniform probability distribution at every single moment of the stream.

The Reservoir Solution

We maintain a 'Reservoir' of size k.

InitializationFor the first k items, we keep them all. (P = 1)

The Replacement LogicFor the i-th item where i > k, we keep it with probability:

P_i = \frac{k}{i}

Proof of UniformityThis guarantees that at any point n, every item seen so far has an equal probability k/n of being in the reservoir.

Algorithm Implementation

reservoir_sampling.py

import random
class ReservoirSampler:
  def __init__(self, k):
      self.k = k
      self.reservoir = []
      self.n = 0 
  def add(self, item):
      self.n += 1
      if len(self.reservoir) < self.k:
          self.reservoir.append(item)
      else:
          j = random.randint(0, self.n - 1)
          if j < self.k:
              self.reservoir[j] = item

This class implements Algorithm R. It processes data one item at a time, making it ideal for streaming pipelines.

O(k) Memory GuaranteeWe only store a list of size k. Even if the stream has 1 trillion items, RAM usage never grows.

The Filling PhaseInitially, the reservoir is empty. We accept every incoming item until we reach capacity k.

The Probabilistic SwapThis is the mathematical core. For every new item (n), we generate a random index j. If j lands inside the reservoir range, we swap.

Comparative Methodologies

Batch Processing (Dask)

When data is massive but static (stored on disk), we don't need a reservoir. We use Lazy Evaluation with Dask to sample without loading memory.

simple_sampling.py

import dask.dataframe as dd
df = dd.read_csv('big_data.csv')
sample = df.sample(frac=0.1)
print(sample.compute())

Stratified Sampling

Randomness can be biased. If we need to ensure minority groups are represented, we split the data into Strata first.

stratified_sampling.py

stratified = df.groupby('cat').apply(
  lambda x: x.sample(frac=0.1), 
  meta=df
)

import random class ReservoirSampler: def __init__(self, k): self.k = k self.reservoir = [] self.n = 0 def add(self, item): self.n += 1 if len(self.reservoir) < self.k: self.reservoir.append(item) else: j = random.randint(0, self.n - 1) if j < self.k: self.reservoir[j] = item