User Guide

This guide covers the basics of using scikit-clarans for your clustering tasks.

What is CLARANS?

CLARANS stands for Clustering Large Applications based on RANdomized Search.

Think of it as a middle-ground between:

  • PAM (Partitioning Around Medoids): High quality, but slow on large data.

  • CLARA (Clustering Large Applications): Faster on large data, but works on fixed samples, potentially missing better clusterings.

CLARANS explores the graph of possible solutions randomly. It doesn’t check every neighbor of a node (a set of medoids), but only a random subset. This makes it scalable while better avoiding local minima approaches.

Quick Start

Here is a complete example to get you clustering in seconds.

from clarans import CLARANS
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 1. Prepare your data
# We'll generate 500 samples with 4 distinct centers
X, _ = make_blobs(n_samples=500, centers=4, n_features=2, random_state=42)

# 2. Initialize CLARANS
# We want to find 4 clusters.
model = CLARANS(
    n_clusters=4,
    numlocal=3,
    init='k-medoids++',
    random_state=42
)

# 3. Fit the model
model.fit(X)

# 4. Analyze results
print(f"Medoid Indices: {model.medoid_indices_}")
print(f"Labels: {model.labels_[:10]}...")

Configuration

The CLARANS class offers several parameters to tune performance vs. quality:

Parameter

Description

n_clusters

The number of clusters (medoids) to find.

numlocal

Number of local optima to search for. Higher usually means better quality but slower execution.

maxneighbor

Max neighbors to check per node. Defaults to a percentage of dataset size if not set.

init

Introduction strategy (e.g., 'k-medoids++', 'build', 'random').

Tips for Best Results

  • Initialization matters: Using init='k-medoids++' or 'build' often converges faster to better solutions than pure random.

  • Tuning parameters: If your results vary too much between runs, try increasing numlocal to explore more local minima.

FastCLARANS

FastCLARANS is a faster variant based on Schubert & Rousseeuw (2021). It provides significant speedups by using the FastPAM1 optimization strategy.

from clarans import FastCLARANS

model = FastCLARANS(n_clusters=4, numlocal=3, random_state=42)
model.fit(X)

Key improvements over CLARANS:

  • Smarter sampling: Instead of sampling random (medoid, non-medoid) pairs, FastCLARANS samples only non-medoid candidates and evaluates swaps with all k medoids at once.

  • O(k) speedup: Each candidate evaluation explores k edges of the search graph in the time CLARANS explores one.

  • Memory efficient: Computes distances on-the-fly (O(n) memory) rather than precomputing a full distance matrix (O(n²)).

  • Better quality: By exploring more of the search space per iteration, FastCLARANS often finds better solutions.

When to use FastCLARANS vs CLARANS:

  • Use FastCLARANS when you have low-dimensional data with cheap distance metrics (e.g., Euclidean).

  • Use CLARANS when distance computation is very expensive or when you need maximum memory efficiency.

For more hands-on recipes and runnable examples (including a Jupyter notebook with interactive demos), see Examples.