User Guide

This guide covers the basics of using scikit-clarans for your clustering tasks.

What is CLARANS?

CLARANS stands for Clustering Large Applications based on RANdomized Search.

Think of it as a middle-ground between:

PAM (Partitioning Around Medoids): High quality, but slow on large data.
CLARA (Clustering Large Applications): Faster on large data, but works on fixed samples, potentially missing better clusterings.

CLARANS explores the graph of possible solutions randomly. It doesn’t check every neighbor of a node (a set of medoids), but only a random subset. This makes it scalable while better avoiding local minima approaches.

Quick Start

Here is a complete example to get you clustering in seconds.

from clarans import CLARANS
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 1. Prepare your data
# We'll generate 500 samples with 4 distinct centers
X, _ = make_blobs(n_samples=500, centers=4, n_features=2, random_state=42)

# 2. Initialize CLARANS
# We want to find 4 clusters.
model = CLARANS(
    n_clusters=4,
    numlocal=3,
    init='k-medoids++',
    random_state=42
)

# 3. Fit the model
model.fit(X)

# 4. Analyze results
print(f"Medoid Indices: {model.medoid_indices_}")
print(f"Labels: {model.labels_[:10]}...")

Configuration

The CLARANS class offers several parameters to tune performance vs. quality:

Parameter	Description
`n_clusters`	The number of clusters (medoids) to find.
`numlocal`	Number of local optima to search for. Higher usually means better quality but slower execution.
`maxneighbor`	Max neighbors to check per node. Defaults to a percentage of dataset size if not set.
`init`	Introduction strategy (e.g., `'k-medoids++'`, `'build'`, `'random'`).

Tips for Best Results

Initialization matters: Using init='k-medoids++' or 'build' often converges faster to better solutions than pure random.
Tuning parameters: If your results vary too much between runs, try increasing numlocal to explore more local minima.

FastCLARANS

FastCLARANS is a faster variant based on Schubert & Rousseeuw (2021). It provides significant speedups by using the FastPAM1 optimization strategy.

from clarans import FastCLARANS

model = FastCLARANS(n_clusters=4, numlocal=3, random_state=42)
model.fit(X)

Key improvements over CLARANS:

Smarter sampling: Instead of sampling random (medoid, non-medoid) pairs, FastCLARANS samples only non-medoid candidates and evaluates swaps with all k medoids at once.
O(k) speedup: Each candidate evaluation explores k edges of the search graph in the time CLARANS explores one.
Memory efficient: Computes distances on-the-fly (O(n) memory) rather than precomputing a full distance matrix (O(n²)).
Better quality: By exploring more of the search space per iteration, FastCLARANS often finds better solutions.

When to use FastCLARANS vs CLARANS:

Use FastCLARANS when you have low-dimensional data with cheap distance metrics (e.g., Euclidean).
Use CLARANS when distance computation is very expensive or when you need maximum memory efficiency.

For more hands-on recipes and runnable examples (including a Jupyter notebook with interactive demos), see Examples.