User Guide
This guide covers the basics of using scikit-clarans for your clustering tasks.
What is CLARANS?
CLARANS stands for Clustering Large Applications based on RANdomized Search.
Think of it as a middle-ground between:
PAM (Partitioning Around Medoids): High quality, but slow on large data.
CLARA (Clustering Large Applications): Faster on large data, but works on fixed samples, potentially missing better clusterings.
CLARANS explores the graph of possible solutions randomly. It doesn’t check every neighbor of a node (a set of medoids), but only a random subset. This makes it scalable while better avoiding local minima approaches.
Quick Start
Here is a complete example to get you clustering in seconds.
from clarans import CLARANS
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# 1. Prepare your data
# We'll generate 500 samples with 4 distinct centers
X, _ = make_blobs(n_samples=500, centers=4, n_features=2, random_state=42)
# 2. Initialize CLARANS
# We want to find 4 clusters.
model = CLARANS(
n_clusters=4,
numlocal=3,
init='k-medoids++',
random_state=42
)
# 3. Fit the model
model.fit(X)
# 4. Analyze results
print(f"Medoid Indices: {model.medoid_indices_}")
print(f"Labels: {model.labels_[:10]}...")
Configuration
The CLARANS class offers several parameters to tune performance vs. quality:
Parameter |
Description |
|---|---|
|
The number of clusters (medoids) to find. |
|
Number of local optima to search for. Higher usually means better quality but slower execution. |
|
Max neighbors to check per node. Defaults to a percentage of dataset size if not set. |
|
Introduction strategy (e.g., |
Tips for Best Results
Initialization matters: Using
init='k-medoids++'or'build'often converges faster to better solutions than pure random.Tuning parameters: If your results vary too much between runs, try increasing
numlocalto explore more local minima.
FastCLARANS
FastCLARANS is a faster variant based on Schubert & Rousseeuw (2021). It provides significant speedups by using the FastPAM1 optimization strategy.
from clarans import FastCLARANS
model = FastCLARANS(n_clusters=4, numlocal=3, random_state=42)
model.fit(X)
Key improvements over CLARANS:
Smarter sampling: Instead of sampling random (medoid, non-medoid) pairs, FastCLARANS samples only non-medoid candidates and evaluates swaps with all k medoids at once.
O(k) speedup: Each candidate evaluation explores k edges of the search graph in the time CLARANS explores one.
Memory efficient: Computes distances on-the-fly (O(n) memory) rather than precomputing a full distance matrix (O(n²)).
Better quality: By exploring more of the search space per iteration, FastCLARANS often finds better solutions.
When to use FastCLARANS vs CLARANS:
Use FastCLARANS when you have low-dimensional data with cheap distance metrics (e.g., Euclidean).
Use CLARANS when distance computation is very expensive or when you need maximum memory efficiency.
For more hands-on recipes and runnable examples (including a Jupyter notebook with interactive demos), see Examples.