User Guide ========== This guide covers the basics of using **scikit-clarans** for your clustering tasks. What is CLARANS? ---------------- **CLARANS** stands for *Clustering Large Applications based on RANdomized Search*. Think of it as a middle-ground between: * **PAM (Partitioning Around Medoids):** High quality, but slow on large data. * **CLARA (Clustering Large Applications):** Faster on large data, but works on fixed samples, potentially missing better clusterings. CLARANS explores the graph of possible solutions randomly. It doesn't check *every* neighbor of a node (a set of medoids), but only a random subset. This makes it scalable while better avoiding local minima approaches. Quick Start ----------- Here is a complete example to get you clustering in seconds. .. code-block:: python from clarans import CLARANS from sklearn.datasets import make_blobs import matplotlib.pyplot as plt # 1. Prepare your data # We'll generate 500 samples with 4 distinct centers X, _ = make_blobs(n_samples=500, centers=4, n_features=2, random_state=42) # 2. Initialize CLARANS # We want to find 4 clusters. model = CLARANS( n_clusters=4, numlocal=3, init='k-medoids++', random_state=42 ) # 3. Fit the model model.fit(X) # 4. Analyze results print(f"Medoid Indices: {model.medoid_indices_}") print(f"Labels: {model.labels_[:10]}...") Configuration ------------- The ``CLARANS`` class offers several parameters to tune performance vs. quality: .. list-table:: :widths: 25 75 :header-rows: 1 * - Parameter - Description * - ``n_clusters`` - The number of clusters (medoids) to find. * - ``numlocal`` - Number of local optima to search for. Higher usually means better quality but slower execution. * - ``maxneighbor`` - Max neighbors to check per node. Defaults to a percentage of dataset size if not set. * - ``init`` - Introduction strategy (e.g., ``'k-medoids++'``, ``'build'``, ``'random'``). Tips for Best Results --------------------- * **Initialization matters:** Using ``init='k-medoids++'`` or ``'build'`` often converges faster to better solutions than pure random. * **Tuning parameters:** If your results vary too much between runs, try increasing ``numlocal`` to explore more local minima. FastCLARANS ----------- **FastCLARANS** is a faster variant based on Schubert & Rousseeuw (2021). It provides significant speedups by using the FastPAM1 optimization strategy. .. code-block:: python from clarans import FastCLARANS model = FastCLARANS(n_clusters=4, numlocal=3, random_state=42) model.fit(X) **Key improvements over CLARANS:** * **Smarter sampling:** Instead of sampling random (medoid, non-medoid) pairs, FastCLARANS samples only non-medoid candidates and evaluates swaps with all k medoids at once. * **O(k) speedup:** Each candidate evaluation explores k edges of the search graph in the time CLARANS explores one. * **Memory efficient:** Computes distances on-the-fly (O(n) memory) rather than precomputing a full distance matrix (O(n²)). * **Better quality:** By exploring more of the search space per iteration, FastCLARANS often finds better solutions. **When to use FastCLARANS vs CLARANS:** * Use **FastCLARANS** when you have low-dimensional data with cheap distance metrics (e.g., Euclidean). * Use **CLARANS** when distance computation is very expensive or when you need maximum memory efficiency. For more hands-on recipes and runnable examples (including a Jupyter notebook with interactive demos), see :doc:`examples`.