Cluster Analysis Tutorial

In this tutorial, we’ll use SAMPLE as our session token. This is unique to the training sandbox and will not work in other environments. See BigID API/Tutorial for information on authenticating with BigID.

To view the complete code for all steps, see the section labelled Code Samples.”

For more information on the API capabilities used in this tutorial, check out the Data Catalog API Docs.

1. Authenticate Using Your API Key

All API requests require authentication using a valid API key. Refer to BigID Documentation to obtain your token. Then, define the Authorization header using the format `Authorization: Bearer YOUR_API_KEY`. This header must be included in every request to ensure proper authentication and access to BigID’s API endpoints. Throughout the tutorial, we will be using SAMPLE as our token.

2. Obtain Cluster ID

The easiest way to obtain the Cluster ID for the cluster you are interested in is through the BigID UI. On the Cluster Analysis page, you can find detailed information on all existing clusters for your organization’s data.

However, if you prefer to handle everything programmatically, you can use the GET /api/v1/clusters endpoint to retrieve the list of clusters via API. You can filter the results to find exactly what you’re looking for.

For example, the following request retrieves all clusters where the size equals 20:

Your response will contain a variety of information for each cluster. Once you have located the cluster you are interested in, you will need the ID in order to proceed.

3. Get All Columns

Once you have the id of the cluster you are interested in, you can review its columns using the GET /api/v1/data-catalog/column/cluster/similar endpoint. This endpoint returns a detailed list of every column that has been grouped into the specified cluster by BigID’s similarity engine. These columns may come from different tables, sources, or systems, but they share common structural, content-based, or metadata characteristics that indicate a high degree of similarity.

You can use this response to:

Audit similar columns across your data environment
Apply or verify tags (like sensitivity or classification labels)
Identify redundant or duplicate data
Export the columns for further analysis or reporting

Query Parameters

Depending on your needs, you can customize your request using optional query parameters to paginate results, apply filters, or limit the data returned. The specific query options for this endpoint are outlined below:

Parameter	Type	Description	Default
clusterId	string	Required. The ID of the cluster to retrieve columns for.	N/A
limit	number	Maximum number of results to return.	10000
skip	number	Number of results to skip (used for pagination).	0
filter	string	Filter results based on column metadata.	"" (empty)
searchText	string	Search across column names and attributes.	"" (empty)
sort	string	Field to sort the results by.	_id
offsetKey	string	Used for advanced pagination.	""
ignoreLimit	boolean	If true, ignores the limit parameter.	false
sample	number	Number of columns to randomly sample from the cluster.	null

All of the above options can be used to control the API response; however, they are all optional except for the clusterId. In the below request, only the clusterId is provided. To test the endpoint, replace the fake clusterId below with a real id.

A successful request will return a 200 OK response with a JSON payload containing metadata for all columns in the specified cluster. For example:

{
  "results": [
    {
      "fullyQualifiedName": "x.y.z",
      "columnName": "abc",
      "businessAttribute": "friendly name",
      "tableName": "my-table",
      "source": "my-sql-tables",
      "isPrimary": true,
      "dataType": "integer",
      "attributes": [
        {
          "attribute_original_name": "SSN",
          "attribute_name": "SSN",
          "rank": "High",
          "calc_confidence_level": 0.695781717492013,
          "attribute_type": "IDSoR Attribute",
          "attribute_id": "74b78d7040e97f1ab9ba3b69c8e372e3"
        }
      ],
      "tags": [
        {
          "tagId": "ba7a5426-64d6-47eb-995b-8fd8e1224de0",
          "valueId": "ba7a5426-64d6-47eb-995b-8fd8e1224de1",
          "isMutuallyExclusive": "true",
          "properties": {
            "hidden": false,
            "applicationType": "sensitivityClassification"
          },
          "tagName": "sensitivity",
          "tagValue": "high"
        }
      ]
    }
  ],
  "totalCount": 5
}

4. Troubleshooting

If your request fails, here’s what the server might tell you, and how to fix it:

Status Code	Example Response	What It Means	How to Fix It
200	Successful response with scan data	Everything’s looking good!	Keep cruising.
400	`{ "error": "Scan ID is invalid" }`	Bad or malformed scan ID provided	Double-check the scan ID you’re using.
404	`{ "error": "Scan 1234 was not found" }`	Scan ID doesn’t exist	Make sure the ID is valid and fetched from the parent scans endpoint.
401	Unauthorized	API key missing or invalid	Verify your API key and authorization header.
500	`{ "status": "error", "message": "Server error", "errors": [{}] }`	BigID server hit a snag (internal error)	Wait a moment and retry. If it persists, reach out to support.

Code Samples

Python

# Cluster Analysis API Tutorial
import requests
import json

# --- 1. Setup and Authentication ---
# Base URL of the BigID API (training sandbox)
base_url = "https://developer.bigid.com/api/v1"
# Session token (replace SAMPLE with actual session token)
AUTH_TOKEN = "SAMPLE"

headers = {
    "Authorization": f"Bearer {AUTH_TOKEN}",
    "Content-Type": "application/json"
}

try:
    # 2. Provide the Cluster ID once obtained
    target_cluster_id = "cluster_id_here"

    # 3. Get All Columns for the provided cluster
    columns_url = f"{base_url}/data-catalog/column/cluster/similar"

    # The clusterId is a required parameter for this endpoint.
    # You can also add other optional parameters like 'limit'.
    params = {
        'clusterId': target_cluster_id,
        'limit': 20 # Limit the results for this example
    }

    response_cols = requests.get(columns_url, headers=headers, params=params)
    response_cols.raise_for_status()

    columns_data = response_cols.json()

    print("Successfully retrieved columns for the cluster.")
    print(f"Total columns in cluster: {columns_data.get('totalCount')}")
    print("Sample of retrieved columns:")
    print(json.dumps(columns_data.get('results', []), indent=2))

except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
    if http_err.response:
        print(f"Response content: {http_err.response.text}")
except Exception as err:
    print(f"An other error occurred: {err}")

JavaScript

// Cluster Analysis API Tutorial

// --- 1. Setup and Authentication ---
// Base URL of the BigID API (training sandbox)
base_url = "https://developer.bigid.com/api/v1"
// Session token (replace SAMPLE with actual session token)
const AUTH_TOKEN = "SAMPLE";

const headers = {
    "Authorization": `Bearer ${AUTH_TOKEN}`,
    "Content-Type": "application/json"
};

async function runDataCatalogWorkflow() {
    try {
        // 2. Provide the Cluster ID from the BigID UI
        const targetClusterId = "cluster_id_here";

        // 3. Get All Columns for the provided cluster
        const params = new URLSearchParams({
            clusterId: targetClusterId,
            limit: 20 // Limit the results for this example
        });

        const columnsUrl = `${base_url}/data-catalog/column/cluster/similar?${params}`;

        const colsResponse = await fetch(columnsUrl, { headers });
        if (!colsResponse.ok) {
            throw new Error(`HTTP error fetching columns! Status: ${colsResponse.status}`);
        }

        const columnsData = await colsResponse.json();

        console.log("Successfully retrieved columns for the cluster.");
        console.log(`Total columns in cluster: ${columnsData.totalCount}`);
        console.log("Sample of retrieved columns:");
        console.log(JSON.stringify(columnsData.results || [], null, 2));

    } catch (error) {
        console.error("An error occurred during the workflow:", error.message);
    }
}

// Run the entire workflow
runDataCatalogWorkflow();

Summary

Congratulations! In this tutorial, you have learned how to authenticate with BigID, get a cluster ID, and retrieve all columns in that cluster using the API. Now you can easily review and manage similar columns across your data environment.

Terms Privacy Notice Cookies Your Privacy Choices Legal Resources Certifications Conduct & Ethics Sub-processors