Cluster Analysis API Tutorial

From BigID Developer Portal

In this article, you'll learn:

  • How to authenticate API requests using an API key.
  • How to obtain the scan ID(s) for scans of interest.
  • How to check scan status using the respective scan ID(s).
  • How to parse and interpret response data to confirm scan completion.
  • How to troubleshoot common issues like failed scans or missing data.


scenarioYour organization uses BigID’s clustering capabilities to automatically group similar columns across your environment based on structure, content, and metadata. Now, you want to review all the columns that belong to a specific cluster. Instead of manually searching through tables one by one, you can streamline this process using BigID’s Cluster Analysis API endpoints to retrieve all related columns at once for inspection, tagging, or export.

In this tutorial, we'll use SAMPLE as our session token. This is unique to the training sandbox and will not work in other environments. See BigID API/Tutorial for information on authenticating with BigID.

To view the complete code for all steps, see the section labelled Code Samples.

For more information on the API capabilities used in this tutorial, check out the Data Catalog API Docs.

1. Authenticate Using Your API Key

All API requests require authentication using a valid API key. Refer to BigID Documentation to obtain your token. Then, define the Authorization header using the format `Authorization: Bearer YOUR_API_KEY`. This header must be included in every request to ensure proper authentication and access to BigID’s API endpoints. Throughout the tutorial, we will be using SAMPLE as our token.

2. Obtain Cluster ID

The easiest way to obtain the Cluster ID for the cluster you are interested in is through the BigID UI. On the Cluster Analysis page, you can find detailed information on all existing clusters for your organization’s data.

However, if you prefer to handle everything programmatically, you can use the GET /api/v1/clusters endpoint to retrieve the list of clusters via API. You can filter the results to find exactly what you’re looking for.

For example, the following request retrieves all clusters where the size equals 20:

Your response will contain a variety of information for each cluster. Once you have located the cluster you are interested in, you will need the ID in order to proceed.

3. Get All Columns

Once you have the id of the cluster you are interested in, you can review its columns using the GET /api/v1/data-catalog/column/cluster/similar endpoint. This endpoint returns a detailed list of every column that has been grouped into the specified cluster by BigID’s similarity engine. These columns may come from different tables, sources, or systems, but they share common structural, content-based, or metadata characteristics that indicate a high degree of similarity.

You can use this response to:

  • Audit similar columns across your data environment
  • Apply or verify tags (like sensitivity or classification labels)
  • Identify redundant or duplicate data
  • Export the columns for further analysis or reporting

Query Parameters

Depending on your needs, you can customize your request using optional query parameters to paginate results, apply filters, or limit the data returned. The specific query options for this endpoint are outlined below:

Parameter Type Description Default
clusterId string Required. The ID of the cluster to retrieve columns for. N/A
limit number Maximum number of results to return. 10000
skip number Number of results to skip (used for pagination). 0
filter string Filter results based on column metadata. "" (empty)
searchText string Search across column names and attributes. "" (empty)
sort string Field to sort the results by. _id
offsetKey string Used for advanced pagination. ""
ignoreLimit boolean If true, ignores the limit parameter. false
sample number Number of columns to randomly sample from the cluster. null

All of the above options can be used to control the API response; however, they are all optional except for the clusterId. In the below request, only the clusterId is provided. To test the endpoint, replace the fake clusterId below with a real id.

A successful request will return a 200 OK response with a JSON payload containing metadata for all columns in the specified cluster. For example:

 
{
  "results": [
    {
      "fullyQualifiedName": "x.y.z",
      "columnName": "abc",
      "businessAttribute": "friendly name",
      "tableName": "my-table",
      "source": "my-sql-tables",
      "isPrimary": true,
      "dataType": "integer",
      "attributes": [
        {
          "attribute_original_name": "SSN",
          "attribute_name": "SSN",
          "rank": "High",
          "calc_confidence_level": 0.695781717492013,
          "attribute_type": "IDSoR Attribute",
          "attribute_id": "74b78d7040e97f1ab9ba3b69c8e372e3"
        }
      ],
      "tags": [
        {
          "tagId": "ba7a5426-64d6-47eb-995b-8fd8e1224de0",
          "valueId": "ba7a5426-64d6-47eb-995b-8fd8e1224de1",
          "isMutuallyExclusive": "true",
          "properties": {
            "hidden": false,
            "applicationType": "sensitivityClassification"
          },
          "tagName": "sensitivity",
          "tagValue": "high"
        }
      ]
    }
  ],
  "totalCount": 5
}

4. Troubleshooting

If your request fails, here’s what the server might tell you, and how to fix it:

Status Code Example Response What It Means How to Fix It
200 Successful response with scan data Everything’s looking good! Keep cruising.
400 { "error": "Scan ID is invalid" } Bad or malformed scan ID provided Double-check the scan ID you’re using.
404 { "error": "Scan 1234 was not found" } Scan ID doesn’t exist Make sure the ID is valid and fetched from the parent scans endpoint.
401 Unauthorized API key missing or invalid Verify your API key and authorization header.
500 { "status": "error", "message": "Server error", "errors": [{}] } BigID server hit a snag (internal error) Wait a moment and retry. If it persists, reach out to support.

Code Samples

# Cluster Analysis API Tutorial
import requests
import json

# --- 1. Setup and Authentication ---
# Base URL of the BigID API (training sandbox)
base_url = "https://developer.bigid.com/api/v1"
# Session token (replace SAMPLE with actual session token)
AUTH_TOKEN = "SAMPLE" 

headers = {
    "Authorization": f"Bearer {AUTH_TOKEN}",
    "Content-Type": "application/json"
}

try:
    # 2. Provide the Cluster ID once obtained
    target_cluster_id = "cluster_id_here"

    # 3. Get All Columns for the provided cluster
    columns_url = f"{base_url}/data-catalog/column/cluster/similar"
    
    # The clusterId is a required parameter for this endpoint.
    # You can also add other optional parameters like 'limit'.
    params = {
        'clusterId': target_cluster_id,
        'limit': 20 # Limit the results for this example
    }
        
    response_cols = requests.get(columns_url, headers=headers, params=params)
    response_cols.raise_for_status()
    
    columns_data = response_cols.json()
       
    print("Successfully retrieved columns for the cluster.")
    print(f"Total columns in cluster: {columns_data.get('totalCount')}")
    print("Sample of retrieved columns:")
    print(json.dumps(columns_data.get('results', []), indent=2))

except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
    if http_err.response:
        print(f"Response content: {http_err.response.text}")
except Exception as err:
    print(f"An other error occurred: {err}")
// Cluster Analysis API Tutorial

// --- 1. Setup and Authentication ---
// Base URL of the BigID API (training sandbox)
base_url = "https://developer.bigid.com/api/v1"
// Session token (replace SAMPLE with actual session token)
const AUTH_TOKEN = "SAMPLE";

const headers = {
    "Authorization": `Bearer ${AUTH_TOKEN}`,
    "Content-Type": "application/json"
};

async function runDataCatalogWorkflow() {
    try {
        // 2. Provide the Cluster ID from the BigID UI
        const targetClusterId = "cluster_id_here";

        // 3. Get All Columns for the provided cluster
        const params = new URLSearchParams({
            clusterId: targetClusterId,
            limit: 20 // Limit the results for this example
        });
        
        const columnsUrl = `${base_url}/data-catalog/column/cluster/similar?${params}`;

        const colsResponse = await fetch(columnsUrl, { headers });
        if (!colsResponse.ok) {
            throw new Error(`HTTP error fetching columns! Status: ${colsResponse.status}`);
        }
        
        const columnsData = await colsResponse.json();

        console.log("Successfully retrieved columns for the cluster.");
        console.log(`Total columns in cluster: ${columnsData.totalCount}`);
        console.log("Sample of retrieved columns:");
        console.log(JSON.stringify(columnsData.results || [], null, 2));
        
    } catch (error) {
        console.error("An error occurred during the workflow:", error.message);
    }
}

// Run the entire workflow
runDataCatalogWorkflow();

Summary

Congratulations! In this tutorial, you have learned how to authenticate with BigID, get a cluster ID, and retrieve all columns in that cluster using the API. Now you can easily review and manage similar columns across your data environment.