Skip to content

Cluster Analysis Tutorial

In this tutorial, we’ll use SAMPLE as our session token. This is unique to the training sandbox and will not work in other environments. See BigID API/Tutorial for information on authenticating with BigID.

To view the complete code for all steps, see the section labelled Code Samples.”

For more information on the API capabilities used in this tutorial, check out the Data Catalog API Docs.

All API requests require authentication using a valid API key. Refer to BigID Documentation to obtain your token. Then, define the Authorization header using the format `Authorization: Bearer YOUR_API_KEY`. This header must be included in every request to ensure proper authentication and access to BigID’s API endpoints. Throughout the tutorial, we will be using SAMPLE as our token.

The easiest way to obtain the Cluster ID for the cluster you are interested in is through the BigID UI. On the Cluster Analysis page, you can find detailed information on all existing clusters for your organization’s data.

However, if you prefer to handle everything programmatically, you can use the GET /api/v1/clusters endpoint to retrieve the list of clusters via API. You can filter the results to find exactly what you’re looking for.

For example, the following request retrieves all clusters where the size equals 20:

Your response will contain a variety of information for each cluster. Once you have located the cluster you are interested in, you will need the ID in order to proceed.

Once you have the id of the cluster you are interested in, you can review its columns using the GET /api/v1/data-catalog/column/cluster/similar endpoint. This endpoint returns a detailed list of every column that has been grouped into the specified cluster by BigID’s similarity engine. These columns may come from different tables, sources, or systems, but they share common structural, content-based, or metadata characteristics that indicate a high degree of similarity.

You can use this response to:

  • Audit similar columns across your data environment
  • Apply or verify tags (like sensitivity or classification labels)
  • Identify redundant or duplicate data
  • Export the columns for further analysis or reporting

Depending on your needs, you can customize your request using optional query parameters to paginate results, apply filters, or limit the data returned. The specific query options for this endpoint are outlined below:

ParameterTypeDescriptionDefault
clusterIdstringRequired. The ID of the cluster to retrieve columns for.N/A
limitnumberMaximum number of results to return.10000
skipnumberNumber of results to skip (used for pagination).0
filterstringFilter results based on column metadata."" (empty)
searchTextstringSearch across column names and attributes."" (empty)
sortstringField to sort the results by._id
offsetKeystringUsed for advanced pagination.""
ignoreLimitbooleanIf true, ignores the limit parameter.false
samplenumberNumber of columns to randomly sample from the cluster.null

All of the above options can be used to control the API response; however, they are all optional except for the clusterId. In the below request, only the clusterId is provided. To test the endpoint, replace the fake clusterId below with a real id.

A successful request will return a 200 OK response with a JSON payload containing metadata for all columns in the specified cluster. For example:

{
"results": [
{
"fullyQualifiedName": "x.y.z",
"columnName": "abc",
"businessAttribute": "friendly name",
"tableName": "my-table",
"source": "my-sql-tables",
"isPrimary": true,
"dataType": "integer",
"attributes": [
{
"attribute_original_name": "SSN",
"attribute_name": "SSN",
"rank": "High",
"calc_confidence_level": 0.695781717492013,
"attribute_type": "IDSoR Attribute",
"attribute_id": "74b78d7040e97f1ab9ba3b69c8e372e3"
}
],
"tags": [
{
"tagId": "ba7a5426-64d6-47eb-995b-8fd8e1224de0",
"valueId": "ba7a5426-64d6-47eb-995b-8fd8e1224de1",
"isMutuallyExclusive": "true",
"properties": {
"hidden": false,
"applicationType": "sensitivityClassification"
},
"tagName": "sensitivity",
"tagValue": "high"
}
]
}
],
"totalCount": 5
}

If your request fails, here’s what the server might tell you, and how to fix it:

Status CodeExample ResponseWhat It MeansHow to Fix It
200Successful response with scan dataEverything’s looking good!Keep cruising.
400{ "error": "Scan ID is invalid" }Bad or malformed scan ID providedDouble-check the scan ID you’re using.
404{ "error": "Scan 1234 was not found" }Scan ID doesn’t existMake sure the ID is valid and fetched from the parent scans endpoint.
401UnauthorizedAPI key missing or invalidVerify your API key and authorization header.
500{ "status": "error", "message": "Server error", "errors": [{}] }BigID server hit a snag (internal error)Wait a moment and retry. If it persists, reach out to support.
# Cluster Analysis API Tutorial
import requests
import json
# --- 1. Setup and Authentication ---
# Base URL of the BigID API (training sandbox)
base_url = "https://developer.bigid.com/api/v1"
# Session token (replace SAMPLE with actual session token)
AUTH_TOKEN = "SAMPLE"
headers = {
"Authorization": f"Bearer {AUTH_TOKEN}",
"Content-Type": "application/json"
}
try:
# 2. Provide the Cluster ID once obtained
target_cluster_id = "cluster_id_here"
# 3. Get All Columns for the provided cluster
columns_url = f"{base_url}/data-catalog/column/cluster/similar"
# The clusterId is a required parameter for this endpoint.
# You can also add other optional parameters like 'limit'.
params = {
'clusterId': target_cluster_id,
'limit': 20 # Limit the results for this example
}
response_cols = requests.get(columns_url, headers=headers, params=params)
response_cols.raise_for_status()
columns_data = response_cols.json()
print("Successfully retrieved columns for the cluster.")
print(f"Total columns in cluster: {columns_data.get('totalCount')}")
print("Sample of retrieved columns:")
print(json.dumps(columns_data.get('results', []), indent=2))
except requests.exceptions.HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
if http_err.response:
print(f"Response content: {http_err.response.text}")
except Exception as err:
print(f"An other error occurred: {err}")
// Cluster Analysis API Tutorial
// --- 1. Setup and Authentication ---
// Base URL of the BigID API (training sandbox)
base_url = "https://developer.bigid.com/api/v1"
// Session token (replace SAMPLE with actual session token)
const AUTH_TOKEN = "SAMPLE";
const headers = {
"Authorization": `Bearer ${AUTH_TOKEN}`,
"Content-Type": "application/json"
};
async function runDataCatalogWorkflow() {
try {
// 2. Provide the Cluster ID from the BigID UI
const targetClusterId = "cluster_id_here";
// 3. Get All Columns for the provided cluster
const params = new URLSearchParams({
clusterId: targetClusterId,
limit: 20 // Limit the results for this example
});
const columnsUrl = `${base_url}/data-catalog/column/cluster/similar?${params}`;
const colsResponse = await fetch(columnsUrl, { headers });
if (!colsResponse.ok) {
throw new Error(`HTTP error fetching columns! Status: ${colsResponse.status}`);
}
const columnsData = await colsResponse.json();
console.log("Successfully retrieved columns for the cluster.");
console.log(`Total columns in cluster: ${columnsData.totalCount}`);
console.log("Sample of retrieved columns:");
console.log(JSON.stringify(columnsData.results || [], null, 2));
} catch (error) {
console.error("An error occurred during the workflow:", error.message);
}
}
// Run the entire workflow
runDataCatalogWorkflow();

Congratulations! In this tutorial, you have learned how to authenticate with BigID, get a cluster ID, and retrieve all columns in that cluster using the API. Now you can easily review and manage similar columns across your data environment.