Cluster Analysis API Tutorial
- How to authenticate API requests using an API key.
- How to obtain the scan ID(s) for scans of interest.
- How to check scan status using the respective scan ID(s).
- How to parse and interpret response data to confirm scan completion.
- How to troubleshoot common issues like failed scans or missing data.
In this tutorial, we'll use SAMPLE as our session token. This is unique to the training sandbox and will not work in other environments. See BigID API/Tutorial for information on authenticating with BigID.
To view the complete code for all steps, see the section labelled Code Samples.
For more information on the API capabilities used in this tutorial, check out the Data Catalog API Docs.
1. Authenticate Using Your API Key
All API requests require authentication using a valid API key. Refer to BigID Documentation to obtain your token. Then, define the Authorization header using the format `Authorization: Bearer YOUR_API_KEY`. This header must be included in every request to ensure proper authentication and access to BigID’s API endpoints. Throughout the tutorial, we will be using SAMPLE as our token.
2. Obtain Cluster ID
The easiest way to obtain the Cluster ID for the cluster you are interested in is through the BigID UI. On the Cluster Analysis page, you can find detailed information on all existing clusters for your organization’s data.
However, if you prefer to handle everything programmatically, you can use the GET /api/v1/clusters endpoint to retrieve the list of clusters via API. You can filter the results to find exactly what you’re looking for.
For example, the following request retrieves all clusters where the size equals 20:
Your response will contain a variety of information for each cluster. Once you have located the one you are interested in, you will need the ID in order to proceed.
[
{ "id": "abc123def456", "name": "Sensitive HR Data", "size": 20, "labels": ["PII", "Confidential"], "severity": "High", "status": "Active", "createdAt": "2025-07-15T14:22:30.000Z" }, { "id": "ghi789jkl012", "name": "Marketing Metrics", "size": 20, "labels": ["Analytics"], "severity": "Low", "status": "Archived", "createdAt": "2025-06-28T09:08:12.000Z" }
]
3. Get All Columns
Once you have the id of the cluster you are interested in, you can review its columns using the GET /api/v1/data-catalog/column/cluster/similar endpoint. This endpoint returns a detailed list of every column that has been grouped into the specified cluster by BigID’s similarity engine. These columns may come from different tables, sources, or systems, but they share common structural, content-based, or metadata characteristics that indicate a high degree of similarity.
You can use this response to:
- Audit similar columns across your data environment
- Apply or verify tags (like sensitivity or classification labels)
- Identify redundant or duplicate data
- Export the columns for further analysis or reporting
Query Parameters
Depending on your needs, you can customize your request using optional query parameters to paginate results, apply filters, or limit the data returned. The specific query options for this endpoint are outlined below:
Parameter | Type | Description | Default |
---|---|---|---|
clusterId | string | Required. The ID of the cluster to retrieve columns for. | N/A |
limit | number | Maximum number of results to return. | 10000 |
skip | number | Number of results to skip (used for pagination). | 0 |
filter | string | Filter results based on column metadata. | "" (empty) |
searchText | string | Search across column names and attributes. | "" (empty) |
sort | string | Field to sort the results by. | _id |
offsetKey | string | Used for advanced pagination. | "" |
ignoreLimit | boolean | If true, ignores the limit parameter. | false |
sample | number | Number of columns to randomly sample from the cluster. | null |
All of the above options can be used to control the API response; however, they are all optional except for the clusterId. In the below request, only the clusterId is provided. To test the endpoint, replace the fake clusterId below with a real id.
A successful request will return a 200 OK response with a JSON payload containing metadata for all columns in the specified cluster. For example:
{
"results": [
{
"fullyQualifiedName": "x.y.z",
"columnName": "abc",
"businessAttribute": "friendly name",
"tableName": "my-table",
"source": "my-sql-tables",
"isPrimary": true,
"dataType": "integer",
"attributes": [
{
"attribute_original_name": "SSN",
"attribute_name": "SSN",
"rank": "High",
"calc_confidence_level": 0.695781717492013,
"attribute_type": "IDSoR Attribute",
"attribute_id": "74b78d7040e97f1ab9ba3b69c8e372e3"
}
],
"tags": [
{
"tagId": "ba7a5426-64d6-47eb-995b-8fd8e1224de0",
"valueId": "ba7a5426-64d6-47eb-995b-8fd8e1224de1",
"isMutuallyExclusive": "true",
"properties": {
"hidden": false,
"applicationType": "sensitivityClassification"
},
"tagName": "sensitivity",
"tagValue": "high"
}
]
}
],
"totalCount": 5
}
4. Troubleshooting
If your request fails, here’s what the server might tell you, and how to fix it:
Status Code | Example Response | What It Means | How to Fix It |
---|---|---|---|
200 | Successful response with scan data | Everything’s looking good! | Keep cruising. |
400 | { "error": "Scan ID is invalid" } |
Bad or malformed scan ID provided | Double-check the scan ID you’re using. |
404 | { "error": "Scan 1234 was not found" } |
Scan ID doesn’t exist | Make sure the ID is valid and fetched from the parent scans endpoint. |
401 | Unauthorized | API key missing or invalid | Verify your API key and authorization header. |
500 | { "status": "error", "message": "Server error", "errors": [{}] } |
BigID server hit a snag (internal error) | Wait a moment and retry. If it persists, reach out to support. |
Code Samples
# Cluster Analysis API Tutorial IN PROGRESS
// Cluster Analysis API Tutorial IN PROGRESS
Summary
Congratulations! In this tutorial, you have learned how to authenticate with BigID, get a cluster ID, and retrieve all columns in that cluster using the API. Now you can easily review and manage similar columns across your data environment.