Find Duplicate Data Tutorial
This means that both structured and unstructured data sources can have multiple objects within them. In a production BigID system there can be millions of objects so we need to filter.
Using Duplicate Filters
Section titled “Using Duplicate Filters”Let’s use the has_duplicates filter to request objects that contain duplicate files.
This second API call only returns around 47 results as opposed to the around 100 results returned from the previous request. But what are the duplicates? Each duplicate object has a duplicate_id that represents a hash of the file. We can filter objects by this ID to find all the duplicates. Replace DUPLICATEID in the URL of the request below with the duplicate_id of the first object above to find its duplicates.
Now you have a list of the files that are duplicated, you can delete some of your unneeded copies to save on data storage costs.
Code Samples
Section titled “Code Samples”Python
Section titled “Python”# Duplicate Data Tutorialimport requestsimport json
base_url = "https://developer.bigid.com/api/v1"headers = { "Authorization": "Bearer SAMPLE", "Content-Type": "application/json"}
# 1. Get all catalog objectsresponse = requests.get( f"{base_url}/data-catalog", headers=headers)data = response.json()print("All Objects:", json.dumps(data, indent=2))
# 2. Get catalog objects that have duplicatesresponse = requests.get( f"{base_url}/data-catalog?filter=has_duplicates=\"true\"", headers=headers)data = response.json()print("Duplicate Objects:", json.dumps(data, indent=2))
# Get the duplicate_id of the first object (for example)results = data.get("results", [])first_object = results[0]duplicate_id = first_object.get("duplicate_id")
# 3. Get all objects that share the same duplicate_idresponse = requests.get( f"{base_url}/data-catalog?filter=duplicate_id=\"{duplicate_id}\"", headers=headers)data = response.json()print("Objects with same duplicate_id:", json.dumps(data, indent=2))JavaScript
Section titled “JavaScript”// Duplicate Data Tutorialconst baseUrl = "https://developer.bigid.com/api/v1";const headers = { "Authorization": "Bearer SAMPLE", "Content-Type": "application/json"};
// 1. Get all catalog objectsasync function getAllCatalogObjects() { console.log("Fetching all catalog objects..."); const res = await fetch(`${baseUrl}/data-catalog`, { headers }); const data = await res.json(); console.log("All Objects:", JSON.stringify(data, null, 2)); return data;}
// 2. Get catalog objects that have duplicatesasync function getObjectsWithDuplicates() { console.log("Fetching objects with duplicates..."); const res = await fetch(`${baseUrl}/data-catalog?filter=has_duplicates="true"`, { headers }); const data = await res.json(); console.log("Duplicate Objects:", JSON.stringify(data, null, 2)); return data;}
// 3. Get all objects that share the same duplicate_idasync function getObjectsByDuplicateId(duplicateId) { // Use duplicate id of desired object obtained above in step 2 console.log(`Fetching objects for duplicate_id: ${duplicateId}`); const res = await fetch(`${baseUrl}/data-catalog?filter=duplicate_id="${duplicateId}"`, { headers }); const data = await res.json(); console.log("Objects with same duplicate_id:", JSON.stringify(data, null, 2)); return data;}All rights reserved.