BigID API/Duplicate Data Tutorial: Difference between revisions

Latest revision as of 18:50, 4 March 2022

In this article, you'll learn:

What the BigID data catalog can be used for
How to use the has_duplicates filter
How to use the duplicate_id filter

scenarioYou're seeing increasingly high storage costs in your cloud data sources. Looking at the names of these data sources, they don't seem to be storing anything that's particularly large, but you suspect that they're storing similar data which is increasing your storage costs. Use the BigID catalog to get a list of duplicated data

The BigID Catalog

The BigID catalog provides a view into all of your data. It allows you to see the types of data being held in each column, the access rights for that data and how it relates to other information across your systems. Below is a demo of the catalog through the BigID UI:

Since we care about the data within the API as opposed to the UI, let's see what the API response that powers this page looks like.

In the results section of the response you'll see a listing much like that in the UI. Each of these items is an object. An object is:

A database table in a structured data source
A file in an unstructured data source

This means that both structured and unstructured data sources can have multiple objects within them. In a production BigID system there can be millions of objects so we need to filter.

If you ever want to request information about objects with specific properties think: What filter should I use?

Using Duplicate Filters

Let's use the has_duplicates filter to request objects that contain duplicate files.

This second API call only returns around 47 results as opposed to the around 100 results returned from the previous request. But what are the duplicates? Each duplicate object has a duplicate_id that represents a hash of the file. We can filter objects by this ID to find all the duplicates. Replace DUPLICATEID in the URL of the request below with the duplicate_id of the first object above to find its duplicates.

Now you have a list of the files that are duplicated, you can delete some of your unneeded copies to save on data storage costs.

@@ Line 2: / Line 2: @@
 In this article, you'll learn:
 * What the BigID data catalog can be used for
-* Retrieving object data from the catalog via API
+* How to use the has_duplicates filter
-* Retrieving column data from the catalog via API
+* How to use the duplicate_id filter
 {{Box/end}}
@@ Line 9: / Line 9: @@
 {{Scenario|You're seeing increasingly high storage costs in your cloud data sources. Looking at the names of these data sources, they don't seem to be storing anything that's particularly large, but you suspect that they're storing similar data which is increasing your storage costs. '''Use the BigID catalog to get a list of duplicated data'''}}
-== The BigID Catalog ==
+{{:InformationTemplates:BigID_Catalog}}
-<html><img src="https://resources.cdn.mybigid.com/images-animated/catalog-01.gif" /></html>
+This means that both structured and unstructured data sources can have multiple objects within them. In a production BigID system there can be millions of objects so we need to filter. {{Note|If you ever want to request information about objects with specific properties think: What filter should I use?}}
+== Using Duplicate Filters ==
+Let's use the has_duplicates filter to request objects that contain duplicate files.
 <html>
-<iframe style="border:0px; width:100%; height:400px; border-radius:10px;" src="https://apibrowser.mybigid.com/?url=sessions&method=POST&body=%7B%22username%22%3A%22bigid%22%2C%22password%22%3A%22learner%22%7D&selectedSetting=body"></iframe>
+<iframe style="border:0px; width:100%; height:400px; border-radius:10px;" src="https://apiexplorer.bigid.tools/?url=data-catalog%3Ffilter%3Dhas_duplicates%3D%22true%22&method=GET&headers=%5B%7B%22name%22%3A%22Authorization%22%2C%22value%22%3A%22SAMPLE%22%7D%5D"></iframe>
 </html>
-In the response, there's a bunch of information about the logged in user. For our purposes, we just care about line 4, the auth_token. This token is what we'll use the authenticate with the other BigID APIs. We've placed a sample below with the auth token highlighted. '''Copy the auth token from the request you placed above. We'll need it in just a second.'''
+This second API call only returns around 47 results as opposed to the around 100 results returned from the previous request. But what are the duplicates? Each duplicate object has a duplicate_id that represents a hash of the file. We can filter objects by this ID to find all the duplicates. '''Replace DUPLICATEID in the URL of the request below with the duplicate_id of the first object above to find its duplicates.'''
-<syntaxhighlight lang="JSON" line highlight="4">
-{
-    "success": true,
-    "message": "Enjoy your token!",
-    "auth_token": "eyJhbGciOiJ<don't copy me! I'm just an example!>...",
-    "username": "bigid",
-    "firstName": "BigID Admin",
-    "permissions": [
-        "admin",
-        "permission.tasks.edit",
-        "permission.tasks.read_task_list",
-    ...
-</syntaxhighlight>
-== Calling an API ==
-Now that you have a session token we can directly call BigID APIs. Documentation for these APIs is available at https://www.docs.bigid.com/bigid/reference/api-getting-started . Since we're just trying to perform a simple task, we don't need the docs here, just to know that GET /ds-connections is the endpoint to retrieve a list of data source connections.
-Add a new header named "Authorization" and paste the session token you got in the previous request to authenticate yourself.
 <html>
-<iframe style="border:0px; width:100%; height:400px; border-radius:10px;" src="https://apibrowser.mybigid.com/?url=ds-connections&method=GET&selectedSetting=headers"></iframe>
+<iframe style="border:0px; width:100%; height:400px; border-radius:10px;" src="https://apiexplorer.bigid.tools/?url=data-catalog%3Ffilter%3Dduplicate_id%3D%22DUPLICATEID%22&method=GET&headers=%5B%7B%22name%22%3A%22Authorization%22%2C%22value%22%3A%22SAMPLE%22%7D%5D"></iframe>
 </html>
-In that API call, we can see a list of data sources and all the information for each data source.
+Now you have a list of the files that are duplicated, you can delete some of your unneeded copies to save on data storage costs.
-<syntaxhighlight lang="JSON" lines>
+[[Category:Tutorial]][[Category:API]]
-{
-    "status": "success",
-    "statusCode": 200,
-    "data": {
-        "ds_connections": [
-            "<data source info here>"
-         ]
-    }
-}
-</syntaxhighlight>