BigID API/Duplicate Data Tutorial: Difference between revisions
No edit summary |
No edit summary |
||
Line 9: | Line 9: | ||
{{Scenario|You're seeing increasingly high storage costs in your cloud data sources. Looking at the names of these data sources, they don't seem to be storing anything that's particularly large, but you suspect that they're storing similar data which is increasing your storage costs. '''Use the BigID catalog to get a list of duplicated data'''}} | {{Scenario|You're seeing increasingly high storage costs in your cloud data sources. Looking at the names of these data sources, they don't seem to be storing anything that's particularly large, but you suspect that they're storing similar data which is increasing your storage costs. '''Use the BigID catalog to get a list of duplicated data'''}} | ||
{{InformationTemplates:BigID_Catalog}} | |||
This means that both structured and unstructured data sources can have multiple objects within them. In a production BigID system there can be millions of objects so we need to filter. {{Note|If you ever want to request information about objects with specific properties think: What filter should I use?}} | This means that both structured and unstructured data sources can have multiple objects within them. In a production BigID system there can be millions of objects so we need to filter. {{Note|If you ever want to request information about objects with specific properties think: What filter should I use?}} |
Revision as of 16:09, 8 November 2021
- What the BigID data catalog can be used for
- How to use the has_duplicates filter
- How to use the duplicate_id filter
Template:InformationTemplates:BigID Catalog
This means that both structured and unstructured data sources can have multiple objects within them. In a production BigID system there can be millions of objects so we need to filter.
Using Duplicate Filters
Let's use the has_duplicates filter to request objects that contain duplicate files.
This second API call only returns around 47 results as opposed to the around 100 results returned from the previous request. But what are the duplicates? Each duplicate object has a duplicate_id that represents a hash of the file. We can filter objects by this ID to find all the duplicates. Replace DUPLICATEID in the URL of the request below with the duplicate_id of the first object above to find its duplicates.
Now you have a list of the files that are duplicated, you can delete some of your unneeded copies to save on data storage costs.