Writing a REST Connector

Connectors Overview

BigID Connectors act as translators between your custom data sources and the standardized format BigID expects. While BigID supports native Java-based internal connectors, External REST Connectors allow you to write integrations in any programming language. As long as your service can respond to HTTP requests in the format BigID expects, the BigID scanner will be able to map, catalog, and classify your custom data.

If you need a refresher on the types of connectors and how BigID scanning works, please read What is a BigID Connector? before proceeding.

Writing a Structured Connector

While your data source may store its data differently, you will need to reorganize it into this structure for BigID to scan the data. Remember that connectors are translators. Improper translation from your data source’s format to BigID’s is the largest issue when testing and developing custom connectors. See the below example of a JSON file for a single user converted into this format:

{
    "user":"[email protected]",
    "phone": "(321) 555-5555",
    "information": [
        {"key": "name", "value": "user"},
        {"key": "favcolor", "value": "green"}
    ]
}

translates to

{
  "objectName": "User",
  "fields": [
    {
      "fieldName": "Name",
      "fieldType": "string"
    },
    {
      "fieldName": "FavColor",
      "fieldType": "string"
    },
    {
      "fieldName": "Phone",
      "fieldType": "string"
    }
  ]
}

{
  "id": "[email protected]",
  "fields": [
    {
      "fieldName": "phone",
      "fieldValue": "(321) 555-5555"
    },
    {
      "fieldName": "name",
      "fieldValue": "user"
    },
    {
      "fieldName": "favcolor",
      "fieldValue": "green"
    }
  ]
}

Connector Endpoints

BigID will request and view your connector’s hierarchy through six endpoints:

List what fields an object has (GET /objects/<objectName>/describe)
List what objects exist (/objects)
List what records are inside an object (GET /objects/<objectName>/records)
Count the number of records inside an object (GET /objects/<objectName>/count)
Return the fields and values for a given record ID (GET /objects/<objectName>/<recordID>)
Search for records in an object (POST /objects/<objectName>/sar)

You are not required to implement every endpoint outlined here. If you want a connector that only does DSAR requests you only need to implement the following endpoints:

List what objects exist (/objects)
List what fields an object has (GET /objects/<objectName>/describe)
Search for records in an object (POST /objects/<objectName>/sar)

List what fields an object has

Since all records within an object must contain the same fields, BigID uses an object’s fields to tell users what type of information is inside your data source. BigID expects the following response by your connector to see what fields an item has:

GET /objects/User/describe HTTP/1.1

{
  "status": "success",
  "objectName": "User",
  "fields": [
    {
      "fieldName": "Name",
      "fieldType": "string"
    },
    {
      "fieldName": "FavColor",
      "fieldType": "string"
    },
    {
      "fieldName": "Phone",
      "fieldType": "string"
    }
  ]
}

Exercise: List what fields an object has

Scenario

You want to create a connector for your org to be able to scan for sensitive data within an ecommerce system, ShopDB. Create an endpoint for your connector that allows BigID to list an object’s fields

List what objects exist

Now that we have an object class created, we need to let BigID all of the objects our connector has access to. Our object listing endpoint needs to display the fields within an object that we implemented in the previous exercise. As we implement these endpoints, many of them rely on one another.

GET /objects/ HTTP/1.1

{
  "status": "success",
  "objects": [
    {
      "objectName": "User",
      "fields": [
        {
          "fieldName": "Name",
          "fieldType": "string"
        },
        {
          "fieldName": "FavColor",
          "fieldType": "string"
        },
        {
          "fieldName": "Phone",
          "fieldType": "string"
        }
      ]
    }
  ]
}

Exercise: List what objects exist

Scenario

For your ShopDB connector to work, you need to tell BigID what objects are inside the database. The code for this endpoint is similar to your previous endpoint. Create an endpoint for your connector that allows BigID to list out all objects.

List what records are inside an object

Now that BigID knows what objects we have and what type of information it can find inside those objects, we need to return the actual data BigID will scan.

GET /objects/User/records?Count=&Offset= HTTP/1.1

{
  "status": "success",
  "records": [
    {
      "id": 1,
      "data": [
        {
          "fieldName": "Name",
          "fieldValue": "Michael",
          "fieldType": "string"
        },
        {
          "fieldName": "Phone",
          "fieldValue": "3215555555",
          "fieldType": "string"
        }
      ]
    }
  ],
  "offset": 1
}

Exercise: List what records are inside an object

Scenario

BigID has the structure of your connector, but scans are still failing with 0 records found. Create an endpoint for your connector that allows BigID to list out records.

Count the number of records inside an object

In order for BigID to properly paginate through your data sources, we need to know how many total records to expect.

GET /objects/User/count HTTP/1.1

{
  "status": "success",
  "count": 1
}

Record Pagination

In a previous exercise, you returned all records inside of an object in a single call. It’s extremely common for providers not to allow all records to be returned in a single call either do to performance, or cost reasons. This is where the offset and count parameter come in.

Data providers typically meter their APIs in one of the following ways:

Limit the number of requests per second/hour/day
Limit the number of records returned in a single request
Limit the amount of data in GB processed in a given time period

To handle these use cases, BigID has two properties on the records endpoint: Count and Offset.

Count dictates the batch size. This is how many records your connector should return in a single call. This allows BigID users to throttle your connector based on how many records it returns per page.

Offset allows BigID to paginate requests.

A sample scan using the records endpoint looks like the following:

   sequenceDiagram
     participant BigID
     participant Connector

BigID->>Connector: Call /count
Connector-->>BigID: Total records = 40

BigID->>Connector: Call /records with offset = 0 and count = 20
Connector-->>BigID: 20 Records retrieved, offset = 20

BigID->>Connector: Call /records with offset = 21 and count = 20
Connector-->>BigID: 20 Records retrieved, offset = 40

BigID->>BigID: Stop as offset >= total (40 >= 40)

Exercise: Pagination

Scenario

Your connector is constantly scanning and scans are not completing. Implement pagination so BigID knows when you are done scanning.

Return the fields and values for a given record ID

There are cases when we only want to look into an individual record. This is especially true in cases where we already know the unique ID like a DSAR or a record investigation.

GET /objects/User/records/1 HTTP/1.1

{
  "status": "success",
  "records": [
    {
      "id": 1,
      "data": [
        {
          "fieldName": "Name",
          "fieldValue": "Michael",
          "fieldType": "string"
        },
        {
          "fieldName": "Phone",
          "fieldValue": "3215555555",
          "fieldType": "string"
        }
      ]
    }
  ],
  "offset": 1
}

Exercise: Return the fields and values for a given record ID

Scenario

Your team has been trying to investigate sensitive data in your data source using the “Investigate” button in the data inventory, but they aren’t getting results. Create an endpoint for your connector that allows BigID to lookup an individual record

Search for records in an object

Our final service that our connector needs to offer is the ability to search for records. This is what allows our connector to do DSARs. Depending on your data source, you may need to implement searching on your own.

POST /objects/User/records/sar HTTP/1.1
Content-Type: application/json


[{"fieldName": "Name", "fieldValue": "Michael", "isFullMatch": "false"}]

{
  "status": "success",
  "records": [
    {
      "id": 1,
      "data": [
        {
          "fieldName": "Name",
          "fieldValue": "Michael",
          "fieldType": "string"
        },
        {
          "fieldName": "Phone",
          "fieldValue": "3215555555",
          "fieldType": "string"
        }
      ]
    }
  ],
  "offset": 1
}

Exercise: Search for records in an object

Scenario

Your team is using BigID to comply with GDPR DSAR requests. ShopDB doesn’t have any search functionality so you’ll have to implement it yourself. Implement the SAR endpoint so BigID can properly query your data source.

Authentication

The connector you’ve written is allowing anyone on the internet to see the information inside of your data source which is no good. This connector’s database also has no username and password which is extremely rare. Typically the data sources your connectors will access will also require authentication of some type.

BigID has the ability to store and manage these credentials for you inside of the BigID system or use credentials from external password vaults. When a scan is started, BigID will send the credentials to your connector.

Credentials sent to your connector will always take the form of a Basic authentication header. This header is not encrypted. For this reason, you must never install a BigID connector without HTTPS.

While Basic auth credentials always take the form of a username and password, not all systems use that mechanism for login. Below are some examples of how you can format a username and password into the credentials your application is expecting:

OAuth Client Credentials - Username: https://tenanturl.com/,CLIENT_ID Password: CLIENT_SECRET
Multi Tenant SaaS application with API key - Username: tenantID Password: API_KEY
SaaS application with just an API key - Username: <blank> Password: API_KEY

Exercise: Authentication

Scenario

Your organization is writing a connector and wants to be sure the data is only viewable to BigID. You’ve set up the appropriate firewalls, but also want your connector to be protected by a username and password. Protect your records endpoint with username password auth

Writing an Unstructured Connector

Connector Endpoints

BigID will request and view your connector’s hierarchy through six endpoints:

List what Containers exist (/objects)
List what Objects exist (/objects/<containerName>?Offset=)
Return the Metadata for a given object (GET /objects/<containerName>/<objectName>/describe)
Return the InputStream for a given object (GET /objects/<containerName>/<objectName>/content-stream)
Search for specific data inside objects (POST /objects/<containerName>/<objectName>/sar?Offset=)

Just like for structured connectors, you are not required to implement every endpoint outlined here. If you want a connector that only does DSAR requests you only need to implement the following endpoints:

List what Containers exist (/objects)
List what Objects exist (/objects/<containerName>?Offset=)
Search for specific data inside objects (POST /objects/<containerName>/<objectName>/sar?Offset=)

List what containers exist

The first step in BigID’s scan of unstructured data sources will always be to list out the containers that exist within a data source. This allows BigID to know where to start the discovery process. Compared to structured connectors, this endpoint is extremely simple. That’s because containers or folders don’t really specify anything about the information that can be inside them, just that they are a storage location.

GET /objects HTTP/1.1

{
  "status": "success",
  "containers": [
    {
      "containerName": "Folder"
    },
    {
      "containerName": "Folder2"
    }
  ]
}

Exercise: List what containers exist

Scenario

Your team has a set of CSV files they are looking to scan on a file server. Implement the containers endpoint to let BigID know what containers are present on the file server

List what objects exist inside a container

Now that BigID knows what containers our connector has access to, we can give it insights into the objects inside those containers. This endpoint will also provide BigID with the metadata about the objects so we can do things like HyperScan and enforce permissions policies.

You’ll notice that certain fields are set directly on the object, whereas others are set inside of custom fields. The field set in the object are predetermined by the BigID connector specification and cannot change. The custom fields object allows you to add your own!

GET /objects/Folder?Offset= HTTP/1.1

{
  "status": "success",
  "objects": [
    {
      "containerName": "Folder",
      "objectName": "1.pdf",
      "dateCreated": "26/02/2020 19:00:34",
      "lastModified": "26/02/2020 19:00:34",
      "owner": 501,
      "sizeInBytes": 177924,
      "schemaFields": "NULL",
      "sarCapable": false,
      "customFields": [
        {
          "fieldName": "file_type",
          "fieldValue": "pdf"
        }
      ]
    }
  ],
  "offset": 1
}

Exercise: List what objects exist inside a container

Scenario

List the objects inside of the container and their metadata so BigID can scan populate the catalog. Implement the objects endpoint to let BigID know what objects are present on the file server

Return the metadata for a given object

Just like we listed out the metadata for all objects, we need to be able to list out the metadata for a single object so BigID can inspect single objects without needing to go through potentially thousands of listings in our container.

GET /objects/Folder/File/describe HTTP/1.1

{
  "status": "success",
  "containerName": "Folder",
  "objectName": "1.pdf",
  "dateCreated": "26/02/2020 19:00:34",
  "lastModified": "26/02/2020 19:00:34",
  "owner": 501,
  "sizeInBytes": 177924,
  "schemaFields": "NULL",
  "sarCapable": false,
  "customFields": [
    {
      "fieldName": "file_type",
      "fieldValue": "pdf"
    }
  ]
}

Exercise: Return the metadata for a given object

Scenario

We need our users to be able to know the metadata of a single file. The format is exactly the same as the container listing endpoint so feel free to reuse your code. Implement the object endpoint to let BigID know a single object’s metadata

Return the inputStream for a given object

In order for BigID to do classification on the contents of the file it needs access to them. The inputStream endpoint is unique in that it just returns the raw data from the file. There’s no formatting required here, just return the raw data.

GET /objects/Folder/File/content-stream HTTP/1.1

utf-8 data here

Exercise: Return the inputStream for a given object

Search for specific data inside objects

POST /objects/Folder/File/sar?Offset= HTTP/1.1
Content-Type: application/json


[{"fieldName": "Name", "fieldValue": "Michael", "isFullMatch": "false"}]

{
  "status": "success",
  "records": [
    {
      "id": "container/object",
      "data": [
        {
          "searchedFieldName": "fieldName",
          "searchedFieldValue": "fieldValue",
          "fullObjectName": "container/object",
          "offset": "12"
        }
      ]
    }
  ]
}

Exercise: Search for specific data inside objects

Distributing a BigID Connector

       graph TD
           Cloud[BigID Cloud Environment]
           OnPrem[BigID On-Premise Environment]
           Connector[Connector]
           Scanner[Scanner]
           Target[Target Data Source]

           subgraph BigID Environment
               Cloud --- |Outbound Access| Scanner
               OnPrem --- |Outbound Access| Scanner
           end

           Scanner -->|Inbound Access| Connector
           Connector -->|Outbound Access| Target

BigID connectors are distributed as Docker images. This allows them to be run within the BigID app server Kubernetes cluster or on the Kubernetes clusters and Docker compose servers running an organization’s scanners. As long as your scanner can access your connector and your connector can access your data source there are no other deployment requirements.

\n\n---\n\n## Pre-Flight Checklist\n\nBefore creating a connector for a given data source you’ll need to collect the following information:\n\n- What is the type of data returned from the API?\n- Where is the API documentation? Do you have access to it?\n- What authentication modes does this API support?\n- Does your account have any kind of usage limits? (X requests per day, per hour, per minute?)\n- Is there a programmatic representation of all API endpoints? (OpenAPI, Swagger)\n- Is this a structured data source or an unstructured data source?\n- What are the credentials for a test environment of this data source?\n\nAfter you have obtained the above information, determine what data models from the data source you wish to include in your connector. There may be hundreds of data models, but only a few may contain PI. For each of these data models, you need to know the following information:\n\n- What are the fields for this data model? What are the types of those fields? (ex: name=string, age=number)\n- What endpoint gives the records of this data model? (ex: /animals/)\n- What endpoint allows you to lookup a single record of this data model given an id? (ex: /animals/1)\n- What endpoint allows you to search for records in this data model matching criteria? (ex: animals/search?type=cat)\n\nYou now have all the information required to build a connector.

Terms Privacy Notice Cookies Your Privacy Choices Legal Resources Certifications Conduct & Ethics Sub-processors