Connectors/Learn: Difference between revisions

From BigID Developer Portal
 
(15 intermediate revisions by the same user not shown)
Line 553: Line 553:
   "count": 1
   "count": 1
}</syntaxhighlight>
}</syntaxhighlight>
=== Record Pagination ===
In a previous exercise, you returned all records inside of an object in a single call. It's extremely common for providers not to allow all records to be returned in a single call either do to performance, or cost reasons. This is where the offset and count parameter come in.
Data providers typically meter their APIs in one of the following ways:
* Limit the number of requests per second/hour/day
* Limit the number of records returned in a single request
* Limit the amount of data in GB processed in a given time period
To handle these use cases, BigID has two properties on the records endpoint: Count and Offset.
Count dictates the batch size. This is how many records your connector should return in a single call. This allows BigID users to throttle your connector based on how many records it returns per page.
Offset allows BigID to paginate requests.
A sample scan using the records endpoint looks like the following:
<html>
<script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
<center>
<div class="mermaid">
    sequenceDiagram
      participant BigID
      participant Connector
      BigID->>Connector: Call /count
      Connector-->>BigID: Total records = 40
      BigID->>Connector: Call /records with offset = 0 and count = 20
      Connector-->>BigID: 20 Records retrieved, offset = 20
      BigID->>Connector: Call /records with offset = 21 and count = 20
      Connector-->>BigID: 20 Records retrieved, offset = 40
      BigID->>BigID: Stop as offset >= total (40 >= 40)
  </div>
</center>
<script>
if (document.documentElement.classList.contains('skin-citizen-dark')) {
  mermaid.initialize({ startOnLoad: true, theme: 'dark' });
} else {
  mermaid.initialize({ startOnLoad: true, theme: 'base' });
}
   
  </script>
</html>
{{Warning|Not implementing offset correctly will result in BigID continually scanning the data source and the scan never completing. Your connector must stop sending records when the offset supplied is larger than the total number of records.}}
==== Exercise: Pagination ====
{{Scenario|Your connector is constantly scanning and scans are not completing. '''Implement pagination so BigID knows when you are done scanning.'''}}
{{CodeTest|connector%2F3-pagination}}


===Return the fields and values for a given '''record ID'''===
===Return the fields and values for a given '''record ID'''===
Line 624: Line 677:
{{Scenario|Your team is using BigID to comply with GDPR DSAR requests. ShopDB doesn't have any search functionality so you'll have to implement it yourself. '''Implement the SAR endpoint so BigID can properly query your data source.'''}}
{{Scenario|Your team is using BigID to comply with GDPR DSAR requests. ShopDB doesn't have any search functionality so you'll have to implement it yourself. '''Implement the SAR endpoint so BigID can properly query your data source.'''}}
{{CodeTest|connector%2F5-sar}}
{{CodeTest|connector%2F5-sar}}
=== Authentication ===
The connector you've written is allowing anyone on the internet to see the information inside of your data source which is no good. This connector's database also has no username and password which is extremely rare. Typically the data sources your connectors will access will also require authentication of some type.
BigID has the ability to store and manage these credentials for you inside of the BigID system or use credentials from external password vaults. When a scan is started, BigID will send the credentials to your connector.
Credentials sent to your connector will always take the form of a [https://en.wikipedia.org/wiki/Basic_access_authentication Basic authentication header]. This header is not encrypted. For this reason, you must never install a BigID connector without HTTPS.
{{Warning|Never install a BigID connector without HTTPS. Credentials are sent to them unencrypted and will be exposed if you do not secure your connector}}
While Basic auth credentials always take the form of a username and password, not all systems use that mechanism for login. Below are some examples of how you can format a username and password into the credentials your application is expecting:
* OAuth Client Credentials - Username: https://tenanturl.com/,CLIENT_ID Password: CLIENT_SECRET
* Multi Tenant SaaS application with API key - Username: tenantID Password: API_KEY
* SaaS application with just an API key - Username: <blank> Password: API_KEY
====Exercise: Authentication====
{{Scenario|Your organization is writing a connector and wants to be sure the data is only viewable to BigID. You've set up the appropriate firewalls, but also want your connector to be protected by a username and password. '''Protect your records endpoint with username password auth'''}}
{{CodeTest|connector%2F3-auth}}
{{Warning|In the real world, never hardcode credentials into a connector. Your connector should use the native auth of your data source. This will allow you to reuse the connector for multiple instances of that data source.}}


= Writing an Unstructured Connector =
= Writing an Unstructured Connector =

Latest revision as of 03:10, 21 January 2025

In this article, you'll learn:

  • What do connectors add to your BigID system
  • What types of connectors exist
  • How to write a REST-based BigID connector

What is a BigID Connector[edit]

BigID Connectors allow your BigID system to provide insights about new types of data. Whether that's a known data type like CSV from a new type of data source or something completely new to the BigID ecosystem, a connector will allow you to bring BigID's data discovery capabilities to that system.

Why do we need connectors?[edit]

Every data source has its own way of communicating with third parties. Some data sources return information nicely organized, others return it as a jumbled mess. In order for BigID to give you the insights you expect, data needs to be fed to BigID in a consistent way. Connectors work as translators between the multitude of formats that data sources have adapted to the standard format BigID expects. Note that even if the data format is the same (REST JSON, REST XML, GraphQL, etc) small differences make it difficult to reuse connectors. Think of a connector as a way to interface with a single system.

How are connectors implemented?[edit]

Connectors can either be implemented as a REST API or as a Java JAR file. REST connectors are bound by the limitations of HTTP connections including timeouts, size limitations and more. Java connectors are well suited for complex use cases especially those involving data sources that stream data.

BigID Scanning Process[edit]

While BigID has different scanning methods (snapshots, metadata scans, Hyperscan), they all depend on scanners. Scanners allow BigID to contact data sources and create the search maps that are used to power the BigID system. Depending on your deployment model you may have scanners located in the BigID cloud, on-premise, or in your organization's cloud provider accounts. Scanners take the form of a Docker container and require only outbound network access.

In a scan, the scanner will do the following:

  • If correlation is enabled, load all correlation records in order to find them within data sources.
  • Scan table and file metadata to determine access permissions and ownership
  • Classify data streams

After a user starts a scan, the scanner will use the data in the scan request to determine what type of connection to make. In the case of REST API scans, the connector will reach out to your connector. This means your REST connector must allow inbound network access from your scanner, and your data source must allow inbound access from the connector. For Java connectors, the scanner will directly communicate with the data source.

Connector Types[edit]

There are two different types of connectors supported within BigID. Which type of connector you want to use to connect to your data source will have broad implications on setup, network security settings, and connector installation.

Internal (Java-based Connectors)[edit]

Most of the connectors you are familiar with are Java-based connectors.

These connectors are written in the Java programming language and distributed as JAR files. To install a new Java-based connector, an administrator must manually load the connector JAR file into the scanner using the command-line. Thankfully, the 50+ BigID written internal connectors are bundled in the scanner by default. The scanner directly uses these connectors’ code to connect to your data sources.

These connectors allow large amounts of customization in the scanning process and the connection to your data source. Due to the customization options, they are more complicated to create and are not the recommended connector development method for BigID customers.

External (Generic REST API Connectors)[edit]

External connectors allow you to create a connector in your favorite programming language. The scanner will communicate with your connector over HTTPS so as long as your programming language of choice can respond to web requests, it can be used to create an external connector.

External connectors can be hosted on any server that has a network connection to both your scanner and your data source.

There are two different types of external connectors that you can create: unstructured and structured.

Unstructured External Connector[edit]

Unstructured connectors allow BigID to scan files from a given data source. An example of an unstructured data source is Google Drive.

Structured External Connector[edit]

Structured connectors allow BigID to scan databases. An example of a structured connector would be our MySQL connector.

The Simplest BigID Connector[edit]

Below is sample code for the simplest REST connector you can make.

const express = require('express')
const app = express()
const port = 3000;

// This is our fake data for the connector
const FAKE_DATA = {
    Customers: [
        { Id: 1, Name: "Michael", Address: "100 Osceola Parkway, Kissimmee FL" },
        { Id: 2, Name: "Bob", Address: "1 Sand Lake Rd, Orlando FL" },
        { Id: 3, Name: "Stewart", Address: "1 Sand Lake Rde, Orlando FL" }
    ],
    Orders: [
        { Id: 1, Item: "banana", customerId: 2, Price: 1 }
    ]
};

/**
 * Describes all objects inside a data source
 *
 * @async
 * @param {{ domain: string; user: string; pass: string; header: string; }} login login information from BigID
 * @param {string} object name of object
 */
async function describeObjects(login) {

    const objects = [];

    // Create a new object for each table
    for (let table in FAKE_DATA) {
        let thisObject = { objectName: table, fields: [] };

        // Get field definitions for the object using the first item in each object
        for (let field in FAKE_DATA[table][0]) {
            thisObject.fields.push({ fieldName: field, fieldType: typeof FAKE_DATA[table][0][field] });
        }
        objects.push(thisObject);
    }

    return {
        status: 'success',
        objects: objects
    };
}


/**
 * Describes a specific object in the data source.
 *
 * @async
 * @param {{ domain: string; user: string; pass: string; header: string; }} login login information from BigID
 * @param {string} object name of object
 */
async function describeObject(login, object) {

    const objectFields = [];

    // Get field definitions for the object using the first item in the object
    for (let field in FAKE_DATA[object][0]) {
        thisObject.fields.push({ fieldName: field, fieldType: typeof FAKE_DATA[table][0][field] });
    }
    objectFields.push(thisObject);

    return {
        status: "success",
        objectName: object,
        fields: objectFields
    }
}

/**
 * Get records from data source for a specific object
 *
 * @async
 * @param {{ domain: string; user: string; pass: string; header: string; }} login login information from BigID
 * @param {string} object
 * @param {number} offset
 * @returns {unknown}
 */
async function getRecords(login, object, offset, count) {
    let records = [];

    // an offset defines where we should start. Use that as the starting point
    // TODO splice only elements needed
    const recordWindow = FAKE_DATA[object];

    for(let record of FAKE_DATA['object']) {
        
        let thisResult = {
            id: record.Id, // All records *must* have a unique ID for BigID to identify them
            data: []
        };
        for(let field in record){
            thisResult.data.push({fieldName: field, fieldType: typeof record[field], fieldValue: record[field]});
        }
        records.push(thisResult);
    }

    return {
        status: "success",
        records: records,
        offset: records.length
    }

}

/**
 * Get a count of the number of records of a given object type
 *
 * @async
 * @param {{ domain: string; user: string; pass: string; header: string; }} login login information from BigID
 * @param {string} object object name
 * @returns {unknown}
 */
async function countRecords(login, object) {
    return {
        status: "success",
        count: FAKE_DATA[object].length
    };
}

/**
 * Search through objects
 *
 * @param {{ domain: string; user: string; pass: string; header: string; }} login
 * @param {*} object
 * @param {*} search
 */
function search(login, object, search) {
    //TODO
}

/**
 * Returns BASIC formatted auth data from the authorization header
 *
 * @param {*} req
 * @returns {{ user: string; pass: string; header: string; }}
 */
function getAuthData(req) {
    
    if (req.headers.authorization === undefined) {
        console.log("No Login information supplied");
        res.status(401).json({
            status: 'error'
        })
        return null;
    }

    const b64auth = (req.headers.authorization || '').split(' ')[1] || ''
    const [login, password] = Buffer.from(b64auth, 'base64').toString().split(':');

    return {
        header: 'Basic ' + Buffer.from(login + ':' + password).toString('base64')
    }
}

app.get('/objects/', async (req, res) => {
    console.log(`LISTOBJS()`);

    const login = getAuthData(req);
    if (login === null) { return; }

    const obj = await describeObjects(login);
    if (obj.status === "success") {
        return res.json(obj)
    }

    return res.status(401).json(obj);
});

app.get('/objects/:object/describe', async (req, res) => {
    console.log(`DESCRIBEOBJS(object=${req.params.object})`);

    const login = getAuthData(req);
    if (login === null) { return; }

    let result = await describeObject(login, req.params.object);

    if (result.status === "success") {
        return res.json(result);
    }
    return res.status(401).json(result);

});

app.get('/objects/:object/records', async (req, res) => {
    console.log(`GETRECORDS(offset=${offset},count=${count},object=${req.params.object})`);

    const login = getAuthData(req);
    if (login === null) { return; }

    let offset = req.query.Offset || 0;
    let count = req.query.Count || 25;
    if (count > 200) { count = 200 }

    return res.json(await getRecords(login, req.params.object, offset, count));
});

app.get('/objects/:object/count', async (req, res) => {
    console.log(`COUNTRECORDS(object=${req.params.object})`);

    const login = getAuthData(req);
    if (login === null) { return; }

    return res.json(await countRecords(login, req.params.object));
});

app.listen(port, () => {
    console.log(`Sample Connector listening on port ${port}`)
});

Knowledge Check[edit]

1 What is the primary purpose of BigID connectors? (Select one)

To translate various data formats into a standard format expected by BigID.
To store large volumes of data.
To manage user access permissions.
To generate reports automatically.

2 Which statement about REST connectors is true? (Select one)

REST connectors are bound by HTTP limitations such as timeouts and size constraints.
REST connectors allow unlimited data transfer.
REST connectors do not require network access.
REST connectors are better suited for complex use cases involving data streams.

3 What are the two types of external connectors in BigID? (Select two)

Structured
Unstructured
Metadata
Hyperscan

4 How are Java-based connectors installed in BigID? (Select one)

By manually loading the JAR file into the scanner using the command line.
By uploading them through the BigID web interface.
By configuring them in the network settings.
By downloading them directly from the BigID cloud.


Writing a Structured Connector[edit]


Structured Connector
Databases
Structured...
Object
Something in the world that you are storing multiple of.

Examples: Customers, Products, Orders.

All Objects must have the same Fields
Object...
Contains
Contains
Record
An individual item you are storing.

Examples: Your customer Michael, an apple you sell, an order Michael made for the product

All Records must have a unique ID.
Record...
Contains
Contains
Fields
Pieces of information about a record.

Examples: Michael's address is 100 Church St, an apple costs $1, Michael's order was placed on January 1st.

Every field has a field name, a field value and a field type. Field types can be any string, but typically represent database types like string or number
Fields...
Unstructured Connector
Files/Streams
Unstructur...
Container
A grouping of files or streams. Think of a file folder

Examples: Invoices, 2024, LogStreamForApp
Container...
Contains
Contains
Object
An individual file or stream of information

Examples: InvoiceJan1.pdf, TaxNoticeFor2024.pdf, Kenesis Stream named LogStreamForApp

All Objects must have a unique ObjectName.
Object...
Contains
Contains
Metadata
Pieces of information about an Object.

Examples: Last Modified Date, Owner, Name, File Permissions
Metadata...
Contains
Contains
InputStream
The data inside the file in UTF8 format
InputStream...
Structured connectors are used to connect to systems that organize data in a predefined format. Examples of structured data sources are databases, spreadsheets, and data warehouses.

As we mentioned before, BigID connectors serve as translators between an external system's format and BigID's data format. The diagram to the left shows the structure of that internal data format. Structured data sources consist of Objects. You can think of objects as things in the real world like products, or invoices, or even your customers. If you're familiar with relational databases, an object most often takes the form of a table in your database.

Inside of objects we have records. Records are an individual instance of an object. Because fields are defined at the object level, all records of the same type will have the same fields. There is only one required field for a Record and that is a unique ID. BigID will use this ID when attempting to fetch an individual record so it's important that it's unique.

Inside of records, we have fields and values. Fields and values are the actual data BigID will scan for your object. Fields will also have a datatype. The datatype of a field is shown unedited inside the BigID Catalog. This means it can be any string you want to present to users whether that be something easy to understand like "text" or something more specific like varchar(255). Keep in mind your audience when designating the data types for a given field. Many of the users of your BigID system may be concerned seeing highly technical data types.


While your data source may store its data differently, you will need to reorganize it into this structure for BigID to scan the data. Remember that connectors are translators. Improper translation from your data source's format to BigID's is the largest issue when testing and developing custom connectors. See the below example of a JSON file for a single user converted into this format:

{
    "user":"[email protected]",
    "phone": "(321) 555-5555",
    "information": [
        {"key": "name", "value": "user"},
        {"key": "favcolor", "value": "green"}
    ]
}

translates to

{
  "objectName": "User",
  "fields": [
    {
      "fieldName": "Name",
      "fieldType": "string"
    },
    {
      "fieldName": "FavColor",
      "fieldType": "string"
    },
    {
      "fieldName": "Phone",
      "fieldType": "string"
    }
  ]
}
{
  "id": "[email protected]",
  "fields": [
    {
      "fieldName": "phone",
      "fieldValue": "(321) 555-5555"
    },
    {
      "fieldName": "name",
      "fieldValue": "user"
    },
    {
      "fieldName": "favcolor",
      "fieldValue": "green"
    }
  ]
}

If you're using code generation software, or just want it for your own purposes, the swaggerfile describing a structured connector is available here.

Connector Endpoints[edit]

BigID will request and view your connector’s hierarchy through six endpoints:

  • List what fields an object has (GET /objects/<objectName>/describe)
  • List what objects exist (/objects)
  • List what records are inside an object (GET /objects/<objectName>/records)
  • Count the number of records inside an object (GET /objects/<objectName>/count)
  • Return the fields and values for a given record ID (GET /objects/<objectName>/<recordID>)
  • Search for records in an object (POST /objects/<objectName>/sar)

You are not required to implement every endpoint outlined here. If you want a connector that only does DSAR requests you only need to implement the following endpoints:

  • List what objects exist (/objects)
  • List what fields an object has (GET /objects/<objectName>/describe)
  • Search for records in an object (POST /objects/<objectName>/sar)
Any additional endpoints added to your connector are of no interest to BigID. The BigID system is not concerned with the implementation details of your connector as long as HTTP requests to these endpoints succeed and are in the correct format

List what fields an object has[edit]

Since all records within an object must contain the same fields, BigID uses an object's fields to tell users what type of information is inside your data source. BigID expects the following response by your connector to see what fields an item has:

GET /objects/User/describe HTTP/1.1
{
  "status": "success",
  "objectName": "User",
  "fields": [
    {
      "fieldName": "Name",
      "fieldType": "string"
    },
    {
      "fieldName": "FavColor",
      "fieldType": "string"
    },
    {
      "fieldName": "Phone",
      "fieldType": "string"
    }
  ]
}


Exercise: List what fields an object has[edit]

scenarioYou want to create a connector for your org to be able to scan for sensitive data within an ecommerce system, ShopDB. Create an endpoint for your connector that allows BigID to list an object's fields

List what objects exist[edit]

Now that we have an object class created, we need to let BigID all of the objects our connector has access to. Our object listing endpoint needs to display the fields within an object that we implemented in the previous exercise. As we implement these endpoints, many of them rely on one another.

GET /objects/ HTTP/1.1
{
  "status": "success",
  "objects": [
    {
      "objectName": "User",
      "fields": [
        {
          "fieldName": "Name",
          "fieldType": "string"
        },
        {
          "fieldName": "FavColor",
          "fieldType": "string"
        },
        {
          "fieldName": "Phone",
          "fieldType": "string"
        }
      ]
    }
  ]
}

Exercise: List what objects exist[edit]

scenarioFor your ShopDB connector to work, you need to tell BigID what objects are inside the database. The code for this endpoint is similar to your previous endpoint. Create an endpoint for your connector that allows BigID to list out all objects.

List what records are inside an object[edit]

Now that BigID knows what objects we have and what type of information it can find inside those objects, we need to return the actual data BigID will scan.

GET /objects/User/records?Count=&Offset= HTTP/1.1
{
  "status": "success",
  "records": [
    {
      "id": 1,
      "data": [
        {
          "fieldName": "Name",
          "fieldValue": "Michael",
          "fieldType": "string"
        },
        {
          "fieldName": "Phone",
          "fieldValue": "3215555555",
          "fieldType": "string"
        }
      ]
    }
  ],
  "offset": 1
}

Exercise: List what records are inside an object[edit]

scenarioBigID has the structure of your connector, but scans are still failing with 0 records found. Create an endpoint for your connector that allows BigID to list out records.

Count the number of records inside an object[edit]

In order for BigID to properly paginate through your data sources, we need to know how many total records to expect.

GET /objects/User/count HTTP/1.1
{
  "status": "success",
  "count": 1
}

Record Pagination[edit]

In a previous exercise, you returned all records inside of an object in a single call. It's extremely common for providers not to allow all records to be returned in a single call either do to performance, or cost reasons. This is where the offset and count parameter come in.

Data providers typically meter their APIs in one of the following ways:

  • Limit the number of requests per second/hour/day
  • Limit the number of records returned in a single request
  • Limit the amount of data in GB processed in a given time period

To handle these use cases, BigID has two properties on the records endpoint: Count and Offset.

Count dictates the batch size. This is how many records your connector should return in a single call. This allows BigID users to throttle your connector based on how many records it returns per page.

Offset allows BigID to paginate requests.

A sample scan using the records endpoint looks like the following:

sequenceDiagram participant BigID participant Connector BigID->>Connector: Call /count Connector-->>BigID: Total records = 40 BigID->>Connector: Call /records with offset = 0 and count = 20 Connector-->>BigID: 20 Records retrieved, offset = 20 BigID->>Connector: Call /records with offset = 21 and count = 20 Connector-->>BigID: 20 Records retrieved, offset = 40 BigID->>BigID: Stop as offset >= total (40 >= 40)

Not implementing offset correctly will result in BigID continually scanning the data source and the scan never completing. Your connector must stop sending records when the offset supplied is larger than the total number of records.

Exercise: Pagination[edit]

scenarioYour connector is constantly scanning and scans are not completing. Implement pagination so BigID knows when you are done scanning.

Return the fields and values for a given record ID[edit]

There are cases when we only want to look into an individual record. This is especially true in cases where we already know the unique ID like a DSAR or a record investigation.

GET /objects/User/records/1 HTTP/1.1
{
  "status": "success",
  "records": [
    {
      "id": 1,
      "data": [
        {
          "fieldName": "Name",
          "fieldValue": "Michael",
          "fieldType": "string"
        },
        {
          "fieldName": "Phone",
          "fieldValue": "3215555555",
          "fieldType": "string"
        }
      ]
    }
  ],
  "offset": 1
}

Exercise: Return the fields and values for a given record ID[edit]

scenarioYour team has been trying to investigate sensitive data in your data source using the "Investigate" button in the data inventory, but they aren't getting results. Create an endpoint for your connector that allows BigID to lookup an individual record

Search for records in an object[edit]

Our final service that our connector needs to offer is the ability to search for records. This is what allows our connector to do DSARs. Depending on your data source, you may need to implement searching on your own.

POST /objects/User/records/sar HTTP/1.1
Content-Type: application/json


[{"fieldName": "Name", "fieldValue": "Michael", "isFullMatch": "false"}]
{
  "status": "success",
  "records": [
    {
      "id": 1,
      "data": [
        {
          "fieldName": "Name",
          "fieldValue": "Michael",
          "fieldType": "string"
        },
        {
          "fieldName": "Phone",
          "fieldValue": "3215555555",
          "fieldType": "string"
        }
      ]
    }
  ],
  "offset": 1
}

Exercise: Search for records in an object[edit]

scenarioYour team is using BigID to comply with GDPR DSAR requests. ShopDB doesn't have any search functionality so you'll have to implement it yourself. Implement the SAR endpoint so BigID can properly query your data source.

Authentication[edit]

The connector you've written is allowing anyone on the internet to see the information inside of your data source which is no good. This connector's database also has no username and password which is extremely rare. Typically the data sources your connectors will access will also require authentication of some type.

BigID has the ability to store and manage these credentials for you inside of the BigID system or use credentials from external password vaults. When a scan is started, BigID will send the credentials to your connector.

Credentials sent to your connector will always take the form of a Basic authentication header. This header is not encrypted. For this reason, you must never install a BigID connector without HTTPS.

Never install a BigID connector without HTTPS. Credentials are sent to them unencrypted and will be exposed if you do not secure your connector

While Basic auth credentials always take the form of a username and password, not all systems use that mechanism for login. Below are some examples of how you can format a username and password into the credentials your application is expecting:

  • OAuth Client Credentials - Username: https://tenanturl.com/,CLIENT_ID Password: CLIENT_SECRET
  • Multi Tenant SaaS application with API key - Username: tenantID Password: API_KEY
  • SaaS application with just an API key - Username: <blank> Password: API_KEY

Exercise: Authentication[edit]

scenarioYour organization is writing a connector and wants to be sure the data is only viewable to BigID. You've set up the appropriate firewalls, but also want your connector to be protected by a username and password. Protect your records endpoint with username password auth

In the real world, never hardcode credentials into a connector. Your connector should use the native auth of your data source. This will allow you to reuse the connector for multiple instances of that data source.

Writing an Unstructured Connector[edit]


Structured Connector
Databases
Structured...
Object
Something in the world that you are storing multiple of.

Examples: Customers, Products, Orders.

All Objects must have the same Fields
Object...
Contains
Contains
Record
An individual item you are storing.

Examples: Your customer Michael, an apple you sell, an order Michael made for the product

All Records must have a unique ID.
Record...
Contains
Contains
Fields
Pieces of information about a record.

Examples: Michael's address is 100 Church St, an apple costs $1, Michael's order was placed on January 1st.

Every field has a field name, a field value and a field type. Field types can be any string, but typically represent database types like string or number
Fields...
Unstructured Connector
Files/Streams
Unstructur...
Container
A grouping of files or streams. Think of a file folder

Examples: Invoices, 2024, LogStreamForApp
Container...
Contains
Contains
Object
An individual file or stream of information

Examples: InvoiceJan1.pdf, TaxNoticeFor2024.pdf, Kenesis Stream named LogStreamForApp

All Objects must have a unique ObjectName.
Object...
Contains
Contains
Metadata
Pieces of information about an Object.

Examples: Last Modified Date, Owner, Name, File Permissions
Metadata...
Contains
Contains
InputStream
The data inside the file in UTF8 format
InputStream...
Unstructured connectors allow you to connect to data sources that store data as unorganized blobs of data. Most commonly this means file stores, but data warehouses and streaming applications also fall under this type of data sources. Because unstructured data sources don't have discrete labels for each piece of data inside them, we use a different format inside of BigID for this type of data.

Unstructured sources have Containers that represent a logical grouping of data. Typically a Container is a file folder, but it can be any label applied to a group of data.

Inside of containers you have Objects. Objects represent a single piece of data stored inside a container. The most common example of an object is a file. All objects inside a container need to have a unique name so BigID can request information about a specific object.

Objects contain two separate pieces of data. One is 'Metadata. Metadata is information about a given file or stream. Most commonly this means information like who the owner of the file is and when it was last modified. Some types of data like photos may have additional metadata like what type of camera took the picture. BigID uses this data to analyze permissions and scan faster.

Objects also contain an inputStream. This is the raw file contents that BigID will scan.

If you're using code generation software, or just want it for your own purposes, the swaggerfile describing an unstructured connector is available here.

Connector Endpoints[edit]

BigID will request and view your connector’s hierarchy through six endpoints:

  • List what Containers exist (/objects)
  • List what Objects exist (/objects/<containerName>?Offset=)
  • Return the Metadata for a given object (GET /objects/<containerName>/<objectName>/describe)
  • Return the InputStream for a given object (GET /objects/<containerName>/<objectName>/content-stream)
  • Search for specific data inside objects (POST /objects/<containerName>/<objectName>/sar?Offset=)

Just like for structured connectors, you are not required to implement every endpoint outlined here. If you want a connector that only does DSAR requests you only need to implement the following endpoints:

  • List what Containers exist (/objects)
  • List what Objects exist (/objects/<containerName>?Offset=)
  • Search for specific data inside objects (POST /objects/<containerName>/<objectName>/sar?Offset=)
Any additional endpoints added to your connector are of no interest to BigID. The BigID system is not concerned with the implementation details of your connector as long as HTTP requests to these endpoints succeed and are in the correct format

List what containers exist[edit]

The first step in BigID's scan of unstructured data sources will always be to list out the containers that exist within a data source. This allows BigID to know where to start the discovery process. Compared to structured connectors, this endpoint is extremely simple. That's because containers or folders don't really specify anything about the information that can be inside them, just that they are a storage location.

GET /objects HTTP/1.1
{
  "status": "success",
  "containers": [
    {
      "containerName": "Folder"
    },
    {
      "containerName": "Folder2"
    }
  ]
}

Exercise: List what containers exist[edit]

scenarioYour team has a set of CSV files they are looking to scan on a file server. Implement the containers endpoint to let BigID know what containers are present on the file server

List what objects exist inside a container[edit]

Now that BigID knows what containers our connector has access to, we can give it insights into the objects inside those containers. This endpoint will also provide BigID with the metadata about the objects so we can do things like HyperScan and enforce permissions policies.

You'll notice that certain fields are set directly on the object, whereas others are set inside of custom fields. The field set in the object are predetermined by the BigID connector specification and cannot change. The custom fields object allows you to add your own!

sarCapable refers to whether an unstructured data source could be processed as structured data. This means BigID will ignore the actual file contents and only use structured fields defined here. It should not be set to true for any file you want to perform SARs on, especially if you want BigID to classify the file content.
GET /objects/Folder?Offset= HTTP/1.1
{
  "status": "success",
  "objects": [
    {
      "containerName": "Folder",
      "objectName": "1.pdf",
      "dateCreated": "26/02/2020 19:00:34",
      "lastModified": "26/02/2020 19:00:34",
      "owner": 501,
      "sizeInBytes": 177924,
      "schemaFields": "NULL",
      "sarCapable": false,
      "customFields": [
        {
          "fieldName": "file_type",
          "fieldValue": "pdf"
        }
      ]
    }
  ],
  "offset": 1
}

Exercise: List what objects exist inside a container[edit]

scenarioList the objects inside of the container and their metadata so BigID can scan populate the catalog. Implement the objects endpoint to let BigID know what objects are present on the file server

Return the metadata for a given object[edit]

Just like we listed out the metadata for all objects, we need to be able to list out the metadata for a single object so BigID can inspect single objects without needing to go through potentially thousands of listings in our container.

GET /objects/Folder/File/describe HTTP/1.1
{
  "status": "success",
  "containerName": "Folder",
  "objectName": "1.pdf",
  "dateCreated": "26/02/2020 19:00:34",
  "lastModified": "26/02/2020 19:00:34",
  "owner": 501,
  "sizeInBytes": 177924,
  "schemaFields": "NULL",
  "sarCapable": false,
  "customFields": [
    {
      "fieldName": "file_type",
      "fieldValue": "pdf"
    }
  ]
}

Exercise: Return the metadata for a given object[edit]

scenarioWe need our users to be able to know the metadata of a single file. The format is exactly the same as the container listing endpoint so feel free to reuse your code. Implement the object endpoint to let BigID know a single object's metadata

Return the inputStream for a given object[edit]

In order for BigID to do classification on the contents of the file it needs access to them. The inputStream endpoint is unique in that it just returns the raw data from the file. There's no formatting required here, just return the raw data.

What about binary data? BigID doesn't natively have the ability to process binary data from REST connectors. You'll need to use a library in your connector to convert the document to UTF8 text.
GET /objects/Folder/File/content-stream HTTP/1.1

utf-8 data here

Exercise: Return the inputStream for a given object[edit]

Search for specific data inside objects[edit]

POST /objects/Folder/File/sar?Offset= HTTP/1.1
Content-Type: application/json


[{"fieldName": "Name", "fieldValue": "Michael", "isFullMatch": "false"}]
{
  "status": "success",
  "records": [
    {
      "id": "container/object",
      "data": [
        {
          "searchedFieldName": "fieldName",
          "searchedFieldValue": "fieldValue",
          "fullObjectName": "container/object",
          "offset": "12"
        }
      ]
    }
  ]
}

Exercise: Search for specific data inside objects[edit]

Certification[edit]

To become a BigID certified developer, you need to create a connector showing your knowledge. There are no specific programming language requirements for this connector, just like within BigID. We will not judge your code on syntax or style. We’ve designed this project to be flexible enough to focus on your specific domain while also being rigorous enough to verify your knowledge to BigID customers around the world.

Grading[edit]

The project consists of requirements and optional exercises. Your application MUST implement all requirements. Not implementing a single requirement will result in a failing score. After implementing all requirements, you may choose a path for your connector. All parts of a path must be completed.

Requirements[edit]

  • Your connector must be packaged as a .zip file.
  • The zip file must contain your connector's code, a Docker image packaged as a tar.gz, and instructions on running your connector.
  • Your connector must respond to port 3000 over HTTP.
  • Your application may not rely on any external servers beyond those specified in this document.

Unstructured REST Connector Path[edit]

This connector must read a directory containing a txt file. It must implement the following endpoints as described in the REST connector documentation:

  • List what Containers exist (/objects)
  • List what Objects exist (/objects/<containerName>?Offset=)
  • Return the Metadata for a given object (GET /objects/<containerName>/<objectName>/describe)
  • Return the InputStream for a given object (GET /objects/<containerName>/<objectName>/content-stream)

Structured Connector Path[edit]

This connector must read a directory containing a csv file. It must implement the following endpoints as described in the REST connector documentation:

  • List what fields an object has (GET /objects/<objectName>/describe)
  • List what objects exist (/objects)
  • List what records are inside an object (GET /objects/<objectName>/records)
  • Count the number of records inside an object (GET /objects/<objectName>/count)
  • Return the fields and values for a given record ID (GET /objects/<objectName>/<recordID>)

SAR Connector Path[edit]

This connector must read a directory containing a JSON file. It must implement the following endpoints as described in the REST connector documentation:

  • List what fields an object has (GET /objects/<objectName>/describe)
  • List what objects exist (/objects)
  • Search for records in an object (POST /objects/<objectName>/sar)