Data Catalog Ingestion Instructions

Overview

This article outlines in detail how to configure ingestions from any of our provided data sources. These steps should be completed by an IT admin, or an equivalent role that has permissions to connect to your backend persistent storage systems.

First, there are four steps are needed to create a new ingestion for any data source:

  1. Select a data source. Currently, we support BigQuery, MySQL, Postgres, and Snowflake.
  2. Choose a name for the ingestion. This name must be unique and below 50 characters.
  3. Configure the data source. You must perform these actions in your data source to allow TerraTrue to connect to it. The actions vary by data source and usually consist of the following:
    • Create a new user in your data source
    • Grant necessary permissions to the new user so that TerraTrue can read the metadata of the tables in your data source
    • For MySQL and Postgres, configure your data source’s firewall rule to allow inbound connections from the TerraTrue IP addresses provisioned for your organization
  4. Configure the ingestion. You must perform this step in the data catalog ingestion creation flow in TerraTrue. This varies by data source and usually consists of the following:
    • Entering credentials: 
      • Enter the user you have created in your data source in the previous step
      • For MySQL and Postgres, the applicable SSL certificates.
    • Using table filters: optionally, you can specify a list of “allow” and “deny” patterns to restrict the ingestion to data only to tables that you care about. A table filter pattern is a case insensitive regular expression, and its format varies by data source.

In the remainder of this document, we detail specifics for each of the four data sources we currently support.

BigQuery

The BigQuery ingestion extracts the schemas of the tables in BigQuery in one GCP project.

Configure Data Source

A. Prerequisites

    1. You must have the resourcemanager.projects.setIamPolicypermission in the GCP project or organization that you wish to ingest BigQuery from.
    2. The following GCP roles include this permission. So make sure your account has one of these roles.
    3. Owner (roles/owner)
    4. Security Admin (roles/iam.securityAdmin)
    5. Folder Admin (roles/resourcemanager.folderAdmin)
    6. Organization Administrator (roles/resourcemanager.organizationAdmin)
    7. Project IAM Admin (roles/resourcemanager.projectIamAdmin)

B. Provision a Service Account

In order for us to connect to BigQuery, TerraTrue needs to provision a service account for your organization. Go to the Data Catalog Settings page.

  1. If this is the first time you set up a BigQuery ingestion, toggle “Provision Service Account” in the “Configure Ingestion” step to on and wait until it finishes. The email of the provisioned service account should be displayed. 
  2. If you have already configured another BigQuery ingestion in TerraTrue before, then the provisioned service account from earlier will be displayed.

C. Grant the Service Account Permissions

Run the following command in the terminal to grant the provisioned service account the BigQuery Metadata Viewer role. We need this role in order to be able to read the schemas of the BigQuery tables. It can take up to seven minutes for the permission change to take effect.

gcloud projects add-iam-policy-binding "<GCP_PROJECT_ID>" \
  --member="serviceAccount:<PROVISIONED_SERVICE_ACCOUNT>" \
  --role="roles/bigquery.metadataViewer" \
   --condition=None

Configure Ingestion

    1. Project ID: this is the GCP project id that you want to ingest BigQuery. The ingestion only extracts metadata from datasets in this project. You need to create an ingestion for each GCP project that you want to ingest from.
    2. Table filter pattern: 
      • Has the format of project_id.dataset_id.table_id
      • For example, project_x\.dataset_y\..* matches all tables in dataset_y in project_x

Remove Access

Please refer to the “Configure Data Source” subsection for the prerequisites.

Find the Provisioned Service Account

For any of the BigQuery ingestion that you have created, click on the three vertical dot button and then the “Edit Ingestion” button. The service account is in the text box labeled “Service Account Email”.

Revoke the Service Account’s Permissions

Run the following command in the terminal to revoke access granted to the provisioned service account for a GCP project. If you have granted access to more than one GCP project, you need to do this for each project:

gcloud projects remove-iam-policy-binding "<GCP_PROJECT_ID>" \
--member="serviceAccount:<PROVISIONED_SERVICE_ACCOUNT>" \
--role="roles/bigquery.metadataViewer"

Delete the BigQuery Ingestions

Go to the ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”.

DynamoDB

Configure Data Source

Prerequisite

  1. You will need permissions to attach IAM policies, create users, and access keys in your AWS account. Having the IAMFullAccess or AdministratorAccess permission policy attached to your user will suffice.

Create User and Grant Access

  1. Create an IAM user named terratrue_scanner if it doesn’t already exist and attach the AmazonDynamoDBReadOnlyAccess policy directly to this user.
  2. Click on the Security Credentials tab for the terratrue_scanner user and create an access key if it has not already been created.
  3. When the access key has been created, you can click “Show” to reveal the Secret Access Key. You will need the Access Key ID and Secret Access Key during the ingestion configuration process.

Configure Ingestion

  1. Create a secret for the Secret Access Key under the Secrets tab. You will refer to the secret’s names later.
  2. Access Key ID: the access key ID for the user created earlier.
  3. Secret Key: the secret access key for the user created earlier.
  4. Table filter pattern:
  1. You can use regex to filter tables you would like to ingest or not to ingest, for example user.* will ingest all tables starting with user

Remove Access

Delete User

You will need permissions to delete a user. Having the IAMFullAccess or AdministratorAccess permission policy attached to your user will suffice.

Go to the IAM Users page in the AWS console. Select the user created for the DynamoDB ingestion and delete it.

Alternatively, the AWS CLI can be used:

aws iam delete-user --user-name terratrue_scanner

Delete Secrets

Go to the secrets tab of the ingestions page, delete the Secret Access Key secret you have created for the user.

Delete the Ingestions

Go to the ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”.

Glue

Configure Data Source

Prerequisite

  1. You will need permissions to create IAM policies, users, and access keys in your AWS account. Having the IAMFullAccess or  AdministratorAccess permission policy attached to your user will suffice.

Create User and Grant Access

  1. Create an IAM policy with glue:getDatabases and glue:getTables permissions.
  2. Create an IAM user with the newly created policy attached.
  3. Click on the Security Credentials tab for the newly created user and create an access key.
  4. When the access key has been created, you can click “Show” to reveal the Secret Access Key. You will need the Access Key ID and Secret Access Key during the ingestion configuration process.

Configure Ingestion

  1. Create a secret for the Access Key ID and Secret Access Key under the Secrets tab. You will refer to the secret’s names later.
  2. AWS Region: the AWS region that the Glue databases are in.
  3. AWS Access Key ID: The access key ID for the user created earlier.
  4. AWS Secret Access Key: The secret access key for the user created earlier.
  5. Instance ID: a globally unique identifier for the Glue instance. 
  6. Table filter pattern:
    1. Has the format of database.table
    2. For example, db_x\** matches all tables in  db_x

Remove Access

Delete User

You will need permissions to delete a user. Having the IAMFullAccess or AdministratorAccess permission policy attached to your user will suffice.

Go to the IAM Users page in the AWS console. Select the user created for the Glue ingestion and delete it.

Alternatively, the AWS CLI can be used:

aws iam delete-user --user-name <user_name>

Delete Secrets

Go to the Secrets tab of the ingestions page, delete the AWS Access Key ID and AWS Secret Access Key secrets you have created for the user.

Delete the Glue Ingestions

Go to the Ingestions page, click on the vertical ellipses button next to an ingestion that you want to delete and then click “Delete Ingestion”.

Hive (Azure HDInsight)

Configure Data Source

A. Prerequisites

  1. To create user and grant user permissions, you should be the administrator of the HDInsight cluster

B. Create User and Grant Access

  1. In the cluster, you can follow the Ambari management page to add a new user terratrue_user through the portal. The username and password will be used later for configuring ingestion
  2. The user needs the Cluster User access
  3. If you have configured the Hive cluster with ESP (Enterprise Security Package) and would like to configure granular access to the tables you would like to ingest, you can follow this documentation to create policies in Apache Ranger Admin UI and attach to the created user

Configure Ingestion

  1. Create a secret for the Azure HDInsight user password under the Secrets tab. You will refer to the secret’s names later.
  2. Hosting Environment: only HDInsight is supported right now.
  3. Instance ID: a globally unique identifier for the Hive instance.
  4. HostPort: this is the host and port for your cluster. We’ll use port 443 to connect to Azure HDInsight cluster using TLS, you can find the host it in your Azure HDInsight cluster portal or from your cluster URL, the format should be <cluster_name>.azurehdinsight.net:443 
  5. Username: this is the username you created from Ambari management. Default is terratrue_user.
  6. Password: this is the password secret reference 
  7. Table filter pattern:
    1. Has the format of database.table
    2. For example, default\..*matches all tables in database default

Remove Access

Delete User

Go to the Cluster, from upper right corner user menu, go to Manage Ambari -> Users, delete the user you created for ingestion

Delete Secrets

Go to the secrets tab of the ingestions page, delete the password secret you created for Hive ingestion

Delete the Ingestions

Go to the ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”.

Kafka (Confluent Cloud)

Configure Data Source

A. Prerequisites

  1. For creating cluster API keys and secrets, you need to have the CloudClusterAdmin role in the cluster.
  2. For creating schema registry api keys and secrets, you need to have the DeveloperManage or ResourceOwner roles.

B. Create User and Grant Access

  1. In Confluent Cloud, go to the cluster you would like to ingest.
  2. Create an API key from Cluster Overview -> API Keys, the scope should be Granular access
  3. Create a new service account and configure ACL for this account. We’ll need the following ACLs associated with the key:
Topic Name = *
Permission = ALLOW
Operation = DESCRIBE
Pattern Type = LITERAL

You can configure the ACL to the service account by topic names in the Confluent Cloud UI; alternatively you can use the following command to get service account id and grant access to all topics in Confluent CLI: 

confluent api-key list
confluent kafka acl create --allow --service-account <service_account_id> --operations describe --topic "*"
  1. An API credential file should be downloaded to your local machine, you’ll need it later when configure ingestion
  2. You can retrieve the Bootstrap server endpoint from Cluster Overview -> Cluster Settings
  3. For schema registry, you’ll need to go to Confluent Cloud homepage -> Environments, on the right side of the page you should see the schema registry endpoint from Stream Governance API, also you’ll need to create api key for accessing schema registry, you can click Credentials -> View & manage to add api key, this doesn’t need any further configuration and once created, the schema registry credential file should be downloaded to your local machine

Configure Ingestion

  1. Create a secret for the Kafka API key secret under the Secrets tab. You will refer to the secret’s names later.
  2. If you have schema registry credentials created, also create secrets for API key and API secret under the Secrets tab. You will refer to the secret’s names later.
  3. Instance ID: a globally unique identifier for the Kafka instance.
  4. Bootstrap server host/port: the bootstrap server host and port 
  5. Sasl mechanism: use PLAIN for Confluent Cloud
  6. Sasl username: this is the API key of the service account you created
  7. Sasl password: this is the API secret reference
  8. Schema registry URL: this is the endpoint of schema registry
  9. Schema registry API key: this is the api key secret reference of schema registry
  10. Schema registry API secret: this is the api secret reference of schema registry
  11. Topic filter pattern:
    1. You can use a regex to filter topics you would like to ingest or not to ingest, for example user.* will ingest all topics starting with user

Remove Access

Delete User

Go to the Cluster -> Cluster Overview -> API Keys, select the API key you created for Kafka ingestion, and Delete API key.

If you have schema registry configured, you can go to Confluent Cloud homepage -> Environments, on the right side of the page you’ll see schema registry information and click Credentials -> View & manage and select the API key you created for Kafka ingestion and delete the key.

Delete Secrets

Go to the secrets tab of the ingestions page, delete the api secrets you created for Kafka ingestion

Delete the Ingestions

Go to the ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”.

MySQL

Configure Data Source

A. Create User and Grant Access

    1. You must login into your MySQL instance as an admin
    2. Create a new user
CREATE USER 'terratrue_user'@'%' IDENTIFIED BY '<your-password>';
    1. Grant the SELECT privileges to the new user
      1. All databases. This will allow ingestion from future databases.
GRANT SELECT on *.* TO 'terratrue_user'@'%';
      1. One or more databases. Run below for each database.
GRANT SELECT on <DATABASE_NAME>.* TO 'terratrue_user'@'%';
    1. Even though the ingestion doesn’t read the data in the tables, it requires the SELECT privilege because MySQL doesn’t have a separate privilege for reading the structure of the tables. 

B. Update Firewall Rules

In order for TerraTrue to connect to your instance, you need to make sure your instance is publicly accessible.

If you have set up a firewall in front of the instance, you also need to update your firewall rules to allow inbound traffic from TerraTrue. As part of Data Catalog onboarding, TerraTrue provisions a list of IP addresses for your organization. Please get the list of IP addresses from the Data Catalog Settings page and then update your firewall rules.

C. SSL Connection

AWS RDS

If your MySQL instance is in AWS RDS, then we will use the AWS global certificate bundle to establish SSL connection to your instance. You don’t need to provide any certificate to us unless your MySQL instance uses your own certificate.

GCP CloudSQL

If your MySQL instance is in GCP CloudSQL, you need to provision a client certificate and provide us the server CA certificate, client certificate, and client key during the ingestion configuration. Please refer to the following GCP documentation.

https://cloud.google.com/sql/docs/postgres/configure-ssl-instance#new-client

Self Hosted

If you self host your MySQL instance, then you need to provide us with the server CA certificate during ingestion configuration so that we can establish SSL connection to your instance.

Configure Ingestion

  1. Create a secret for the password under the Secrets tab. You will refer to the secret’s name later.
  2. Hosting environment: where your instance is hosted, can be AWS RDS, GCP CloudSQL, or self hosted.
  3. Host and port: the IP or URL and port of your instance
  4. Username/password: the default user name is terratrue_user
  5. SSL certificates:
    • If your hosting environment is AWS RDS and your instance uses the AWS managed certificate, then you don’t need to upload any certificates. Otherwise, please upload the server CA certificate.
    • If your hosting environment is GCP CloudSQL, then you need to provision a client certificate and upload the server CA certificate, client certificate, and client key.
    • If you self host your instance, then you need to upload the server CA certificate.
  6. Instance ID: a globally unique identifier for the MySQL instance. 
    • If your hosting environment is AWS RDS, we recommend using the ARN of your instance. Alternatively, you can use the format of <AWS_ACCOUNT_ID>-<REGION>-<RDS_INSTANCE_ID>
    • If your hosting environment is GCP CloudSQL, we recommend using the connection name of your instance. Alternatively, you can use the format of <GCP_PROJECT_ID>-<CLOUDSQL_INSTANCE_ID>
  7. Table filter pattern:
    • Has the format of database.table
    • For example, db_x\..* matches all tables in db_x
    • We automatically exclude these databases created by MySQL: information_schema, mysql, performance_schema, sys

Remove Access

Remove User

You must login into your postgres instance as a superuser. Run the following command

DROP USER IF EXISTS terratrue_user;

Update Firewall Rules

Remove the IP addresses from your MySQL instance’s inbound firewall rules.

Delete Secrets

Go to the secrets tab of the ingestions page, delete the secret you have created for the user terratrue_user.

Delete the MySQL Ingestions

Go to the ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”. This also deleted any certificates you have uploaded.

Revoke Client Certificates

If you have configured the ingestion with a client certificate and key, then you can also revoke the certificate.

Postgres

Configure Data Source

A. Create User and Grant Access

  1. You must login into your Postgres instance as a superuser
  2. Create a new user
CREATE USER terratrue_user WITH PASSWORD '<your-password>';
  1. Make sure the new user can connect to the databases that you want to ingest
SELECT has_database_privilege('terratrue_user', '<DATABASE_NAME>', 'connect');
  1. If the user cannot connect to a database, then grant the connect privilege to it.
GRANT CONNECT ON DATABASE <DATABASE_NAME> TO terratrue_user;
  1. If you cannot grant the connect privilege to any databases, you need to exclude them during the ingestion configuration as explained below.

B. Update Firewall Rules

In order for TerraTrue to connect to your instance, you need to make sure your instance is publicly accessible.

If you have set up a firewall in front of the instance, you also need to update your firewall rules to allow inbound traffic from TerraTrue. As part of Data Catalog onboarding, TerraTrue provisions a list of IP addresses for your organization. Please get the list of IP addresses from  the Data Catalog Settings page and then update your firewall rules.

C. SSL Connection

AWS RDS

If your Postgres instance is in AWS RDS, then we will use the AWS global certificate bundle to establish SSL connection to your instance. You don’t need to provide any certificate to us unless your Postgres instance uses your own certificate.

GCP CloudSQL

If your Postgres instance is in GCP CloudSQL, you need to provision a client certificate and provide us the server CA certificate, client certificate, and client key during the ingestion configuration. Please refer to the following GCP documentation.

https://cloud.google.com/sql/docs/postgres/configure-ssl-instance#new-client

Self Hosted

If you self host your Postgres instance, then you need to provide us with the server CA certificate during ingestion configuration so that we can establish SSL connection to your instance.

Configure Ingestion

  1. Create a secret for the password under the Secrets tab. You will refer to the secret’s name later.
  2. Hosting environment: where your instance is hosted, can be AWS RDS, GCP CloudSQL, or self hosted.
  3. Host and port: the IP or URL and port of your instance
  4. Username/password: the default user name is terratrue_user
  5. SSL certificates:
    • If your hosting environment is AWS RDS and your instance uses the AWS managed certificate, then you don’t need to upload any certificates. Otherwise, please upload the server CA certificate.
    • If your hosting environment is GCP CloudSQL, then you need to provision a client certificate and upload the server CA certificate, client certificate, and client key.
    • If you self host your instance, then you need to upload the server CA certificate.
  6. Instance ID: a globally unique identifier for the Postgres instance. 
    • If your hosting environment is AWS RDS, we recommend using the ARN of your instance. Alternatively, you can use the format of <AWS_ACCOUNT_ID>-<REGION>-<RDS_INSTANCE_ID>
    • If your hosting environment is GCP CloudSQL, we recommend using the connection name of your instance. Alternatively, you can use the format of <GCP_PROJECT_ID>-<CLOUDSQL_INSTANCE_ID>
  7. Database filters: you can specify a list of allow and deny regular expression patterns to restrict the ingestion to databases that you care about. 
    • For example, the pattern database_x$ matches database_x
    • We automatically exclude the rdsadmin database for AWS RDS and cloudsqladmin database for GCP CloudSQL
  8. Table filter pattern:
    • Has the format of database.schema.table
    • For example, db_x\.schema_y\..* matches all tables in schema_y in db_x

Remove Access

Remove User

You must login into your postgres instance as a superuser. Run the following command

DROP USER IF EXISTS terratrue_user;

Update Firewall Rules

Remove the IP addresses from your Postgres instance’s inbound firewall rules.

Delete Secrets

Go to the secrets tab of the ingestions page, delete the secret you have created for the user terratrue_user.

Delete the Postgres Ingestions

Go to the ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”. This also deleted any certificates you have uploaded.

Revoke Client Certificates

If you have configured the ingestion with a client certificate and key, then you can also revoke the certificate.

Snowflake

The Snowflake ingestion extracts the schemas of the tables in the databases in your Snowflake warehouse. In order for the ingestion to work, you need to create a Snowflake user and grant this user necessary permissions to read the metadata.

Configure Data Source

A. Prerequisites

  1. You must have the ACCOUNTADMIN and ORGADMIN roles.

B. Run the following script from your Snowflake account 

  1. Make sure you replace the <your-warehouse>, <your-database>, and <your-password> placeholders with the actual values in your environment.
  2. In step 3, make sure you run the commands for each database that you want to ingest.
// Step 1: create a role
create or replace role terratrue_role;

// Step 2: Grant access to a warehouse to run queries to view metadata
grant operate, usage on warehouse "<your-warehouse>" to role terratrue_role;

// Step 3: For each of the databases that you want to ingest from,
// 1. Grant access to view the table and its child schemas
// 2. Grant references privileges to your tables

grant usage on DATABASE "<your-database>" to role terratrue_role;
grant usage on all schemas in database "<your-database>" to role terratrue_role;
grant usage on future schemas in database "<your-database>" to role terratrue_role;

grant references on all tables in database "<your-database>" to role terratrue_role;
grant references on future tables in database "<your-database>" to role terratrue_role;
grant references on all external tables in database "<your-database>" to role terratrue_role;
grant references on future external tables in database "<your-database>" to role terratrue_role;

// Step 4: Create a new terratrue_user
create user terratrue_user display_name = 'Terratrue' password='<your-password>' default_role = terratrue_role default_warehouse = '<your-warehouse>';

// Step 5: Grant the terratrue_role to the new terratrue_user.
grant role terratrue_role to user terratrue_user;

The details of each granted privilege can be viewed in these snowflake docs. This is a summary of why each privilege is required:

  • operate and usage on warehouse are required to execute queries
  • usage on database and schema are required because without it tables inside these are not accessible. If an admin does the required grants on table but misses the grants on schema or the database in which the table/view exists then we will not be able to get metadata for the table/view.
  • references on tables is required for us to read the metadata of the tables

Configure Ingestion

  1. Create a secret for the password under the Secrets tab. You will need to refer to the secret’s name later.
  2. Account ID: your Snowflake account ID, which can be retrieved as follows:
    • USE ROLE ORGADMIN; SHOW ORGANIZATION ACCOUNTS;
    • In the account_url column, the subdomain name of snowflakecomputing.comin the format of <ORG_NAME>-<ACCOUNT> is the account id that you should enter for the ingestion
  3. Warehouse: the name of the Snowflake warehouse to use
  4. Username/password: the default user name is terratrue_user
  5. Role: the default role is terratrue_role
  6. Instance ID: a globally unique identifier for the Snowflake instance. 
    • We recommend you to use the format of <READABLE_NAME>-<ACCOUNT_ID>-<SUFFIX>, where <READABLE_NAME> is an optional readable name of the instance and <SUFFIX> is the name of the databases included in this ingestion. You can skip <SUFFIX> if you have granted access to several databases.
    • You will be able to filter the datasets in the data catalog using this instance id.
  7. Table filter pattern: 
      • Has the format of database.schema.table
      • For example, db_x\.schema_y\..* matches all tables in schema_y in db_x

Remove Access

You must have the ACCOUNTADMIN and ORGADMIN roles.

Run the following script from your Snowflake account

The script below deletes the terratrue_role and terratrue_user from your Snowflake account.

DROP USER IF EXISTS terratrue_user;
DROP ROLE IF EXISTS terratrue_role;

Delete Secrets

Go to the secrets tab of the ingestions page, delete the secret you have created for the user terratrue_user.

Delete the Snowflake Ingestions

Go to the Ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”.

 

Was this article helpful?
0 out of 2 found this helpful