Overview
This article outlines in detail how to configure ingestions from any of our provided data sources. These steps should be completed by an IT admin, or an equivalent role that has permissions to connect to your backend persistent storage systems.
First, there are four steps are needed to create a new ingestion for any data source:
- Select a data source. Currently, we support BigQuery, MySQL, Postgres, and Snowflake.
- Choose a name for the ingestion. This name must be unique and below 50 characters.
- Configure the data source. You must perform these actions in your data source to allow TerraTrue to connect to it. The actions vary by data source and usually consist of the following:
- Create a new user in your data source
- Grant necessary permissions to the new user so that TerraTrue can read the metadata of the tables in your data source
- For MySQL and Postgres, configure your data source’s firewall rule to allow inbound connections from the TerraTrue IP addresses provisioned for your organization
- Configure the ingestion. You must perform this step in the data catalog ingestion creation flow in TerraTrue. This varies by data source and usually consists of the following:
- Entering credentials:
- Enter the user you have created in your data source in the previous step
- For MySQL and Postgres, the applicable SSL certificates.
- Using table filters: optionally, you can specify a list of “allow” and “deny” patterns to restrict the ingestion to data only to tables that you care about. A table filter pattern is a case insensitive regular expression, and its format varies by data source.
- Entering credentials:
In the remainder of this document, we detail specifics for each of the four data sources we currently support.
BigQuery
The BigQuery ingestion extracts the schemas of the tables in BigQuery in one GCP project.
Configure Data Source
A. Prerequisites
-
- You must have the
resourcemanager.projects.setIamPolicypermission
in the GCP project or organization that you wish to ingest BigQuery from. - The following GCP roles include this permission. So make sure your account has one of these roles.
- Owner (
roles/owner
) - Security Admin (
roles/iam.securityAdmin
) - Folder Admin (
roles/resourcemanager.folderAdmin
) - Organization Administrator (
roles/resourcemanager.organizationAdmin
) - Project IAM Admin (
roles/resourcemanager.projectIamAdmin
)
- You must have the
B. Provision a Service Account
In order for us to connect to BigQuery, TerraTrue needs to provision a service account for your organization. Go to the Data Catalog Settings page.
- If this is the first time you set up a BigQuery ingestion, toggle “Provision Service Account” in the “Configure Ingestion” step to on and wait until it finishes. The email of the provisioned service account should be displayed.
- If you have already configured another BigQuery ingestion in TerraTrue before, then the provisioned service account from earlier will be displayed.
C. Grant the Service Account Permissions
Run the following command in the terminal to grant the provisioned service account the BigQuery Metadata Viewer role. We need this role in order to be able to read the schemas of the BigQuery tables. It can take up to seven minutes for the permission change to take effect.
gcloud projects add-iam-policy-binding "<GCP_PROJECT_ID>" \
--member="serviceAccount:<PROVISIONED_SERVICE_ACCOUNT>" \
--role="roles/bigquery.metadataViewer" \
--condition=None
Configure Ingestion
-
- Project ID: this is the GCP project id that you want to ingest BigQuery. The ingestion only extracts metadata from datasets in this project. You need to create an ingestion for each GCP project that you want to ingest from.
- Table filter pattern:
- Has the format of project_id.dataset_id.table_id
- For example,
project_x\.dataset_y\..*
matches all tables indataset_y
inproject_x
Remove Access
Please refer to the “Configure Data Source” subsection for the prerequisites.
Find the Provisioned Service Account
For any of the BigQuery ingestion that you have created, click on the three vertical dot button and then the “Edit Ingestion” button. The service account is in the text box labeled “Service Account Email”.
Revoke the Service Account’s Permissions
Run the following command in the terminal to revoke access granted to the provisioned service account for a GCP project. If you have granted access to more than one GCP project, you need to do this for each project:
gcloud projects remove-iam-policy-binding "<GCP_PROJECT_ID>" \
--member="serviceAccount:<PROVISIONED_SERVICE_ACCOUNT>" \
--role="roles/bigquery.metadataViewer"
Delete the BigQuery Ingestions
Go to the ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”.
Snowflake
The Snowflake ingestion extracts the schemas of the tables in the databases in your Snowflake warehouse. In order for the ingestion to work, you need to create a Snowflake user and grant this user necessary permissions to read the metadata.
Configure Data Source
A. Prerequisites
- You must have the
ACCOUNTADMIN
andORGADMIN
roles.
B. Run the following script from your Snowflake account
- Make sure you replace the
<your-warehouse>
,<your-database>
, and<your-password>
placeholders with the actual values in your environment. - In step 3, make sure you run the commands for each database that you want to ingest.
// Step 1: create a role
create or replace role terratrue_role;
// Step 2: Grant access to a warehouse to run queries to view metadata
grant operate, usage on warehouse "<your-warehouse>" to role terratrue_role;
// Step 3: For each of the databases that you want to ingest from,
// 1. Grant access to view the table and its child schemas
// 2. Grant references privileges to your tables
grant usage on DATABASE "<your-database>" to role terratrue_role;
grant usage on all schemas in database "<your-database>" to role terratrue_role;
grant usage on future schemas in database "<your-database>" to role terratrue_role;
grant references on all tables in database "<your-database>" to role terratrue_role;
grant references on future tables in database "<your-database>" to role terratrue_role;
grant references on all external tables in database "<your-database>" to role terratrue_role;
grant references on future external tables in database "<your-database>" to role terratrue_role;
// Step 4: Create a new terratrue_user
create user terratrue_user display_name = 'Terratrue' password='<your-password>' default_role = terratrue_role default_warehouse = '<your-warehouse>';
// Step 5: Grant the terratrue_role to the new terratrue_user.
grant role terratrue_role to user terratrue_user;
The details of each granted privilege can be viewed in these snowflake docs. This is a summary of why each privilege is required:
-
operate
andusage
on warehouse are required to execute queries -
usage
on database and schema are required because without it tables inside these are not accessible. If an admin does the required grants on table but misses the grants on schema or the database in which the table/view exists then we will not be able to get metadata for the table/view. -
references
on tables is required for us to read the metadata of the tables
Configure Ingestion
- Create a secret for the password under the Secrets tab. You will need to refer to the secret’s name later.
- Account ID: your Snowflake account ID, which can be retrieved as follows:
-
USE ROLE ORGADMIN; SHOW ORGANIZATION ACCOUNTS;
-
In the account_url column, the subdomain name of
snowflakecomputing.com
in the format of<ORG_NAME>-<ACCOUNT>
is the account id that you should enter for the ingestion
-
- Warehouse: the name of the Snowflake warehouse to use
- Username/password: the default user name is
terratrue_user
- Role: the default role is
terratrue_role
- Instance ID: a globally unique identifier for the Snowflake instance.
- We recommend you to use the format of
<READABLE_NAME>-<ACCOUNT_ID>-<SUFFIX>
, where<READABLE_NAME>
is an optional readable name of the instance and<SUFFIX>
is the name of the databases included in this ingestion. You can skip<SUFFIX>
if you have granted access to several databases. - You will be able to filter the datasets in the data catalog using this instance id.
- We recommend you to use the format of
- Table filter pattern:
-
- Has the format of database.schema.table
- For example,
db_x\.schema_y\..*
matches all tables inschema_y
indb_x
-
Remove Access
You must have the ACCOUNTADMIN
and ORGADMIN
roles.
Run the following script from your Snowflake account
The script below deletes the terratrue_role and terratrue_user from your Snowflake account.
DROP USER IF EXISTS terratrue_user;
DROP ROLE IF EXISTS terratrue_role;
Delete Secrets
Go to the secrets tab of the ingestions page, delete the secret you have created for the user terratrue_user.
Delete the Snowflake Ingestions
Go to the Ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”.
Postgres
Configure Data Source
A. Create User and Grant Access
- You must login into your Postgres instance as a superuser
- Create a new user
CREATE USER terratrue_user WITH PASSWORD '<your-password>';
- Make sure the new user can connect to the databases that you want to ingest
SELECT has_database_privilege('terratrue_user', '<DATABASE_NAME>', 'connect');
- If the user cannot connect to a database, then grant the connect privilege to it.
GRANT CONNECT ON DATABASE <DATABASE_NAME> TO terratrue_user;
- If you cannot grant the connect privilege to any databases, you need to exclude them during the ingestion configuration as explained below.
B. Update Firewall Rules
In order for TerraTrue to connect to your instance, you need to make sure your instance is publicly accessible.
If you have set up a firewall in front of the instance, you also need to update your firewall rules to allow inbound traffic from TerraTrue. As part of Data Catalog onboarding, TerraTrue provisions a list of IP addresses for your organization. Please get the list of IP addresses from the Data Catalog Settings page and then update your firewall rules.
C. SSL Connection
AWS RDS
If your Postgres instance is in AWS RDS, then we will use the AWS global certificate bundle to establish SSL connection to your instance. You don’t need to provide any certificate to us unless your Postgres instance uses your own certificate.
GCP CloudSQL
If your Postgres instance is in GCP CloudSQL, you need to provision a client certificate and provide us the server CA certificate, client certificate, and client key during the ingestion configuration. Please refer to the following GCP documentation.
https://cloud.google.com/sql/docs/postgres/configure-ssl-instance#new-client
Self Hosted
If you self host your Postgres instance, then you need to provide us with the server CA certificate during ingestion configuration so that we can establish SSL connection to your instance.
Configure Ingestion
- Create a secret for the password under the Secrets tab. You will refer to the secret’s name later.
- Hosting environment: where your instance is hosted, can be AWS RDS, GCP CloudSQL, or self hosted.
- Host and port: the IP or URL and port of your instance
- Username/password: the default user name is terratrue_user
- SSL certificates:
- If your hosting environment is AWS RDS and your instance uses the AWS managed certificate, then you don’t need to upload any certificates. Otherwise, please upload the server CA certificate.
- If your hosting environment is GCP CloudSQL, then you need to provision a client certificate and upload the server CA certificate, client certificate, and client key.
- If you self host your instance, then you need to upload the server CA certificate.
- Instance ID: a globally unique identifier for the Postgres instance.
- If your hosting environment is AWS RDS, we recommend using the ARN of your instance. Alternatively, you can use the format of
<AWS_ACCOUNT_ID>-<REGION>-<RDS_INSTANCE_ID>
- If your hosting environment is GCP CloudSQL, we recommend using the connection name of your instance. Alternatively, you can use the format of
<GCP_PROJECT_ID>-<CLOUDSQL_INSTANCE_ID>
- If your hosting environment is AWS RDS, we recommend using the ARN of your instance. Alternatively, you can use the format of
- Database filters: you can specify a list of allow and deny regular expression patterns to restrict the ingestion to databases that you care about.
- For example, the pattern
database_x$
matchesdatabase_x
- We automatically exclude the
rdsadmin
database for AWS RDS andcloudsqladmin
database for GCP CloudSQL
- For example, the pattern
- Table filter pattern:
- Has the format of
database.schema.table
- For example,
db_x\.schema_y\..*
matches all tables inschema_y
indb_x
- Has the format of
Remove Access
Remove User
You must login into your postgres instance as a superuser. Run the following command
DROP USER IF EXISTS terratrue_user;
Update Firewall Rules
Remove the IP addresses from your Postgres instance’s inbound firewall rules.
Delete Secrets
Go to the secrets tab of the ingestions page, delete the secret you have created for the user terratrue_user.
Delete the Postgres Ingestions
Go to the ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”. This also deleted any certificates you have uploaded.
Revoke Client Certificates
If you have configured the ingestion with a client certificate and key, then you can also revoke the certificate.
MySQL
Configure Data Source
A. Create User and Grant Access
-
- You must login into your MySQL instance as an admin
- Create a new user
CREATE USER 'terratrue_user'@'%' IDENTIFIED BY '<your-password>';
-
- Grant the
SELECT
privileges to the new user- All databases. This will allow ingestion from future databases.
- Grant the
GRANT SELECT on *.* TO 'terratrue_user'@'%';
-
-
- One or more databases. Run below for each database.
-
GRANT SELECT on <DATABASE_NAME>.* TO 'terratrue_user'@'%';
-
- Even though the ingestion doesn’t read the data in the tables, it requires the
SELECT
privilege because MySQL doesn’t have a separate privilege for reading the structure of the tables.
- Even though the ingestion doesn’t read the data in the tables, it requires the
B. Update Firewall Rules
In order for TerraTrue to connect to your instance, you need to make sure your instance is publicly accessible.
If you have set up a firewall in front of the instance, you also need to update your firewall rules to allow inbound traffic from TerraTrue. As part of Data Catalog onboarding, TerraTrue provisions a list of IP addresses for your organization. Please get the list of IP addresses from the Data Catalog Settings page and then update your firewall rules.
C. SSL Connection
AWS RDS
If your MySQL instance is in AWS RDS, then we will use the AWS global certificate bundle to establish SSL connection to your instance. You don’t need to provide any certificate to us unless your MySQL instance uses your own certificate.
GCP CloudSQL
If your MySQL instance is in GCP CloudSQL, you need to provision a client certificate and provide us the server CA certificate, client certificate, and client key during the ingestion configuration. Please refer to the following GCP documentation.
https://cloud.google.com/sql/docs/postgres/configure-ssl-instance#new-client
Self Hosted
If you self host your MySQL instance, then you need to provide us with the server CA certificate during ingestion configuration so that we can establish SSL connection to your instance.
Configure Ingestion
- Create a secret for the password under the Secrets tab. You will refer to the secret’s name later.
- Hosting environment: where your instance is hosted, can be AWS RDS, GCP CloudSQL, or self hosted.
- Host and port: the IP or URL and port of your instance
- Username/password: the default user name is terratrue_user
- SSL certificates:
- If your hosting environment is AWS RDS and your instance uses the AWS managed certificate, then you don’t need to upload any certificates. Otherwise, please upload the server CA certificate.
- If your hosting environment is GCP CloudSQL, then you need to provision a client certificate and upload the server CA certificate, client certificate, and client key.
- If you self host your instance, then you need to upload the server CA certificate.
- Instance ID: a globally unique identifier for the MySQL instance.
- If your hosting environment is AWS RDS, we recommend using the ARN of your instance. Alternatively, you can use the format of
<AWS_ACCOUNT_ID>-<REGION>-<RDS_INSTANCE_ID>
- If your hosting environment is GCP CloudSQL, we recommend using the connection name of your instance. Alternatively, you can use the format of
<GCP_PROJECT_ID>-<CLOUDSQL_INSTANCE_ID>
- If your hosting environment is AWS RDS, we recommend using the ARN of your instance. Alternatively, you can use the format of
- Table filter pattern:
- Has the format of database.table
- For example,
db_x\..*
matches all tables indb_x
- We automatically exclude these databases created by MySQL:
information_schema
,mysql
,performance_schema
,sys
Remove Access
Remove User
You must login into your postgres instance as a superuser. Run the following command
DROP USER IF EXISTS terratrue_user;
Update Firewall Rules
Remove the IP addresses from your MySQL instance’s inbound firewall rules.
Delete Secrets
Go to the secrets tab of the ingestions page, delete the secret you have created for the user terratrue_user
.
Delete the MySQL Ingestions
Go to the ingestions page, click on the vertical ellipsis button next to an ingestion that you want to delete and then click “Delete Ingestion”. This also deleted any certificates you have uploaded.
Revoke Client Certificates
If you have configured the ingestion with a client certificate and key, then you can also revoke the certificate.
Glue
Configure Data Source
Prerequisite
- You will need permissions to create IAM policies, users, and access keys in your AWS account. Having the
IAMFullAccess
orAdministratorAccess
permission policy attached to your user will suffice.
Create User and Grant Access
- Create an IAM policy with
glue:getDatabases
andglue:getTables
permissions. - Create an IAM user with the newly created policy attached.
- Click on the Security Credentials tab for the newly created user and create an access key.
- When the access key has been created, you can click “Show” to reveal the Secret Access Key. You will need the Access Key ID and Secret Access Key during the ingestion configuration process.
Configure Ingestion
- Create a secret for the Access Key ID and Secret Access Key under the Secrets tab. You will refer to the secret’s names later.
- AWS Region: the AWS region that the Glue databases are in.
- AWS Access Key ID: The access key ID for the user created earlier.
- AWS Secret Access Key: The secret access key for the user created earlier.
- Instance ID: a globally unique identifier for the Glue instance.
- Table filter pattern:
- Has the format of database.table
- For example,
db_x\**
matches all tables indb_x
Remove Access
Delete User
You will need permissions to delete a user. Having the IAMFullAccess or AdministratorAccess permission policy attached to your user will suffice.
Go to the IAM Users page in the AWS console. Select the user created for the Glue ingestion and delete it.
Alternatively, the AWS CLI can be used:
aws iam delete-user --user-name <user_name>
Delete Secrets
Go to the Secrets tab of the ingestions page, delete the AWS Access Key ID and AWS Secret Access Key secrets you have created for the user.
Delete the Glue Ingestions
Go to the Ingestions page, click on the vertical ellipses button next to an ingestion that you want to delete and then click “Delete Ingestion”.