TerraTrue Data Catalog

Background

TerraTrue’s Data Catalog allows you to connect directly with your data sources, scan your data tables, and populate a list of data types in use. We’ll then classify those data types and match them to your TerraTrue taxonomy, giving you a clear view into what data your organization is currently using.

Today, we offer the ability to connect with the following ingestion sources:

Amazon DocumentDB
Amplitude
Athena
BigQuery
Databricks Unity Catalog
Datahub
DynamoDB
Elasticsearch
Glue
Hive (Azure HDInsight)
Kafka (Confluent Cloud)
MLflow
MongoDB (ATLAS)
MySQL
Oracle
Postgres
Redshift
SageMaker (Amazon)
Snowflake
SQL Server

Please reach out to your Customer Success Manager if you’d like to request any additional sources including a demo of what exists today.

To visit Data Catalog, click the data warehouse icon on the left-hand side of TerraTrue. You’ll then see up to four options, depending on your permissions*

Explore
Datasets
Ingestion
Settings

If you’re just getting started with Data Catalog and haven’t scanned anything yet, you can learn how to ingest data by reading our Ingestion Instructions.

*Note - the Data Catalog supports 3 specific user roles: Data Catalog Admin, Data Catalog Editor and Data Catalog Viewer. Each of these roles supports different levels of interaction with the Data Catalog and this is summarized in the table below:

User Role	Catalog visible in main nav	Explore	Search Datasets	Dataset Schema	Ingestion	Settings
TerraTrue Admin	✅	✅	✅	✅	✅	✅
Data Catalog Admin	✅	✅	✅	✅	✅	✅
Data Catalog Editor	✅	✅	✅	✅	❌	❌
Data Catalog Viewer	✅	✅	✅	✅ **	❌	❌
Observer	❌	❌	❌	❌	❌	❌

** - cannot edit descriptions or data type classifications.

Read more about user permissions here.

Explore

Explore gives you a birds eye view of the data in your catalog. From here, you can:

Search for a data type, data set, column, or keyword to drill in and find anything specific in your catalog
View your Data Sources to see where you’ve ingested from, and click any of them to view all of its datasets
Click ‘Data Types’, and see all the data types cataloged across your data sources. You can toggle between viewing ‘All’ to see all the data types in your existing Taxonomy, or toggling to ‘Matched’ to see just the data types detected in your datasets. The number in parentheses indicates how many datasets the type was detected in.

Datasets

Here, you can explore exactly what’s been cataloged in your datasets. Use the filter to drill down by data source, or by data type, to find a dataset you’re looking for.

You can also click into a dataset to see its schema and take a few actions from this screen:

Add a description to the dataset
See the exact column names and string types that have been detected
See the auto-classified data types TerraTrue detected. If a data type is incorrectly classified, simply click on it and either type the name of the correct data type or scroll through the window to select the right one.

Ingestion

On the ingestion page, there are four options:

Sources - view your active ingestion sources including the source name, ingestion name, number of times its been executed, last executed date, and it’s ingestion status. Click the three-dot menu icon on any source to edit or delete.
Agents - install software agents in your AWS cloud environment to manage connections to your AWS RDS (MySQL, Oracle DB, Postgres, and SQL Server) and AWS DocumentDB. You can use these Agents when setting up ingestions to the 5 data sources mentioned. Agents are described in more detail in the next section.
Secrets - manage your existing secrets or click ‘Create New’ to add a new one. The use of Secrets is explained in more detail in our Ingestion Instructions.

And finally, the ability to connect a new source. For detailed instructions on how to ingest from a new source, please visit our Ingestion Instructions.

Screenshot 2024-04-24 at 5.36.56 PM.png

Once configured, ingestions will run daily between 9AM and 5PM EST to ensure you have the most up to date information in your catalog.

Agents

Many companies, with dedicated network and security teams, can have policies that disallow direct connections from an externally hosted Data Catalog to their data sources. They express a strong preference to have a Catalog connect to their data sources only through client software that they can security review, configure, deploy via their IaC tools (like Terraform) and continuously monitor.
Some companies prefer a client installation that they entirely manage and which connects to their network partitioned data sources through VPC peering.
Yet other companies have network isolation policies where they do not allocate external IPs for some of their data sources, making connections to these sources from outside the VPC impossible.
Others have a requirement that access credentials can never leave their cloud environment, and can be rotated at will on their end.
Amazon DocumentDB is a popular NoSQL datastore choice on AWS that can only be accessed from either the same VPC it is deployed in, or through a peered VPC.

To address each of these needs, TerraTrue developed a client agent for Data Catalog. You have the option to install these agents in your cloud environment and mediate access to supported data sources.

The agent runs in its own Kubernetes cluster and can be configured to read specific access credentials from your AWS Secret Manager (so the credentials never leave your environment). The agent can be installed in the same VPC or a peered VPC to the data source(s) of interest. The agent can scan and ingest data from one or more datastores without sensitive data ever leaving your cloud environment. Only metadata is sent to a backend TerraTrue Data Catalog service dedicated to your organization. You can even choose to lock down inbound connections to the agent altogether (using the second option mentioned below).

We support two flavors for these Data Catalog agents:

A TerraTrue managed agent if you have minimal operational expertise with Kubernetes and prefer that we manage the cluster installation.
A Self managed agent, where you can fully manage a dedicated agent cluster.

Detailed installation instructions for both these types of installations is available here.

The Data Catalog dashboard views and the accompanying rules-engine based triggers for launches are fully agnostic to how the metadata is ingested. So, the agents come into the picture only for setting up your Ingestions.

Settings

You can manage two settings on this page:

The email for your Google Cloud Service Account, which is required to set up ingestion from BigQuery.
The Inbound IP Addresses you'll use to complete ingestions.

More information about using either of these settings can be outlined in our Ingestion Instructions.

Reporting

In Privacy Central you can display information on data types collected from launches and from data catalog side-by-side. First, navigate to privacy central and select Data Types. From there, you can view the number of data sources and instances for each data type on the right, and in the aggregate in the bottom left:

If you would like to deep dive into a specific data type by selecting from the table above, you also have the option to view in data catalog, and if that data type triggered a launch, view that launch in the launchpad: