Guidance for Data Management on AWS

Overview

The Guidance for Data Management on AWS is an opinionated Data Management implementation on AWS.

Cost

You are responsible for the cost of the AWS services used while running this Guidance. As of May 2024, the cost for running this Guidance with the default settings in the US West (Oregon) AWS Region is approximately $530 per month, using the following assumptions:

Assume 10 DataZone users with metadata storage & requests under amount included in per user cost (DataZone pricing)
50 Data Management create asset API requests per month
Low (< 1k) lineage API/PUT requests per month
Does not include estimates for existing S3, Glue, etc resources in the customer's Data Management spoke account

A detailed cost breakdown estimate can be found at this shared AWS Pricing Calculator estimate. This detailed estimate does not yet include DataZone pricing so adding $9/month/user for our assumed 10 users gives us $90 additional per month to add on to this estimate. This gives us the $530/month estimate referenced above.

Prerequisites

Operating System

These deployment instructions are intended for use on MacOS. Deployment using a different operating system may require additional steps.

Third-Party tools

Rush
Node v20
AWS CLI with credentials to both the hub and spoke account.

AWS Account Requirements

For this guidance, you will need to have or set up three accounts in the same AWS Organization.
1. A management account where IAM Identity Center is enabled.
2. A hub account where a DataZone domain will be created. This must be part of the same OU as the spoke account.
3. A spoke account where data assets will reside. This must be part of the same OU as the hub account. You may set up multiple spoke accounts.
All resources in these accounts are assumed to be in the same region unless specified otherwise.
Create a DataZone Domain
1. Enable IAM Identity Center for DataZone
Request association of the spoke account in the hub account for the DataZone domain and accept the request in the spoke account
1. Enable the Data Lake and Data Warehouse blueprints when accepting the request.

Create roles

Create a role in the organization’s account where the IAM Identity Center instance is and save the role ARN for the deployment steps

It should have the following trust policy:

{
   "Version": "2012-10-17",
   "Statement": [
      {
            "Effect": "Allow",
            "Principal": {
               "AWS": [
                  "arn:aws:iam::<HUB_ACCOUNT_ID>:root"
                  ]
            },
            "Action": "sts:AssumeRole",
            "Condition": {}
      }
   ]
}

It should have the following permissions:

{
   "Version": "2012-10-17",
   "Statement": [
      {
            "Effect": "Allow",
            "Action": [
               "identitystore:IsMemberInGroups",
               "identitystore:GetUserId"
            ],
            "Resource": "*"
      }
   ]
}

In your Spoke account create an IAM role to be used when creating assets in DM. You will pass the role’s Amazon Resource Name (ARN) to DM when you create assets. DM will pass this role to Glue and Glue DataBrew as needed.

The role name must be prefixed with dm-. This enables the role to be passed by DM.

The trust policy is as follows:

{
   "Version": "2012-10-17",
   "Statement": [
      {
            "Sid": "dataBrew",
            "Effect": "Allow",
            "Principal": {
               "Service": "databrew.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
      },
      {
            "Sid": "glue",
            "Effect": "Allow",
            "Principal": {
               "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
      }
   ]
}

Add the following policies to the role

Generate and upload a certificate
1. openssl genrsa 2048 > my-aws-private.key
2. openssl req -new -x509 -nodes -sha1 -days 3650 -extensions v3_ca -key my-aws-private.key > my-aws-public.crt
  1. Leave all prompts blank except Common Name (e.g. server FQDN or YOUR name) []: can be set as dm.amazonaws.com.
3. Import certificate into the hub account: aws acm import-certificate --certificate fileb://my-aws-public.crt --private-key fileb://my-aws-private.key --region <REGION> --profile <AWS_PROFILE>
4. Note the ARN that is returned. You will need to provide this as loadBalancerCertificateArn when deploying the hub infrastructure

AWS CDK Bootstrap

The hub and spoke accounts must be bootstrapped for the AWS Cloud Development Kit.

cdk bootstrap <HUB_ACCOUNT_ID>/<REGION> --profile <HUB_PROFILE>
cdk bootstrap <SPOKE_ACCOUNT_ID>/<REGION> --profile <SPOKE_PROFILE>

Service limits

Deployment Steps

Pre-Deployment Steps

Create your first DataZone project.
Create DataZone environments.
1. Create a Data Warehouse environment in the newly created project
  1. Create a Redshift Serverless data warehouse in the spoke account with the following settings:
    1. Customize admin user credentials
      1. Manage admin credentials in AWS Secrets Manager
    2. Select private subnets
    3. Enhanced VPC routing on
  2. Go to Secrets Manager and open the secret that was created for your new data warehouse.
    1. Add the following tags:
      1. AmazonDataZoneDomain
        
        Key: AmazonDataZoneDomain
        
        Value: <DATAZONE_DOMAIN_ID>
      2. AmazonDataZoneProject
        
        Key: AmazonDataZoneProject
        
        Value: <DATAZONE_PROJECT_ID>
  3. Create VPC endpoints in the VPC and private subnets where redshift is deployed
    1. S3 gateway
    2. Glue
    3. Databrew
  4. Create a DataZone data warehouse environment for the newly created Redshift resources.
2. Create a Data Lake Environment in the newly created project
Go to AWS Lake Formation in the console of the spoke account
1. Make sure you are an Admin along with the CDK execution role.
  1. If this is your first time going to LakeFormation, there will be a prompt to “Add myself”, select that box. Also select the box that says “Add other AWS users or roles”. Select the CDK CloudFormation execution role. It will begin with: cdk-<ID>cfn-exec-role and click “Get Started”.
  2. If you have previously used LakeFormation, add them under the Administrative roles and tasks section under the Administration section

Deployment Steps

We will need to deploy the hub stack in three separate steps

Step 1: Setup the shared stack

Note: this step only needs to be performed once for the initial deployment

git clone git@github.com:aws-solutions-library-samples/guidance-for-data-management-on-aws.git
cd guidance-for-data-management-on-aws
Install dependencies rush update --bypass-policy
Build rush build
cd infrastructure/hub
Export credentials for the hub account and the AWS region to the environment
Deploy npm run cdk -- deploy --require-approval never --concurrency=10 -c identityStoreId=<IAM_IDENTITY_CENTER_IDENTITY_STORE_ID> -c identityStoreRoleArn=<IAM_IDENTITY_CENTER_ROLE_ARN> -c identityStoreRegion=<IDENTITY_STORE_REGION> -c ssoRegion=<SSO_REGION> -c orgId=<AWS_ORGANIZATIONS_ORG_ID> -c orgRootId=<AWS_ORGANIZATIONS_ROOT_ID> -c orgOuId=<AWS_ORGANIZATIONS_OU_ID> -c ssoInstanceArn=<AWS_IDENTITY_CENTER_INSTANCE_ARN> -c adminEmail=<ADMIN_EMAIL> --all

Step 2: Configure IAM Identity Center

Note: this step only needs to be performed once for the initial deployment

Open the IAM Identity Center console and then, from the navigation pane, choose Applications. 2. Choose Add application, I have an application I want to set up, and Add custom SAML 2.0 application, and then choose Next. 3. On the Configure application page, enter a Display name and a Description. 4. Copy the URL of the IAM Identity Center SAML metadata file. You use these resources in later steps to create an IdP in a user pool. 5. Under Application metadata, choose Manually type your metadata values. Then provide the following values. 6. Important: Make sure to replace the domain, region, and userPoolId values with information you gather after the CDK deployment. - Application Assertion Consumer Service (ACS) URL: <userPoolDomain>/saml2/idpresponse - Application SAML audience: urn:amazon:cognito:sp:<userPoolId> 7. Choose Submit. Then, go to the Details page for the application that you added. 8. Select the Actions dropdown list and choose Edit attribute mappings. Then, provide the following attributes. - User attribute in the application: Subject - Note: Subject is prefilled. - Maps to this string value or user attribute in IAM Identity Center: ${user:subject} - Format: Persistent
```
  - User attribute in the application: `email`
     - Maps to this string value or user attribute in IAM Identity Center: `${user:email}`
     - Format: `Basic`
```

Step 3: Redeploy the hub stack

Redeploy npm run cdk -- deploy --require-approval never --concurrency=10 -c identityStoreId=<IAM_IDENTITY_CENTER_IDENTITY_STORE_ID> -c identityStoreRoleArn=<IAM_IDENTITY_CENTER_ROLE_ARN> -c identityStoreRegion=<IDENTITY_STORE_REGION> -c ssoRegion=<SSO_REGION> -c orgId=<AWS_ORGANIZATIONS_ORG_ID> -c orgRootId=<AWS_ORGANIZATIONS_ROOT_ID> -c orgOuId=<AWS_ORGANIZATIONS_OU_ID> -c ssoInstanceArn=<AWS_IDENTITY_CENTER_INSTANCE_ARN> -c samlMetaDataUrl=<SAML_METADATA_URL> -c callbackUrls=http://localhost:3000 -c adminEmail=<ADMIN_EMAIL> -c loadBalancerCertificateArn=<LOAD_BALANCER_CERTIFICATE_ARN> --all

Deploy the spoke stack

cd ../spoke
Export credentials for the spoke account and the AWS region to the environment
npm run cdk -- deploy -c hubAccountId=<HUB_ACCOUNT_ID> -c orgId=<AWS_ORGANIZATIONS_ORG_ID> -c orgRootId=<AWS_ORGANIZATIONS_ROOT_ID> -c orgOuId=<AWS_ORGANIZATIONS_OU_ID> -c deleteBucket=true --require-approval never --concurrency=10 --all

Post-Deployment Steps

Open the AWS Lake Formation console in the spoke account
1. Go to the dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION> database
2. Click Edit and uncheck the “Use only IAM access control for new tables in this database” and click Save.
3. Click Actions/permissions/view
  1. Select IAMAllowedPrincipals and click Revoke
4. Click Actions/permissions/Grant
  1. For each of the following, you will need to click through and grant the permissions
    1. DataZone database access
      1. Principals
        
        IAM users and roles: AmazonDataZoneGlueAccess-<REGION>-<DOMAIN_ID>
      2. LF-Tags or catalog resources
        
        Select Named Data Catalog resources
        
        Databases: dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
      3. Database permissions:
        
        Database permissions: Describe
        
        Grantable permissions: Describe
    2. DataZone table access
      1. Principals
        
        IAM users and roles: AmazonDataZoneGlueAccess-<REGION>-<DOMAIN_ID>
      2. LF-Tags or catalog resources
        
        Select Named Data Catalog resources
        
        Databases: dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
        
        Tables: All tables
      3. Table permissions:
        
        Table permissions: Describe, Select
        
        Grantable permissions: Describe, Select
    3. Service role database access
      1. Principals
        
        IAM users and roles: <SERVICE_ROLE>
      2. LF-Tags or catalog resources
        
        Select Named Data Catalog resources
        
        Databases: dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
      3. Database permissions:
        
        Database permissions: Create table, Describe
        
        Grantable permissions: None
    4. Service role table access
      1. Principals
        
        IAM users and roles: <SERVICE_ROLE>
      2. LF-Tags or catalog resources
        
        Select Named Data Catalog resources
        
        Databases: dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
        
        Tables: All tables
      3. Table permissions:
        
        Table permissions: Alter, Describe, Insert, Select
        
        Grantable permissions: None
5. Add the S3 bucket as a data location
  1. Go to Data lake locations
  2. Click Register location
  3. Enter the dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION> location
  4. Switch Permission mode to Lake Formation
  5. Click Register Location
6. Grant the service role access to the data location
  1. Go to Data locations
  2. Click Grant
  3. Select the service role created earlier
  4. Enter the dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION> location for Storage locations
  5. Click Grant
Sign in to DataZone with the user that you will be making API calls with
1. Create a user in IAM Identity center in the Management account
  1. Go to IAM Identity Center in the console in the Management account
  2. Click Users
  3. Click Add User and follow the prompts
  4. You should get an email to set up your new user. Follow the prompts in the email to complete setup.
2. Assign the user you created to the custom application
  1. Go to IAM Identity Center in the console in the Management account
  2. Click Applications
  3. Click Customer Managed
  4. Click on the customer application that was created earlier
  5. Click Assign users and groups and assign your user
3. Sign into DataZone with your SSO user
  1. Go to IAM Identity Center in the Management account
  2. Under Settings summary, open the AWS access portal URL
  3. Sign in with your new user.
  4. Click on the DataZone application to open the UI. Click Sign in with SSO. This step is needed to get DataZone to recognize your user ID.

Deployment Validation

Check that the following CloudFormation stacks have been successfully created in the spoke account:
1. dm-spoke-shared
2. dm-spoke-dataAsset
Check that the following CloudFormation stacks have been successfully created in the hub account:
1. dm-hub-shared
2. dm-hub-CognitoCustomStack
3. dm-hub-SsoCustomStack
4. dm-hub-datalineage
5. dm-hub-dataAsset

Running the Guidance

Creating Assets

The following outlines steps to be done to create an asset using the Data Asset API. Below this are examples using a simple sample dataset.

Generate a token
1. Go to Amazon Cognito in the Hub account
2. Select the dm user pool
3. Click App Integration
4. Scroll the the bottom of the page and select dm-sso-client from the App client list
5. Scroll to the Hosted UI section and click View Hosted UI
6. Log in with your configured IAM Identity Center User
7. You should be redirected to localhost in your browser
8. Copy the url from your browser into a text editor
9. It should take the form of localhost:3000/#access_token=<ACCESS_TOKEN>&id_token=<ID_TOKEN>&token_type=Bearer&expires_in=3600
10. Copy the ID token, which should be valid for 1 hour. You can click through the hosted UI again to generate a new token.
Open an API client of your choice
1. Go to AWS Systems Manager Parameter Store and open the /dm/dataAsset/apiUrl parameter. This is the API URL.
2. Configure the client as follows:
  1. Method POST
  2. URL: <API_URL>dataAssetTasks
  3. Auth: Bearer token generated above
  4. Headers:
    1. accept-version: 1.0.0
    2. accept: application/json
  5. Body:
```
 {
    "catalog": {
       "domainId": "<DATAZONE_DOMAIN_ID>",
       "domainName": "<DATAZONE_DOMAIN_NAME>",
       "environmentId": "<DATAZONE_ENVIRONMENT_ID>",
       "projectId": "<DATAZONE_PROJECT_ID>",
       "region": "<REGION>",
       "assetName": "my-asset",
       "accountId": "<SPOKE_ACCOUNT_ID>",
       "autoPublish": true,
       "revision": 1
    },
    "workflow": <SEE_BELOW>
 }
```
  6. Run the request
  7. Check that the Step Functions have completed successfully.
    1. Go to the AWS Console in the hub account and look at the dm-data-asset State Machine in AWS Step Functions. You should see a successful an execution running.
    2. Go to the AWS Console in the spoke account and look at the dm-spoke-data-asset State Machine in AWS Step Functions. You should see an execution running.
    3. Once the executions complete, you should be able tofind the new assets in the DataZone data catalog.

Sample Dataset

A simple sample dataset file can be found in docs/sample_data/sample_products.csv. Below are a few sample rows:

sku	units	weight	cost
Alpha	104	8	846.00
Bravo	102	5	961.00
Charlie	155	4	472.00

These rows represent a table of product names, the number of units in inventory, their weight, and their cost. Below we will add this data as assets in the Data Management using the Data Asset API. There is an example of creating a Glue table asset backed by S3 or a Redshift table. Both of these assets will be managed assets in DataZone meaning other users of DataZone can subscribe to and consume these when published.

Glue Tables

Load the CSV file into the dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION> S3 bucket

Make an API request replacing the request body workflow with:

 {
     "name": "sample-products-workflow",
     "roleArn": "<SERVICE_ROLE_ARN>",
     "dataset": {
         "name": "sample-products-dataset",
         "format": "csv",
         "connection": {
             "dataLake": {
                 "s3": {
                 "path": "s3://<S3 PATH>/",
                 "region": "<REGION>"
                 }
             }
         }
     },
     "transforms": {
         "recipe": {
         "steps": [
             {
                 "Action": {
                     "Operation": "LOWER_CASE",
                     "Parameters": {
                         "sourceColumn": "sku"
                     }
                 }
             }
         ]
         }
     },
     "dataQuality": {
         "ruleset": "Rules = [ (ColumnValues \"units\" >= 0) ]"
     }
 }

Note: the S3 path passed in the connection.dataLake.s3.path parameter is the S3 path to the asset. The resulting workflow will create a glue crawler at the path (prefix) level which will crawl all files under that prefix. This allows for creating a data asset which is composed of multiple CSV files or has hive partitioning (year=2024/month=01/...) under the prefix.

This workflow includes a transform and a user-defined data quality check. The transform takes the form of a Glue DataBrew recipe. In this case the transform converts the product names in the sku column to lowercase. The data quality check will define and run a Glue Data Quality check and include the results of the check in the metadata of the asset created in Data Zone. In this case the data quality check will ensure the units column contains non-negative values.

Redshift Tables

Load a table into your Redshift data warehouse.

Make an API request with the following request body workflow:

 {
     "name": "sample-products-redshift-workflow",
     "roleArn": "<SERVICE_ROLE_ARN>",
     "dataset": {
         "name": "sample-products-redshift-dataset",
         "format": "csv",
         "connection": {
             "redshift": {
                 "secretArn": "REDSHIFT_ADMIN_SECRET_ARN",
                 "jdbcConnectionUrl": "REDSHIFT_CONNECTION_URL",
                 "subnetId": "REDSHIFT_SUBNET_ID",
                 "securityGroupIdList": ["REDSHIFT_SG_ID"],
                 "availabilityZone": "REDSHIFT_AZ",
                 "path": "<DB_NAME>/<SCHEMA_NAME>/sample_product_table",
                 "databaseTableName": "<SCHEMA_NAME>.sample_product_table",
                 "workgroupName": "REDSHIFT_WORKGROUP_NAME"
             }
         }
     }
 }

This workflow does not include a transform or user-defined data quality check as in the Glue Table example above. These can be added to the request if desired.

Viewing Assets

After the data asset workflow completes a new data asset will be published to the fabric.

Catalog

The Data Management catalog can be searched within Amazon DataZone. See the documentation for more information.

Lineage

You can view the lineage information from the Marquez portal or the Marquez API. The location of these endpoints can be found in the following SSM parameters /dm/dataLineage/openLineageApiUrl and /dm/dataLineage/openLineageWebUrl.

Next Steps

Customers can bring their own tools for profiling/quality, etc. to the Data Management.
Customers can attach into *DM message bus (EventBridge) to trigger any other processes that need to be added.
Customers can author their own data products. See the Guidance for Sustainability Data Management on AWS for an example data product implementation. If you would like to develop your own data product, see the guide on Authoring Your Own Data Product.

Cleanup

Go to the CloudFormation console in the spoke account and delete all stacks prefixed by dm-spoke.
Go to the CloudFormation console in the hub account and delete all stacks prefixed by dm-hub.

Notices

Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manual_installation.md

manual_installation.md

Guidance for Data Management on AWS

Table of Contents

Overview

Cost

Prerequisites

Operating System

Third-Party tools

AWS Account Requirements

AWS CDK Bootstrap

Service limits

Deployment Steps

Pre-Deployment Steps

Deployment Steps

Step 1: Setup the shared stack

Step 2: Configure IAM Identity Center

Step 3: Redeploy the hub stack

Deploy the spoke stack

Post-Deployment Steps

Deployment Validation

Running the Guidance

Creating Assets

Sample Dataset

Glue Tables

Redshift Tables

Viewing Assets

Catalog

Lineage

Next Steps

Cleanup

Notices

Files

manual_installation.md

Latest commit

History

manual_installation.md

File metadata and controls

Guidance for Data Management on AWS

Table of Contents

Overview

Cost

Prerequisites

Operating System

Third-Party tools

AWS Account Requirements

AWS CDK Bootstrap

Service limits

Deployment Steps

Pre-Deployment Steps

Deployment Steps

Step 1: Setup the shared stack

Step 2: Configure IAM Identity Center

Step 3: Redeploy the hub stack

Deploy the spoke stack

Post-Deployment Steps

Deployment Validation

Running the Guidance

Creating Assets

Sample Dataset

Glue Tables

Redshift Tables

Viewing Assets

Catalog

Lineage

Next Steps

Cleanup

Notices