- Overview
- Prerequisites
- Deployment Steps
- Deployment Validation
- Running the Guidance
- Next Steps
- Cleanup
The Guidance for Data Management on AWS is an opinionated Data Management implementation on AWS.
You are responsible for the cost of the AWS services used while running this Guidance. As of May 2024, the cost for running this Guidance with the default settings in the US West (Oregon) AWS Region is approximately $530 per month, using the following assumptions:
- Assume 10 DataZone users with metadata storage & requests under amount included in per user cost (DataZone pricing)
- 50 Data Management create asset API requests per month
- Low (< 1k) lineage API/PUT requests per month
- Does not include estimates for existing S3, Glue, etc resources in the customer's Data Management spoke account
A detailed cost breakdown estimate can be found at this shared AWS Pricing Calculator estimate. This detailed estimate does not yet include DataZone pricing so adding $9/month/user for our assumed 10 users gives us $90 additional per month to add on to this estimate. This gives us the $530/month estimate referenced above.
These deployment instructions are intended for use on MacOS. Deployment using a different operating system may require additional steps.
- For this guidance, you will need to have or set up three accounts in the same AWS Organization.
- A management account where IAM Identity Center is enabled.
- A hub account where a DataZone domain will be created. This must be part of the same OU as the spoke account.
- A spoke account where data assets will reside. This must be part of the same OU as the hub account. You may set up multiple spoke accounts.
- All resources in these accounts are assumed to be in the same region unless specified otherwise.
- Create a DataZone Domain
- Enable IAM Identity Center for DataZone
- Request association of the spoke account in the hub account for the DataZone domain and accept the request in the spoke account
- Enable the Data Lake and Data Warehouse blueprints when accepting the request.
- Create roles
- Create a role in the organization’s account where the IAM Identity Center instance is and save the role ARN for the deployment steps
- It should have the following trust policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<HUB_ACCOUNT_ID>:root" ] }, "Action": "sts:AssumeRole", "Condition": {} } ] }
- It should have the following permissions:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "identitystore:IsMemberInGroups", "identitystore:GetUserId" ], "Resource": "*" } ] }
- It should have the following trust policy:
- In your Spoke account create an IAM role to be used when creating assets in DM. You will pass the role’s Amazon Resource Name (ARN) to DM when you create assets. DM will pass this role to Glue and Glue DataBrew as needed.
- The role name must be prefixed with
dm-
. This enables the role to be passed by DM. - The trust policy is as follows:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "dataBrew", "Effect": "Allow", "Principal": { "Service": "databrew.amazonaws.com" }, "Action": "sts:AssumeRole" }, { "Sid": "glue", "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
- Add the following policies to the role
- The role name must be prefixed with
- Create a role in the organization’s account where the IAM Identity Center instance is and save the role ARN for the deployment steps
- Generate and upload a certificate
openssl genrsa 2048 > my-aws-private.key
openssl req -new -x509 -nodes -sha1 -days 3650 -extensions v3_ca -key my-aws-private.key > my-aws-public.crt
- Leave all prompts blank except
Common Name (e.g. server FQDN or YOUR name) []:
can be set asdm.amazonaws.com
.
- Leave all prompts blank except
- Import certificate into the hub account:
aws acm import-certificate --certificate fileb://my-aws-public.crt --private-key fileb://my-aws-private.key --region <REGION> --profile <AWS_PROFILE>
- Note the ARN that is returned. You will need to provide this as
loadBalancerCertificateArn
when deploying the hub infrastructure
The hub and spoke accounts must be bootstrapped for the AWS Cloud Development Kit.
cdk bootstrap <HUB_ACCOUNT_ID>/<REGION> --profile <HUB_PROFILE>
cdk bootstrap <SPOKE_ACCOUNT_ID>/<REGION> --profile <SPOKE_PROFILE>
- Create your first DataZone project.
- Create DataZone environments.
- Create a Data Warehouse environment in the newly created project
- Create a Redshift Serverless data warehouse in the spoke account with the following settings:
- Customize admin user credentials
- Manage admin credentials in AWS Secrets Manager
- Select private subnets
- Enhanced VPC routing on
- Customize admin user credentials
- Go to Secrets Manager and open the secret that was created for your new data warehouse.
- Add the following tags:
- AmazonDataZoneDomain
- Key: AmazonDataZoneDomain
- Value:
<DATAZONE_DOMAIN_ID>
- AmazonDataZoneProject
- Key: AmazonDataZoneProject
- Value:
<DATAZONE_PROJECT_ID>
- AmazonDataZoneDomain
- Add the following tags:
- Create VPC endpoints in the VPC and private subnets where redshift is deployed
- S3 gateway
- Glue
- Databrew
- Create a DataZone data warehouse environment for the newly created Redshift resources.
- Create a Redshift Serverless data warehouse in the spoke account with the following settings:
- Create a Data Lake Environment in the newly created project
- Create a Data Warehouse environment in the newly created project
- Go to AWS Lake Formation in the console of the spoke account
- Make sure you are an Admin along with the CDK execution role.
- If this is your first time going to LakeFormation, there will be a prompt to “Add myself”, select that box. Also select the box that says “Add other AWS users or roles”. Select the CDK CloudFormation execution role. It will begin with:
cdk-<ID>cfn-exec-role
and click “Get Started”. - If you have previously used LakeFormation, add them under the Administrative roles and tasks section under the Administration section
- If this is your first time going to LakeFormation, there will be a prompt to “Add myself”, select that box. Also select the box that says “Add other AWS users or roles”. Select the CDK CloudFormation execution role. It will begin with:
- Make sure you are an Admin along with the CDK execution role.
We will need to deploy the hub stack in three separate steps
Note: this step only needs to be performed once for the initial deployment
git clone git@github.com:aws-solutions-library-samples/guidance-for-data-management-on-aws.git
cd guidance-for-data-management-on-aws
- Install dependencies
rush update --bypass-policy
- Build
rush build
cd infrastructure/hub
- Export credentials for the hub account and the AWS region to the environment
- Deploy
npm run cdk -- deploy --require-approval never --concurrency=10 -c identityStoreId=<IAM_IDENTITY_CENTER_IDENTITY_STORE_ID> -c identityStoreRoleArn=<IAM_IDENTITY_CENTER_ROLE_ARN> -c identityStoreRegion=<IDENTITY_STORE_REGION> -c ssoRegion=<SSO_REGION> -c orgId=<AWS_ORGANIZATIONS_ORG_ID> -c orgRootId=<AWS_ORGANIZATIONS_ROOT_ID> -c orgOuId=<AWS_ORGANIZATIONS_OU_ID> -c ssoInstanceArn=<AWS_IDENTITY_CENTER_INSTANCE_ARN> -c adminEmail=<ADMIN_EMAIL> --all
Note: this step only needs to be performed once for the initial deployment
-
Open the IAM Identity Center console and then, from the navigation pane, choose Applications. 2. Choose Add application, I have an application I want to set up, and Add custom SAML 2.0 application, and then choose Next. 3. On the Configure application page, enter a Display name and a Description. 4. Copy the URL of the IAM Identity Center SAML metadata file. You use these resources in later steps to create an IdP in a user pool. 5. Under Application metadata, choose Manually type your metadata values. Then provide the following values. 6. Important: Make sure to replace the domain, region, and userPoolId values with information you gather after the CDK deployment. - Application Assertion Consumer Service (ACS) URL:
<userPoolDomain>/saml2/idpresponse
- Application SAML audience:urn:amazon:cognito:sp:<userPoolId>
7. Choose Submit. Then, go to the Details page for the application that you added. 8. Select the Actions dropdown list and choose Edit attribute mappings. Then, provide the following attributes. - User attribute in the application:Subject
- Note: Subject is prefilled. - Maps to this string value or user attribute in IAM Identity Center:${user:subject}
- Format:Persistent
- User attribute in the application: `email` - Maps to this string value or user attribute in IAM Identity Center: `${user:email}` - Format: `Basic`
- Redeploy
npm run cdk -- deploy --require-approval never --concurrency=10 -c identityStoreId=<IAM_IDENTITY_CENTER_IDENTITY_STORE_ID> -c identityStoreRoleArn=<IAM_IDENTITY_CENTER_ROLE_ARN> -c identityStoreRegion=<IDENTITY_STORE_REGION> -c ssoRegion=<SSO_REGION> -c orgId=<AWS_ORGANIZATIONS_ORG_ID> -c orgRootId=<AWS_ORGANIZATIONS_ROOT_ID> -c orgOuId=<AWS_ORGANIZATIONS_OU_ID> -c ssoInstanceArn=<AWS_IDENTITY_CENTER_INSTANCE_ARN> -c samlMetaDataUrl=<SAML_METADATA_URL> -c callbackUrls=http://localhost:3000 -c adminEmail=<ADMIN_EMAIL> -c loadBalancerCertificateArn=<LOAD_BALANCER_CERTIFICATE_ARN> --all
cd ../spoke
- Export credentials for the spoke account and the AWS region to the environment
npm run cdk -- deploy -c hubAccountId=<HUB_ACCOUNT_ID> -c orgId=<AWS_ORGANIZATIONS_ORG_ID> -c orgRootId=<AWS_ORGANIZATIONS_ROOT_ID> -c orgOuId=<AWS_ORGANIZATIONS_OU_ID> -c deleteBucket=true --require-approval never --concurrency=10 --all
- Open the AWS Lake Formation console in the spoke account
- Go to the
dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
database - Click Edit and uncheck the “Use only IAM access control for new tables in this database” and click Save.
- Click Actions/permissions/view
- Select IAMAllowedPrincipals and click Revoke
- Click Actions/permissions/Grant
- For each of the following, you will need to click through and grant the permissions
- DataZone database access
- Principals
- IAM users and roles:
AmazonDataZoneGlueAccess-<REGION>-<DOMAIN_ID>
- IAM users and roles:
- LF-Tags or catalog resources
- Select Named Data Catalog resources
- Databases:
dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
- Database permissions:
- Database permissions:
Describe
- Grantable permissions:
Describe
- Database permissions:
- Principals
- DataZone table access
- Principals
- IAM users and roles:
AmazonDataZoneGlueAccess-<REGION>-<DOMAIN_ID>
- IAM users and roles:
- LF-Tags or catalog resources
- Select Named Data Catalog resources
- Databases:
dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
- Tables: All tables
- Table permissions:
- Table permissions:
Describe
,Select
- Grantable permissions:
Describe
,Select
- Table permissions:
- Principals
- Service role database access
- Principals
- IAM users and roles:
<SERVICE_ROLE>
- IAM users and roles:
- LF-Tags or catalog resources
- Select Named Data Catalog resources
- Databases:
dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
- Database permissions:
- Database permissions:
Create table
,Describe
- Grantable permissions: None
- Database permissions:
- Principals
- Service role table access
- Principals
- IAM users and roles:
<SERVICE_ROLE>
- IAM users and roles:
- LF-Tags or catalog resources
- Select Named Data Catalog resources
- Databases:
dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
- Tables: All tables
- Table permissions:
- Table permissions:
Alter
,Describe
,Insert
,Select
- Grantable permissions: None
- Table permissions:
- Principals
- DataZone database access
- For each of the following, you will need to click through and grant the permissions
- Add the S3 bucket as a data location
- Go to Data lake locations
- Click Register location
- Enter the
dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
location - Switch Permission mode to Lake Formation
- Click Register Location
- Grant the service role access to the data location
- Go to Data locations
- Click Grant
- Select the service role created earlier
- Enter the
dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
location for Storage locations - Click Grant
- Go to the
- Sign in to DataZone with the user that you will be making API calls with
- Create a user in IAM Identity center in the Management account
- Go to IAM Identity Center in the console in the Management account
- Click Users
- Click Add User and follow the prompts
- You should get an email to set up your new user. Follow the prompts in the email to complete setup.
- Assign the user you created to the custom application
- Go to IAM Identity Center in the console in the Management account
- Click Applications
- Click Customer Managed
- Click on the customer application that was created earlier
- Click Assign users and groups and assign your user
- Sign into DataZone with your SSO user
- Go to IAM Identity Center in the Management account
- Under Settings summary, open the AWS access portal URL
- Sign in with your new user.
- Click on the DataZone application to open the UI. Click Sign in with SSO. This step is needed to get DataZone to recognize your user ID.
- Create a user in IAM Identity center in the Management account
- Check that the following CloudFormation stacks have been successfully created in the spoke account:
dm-spoke-shared
dm-spoke-dataAsset
- Check that the following CloudFormation stacks have been successfully created in the hub account:
dm-hub-shared
dm-hub-CognitoCustomStack
dm-hub-SsoCustomStack
dm-hub-datalineage
dm-hub-dataAsset
The following outlines steps to be done to create an asset using the Data Asset API. Below this are examples using a simple sample dataset.
- Generate a token
- Go to Amazon Cognito in the Hub account
- Select the
dm
user pool - Click App Integration
- Scroll the the bottom of the page and select
dm-sso-client
from the App client list - Scroll to the Hosted UI section and click View Hosted UI
- Log in with your configured IAM Identity Center User
- You should be redirected to localhost in your browser
- Copy the url from your browser into a text editor
- It should take the form of
localhost:3000/#access_token=<ACCESS_TOKEN>&id_token=<ID_TOKEN>&token_type=Bearer&expires_in=3600
- Copy the ID token, which should be valid for 1 hour. You can click through the hosted UI again to generate a new token.
- Open an API client of your choice
- Go to AWS Systems Manager Parameter Store and open the
/dm/dataAsset/apiUrl
parameter. This is the API URL. - Configure the client as follows:
- Method
POST
- URL:
<API_URL>dataAssetTasks
- Auth: Bearer token generated above
- Headers:
accept-version
:1.0.0
accept
:application/json
- Body:
{ "catalog": { "domainId": "<DATAZONE_DOMAIN_ID>", "domainName": "<DATAZONE_DOMAIN_NAME>", "environmentId": "<DATAZONE_ENVIRONMENT_ID>", "projectId": "<DATAZONE_PROJECT_ID>", "region": "<REGION>", "assetName": "my-asset", "accountId": "<SPOKE_ACCOUNT_ID>", "autoPublish": true, "revision": 1 }, "workflow": <SEE_BELOW> }
- Run the request
- Check that the Step Functions have completed successfully.
- Go to the AWS Console in the hub account and look at the
dm-data-asset
State Machine in AWS Step Functions. You should see a successful an execution running. - Go to the AWS Console in the spoke account and look at the
dm-spoke-data-asset
State Machine in AWS Step Functions. You should see an execution running. - Once the executions complete, you should be able tofind the new assets in the DataZone data catalog.
- Go to the AWS Console in the hub account and look at the
- Method
- Go to AWS Systems Manager Parameter Store and open the
A simple sample dataset file can be found in docs/sample_data/sample_products.csv. Below are a few sample rows:
sku | units | weight | cost |
---|---|---|---|
Alpha | 104 | 8 | 846.00 |
Bravo | 102 | 5 | 961.00 |
Charlie | 155 | 4 | 472.00 |
These rows represent a table of product names, the number of units in inventory, their weight, and their cost. Below we will add this data as assets in the Data Management using the Data Asset API. There is an example of creating a Glue table asset backed by S3 or a Redshift table. Both of these assets will be managed assets in DataZone meaning other users of DataZone can subscribe to and consume these when published.
- Load the CSV file into the
dm-spoke-<SPOKE_ACCOUNT_ID>-<REGION>
S3 bucket - Make an API request replacing the request body
workflow
with:{ "name": "sample-products-workflow", "roleArn": "<SERVICE_ROLE_ARN>", "dataset": { "name": "sample-products-dataset", "format": "csv", "connection": { "dataLake": { "s3": { "path": "s3://<S3 PATH>/", "region": "<REGION>" } } } }, "transforms": { "recipe": { "steps": [ { "Action": { "Operation": "LOWER_CASE", "Parameters": { "sourceColumn": "sku" } } } ] } }, "dataQuality": { "ruleset": "Rules = [ (ColumnValues \"units\" >= 0) ]" } }
Note: the S3 path passed in the connection.dataLake.s3.path parameter is the S3 path to the asset. The resulting workflow will create a glue crawler at the path (prefix) level which will crawl all files under that prefix. This allows for creating a data asset which is composed of multiple CSV files or has hive partitioning (year=2024/month=01/...
) under the prefix.
This workflow includes a transform and a user-defined data quality check. The transform takes the form of a Glue DataBrew recipe. In this case the transform converts the product names in the sku
column to lowercase. The data quality check will define and run a Glue Data Quality check and include the results of the check in the metadata of the asset created in Data Zone. In this case the data quality check will ensure the units
column contains non-negative values.
- Load a table into your Redshift data warehouse.
- Make an API request with the following request body
workflow
:{ "name": "sample-products-redshift-workflow", "roleArn": "<SERVICE_ROLE_ARN>", "dataset": { "name": "sample-products-redshift-dataset", "format": "csv", "connection": { "redshift": { "secretArn": "REDSHIFT_ADMIN_SECRET_ARN", "jdbcConnectionUrl": "REDSHIFT_CONNECTION_URL", "subnetId": "REDSHIFT_SUBNET_ID", "securityGroupIdList": ["REDSHIFT_SG_ID"], "availabilityZone": "REDSHIFT_AZ", "path": "<DB_NAME>/<SCHEMA_NAME>/sample_product_table", "databaseTableName": "<SCHEMA_NAME>.sample_product_table", "workgroupName": "REDSHIFT_WORKGROUP_NAME" } } } }
This workflow does not include a transform or user-defined data quality check as in the Glue Table example above. These can be added to the request if desired.
After the data asset workflow completes a new data asset will be published to the fabric.
The Data Management catalog can be searched within Amazon DataZone. See the documentation for more information.
You can view the lineage information from the Marquez portal or the Marquez API. The location of these endpoints can be found in the following SSM parameters /dm/dataLineage/openLineageApiUrl
and /dm/dataLineage/openLineageWebUrl
.
- Customers can bring their own tools for profiling/quality, etc. to the Data Management.
- Customers can attach into *DM message bus (EventBridge) to trigger any other processes that need to be added.
- Customers can author their own data products. See the Guidance for Sustainability Data Management on AWS for an example data product implementation. If you would like to develop your own data product, see the guide on Authoring Your Own Data Product.
- Go to the CloudFormation console in the spoke account and delete all stacks prefixed by
dm-spoke
. - Go to the CloudFormation console in the hub account and delete all stacks prefixed by
dm-hub
.
Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.