Amazon AWS Certified Data Engineer - Associate DEA-C01 Practice Test Questions, Exam Dumps

Practice Exams:

View All

AWS Certified Data Engineer - Associate DEA-C01 Amazon Practice Test Questions and Exam Dumps

Question No 1:

A data engineer is working on setting up an AWS Glue ETL job to read data stored in an Amazon S3 bucket. The data engineer has already configured the AWS Glue job with the appropriate IAM role and AWS Glue connection. The job runs within a Virtual Private Cloud (VPC) to access data in the private subnet.

However, when attempting to execute the job, it fails with an error related to the Amazon S3 VPC gateway endpoint. The error indicates that the AWS Glue job cannot reach the S3 bucket, even though the IAM permissions and connection configurations appear to be correct.The data engineer needs to ensure that the AWS Glue job can successfully connect to the S3 bucket using the VPC gateway endpoint without going over the public internet.

What should the data engineer do to troubleshoot and resolve the issue to enable successful S3 access from the AWS Glue job?

A. Update the AWS Glue security group to allow inbound traffic from the Amazon S3 VPC gateway endpoint.
B. Configure an S3 bucket policy to explicitly grant the AWS Glue job permissions to access the S3 bucket.
C. Review the AWS Glue job code to ensure that the AWS Glue connection details include a fully qualified domain name.
D. Verify that the VPC's route table includes proper routes to the Amazon S3 VPC gateway endpoint.

Correct Answer:
D. Verify that the VPC's route table includes proper routes to the Amazon S3 VPC gateway endpoint.

Explanation:

When an AWS Glue job runs inside a VPC, it must have network connectivity to the required AWS services, such as Amazon S3. To allow access to S3 without using the public internet, a VPC endpoint is typically used. Specifically, a gateway endpoint for Amazon S3 can be created and attached to the VPC.

However, just creating the endpoint is not enough. The VPC's route tables must also be updated to direct traffic destined for Amazon S3 through the VPC gateway endpoint. If the route table is misconfigured or does not include the correct route for S3 traffic (prefix list used by S3), the Glue job will fail to connect, even if permissions and security groups are correctly set.

Option A is incorrect because VPC gateway endpoints do not require inbound rules in security groups. These endpoints are used only for traffic originating from inside the VPC to AWS services like S3.

Option B is incorrect because while bucket policies are important for authorization, this issue is a network-level problem, not a permission-related issue.

Option C is a red herring. The Glue job connection details may include a host, but connectivity to S3 doesn’t require specifying a fully qualified domain name if AWS endpoints and networking are correctly configured.

Thus, Option D is correct: the route table associated with the subnet where the Glue job runs must have a rule that directs traffic bound for S3 (com.amazonaws.region.s3) to the gateway endpoint ID. Ensuring that this configuration is in place resolves the connectivity error and allows the Glue job to access S3 securely and privately.

This approach provides a secure, scalable, and cost-effective way to allow private access to S3 from Glue jobs running within a VPC.

Question No 2:

A multinational retail company stores customer data in a centralized Amazon S3 bucket, which serves as a customer data hub. This hub is used by employees from multiple countries to perform analytics and reporting tasks.

To comply with data governance and privacy policies, the company mandates that data analysts can only access customer data relevant to their own country. For example, a data analyst in Germany should only be able to access customer data from Germany.The governance team wants to implement this restriction using the least operational effort, while ensuring the solution is scalable, secure, and aligned with AWS best practices.

Which solution should the company implement to enforce per-country access to customer data stored in S3, with the least operational overhead?

A. Create a separate table for each country's customer data. Provide access to each analyst based on the country that the analyst serves.
B. Register the S3 bucket as a data lake location in AWS Lake Formation. Use Lake Formation's row-level security features to enforce country-specific access.
C. Move the data to AWS Regions that are close to the countries where the customers are. Provide access to each analyst based on the country that the analyst serves.
D. Load the data into Amazon Redshift. Create a view for each country and assign IAM roles to analysts based on their country-specific access needs.

Correct Answer:
B. Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company's access policies.

Explanation:

AWS Lake Formation is a fully managed service designed to build secure data lakes in the AWS Cloud. It simplifies data access management, ingestion, and fine-grained security over data stored in Amazon S3.

In this scenario, where a single data hub contains customer data for multiple countries, the goal is to enforce row-level access controls so that analysts only see the data relevant to their region. Manually creating multiple S3 buckets, separate tables, or separate data pipelines would involve high operational overhead and be error-prone.

By registering the Amazon S3 bucket as a Lake Formation data lake location, the company can centrally manage permissions using Lake Formation's fine-grained access control, including row-level security. This means a single table can store all customer records, but analysts will only be able to query records where the customer country matches the analyst's assigned region — all enforced automatically by Lake Formation’s built-in policies.

This approach is highly scalable and low-maintenance, as it does not require maintaining separate datasets or views per country. Changes in access rules (e.g., when an analyst changes regions) can be managed simply by updating their data access policy in Lake Formation.

Option A would lead to a proliferation of tables and more complex access control.
Option C involves unnecessary data movement and does not solve the problem of fine-grained access.
Option D requires setting up Redshift and maintaining multiple views and IAM roles, increasing complexity and cost.

Therefore, Option B — using AWS Lake Formation with row-level security — is the most efficient and operationally simple solution for enforcing country-based data access policies at scale.

Question No 3:

A media company is working to enhance its recommendation engine, which suggests content to users based on their behavior and preferences. To improve the effectiveness of this recommendation engine, the company plans to incorporate valuable insights from various third-party datasets. These datasets are intended to enrich the existing analytics platform and enable more personalized recommendations.

The primary goal is to minimize the operational overhead involved in accessing, integrating, and maintaining these third-party datasets. The company seeks a scalable and low-maintenance solution that fits into its existing AWS-based infrastructure, ensuring the integration process is efficient and requires the least manual intervention.

Which of the following solutions best meets the company's requirements with the least operational overhead?

A. Use API calls to access and integrate third-party datasets from AWS Data Exchange.
B. Use API calls to access and integrate third-party datasets from AWS DataSync.
C. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from AWS CodeCommit repositories.
D. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from Amazon Elastic Container Registry (Amazon ECR).

Correct Answer: A. Use API calls to access and integrate third-party datasets from AWS Data Exchange

Explanation:

AWS Data Exchange is a fully managed service that allows customers to find, subscribe to, and use third-party data in the AWS ecosystem. It is specifically designed to simplify access to external datasets—like financial, healthcare, or media data—from trusted providers.

In this scenario, the media company wants to enrich its recommendation system using third-party data while minimizing the operational effort. AWS Data Exchange provides a low-code and low-maintenance approach to incorporate these datasets. Once subscribed, the data can be delivered directly into Amazon S3, making it easy to consume within the company’s existing analytics pipelines—such as AWS Glue, Athena, or Redshift.

Compared to other options:

Option B (AWS DataSync) is primarily used for moving large volumes of data between on-premises storage and AWS, not for integrating third-party datasets. It’s not intended for dynamic or marketplace-sourced data.
Option C (Kinesis + CodeCommit) doesn’t make sense as CodeCommit is a source code repository, not a place to host or stream third-party datasets.
Option D (Kinesis + ECR) is also inappropriate. ECR hosts Docker container images, not structured or semi-structured data. Kinesis is better suited for real-time data streaming, not for pulling static third-party datasets from repositories like ECR.

Thus, Option A is the most appropriate and efficient method to integrate third-party datasets, providing automated updates and streamlined ingestion directly into the company’s data lake or warehouse with minimal manual configuration and no custom infrastructure.

Question No 4:

A financial institution is embarking on a data mesh architecture initiative to decentralize data ownership while maintaining centralized data governance and fine-grained access control. The company also needs a robust data catalog and a scalable system for running ETL (Extract, Transform, Load) operations.

The team has chosen AWS Glue as the central service for managing ETL pipelines and the data catalog. They want to ensure that the final design supports distributed data domains, self-service analytics, and secure access to data assets.

Which two AWS services should the company use in conjunction with AWS Glue to build a fully functional data mesh architecture?

A. Use Amazon Aurora for data storage. Use an Amazon Redshift provisioned cluster for data analysis.
B. Use Amazon S3 for data storage. Use Amazon Athena for data analysis.
C. Use AWS Glue DataBrew for centralized data governance and access control.
D. Use Amazon RDS for data storage. Use Amazon EMR for data analysis.
E. Use AWS Lake Formation for centralized data governance and access control.

Correct Answers:

B. Use Amazon S3 for data storage. Use Amazon Athena for data analysis.
E. Use AWS Lake Formation for centralized data governance and access control.

Explanation:

A data mesh architecture promotes decentralized ownership of data while enabling enterprise-wide discoverability, governance, and interoperability. AWS provides multiple services that, when combined, support data mesh implementations efficiently.

Amazon S3 is ideal for data storage in a data mesh because it allows data to be stored in open formats (like Parquet or CSV) and easily shared across domains. S3 integrates with other AWS services and supports access controls, versioning, and scalability.

Amazon Athena complements this by enabling serverless, on-demand querying of data directly from S3. It integrates seamlessly with AWS Glue Data Catalog, providing the necessary metadata and schema management across decentralized datasets.

AWS Lake Formation is critical for a centralized governance layer in a data mesh. It allows fine-grained, column-level and row-level access control across S3-based data lakes. Lake Formation works with Glue, Athena, and Redshift Spectrum, ensuring that policies are enforced regardless of how or where the data is queried.

Together:

AWS Glue manages metadata and ETL jobs.
Amazon S3 stores decentralized domain data.
Athena enables analytics.
Lake Formation enforces access policies across all these components.

Why other options are incorrect:

A and D (Aurora, RDS, EMR): These are centralized and more suited to specific use cases. They don’t offer the same flexibility or decentralization as S3.
C (AWS Glue DataBrew): While useful for data preparation, it’s not designed for centralized governance or large-scale access control.

Thus, the best combination for a data mesh is:
✔ B. Amazon S3 + Athena
✔ E. Lake Formation for governance

Question No 5:

A data engineer at a company has developed several custom Python scripts that are reused across multiple AWS Lambda functions. These scripts perform standardized data formatting tasks essential to various serverless workflows.

Currently, whenever the engineer updates one of these scripts, they must manually update each individual Lambda function, which is time-consuming and error-prone. The engineer wants to find a more efficient and scalable way to manage and reuse these Python scripts across all Lambda functions, with minimal manual effort required when changes are needed.

What is the most efficient way to allow all Lambda functions to share and consistently use the latest version of the Python scripts?

A. Store a pointer to the custom Python scripts in the execution context object in a shared Amazon S3 bucket.
B. Package the custom Python scripts into Lambda layers. Apply the Lambda layers to the Lambda functions.
C. Store a pointer to the custom Python scripts in environment variables in a shared Amazon S3 bucket.
D. Assign the same alias to each Lambda function. Call each Lambda function by specifying the function's alias.

Correct Answer:

B. Package the custom Python scripts into Lambda layers. Apply the Lambda layers to the Lambda functions.

Explanation:

AWS Lambda Layers are designed to enable code reuse across multiple Lambda functions. A Lambda layer is a ZIP archive that contains libraries, dependencies, or custom code (like Python scripts) that can be shared and attached to one or more Lambda functions.

This is the most efficient and scalable approach to solving the engineer’s problem. By packaging the shared Python scripts into a Lambda layer, the engineer only needs to update the layer when the scripts change. Then, the updated layer version can be attached to all relevant Lambda functions — either manually, automatically using CI/CD, or via infrastructure-as-code tools like AWS CloudFormation or Terraform.

Benefits of using Lambda layers include:

Reusability: Share common code across many functions.
Maintainability: Update the code in one place (the layer) rather than editing each function.
Separation of concerns: Keep business logic in the function and shared utilities in the layer.

Why other options are incorrect:

Option A and C (S3 pointers in execution context or environment variables): These options require each Lambda function to download and execute external code during runtime. This is not secure, increases cold start time, and complicates deployment and testing.
Option D (function aliases): Lambda aliases are used to version and point to specific versions of Lambda functions, not to share code across functions. This option doesn't address the core requirement of code reuse.

In conclusion, Lambda layers provide a well-supported, scalable, and AWS-native solution to share and maintain common Python scripts across multiple Lambda functions efficiently.

Question No 6:

A company has implemented an ETL (Extract, Transform, Load) data pipeline using AWS Glue. As part of the process, the data engineer must crawl data from a Microsoft SQL Server database, transform the extracted data, and load it into an Amazon S3 bucket. Additionally, the engineer needs to orchestrate the end-to-end pipeline — from crawling the data source to storing the output in S3.

The company prefers a cost-effective and AWS-native solution that integrates seamlessly with AWS Glue and offers pipeline orchestration capabilities.

Which AWS service or feature should the data engineer use to orchestrate this ETL pipeline most cost-effectively?

A. AWS Step Functions
B. AWS Glue workflows
C. AWS Glue Studio
D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA)

Correct Answer:

B. AWS Glue workflows

Explanation:

AWS Glue workflows are designed specifically to orchestrate complex ETL jobs built with AWS Glue. A workflow can coordinate a series of AWS Glue jobs, crawlers, and triggers in a visual DAG (Directed Acyclic Graph). This makes it ideal for managing multi-step pipelines like the one described in this scenario — crawling data from SQL Server, transforming it, and loading it into Amazon S3.

The key benefits of AWS Glue workflows:

Tightly integrated with AWS Glue jobs and crawlers.
Cost-effective, since you only pay for the underlying Glue jobs and not for the orchestration itself.
Built-in support for triggers (on success, on failure, or scheduled).
Visual workflow editor for easier monitoring and debugging.

Why other options are not optimal:

A. AWS Step Functions: While powerful for orchestration, it adds extra cost and complexity compared to Glue workflows, especially when you're only orchestrating AWS Glue components.
C. AWS Glue Studio: This is a visual interface for developing Glue jobs, not for orchestrating multiple jobs or crawlers. It helps design transformations but doesn’t manage end-to-end workflows.
D. Amazon MWAA (Apache Airflow): A robust orchestration tool, but it requires more setup, management, and cost. It is ideal for complex, multi-service orchestration, but overkill for purely Glue-based pipelines.

Therefore, the most cost-effective and purpose-built choice is:
B. AWS Glue workflows

Question No 7:

A financial services company stores large volumes of financial data in Amazon Redshift, which powers their internal analytics and reporting. The company now wants to build a web-based trading application that will need to query Amazon Redshift in real time to display live financial data to users.

The data engineer needs a solution that allows the web application to interact with Redshift efficiently and securely, with the least amount of operational overhead. The company prefers serverless or managed solutions that reduce the need to manage connections, drivers, or infrastructure.
Which solution should the data engineer implement to allow real-time querying of Redshift data with minimal operational overhead?

A. Establish WebSocket connections to Amazon Redshift.
B. Use the Amazon Redshift Data API.
C. Set up Java Database Connectivity (JDBC) connections to Amazon Redshift.
D. Store frequently accessed data in Amazon S3. Use Amazon S3 Select to run the queries.

Correct Answer:

B. Use the Amazon Redshift Data API

Explanation:

The Amazon Redshift Data API provides a simple, secure, and scalable way to interact with Redshift from web or serverless applications without managing persistent database connections. It enables real-time SQL queries by making HTTP-based API calls, which are ideal for modern web applications that need low-latency access to Redshift data without the complexity of managing connection pools, drivers, or authentication.

Key benefits of the Redshift Data API:

No persistent connections: Perfect for serverless and web apps.
IAM authentication: Eliminates hardcoded credentials.
Stateless API: Works well with microservices and event-driven architectures.
AWS SDK integration: Easily callable from Lambda, API Gateway, or directly from web apps using SDKs.

Why the other options are less ideal:

A. WebSocket connections: Redshift does not support WebSocket connections. This option is technically invalid.
C. JDBC connections: While JDBC is a valid way to connect to Redshift, it requires managing connection pools, driver compatibility, and networking (VPC access, security groups, etc.). This introduces significant operational overhead, especially for web-scale applications.
D. S3 Select: This is used to query objects stored in S3, not data in Redshift. While useful for optimizing S3 access, it does not apply to querying Redshift tables.

Therefore, for a web application needing real-time access to Redshift with minimal management, the Amazon Redshift Data API is the most efficient and scalable solution.Correct Choice: B. Use the Amazon Redshift Data API

Question No 8:

A company uses Amazon Athena to perform ad hoc queries on data stored in Amazon S3. Multiple users, teams, and applications across the same AWS account use Athena for different purposes. To meet security and compliance requirements, the company needs to isolate query processes and restrict access to query history based on each use case.

The solution must allow fine-grained access control to ensure that one group cannot see or interfere with another group's queries or results.

What is the best way to separate Athena usage and control access to query history within the same AWS account?

A. Create an S3 bucket for each use case. Use S3 bucket policies to control access.
B. Create an Athena workgroup for each use case. Apply tags to the workgroups and enforce IAM policies using those tags.
C. Create a separate IAM role for each use case and associate the roles with Athena.
D. Use AWS Glue Data Catalog resource policies to control table-level access for each user.

Correct Answer:
B. Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an IAM policy that uses the tags to apply appropriate permissions.

Explanation:

Amazon Athena Workgroups provide logical separation between different use cases, teams, or applications. Each workgroup can have:

Its own query result location
Query history access control
Usage limits
Specific IAM permissions

By assigning tags to workgroups (e.g., {"UseCase": "TeamA"}), you can write IAM policies that grant or restrict permissions based on these tags. For example, you can ensure that users in Team A can only access Workgroup A and not see queries or results from Team B.

This approach ensures clean separation of:

Query logs
Saved queries
Output locations
Permissions

Why the other options are incorrect:

A (S3 buckets): Controls access to data but not Athena’s query history or results interface.
C (IAM roles): While useful, roles alone don’t isolate query logs or workspaces within Athena.
D (Glue Data Catalog resource policies): These control access to metadata and tables, not query history or execution context.

So, Athena Workgroups are purpose-built for this scenario, making Option B the best solution.

Question No 9:

A data engineer wants to schedule a daily workflow that runs multiple AWS Glue jobs. The jobs do not need to start or finish at a specific time. The company wants the most cost-effective solution for running these jobs.

Which Glue job setting should the engineer use to optimize cost?

A. Choose the FLEX execution class in the Glue job properties.
B. Use the Spot instance type in Glue job properties.
C. Choose the STANDARD execution class in the Glue job properties.
D. Choose the latest GlueVersion in the job properties.

Correct Answer:
A. Choose the FLEX execution class in the Glue job properties.

Explanation:

AWS Glue provides two execution classes:

STANDARD: Prioritized job execution for time-sensitive tasks.
FLEX: Lower-cost option, suitable for jobs that can tolerate start-up delays (up to 10 minutes).

Since the Glue jobs in this scenario do not require precise timing, using the FLEX execution class significantly reduces costs — sometimes up to 34% less than STANDARD. The trade-off is a potential delay in starting the job, but the actual job processing remains the same.

Why the other options are incorrect:

B (Spot instances): Glue doesn’t use EC2 Spot Instances directly. This option is not valid in Glue job settings.
C (STANDARD execution class): More expensive and unnecessary if timing isn't critical.
D (GlueVersion): Determines available features or Python version, not job cost directly.

Thus, FLEX execution class is the most cost-effective choice for non-urgent workflows.

Question No 10:

A data engineer needs to build an AWS Lambda function that automatically converts .csv files into Apache Parquet format. The function should run only when a user uploads a .csv file into a specific Amazon S3 bucket. The company prefers the solution with the least operational overhead.

What is the simplest and most efficient way to trigger the Lambda function only when a .csv file is uploaded?

A. Create an S3 event notification on s3:ObjectCreated:* with a filter for .csv files, targeting the Lambda function.
B. Create an S3 event on s3:ObjectTagging:* for objects tagged as .csv.
C. Create an S3 event on s3:* with a filter for .csv files, targeting the Lambda function.
D. Use s3:ObjectCreated:* and send to an SNS topic, then subscribe the Lambda function to the topic.

Correct Answer:
A. Create an S3 event notification on s3:ObjectCreated:* with a filter for .csv files, targeting the Lambda function.

Explanation:

Amazon S3 supports event notifications that can trigger a Lambda function when specific events occur — such as when an object is created. You can configure an event with:

Event type: e.g., s3:ObjectCreated:*
Suffix filter: to limit it to .csv files
Destination: the Lambda function’s ARN

This approach is simple, serverless, and low-maintenance, meeting all the requirements with minimal overhead.

Why other options are not ideal:

B: s3:ObjectTagging:* triggers only when tags are applied, not when files are uploaded.
C: s3:* is overly broad and includes many events, increasing noise and potential cost.
D: Using SNS adds an extra layer of complexity and cost. Direct invocation of Lambda is simpler.

Thus, Option A provides a clean, efficient trigger that runs the Lambda function only when a .csv file is uploaded, without unnecessary components.