Google Professional Data Engineer Practice Test Questions, Exam Dumps

Practice Exams:

View All

Professional Data Engineer Google Practice Test Questions and Exam Dumps

Question No 1:

Your company has built a TensorFlow neural network model with a large number of neurons and layers. While the model performs well with the training data, it does not perform as expected when tested on new, unseen data. What approach should you take to address this issue?

A. Threading
B. Serialization
C. Dropout Methods
D. Dimensionality Reduction

Correct Answer:
C. Dropout Methods

Explanation:

In machine learning, particularly in deep learning models such as neural networks, the issue described in the question is known as overfitting. Overfitting occurs when a model learns the training data too well, including its noise and outliers, to the point where it loses the ability to generalize to new, unseen data. In this case, the model is performing well on the training data but fails to deliver good performance when tested on new data.

To address this problem, dropout methods can be applied to improve the model's ability to generalize. Dropout is a regularization technique used in neural networks to prevent overfitting. It works by randomly "dropping out" (or deactivating) a certain percentage of neurons during each training iteration. This forces the model to learn redundant representations and prevents the model from becoming overly reliant on specific neurons, which can help it generalize better to new data.

Why Other Options Are Incorrect:

A. Threading:
Threading is a technique used to improve the concurrency of computations. While threading can speed up certain processes by allowing parallel execution of tasks, it does not directly address overfitting or help the model generalize better. Therefore, it would not solve the issue of poor performance on unseen data.
B. Serialization:
Serialization refers to the process of converting an object into a format (like JSON, binary, etc.) that can be easily saved or transmitted and later reconstructed. Serialization is important for saving trained models or moving data around, but it is not related to improving model performance or addressing issues like overfitting. Hence, it wouldn't solve the poor generalization problem in this context.
D. Dimensionality Reduction:
Dimensionality reduction, techniques such as Principal Component Analysis (PCA) or t-SNE, are used to reduce the number of input features (dimensions) in a dataset. While this can sometimes improve the performance of machine learning models by removing noise or irrelevant features, it does not directly address the overfitting problem in deep learning models. In fact, for neural networks with many layers and neurons, dimensionality reduction is usually not as effective as regularization techniques like dropout.

In this scenario, the most effective method for improving the model's performance on unseen data is applying dropout. This will help reduce overfitting by forcing the model to rely on multiple, different pathways to make predictions, thus improving its generalization capability.

Question No 2:

You are building a model to provide clothing recommendations. You know that a user's fashion preferences change over time, and you have set up a data pipeline to continuously stream new data as it becomes available. How should you use this data to train your model?

A. Continuously retrain the model on just the new data.
B. Continuously retrain the model on a combination of existing data and the new data.
C. Train on the existing data while using the new data as your test set.
D. Train on the new data while using the existing data as your test set.

Correct Answer:
B. Continuously retrain the model on a combination of existing data and the new data.

Explanation:

In recommendation systems, particularly for clothing recommendations, user preferences evolve over time. Therefore, the model needs to continuously adapt to new data to remain relevant. A key part of this adaptation process involves retraining the model regularly with both old (historical) data and new (streaming) data.

Why Option B is Correct:

Retraining with both existing and new data is important to ensure that the model captures long-term patterns as well as adjusts to the most recent changes in user preferences. If only the new data is used for retraining (as in Option A), the model might forget valuable insights from previous data, which could significantly reduce its overall accuracy and quality. On the other hand, using a combination of both old and new data provides a balanced approach, allowing the model to retain valuable historical patterns while adapting to recent trends and shifts in user behavior.
Continuous retraining ensures that the model remains up-to-date as users' preferences evolve, which is especially important for personalized recommendations in domains such as fashion, where trends and individual tastes change rapidly.

Why Other Options Are Incorrect:

A. Continuously retrain the model on just the new data:
While retraining on new data ensures the model captures the latest trends, it can lead to catastrophic forgetting of previous patterns. This is a problem in recommendation systems because user preferences are not static. Limiting the model to only new data would make it lose its historical context, potentially causing it to offer irrelevant recommendations.
C. Train on the existing data while using the new data as your test set:
Using new data as the test set while training on old data is not a good practice. The test set should represent unseen data that the model has never encountered, but in this case, the new data is actually streaming data that the model should use to adjust and adapt. Using this data as a test set would prevent the model from incorporating the latest trends in the recommendations.
D. Train on the new data while using the existing data as your test set:
Training solely on new data while keeping the old data as the test set would again ignore the long-term patterns in user preferences. Moreover, the model would perform well on historical data (which is already known to the system) but might fail to adapt to the dynamic nature of fashion trends. This approach is not a suitable strategy for a recommendation system that needs to capture both historical and current patterns.

Conclusion:

The best approach to handle the continuous stream of new data in the clothing recommendation system is to retrain the model on a combination of existing and new data. This ensures that the model stays relevant by adapting to user preferences over time while maintaining its understanding of long-term trends.

Question No 3:

You designed a database for patient records as part of a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, relying on self-joins to generate reports. Initially, the server resource utilization was at 50%. However, as the project scaled and the database now needs to store 100 times more patient records, the reports either take too long to generate or encounter errors due to insufficient compute resources.

How should you adjust the database design to handle the increased scale?

A. Add capacity (memory and disk space) to the database server by the order of 200.
B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-joins.
D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Correct Answer:
C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-joins.

Explanation:

As the database grows significantly in size (storing 100 times more records), performance issues can arise, particularly with queries and reports that require heavy resource consumption. These issues often occur due to inefficient database design, such as using self-joins on large tables, which can lead to long query times and resource exhaustion.

Why Option C is Correct:

Normalization is the process of organizing a database to minimize redundancy and dependency. In this case, normalizing the database by separating the patient table and the visits table into distinct entities can drastically improve performance. When the data is stored in one large table (e.g., combining all patient records and visits into one table), querying and generating reports becomes inefficient because each query must scan and join all records. By separating patient information from visit records, the database becomes more manageable, and queries will be faster because they operate on smaller, more focused tables.
Normalization reduces the need for self-joins, which are typically resource-intensive operations. A normalized schema allows for more efficient querying, as related data can be accessed by simple foreign key references instead of scanning large amounts of irrelevant data. This restructuring also makes it easier to manage and scale the database.

Why Other Options Are Incorrect:

A. Add capacity (memory and disk space) to the database server by the order of 200:
While increasing server capacity may temporarily solve resource constraints, it is not a sustainable solution. The real issue lies in inefficient database design, and simply adding more resources does not address the underlying cause of the performance problems.
B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges:
Sharding the data based on date ranges could help with query performance in some cases, but it introduces complexities in maintaining the database, such as managing multiple shards and ensuring that data is distributed efficiently. Additionally, it limits the flexibility of queries, as it only allows for reporting on prespecified date ranges.
D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports:
While partitioning by clinic could improve query performance when accessing data for specific clinics, it introduces complexity in generating reports across multiple clinics. The use of unions across partitioned tables can be cumbersome and may still result in long query times, especially as the number of clinics or records grows.

Conclusion:

The most effective solution is to normalize the database into separate tables for patients and visits, which will reduce the reliance on self-joins and improve query performance as the dataset grows. This approach provides a more scalable, maintainable, and efficient structure for handling large datasets.

Question No 4:

You’ve created a crucial report in Google Data Studio 360 for your large team, using Google BigQuery as the data source. However, you’ve noticed that visualizations are not showing data that is less than 1 hour old. What is the most appropriate action to resolve this issue?

A. Disable caching by editing the report settings.
B. Disable caching in BigQuery by editing table details.
C. Refresh your browser tab showing the visualizations.
D. Clear your browser history for the past hour and then reload the tab showing the visualizations.

Correct Answer: A. Disable caching by editing the report settings.

Explanation:

Google Data Studio 360 allows users to create interactive and dynamic reports using data from various sources, including Google BigQuery. However, an issue can arise where real-time data (e.g., data less than an hour old) does not show up in visualizations. This issue is commonly related to data caching, which can cause the visualizations to display outdated data.

Let's examine the possible solutions:

A. Disable caching by editing the report settings.

Caching is a feature that helps improve the performance of reports by storing results for a period of time, reducing the need for repeated data requests. However, in certain cases (like needing real-time or near-real-time data), cached results might prevent fresh data from appearing. Disabling caching directly within Google Data Studio ensures that the most up-to-date data is queried from the source (Google BigQuery) each time the report is viewed or refreshed.
By editing the report settings in Google Data Studio, you can turn off caching for your report, ensuring that data displayed in the visualizations is as recent as possible. This is the best practice to resolve issues like not seeing data that is less than one hour old.

B. Disable caching in BigQuery by editing table details.

While caching in BigQuery can affect query performance, BigQuery caching typically applies to the query results and not to how data is displayed in Google Data Studio reports. Disabling caching at the BigQuery table level does not solve the issue in Data Studio. The issue is related to caching at the report level in Data Studio, not the table or query level in BigQuery.

C. Refresh your browser tab showing the visualizations.

Refreshing the browser tab may seem like an obvious solution to get the most up-to-date data, but it does not address the core issue of data caching within Google Data Studio. Even after refreshing, cached data may still be used if caching is not disabled in the report settings.

D. Clear your browser history for the past hour and then reload the tab showing the visualizations.

Clearing the browser history is generally unnecessary to fix caching issues in Google Data Studio. While browser history can store data, the visualization cache is stored within the report and data source settings in Google Data Studio, not in your browser history. Therefore, clearing browser history is not an effective solution to this problem.

Conclusion:

The most effective solution to display real-time data in Google Data Studio is to disable caching at the report level. This can be done by editing the report settings to ensure that fresh data from Google BigQuery is always displayed. Therefore, option A is the correct answer.

Question No 5:

An external customer provides you with a daily dump of data from their database in the form of comma-separated values (CSV) files. These files are stored in Google Cloud Storage (GCS). You want to analyze this data in Google BigQuery, but there may be rows that are formatted incorrectly or corrupted.

How should you design the pipeline to handle this situation?

A. Use federated data sources, and check data in the SQL query.
B. Enable BigQuery monitoring in Google Stackdriver and create an alert.
C. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Correct Answer: D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Explanation:

When working with large volumes of external data, there are several challenges, especially if the data could be incorrectly formatted or corrupted. It’s essential to handle this bad data appropriately during the ingestion process, ensuring that the valid data is imported successfully into Google BigQuery while errors are captured for later analysis.

A. Use federated data sources, and check data in the SQL query.

Federated data sources allow you to query data directly from Google Cloud Storage without importing it into BigQuery. While this method may seem like a quick solution, it does not provide an optimal way to handle data quality issues such as corrupted rows. Checking data in SQL queries might help detect certain issues, but this approach is not ideal for bulk processing or for dealing with rows that are not properly formatted. It lacks an efficient mechanism to capture and isolate erroneous data.

B. Enable BigQuery monitoring in Google Stackdriver and create an alert.

While Google Stackdriver (now Google Cloud Operations Suite) is useful for monitoring system performance, it does not provide a direct mechanism for handling or correcting data quality issues during the import process. Alerts can notify you if certain thresholds are breached, but they do not automatically address problems such as incorrectly formatted or corrupted rows. This solution does not directly help with managing data quality during the import.

C. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.

Using gcloud CLI for importing data into BigQuery and setting max_bad_records to 0 would ensure that no bad records are accepted into BigQuery. However, this approach will fail the entire import process if even a single corrupted record is encountered, causing data loss. While it ensures data integrity, it does not provide a graceful method of handling errors or capturing problematic data for analysis.

D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

The most robust solution is to use Google Cloud Dataflow for batch processing. Dataflow allows you to process data in a highly customizable way, including the ability to detect and handle bad or corrupted rows. By setting up a dead-letter table, you can store erroneous data in a separate table for later analysis, ensuring that valid data is imported into BigQuery while invalid data can be reviewed or corrected. This approach offers the most flexibility and control over the data ingestion process, providing a reliable method of managing bad data while ensuring that the valid data is loaded efficiently.

Conclusion:

The best approach for handling incorrectly formatted or corrupted data during ingestion into Google BigQuery is to use Google Cloud Dataflow to batch-process the data, capture errors, and route problematic data to a dead-letter table. This solution ensures that the process remains robust and scalable, providing a clean and effective way to manage bad data without disrupting the overall ingestion pipeline. Therefore, option D is the correct answer.

Question No 6:

Your weather application queries a database every 15 minutes to retrieve the current temperature. The application frontend is hosted on Google App Engine, serving millions of users.

What is the best way to design the frontend to ensure it can handle a database failure without affecting the user experience?

A. Issue a command to restart the database servers.
B. Retry the query with exponential backoff, up to a cap of 15 minutes.
C. Retry the query every second until it comes back online to minimize the staleness of data.
D. Reduce the query frequency to once every hour until the database comes back online.

Correct Answer:
B. Retry the query with exponential backoff, up to a cap of 15 minutes.

Explanation:

In a large-scale, user-facing application like a weather app, ensuring high availability and resilience during a database failure is critical. In this scenario, the application queries the database regularly to fetch the current temperature every 15 minutes. If the database encounters downtime, the frontend must be designed to handle the failure gracefully without impacting the user experience.

Why Option B is Correct:

The best approach in this case is to retry the query with exponential backoff. Exponential backoff is a standard technique used to reduce the load on a service that is experiencing problems. When a failure occurs, the system retries the operation, but with progressively longer intervals between each retry. This helps to avoid overwhelming the system with repeated requests and allows for a more efficient use of resources. In this specific case, the retry attempts are capped at 15 minutes to prevent unnecessary delays, ensuring that the system doesn’t keep retrying endlessly.

The exponential backoff strategy is particularly well-suited to situations where the failure might be temporary, such as network issues or a transient database failure. By implementing this approach, the frontend avoids sending a high number of queries during a short period, reducing the risk of server overload and worsening the failure. The cap of 15 minutes ensures that the app will not keep retrying indefinitely and that users are still able to access current temperature data within an acceptable timeframe.

Why Other Options Are Incorrect:

A. Issue a command to restart the database servers:
Restarting the database might not solve the issue, as it doesn’t address the root cause of the failure (e.g., network issues, misconfiguration, or resource exhaustion). It is also not a suitable solution for a frontend design, as it requires administrative intervention and can cause downtime, affecting the entire application.
C. Retry the query every second until it comes back online to minimize staleness of data:
Retrying every second is inefficient and can put unnecessary load on both the frontend and backend systems, especially if the database is down for an extended period. This could lead to performance degradation, and it does not scale well for high-traffic applications serving millions of users.
D. Reduce the query frequency to once every hour until the database comes back online:
While reducing the query frequency could lessen the load on the database, it’s not a good solution because it doesn’t address how to handle short-term failures or sudden spikes in traffic. The system would still experience delays in fetching the most up-to-date information for the users. Additionally, weather data typically changes frequently, so users would be receiving outdated information, which negatively impacts the app's usability.

Conclusion:

The best strategy to handle a database failure in a high-traffic weather app is to retry the query with exponential backoff. This approach ensures a balanced response, avoiding system overload while still attempting to recover from temporary failures. It ensures efficient resource usage and keeps the user experience intact by minimizing disruption during a database failure.

Question No 7:

You are tasked with creating a model to predict housing prices. Due to budget constraints, the model must be run on a single, resource-constrained virtual machine. Which machine learning algorithm would be the best choice in this scenario?

A. Linear regression
B. Logistic classification
C. Recurrent neural network
D. Feedforward neural network

Correct Answer:
A. Linear regression

Explanation:

When selecting a machine learning algorithm for a task like predicting housing prices, the resource constraints of the environment must be considered. In this case, the model is to be deployed on a single resource-constrained virtual machine with limited computational capacity. Therefore, it’s essential to choose an algorithm that balances predictive power with computational efficiency.

Why Option A is Correct:

Linear regression is a simple, computationally efficient algorithm that can be very effective for predicting continuous values, such as housing prices. It assumes a linear relationship between the input features (such as square footage, number of rooms, location, etc.) and the target variable (the housing price). Linear regression is particularly well-suited for small-scale or resource-constrained environments because it has relatively low computational complexity compared to more sophisticated models like neural networks.

Some key reasons why linear regression is a good choice:

Low resource requirements: Linear regression doesn’t require significant memory or computational resources to run, making it ideal for environments with limited resources, such as a single virtual machine.
Simplicity: The model is easy to train, tune, and interpret. The results of linear regression are typically easier to explain to stakeholders, as it directly shows the influence of each feature on the target variable.
Fast training and inference: Unlike more complex algorithms, such as deep neural networks, linear regression can be trained and evaluated very quickly, even with larger datasets.

Why Other Options Are Incorrect:

B. Logistic classification:
Logistic regression is used for binary classification tasks, not for regression. Predicting housing prices is a regression problem, as the target variable is continuous (the price of a house), not categorical. Therefore, logistic regression would not be appropriate.
C. Recurrent neural network (RNN):
RNNs are designed for sequential data, like time series or text. They are much more computationally expensive than linear regression due to their complex architecture, and would not be a suitable choice for a relatively simple task like predicting housing prices, especially on a resource-constrained system.
D. Feedforward neural network:
While feedforward neural networks are more flexible and can capture complex patterns in data, they require significant computational resources to train and tune. Running a neural network on a resource-constrained virtual machine would be inefficient and impractical in this case. Additionally, without sufficient data and compute power, the neural network may not provide a significant performance boost over linear regression.

Conclusion:

For predicting housing prices on a resource-constrained virtual machine, linear regression is the best choice due to its simplicity, computational efficiency, and effectiveness in modeling linear relationships. It provides a good balance between model accuracy and the ability to run efficiently in environments with limited resources.

Question No 8:

You are building a new real-time data warehouse for your company and will use Google BigQuery streaming inserts to handle incoming data. Although data is being streamed continuously, there is no guarantee that it will only be sent once. However, each row of data contains a unique ID and an event timestamp. You want to ensure that duplicates are excluded when you interactively query the data.

Which type of query should you use to handle this situation?

A. Include ORDER BY on the timestamp column and use LIMIT to 1.
B. Use GROUP BY on the unique ID and timestamp columns and apply SUM on the values.
C. Use the LAG window function with PARTITION BY unique ID along with WHERE LAG IS NOT NULL.
D. Use the ROW_NUMBER window function with PARTITION BY unique ID along with WHERE row = 1.

Correct Answer: D. Use the ROW_NUMBER window function with PARTITION BY unique ID along with WHERE row = 1.

Explanation:

When building a real-time data warehouse using Google BigQuery streaming inserts, one of the challenges you may face is duplicate data due to the possibility of data being sent multiple times. Even though each row of data includes a unique ID and an event timestamp, handling duplicates efficiently is crucial to ensure that the data remains accurate and queries provide meaningful results.

A. Include ORDER BY on timestamp column and use LIMIT to 1.

Using ORDER BY on the timestamp column can help you sort data based on time, but this approach alone doesn’t guarantee that duplicates will be removed, especially if there are multiple rows with the same timestamp. Additionally, LIMIT 1 would return only the first row from the sorted result, which is not the most effective way to handle duplicates when working with real-time data streams. This method could unintentionally exclude valuable data and may not always work as expected for large datasets.

B. Use GROUP BY on the unique ID and timestamp columns and apply SUM on the values.

While GROUP BY can group data by unique ID and timestamp, applying an aggregation function like SUM would be inappropriate unless you're looking to aggregate numeric data. Since the problem is about handling duplicates, this query doesn’t address the main issue. In fact, using SUM could aggregate values incorrectly, especially when dealing with non-numeric data.

C. Use the LAG window function with PARTITION BY unique ID along with WHERE LAG IS NOT NULL.

The LAG window function provides access to a previous row's value in a dataset, but it is typically used to compare current and previous rows. In this context, while it might be useful in certain situations, it’s not the best solution for handling duplicate rows based on a unique ID. The LAG function could still let duplicates appear, especially if you don’t have a good way to differentiate the valid rows from the duplicates.

D. Use the ROW_NUMBER window function with PARTITION BY unique ID along with WHERE row = 1.

The ROW_NUMBER window function is specifically designed for this type of scenario. By using PARTITION BY on the unique ID, you can assign a unique row number to each row within the same unique ID group. This allows you to identify duplicates and select only the first row (or the most recent, depending on how you order them). The WHERE row = 1 condition ensures that only one row per unique ID is kept, effectively removing duplicates. This approach works efficiently for real-time streaming data, where duplicates may appear due to data being re-sent.

Conclusion:

To ensure that duplicates are excluded from your real-time data warehouse, the best approach is to use the ROW_NUMBER window function in combination with the PARTITION BY clause on the unique ID. This guarantees that only one row per unique ID is included in your query results, thus effectively handling the issue of duplicate data in Google BigQuery.

Question No 9:

Your company is using wildcard tables to query data across multiple tables with similar names in Google BigQuery. However, the SQL statement you’ve written is failing with an error. Which of the following table names would make the SQL statement work correctly?

A. 'bigquery-public-data.noaa_gsod.gsod'
B. bigquery-public-data.noaa_gsod.gsod*
C. 'bigquery-public-data.noaa_gsod.gsod'*
D. 'bigquery-public-data.noaa_gsod.gsod*'

Correct Answer: B. bigquery-public-data.noaa_gsod.gsod*

Explanation:

Wildcard tables in Google BigQuery allow you to query multiple tables that share a similar naming pattern in a dataset. This is particularly useful when the dataset consists of multiple tables with names that follow a common convention, like tables with dates appended to their names. To ensure that wildcard tables are queried correctly, it's important to follow the correct syntax.

Let's look at the options in detail:

A. 'bigquery-public-data.noaa_gsod.gsod'

In this option, the table name is enclosed in single quotes. Single quotes are typically used to specify string literals in SQL queries, but they are not necessary for referencing table names. This query will result in an error because BigQuery expects the table name to be written without enclosing quotes (except in the case of reserved keywords or special characters). The inclusion of single quotes here is incorrect.

B. bigquery-public-data.noaa_gsod.gsod*

This is the correct way to reference wildcard tables in BigQuery. The * symbol is used to match tables that follow a naming pattern. This query will correctly match all tables starting with gsod in the noaa_gsod dataset. The wildcard allows you to query across multiple tables at once, and BigQuery will automatically handle it as long as the pattern is valid.

C. 'bigquery-public-data.noaa_gsod.gsod'*

This option includes single quotes around the table name, followed by the wildcard character * outside the quotes. This is incorrect because the table name should not be enclosed in quotes when using wildcards. The query will result in an error due to improper syntax.

D. 'bigquery-public-data.noaa_gsod.gsod*'

Similar to option C, this option uses single quotes around the entire table name, which is incorrect. Wildcard tables do not require quotes, and the inclusion of single quotes around the wildcard is unnecessary and will lead to an error.

Conclusion:

The correct way to reference wildcard tables in Google BigQuery is by specifying the table name pattern without quotes, followed by the wildcard character *. Therefore, the correct answer is B: bigquery-public-data.noaa_gsod.gsod*.

Question No 10:

Your company operates in a highly regulated industry and needs to ensure that individual users have access only to the minimum amount of information required for their roles. You want to enforce this access control requirement in Google BigQuery.

Which three approaches can you take to ensure this requirement is met? (Choose three.)

A. Disable writes to certain tables.
B. Restrict access to tables by role.
C. Ensure that the data is encrypted at all times.
D. Restrict BigQuery API access to approved users.
E. Segregate data across multiple tables or databases.
F. Use Google Stackdriver Audit Logging to monitor policy violations.

Correct Answers: B, D, E.

Explanation:

In a highly regulated industry, it's critical to control data access to ensure users only have the minimum necessary permissions. Google BigQuery provides several features and approaches to manage access control and ensure compliance. Here’s how each option plays a role:

A. Disable writes to certain tables.

While disabling writes is useful for ensuring data integrity (preventing unauthorized changes), it doesn't directly address the need to restrict user access. The requirement is focused on ensuring users have appropriate data read permissions, not just limiting write permissions.

B. Restrict access to tables by role.

Role-based access control (RBAC) is a best practice for controlling access to data in Google BigQuery. By defining roles with specific permissions, you can ensure that users can only access the data they are authorized to view. For example, you can create roles like Viewer, Editor, and Owner, each with different access levels to tables or datasets. This approach ensures that users only have access to the minimum data necessary for their roles.

C. Ensure that the data is encrypted at all times.

While data encryption is essential for data security, it doesn’t directly address user access control. Google BigQuery automatically encrypts data at rest and in transit, so encryption will not specifically enforce the requirement of minimum access for individual users.

D. Restrict BigQuery API access to approved users.

To ensure that only authorized individuals can interact with BigQuery through the API, you can restrict access by using IAM roles and permissions. This ensures that only approved users can make API requests to query or modify BigQuery resources. This approach limits access at the API level and enforces the principle of least privilege, ensuring that only those with the appropriate roles can access data.

E. Segregate data across multiple tables or databases.

By segregating sensitive or regulated data across different tables or even datasets, you can control access at a more granular level. This segregation allows for more specific access control policies, ensuring that users only have access to the data they need based on their role.

F. Use Google Stackdriver Audit Logging to monitor policy violations.

While Stackdriver Audit Logging can help monitor access patterns and detect potential violations, it is primarily a monitoring and auditing tool. It doesn’t actively enforce access controls but rather helps track and report any policy violations after the fact.

To ensure that individual users have access only to the minimum necessary information in Google BigQuery, the best approaches are restricting access by role, restricting API access, and segregating data across tables or databases. These three approaches will enforce role-based access and least privilege, key principles in regulatory compliance. Therefore, the correct answers are B, D, and E.