Databricks Certified Machine Learning Professional Practice Test Questions, Exam Dumps

Practice Exams:

View All

Certified Machine Learning Professional Databricks Practice Test Questions and Exam Dumps

Question No 1:

In the context of machine learning and data science, models are trained on historical data with the assumption that the relationship between input features (independent variables) and the target variable (dependent variable) remains stable over time. However, this assumption may no longer hold true in real-world applications where data evolves.

Which of the following best characterizes the phenomenon known as "concept drift"?

A. A shift occurs in the distribution of one or more input variables over time.
B. A shift occurs in the distribution of the target variable alone.
C. A change occurs in the underlying relationship between input variables and the target variable over time.
D. A change occurs in the distribution of the model’s predicted outputs, regardless of the data itself.
E. None of the above accurately describe concept drift.

Correct Answer:

C. A change occurs in the underlying relationship between input variables and the target variable over time.

Explanation:

Concept drift refers to the change in the statistical relationship between input features (X) and the target variable (Y) over time. In supervised machine learning, models are trained under the assumption that the joint probability distribution P(X, Y) remains stable. However, in dynamic environments such as financial markets, online behavior, or sensor-based monitoring systems, this assumption often breaks down.

For instance, consider a spam email classifier. Initially, certain words may strongly correlate with spam messages. Over time, spammers change tactics—introducing new phrases or avoiding flagged terms—which shifts the relationship between email content (input) and whether it is spam (target). This shift is not just a change in input or target distributions alone, but a change in how inputs determine the output—hence, concept drift.

There are different types of drift:

Sudden drift, where the change happens abruptly.
Gradual drift, where the relationship evolves over time.
Incremental drift, which occurs in small steps.
Recurring drift, where patterns disappear and later re-emerge.

Concept drift is crucial to detect and address because it can significantly degrade model performance if left unchecked. A model trained on outdated relationships may make increasingly poor predictions. Techniques such as retraining models periodically, using online learning, or incorporating drift detection algorithms (e.g., DDM, ADWIN) help mitigate its impact.

In summary, concept drift is specifically about the evolving relationship between inputs and outputs, not just shifts in data distributions. Recognizing and adapting to it is vital for maintaining robust machine learning systems in real-world, changing environments.

Question No 2:

A machine learning engineer is analyzing categorical input variables in a production machine learning application. They suspect that missing values are becoming more frequent in recent data for a particular value in one of the categorical input variables.

What tool can the machine learning engineer use to verify this hypothesis?

Answer Choices:

A. Kolmogorov-Smirnov (KS) test
B. One-way Chi-squared Test
C. Two-way Chi-squared Test
D. Jenson-Shannon distance
E. None of these

Correct Answer: B. One-way Chi-squared Test

Explanation:

When dealing with categorical data, particularly in a machine learning context, it’s crucial to assess how variables behave over time, especially when there’s a suspicion that certain patterns (such as an increase in missing values) might be emerging. The machine learning engineer’s goal here is to investigate whether there’s a statistically significant shift in the distribution of missing values for a particular category over time.

To assess this hypothesis, the One-way Chi-squared Test is the most appropriate tool. This test is designed to determine if the distribution of a categorical variable differs from an expected distribution. In this scenario, the engineer can use the test to evaluate if the proportion of missing values for the categorical variable is consistent across different time periods. The null hypothesis would state that the distribution of missing values is the same in both older and more recent data, while the alternative hypothesis would suggest a significant difference (indicating that the missing values are indeed becoming more prevalent in recent data).

Other options listed are less relevant:

Kolmogorov-Smirnov (KS) Test: This is typically used to compare two continuous distributions or to test if a sample follows a specified distribution, not for categorical data or missing value patterns.
Two-way Chi-squared Test: This test is used for testing relationships between two categorical variables, not for assessing missing data in one variable over time.
Jenson-Shannon Distance: This is a measure of similarity between probability distributions, and while useful for comparing distributions, it's not suitable for detecting shifts in missing values in a categorical variable over time.

Thus, the One-way Chi-squared Test allows the engineer to formally test if missing values are more prevalent in newer data, making it the most suitable tool for the task.

Question No 3:

Which of the following is a simple and cost-effective approach for monitoring drift in numeric features of a machine learning model?

A. Jensen-Shannon test
B. Summary statistics trends
C. Chi-squared test
D. None of these methods are suitable for monitoring feature drift
E. Kolmogorov-Smirnov (KS) test

Detailed Question with Answer:

The correct answer is B. Summary statistics trends.

Explanation:

Feature drift (or concept drift) occurs when the statistical properties of the input data change over time, which can significantly impact the performance of a machine learning model. Monitoring such drift is crucial to ensure that a model continues to produce accurate predictions over time. For monitoring drift in numeric features, it’s essential to choose a method that is not only effective but also simple and low-cost to implement.

Let’s examine each option:

A. Jensen-Shannon test: The Jensen-Shannon test is used to measure the similarity between two probability distributions. While it can be useful for comparing distributions, it is generally more complex and computationally expensive. It is more suited for categorical or discrete data, rather than numeric feature drift monitoring, making it less appropriate for monitoring drift in numeric features in most cases.
B. Summary statistics trends: This approach involves tracking basic statistics such as mean, median, variance, standard deviation, and quartiles of a feature over time. By plotting and comparing the trends of these statistics, you can easily identify shifts or anomalies in the data. This is a simple, intuitive, and cost-effective method for detecting numeric feature drift. Summary statistics do not require complex computations and are easy to interpret, making this method an excellent choice for low-cost drift monitoring.
C. Chi-squared test: The Chi-squared test is typically used for categorical data to assess whether observed frequencies match expected frequencies. This test is not suitable for monitoring drift in numeric features as it is not designed to handle continuous data.
D. None of these methods are suitable for monitoring feature drift: This option is incorrect because, as we discussed, summary statistics trends (option B) are an appropriate method for monitoring numeric feature drift.
E. Kolmogorov-Smirnov (KS) test: The Kolmogorov-Smirnov test is used to compare two sample distributions and test if they come from the same distribution. While it is effective for comparing distributions, it can be more computationally intensive than using summary statistics trends, which makes it a less optimal choice for low-cost and simple monitoring.

In conclusion, tracking summary statistics trends is the most effective, simple, and low-cost method for monitoring numeric feature drift. By regularly calculating and visualizing trends in statistics like mean, variance, and standard deviation, data scientists can detect when there is a significant change in the feature distribution, which may necessitate model retraining or adjustment.

Question No 4:

A data scientist has developed a predictive model to forecast ice cream sales based on two key input variables: the expected temperature and the expected number of hours of sunshine in a given day. The model was trained using historical data where the temperature stayed within a specific range, and the relationship between these features and ice cream sales was established. However, recently, the expected temperature for upcoming days is forecasted to drop below the range of temperatures on which the model was originally trained.Given this scenario,

Which of the following types of data drift is most likely present?

A. Label Drift
B. None of These
C. Concept Drift
D. Prediction Drift
E. Feature Drift

Answer: E. Feature Drift

Explanation:

Data drift is a phenomenon where the statistical properties of data change over time, which can lead to the deterioration of the model's predictive performance. It is important to identify the type of drift occurring in order to address it effectively. In the given scenario, the data scientist's model was trained using a particular range of temperatures, and now the forecasted temperatures are dropping below that range. This could indicate a change in the distribution or behavior of the features used in the model, specifically the temperature.

Feature Drift refers to the situation when the input features of a model change over time. In this case, the expected temperature dropping below the historical range on which the model was trained is a classic example of feature drift. The model may have been designed to perform well within a specific range of temperature values. When the input feature (temperature) shifts outside the range that the model was trained on, the model may no longer be as accurate, as it has never encountered such temperatures before.

Let’s break down the other options for clarity:

Label Drift refers to changes in the distribution of the output variable (or label) over time. For instance, if ice cream sales are starting to behave differently due to external factors like changing consumer behavior or economic conditions, this could be label drift. However, in the provided scenario, the model is facing an issue with the features (input variables), not the target variable (ice cream sales). Therefore, label drift is not the correct answer.
Concept Drift involves a situation where the relationship between the features and the target variable changes over time. For example, if the relationship between temperature and ice cream sales becomes weaker or stronger, this would be concept drift. However, in this case, there is no indication that the relationship between temperature and ice cream sales is changing, only that the model is being faced with new temperature values outside of its training range. Hence, concept drift is not the best match for the given scenario.
Prediction Drift is not a commonly recognized term in the context of data drift. It is likely a misnomer for drift that affects model predictions directly, but it is not a standard category of data drift.
None of These is not a valid option because the scenario clearly involves a shift in the features (temperature), which is a recognized form of drift.

In conclusion, the most accurate description of the drift happening in this scenario is Feature Drift, as the expected temperature is now falling outside the range seen during model training, which could potentially impact the model’s performance. This is a typical issue in machine learning, where features evolve or behave differently than they did during the training phase, requiring adjustments to the model.

Question No 5:

A data scientist wants to remove the star_rating column from a Delta table located at a specific path. To do this, they need to load the data and then drop the star_rating column.

Which of the following code blocks would accomplish this task?

A. spark.read.format("delta").load(path).drop("star_rating")

B. spark.read.format("delta").table(path).drop("star_rating")

C. Delta tables cannot be modified.

D. spark.read.table(path).drop("star_rating")

E. spark.sql("SELECT * EXCEPT star_rating FROM path")

Answer:

The correct answer is A. spark.read.format("delta").load(path).drop("star_rating").

Explanation:

When working with Delta tables, data scientists may need to load the data, perform transformations or modifications, and then save the modified data back. In this scenario, the goal is to remove a specific column (star_rating) from a Delta table.

Let’s break down each option:

A. spark.read.format("delta").load(path).drop("star_rating")
This is the correct approach. The method spark.read.format("delta").load(path) loads the Delta table data from the specified path. The .drop("star_rating") part removes the star_rating column from the loaded DataFrame. This effectively drops the column from the data in memory. However, note that this operation does not modify the underlying Delta table directly; it just creates a DataFrame without the star_rating column. To persist this change, you would need to write the DataFrame back to the Delta table.

B. spark.read.format("delta").table(path).drop("star_rating")
This option is incorrect because .table(path) is used to load a Delta table by its table name, not by the file path. Since the question specifies the table is at a particular path, .load(path) should be used instead. Therefore, this code will not work as intended.

C. Delta tables cannot be modified.
This statement is incorrect. Delta tables can be modified by loading the data, performing transformations, and then writing the data back. Delta Lake provides robust support for handling data updates, deletions, and schema changes.

D. spark.read.table(path).drop("star_rating")
This option is incorrect because .table(path) is used to refer to a Delta table by its name, not by the file path. Like option B, it’s not suitable for loading data from a specific location.

E. spark.sql("SELECT * EXCEPT star_rating FROM path")
While this SQL query might appear useful, the syntax is incorrect. In Delta Lake, the proper syntax for selecting columns in SQL would be using the SELECT statement, but the EXCEPT clause doesn’t work in this context for removing a column from a Delta table. Also, the table name, not a path, should be specified in the SQL query.

In summary, option A is the correct way to load a Delta table from a specified path and drop the star_rating column. However, keep in mind that after this operation, the data must be written back to the Delta table if the modification is to persist.

Question No 6:

Which of the following operations in the Feature Store Client (fs) can be used to return a Spark DataFrame associated with a Feature Store table?

A. fs.create_table
B. fs.write_table
C. fs.get_table
D. There is no way to accomplish this task with fs
E. fs.read_table

Answer:

The correct answer is E. fs.read_table.

Explanation:

In the context of a Feature Store Client (fs), a common requirement when working with feature engineering and machine learning workflows is to retrieve data from a Feature Store table as a Spark DataFrame. A Feature Store is typically used to store, manage, and retrieve features that can be reused across different machine learning models. When accessing these features, it’s essential to have the ability to query or read them in a format that can be processed efficiently for training or inference purposes. The operation that specifically allows this in the fs client is the read_table operation.

Here’s a breakdown of each option:

A. fs.create_table: This operation is used to create a new Feature Store table. It is part of the setup process for storing features but does not return any data, let alone in the form of a Spark DataFrame. Hence, it is not the correct option for this task.
B. fs.write_table: The write_table operation is used for writing data (usually in the form of a DataFrame) to an existing Feature Store table. While this operation is important for populating the Feature Store, it does not retrieve data from a table. Therefore, it is not applicable here.
C. fs.get_table: The get_table operation typically retrieves metadata or schema information about a Feature Store table but does not fetch the actual data in the form of a Spark DataFrame. Thus, it is not the correct method for fetching data.
D. There is no way to accomplish this task with fs: This statement is incorrect, as fs.read_table can indeed be used to accomplish this task.
E. fs.read_table: This operation is the correct choice. It is used to read the data associated with a Feature Store table and returns it as a Spark DataFrame. This operation is crucial when you need to load feature data into a Spark-based environment for further processing or model training.

Thus, to retrieve data from a Feature Store table as a Spark DataFrame, the appropriate operation is fs.read_table. This allows the data to be loaded into a distributed processing framework (like Apache Spark), making it suitable for large-scale machine learning tasks.

Question No 7:

A machine learning engineer is in the process of implementing a concept drift monitoring solution. They are planning to use the following steps:

Deploy a model to production and compute predicted values
Obtain the observed (actual) label values
Run a statistical test to determine if there are changes over time.

Which of the following should be completed as Step #3?

A. Obtain the observed values (actual) feature values
B. Measure the latency of the prediction time
C. Retrain the model
D. None of these should be completed as Step #3
E. Compute the evaluation metric using the observed and predicted values

Answer: E. Compute the evaluation metric using the observed and predicted values

Explanation:

Concept drift refers to the phenomenon where the underlying data distribution changes over time, causing machine learning models to lose their predictive accuracy. Detecting and addressing concept drift is a critical task in production environments, especially for models that continuously receive new data.

In this scenario, the engineer's objective is to monitor for concept drift. The steps outlined in the question form part of the process for detecting when the model is no longer performing as expected. Let's break down each step in the process:

Deploy a model to production and compute predicted values – In this first step, the model is deployed and begins making predictions based on incoming data.
Obtain the observed (actual) label values – This step involves collecting the ground truth or observed outcomes that the model's predictions will be compared to.

Now, we need to determine what happens in Step #3. After obtaining the predicted and observed values, the next logical step is to compute evaluation metrics such as accuracy, precision, recall, or F1-score, comparing the model's predictions to the true labels. These metrics are essential for understanding the model's performance.

In Step #4, the engineer plans to run a statistical test. This test is typically used to check for significant differences between the predicted and actual values over time, helping to detect any drift. For instance, if there’s a noticeable drop in the evaluation metric or a significant change in the statistical test results, the model may have been affected by concept drift.

Thus, Step #3 is the calculation of the evaluation metric, which is a critical part of monitoring and detecting concept drift.

Question No 8:

What is a key advantage of using Jensen-Shannon (JS) distance over the Kolmogorov-Smirnov (KS) test when detecting numeric feature drift in machine learning models?

A. All of these reasons
B. JS is not normalized or smoothed
C. None of these reasons
D. JS is more robust when working with large datasets
E. JS does not require any manual threshold or cutoff determinations

In the context of detecting numeric feature drift in machine learning models, why is Jensen-Shannon (JS) distance often preferred over the Kolmogorov-Smirnov (KS) test?

Answer:
The correct answer is E. JS does not require any manual threshold or cutoff determinations.

Explanation:

Feature drift refers to the change in the statistical properties of a model’s input features over time, which can impact the model’s performance. Detecting such drift is essential for maintaining the model’s accuracy. There are various statistical methods to detect feature drift, with two commonly used techniques being Jensen-Shannon (JS) distance and Kolmogorov-Smirnov (KS) test. However, each method has its advantages and drawbacks.

Jensen-Shannon (JS) Distance: JS distance measures the similarity between two probability distributions. One of its advantages is that it does not require any manual threshold or cutoff determination. This is a significant benefit because setting thresholds for drift detection can be subjective and dependent on the dataset's context. In JS distance, the values naturally lie within a bounded range (0 to 1), making it easier to interpret and less prone to subjective tuning. Additionally, JS is symmetric and has smoothing properties that make it more stable, especially in cases with small sample sizes or noisy data.
Kolmogorov-Smirnov (KS) Test: The KS test compares the empirical distributions of two datasets to determine if they differ significantly. Although the KS test is widely used, it has limitations, especially in terms of threshold selection. It is highly sensitive to large datasets, and determining a suitable threshold for drift detection often requires domain expertise or trial-and-error approaches. This is where JS distance has an edge because it does not rely on these manual decisions.
Robustness with Large Datasets: While JS distance can be more robust in some cases, particularly when dealing with large datasets, it is not specifically designed for such datasets, and both JS and KS can handle large data in different ways. Thus, this point is not as strong an argument in favor of JS over KS.

In conclusion, JS distance is often preferred over KS tests for numeric feature drift detection because it offers a more automated, smooth, and interpretable way to measure drift without requiring manual threshold determination, as seen in option E.