Databricks Certified Machine Learning Associate Practice Test Questions, Exam Dumps

Practice Exams:

View All

Certified Machine Learning Associate Databricks Practice Test Questions and Exam Dumps

Question No 1:

A machine learning engineer has used the Feature Store client (fs) to create a new Feature Table named new_table. During the creation process, they included a metadata description to provide key contextual information about the Feature Table. The engineer now wishes to programmatically retrieve this metadata description at a later point in time.

Which of the following code snippets correctly retrieves the description of the Feature Table?

Options:

A. There is no way to return the metadata description programmatically.
B. fs.create_training_set("new_table")
C. fs.get_table("new_table").description
D. fs.get_table("new_table").load_df()
E. fs.get_table("new_table")

Correct Answer: C. fs.get_table("new_table").description

Explanation:

Feature Store is an essential component in many machine learning platforms, enabling engineers to store, reuse, and share engineered features across different models. Metadata such as descriptions, tags, and ownership information helps provide critical context about what a Feature Table contains and how it should be used.

When a Feature Table is created using the Feature Store client (fs), engineers can attach metadata, such as a human-readable description, to document its purpose and usage. Retrieving this description programmatically is important for automation, data discovery, and governance.

The correct way to retrieve a Feature Table's metadata description is by first accessing the Feature Table object using fs.get_table("new_table"). This method returns a table object that encapsulates all metadata and schema information. Once you have the object, you can access its description attribute using dot notation. Thus, fs.get_table("new_table").description retrieves the textual description that was originally set during table creation.

Let’s review the incorrect options:

A is incorrect because metadata can be retrieved programmatically.
B is used to create a training set, not to fetch metadata.
D loads the data (as a DataFrame), but does not access metadata.
E returns the table object, but does not access the description unless followed by .description.

Therefore, Option C is the correct and most direct way to retrieve the Feature Table's metadata description programmatically.

Question No 2:

A data scientist is working with Apache Spark using PySpark and has a DataFrame named spark_df. The DataFrame contains several columns, including one named "price". The goal is to create a new DataFrame that includes only the rows where the value in the price column is greater than zero.

Which of the following code snippets will correctly filter the DataFrame to meet this requirement in PySpark?

Options:

A. spark_df[spark_df["price"] > 0]
B. spark_df.filter(col("price") > 0)
C. SELECT * FROM spark_df WHERE price > 0
D. spark_df.loc[spark_df["price"] > 0, :]
E. spark_df.loc[:, spark_df["price"] > 0]

Correct Answer: B. spark_df.filter(col("price") > 0)

Explanation:

In PySpark, the appropriate way to filter rows in a DataFrame based on a condition is by using the .filter() or .where() method. Both methods accept a column expression that evaluates to a boolean.

The correct expression in PySpark to filter rows where the price column is greater than 0 is:

from pyspark.sql.functions import col

filtered_df = spark_df.filter(col("price") > 0)

Option B uses the filter() method with the col() function from pyspark.sql.functions, which is the standard and recommended way to reference column names programmatically.

Why the Other Options Are Incorrect:

A. spark_df[spark_df["price"] > 0] resembles pandas syntax. PySpark does not support direct indexing using boolean conditions in square brackets.
C. SELECT * FROM spark_df WHERE price > 0 is an SQL query. While valid in Spark SQL, it cannot be used directly on a DataFrame object without registering it as a temporary view first.
D & E. Both use .loc[], which is strictly a pandas method and not available in PySpark DataFrames.

In summary, PySpark requires the use of .filter() or .where() with column expressions for conditional row selection. Option B demonstrates this correctly and is the only valid PySpark solution among the given choices.

Question No 3:

A health organization is working on developing a machine learning classification model to detect whether patients are currently infected with a particular disease. The primary goal of the organization is to identify as many actual positive infection cases as possible, even if that means some negative cases might occasionally be misclassified. In this medical context, failing to identify a true positive (i.e., a patient who is infected but not flagged by the model) could lead to serious health consequences, such as delayed treatment and further transmission of the infection.

Given this objective, which classification metric should the organization prioritize for evaluating the performance of their model?

A. Root Mean Squared Error (RMSE)
B. Precision
C. Area under the residual operating curve
D. Accuracy
E. Recall

Correct Answer: E. Recall

Explanation:

In classification problems, the choice of evaluation metric should align with the specific goal of the model. In this scenario, the health organization is most concerned with maximizing the detection of actual infection cases—in other words, correctly identifying patients who truly have the disease.

Recall (also known as Sensitivity or True Positive Rate) measures the proportion of actual positive cases that were correctly identified by the model. It is defined as:

Recall=True PositivesTrue Positives+False Negatives\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}Recall=True Positives+False NegativesTrue Positives

A high recall means the model is good at catching most of the actual positive cases, which is critical in medical diagnostics where missing a true case (false negative) can be dangerous.

Now, let’s understand why the other options are less suitable:

A. RMSE: Used for regression problems, not classification. Irrelevant here.
B. Precision: Measures how many of the predicted positive cases are truly positive. While useful, it does not prioritize identifying all positives.
C. Area under the residual operating curve: This is likely a misstatement of Receiver Operating Characteristic (ROC) curve; the AUC-ROC is helpful but less directly focused on minimizing false negatives than recall.
D. Accuracy: Measures overall correctness but can be misleading in imbalanced datasets (e.g., if positives are rare).

Thus, Recall is the most appropriate metric when the priority is to ensure that as many true infection cases as possible are identified.

Question No 4:

In the context of data preprocessing, particularly when dealing with missing values in a dataset, under which of the following conditions is it generally more appropriate to impute missing numerical feature values using the median instead of the mean?

A. When the features represent categorical data types
B. When the features are boolean (True/False) in nature
C. When the numerical features include a significant number of extreme outliers
D. When the numerical features are normally distributed and contain no outliers
E. When there are no missing values in the features

Correct Answer:

C. When the numerical features include a significant number of extreme outliers

Explanation:

In data preprocessing, handling missing values is a critical step to ensure model accuracy and reliability. Two of the most common imputation strategies for numerical data are using the mean (average) or the median (middle value) of a feature's distribution. The decision between these two depends significantly on the nature and distribution of the data.

When a numerical feature contains extreme outliers, the mean can be heavily skewed. Outliers have a disproportionate influence on the mean, which can lead to misleading imputations. For instance, in a dataset of incomes, a few extremely high salaries can significantly raise the mean, even if the majority of values are much lower. In such scenarios, imputing missing values with the mean can distort the feature’s true central tendency.

On the other hand, the median is a robust statistic that is less sensitive to outliers. It represents the midpoint of the data, ensuring that exactly half of the non-missing values are above it and half are below. This makes the median a more reliable estimator of central tendency in skewed or outlier-heavy distributions.

Therefore, when numerical features contain a lot of extreme outliers, using the median to impute missing values provides a more accurate and stable representation of the data, which ultimately leads to better model performance. In contrast, if the data is normally distributed without significant outliers, the mean might suffice or even perform slightly better due to its statistical properties in Gaussian distributions.

Thus, understanding the distribution of your data is crucial before selecting an imputation strategy.

Question No 5:

A data scientist is working with a dataset that contains several missing values across different feature variables. To handle this issue, the scientist chooses to impute the missing values by replacing them with the median value of each respective feature. However, a colleague raises a concern, suggesting that this imputation strategy may discard valuable information that could potentially improve model performance.

To address this concern, which of the following actions would help the data scientist retain and incorporate as much information as possible about the original missing values into the feature set?

A. Replace the missing values with the mean value instead of the median
B. Avoid imputation altogether and rely on the algorithm to manage missing data
C. Eliminate all features that originally had missing values
D. Add a binary indicator variable for each feature with missing data, flagging whether each value was imputed
E. Add a constant feature for each column with missing data, containing the percentage of missing values in that feature

Correct Answer: D. Add a binary indicator variable for each feature with missing data, flagging whether each value was imputed.

Explanation:

When handling missing data, it's important not only to fill in the gaps but also to retain as much information as possible about the missingness itself, as it can be informative. The most effective approach among the options listed is to create a binary indicator variable (Option D) that flags whether a value was originally missing before imputation.

This strategy is known as missingness indicator imputation. It allows the model to learn patterns from the fact that a value was missing, which can sometimes be a predictive signal itself. For example, in medical data, missing values might correlate with specific diagnoses or treatments, and ignoring this could reduce model accuracy.

Replacing missing values with the mean (A) instead of the median doesn’t offer more information—it merely changes the central tendency used for imputation. Not imputing values (B) and relying on the algorithm can be problematic, as many machine learning models (like linear regression or SVMs) cannot handle missing data directly. Removing features (C) reduces dimensionality but may discard valuable predictors. Option E, creating a constant column with the percentage of missing values, doesn't add row-level information and is unlikely to help the model.

In summary, adding a binary indicator variable per feature with missing values is the most informative and model-friendly approach, preserving both the imputed value and the knowledge of its original absence.

Question No 6:

A data scientist is working with a PySpark DataFrame named spark_df and wants to examine summary statistics for all numerical columns in the dataset. The goal is to retrieve key statistics such as count, mean, standard deviation, minimum, maximum, and the interquartile range (IQR) for each numerical feature.

Which of the following lines of code will best help the data scientist accomplish this?

A. spark_df.summary()
B. spark_df.stats()
C. spark_df.describe().head()
D. spark_df.printSchema()
E. spark_df.toPandas()

Correct Answer: A. spark_df.summary()

Explanation:

In PySpark, when a data scientist wants to analyze summary statistics of a DataFrame, the summary() method is the most comprehensive built-in option. The method spark_df.summary() returns a DataFrame that includes the count, mean, standard deviation, minimum, maximum, as well as 25%, 50% (median), and 75% percentiles for all numeric columns—essentially covering the interquartile range (IQR).

Breakdown of Each Option:

A. spark_df.summary()
Correct. This method provides a wide range of statistics including IQR (via percentiles), which is not available in describe(). It returns a DataFrame with metrics such as count, mean, stddev, min, 25%, 50%, 75%, and max.
B. spark_df.stats()
ncorrect. There is no method named stats() in PySpark DataFrames. This would raise an AttributeError.
C. spark_df.describe().head()
Partially correct. describe() provides count, mean, stddev, min, and max, but does not include IQR or percentiles. Additionally, .head() fetches only the first row, which limits usefulness.
D. spark_df.printSchema()
Incorrect. This only prints the schema (column names and types) and provides no statistics.
E. spark_df.toPandas()
Incorrect. This converts the Spark DataFrame to a Pandas DataFrame, which may be impractical for large datasets and does not directly compute statistics.

For detailed summary statistics, including IQR, the summary() method is the best built-in tool in PySpark.

Question No 7:

An organization is in the process of building a centralized feature repository that will be used across multiple machine learning projects. As part of the repository design, the team has decided to apply one-hot encoding to all categorical variables ahead of time, before model training.However, a data scientist raises a concern, recommending not to perform one-hot encoding within the repository and instead apply such transformations later in the model development phase.

Which of the following best explains the reasoning behind the data scientist’s recommendation?

A. One-hot encoding is not supported by most machine learning libraries.
B. One-hot encoding is dependent on the target variable’s values, which vary across applications.
C. One-hot encoding is computationally intensive and should only be used on small training samples.
D. One-hot encoding is not a widely accepted method for encoding categorical data.
E. One-hot encoding can be problematic for some machine learning algorithms and reduces flexibility across use cases.

Correct Answer: E

Explanation:

When building a centralized feature repository, it is essential to maintain flexibility and reusability across different machine learning models, tasks, and domains. One-hot encoding, while a common and useful technique for converting categorical variables into numerical format, is not always optimal or universally applicable.

Some algorithms, such as tree-based models (e.g., decision trees, random forests, gradient boosting machines), often perform better when categorical variables are left in their original form or encoded using target encoding, ordinal encoding, or embedding methods. These approaches preserve relationships between categories or reduce dimensionality, which one-hot encoding can obscure or exacerbate.

Additionally, applying one-hot encoding in the repository creates rigid feature structures. Each time a new category is encountered or if the dataset distribution changes across use cases, the encoding would need to be updated, risking inconsistency, data leakage, or high sparsity in features. This undermines the repository's goal of generality.

Moreover, the number of features increases significantly with one-hot encoding, especially for high-cardinality categorical variables. This adds computational overhead and may reduce model performance, particularly for algorithms sensitive to feature dimensionality.

By postponing one-hot encoding to the model-specific preprocessing stage, data scientists preserve the flexibility to choose the encoding method best suited for their specific model and target variable. This aligns with best practices in machine learning pipeline design, where data transformations are tailored to the modeling context.

Thus, the data scientist's suggestion is justified by the fact that one-hot encoding limits algorithm compatibility and reduces the adaptability of the feature repository, making Option E the correct choice.

Question No 8:

A data scientist is working with a housing dataset to predict home prices using linear regression. Two separate models have been developed:

Model A uses price (in dollars) as the target (label) variable.
Model B uses log(price) (the natural logarithm of the price) as the target variable.

To evaluate performance, the data scientist compares the predicted values from both models against the actual price values using Root Mean Squared Error (RMSE). After doing this, the data scientist notices that Model B, which was trained on log(price), produces a much higher RMSE than Model A.

Given this observation, which of the following explanations is invalid—i.e., cannot account for the discrepancy in RMSE values between the two models?

A. Model B is actually more accurate than Model A.
B. The data scientist forgot to exponentiate Model B's predictions before comparing them to actual prices.
C. The data scientist incorrectly took the log of Model A's predictions before computing RMSE.
D. Model A is genuinely more accurate than Model B.
E. RMSE is not a valid metric for evaluating regression models.

Correct Answer: E. RMSE is not a valid metric for evaluating regression models.

Explanation:

Root Mean Squared Error (RMSE) is a widely accepted and valid metric for evaluating the accuracy of regression models. It measures the average magnitude of the errors between predicted and actual values, giving higher weight to larger errors. Therefore, option E is an invalid explanation—RMSE is valid for regression tasks, provided it is used correctly.

Now, let's consider the situation described:

Model A directly predicts price, so computing RMSE between predictions and actual prices is appropriate.
Model B predicts log(price), which must be exponentiated (using exp(prediction)) to return to the price scale before RMSE can be computed against actual prices.

If the data scientist forgot to exponentiate Model B's predictions, the values would remain in the logarithmic scale, while the actual prices are in the raw dollar scale. Comparing predictions and actuals on mismatched scales would artificially inflate RMSE, making it seem like Model B performs worse than it actually does. This is the likely explanation, making option B valid.

If the data scientist mistakenly took the log of Model A’s predictions before computing RMSE, that would also distort the evaluation, as the prediction and actual values would again be on different scales. This makes option C also valid.

It’s entirely possible that Model A is simply more accurate than Model B (option D), or vice versa (option A), though RMSE wouldn’t show that clearly unless both models are evaluated on the same scale.

Hence, the only truly invalid explanation is that RMSE is not suitable at all—it is suitable if applied properly.

Question No 9:

A data scientist is working on a regression task and applies 3-fold cross-validation to evaluate and fine-tune the model's hyperparameters. In this process, the dataset is split into three parts, and the model is trained on two folds while validated on the third, rotating across all folds.After performing the 3-fold cross-validation, the root-mean-squared error (RMSE) values obtained from the three validation sets are as follows:

Fold 1 RMSE: 10.0
Fold 2 RMSE: 12.0
Fold 3 RMSE: 17.0

Based on these results, what is the overall cross-validation RMSE that should be used to evaluate the model’s performance?

Choose the correct answer:

A. 13.0
B. 17.0
C. 12.0
D. 39.0
E. 10.0

Correct Answer: A. 13.0

Explanation:

Cross-validation is a widely used model evaluation method, especially in scenarios involving limited data. In k-fold cross-validation, the dataset is divided into k equally sized folds. The model is trained on k–1 of these folds and validated on the remaining one. This process is repeated k times, with each fold used exactly once for validation. The average performance metric across all folds is then used as the final evaluation score.

In this problem, 3-fold cross-validation was applied, and the RMSE scores for each fold were: 10.0, 12.0, and 17.0. RMSE is a common metric for regression problems, measuring the average magnitude of the prediction error.

To compute the overall cross-validation RMSE, we take the mean of the RMSE values from all folds:

Overall RMSE=10.0+12.0+17.03=39.03=13.0\text{Overall RMSE} = \frac{10.0 + 12.0 + 17.0}{3} = \frac{39.0}{3} = 13.0Overall RMSE=310.0+12.0+17.0=339.0=13.0

This average gives a more stable estimate of the model’s generalization performance, accounting for variations in the training/validation splits.

Common Misconceptions:

Choosing 17.0 assumes the worst-case error is representative, which is incorrect in cross-validation.
Summing RMSEs to get 39.0 without averaging (option D) is a mathematical error.
Picking 10.0 or 12.0 focuses only on specific folds, ignoring the complete picture.

Thus, option A (13.0) correctly reflects the average model performance across all folds.