Use VCE Exam Simulator to open VCE files

Certified Generative AI Engineer Associate Databricks Practice Test Questions and Exam Dumps
A Generative AI Engineer has developed a RAG (Retrieval-Augmented Generation) application to answer questions from users about a series of fantasy novels. The novel texts are chunked and embedded into a vector store, along with metadata like page number, chapter number, and book title. These chunks are retrieved based on the user’s query and used to generate responses via a language model (LLM). The engineer has used intuition to select the chunking strategy and configuration but now wishes to systematically optimize these parameters. What two strategies should the Generative AI Engineer implement to optimize the chunking strategy and parameters? (Choose two.)
A. Experiment with different embedding models and compare their performance in generating answers.
B. Introduce a query classifier that predicts the relevant book for the user's query, and use this prediction to filter the retrieval process.
C. Select an appropriate evaluation metric, such as recall or NDCG, and test different chunking strategies, such as splitting by paragraphs or chapters. Optimize the chunking strategy based on the best-performing metric.
D. Pass known questions with their correct answers to an LLM and ask the LLM to suggest the ideal token count. Use summary statistics like the mean or median token counts to select the chunk size.
E. Develop an LLM-based metric to evaluate how well each chunk answers user queries. Use this metric to optimize chunking parameters.
B. Introduce a query classifier that predicts the relevant book for the user's query, and use this prediction to filter the retrieval process.
C. Select an appropriate evaluation metric, such as recall or NDCG, and test different chunking strategies, such as splitting by paragraphs or chapters. Optimize the chunking strategy based on the best-performing metric.
When optimizing a chunking strategy for a RAG (Retrieval-Augmented Generation) system, it’s essential to take a methodical approach to improve the accuracy and efficiency of the retrieval and response generation. The two strategies outlined below are key to achieving this optimization:
Explanation: By creating a query classifier that predicts the most relevant book based on the user's question, the system can narrow down the search space. This prevents irrelevant books from being considered, thus improving retrieval performance. For example, if a user asks about a specific character or event from the third book in a series, the classifier can directly target the third book, improving the relevance of the retrieved chunks and reducing the time spent searching. This method is particularly useful for large datasets like novels that span multiple books or volumes, where not all chunks are relevant to every query.
Explanation: Evaluating the performance of various chunking strategies using metrics like recall (how many relevant documents are retrieved) or NDCG (Normalized Discounted Cumulative Gain) (which evaluates ranking quality) allows the engineer to quantify and compare how different chunking strategies impact retrieval performance. By experimenting with strategies like splitting by paragraph, chapter, or other logical divisions in the text, the engineer can fine-tune the system for optimal retrieval and answer generation. The metric-driven approach ensures that the best chunking strategy is chosen based on real-world performance, rather than intuition.
Option A focuses on embedding models, but changing models may not directly improve chunking strategy.
Option D is a less systematic approach to chunking size, as it relies on the LLM to estimate token counts rather than testing based on measurable outcomes.
Option E suggests using an LLM to judge chunk relevance, but it may not be as reliable or objective as using established evaluation metrics (like recall or NDCG).
By following Options B and C, the engineer can improve both the relevance and efficiency of the retrieval process, leading to better response generation.
A Generative AI Engineer is building a Retrieval-Augmented Generation (RAG) application to answer user questions about technical regulations while learning a new sport. What are the correct steps to build and deploy this RAG application?
A. Ingest documents from a source → Index the documents and save to Vector Search → User submits queries against an LLM → LLM retrieves relevant documents → Evaluate model → LLM generates a response → Deploy it using Model Serving.
B. Ingest documents from a source → Index the documents and save to Vector Search → User submits queries against an LLM → LLM retrieves relevant documents → LLM generates a response → Evaluate model → Deploy it using Model Serving.
C. Ingest documents from a source → Index the documents and save to Vector Search → Evaluate model → Deploy it using Model Serving.
D. User submits queries against an LLM → Ingest documents from a source → Index the documents and save to Vector Search → LLM retrieves relevant documents → LLM generates a response → Evaluate model → Deploy it using Model Serving.
B. Ingest documents from a source → Index the documents and save to Vector Search → User submits queries against an LLM → LLM retrieves relevant documents → LLM generates a response → Evaluate model → Deploy it using Model Serving.
In a Retrieval-Augmented Generation (RAG) application, the key steps involve integrating retrieval and generation models to produce accurate responses to user queries. Below is a breakdown of the correct steps and why they are necessary:
The first step is to ingest relevant documents that provide the necessary context for the queries users might ask. These documents could be regulations, rulebooks, or guides related to the sport being learned.
Once the documents are ingested, they need to be indexed in a way that allows efficient retrieval. This is typically done by transforming the documents into vector representations (using techniques like embeddings), enabling quick searches for relevant information. These vectors are stored in a Vector Search system (such as Elasticsearch or Faiss) for fast retrieval during query processing.
When a user submits a question, the system queries the LLM (Language Learning Model) with the user input.
The LLM retrieves relevant documents from the vector store by comparing the query’s vector to those in the database. This step ensures that the LLM accesses the most pertinent information to answer the user’s question.
The retrieved documents are then used by the LLM to generate a coherent and accurate response. The combination of retrieval and generation helps improve the quality and relevance of the response compared to a purely generative approach.
Once the system is generating answers, it should be evaluated for accuracy and relevance. This may involve manual review or automated metrics like precision, recall, or F1 score.
Finally, after evaluation, the model is deployed for production use, typically using a model-serving platform (such as TensorFlow Serving, FastAPI, or custom solutions) to make the application accessible to users.
Why Option B is Correct:
Option B provides a logical and complete sequence for deploying a RAG application. Document ingestion, indexing, query processing, model evaluation, and deployment are all covered in the correct order. This step-by-step process ensures the system is built methodically and effectively.
Other options, such as A, C, and D, either miss steps or present them in an incorrect order, making them incomplete or inaccurate.
A Generative AI Engineer has deployed a Large Language Model (LLM) application for customer service at a digital marketing company, assisting with answering customer inquiries. Which metric should they monitor to evaluate the performance of their LLM application in production?
A. Number of customer inquiries processed per unit of time
B. Energy usage per query
C. Final perplexity scores from the model’s training phase
D. HuggingFace Leaderboard values for the base LLM
A. Number of customer inquiries processed per unit of time
When deploying a Large Language Model (LLM) in a customer service application, the primary goal is to ensure that the system is efficient and responsive, while maintaining high-quality answers. Therefore, it is essential to focus on metrics that directly relate to production performance and user experience.
This metric is crucial for evaluating the operational performance of the LLM in a production environment. It measures the throughput of the system, or how many customer inquiries the system can handle within a specific time frame. Monitoring this metric ensures that the system can scale and meet the demand for customer service without delays. High throughput typically indicates that the application is functioning effectively and efficiently, while also providing insights into potential bottlenecks in the processing pipeline.
Although energy usage is an important consideration for sustainability, it is not a primary metric for evaluating the performance of an LLM in a customer service setting. While energy efficiency is important, the direct focus in this context should be on response speed and quality, rather than energy consumption.
Perplexity is a metric used during the training phase to measure how well a model predicts the next word in a sequence. However, it is not directly relevant for monitoring LLM performance in a production environment. Once the model is trained, the focus should shift to real-time performance metrics such as response quality, latency, and throughput.
Leaderboards are useful for evaluating models during research and development stages, but they do not provide actionable insights for production-level applications. Performance in real-world use cases depends on how the model interacts with actual customer data, not its placement on a leaderboard.
In conclusion, Option A, which tracks the number of inquiries processed per unit of time, is the most practical metric to monitor the operational efficiency of an LLM deployed in a customer service context.
A Generative AI Engineer is tasked with developing a system that suggests the best matched team member for newly scoped projects. The team member should be selected based on project date availability and the alignment of their employee profile with the project scope. Both the employee profiles and project scopes are unstructured text. How should the Generative AI Engineer architect the system to achieve this?
A. Create a tool to find available team members based on project dates. Embed all project scopes into a vector store, and perform retrieval using team member profiles to find the best match.
B. Create a tool to find team member availability based on project dates, and another tool that uses an LLM to extract keywords from project scopes. Iterate through available team member profiles and perform keyword matching to identify the best fit.
C. Create a tool to find available team members based on project dates. Create a second tool that calculates a similarity score between team member profiles and project scopes. Rank team members by the best similarity score to select the optimal match.
D. Create a tool to find available team members based on project dates. Embed team member profiles into a vector store and use project scope and filtering to perform retrieval to find the best-matched available team members.
D. Create a tool to find available team members based on project dates. Embed team member profiles into a vector store and use the project scope and filtering to perform retrieval to find the best-matched available team members.
In this scenario, the goal is to develop a system that suggests the most suitable employee for a project based on both the team member’s availability and how well their profile matches the project’s scope. Since both the employee profiles and project scopes are unstructured text, the solution should incorporate techniques for managing and comparing these text-based inputs effectively.
This option suggests embedding only the project scope and using it to match available team members. While embedding the project scope is a valid approach, it overlooks the need to consider team member profiles effectively. Both the availability of team members and their profile alignment with the project scope need to be factored in to make an optimal selection.
This approach uses keyword matching after extracting keywords from the project scope, which could work in some contexts. However, keyword-based matching might be too simplistic and miss subtle matches or context from the project scope or team member profiles. The combination of keywords might not fully capture the complexity of both the team member’s qualifications and the project scope.
Creating a similarity score is a good idea, but the system still needs an efficient and scalable retrieval mechanism for matching team member profiles with projects. Ranking based on a similarity score is an important step, but the retrieval process itself needs to efficiently handle the unstructured text data.
This option provides the most comprehensive and efficient solution. By embedding both team member profiles and project scopes into a vector store, the system can leverage vector-based retrieval to quickly find the best matches. Additionally, embedding team member profiles into a vector store allows for flexible filtering based on project availability, improving the matching process.
In conclusion, Option D provides a robust architecture that utilizes vector embeddings and retrieval techniques to match available team members with the best-fitting projects based on both profile relevance and availability.
A Generative AI Engineer is developing a real-time, LLM-powered sports commentary platform. The goal of the platform is to deliver up-to-the-minute analyses and summaries of live games to users, replacing traditional static news articles that may be outdated. To achieve this, the platform must be able to access and utilize live game data (e.g., scores, player stats) in real time to inform the LLM-generated commentary.
Which tool would enable the platform to retrieve and serve real-time data to the language model for analysis and summary generation?
A. DatabricksIQ
B. Foundation Model APIs
C. Feature Serving
D. AutoML
C. Feature Serving
In real-time AI applications, particularly those involving continuous updates and insights—such as live sports commentary—real-time data access is crucial. The LLM needs to generate relevant summaries and analyses based on the most recent game events (e.g., score changes, key plays, player performance). To achieve this, the system must supply current data to the model at inference time.
Feature Serving is a mechanism in machine learning infrastructure that allows you to serve fresh, low-latency features to models during inference. It is commonly used in production ML systems to provide real-time context (e.g., the current game score, player stats, time remaining) so the model can make accurate predictions or generate relevant content.
In this case, Feature Serving would be used to serve live game data to the LLM just before or during the inference step. This ensures the generated commentary or analysis is grounded in the most up-to-date game state, making the insights timely and useful to users.
A. DatabricksIQ: Not a real-time feature serving tool; it focuses more on data analytics and AI experimentation at scale.
B. Foundation Model APIs: These provide access to large models but don’t inherently handle real-time data integration.
D. AutoML: AutoML helps automate model training and tuning but is not designed for real-time data delivery or inference integration.
To build a responsive, LLM-based live sports commentary platform, Feature Serving is the most suitable tool for ensuring the LLM receives current, real-time game data, enabling accurate and timely analyses for end users.
A Generative AI Engineer is operating a Retrieval-Augmented Generation (RAG) application that utilizes a provisioned throughput model serving endpoint. To monitor and audit the requests made to the model and the responses generated, the current setup includes a custom microservice positioned between the user interface and the serving endpoint, which logs these interactions to a remote server.
To simplify the architecture and avoid maintaining this additional microservice, which native Databricks feature can automatically log incoming requests and outgoing responses at the serving endpoint for observability and troubleshooting?
A. Vector Search
B. Lakeview
C. DBSQL
D. Inference Tables
In Databricks, Inference Tables provide a native, fully managed logging mechanism designed specifically for model serving endpoints. When a model is deployed using the Databricks Model Serving feature (especially with provisioned throughput), Inference Tables can automatically capture:
Input requests: The exact prompt, parameters, or query sent to the model.
Model responses: The output generated by the model.
Metadata: Timestamps, latency, status codes, and endpoint performance.
This capability is critical in production AI applications like RAG systems, where monitoring the input-output behavior of the model is important for debugging, auditing, performance analysis, and improvement of prompts or chunking strategies.
Using Inference Tables removes the need for building and managing an external logging microservice. The data is logged directly to a Delta table, making it easy to query, analyze, and visualize using familiar Databricks tools like notebooks or SQL.
A. Vector Search: Used for indexing and retrieving embeddings, not request/response logging.
B. Lakeview: Databricks' dashboarding tool; it does not handle model logging or request tracing.
C. DBSQL (Databricks SQL): It enables querying structured data in the lakehouse, but it's not designed for logging model interactions.
To monitor and log interactions with a model serving endpoint in a RAG application, Inference Tables are the ideal Databricks-native solution—automatically capturing both incoming requests and outgoing responses without additional infrastructure.
A Generative AI Engineer is working on enhancing the quality of a Retrieval-Augmented Generation (RAG) system. Recently, the system has been generating inflammatory or offensive responses, which are negatively affecting user experience and trust. To mitigate this issue and improve response safety and reliability, which of the following actions would be the most effective?
A. Increase the frequency of upstream data updates
B. Inform users about how the RAG system behaves
C. Restrict access to the data sources to a limited number of users
D. Properly curate upstream data, including manual review, before feeding it into the RAG system
In a RAG (Retrieval-Augmented Generation) system, responses generated by a large language model (LLM) are heavily influenced by the retrieved context—usually drawn from an indexed vector database that contains documents or chunks of text. If the source data is offensive, biased, or inflammatory, the LLM may reproduce or amplify this behavior in its responses.
To effectively mitigate offensive outputs, the best solution is to curate the upstream data—the content that is ingested, embedded, and stored in the vector store. Manual review of documents prior to ingestion ensures that inappropriate or harmful content is either corrected or excluded altogether. This greatly reduces the chance that such content will be retrieved and used to influence the LLM’s generation process.
A. Increase the frequency of upstream data updates: This ensures freshness of data but does not address data quality or harmful content.
B. Inform users of expected RAG behavior: Transparency is good, but it doesn't solve the problem of generating offensive outputs—it only warns users after the fact.
C. Restrict access to data sources: While limiting access may reduce exposure, it does not correct the underlying issue of bad content influencing model responses.
Manual curation of source data is the most proactive and reliable method to ensure that a RAG system avoids producing harmful or inflammatory text. This safeguards the system’s integrity and maintains user trust.
A Generative AI Engineer is building a Retrieval-Augmented Generation (RAG) application powered by a large language model (LLM). The application’s retriever has pre-processed its documents into chunks of 512 tokens maximum. The engineer prioritizes low cost and low latency over the highest possible output quality, and must now choose an appropriate model with a suitable context length and model size.
Given these requirements, which configuration is most suitable?
A. Context length: 514 tokens; Model size: 0.44GB; Embedding dimension: 768
B. Context length: 2048 tokens; Model size: 11GB; Embedding dimension: 2560
C. Context length: 32,768 tokens; Model size: 14GB; Embedding dimension: 4096
D. Context length: 512 tokens; Model size: 0.13GB; Embedding dimension: 384
When building a cost- and latency-sensitive RAG application, the choice of model must be tightly aligned with the document chunk size, application requirements, and resource constraints. Since the documents are chunked into 512 tokens, there's no need for a model with a much longer context window.
Context length defines how many tokens the model can process at once. A 512-token context length exactly matches the document chunk size, ensuring that the model can process each chunk in one pass without truncation or waste.
Model size (0.13GB) is very small, which translates into lower memory usage, faster inference, and significantly lower costs, particularly important in real-time or high-throughput environments.
Embedding dimension (384) is on the lower end but acceptable, especially since quality is not the priority.
By contrast:
Option A (514 tokens) slightly exceeds the chunk size and offers higher embedding dimensionality but at a higher cost (larger model).
Options B and C have longer context lengths and larger models, leading to higher latency and cost—which contradicts the application’s requirements.
Option D offers the best fit for a latency- and cost-sensitive application with 512-token chunks. It efficiently balances performance and resource usage, meeting the engineer’s priorities.
A small startup operating in the cancer research domain wants to build a cost-effective Retrieval-Augmented Generation (RAG) application using Foundation Model APIs. The goal is to ensure the application delivers good-quality answers while minimizing costs, as the startup has limited resources but needs to support research and customer inquiries effectively.
Which strategy is the most appropriate for this startup to achieve its goals?
A. Limit the number of relevant documents available for the RAG application to retrieve from
B. Choose a smaller, domain-specific large language model (LLM)
C. Restrict the number of user queries per day to manage costs
D. Use the largest available general-purpose LLM to guarantee performance
For a small, cost-conscious startup in a highly specialized field like cancer research, optimizing the quality-to-cost ratio is essential when building a RAG (Retrieval-Augmented Generation) system.
Choosing a smaller, domain-specific LLM is the most effective strategy. These models are typically fine-tuned on domain-relevant data, such as biomedical texts or cancer-related literature, which significantly boosts performance for specialized queries even though they may have fewer parameters than large general-purpose models. The reduced size also translates into lower inference costs and faster response times—critical for startups operating on a tight budget.
Option A, limiting the number of retrievable documents, would hinder the RAG system’s ability to find accurate answers, lowering overall quality.
Option C, restricting customer queries, can reduce costs but negatively impacts user experience and accessibility, which is counterproductive for customer-focused solutions.
Option D, using the largest LLM possible, offers general high performance but is cost-prohibitive and often unnecessary when the use case is narrowly focused.
By using a smaller, domain-specific model, the startup can ensure that responses are both accurate and cost-effective, aligning with their technical needs and financial constraints—while maintaining a high-quality customer experience.
A Generative AI Engineer is developing a chatbot application to assist the internal HelpDesk Call Center team in efficiently identifying ticket root causes and resolutions. As part of the project’s planning phase, the engineer is evaluating various internal data sources to feed into the GenAI-powered system. The engineer has narrowed the options to the following datasets:
call_rep_history (Delta Table): Contains representative_id and call_id, used to track metrics like call_duration and call_start_time for measuring agent performance.
transcript_volume (Unity Catalog Volume): Contains audio files (.wav) and text transcripts (.txt) of recorded calls.
call_cust_history (Delta Table): Tracks how frequently customers use the HelpDesk, for billing purposes.
call_detail (Delta Table): Contains a snapshot of all call activity including root_cause and resolution fields; updated hourly. These fields may be blank for ongoing calls.
maintenance_schedule (Delta Table): Lists both historical and future application maintenance/outage windows.
Which TWO data sources are the most suitable for providing helpful context to identify root causes and resolutions of tickets?
(Choose two)
A. call_cust_history
B. maintenance_schedule
C. call_rep_history
D. call_detail
E. transcript_volume
To enhance a chatbot’s capability in identifying ticket root causes and resolutions, the selected data sources must provide context-rich, technical, and conversational insights.
call_detail (Delta Table):
This table includes structured data like root_cause and resolution, which directly map to the information the chatbot is designed to extract and use. Even though the fields might be blank for ongoing calls, historical entries offer crucial training and retrieval data. It provides a clear and authoritative reference for common problems and how they were solved—perfect for grounding the chatbot's responses.
transcript_volume (Unity Catalog Volume):
These transcripts are rich in unstructured, real-world conversation data between support agents and users. They offer insights into language patterns, question formats, and resolutions, which are ideal for Retrieval-Augmented Generation (RAG) systems. Embedding these transcripts into a vector store allows the chatbot to retrieve relevant conversational data and mimic human-like support.
A. call_cust_history: Focused on usage and billing, not helpful for resolution or diagnostics.
B. maintenance_schedule: While useful for contextual outages, it does not aid in resolution details unless integrated with ticket timelines.
C. call_rep_history: Focuses on agent performance (duration, timing), not issue resolution.
Top Training Courses
LIMITED OFFER: GET 30% Discount
This is ONE TIME OFFER
A confirmation link will be sent to this email address to verify your login. *We value your privacy. We will not rent or sell your email address.
Download Free Demo of VCE Exam Simulator
Experience Avanset VCE Exam Simulator for yourself.
Simply submit your e-mail address below to get started with our interactive software demo of your free trial.