- Home
- Training Courses
- Certifications
- AWS Certified Machine Learning - Specialty (MLS-C01)

PDFs and exam guides are not so efficient, right? Prepare for your Amazon examination with our training course. The AWS Certified Machine Learning - Specialty course contains a complete batch of videos that will provide you with profound and thorough knowledge related to Amazon certification exam. Pass the Amazon AWS Certified Machine Learning - Specialty test with flying colors.

Rating

4.6

Students

134

Duration

09:08:00 h

$16.49

$14.99

Curriculum for AWS Certified Machine Learning - Specialty Certification Video Course

Introduction

1 Lectures

Time 00:06:00

Data Engineering

22 Lectures

Time 01:24:00

Exploratory Data Analysis

21 Lectures

Time 02:27:00

Modeling

45 Lectures

Time 03:46:00

ML Implementation and Operations

11 Lectures

Time 01:04:00

Wrapping Up

6 Lectures

Time 00:21:00

Name of Video | Time |
---|---|

1. Course Introduction: What to Expect |
6:00 |

Name of Video | Time |
---|---|

1. Section Intro: Data Engineering |
1:00 |

2. Amazon S3 - Overview |
5:00 |

3. Amazon S3 - Storage Tiers & Lifecycle Rules |
4:00 |

4. Amazon S3 Security |
8:00 |

5. Kinesis Data Streams & Kinesis Data Firehose |
9:00 |

6. Lab 1.1 - Kinesis Data Firehose |
6:00 |

7. Kinesis Data Analytics |
4:00 |

8. Lab 1.2 - Kinesis Data Analytics |
7:00 |

9. Kinesis Video Streams |
3:00 |

10. Kinesis ML Summary |
1:00 |

11. Glue Data Catalog & Crawlers |
3:00 |

12. Lab 1.3 - Glue Data Catalog |
4:00 |

13. Glue ETL |
2:00 |

14. Lab 1.4 - Glue ETL |
6:00 |

15. Lab 1.5 - Athena |
1:00 |

16. Lab 1 - Cleanup |
2:00 |

17. AWS Data Stores in Machine Learning |
3:00 |

18. AWS Data Pipelines |
3:00 |

19. AWS Batch |
2:00 |

20. AWS DMS - Database Migration Services |
2:00 |

21. AWS Step Functions |
3:00 |

22. Full Data Engineering Pipelines |
5:00 |

Name of Video | Time |
---|---|

1. Section Intro: Data Analysis |
1:00 |

2. Python in Data Science and Machine Learning |
12:00 |

3. Example: Preparing Data for Machine Learning in a Jupyter Notebook. |
10:00 |

4. Types of Data |
5:00 |

5. Data Distributions |
6:00 |

6. Time Series: Trends and Seasonality |
4:00 |

7. Introduction to Amazon Athena |
5:00 |

8. Overview of Amazon Quicksight |
6:00 |

9. Types of Visualizations, and When to Use Them. |
5:00 |

10. Elastic MapReduce (EMR) and Hadoop Overview |
7:00 |

11. Apache Spark on EMR |
10:00 |

12. EMR Notebooks, Security, and Instance Types |
4:00 |

13. Feature Engineering and the Curse of Dimensionality |
7:00 |

14. Imputing Missing Data |
8:00 |

15. Dealing with Unbalanced Data |
6:00 |

16. Handling Outliers |
9:00 |

17. Binning, Transforming, Encoding, Scaling, and Shuffling |
8:00 |

18. Amazon SageMaker Ground Truth and Label Generation |
4:00 |

19. Lab: Preparing Data for TF-IDF with Spark and EMR, Part 1 |
6:00 |

20. Lab: Preparing Data for TF-IDF with Spark and EMR, Part 2 |
10:00 |

21. Lab: Preparing Data for TF-IDF with Spark and EMR, Part 3 |
14:00 |

Name of Video | Time |
---|---|

1. Section Intro: Modeling |
2:00 |

2. Introduction to Deep Learning |
9:00 |

3. Convolutional Neural Networks |
12:00 |

4. Recurrent Neural Networks |
11:00 |

5. Deep Learning on EC2 and EMR |
2:00 |

6. Tuning Neural Networks |
5:00 |

7. Regularization Techniques for Neural Networks (Dropout, Early Stopping) |
7:00 |

8. Grief with Gradients: The Vanishing Gradient problem |
4:00 |

9. L1 and L2 Regularization |
3:00 |

10. The Confusion Matrix |
6:00 |

11. Precision, Recall, F1, AUC, and more |
7:00 |

12. Ensemble Methods: Bagging and Boosting |
4:00 |

13. Introducing Amazon SageMaker |
8:00 |

14. Linear Learner in SageMaker |
5:00 |

15. XGBoost in SageMaker |
3:00 |

16. Seq2Seq in SageMaker |
5:00 |

17. DeepAR in SageMaker |
4:00 |

18. BlazingText in SageMaker |
5:00 |

19. Object2Vec in SageMaker |
5:00 |

20. Object Detection in SageMaker |
4:00 |

21. Image Classification in SageMaker |
4:00 |

22. Semantic Segmentation in SageMaker |
4:00 |

23. Random Cut Forest in SageMaker |
3:00 |

24. Neural Topic Model in SageMaker |
3:00 |

25. Latent Dirichlet Allocation (LDA) in SageMaker |
3:00 |

26. K-Nearest-Neighbors (KNN) in SageMaker |
3:00 |

27. K-Means Clustering in SageMaker |
5:00 |

28. Principal Component Analysis (PCA) in SageMaker |
3:00 |

29. Factorization Machines in SageMaker |
4:00 |

30. IP Insights in SageMaker |
3:00 |

31. Reinforcement Learning in SageMaker |
12:00 |

32. Automatic Model Tuning |
6:00 |

33. Apache Spark with SageMaker |
3:00 |

34. Amazon Comprehend |
6:00 |

35. Amazon Translate |
2:00 |

36. Amazon Transcribe |
4:00 |

37. Amazon Polly |
6:00 |

38. Amazon Rekognition |
7:00 |

39. Amazon Forecast |
2:00 |

40. Amazon Lex |
3:00 |

41. The Best of the Rest: Other High-Level AWS Machine Learning Services |
3:00 |

42. Putting them All Together |
2:00 |

43. Lab: Tuning a Convolutional Neural Network on EC2, Part 1 |
9:00 |

44. Lab: Tuning a Convolutional Neural Network on EC2, Part 2 |
9:00 |

45. Lab: Tuning a Convolutional Neural Network on EC2, Part 3 |
6:00 |

Name of Video | Time |
---|---|

1. Section Intro: Machine Learning Implementation and Operations |
1:00 |

2. SageMaker's Inner Details and Production Variants |
11:00 |

3. SageMaker On the Edge: SageMaker Neo and IoT Greengrass |
4:00 |

4. SageMaker Security: Encryption at Rest and In Transit |
5:00 |

5. SageMaker Security: VPC's, IAM, Logging, and Monitoring |
4:00 |

6. SageMaker Resource Management: Instance Types and Spot Training |
4:00 |

7. SageMaker Resource Management: Elastic Inference, Automatic Scaling, AZ's |
5:00 |

8. SageMaker Inference Pipelines |
2:00 |

9. Lab: Tuning, Deploying, and Predicting with Tensorflow on SageMaker - Part 1 |
5:00 |

10. Lab: Tuning, Deploying, and Predicting with Tensorflow on SageMaker - Part 2 |
11:00 |

11. Lab: Tuning, Deploying, and Predicting with Tensorflow on SageMaker - Part 3 |
12:00 |

Name of Video | Time |
---|---|

1. Section Intro: Wrapping Up |
1:00 |

2. More Preparation Resources |
6:00 |

3. Test-Taking Strategies, and What to Expect |
10:00 |

4. You Made It! |
1:00 |

5. Save 50% on your AWS Exam Cost! |
2:00 |

6. Get an Extra 30 Minutes on your AWS Exam - Non Native English Speakers only |
1:00 |

100% Latest & Updated Amazon AWS Certified Machine Learning - Specialty Practice Test Questions, Exam Dumps & Verified Answers!

30 Days Free Updates, Instant Download!

AWS Certified Machine Learning - Specialty Premium Bundle

- Premium File: 345 Questions & Answers. Last update: Aug 10, 2024
- Training Course: 106 Video Lectures
- Study Guide: 275 Pages

- Latest Questions
- 100% Accurate Answers
- Fast Exam Updates

$69.97

$49.99

Free AWS Certified Machine Learning - Specialty Exam Questions & AWS Certified Machine Learning - Specialty Dumps

File Name | Size | Votes |
---|---|---|

File Name amazon.test-king.aws certified machine learning - specialty.v2024-06-14.by.george.111q.vce |
Size 1.05 MB |
Votes 1 |

File Name amazon.real-exams.aws certified machine learning - specialty.v2021-12-17.by.lucas.108q.vce |
Size 1.45 MB |
Votes 1 |

File Name amazon.pass4sure.aws certified machine learning - specialty.v2021-07-27.by.benjamin.78q.vce |
Size 1.37 MB |
Votes 1 |

File Name amazon.actualtests.aws certified machine learning - specialty.v2021-04-30.by.giovanni.72q.vce |
Size 902.88 KB |
Votes 2 |

Amazon AWS Certified Machine Learning - Specialty Training Course

Want verified and proven knowledge for AWS Certified Machine Learning - Specialty (MLS-C01)? Believe it's easy when you have ExamSnap's AWS Certified Machine Learning - Specialty (MLS-C01) certification video training course by your side which along with our Amazon AWS Certified Machine Learning - Specialty Exam Dumps & Practice Test questions provide a complete solution to pass your exam Read More.

Now, if you're new to all of this, you generally do this sort of exploratory data analysis within something called a Jupiter notebook. A Jupiter's Notebook actually runs within your web browser, and it communicates with the server run by your Python environment. In this example, we are using Anaconda, which is a very popular choice in the world of data science and machine learning. A Jupyter notebook allows you to intersperse code with your own notes and annotations so others can understand what you're trying to do when they look at it, and it might be a good reminder to yourself as well. These code blocks are really running. You just hit shift enter within one of these blocks of code to actually run them.

So you can set up an entire pipeline or sequence of steps within a notebook and rerun them as needed as you iterate on it. So you could actually run an entire pipeline of analyzing, preparing, and cleaning your data, training a machine learning model on it, and then deploying that model to actually make predictions. and we'll see that throughout the course. To make it a little bit more real, we will look at an actual Jupiter notebook that will create a machine learning model to predict whether mammogram results are something to worry about or not. So this is a real-world example of machine learning here. So here we have a real world example of using a Jupiter notebook, Python, the Pandas Library, NumPy, and Scikit-Learn to prepare and understand data and create a machine learning model from it that actually works. What we're going to use here is a mammographic mass dataset from the UCI repository, which is a great way to find a public domain data set to play with.

And we're going to try to create a neural network that will predict whether a given mass in a mammogram is benign or malignant based on the properties of that mass. So, we have a very useful example here. Now again, we're not going to get into the details of the actual Python code here because for the purpose of the exam, all that's required is what's happening at a high level here. Again, you will not be required to read, look at, or understand Python code for the exam. So we're not going to go there. So at a high level, what are we doing here? Well, we'll start by importing the data that we're given. And often the data that we're given in raw form is not terribly useful as is. So first, we need to see what we're up against. So we're going to import the Pandas Library. We're going to call read CSV on it for the raw data file that we have here that's in comma-separated format and call head just to see what we have and what we're working with when we do that. We can see a few things here. First of all, there are no meaningful column names in this data file.

So we have no names for the columns that are part of the data itself. If we want to actually understand what's going on here as we're evaluating and playing with our data, we're going to have to introduce those column names by hand. Another weird thing is that there are a lot of question marks here instead of numbers. It turns out that this represents missing data. Now, a question mark is not what Python expects a missing data point to look like. So we're going to have to deal with that as well and also figure out how we're going to handle those missing data points. So we're going to just throw them away and try to replace them with something else. What are we going to do? So let's start with the first things here. So we'll start by rereading that CSV file with Pandas. And this time we're going to say that a question mark represents a missing value, an "A" value. And we'll pass in a list of explicit column names here as well that came with the documentation that accompanied this dataset. We'll call her again and see how things look now. Okay, so this is a little bit easier to deal with now. We actually have useful column names that we've added here to our data frame. And instead of question marks, we have a notation of values, which is Python's numeric way of saying missing data is meaningless data.

Okay, so we're getting there. Now that we actually have our data in a format where we can actually understand it and work with it, we need to understand what sort of cleaning and preprocessing we might need to do to that data. Our models are only as good as the data going into them. So let's start by just calling describe on the data frame from Pandas to get some level information about what's in there. So by looking at the counts, we can see that they're not all the same. So there's a lot of missing data here. For example, the density column seems to be missing a lot of data compared to, say, the severity column. So what do we do about all that missing data? Well, we'll get into that later in this section. There are a lot of ways to impute missing data and replace that missing data with meaningful substitutes. But the simplest thing to do is to just drop the rows that contain missing data and not deal with the problem at all. And if you have enough data that you can get away with that and you're not going to introduce any bias by doing that, then that might be a reasonable thing to do, at least during the early stages of iterating on your algorithms. So we need to make sure we're not going to introduce any sort of bias by dropping those missing rows. So let's visualise what those rows look like that have missing data.

That's what this line is going to do here; it's just extracting all of the rows that have a null value in one of the columns. And just by eyeballing it here, I mean, there are more sophisticated ways of actually doing this, but just by looking at it, we don't really see any obvious patterns in ages or things like that. It seems like this missing data is pretty evenly distributed across different types of people. So given that we can feel pretty okay with just dropping that data, Now, mind you, dropping data is never the most optimal thing to do. There is usually a better way to deal with that through imputation methods. But again, we'll get there later. But for now, we're just going to call dropNA on that data frame to say that any roads that contain missing data should be gotten rid of them. So let's go ahead and do that and describe it again. And we can see now that we have 830 rows in forever single column there, so there is no missing data anymore. And if we wanted to, we could actually compare the mean and standard deviation of these before and after to see what sort of impact that really had. Okay, so now before we actually take this data and pass it into Scikit-Learn to do some modelling on it, we need to convert it back into a NumPy array.

And that's what this values method here does. So we're just taking our Pandas data frame and converting it back into NumPy arrays. That's it. So let's take a look at that. We can see here that the resulting features data frame that got converted to a NumPy array looks like this. And the next thing we need to deal with is normalising that data. So you can see there's a big range of data here. So the age, for example, is going to be a much larger number than say, the shape, the margin, or the density of a mass, right? So if I were to use this data as is, that age would have a much bigger weight on the results and everything else. And that's not a reasonable thing. That doesn't make sense. Also, ages are centred around, I don't know, 30 or 40 years old, right? So we need to account for that offset as well. We need to normalise the state, which is what it comes down to; make sure everything is centred around the mean for each column and scale down to the same ranges so that they have the same weight. All right, so to do that, we can use SciKitLearn's preprocessing model. It has a handy thing called a standard scaler that does just that. We'll just call Fit transform on the entire NumPyarray using standard scalars and look at the resulting array, and you can see that now things are within plus or minus one more or less. It's got this sort of normal distribution; it's not really constrained to that, but it's all centred around zero and more or less in the same range, which is what's important.

So we've used Pandas to understand our data and deal with missing values in that data. And we could have actually used that as well to solidify that data if we wanted to, if we had too much data to process at once. while we're experimenting. We then exported that to a NumPy array and used Scikit-Learn's preprocessing module to actually scale that down into a consistent range. Now that our data has been prepared and cleaned, we can actually feed it into an actual machine learning model. I'm not going to get into the details of this, but basically we're using TensorFlow's Keras API to create a neural network that will actually learn from this data and create a neural network that can predict whether a mass that it hasn't seen before might be benign or malignant. Let's go ahead and run that. Don't worry; we'll get into how that works later in the modelling section of the course. It's a lot of fun. And now that we have our model defined here, we can actually wrap that in a Scikit-Learn estimator here.

So just like we have that classifier in the slides from Sci Kitlearn, Keras has a way to actually make a TensorFlow neural network look like a Sci Kitlearn model. So we're using Keras' classifier to do that. Basically, it's going to create a neural network with various parameters. We're going to call that estimator, and then we can call Psychiatlearn's cross-validation score function to actually evaluate that and train it. So what a cross-validation score does is actually randomly separate your data set into training and test data sets multiple times. So multiple times it will actually take the training component there, the training data set, train the model, train the neural network, and take the test data that we held aside and evaluate how well that resulting model can predict the labels on data that it's never seen before. And in this case, we will actually do that ten times and take the average of the results from each different split of training and testing data. So we can go ahead and kick that off. So basically, we're going to split up our data into training and testing sets ten times, train ten different neural networks, and evaluate the results of all ten networks. Obviously, this will take a little bit of time, so we'll wait for that to finish.

And we finished. And we had about 80% accuracy all in all, which for this particular data set isn't great but it's not bad either. We'll talk more about what that accuracy really means and how to interpret it later in the course. But again, the high-level stuff here that we want you to understand is how Jupyter notebooks work and how they're used. So you can see that we have this sort of block of Python code that we could run one step at a time. It's actually communicating with a real Python environment running on the back end here. We can intersperse little notes and notations for ourselves as sort of reminders of what's going on here and why we're doing it for ourselves or for other people. And that's what a notebook is all about. We've looked at using the Pandas library to actually explore our data, to clean it up a little bit, deal with missing values, and then export it to an actual model that's using, in this example, TensorFlow and Keras, and using Scikit-Learn to preprocess that data before it gets fed in for training, using the standard scaler module to normalise that data before it goes in. So we've used a Jupiter notebook. We've used Pandas to visualise and clean up our data. We've used Scikit-Learn to actually scale and preprocess that data, and then used an actual deep neural network to create a model and evaluate its results, all in one little notebook. So high level. That's what's going on here. very common pattern in the world of machine learning. And hopefully that's helped make it all real.

Let's dive into data distributions. This is characterizing the likelihood of your data falling into a certain range, basically. And it's very important when you're doing exploratory data analysis. It's also something that you're very likely to see in the exam. Let's start with the normal distribution. A normal distribution is a simple example of a probability density function. So here's that normal distribution curve that we've seen before. It's easy conceptually to try to think of this as the probability of a given value occurring. But that's a little bit misleading when you're talking about continuous data because there's an infinite number of actual possible data points in a continuous data distribution. There could be zero, one, or 00000 one. So the actual probability of a very specific event occurring is very, very small, even infinitely small. So the probability density function really speaks to the probability of a given range of values occurring. So that's the way you have to think about these things.

As an example, in this normal distribution between the mean and one standard deviation from the mean, there's a 34.1% chance it turns out to be a value falling within that range. And you can tighten that up or spread it out as much as you want to and figure out the actual values. But that's the way to think about a probability-density function for a given range of values. It gives you a way of finding out the probability of that range occurring. So you can see here that as you get close to the mean within one standard deviation, you're pretty likely to get there if you add up 34 and 34, whatever that comes out to. It's the probability of landing within one standard deviation of the mean. But as you get out here between two and three standard deviations, while we're down to just a little bit over 4% combined with the positive and negative sides, and as you get out beyond three standard deviations, then we're much less than 1% actually. So it's just a way to visualise and talk about the probabilities of the given data point happening. This particular distribution, the normal distribution, shows up a lot. It's basically a bell curve centred around zero, which represents the mean of your data set. And you can carve this up into various standard deviations that lie along these points.

So again, the probability distribution function gives you the probability of a data point falling within some given range of a given value. And a normal function is just one example of a probability density function. Let's get to some more in a moment. Now, when you're dealing with discrete data, that nuance about having an infinite number of possible values goes away, and things get a little bit different. So now we're talking about a probability mass function. If you're dealing with discrete data, we talk about the probability mass function. For example, we can plot a normal probability density function of continuous data on this black curve. But if we were to quantize that into a discrete dataset like we would do with a histogram, we could say the number three occurs a set number of uld say the nWe can also say that the number three has a little bit more than a 30% chance of occurring. So a probability mass function is the way that we visualise the probability of discrete data occurring. And it looks a lot like a histogram because it basically is a histogram. So it's a terminology difference. The probability density function is a solid curve that describes the probability of a range of values happening with continuous data. The probability mass function is the probability of given discrete values occurring in a data set.

An example of a probability distribution is the Poisson distribution. This is a specific discrete distribution based on what's called a Poisson experiment. Poisson experiments are defined as some series of events that end in success or failure where the average number of successes over time or distance is known. For example, you might know that a real estate company on average sells a certain number of homes every day. The Poisson distribution would tell you the likelihood of selling a given number of homes on the next day. In this graph, lambda would embody the expected number of homes sold. If lambda is one, we're looking at the yellow line. We can see there's about a 37% chance of selling zero or one home on any given day if we know the overall average is one. But if on average we sell ten homes per day, we're looking at the blue line. Now this starts to look more like a normal distribution, as we have more possible values that are less than ten to work with. The important point is that poisson distributions deal with discrete data. We can't sell 254 houses in a given day. So we just don't have enough points to work with to have a smooth normal distribution against any value.

As we get closer to lambda values of 0, the distribution starts to look more exponential. So again, when you hear about Poisson distributions, remember we must be talking about discrete data. Other examples might be how many pieces of mail you receive on a given day or how many calls a call centre receives. These are all discrete events. You don't have half a call or half a letter. Only whole integer values make sense in these problems. And another discrete probability distribution is the Bernoulli distribution. This just describes the number of successes in a series of experiments with a yes or no question. So, for example, flipping a coin, heads or tails could be described as a binomial distribution. It's just another discrete probability distribution where you just have a binomial zero or one positive or negative, heads or tails kind of result. There's also something called a Bernoulli distribution that you might want to know about. It's a special case of the binomial distribution that just has a single trial of N = 1. So you could think of a binomial distribution as the sum of Bernoulli distributions. The Bernoulli distribution has a single trial, and a binomial distribution probably consists of multiple trials. That's all there is to it. So to recap some basic data distributions, we talked about normal distributions, poisson distributions, binomial distributions, and Bernie distributions. And remember, only the normal distribution was for continuous data. The rest were in the context of discrete data.

Let's talk about some real basics with time series analysis. All time series are what they sound like: a series of data points over time. So they tend to be discrete samples taken at discrete points in time over a period of time. All time series are So, for example, we can have a trend in a time series. This is an actual graph here of global average sea levels over time between 1872 and 2008; And yeah, it's not pretty, guys, but that's a different topic for another time. So you can see here that there is definitely an overall trend going upward over time. From 1870 to 2010, sea level has increased. There are fluctuations from year to year, but the larger trend going back over a longer period of time is pretty clear, and it's up and to the right. So that's what a trend is all about. If you step back and look at your entire time series, does it seem to be trending in one direction or the other? That's a trend. Time series can also exhibit seasonality. So, for example, if we look at the incidence of pneumonia and influenza, we can see that it definitely has seasonal components to it. It tends to peak during certain months, and during the summer, it's not so bad. So we have a seasonality here indicated by these black lines that represent those normal fluctuations we expect to see from month to month. So seasonality can be superimposed on trends, basically, to compose a time series. There might be a trend as well, but in this case, it looks pretty flat. The main signal we see here is seasonality, where most of the fluctuations we see in this time series can be described by describing what time of the year it is, what month it is, and what week it is it. You can have both, of course.

For example, you can take a raw data set of a time series, as seen on the top here, and extract the seasonal component, so you can actually numerically figure out what the seasonal piece of that is. And if you subtract out the seasonality from the raw data, you're left with the trends, basically. Now the data we're looking at is actually Wikipedia edits. So the raw data is at the top; we extract the seasonal trends, and if you subtract out that seasonality, we get the overall trend that takes out those months to month variations. So there's an example of seasonality and trends working together, right? So this is a data set that has both seasonality and no seasonality. It turns out that, for whatever reason, people edit Wikipedia pages more frequently in certain months than others. But there's also a larger trend here. So obviously, in 2006, things started to explode and they kind of peaked around 2007 and it's been kind of falling off and levelling off in recent years. So that's how we would describe the trend. The seasonality would be these month-to-month variations. There's also noise that is a component of a time series.

So not only do we have seasonality and trends, but there's also going to be some random noise that just can't be accounted for. Otherwise, some variations are just random in nature. There's nothing you can do about it. There are a couple of ways to model this. So one is an additive model. So if your seasonal variation is constant, it doesn't matter what overall trends you're seeing. Basically, the variation you see from one season to another is the same. Then you might use an additive model, and you could describe the entire time series as the sum of the seasonality plus the longer term trends plus the noise. And that would be your overall time series. As we saw before, you could take a time series, subtract out the seasonality, and be left with trends plus noise. Right. So it all works out mathematically, but sometimes the seasonality will scale along with the trends. So as the scale of your data increases or decreases, the amplitude of that seasonality also increases or decreases. And in that case, you would want to use a multiplicative model. Or you could say that the time series is the product of seasonality, trends, and noise. It just depends on the nature of the data. There's no like hard and fast rule there. One model or the other might better describe your data. And you know, there's a whole world of time series analysis out there. But for the purpose of the exam—exploratory data that's the bit that you need to know.

So we've talked about some of the general ways of analysing and preparing your data that you might use outside of AWS and within it as well. But let's talk about some of the specific AWS services you might use in the process of exploring your data too. Obviously, that plays a large role in the exam. We go into a lot more depth in this when we're preparing for the AWS Certified Data Analysis exam or a big data exam. We just need sort of a higher level of understanding for it in the context of the machine learning exam. So there is some overlap between those two certification exams. We just need to keep it a little bit higher for the purpose of this one. Let's start with Amazon Athena, which offers a serverless way of doing interactive queries of your Three Data Lake, whatever it might be.

So there's no need to load data into a database with Athena; it just stays in S3 in its raw form, so you can have a data lake of CSV files or what have you. It also supports JSON.org, Parquet, and Avro formats as well, and it just looks at it, analyzes it, and allows you to execute SQL queries on it, basically under the hood. It's powered by an open source thing called Presto. And the main thing to remember with Athena is that a) it lets you do SQL queries on unstructured or semistructured data in an S3 data lake, and b) it is serverless.

So there are no servers for you to manage. You just use Athena, give it a query, tell it where to look for the data, and it just happens. AWS figures out where the actual servers are running to make it happen for you. You don't manage that. A few examples of usage suggest that maybe you want to do some ad hoc queries of your weblog data. You could just have raw web logs, see an S3 somewhere, and use Athena to query that stuff and see what sorts of trends exist in it. You might want to query some data that's being staged in S3 before you load it into Redshift or some other data warehouse. You might want to analyse logs from Cloud Trailor, Cloud Front, or your VPC or Elastic load balancer logs that are sitting in S3, using Athena to just issue SQL queries on those.

You can also integrate Athena with Jupyter and Zeppelin in RStudio notebooks, which is nice. You can just run queries from within your notebook as well, and that's useful for analysing and understanding your data right within your notebook. It also integrates with QuickSite, which is an AWS data visualisation tool that we'll talk about shortly, and it can also integrate with pretty much any other visualisation tool via the ODBC and JDBC protocols. It's important to understand the relationship between Athena and the AWS glue. So if you're using glue to impart some structure to your S3 data lake, you can extract what some columns might be that correlate with the data that's sitting in there. Athena can take advantage of that. So your Glue data catalogue can create a metadata repository across various services, including Athena. It can crawl your data in S Three, extract that schema from it, and Athena can use that schema to actually issue SQL queries and come up with names for your columns from a SQL standpoint. Right, so Glue's fully managed ETL capabilities can be used to transform data or convert it into columns or other formats to optimise the cost and improve the performance of your Athena queries. A typical pipeline might look like this. You have data sitting in Table 3.

You have a glue crawler that extracts the actual meaning of that data—its structure. Athena then sits on top of that to issue queries, and you can feed those Athena queries into QuickSite to visualise them. The cost model with Athena is "pay as you go." These details aren't going to be terribly important on this exam, but it's good to know it's only $5 per terabyte scanned. It charges you for the amount of data scanned for a query, basically. And you can keep that cost down by compressing your data, by the way. So if you want to keep Athena cheap, you want to compress the data going into it. Also, converting your data to columnar formats saves a lot of money because it allows Athenato to selectively read only the columns that it needs to process the data for that query. So one thing that is important to know is that using column formats such as ORC and Parquet in conjunction with Athena is a really good idea.

Now, obviously, Glue and S Three will have their own charges as well, in addition to Athena, which, from a security standpoint, uses all the typical access control stuff. IAM ACL's S three-bucket policy comes into play as well. There are IAM policies for Athena full access or Quick Side Athena access to make things easy with an S 3. Of course, you can encrypt the results from your Athena query at rest if you want to, within an S3 staging directory. And you have all the various means of encrypting data in S3 at your disposal, including SSE, S3, SSE kms, or CSE kms. It's also possible to do cross account access in Athena using S3 bucket policies. And for in-transit data, we have TLS being used at all times to encrypt data in-transit between S3 and Athena. Some things you would not want to use Athena for would include highly formatted reports or visualizations. That's what Quick Sites is about, which we'll talk about next. And it's also not for doing ETL. That's what Glue ETL is for Athena—just for doing ad hoc queries using SQL against your S3 data lake. That's it.

Prepared by Top Experts, the top IT Trainers ensure that when it comes to your IT exam prep and you can count on ExamSnap AWS Certified Machine Learning - Specialty (MLS-C01) certification video training course that goes in line with the corresponding Amazon AWS Certified Machine Learning - Specialty exam dumps, study guide, and practice test questions & answers.

Comments (0)

Add Comment

Please post your comments about AWS Certified Machine Learning - Specialty Exams. Don't share your email address asking for AWS Certified Machine Learning - Specialty braindumps or AWS Certified Machine Learning - Specialty exam pdf files.

Purchase Individually

AWS Certified Machine Learning - Specialty

Premium File

345 Q&A

$43.99
$39.99

AWS Certified Machine Learning - Specialty

Training Course

106 Lectures

$16.49
$14.99

AWS Certified Machine Learning - Specialty

Study Guide

275 Pages

$16.49
$14.99

Amazon Training Courses

AWS Certified Advanced Networking - Specialty ANS-C01

AWS Certified Advanced Networking - Specialty ANS-C01

$14.99

AWS Certified Cloud Practitioner CLF-C02

AWS Certified Cloud Practitioner CLF-C02

$14.99

AWS Certified Data Engineer - Associate DEA-C01

AWS Certified Data Engineer - Associate DEA-C01

$14.99

AWS Certified Developer - Associate DVA-C02

AWS Certified Developer - Associate DVA-C02

$14.99

AWS Certified DevOps Engineer - Professional DOP-C02

AWS Certified DevOps Engineer - Professional DOP-C02

$14.99

AWS Certified Machine Learning - Specialty

AWS Certified Machine Learning - Specialty (MLS-C01)

$14.99

AWS Certified Security - Specialty SCS-C02

AWS Certified Security - Specialty SCS-C02

$14.99

AWS Certified Solutions Architect - Associate SAA-C03

AWS Certified Solutions Architect - Associate SAA-C03

$14.99

AWS Certified Solutions Architect - Professional SAP-C02

AWS Certified Solutions Architect - Professional SAP-C02

$14.99

AWS Certified SysOps Administrator - Associate

AWS Certified SysOps Administrator - Associate (SOA-C02)

$14.99

AWS-SysOps

AWS Certified SysOps Administrator (SOA-C01)

$14.99

Only Registered Members can View Training Courses

Please fill out your email address below in order to view Training Courses. Registration is Free and Easy, You Simply need to provide an email address.

- Trusted by 1.2M IT Certification Candidates Every Month
- Hundreds Hours of Videos
- Instant download After Registration

Latest IT Certification News

- Cisco CCNP Enterprise 300-420 ENSLD - SD Access Fabric Design Part 2
- Amazon AWS DevOps Engineer Professional - Configuration Management and Infrastructure Part 9
- VMware VCAP6-NV 3V0-643 – Introduction
- CompTIA IT Fundamentals FC0-U61 - IT Security Threat Mitigation
- SY0-501 Section 1.1- Implement security configuration parameters on network devices and other technologies.
- Google Professional Data Engineer - Dataflow ~ Apache Beam Part 2
- SY0-501 Section 3.1 Explain types of malware.
- Basic CompTIA Certifications for you
- Best Cisco Security Certifications In 2018
- ISACA CRISC - IT Risk Assessment
- MB-310 Microsoft Dynamics 365 - Financial Dimensions and Account structures
- Tips To Write A Perfect CV
- SY0-501 Section 4.1-Explain the importance of application security controls and techniques.
- Explore the Best Computer Jobs to Take Your First Step into the Future

LIMITED OFFER: GET 30% Discount

This is ONE TIME OFFER

Enter Your Email Address to Receive Your 30% Discount Code

A confirmation link will be sent to this email address to verify your login. *We value your privacy. We will not rent or sell your email address.

Download Free Demo of VCE Exam Simulator

Experience Avanset VCE Exam Simulator for yourself.

Simply submit your e-mail address below to get started with our interactive software demo of your free trial.

- Realistic exam simulation and exam editor with preview functions
- Whole exam in a single file with several different question types
- Customizable exam-taking mode & detailed score reports