ASQ Six Sigma Green Belt – Objective – Ethereum Part 2

  1. Ethereum Browsers

Now we will be talking about correlation and linear regression here in this topic in this topic of correlation and linear regression, we will be covering these four broad topics calculating correlation coefficient. We will understand what correlation coefficient is and how do we calculate that. We will talk about correlation versus causation and then we will look at linear regression equation. We will find out the linear regression equation and at the end we will be using that linear regression equation for estimating and prediction. Let’s start with calculating correlation coefficient.

And even before we talk about calculating correlation coefficient, let’s understand what’s correlation, what’s correlation coefficient? In the study of six sigma you will see an equation y is equal to FX. This is a simple equation. What does this tell you is that output is the function of inputs. So if you want to change output, you need to change inputs. If we just take a simple example of taking two pieces, putting glue in between that, joining that and having a good strength of this joint. Now, strength of this joint is the output of this process.

What are the inputs? Here? Inputs will be the quantity of glue which we are putting the surface finish of these two joints which we have, and the pressure which we are putting to make this joint. All those things are input. So if we change those inputs, the output will change. If we add more glue, perhaps the strength might increase.

If we make the surface rough, generally that will lead to more strength. And if we apply more pressure to these two pieces, then probably this will lead to a better or the stronger joint. So in summary, what we want to say is that output is the function of input. Here in y is equal to FX are inputs. Inputs are also called as independent variables. Because these are independent, you can change them independently. In the example of joining two pieces, the quantity of the glue, the pressure and the holding time, all these are independent variables.

These are also called as controllable variables because you can control these, these are inputs you can control. On the other hand, y is the output and which is also called as dependent variable because the value of this y depends on something else. And that something else is those inputs which we are changing. So in correlation, what we want to do is we want to study the relationship between input and output. Let’s take a very simple example. Many of you might be from different fields, some of you might be from engineering, some from medical science, some from banking, some from some other fields. But just to keep things simple, let’s take a simple example of study versus marks.

So what we will do is we will take an example, hypothetical example of how many hours a student studied and how many marks this student obtained. And what we want to study is the relationship between the study time and the marks obtained. And you would have very well understood at this time the hours of study, how many hours this person is studying, is the x is the input and the output will be the marks obtained by this person. So for this, let’s take a simple example here.

Here I have a table which lists down how many hours this student studied and what was the test score. So a student who studied for 20 hours scored 40 marks. Another student who studied for 24 hours got 55 marks. And if you look at the bottom of the list, there was another student who studied for 23 hours and this person got 37 marks. Now, what we want to do in correlation is understand the relationship between these two variables, hours studied and the test score. The simplest thing to do here is to make a scatter plot. That’s the first thing which you will be doing when you are studying two variables, the relationship between two variables. And this is what we have done here. On x axis I have always studied and on y axis is the test score. So generally on x axis you put the input and on the y axis you put the output. That’s general convention.

So that’s what I have done here. So if you look at these points, so the point number one here, which is 20 hours studied, leading to 40 marks. So if I look at 20 hours studied and if I look at this particular point, so this is 40 marks obtained. So this is this point, this point is here. And if we just look at another point here, which is let’s say 62, which is another extreme here, 62 or 65.

So this would be that point, which is for 62 hours studied, this person got 83 marks, 83% marks. Now, if we just look at this plot, looking at these dots or the spots, you can see that there is some relationship. It looks like if I have to draw a line which is generally connects all these points, that line might be something like this. So which shows that as the number of hours study increase, the test score also increase generally. But then there will always be variation, because this is not the only thing which matters. How much knowledge this student had earlier, that also matters.

What sort of a books this person is studying, that also matters. So there are a number of things. This is not the only thing. If this was the only thing which was making difference, then we would have seen all these points falling on a straight line. But then there are other factors. And because of that you see this particular variation from the straight line which we plotted. So this is the first thing which we do make a scatter plot. So here is the summary of three things which we do generally when we want to find out the relationship between two variables. First is plot, a scatter plot, which we have done here.

The second is to find out the correlation coefficient which we will be doing in the next lecture, which is here. The correlation coefficient in this particular case was zero, 87, nine. What does that mean? We will look at that and then there’s a p value. We will not talk much about that. And the third step would be to find out a regression equation. Regression equation will actually tell you what is the relationship between this input and output. So here we come out with this particular regression equation where we can find out test score with this particular formula test score is equal to around let’s say 16 plus roughly one time our study. So this is equation we will find out in.

  1. Ethereum Development

To find out correlation coefficient. Here is the formula which we use. R is equal to n, sum of x, y minus sum of x, sum of y and so on which you see here. So this is the formula which we will be using here to find out the correlation coefficient. How do we come out with this formula? This is beyond the scope of this course. Let’s understand that this is a formula which we use to find out the correlation coefficient. Now, how do we find out values for this? So here we have x as our studied and Y is the test score.

So for 20 hours studied, we have 40 as a score. For 24 hours study, this person got 55 marks and so on. So these are x and y. In this formula there are a number of terms, sum of x, y, some of x, sum of y and so on. Let’s understand these things. When you add all these axes, so 20 plus 24 plus 46 plus 62 and so on, that will come out to be 371. This is your sigma x. And remember that when we talk of sigma, there are two sigmas in statistics. One is this, which is the sum, and another is this, which is standard deviation. So let’s not get confused with these and understand that when I say sigma that might mean one of these two things depending on the context.

Now, sigma x, which is the sum of x is 371. So this is this value. So if I add all XS, this gives me 371 and what is sigma y? That’s obvious sum of all the values which are in column y. So the sum of all Y’s is sigma y. Now, what we do is we multiply x and y here in XY. So 20 multiplied by 40 will give me 824, multiplied by 55 will give me 1320 and so on. So these are XY x multiplied by y.

And if I add all these things, this will give me sigma XY, which is this particular term. Then I have few more terms here which are sigma x square and sigma x square is if I take square of all these x’s. So the first x was 20, the square of that is 400, the second x was 24, and the square of that is 576 and so on. So these are x squares and the next column is Y squares, which are the square of y. Some of these x square will be sigma x square and sum of all these y square is sigma y square.

So these are something which I have calculated in the table. Now, if I put all these values in this particular formula, so let’s do that. R is equal to the first is n, n is the number of items and how many items we have 123-45-6789 and ten. So there are ten items here. So ten multiplied by sigma XY and sigma XY is this, which is 21764 minus sigma X, sigma X is 371, sigma Y is 520 and that’s it divided by and there’s a square root of in the bracket NN is ten sigma X square.

And here, let’s remember that that this sigma X square is the sum of x squares and which is 16297, which we calculated in the table minus sigma X square. And here sigma X square is sigma X and square of that. So, Sigma X here is 371 and square of that bracket close. In the second bracket again, we put ten as N and sigma Y square.

Sigma Y square is 30160 minus sigma Y square, which is 520 square. That’s it. So now, if you solve this, so let’s quickly solve this using this calculator here, if I switch it on and what I will do is ten multiplied by 21764 minus in the bracket 371 multiplied by 520 bracket close equal to. So this gives me 24720 at the top. And now at the bottom square root of, let’s calculate the first term which is, let me clear this first which is ten multiplied by 16 two nine seven -371 square which gives me two five, three to nine, so let me put it here, two five, three to nine. And in the second bracket I will get ten multiply by 30160 -520 X square. This gives me 31200.

And if I just solve the bottom portion, once again, let me put top as the same thing. Bottom is 31200 multiplied by 25329 is equal to a big number and the square root of that gives me two, eight, one 1. 65. And if I take one by X of that multiply that by 24720. This gives me 0. 879 as the R value of the Pearson correlation coefficient. So, this particular correlation coefficient is called as Pearson correlation coefficient that comes out to be 0879. What does that mean? We will talk about that.

And this value is plus. So, after calculating this, let’s look at one more formula which is also used sometimes and which is here. So, this is the second formula. We used this first formula to find out the value of R as zero point 87 nine. You can try this formula as well. The difference between this formula and this formula is you need to have different sort of a calculation here. I’m talking about Xi minus X bar. X bar is 371 divided by ten, which comes out to be 37. 137.

1 will be X bar and similarly Y bar will be 52. So y bar is 52. You don’t need these columns, you don’t need this column, you don’t need to make these columns. What you need to do is you need to make some other columns which are X minus X bar. In this particular case, take each X and minus this with 37. 1. So X bar is 37. 1. So, first one will be 20 -37. 120 -37. 1 this will be -17. 1 and so on so you find out these X minus X bars and you need to find out X minus X bar squares as well. So here it will be 17. 1 minus N square of that. Similarly, you need to find out Y minus Y bar and Y minus Y bar square. So you need to calculate these four columns and then you need to take some of these sum of X minus X bar.

And then in the last you need to find out one more column which is X minus X bar and Y minus Y bar. So this also needs to be one of the column here. And once you take some of this, this will be the first year, the top, and then X minus X bar square will be this particular one and Y minus Y bar square will be this particular column, the sum of all these things. So this is how you could use this second formula as well. I prefer to use the first one, but if you want to try that, you can try the second formula as well.

  1. Ethereum Security

So when we say Pearson correlation coefficient, this basically measures the strength of linear relationship between y and x. How strongly these are connected, whether the increase in one leads to increase in another or let’s say increase in one leads to decrease of another variable, then we will say that there is a strong relationship. If increase or decrease of one doesn’t affect the other, then we will say that there is no correlation or there is a very little correlation between these two variables. The correlation coefficient r has a value between minus one and plus one.

Minus one shows a perfect negative relationship and plus one shows a perfect positive relationship and zero shows that there is no relationship. So if I draw a scale here, let’s say from minus one to plus one. So this is minus one, this is plus one and this is zero. So the value of r will be somewhere in between these values. In our case, the value of r came out to be zero 87 on plus side. So let’s say it was here zero point 87. This is slightly close to plus one being plus it shows that there is a positive relationship. The value is plus means there is a positive relationship and value near to one shows that there is a strong relationship. There is no hard and fast rule, but generally let’s say the value which is greater than zero six or zero seven means there is a strong relationship.

So on plus side, let’s say on plus side 0. 6 plus onwards, this shows a strong relationship. A strong positive relationship. On the other hand if I have -0. 6 and towards minus one that also shows strong relationship, but this is a negative relationship. The example of positive relationship we have seen here, what does this tell you is that if student studies for more number of hours, the marks will increase. So hours increase, marks increase, this is a positive relationship. On the other hand, if you want to look at the example of negative relationship, that could be number of hours a student watches TV let’s say. And if we compare that with the marks obtained so this might give us a negative relationship.

More number of hours a student watches TV, that means there is a less chance that this person will get higher score. This could be something which come out as a negative relationship. So with this basic understanding, let’s look at some of the example of value of r here. Let’s take example where R is plus one. So here in this case, let’s say this is my x and this is my y and R will be one when all these points fall on a straight line. So with the increase in x, there is a proportional increase in y.

So if I draw a line here joining all these points, this line will be straight. In this case R is equal to plus one. The example of r is equal to minus one will be something like this, where with the increase in x, the value of Y decreases. So that might show something like this. So with the increase in x, there is a proportional decrease in the value of y and everything is in a straight line. Here the value of R will be minus one. Let’s take some other example. Let’s say somewhere between zero six or zero seven. That graph will show something like this. Here if I have x and here is my y, and if the points are something like this, it’s not perfectly in a straight line, but it shows a trend that with increase in x there is an increase in y, but everything doesn’t follow in the straight line. And if I have to draw a straight line that will be something like this and how do we draw this straight line?

We will talk about that in regression. So there is a way to find a straight line which best fits these points. But let’s look at this line where everything is not aligned. But generally there is a trend that with the increase in x, y increases. Here, if you see the value of R might come out to be let’s say zero six or zero seven in the plus side. Similarly, the example of negative will be something like this. Here is x, here is Y. So points are falling, with the increase in x, the y decreases, but it’s not perfectly in the straight line. And if I draw a rough straight line which more or less fits these points then here the value of R will be -0. 6 or 0. 7. So let’s take another example where the value of R is around zero.

So around zero means there is no relationship between x and y. That sort of a graph scatter plot will look something like this x y and then Scatter plot will show that points are spread all around. So there is no good way that you can say that whether with the increase in x y is increasing or decreasing here the value of R will be zero. Another example, let’s quickly take another one more example. Here R will be, let’s say zero one or zero two or zero three or something on a very lower side, r is on a lower side, that means there is some sort of a trend, but the points are spread all around.

So let’s take example will be something like this where we have all these points. These generally show that there is a trend that with the increase in x, y increases, but points are spread too much around the line which fits these points. So this might lead to, let’s say R is equal to zero three or zero four roughly. So that way by visually looking at a scatter plot, you can make a very basic judgment about what could be the value of R. But then the best way to calculate the value of r is using the formula which we discussed earlier.

So in correlation we have been talking about calculating the value of r and r is the Pearson correlation coefficient when we say r, r is a sample correlation. So if we talk about the example which we took where we had ten students and we had number of hours studied by these ten students and marks obtained, these were sample from the population. So the correlation coefficient is represented in two ways r and rho this is row and this is r. R is for sample and that’s what we did, we took a sample of ten students and we found out the correlation coefficient so that was a sample correlation coefficient this came out to be zero 87.

Let’s say we take another ten students and we do the same thing for them will the correlation coefficient be same? Probably no. Next time, if we take another ten student, take a sample of them and find out the correlation coefficient. Correlation coefficient might come out to be, let’s say, zero 74 or something. We take another ten student and we find out the value of correlation coefficient. In that case, let’s say this might come out to be zero 800. Another set of ten students if we take this, might come out to be zero 91. This varies because what we are doing is we are picking sample but then there is a correlation coefficient for the whole population.

So if we do this exercise for let’s say the whole school or for the whole city or for the whole country, whatever is the population in that case whatever correlation coefficient which we calculate is denominated as row. So if you see these two signs you should understand whether we are talking about the sample or we are talking about.

  1. Etherscan

Now, if you take the square of that r square, that is called as the coefficient of determination. And what does coefficient of determination tells you? Coefficient of determination tells you the proportion of the variation in the dependent variable that is predicted from the independent variable. Let’s understand this in simple terms in regards to the example which we took, if I find find out r square and in our case r was zero 87 or zero 88 and R square comes out to be zero 77, if you multiply zero 88 by zero 88, that will come out to be zero 77. Now, what does this tell you is that 77% of variation in the marks obtained can be determined by number of hours studied. That means number of hours studied tells us 77% of variation. Why? Marks change from one student to another student.

So this is a very strong factor. So what we are studying here is the relationship between one x and one y. But let’s understand that in reality marks obtained will not just depend on our studied. So in this equation of marks obtained, we might have to put how many books this person has studied, what sort of a knowledge this person had earlier, and not many other factors. So if you have, let’s say four or five factors which are indicator of marks obtained, then you can find out by R square that which of these factors is the main factor. Some factors will say, let’s say contribute 20% of the variation. Some factors will contribute, let’s say 60% of the variation in y because of x.

So you need to look at that which all factors are most important. This is what you would do when you have multiple x or multiple inputs and that is something which we are not talking here in this course. So coefficient of determination tells you that what proportion of variation is because of this particular input. So in our case, we concluded that zero point 77 or 77% of variation in the marks obtained is because of the number of hours as student studies. Now, r was something between minus one and plus one. This is something which we have seen. And when we say square, r square will be anything between zero and plus one. So if you square minus one, the square of minus one will also be plus one.

  1. Ethernodes

Before you attempt to do that, remember one statement and which is correlation doesn’t imply causation. Let’s understand this with some simple examples. When Greek economy was failing, then the bond interest rate kept on increasing. So if you have that data, let’s say whatever that year it was 1980 1981 and you have the data related to the interest rate on the bond and if you draw that you will see that there was an increase in the bond interest rate. So this is Greek bond interest rate. Similarly during that phase, if you have another data which is Facebook users, so Facebook users also will kept on increasing. So let’s draw this with the blue.

So now if you have these two things and you want to find out the relationship between the Greek economy and the Facebook user, you will find out a very good value of R and this will show that there is a very strong relationship between these two things failure of Greek economy and the increase of Facebook users. But if you look in reality, there is absolutely no relationship between these two things.

So before you pick up your calculator and you find out the value of R, have some basic understanding that does this make any sense? Are these two variables connected to each other in any way or not? So that’s something which you need to understand first. So this is one thing. Second thing is what is X and what is Y? So let’s take an example of increasing temperature and increase in the ice cream sale, the temperature is the input and the ice cream sale is the output and you will see that there is a good relationship between these two.

As temperature increases, the ice cream sale also increases but that does not imply other way around that with the increase in the ice cream sale the temperature will increase. Temperature doesn’t increase because the ice cream sale has gone up. Rather it’s other way around. When temperature rises, ice cream sale increases. So this is another thing which you need to understand. Third important thing is that sometimes you might have a third thing in between these two things which you are looking at. Let’s talk about temperature increase and ice cream sale.

Temperature goes up, ice cream sale goes up and as temperature goes up, let’s say the heat stroke cases which are coming to the hospital that also increase. So now if you have a data which tells you that ice cream sale versus heat stroke so you cannot basically make a relationship between heat stroke and ice cream sale that as ice cream sale increases, heat strokes increase because these two things are not connected. They are connected because of a third thing which is the temperature. So as temperature goes up, heat strokes go up, as temperature goes up, ice cream sales go up. But then you really cannot connect ice cream sail and heat stroke.

So this is another thing which you need to understand. So, in summary, whenever you have the data where you want to find out the correlation coefficient or you want to find out a regression equation, make sure that those two things are somehow practically connected with each other. That should practically make some sense.

img