Statistics for Data Science

Statistics for Data Science

Why data science?

“Data science is the new gold mine “. The assertion holds immense significance when it comes to the present business world. The current corporate field is mostly operated based on data-driven choices. You will be amazed by knowing that 2.5 quintillion bytes of data is produced by an individual every day and according to the prediction. By 2025 about 463 exabytes (463 exabytes= 463*10 power 5 bytes) of data will be generated every day by an individual. Therefore deciding to build a career in the data science field will be worthwhile. To start a great career in the data science field one should be aware of “Statistics for Data Science”.

What is Statistics?

Statistics is a part of Applied Mathematics that has the involvement of collection, description, analysis, and conclusions from the data. The principle of statistics works on analysing primary data, building a statistical model, and predicting the outcomes.

Need for Statistics 

Statistics is used in everything from our day-by-day life to the corporate world such as the stock market, weather forecasting, shopping, etc. To start learning statistics for data science, one should be aware of these important terms:

  • Population – A group from which data has to be collected.
  • Sample – Sample is the subset of the population.
  • Variable – Characteristic of a member differentiating themselves from others. 

Statistics is divided into two major categories:

  1. Descriptive statistics – Descriptive statistics assist with organising data that focuses on the main attributes of data. It gives the summary of data mathematically or graphically.  For example, if one needs to study the weight of every individual sitting in a room, using a descriptive statistical method the person will record the weight of each and everyone, now the person could find the maximum, average, and minimum weight in the room

  1. Inferential statistics–  Inferential statistics deduces the enormous data set into modest data sets and applies the theory of probability to predict the conclusions. Considering the same example from the descriptive statistics,  if a person needs to study or examine the weight of the individuals in the room, the person will take a sample (little group) of the population and can predict the conclusions.

Distribution

It is defined as a collection of data or scores on a variable. These scores or data are arranged in order so that they can be presented graphically. 

Univariate statistics

Univariate statistics refers to a statistical analysis that incorporates a single dependent variable and can include one or more independent variables. Univariate statistics use inferential statistics under ideal conditions that allow the analyst to deduce the connection between the dependent and independent variable, and generalizes the results of their analysis on the smaller sample to a larger population. 

Bivariate statistics

Bivariate statistics is a sort of inferential statistics that manages the connection between two variables. Bivariate statistics help to describe the strength of the relationship between the variables.

Multivariate statistics

Multivariate statistics is used to examine the joint behaviour of more than one random variable. Multivariate analysis reduces Type 1 error.

Type 1 error – It is known as a false positive error when an analyst rejects a null hypothesis.

Probability

Probability means something (event or circumstance) that is probable. It is the ratio of the number of possible outcomes to the total number of outcomes of an event. Probability is based on two principles- The Addition principle and the Multiplication principle 

The addition principle is used in mutually exclusive events and the Multiplication principle is used in independent events. 

So what are mutually exclusive events and independent events?

Mutually exclusive events–  let’s say A and B are two events and they are said to be mutually exclusive if B does not occur when A occurs and vice versa.

Independent events –  Let’s say A and B are two events And they are said to be independent if the outcome of an event A does not affect the outcome of event B and vice versa.

Bayes’ Theorem

Bayes’ Theorem is named after an English Statistician, Thomas Bayes. It states that statistics determine the probability of an event, based on prior information of conditions that may be identified with the event. 

     P(A|B)= (P(A) P(B|A))/P(B)

  • P(A|B) – the probability of event A occurring, given event B has occurred
  • P(B|A) – the probability of event B occurring, given event A has occurred
  • P(A) – the probability of event A
  • P(B) – the probability of event B

Binomial Distribution

The probability of success or failure outcome of the event that is repeated multiple times. 

b(x; n, P) = nCx * Px * (1 – P)n – x

Where:

b = binomial probability

x = total number of “successes” (pass or fail, heads or tails etc.)

P = probability of success on an individual trial

n = number of trials

Poisson’s probability theorem 

The Poisson’s random variable satisfies these conditions: 

  • The number of successes in two disjoint time intervals is independent.
  •  the probability of success during little time intervals is proportional to the entire length of the time span.

Normal distribution function

The Normal Distribution function is also called Gaussian Distribution. The most common distribution function for independent and randomly generated variables.

This brings us to the end of the blog on statistics for data science. We hope that you were able to gain some insight into the world of statistics for data science. Happy Learning!