Introduction to Probability and Statistics

Probability and statistics are two closely related disciplines that play a foundational role in understanding and analyzing data, uncertainty, and variability. While probability provides the theoretical framework for dealing with uncertainty, statistics applies this framework to make sense of data and draw meaningful conclusions.

What is Probability?

Probability is the branch of mathematics concerned with quantifying uncertainty. It provides a numerical measure (ranging from $0$ to $1$) of how likely an event is to occur.

  • A probability of $0$ means an event is impossible
  • A probability of $1$ means that the event is certain
Probability theory is built on fundamental concepts such as random experiments, events, and probability distributions. It answers questions like, “What is the likelihood of rolling a 6 on a die?” or “How often will a coin flip land heads?”

These principles serve as the backbone for understanding randomness in the natural and social worlds.

Definition:

Randomness

Randomness refers to the lack of pattern or predictability in a sequence of events. In probability theory, randomness describes an outcome or process that cannot be determined in advance but follows a specific probabilistic distribution over the long run.

Example 1

Example of random events

  • The number of tails that appear in 10 coin flips
  • The number announced in the next lottery draw
  • The number of students who arrive late to class
  • The number of defective items in a production batch
  • The number of emails received in a day
  • The number of accidents at a busy intersection
  • Random Variables

    Many experiments can be modelled by probability distributions, and the choice of which distribution to use depends on the type of variable that we are interested in.

    Definition:

    Variable

    A variable is a characteristic or attribute that can assume different values. Variables can be classified as either quantitiative or qualitative.

    Qualitative variables use names and labels. Quantitative variables are numerical

    Example 2

    In 2021, someone took a bite of Kellogg's Strawberry Pop-Tarts and cried foul—not because they weren't tasty, but because they didn't have enough actual strawberries. The lawsuit claimed the filling was more apple and pear than berry, accusing Kellogg's of fruity deception. Kellogg's defense? ``Nobody buys Pop-Tarts expecting a farmer's market in the filling.`` The judge sided with Kelloggs and the case was thrown out.

    100 people were surveyed about their Pop-Tart preferences. Classify each of the following variables as either qualitative or quantitative:

    • The flavor of the Pop-Tarts
    • The number of Pop-Tarts consumed in a week
    • Their reasone for buying Pop-Tarts (convenience, nostalgia, etc)
    • The temperature at which the Pop-Tarts are toasted

    Solution

    • Flavor of the Pop-Tarts - Qualitative (since it uses names like Strawberry, Blueberry, etc.)
    • Number of Pop-Tarts consumed - Quantitative (since it involves numerical values like 1, 2, 3, ...)
    • Reason for purchase - Qualitative (since it uses categories or labels)
    • Toasted temperature - Quantitative (since it involves numerical values like 350°F, 375°F, ...)


    Quantitative or numerical variables can be further divided into two groups: discrete and continuous.

    Definition:

    Discrete Random Variable

    A discrete random variable is a variable that can take on a countable number of distinct values.

    Definition:

    Continuous Random Variable

    A continuous random variable is a variable that can take on an infinite number of values within a given range.

    Example 3

    In 2021, Molson Coors got served — not a drink, but a class action lawsuit — accusing them of suggesting that their pineapple-and-mango-flavored Vizzy Hard Seltzers were sources of Vitamin C ```nutritionally-equivalent to actual pineapples and mangos``. The plaintiffs argued that fortifying alcoholic drinks with vitamins could mislead consumers into perceiving them as healthy options. Molson Coors paid out $ \$9.5$ million to settle the case.

    After the lawsuit, the marketing team decided to analyze customer behavior and product data to better understand their audience. They collected the following variables:

    • The volume of liquid (in millitres) contained in each can
    • Number of Vizzy Hard Seltzers purchased in the last month
    • The rating (out of 5 stars) customers gave to the pineapple-and-mango Vizzy
    • The volume of alcohol in the beverage


    Classify each variable as either discrete or continuous.

    Solution

    • Volume in each can. - Continuous (since volume can take on a range of values e.g. 355 ml, 350.84 ml, ...)
    • Number of Seltzers purchased - Discrete (since it involves whole, countable numbers)
    • Star rating - Discrete (since customers select whole star values like 1, 2, 3, 4, or 5, even if fractions like 4.5 are possible.)
    • Alcohol content - Continuous (as as alcohol content can vary with precision, such as 5.0%, 5.02%, or 4.98%.)


    What is Statistics?

    Statistics is the science of collecting, analyzing, interpreting, and presenting data. It involves the study of data variability, uncertainty, and the relationships between variables. Statistics can be broadly divided into two categories:

    • Descriptive statistics concern the organization, presentation, and numerical measures of data. Its goal is to structure and summarize data in a compact form.
    • Inferential statistics , on the other hand, concern the methods and procedures used to draw conclusions about data. The goal here is to make a statement about a population based on information collected from a sample.

    Definition:

    Population

    A population is the set of all objects that are of interest to the statistician.

    Definition:

    Sample

    A sample is a subset of the population that is selected for study.

    Example 4

    In 2021, some chip lovers felt duped by Tostitos Hint of Lime chips and sued Frito-Lay, claiming the only thing lime-related about the chips was the picture of a lime on the bag. The chips, it turned out, got their citrusy flavor from mysterious ``natural flavors`` instead of actual lime—because why use a lime when you can just hint at it? Frito-Lay defended itself by saying, ``We never promised a squeeze, just a sprinkle of imagination.`` The case is still pending, leaving us to ponder the fine line between a snack and a citrus-flavored lie.

    A market research firm surveyed 1000 chip enthusiasts to understand their preferences. Classify each of the following as either a population or a sample:

    • The 1000 chip enthusiasts surveyed
    • All chip enthusiasts worldwide

    Solution

    • The 1000 chip enthusiasts surveyed - Sample (since it is a subset of the larger group of all chip enthusiasts worldwide)
    • All chip enthusiasts worldwide - Population (since it is the entire group of interest)


    How are Probability and Statistics Connected?

    Probability and statistics are interconnected in many ways. Probability provides the theoretical underpinnings for statistical inference. In other words, probability allows statisticians to model uncertainty and variability in data, while statistics applies these models to real-world data to draw conclusions. For example:

    • In hypothesis testing , we use probability to determine how likely an observed result is, assuming a specific hypothesis is true.
    • In estimation , we use probability to quantify the uncertainty in our estimates and construct confidence intervals.

    Connecting Probability and Statistics to Inferential Techniques

    Inferential statistics relies heavily on probability to draw conclusions about a population based on a sample. Here`s how the two concepts connect to inferential techniques:

    • Confidence Intervals: Probability helps quantify the uncertainty in an estimate, allowing statisticians to create intervals that likely contain a population parameter.
    • Hypothesis Testing: Probability helps determine the likelihood of observing a result as extreme as the one obtained, assuming the null hypothesis is true.
    Through inferential techniques, probability and statistics allow us to make informed decisions in the presence of uncertainty, whether it`s predicting election outcomes, understanding disease spread, or optimizing business strategies. Together, these disciplines are essential tools in a data-driven world, empowering us to extract meaning and insight from randomness and variability.