Descriptive Statistics

What are descriptive statistics?
Descriptive statistics are simple numerical and graphical techniques that summarize the main features of a dataset. They reduce large lists of observations into a few informative numbers and charts that describe what the data show, without drawing conclusions that go beyond those observations.

Key definitions (jargon defined on first use)
– Variable: any measured characteristic (for example, price, age, or test score).
– Population: the complete set of items or people you could measure.
– Sample: a subset taken from the population.
– Mean (average): sum of all values divided by count.
– Median: the middle value when the data are ordered.
– Mode: the most frequent value.
– Variance: the average squared deviation from the mean; measures spread.
– Standard deviation: square root of variance; expressed in same units as the data.
– Range: difference between maximum and minimum.
– Skewness: measure of asymmetry of the distribution.
– Kurtosis: measure of the “tailedness” (how extreme values cluster in tails).
– Outlier: an observation that lies far from most other values.

What descriptive statistics do (summary)
– Describe the center or typical value of a dataset (central tendency).
– Describe the amount of spread or dispersion (variability).
– Show shape characteristics of the distribution (symmetry, tails).
– Provide compact summaries for reporting, dashboards, or initial data checks.
They do not, by themselves, prove relationships between variables or make predictions about new or unseen data. Those tasks belong to inferential statistics.

Types — brief overview
1. Measures of central tendency: mean, median, mode.
2. Measures of variability (spread): range, interquartile range (IQR), variance, standard deviation, absolute deviations.
3. Distribution descriptors: skewness and kurtosis; frequency counts.
4. Visual summaries: histograms, scatter plots, line charts, and stem-and-leaf displays.

Univariate versus bivariate (and multivariate)
– Univariate: summarizes one variable at a time (e.g., average age in a room).
– Bivariate: looks at two variables and their relationship (e.g., age vs. test score).
– Multivariate: analyzes more than two variables jointly; used to explore interactions and patterns across several measurements.

Why visuals matter
Graphs help reveal patterns that single numbers can hide. For example, two datasets can have the same mean but very different spreads and shapes. Common plots:
– Histogram or density plot for a single-variable distribution.
– Box plot to show median, quartiles, and outliers.
– Scatter plot for relationships between two variables.
– Line chart for series over time.

Handling outliers
Outliers can distort the mean and variance. Use robust alternatives when appropriate:
– Median instead of mean for center.
– Interquartile range (IQR) instead of total range.
– Report with and without influential points and explain any exclusions.

Descriptive vs. inferential statistics (short)
Descriptive statistics summarize the observed data. Inferential statistics use sample data to draw conclusions, estimate parameters, test hypotheses, or predict beyond the sample. Preparing financial statements is mainly descriptive; deciding future actions from those numbers typically requires inferential methods.

Checklist — steps to describe a dataset
1. Clarify whether your data are a population or a sample.
2. Inspect raw values for entry errors and extreme values.
3. Compute central tendency (mean, median, mode).
4. Compute spread (range, IQR, variance, standard deviation).
5. Check distribution shape (skewness, kurtosis) and plot histograms/box plots.
6

6. Visualize the data. Plot histograms to see the distribution, box