A Crash Course using Python
If you use statistics in your day-to-day job, it’s likely that at some point you’ll run across a distribution comparison problem. Comparing distributions to determine if they’re distinct can lead to many valuable insights; in particular, if different attributes associated with a data set lead to different (statistically significant) outcomes.
To better illustrate this problem, let’s do an example. We’ll pull data from the ‘Adult’ dataset, available via the UCI Machine Learning Repository. This repository contains a data sampling from the 1994 United States census, including information on individuals’ salary (>$50K, <=$50K), age, education, marital status, race, and sex, among other factors.
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None) #Declare the column names of the data set df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'salary']
Now that we have some data, let’s visualize it. First, we look at the age distribution across the US population, using the matplotlib hist() function:
import matplotlib.pyplot as plt def generate_distribution_histogram(dataframe, column_name, title, x_axis_label, y_axis_label, label_name, number_bins = 15): """ This function generates a histogram. Args: dataframe: column_name: String. Name of the column whose distribution we want to visualize. title: String. Title of the histogram. x_axis_label: String. X-axis label. y_axis_label: String. Y-axis label. Outputs: Histogram containing distribution for specific column column_name. """ plt.hist(dataframe[column_name], bins = number_bins, label = label_name) plt.title(title) plt.xlabel(x_axis_label) plt.ylabel(y_axis_label) plt.legend(loc='upper right') #### MAIN FUNCTION #### generate_distribution_histogram(df, 'age', title = 'Age Distribution: US Population', x_axis_label = 'Age (years)', y_axis_label = 'Frequency', label_name = 'Age')
Based on the histogram above, the majority of the sample is concentrated between 30 and 40 years of age, peaking at around 35 years of age and declining after.
But how does the age distribution change when we subset the data by income level? Let’s visualize the distribution again, this time splitting the data into >$50K and <=$50K categories:
#Subset the data into salary categories df_less_than_50k = df[df['salary'] == ' <=50K'] df_greater_than_50k = df[df['salary'] == ' >50K'] #Plot the histogram for the distribution for data <=$50K generate_distribution_histogram(df_less_than_50k, 'age', title = 'Age Distribution: US Population', x_axis_label = 'Age (years)', y_axis_label = 'Frequency', label_name = '<=$50K') #Plot the histogram for the distribution for data >$50K generate_distribution_histogram(df_greater_than_50k, 'age', title = 'Age Distribution: US Population', x_axis_label = 'Age (years)', y_axis_label = 'Frequency', label_name = '>$50K')
As you can see in the visual above, the distributions change when we subset the data by salary level. For the population making less than $50K a year, the distribution peaks around 25 years of age. For the population making greater than $50K a year, the peak occurs around 45 years of age. This intuitively makes sense, as people earlier on in their career make less money than those later on who are more established.
Now that we’ve graphed the different age distributions based on salary, is there a way to statistically prove that that that the two differ? Yes–using the Mann Whitney U Test.
So, what does the Mann-Whitney U Test do exactly?
The Mann-Whitney U Test is a null hypothesis test, used to detect differences between two independent data sets. The test is specifically for non-parametric distributions, which do not assume a specific distribution for a set of data. Because of this, the Mann-Whitney U Test can be applied to any distribution, whether it is Gaussian or not.
Specifically, the null hypothesis of the Mann-Whitney U Test states that the distributions of two data sets are identical. If the null hypothesis is correct, there is a 50 percent chance that an arbitrarily selected value in one distribution is greater than another arbitrarily selected value in the second distribution (2).
The test statistic associated with the Mann-Whitney U Test is defined as U, where U is the smaller of the two values U1 and U2, defined via the following set of equations (3):
where R1 refers to the sum of the ranks for the first group, and R2 refers to the sum of the ranks for the second group. n1 and n2 refer to the sample populations of the the first and second group, respectively.
For step-by-step instructions on how to calculate U, check out the following link, which covers non-parametric testing and is available via Boston University’s School of Public Health: http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric4.html
Applying the Mann-Whitney U Test to the Data
Applying the Mann-Whitney U Test on the distributions is simple, using the mannwhitneyu() function in the scipy.stats package. We apply the code, comparing the two distributions, as follows:
def mann_whitney_u_test(distribution_1, distribution_2): """ Perform the Mann-Whitney U Test, comparing two different distributions. Args: distribution_1: List. distribution_2: List. Outputs: u_statistic: Float. U statisitic for the test. p_value: Float. """ u_statistic, p_value = stats.mannwhitneyu(distribution_1, distribution_2) return u_statistic, p_value #### MAIN FUNCTION #### #Perform the Mann-Whitney U Test on the two distributions mann_whitney_u_test(list(df_less_than_50k['age']), list(df_greater_than_50k['age']))
We receive the following as test outputs:
The first output, the u-statistic, is defined in the previous section as the test statistic U for the Mann-Whitney U Test. The U-statistic is interpreted using a two-tailed test table, where the table contains critical values of U. To reject the null hypothesis at α=0.05, the U obtained from the test must be below the critical value of U found in the test table.
U tends to be large when both sample sizes are large. This explains why the u-statistic for this example–61203011.5–is such a massive number.
The second Python output is the p-value associated with the test. The lower the p-value, the stronger the evidence against the null hypothesis. As a general rule of thumb, when the p-value is below 0.05, the null hypothesis can be rejected. The p-value for this specific example is so low that it registers as 0, so we can definitely reject the null hypothesis. This outcome verifies, with statistical significance, that the age distribution for people making more than $50K/year differs from the age distribution for people making less than $50K/year.
This concludes my tutorial on the Mann-Whitney U Test. The full Python code is available in the following Github repo:
Also, check out some of my other data science tutorials: