Project 1

Exploratory Data Analysis (EDA)


Problem Statement

Your hometown mayor just created a new data analysis team to give policy advice, and the administration recruited you via LinkedIn to join it. Unfortunately, due to budget constraints, for now the “team” is just you…

The mayor wants to start a new initiative to move the needle on one of two separate issues: high school education outcomes, or drug abuse in the community.

Also unfortunately, that is the entirety of what you’ve been told. And the mayor just went on a lobbyist-funded fact-finding trip in the Bahamas. In the meantime, you got your hands on two national datasets: one on SAT scores by state, and one on drug use by age. Start exploring these to look for useful patterns and possible hypotheses!


Exploratory Data Analysis

This project is focused on exploratory data analysis, aka “EDA”. EDA is an essential part of the data science analysis pipeline. Using packages such as numpy, pandas, matplotlib and seaborn, data scientists are equipped with an array of tools to visualise data. EDA and visualisation serves as a starting point to analyse any dataset (refer to flow chart below).

Source: Transformation - Building a Data Science Capability
https://www.linkedin.com/pulse/transforamtion-building-data-science-capability-simon-jones

1. Load the sat_scores.csv dataset and describe it

1.1 Describing dataset using pandas .describe() and .info()

print datasat.info()
datasat.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 4 columns):
State     52 non-null object
Rate      52 non-null int64
Verbal    52 non-null int64
Math      52 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.7+ KB
None
State Rate Verbal Math
0 CT 82 509 510
1 NJ 81 499 513
2 MA 79 511 515
3 NY 77 495 505
4 NH 72 520 516
datasat.describe()
Rate Verbal Math
count 52.000000 52.000000 52.000000
mean 37.153846 532.019231 531.500000
std 27.301788 33.236225 36.014975
min 4.000000 482.000000 439.000000
25% 9.000000 501.000000 504.000000
50% 33.500000 526.500000 521.000000
75% 63.500000 562.000000 555.750000
max 82.000000 593.000000 603.000000

1.2 Summary

The sat scores contain 4 columns of information :

  1. State: Refers to states in the U.S.
  2. Rate: A percentage possibly referring to SAT passing rate of children
  3. Verbal: Verbal Scores in the range of 482 - 593
  4. Math: Math Scores in the range of 439 - 603

2. Plot the data using seaborn

2.1 Using seaborn’s distplot, plot the distributions for each of Rate, Math, and Verbal

# Plot rate distplot

ratep = sns.distplot(datasat['Rate'], kde=False, bins=10)
ratep.set(xlabel='SAT passing rate')
ratep.set_title('Frequency distribution of SAT passing rate')

png

# Plot math distplot

mathp = sns.distplot(datasat['Math'], kde=False,bins=20, color='blue')
mathp.set(xlabel='SAT Math Scores')
mathp.set_title('Frequency distribution of SAT Math Scores')

png

# Plot verbal distplot

verbp = sns.distplot(datasat['Verbal'], kde=False,bins=20, color='orange')
verbp.set(xlabel='SAT Verbal Scores')
verbp.set_title('Frequency distribution of SAT Verbal Scores')

png

2.2 Using seaborn’s pairplot, show the joint distributions for each of Rate, Math, and Verbal

sns.pairplot(datasat, vars=['Rate', 'Math', 'Verbal'], hue='State')

png

Interpretation of pair plot

1. Rate seems to be negatively correlated with both math and verbal scores.
From plots 2 and 3 in the first row, as scores increase, rate decreases.

2. Math and verbal appears to be positively correlated.
From plots (row 2, plot 3 and row 3, plot 2), as either variable increases, the other seems to increase as well.

3. Adding hue dimension (assigned to state) may yield interesting information about the clustering of states according to SAT scores.

3. Plot the data using built-in pandas functions.

3.1 Plot a stacked histogram with Verbal and Math using pandas

datasat[['Verbal','Math']].plot.hist(stacked=True, alpha=0.5,figsize=(12,12), bins=20)

png

3.2 Plot Verbal , Math and Rate (normalised) on the same chart using boxplots

Normalisation centers each variable so that differences in scale can be visualised.

# divide each feature by its own mean, to normalise all features
datanorm = datasat.copy()
print datanorm.mean()

datanorm['Raten'] = datanorm['Rate'].map(lambda x: (x-37.153846)/np.std(datanorm['Rate']))
datanorm['Verbaln'] = datanorm['Verbal'].map(lambda x: (x-532.019231)/np.std(datanorm['Verbal']))
datanorm['Mathn'] = datanorm['Math'].map(lambda x: (x-531.5)/np.std(datanorm['Math']))
Rate       37.000000
Verbal    532.529412
Math      531.843137
dtype: float64
datanormp = datanorm.sort_values(by=['Verbaln','Mathn']).plot(kind='box', y=['Verbaln','Mathn','Raten'], figsize=(10,10),title='SAT score distribution')

png

4. Examine summary statistics

Checking the summary statistics! Correlation matrices give a one stop view of how variables are related linearly to each other.

4.1 Create the correlation matrix of your variables (excluding State).

datasat.corr()
Rate Verbal Math ver_diff
Rate 1.000000 -0.888121 -0.773419 -0.098671
Verbal -0.888121 1.000000 0.899909 0.044527
Math -0.773419 0.899909 1.000000 -0.395574
ver_diff -0.098671 0.044527 -0.395574 1.000000

Rate seems to have a negative correlation with the verbal and math variables.
Verbal and math seems to be positively correlated.

4.2 Assign and print the covariance matrix for the dataset

  1. Covariance measures how 2 variables vary together while correlation measures how one variable is dependent on the other.
  2. Correlation can be obtained by dividing the covariance with the product of standard deviation of the 2 variables.
  3. Correlation will be able to tell us when a variable increases, whether the other increases or decreases.
datasat.cov()
Rate Verbal Math ver_diff
Rate 759.04 -816.280000 -773.220000 -43.060000
Verbal -816.28 1112.934118 1089.404706 23.529412
Math -773.22 1089.404706 1316.774902 -227.370196
ver_diff -43.06 23.529412 -227.370196 250.899608

5. Performing EDA on “drug use by age” data.

5.1 EDA involves the cleaning of data, checking for null values, data types etc.

df1 = pd.read_csv(drugcsv)

df1.head(1)
df1.isnull().sum()
age                        0
n                          0
alcohol-use                0
alcohol-frequency          0
marijuana-use              0
marijuana-frequency        0
cocaine-use                0
cocaine-frequency          0
crack-use                  0
crack-frequency            0
heroin-use                 0
heroin-frequency           0
hallucinogen-use           0
hallucinogen-frequency     0
inhalant-use               0
inhalant-frequency         0
pain-reliever-use          0
pain-reliever-frequency    0
oxycontin-use              0
oxycontin-frequency        0
tranquilizer-use           0
tranquilizer-frequency     0
stimulant-use              0
stimulant-frequency        0
meth-use                   0
meth-frequency             0
sedative-use               0
sedative-frequency         0
dtype: int64
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 28 columns):
age                        17 non-null object
n                          17 non-null int64
alcohol-use                17 non-null float64
alcohol-frequency          17 non-null float64
marijuana-use              17 non-null float64
marijuana-frequency        17 non-null float64
cocaine-use                17 non-null float64
cocaine-frequency          17 non-null object
crack-use                  17 non-null float64
crack-frequency            17 non-null object
heroin-use                 17 non-null float64
heroin-frequency           17 non-null object
hallucinogen-use           17 non-null float64
hallucinogen-frequency     17 non-null float64
inhalant-use               17 non-null float64
inhalant-frequency         17 non-null object
pain-reliever-use          17 non-null float64
pain-reliever-frequency    17 non-null float64
oxycontin-use              17 non-null float64
oxycontin-frequency        17 non-null object
tranquilizer-use           17 non-null float64
tranquilizer-frequency     17 non-null float64
stimulant-use              17 non-null float64
stimulant-frequency        17 non-null float64
meth-use                   17 non-null float64
meth-frequency             17 non-null object
sedative-use               17 non-null float64
sedative-frequency         17 non-null float64
dtypes: float64(20), int64(1), object(7)
memory usage: 3.8+ KB

5.2 EDA also involves looking at an overview of the dataset (descriptive statistics like mean and standard deviation)

pd.options.display.max_columns = 999
df1.describe(include='all')

df2 = df1.iloc[:,[2,4,6,8,10,12,14,16,18,20,22,24,26]] # Split df1 into use columns

df3 = df1.iloc[:,[3,5,7,9,11,13,15,17,19,21,23,25,27]] # Split df1 into freq columns


matplotlib.pyplot.figure(figsize=(10,10))
sns.heatmap(df2.corr(), cmap="PiYG", center=0)

# Heatmap showing use correlations

png

matplotlib.pyplot.figure(figsize=(10,10))
sns.heatmap(df3.corr(), cmap="PiYG", center=0)

png

# Plotting drug use across age
f, ax = plt.subplots(figsize=(15, 15))

x = df1['age']

for col in df2.columns:
    y = df2[col]
    plt.plot(x, y)

f.suptitle('Drug use across age',fontsize = 20)
f.legend()
plt.xlabel('age')
plt.ylabel('usage')

png

Observations:

The predominant drugs that are used across all age groups are the following:

  1. Alcohol
  2. Marijuana
  3. Cocaine
# Plotting drug use frequency across age
f, ax = plt.subplots(figsize=(15, 15))

x = df1['age']

for col in df3.columns:
    y = df3[col]
    plt.plot(x, y)

f.suptitle('Drug frequency across age',fontsize = 20)
f.legend()
plt.xlabel('age')
plt.ylabel('frequency')

png

Observations:

Frequency of drug use does not follow a pattern but interestingly there are spikes in certain ages where frequency of a drug use is high.

5.3 Create a testable hypothesis about this data

Test statistics like t-test may yield simple observations about the data. In this case, t test revealed that mean overall drug use for adolescents is not statistically different from mean drug use of non-adolescents.

# perform t test
# Let null hypothesis be: The mean overall drug use for adolescents is equals to the mean overall drug use for non-adolescents

list1 = df2['mean_use'].tolist()
list2 = list1[-7:]
list3 = list1[0:10]

stats.ttest_ind(list3, list2)
Ttest_indResult(statistic=-0.94213218959786749, pvalue=0.36105217344934259)
# Visualisation of values using box plot
plt.subplots(figsize=(8,8))
ax = sns.boxplot(x='minor_label', y='mean_use', data=df2)
ax.set(xticklabels=['Non-minor (>21 y.o.)', 'Minor (<=21 y.o.)'], title='Box Plot of overall drug use')

png

6. Introduction to dealing with outliers

Rate does not seem to have outliers since it is a percentage range and most values fall between 1 to 100.

Definition of outlier:

An outlier is defined to be a value that is 1.5x interquartile range away from the first or third quartile.

sns.boxplot(data=datasat)
# box plot shows that only ver_diff has outliers

png

# Function to calculate +/- 1.5 IQR range for outliers , ie, values that fall beyond this range will be considered outliers
def iqr(series):
    q1 = datasat[series].quantile(0.25)
    q3 = datasat[series].quantile(0.75)
    iqr = q3 - q1
    out_range = [q1-1.5*iqr, q3+1.5*iqr]
    return out_range

print 'outlier range for \'Rate\' is :', iqr('Rate')
print 'outlier range for \'Verbal\' is :', iqr('Verbal')
print 'outlier range for \'Math\' is :', iqr('Math')
print 'outlier range for \'ver_diff\' is :', iqr('ver_diff')
outlier range for 'Rate' is : [-73.5, 146.5]
outlier range for 'Verbal' is : [409.5, 653.5]
outlier range for 'Math' is : [421.25, 639.25]
outlier range for 'ver_diff' is : [-21.75, 20.25]

Observations:

Since the 3 main columns (rate, verbal and math) fall within the outlier range of values, there are no outliers in these columns.

For the ver_diff column, which is a difference in verbal and math scores, outliers may arise if the difference in the 2 columns is large. Thus, it will not be useful to remove the outliers in this column since this measures a difference, which may be a feature we might be interested in.