30 Nov 2017

Exploratory Data Analysis

This project is focused on exploratory data analysis, aka “EDA” of SAT and drugs dataset from the States. EDA is an essential part of the data science analysis pipeline.

Exploratory Data Analysis (EDA)


Your hometown mayor just created a new data analysis team to give policy advice, and the administration recruited you via LinkedIn to join it. Unfortunately, due to budget constraints, for now the “team” is just you…

The mayor wants to start a new initiative to move the needle on one of two separate issues: high school education outcomes, or drug abuse in the community.

Also unfortunately, that is the entirety of what you’ve been told. And the mayor just went on a lobbyist-funded fact-finding trip in the Bahamas. In the meantime, you got your hands on two national datasets: one on SAT scores by state, and one on drug use by age. Start exploring these to look for useful patterns and possible hypotheses!


This project is focused on exploratory data analysis, aka “EDA”. EDA is an essential part of the data science analysis pipeline. Failure to perform EDA before modeling is almost guaranteed to lead to bad models and faulty conclusions. What you do in this project are good practices for all projects going forward!

This jupyter notebook lab includes a variety of plotting problems.

As a data scientist, one will use visualization and plotting every single day.

Package imports

import numpy as np
import scipy.stats as stats
import csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# this line tells jupyter notebook to put the plots in the notebook rather than saving them to file.
%matplotlib inline

# this line makes plots prettier on mac retina screens. If you don't have one it shouldn't do anything.
%config InlineBackend.figure_format = 'retina'

1. Load the sat_scores.csv dataset and describe it


Input the placeholder path to the sat_scores.csv dataset below with a specific path to the file.

1.1 Load the file with the csv module and put it in a Python dictionary


The dictionary format for data will be the column names as key, and the data under each column as the values.

Toy example:

data = {
    'column1':[0,1,2,3],
    'column2':['a','b','c','d']
    }
#Create dictionary

with open('sat_scores.csv', 'r') as csvfile:
    reader = csv.reader(csvfile)
    data = list(reader)

d = {}

for n in range(4):
    for key in data[0]:
        d[key] = [row[n] for row in data[1::]]
d
{'Math': ['510',
  '513',
  '515',
  '505',
  '516',
  '499',
  '499',
  '506',
  '500',
  '501',
  '499',
  '510',
  '499',
  '489',
  '501',
  '488',
  '474',
  '526',
  '499',
  '527',
  '499',
  '515',
  '510',
  '517',
  '525',
  '515',
  '542',
  '439',
  '539',
  '512',
  '542',
  '553',
  '542',
  '589',
  '550',
  '545',
  '572',
  '589',
  '580',
  '554',
  '568',
  '561',
  '577',
  '562',
  '596',
  '550',
  '570',
  '603',
  '582',
  '599',
  '551',
  '514'],
 'Rate': ['510',
  '513',
  '515',
  '505',
  '516',
  '499',
  '499',
  '506',
  '500',
  '501',
  '499',
  '510',
  '499',
  '489',
  '501',
  '488',
  '474',
  '526',
  '499',
  '527',
  '499',
  '515',
  '510',
  '517',
  '525',
  '515',
  '542',
  '439',
  '539',
  '512',
  '542',
  '553',
  '542',
  '589',
  '550',
  '545',
  '572',
  '589',
  '580',
  '554',
  '568',
  '561',
  '577',
  '562',
  '596',
  '550',
  '570',
  '603',
  '582',
  '599',
  '551',
  '514'],
 'State': ['510',
  '513',
  '515',
  '505',
  '516',
  '499',
  '499',
  '506',
  '500',
  '501',
  '499',
  '510',
  '499',
  '489',
  '501',
  '488',
  '474',
  '526',
  '499',
  '527',
  '499',
  '515',
  '510',
  '517',
  '525',
  '515',
  '542',
  '439',
  '539',
  '512',
  '542',
  '553',
  '542',
  '589',
  '550',
  '545',
  '572',
  '589',
  '580',
  '554',
  '568',
  '561',
  '577',
  '562',
  '596',
  '550',
  '570',
  '603',
  '582',
  '599',
  '551',
  '514'],
 'Verbal': ['510',
  '513',
  '515',
  '505',
  '516',
  '499',
  '499',
  '506',
  '500',
  '501',
  '499',
  '510',
  '499',
  '489',
  '501',
  '488',
  '474',
  '526',
  '499',
  '527',
  '499',
  '515',
  '510',
  '517',
  '525',
  '515',
  '542',
  '439',
  '539',
  '512',
  '542',
  '553',
  '542',
  '589',
  '550',
  '545',
  '572',
  '589',
  '580',
  '554',
  '568',
  '561',
  '577',
  '562',
  '596',
  '550',
  '570',
  '603',
  '582',
  '599',
  '551',
  '514']}

1.2 Make a pandas DataFrame object with the SAT dictionary, and another with the pandas .read_csv() function


Compare the DataFrames using the .dtypes attribute in the DataFrame objects. What is the difference between loading from file and inputting this dictionary (if any)?

sat = pd.read_csv('sat_scores.csv')
sat.dtypes
State     object
Rate       int64
Verbal     int64
Math       int64
dtype: object
sat2 = pd.DataFrame(d)
sat2.dtypes
Math      object
Rate      object
State     object
Verbal    object
dtype: object

Answer:

The difference between creating a dataframe from a dictionary (in Q1.1) and using the read_csv panda function is the content of the dataframe. The elements within the csv file will maintain it’s type when using the read_csv panda function. However, the creating a dataframe from a dictionary will convert the elements into objects. This will pose issues during data cleaning and munging process.

TIP:

If you don’t convert the string column values to float in your dictionary, the columns in the DataFrame are of type object (which are string values, essentially).

1.3 Look at the first ten rows of the DataFrame: What does our data describe?


From now on, use the DataFrame loaded from the file using the .read_csv() function.

Use the .head(num) built-in DataFrame function, where num is the number of rows to print out.

You are not given a “codebook” with this data, so some (very minor) inference must be made.

sat.head(10)
State Rate Verbal Math
0 CT 82 509 510
1 NJ 81 499 513
2 MA 79 511 515
3 NY 77 495 505
4 NH 72 520 516
5 RI 71 501 499
6 PA 71 500 499
7 VT 69 511 506
8 ME 69 506 500
9 VA 68 510 501
# We found that one of the rows identifies the scores of all the states. 
# We need to check if that data in the last row is correct
sat['State'].unique()
array(['CT', 'NJ', 'MA', 'NY', 'NH', 'RI', 'PA', 'VT', 'ME', 'VA', 'DE',
       'MD', 'NC', 'GA', 'IN', 'SC', 'DC', 'OR', 'FL', 'WA', 'TX', 'HI',
       'AK', 'CA', 'AZ', 'NV', 'CO', 'OH', 'MT', 'WV', 'ID', 'TN', 'NM',
       'IL', 'KY', 'WY', 'MI', 'MN', 'KS', 'AL', 'NE', 'OK', 'MO', 'LA',
       'WI', 'AR', 'UT', 'IA', 'SD', 'ND', 'MS', 'All'], dtype=object)

2. Create a “data dictionary” based on the data


A data dictionary is an object that describes your data. This should contain the name of each variable (column), the type of the variable, your description of what the variable is, and the shape (rows and columns) of the entire dataset.

#Checking the type of data for each column
sat.dtypes
State     object
Rate       int64
Verbal     int64
Math       int64
dtype: object
#Checking the data size
sat.shape
(52, 4)
#Checking if there are null values
sat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 4 columns):
State     52 non-null object
Rate      52 non-null int64
Verbal    52 non-null int64
Math      52 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.7+ KB
#Familiarise with the column names
sat.columns
Index(['State', 'Rate', 'Verbal', 'Math'], dtype='object')
sat.tail()
State Rate Verbal Math
47 IA 5 593 603
48 SD 4 577 582
49 ND 4 592 599
50 MS 4 566 551
51 All 45 506 514


Update of the SAT Dataset:

  • No null values within the DataFrame
  • Last row attempts to summarise the readings of each feature.
  • Assess the findings in the ‘All’ row in order to identify if it’s summary is correct by removing the last row to ensure that the data analysis is not affected by the ‘summary’ of the table
sats = sat.drop(51)
sats.describe()
Rate Verbal Math
count 51.000000 51.000000 51.000000
mean 37.000000 532.529412 531.843137
std 27.550681 33.360667 36.287393
min 4.000000 482.000000 439.000000
25% 9.000000 501.000000 503.000000
50% 33.000000 527.000000 525.000000
75% 64.000000 562.000000 557.500000
max 82.000000 593.000000 603.000000


From the mean value of Rate, Math, and Verbal in the SAT DataFrame, we observe that the mean are 37.0, 533 and 532. These number are not coherent with the values in the ‘All’ row. Therefore, we shall proceed to remove the last row of the sat DataFrame.

3. Plot the data using seaborn


3.1 Using seaborn’s distplot, plot the distributions for each of Rate, Math, and Verbal

Set the keyword argument kde=False. This way you can actually see the counts within bins. You can adjust the number of bins to your liking.

rate = sats['Rate']
sns.distplot(rate, bins= 10, kde=False)
plt.show()

math= sats['Math']
sns.distplot(math, bins= 50, kde=False)
plt.show()

verbal = sats['Verbal']
sns.distplot(verbal, bins= 50, kde=False)
plt.show()

png

png

png

3.2 Using seaborn’s pairplot, show the joint distributions for each of Rate, Math, and Verbal


Explain what the visualization tells you about your data.

sns.pairplot(sats, vars=['Rate','Math','Verbal'] , hue='State')
plt.show()


From the pairplot, we are able to observe that there is states which fair very well on the Math and Verbal test, they tend to have low rate of SAT passes. Therefore, we are able to infer that students from the states which scores well, they take the SAT when they are sure to score.

4. Plot the data using built-in pandas functions.


Pandas is very powerful and contains a variety of nice, built-in plotting functions for your data.

4.1 Plot a stacked histogram with Verbal and Math using pandas

verbal_math = sats[['Verbal','Math']]
verbal_math.plot.hist(stacked=True, bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x1a185fa828>

png

4.2 Plot Verbal and Math on the same chart using boxplots

What are the benefits of using a boxplot as compared to a scatterplot or a histogram?

The boxplot is capable of indicating the upper and lower limit, maximum value and minimum value. However, the scatterplot and histogram only gives us a generic idea of distribution of the data.

color = dict(boxes='DarkGreen', whiskers='DarkOrange',medians='DarkBlue', caps='Gray')
verbal_math.plot.box(color=color, sym='r+')
<matplotlib.axes._subplots.AxesSubplot at 0x1a172db860>

png

4.3 Plot Verbal, Math, and Rate appropriately on the same boxplot chart

Think about how you might change the variables so that they would make sense on the same chart. Explain your rationale for the choices on the chart. You should strive to make the chart as intuitive as possible.

verbal_math_rate = sats[['Verbal','Math','Rate']]
color = dict(boxes='DarkGreen', whiskers='DarkOrange',medians='DarkBlue', caps='Gray')
verbal_math_rate.plot.box(color=color, sym='r+')
<matplotlib.axes._subplots.AxesSubplot at 0x1a18e7f978>

png


What’s wrong with plotting a box-plot of Rate on the same chart as Math and Verbal?

The initial boxplot is a misrepresentation. The rate is not on the same metric as verbal and maths. Therefore, the dataset must be standardised.

#Rate descriptive data
rate_value = sats.Rate.values
rate_mean = np.mean(rate_value)
rate_std = np.std(rate_value)
print(rate_mean, rate_std)
37.0 27.27923867605359
#Standardisation of rate
rate_stand = (rate_value - rate_mean) / rate_std
print(np.mean(rate_stand), np.std(rate_stand))
# not exactly a mean of 0 but excruciatingly close
-8.707631565687502e-18 1.0
#Plot histogram to get observe any changes
fig = plt.figure(figsize=(8,4))
ax = fig.gca()
ax = sns.distplot(rate, bins=30, kde=False)
plt.show()

png

Notice that nothing changes about the distribution except for the location and the scale

#Maths descriptive data
math_value = sats.Math.values
math_mean = np.mean(math_value)
math_std = np.std(math_value)
print(math_mean, math_std)
531.843137254902 35.92987317311408
#Standardisation of math
math_stand = (math_value - math_mean) / math_std
print(np.mean(math_stand), np.std(math_stand))
# not exactly a mean of 0 but excruciatingly close
-8.272249987403127e-16 1.0
#Plot histogram to get observe any changes
fig = plt.figure(figsize=(8,4))
ax = fig.gca()
ax = sns.distplot(math, bins=30, kde=False)
plt.show()

png

#Verbal descriptive data
verbal_value = sats.Verbal.values
verbal_mean = np.mean(verbal_value)
verbal_std = np.std(verbal_value)
print(verbal_mean, verbal_std)
532.5294117647059 33.03198268415228
#Standardisation of Verbal
verbal_stand = (verbal_value - verbal_mean) / verbal_std
print(np.mean(verbal_stand), np.std(verbal_stand))
# not exactly a mean of 0 but excruciatingly close
8.098097356089377e-16 0.9999999999999998
#Plot histogram to get observe any changes
fig = plt.figure(figsize=(8,4))
ax = fig.gca()

ax = sns.distplot(verbal, bins=30, kde=False)
plt.show()

png

sats.head()
State Rate Verbal Math
0 CT 82 509 510
1 NJ 81 499 513
2 MA 79 511 515
3 NY 77 495 505
4 NH 72 520 516
sats_without_state = sats.drop(['State'],axis=1)
sats_without_state.head()
Rate Verbal Math
0 82 509 510
1 81 499 513
2 79 511 515
3 77 495 505
4 72 520 516
sats_without_state_stand = (sats_without_state - sats_without_state.mean()) / sats_without_state.std()
sats_without_state_stand.plot.box(color=color, sym='r+')
<matplotlib.axes._subplots.AxesSubplot at 0x1a18e7bf28>

png

5. Create and examine subsets of the data


For these questions masking will be used in pandas. Masking uses conditional statements to select portions of your DataFrame (through boolean operations under the hood.)

Remember the distinction between DataFrame indexing functions in pandas:

.iloc[row, col] : row and column are specified by index, which are integers
.loc[row, col]  : row and column are specified by string "labels" (boolean arrays are allowed; useful for rows)
.ix[row, col]   : row and column indexers can be a mix of labels and integer indices

For detailed reference and tutorial make sure to read over the pandas documentation:

http://pandas.pydata.org/pandas-docs/stable/indexing.html

5.1 Find the list of states that have Verbal scores greater than the average of Verbal scores across states


How many states are above the mean? What does this tell you about the distribution of Verbal scores?

verbal_mean = sats['Verbal'].mean()
print(verbal_mean)
sats.loc[sats['Verbal'] > verbal_mean]
532.5294117647059
State Rate Verbal Math
26 CO 31 539 542
27 OH 26 534 439
28 MT 23 539 539
30 ID 17 543 542
31 TN 13 562 553
32 NM 13 551 542
33 IL 12 576 589
34 KY 12 550 550
35 WY 11 547 545
36 MI 11 561 572
37 MN 9 580 589
38 KS 9 577 580
39 AL 9 559 554
40 NE 8 562 568
41 OK 8 567 561
42 MO 8 577 577
43 LA 7 564 562
44 WI 6 584 596
45 AR 6 562 550
46 UT 5 575 570
47 IA 5 593 603
48 SD 4 577 582
49 ND 4 592 599
50 MS 4 566 551
sats.loc[sats['Verbal'] > verbal_mean].count()
State     24
Rate      24
Verbal    24
Math      24
dtype: int64

There are a total of 24 states that above the mean score of the verbal test. Therefore, the distribution is considerably normally distributed.

5.2 Find the list of states that have Verbal scores greater than the median of Verbal scores across states


How does this compare to the list of states greater than the mean of Verbal scores? Why?

verbal_median = sats['Verbal'].median() 
print(verbal_median)
sats[sats['Verbal'] > verbal_median]
527.0
State Rate Verbal Math
26 CO 31 539 542
27 OH 26 534 439
28 MT 23 539 539
30 ID 17 543 542
31 TN 13 562 553
32 NM 13 551 542
33 IL 12 576 589
34 KY 12 550 550
35 WY 11 547 545
36 MI 11 561 572
37 MN 9 580 589
38 KS 9 577 580
39 AL 9 559 554
40 NE 8 562 568
41 OK 8 567 561
42 MO 8 577 577
43 LA 7 564 562
44 WI 6 584 596
45 AR 6 562 550
46 UT 5 575 570
47 IA 5 593 603
48 SD 4 577 582
49 ND 4 592 599
50 MS 4 566 551
sats[sats['Verbal'] > verbal_median].count()
State     24
Rate      24
Verbal    24
Math      24
dtype: int64


There are a total of 24 states that above the median score of the verbal test. It is similar to list of state which are above the mean score of the verbal test. This is so as the distribution of the scores are normally distributed.

5.3 Create a column that is the difference between the Verbal and Math scores


Specifically, this should be Verbal - Math.

sats['diff_verbal_math'] = sats['Verbal'] - sats['Math']
sats.head(5)
State Rate Verbal Math diff_verbal_math
0 CT 82 509 510 -1
1 NJ 81 499 513 -14
2 MA 79 511 515 -4
3 NY 77 495 505 -10
4 NH 72 520 516 4

5.4 Create two new DataFrames showing states with the greatest difference between scores


  1. Your first DataFrame should be the 10 states with the greatest gap between Verbal and Math scores where Verbal is greater than Math. It should be sorted appropriately to show the ranking of states.
  2. Your second DataFrame will be the inverse: states with the greatest gap between Verbal and Math such that Math is greater than Verbal. Again, this should be sorted appropriately to show rank.
  3. Print the header of both variables, only showing the top 3 states in each.

Question 5.4.1 Find the states with the highest difference between Verbal and Math scores


state_diff_score = sats[['State','diff_verbal_math']]
top_3_verbal = state_diff_score.sort_values('diff_verbal_math').head(9)
print('The states is ' + top_3_verbal.State.head(3))
21    The states is HI
23    The states is CA
1     The states is NJ
Name: State, dtype: object

Question 5.4.2 Find the states with the lowest difference between Verbal and Math scores


top_3_math = state_diff_score.sort_values('diff_verbal_math').tail(9)
print('The state is ' +  top_3_math.State.head(3))
41    The state is OK
16    The state is DC
32    The state is NM
Name: State, dtype: object

6. Examine summary statistics


Checking the summary statistics for data is an essential step in the EDA process!

6.1 Create the correlation matrix of your variables (excluding State).


What does the correlation matrix tell you?

# Create dataframe to analyse the different factors on a heat map
features = sats[['Math','Verbal', 'Rate']]
# create correlation data
correlation = features.corr()

# plot heatmap
ax = sns.heatmap(correlation, annot=True)

# turn the axis label
for item in ax.get_yticklabels():
    item.set_rotation(0)

for item in ax.get_xticklabels():
    item.set_rotation(90)

png


Based on the absolute value of the pearson correlation value, we observe high correlation between two sets of groups: Verbal and Maths scores (0.9) and, Verbal scores and Rate of passes(-0.89). This implies that candidates with high verbal scores are highly likely in attain high maths scores. Whereas, candidates with high verbal scores are highly likely to come from states with low passing rates.

6.2 Use pandas’ .describe() built-in function on your DataFrame


Write up what each of the rows returned by the function indicate.

math.describe()
count     51.000000
mean     531.843137
std       36.287393
min      439.000000
25%      503.000000
50%      525.000000
75%      557.500000
max      603.000000
Name: Math, dtype: float64


Based on the data of the 50 states, we observe that the mean math score of the country is 532 with a minimum score of 439 and maximum score of 603.Those in the upper percentile will score more than 558.

verbal.describe()
count     51.000000
mean     532.529412
std       33.360667
min      482.000000
25%      501.000000
50%      527.000000
75%      562.000000
max      593.000000
Name: Verbal, dtype: float64


Based on the data of the 50 states, we observe that the mean verbal score of the country is 533 with a minimum score of 482 and maximum score of 593. Those in the upper percentile will score more than 593.

rate.describe()
count    51.000000
mean     37.000000
std      27.550681
min       4.000000
25%       9.000000
50%      33.000000
75%      64.000000
max      82.000000
Name: Rate, dtype: float64


Based on the data of the 50 states, we observe that the mean rate of the country is 37 with a minimum rate of 4 and maximum score of 82. Those in the upper percentile will rate more than 64.

6.3 Assign and print the covariance matrix for the dataset


  1. Describe how the covariance matrix is different from the correlation matrix.
  2. What is the process to convert the covariance into the correlation?
  3. Why is the correlation matrix preferred to the covariance matrix for examining relationships in your data?

6.3.1 & 6.3.2 Understanding the concept of Covariance and Pearson Correlation


The covariance matrix indicates “relatedness” between variables. It is literally the sum of deviations from the mean of X times deviations from the mean of Y adjusted by the sample size N.

From the formula above, it is challenging in deducing what the value means as X and Y might have different units. Therefore, we rely on the correlation matrix helps normalise the covariance by dividing it with the standard deviation:

The correlation matrix allows us to easily interprete the “relatedness” on a scale of -1 to 1 as it takes the diversity of the data into account.

6.3.3 Covariance of Verbal, Math, and Rate


The correlation matrix is much more effective in assessing the the relationship between the two different variables as it the deviations of both variables will be taken into account.

covariance = features.cov()
print(covariance)
               Math       Verbal    Rate
Math    1316.774902  1089.404706 -773.22
Verbal  1089.404706  1112.934118 -816.28
Rate    -773.220000  -816.280000  759.04

7. Performing EDA on “drug use by age” data.


You will now switch datasets to one with many more variables. This section of the project is more open-ended.

We’ll work with the “drug-use-by-age.csv” data, sourced from and described here: https://github.com/fivethirtyeight/data/tree/master/drug-use-by-age.

7.1 Load the data using pandas.


Does this data require cleaning? Are variables missing? How will this affect your approach to EDA on the data?

#Load the data using pandas
drug = pd.read_csv('drug-use-by-age.csv', na_values='-')
drug.head()

Data quality check :

  • No null values
  • Check Columns labels
  • Check dtypes of the columns
# Ensure that there are no null values
drug.isnull().sum()
age                        0
n                          0
alcohol-use                0
alcohol-frequency          0
marijuana-use              0
marijuana-frequency        0
cocaine-use                0
cocaine-frequency          1
crack-use                  0
crack-frequency            3
heroin-use                 0
heroin-frequency           1
hallucinogen-use           0
hallucinogen-frequency     0
inhalant-use               0
inhalant-frequency         1
pain-releiver-use          0
pain-releiver-frequency    0
oxycontin-use              0
oxycontin-frequency        1
tranquilizer-use           0
tranquilizer-frequency     0
stimulant-use              0
stimulant-frequency        0
meth-use                   0
meth-frequency             2
sedative-use               0
sedative-frequency         0
dtype: int64


We can observe that there are null values in the EDA data set under cocaine frequency, crack frequency, heroin frequency, inhalant frequency, oxycontin frequency.

drug.fillna(0, inplace=True)
drug.isnull().sum()
age                        0
n                          0
alcohol-use                0
alcohol-frequency          0
marijuana-use              0
marijuana-frequency        0
cocaine-use                0
cocaine-frequency          0
crack-use                  0
crack-frequency            0
heroin-use                 0
heroin-frequency           0
hallucinogen-use           0
hallucinogen-frequency     0
inhalant-use               0
inhalant-frequency         0
pain-releiver-use          0
pain-releiver-frequency    0
oxycontin-use              0
oxycontin-frequency        0
tranquilizer-use           0
tranquilizer-frequency     0
stimulant-use              0
stimulant-frequency        0
meth-use                   0
meth-frequency             0
sedative-use               0
sedative-frequency         0
dtype: int64
drug.columns
Index(['age', 'n', 'alcohol-use', 'alcohol-frequency', 'marijuana-use',
       'marijuana-frequency', 'cocaine-use', 'cocaine-frequency', 'crack-use',
       'crack-frequency', 'heroin-use', 'heroin-frequency', 'hallucinogen-use',
       'hallucinogen-frequency', 'inhalant-use', 'inhalant-frequency',
       'pain-releiver-use', 'pain-releiver-frequency', 'oxycontin-use',
       'oxycontin-frequency', 'tranquilizer-use', 'tranquilizer-frequency',
       'stimulant-use', 'stimulant-frequency', 'meth-use', 'meth-frequency',
       'sedative-use', 'sedative-frequency'],
      dtype='object')


We replace the ‘-‘ in the columns of the drug dataframe as it would be perceived as the subtraction arithmetic operation when we cast a method. Therefore, it would be replace with a ‘_’. From the columns names above, we can observe that ‘pain-releiver’ and ‘oxycontin’ is spelled wrongly.

# List Replacement Method
new_names = ['age', 'n','alcohol_use', 'alcohol_frequency', 'marijuana_use',
        'marijuana_frequency', 'cocaine_use', 'cocaine_frequency',
        'crack_use', 'crack_frequency', 'heroin_use', 'heroin_frequency',
        'hallucinogen_use', 'hallucinogen_frequency', 'inhalant_use',
        'inhalant_frequency', 'painrelieve_use', 'painrelieve_frequency',
        'oxycontin_use', 'oxycontin_frequency', 'tranquilizer_use',
        'tranquilizer-frequency', 'stimulant_use', 'stimulant_frequency',
        'meth_use', 'meth_frequency', 'sedative_use', 'sedative_frequency']
drug.columns = new_names
drug.head()


We add a new column at the front of the table to understand the age proportion within the study. This will help us understand the age distribution amongst the various types of drug use.

new_col = (drug['n']/sum(drug['n'])*100)
idx = 2
drug.insert(loc=idx, column='age_percentage', value=new_col)
drug.head(2)


We are going to make a new data set without the age column due to imbalance in age groups and it would be easier to process the data using other techniques.

druggie = drug.drop(['age'], axis=1)
druggie.head(2)

7.2 Do a high-level, initial overview of the data


Get a feel for what this dataset is all about.

Use whichever techniques you’d like, including those from the SAT dataset EDA. The final response to this question should be a written description of what you infer about the dataset.

Some things to consider doing:

  • Look for relationships between variables and subsets of those variables’ values
  • Derive new features from the ones available to help your analysis
  • Visualize everything!


To understand the relationships between variables, we will visualise the data in the form of a heat map and boxplot. The box plot is use to identify the outliers whereas the heat map will point us to the more closely related variables.

Before doing so, we must separate the data points into two groups: 1) Drug use and 2) Drug Frequency

# The n column is dropped because do not measure the drug use on a percentage scale. 
drug_only = drug.drop(['n'], axis=1)
use_drug = list(drug_only.columns[::2])
use = drug_only[use_drug]
freq_drug = list(drug_only.columns[1::2])
freq = drug_only[freq_drug]
#Configure the shape of the boxplot
fig = plt.figure(figsize=(12,8))
ax = fig.gca()

# Plot the boxplot
ax = sns.boxplot(data=use, orient='h', ax=ax)

# Add titles and axes labels to the boxplot
ax.set_title('Different drug use\n')
plt.xlabel(' % drug use\n')
plt.ylabel('Drug type\n')

# Display the boxplot
plt.show()


The drug use boxplot tells us that there are no outliers. Alcohol and marijuana use is much more widely used than the other drugs.

#Configure the shape of the boxplot
fig = plt.figure(figsize=(12,8))
ax = fig.gca()

# Plot the boxplot
ax = sns.boxplot(data=freq, orient='h', ax=ax)

# Add titles and axes labels to the boxplot
ax.set_title('Different drug frequency\n')
plt.xlabel(' Drug frequency in 12 months\n')
plt.ylabel('Drug type\n')

# Display the boxplot
plt.show()


The drug frequency boxplot tells us that there are outliers in the hallucinogen, pain reliever, tranquilizer, stimulant and sedative category. Alcohol, heroin and marijuana frequency remains to be much more widely used than the other drugs.

# create correlation data
plt.figure(figsize=(20,16))
drug_correlation = drug.corr()

# plot heatmap
ax = sns.heatmap(drug_correlation, linewidth=0.5, cmap="RdBu_r", annot=True)

# turn the axis label
for item in ax.get_yticklabels():
    item.set_rotation(0)

for item in ax.get_xticklabels():
    item.set_rotation(90)


Based on the EDA correlation heatmap, we see a strong positive correlation between ( above 0.95) within the usage of certain drugs:

  1. Stimulant and marijuana use
  2. Oxytocin and marijuana use
  3. Hallucinogen and marijuana use
  4. Pain-reliever and marijuana use
  5. Oxytocin and pain-reliever use
  6. Stimulant and pain-reliever use
  7. Tranquilizer and pain-reliever use
  8. Stimulant and oxytocin use
  9. Tranquilizer and oxytocin use

7.3 Create a testable hypothesis about this data


Requirements for the question:

  1. Write a specific question you would like to answer with the data (that can be accomplished with EDA).
  2. Write a description of the “deliverables”: what will you report after testing/examining your hypothesis?
  3. Use EDA techniques of your choice, numeric and/or visual, to look into your question.
  4. Write up your report on what you have found regarding the hypothesis about the data you came up with.

Your hypothesis could be on:

  • Difference of group means
  • Correlations between variables
  • Anything else you think is interesting, testable, and meaningful!

Important notes:

You should be only doing EDA relevant to your question here. It is easy to go down rabbit holes trying to look at every facet of your data, and so we want you to get in the practice of specifying a hypothesis you are interested in first and scoping your work to specifically answer that question.

Some of you may want to jump ahead to “modeling” data to answer your question. This is a topic addressed in the next project and you should not do this for this project. We specifically want you to not do modeling to emphasize the importance of performing EDA before you jump to statistical analysis.

Question:


Is there a difference between the mean percentage of oxytocin and pain-reliever use?

Deliverables:


  1. Mean percentage of oxytocin use
  2. Mean percentage of pain-reliever use
  3. t-statistic
  4. p-value

The “null hypothesis”


The null hypothesis is a fundamental concept of Frequentist statistical tests. We typically denote the null hypothesis with H0.

Users are using oxycontin and pain-reliever concurrently.

H0: The mean difference between the pain-reliever and oxytocin use is zero.

The “alternative hypothesis”


The alternative hypothesis is the outcome of the experiment that we hope to show. In our example the alternative hypothesis is that there is in fact a mean difference in oxycontin and pain-reliever use.

H1: The parameter of interest, our mean difference between oxytocin and pain-reliever use, is different than zero.

NOTE: The null hypothesis and alternative hypothesis are concerned with the true values, or in other words the parameter of the overall population. Through the process of experimentation / hypothesis testing and statistical analysis of the results we will make an inference about this population parameter.

mean2 = drug['oxycontin_use'].mean()
distri_2 = drug['oxycontin_use'] - mean2
mean2
0.9352941176470588
mean1 = drug['painrelieve_use'].mean()
distri_1 = drug['painrelieve_use'] - mean1
mean1
6.270588235294118
mean_difference = mean1 - mean2
mean_difference
5.3352941176470585

Evaluating our experiment with a t-test and p-value


Say in our experiment we measure the following results:

  • The subjects in the oxycontin group have mean percentage of 0.935
  • The subjects in the pain-reliever group have mean percentage of 6.271

The difference between ooxycontin and pain-reliever groups is 5.336% . We can perform what is known as a t-test to evaluate this.

First we will calculate a t-statistic. The t-statistic is a measure of the degree to which our groups differ standardized by the variance of our measurements.

Secondly we will calculate a p-value. The p-value is a metric that indicates a probability that our measured difference was due to random chance in the sampling of subjects.

The likelihood of the data given the null hypothesis


For our experiment we will set up a null hypothesis and an alternative hypothesis:

H0: The difference between in pain-reliever and oxycontin use is 0.

H1: The difference between in pain-reliever and oxycontin use is not 0.

Likewise, our measured difference is 5.336.

Recall that as Frequentists we want to know:

What is the probability that we observed this data GIVEN a specified mean difference in oxytocin use.

We obviously don’t know the true mean difference in oxycontin use resulting from the drug. The whole point of conducting the experiment is to evaluate the drug. Instead we will assume that the true mean difference is zero: the null hypothesis H0 is assumed to be true:

#95% confidence level
alpha = 0.05
import scipy.stats as stats
results  = stats.ttest_ind(distri_1, distri_2, equal_var=False)

if results.pvalue < (alpha/2):
    print('As p-value={results.pvalue} is less than {alpha/2}, we reject the null hypothesis and conclude that the mean difference between oxycontin and pain-reliever is not 0')
else: 
    print('As p-value={results.pvalue} is less than {alpha/2}, we do not reject the null hypothesis')
As p-value={results.pvalue} is less than {alpha/2}, we do not reject the null hypothesis

8. Introduction to dealing with outliers


Outliers are an interesting problem in statistics, in that there is not an agreed upon best way to define them. Subjectivity in selecting and analyzing data is a problem that will recur throughout the course.

  1. Pull out the rate variable from the sat dataset.
  2. Are there outliers in the dataset? Define, in words, how you numerically define outliers.
  3. Print out the outliers in the dataset.
  4. Remove the outliers from the dataset.
  5. Compare the mean, median, and standard deviation of the “cleaned” data without outliers to the original. What is different about them and why?

8.1.1 and 8.1.2 Identify an outlier


An outliers is a data point which is numerically distant from the majority of the data point.

sats['Rate']
0     82
1     81
2     79
3     77
4     72
5     71
6     71
7     69
8     69
9     68
10    67
11    65
12    65
13    63
14    60
15    57
16    56
17    55
18    54
19    53
20    53
21    52
22    51
23    51
24    34
25    33
26    31
27    26
28    23
29    18
30    17
31    13
32    13
33    12
34    12
35    11
36    11
37     9
38     9
39     9
40     8
41     8
42     8
43     7
44     6
45     6
46     5
47     5
48     4
49     4
50     4
Name: Rate, dtype: int64
sats['Rate'].describe()
count    51.000000
mean     37.000000
std      27.550681
min       4.000000
25%       9.000000
50%      33.000000
75%      64.000000
max      82.000000
Name: Rate, dtype: float64


Since the maximum value is no more than two standard deviation away from the mean, we are able to conclude that there are no outliers. It is not necessary to remove any data point.

9. Percentile scoring and spearman rank correlation


9.1 Calculate the spearman correlation of sat Verbal and Math


  1. How does the spearman correlation compare to the pearson correlation?
  2. Describe clearly in words the process of calculating the spearman rank correlation.
    • Hint: the word “rank” is in the name of the process for a reason!
stats.pearsonr(sat['Verbal'], sat['Math'])
(0.899870852544429, 1.192002673306768e-19)
verbal_math.corr()
Verbal Math
Verbal 1.000000 0.899909
Math 0.899909 1.000000


The similarity between pearson and spearman correlation coefficient are that they are both range between -1 to 1. However pearson correlation coefficient tells us the how closely related the two variables. While the spearman correlation coefficient is based on the rank order of the variable and it’s measure quantity.

9.2 Percentile scoring


Look up percentile scoring of data. In other words, the conversion of numeric data to their equivalent percentile scores.

http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.percentileofscore.html

  1. Convert Rate to percentiles in the sat scores as a new column.
  2. Show the percentile of California in Rate.
  3. How is percentile related to the spearman rank correlation?
sat['percentile'] = sat['Rate'].map(lambda x: stats.percentileofscore(sat['Rate'],x))
sat.head()
State Rate Verbal Math percentile
0 CT 82 509 510 100.000000
1 NJ 81 499 513 98.076923
2 MA 79 511 515 96.153846
3 NY 77 495 505 94.230769
4 NH 72 520 516 92.307692
sat[sat['State']== 'CA']
State Rate Verbal Math percentile
23 CA 51 498 517 56.730769

9.3 Percentiles and outliers


  1. Why might percentile scoring be useful for dealing with outliers?
  2. Plot the distribution of a variable of your choice from the drug use dataset.
  3. Plot the same variable but percentile scored.
  4. Describe the effect, visually, of coverting raw scores to percentile.

9.3.1 Why might percentile scoring be useful for dealing with outliers?


It would be able to point where the outlier is on the percentile range

9.3.2 Plot the distribution of a variable of your choice from the drug use dataset


ax = drug['stimulant_frequency'].plot.hist(bins=50)
ax.set_xlabel('Frequency of Simulant')
Text(0.5,0,'Frequency of Simulant')

png

9.3.3 Plot the same variable but percentile scored


stim_per = []
for n in drug['stimulant_frequency']:
    stim_per.append(stats.percentileofscore(drug['stimulant_frequency'],n))
ax = sns.distplot(stim_per, kde=False, bins=50)
ax.set_xlabel('Percentile of Stimulant use frequency')
ax.set_ylabel('Frequency')
Text(0,0.5,'Frequency')

png

9.3.4 Describe the effect, visually, of coverting raw scores to percentile


From the histogram of the Stimulant use frequency, we can identify an outlier. The distribution of Stimulant use frequency does not follow a normal distribution as graph is not symmetrical. The mean percentile score of the Stimulant use frequency lies between 60-80%.