The Olympic medal data for the following exercises comes from The Guardian. It comprises records of all events held at the Olympic games between 1896 and 2012.

In [11]:
import pandas as pd

medals = pd.read_csv('Summer Olympic medallists 1896 to 2008 - ALL MEDALISTS.csv',skiprows=4)
print medals.shape
medals.head()
(29216, 10)
Out[11]:
City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal
0 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men 100m freestyle M Gold
1 Athens 1896 Aquatics Swimming HERSCHMANN, Otto AUT Men 100m freestyle M Silver
2 Athens 1896 Aquatics Swimming DRIVAS, Dimitrios GRE Men 100m freestyle for sailors M Bronze
3 Athens 1896 Aquatics Swimming MALOKINIS, Ioannis GRE Men 100m freestyle for sailors M Gold
4 Athens 1896 Aquatics Swimming CHASAPIS, Spiridon GRE Men 100m freestyle for sailors M Silver
In [12]:
medals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
City            29216 non-null object
Edition         29216 non-null int64
Sport           29216 non-null object
Discipline      29216 non-null object
Athlete         29216 non-null object
NOC             29216 non-null object
Gender          29216 non-null object
Event           29216 non-null object
Event_gender    29216 non-null object
Medal           29216 non-null object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB

Using .value_counts() for ranking

  • pandas Series method .value_counts()
    • determine the top 15 countries ranked by total number of medals.
      • .value_counts() sorts by values by default.
        • The result is returned as a Series of counts indexed by unique entries from the original Series with values (counts) ranked in descending order.
In [14]:
# Select the 'NOC' column of medals: country_names
country_names = medals['NOC']
print type(country_names)
country_names.head()
<class 'pandas.core.series.Series'>
Out[14]:
0    HUN
1    AUT
2    GRE
3    GRE
4    GRE
Name: NOC, dtype: object
In [18]:
# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()
print type(medal_counts)
medal_counts.head()
<class 'pandas.core.series.Series'>
Out[18]:
USA    4335
URS    2049
GBR    1594
FRA    1314
ITA    1228
Name: NOC, dtype: int64

Using .pivot_table() to count medals by type

  • Rather than ranking countries by total medals won and showing that list, you may want to see a bit more detail.
    • You can use a pivot table to compute how many separate bronze, silver and gold medals each country won.
      • That pivot table can then be used to repeat the previous computation to rank by total medals won.
In [20]:
medals.head(3)
Out[20]:
City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal
0 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men 100m freestyle M Gold
1 Athens 1896 Aquatics Swimming HERSCHMANN, Otto AUT Men 100m freestyle M Silver
2 Athens 1896 Aquatics Swimming DRIVAS, Dimitrios GRE Men 100m freestyle for sailors M Bronze
In [23]:
# Construct the pivot table: counted
counted = medals.pivot_table(index='NOC', values='Athlete', columns='Medal', aggfunc='count')
counted.head(7)
Out[23]:
Medal Bronze Gold Silver
NOC
AFG 1.0 NaN NaN
AHO NaN NaN 1.0
ALG 8.0 4.0 2.0
ANZ 5.0 20.0 4.0
ARG 88.0 68.0 83.0
ARM 7.0 1.0 1.0
AUS 413.0 293.0 369.0
In [25]:
# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis='columns')
counted.head(7)
Out[25]:
Medal Bronze Gold Silver totals
NOC
AFG 1.0 NaN NaN 2.0
AHO NaN NaN 1.0 2.0
ALG 8.0 4.0 2.0 28.0
ANZ 5.0 20.0 4.0 58.0
ARG 88.0 68.0 83.0 478.0
ARM 7.0 1.0 1.0 18.0
AUS 413.0 293.0 369.0 2150.0
In [27]:
# Sort counted by the 'totals' column
counted = counted.sort_values('totals', ascending=False)
counted.head(7)
Out[27]:
Medal Bronze Gold Silver totals
NOC
USA 1052.0 2088.0 1195.0 8670.0
URS 584.0 838.0 627.0 4098.0
GBR 505.0 498.0 591.0 3188.0
FRA 475.0 378.0 461.0 2628.0
ITA 374.0 460.0 394.0 2456.0
GER 454.0 407.0 350.0 2422.0
AUS 413.0 293.0 369.0 2150.0

Applying .drop_duplicates()

  • What could be the difference between the 'Event_gender' and 'Gender' columns?
  • you should be able to evaluate your guess by looking at the unique values of the pairs (Event_gender, Gender)
  • The duplicates can be dropped using the .drop_duplicates() method, leaving behind the unique observations.
In [31]:
# Select columns: ev_gen
ev_gen = medals[['Event_gender', 'Gender']]

print ev_gen.shape
ev_gen.head()
(29216, 2)
Out[31]:
Event_gender Gender
0 M Men
1 M Men
2 M Men
3 M Men
4 M Men
In [33]:
# Drop duplicate pairs: ev_gen_uniques
ev_gen_uniques = ev_gen.drop_duplicates()
ev_gen_uniques
Out[33]:
Event_gender Gender
0 M Men
348 X Men
416 W Women
639 X Women
23675 W Men

Finding possible errors with .groupby()

  • you will now use .groupby() to continue your exploration. Your job is to group by 'Event_gender' and 'Gender' and count the rows.
  • You will see that there is only one suspicious row: This is likely a data error.
In [35]:
# Group medals by the two columns: medals_by_gender
medals_by_gender = medals.groupby(['Event_gender', 'Gender'])

# Create a DataFrame with a group count: medal_count_by_gender
medal_count_by_gender = medals_by_gender.count()

medal_count_by_gender
Out[35]:
City Edition Sport Discipline Athlete NOC Event Medal
Event_gender Gender
M Men 20067 20067 20067 20067 20067 20067 20067 20067
W Men 1 1 1 1 1 1 1 1
Women 7277 7277 7277 7277 7277 7277 7277 7277
X Men 1653 1653 1653 1653 1653 1653 1653 1653
Women 218 218 218 218 218 218 218 218

Locating suspicious data

In [37]:
# Create the Boolean Series: sus
sus = (medals.Event_gender == 'W') & (medals.Gender == 'Men')
sus.head()
Out[37]:
0    False
1    False
2    False
3    False
4    False
dtype: bool
In [38]:
# Create a DataFrame with the suspicious row: suspect
suspect = medals[sus] 
suspect
Out[38]:
City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal
23675 Sydney 2000 Athletics Athletics CHEPCHUMBA, Joyce KEN Men marathon W Bronze

Constructing alternative country rankings

Counting distinct events

In [43]:
print len(medals['Sport'].unique())
medals['Sport'].unique()
42
Out[43]:
array(['Aquatics', 'Athletics', 'Cycling', 'Fencing', 'Gymnastics',
       'Shooting', 'Tennis', 'Weightlifting', 'Wrestling', 'Archery',
       'Basque Pelota', 'Cricket', 'Croquet', 'Equestrian', 'Football',
       'Golf', 'Polo', 'Rowing', 'Rugby', 'Sailing', 'Tug of War',
       'Boxing', 'Lacrosse', 'Roque', 'Hockey', 'Jeu de paume', 'Rackets',
       'Skating', 'Water Motorsports', 'Modern Pentathlon', 'Ice Hockey',
       'Basketball', 'Canoe / Kayak', 'Handball', 'Judo', 'Volleyball',
       'Table Tennis', 'Badminton', 'Baseball', 'Softball', 'Taekwondo',
       'Triathlon'], dtype=object)
In [71]:
# Group medals by 'NOC': country_grouped
country_grouped = medals.groupby('NOC')

# Compute the number of distinct sports in which each country won medals: Nsports
Nsports = country_grouped['Sport'].nunique()
Nsports.head()
Out[71]:
NOC
AFG     1
AHO     1
ALG     3
ANZ     5
ARG    17
Name: Sport, dtype: int64
In [75]:
country_grouped['Sport'].count().head()
Out[75]:
NOC
AFG      1
AHO      1
ALG     14
ANZ     29
ARG    239
Name: Sport, dtype: int64
In [76]:
# Sort the values of Nsports in descending order
Nsports = Nsports.sort_values(ascending=False)
Nsports.head()
Out[76]:
NOC
USA    34
GBR    31
FRA    28
GER    26
CHN    24
Name: Sport, dtype: int64
In [79]:
country_grouped['Sport'].count().sort_values(ascending = False).head()
Out[79]:
NOC
USA    4335
URS    2049
GBR    1594
FRA    1314
ITA    1228
Name: Sport, dtype: int64
In [ ]:
 

Ranking of distinct events

  • Top five countries that have won medals in the most sports
  • Compare medal counts of USA and USSR from 1952 to 1988

Two new DataFrame methods

  • idxmax(): Row or column label where maximum value is located
  • idxmin(): Row or column label where minimum value is located
In [60]:
count = medals.groupby('NOC')['Medal'].count()
count.head()
Out[60]:
NOC
AFG      1
AHO      1
ALG     14
ANZ     29
ARG    239
Name: Medal, dtype: int64
In [61]:
count.index
Out[61]:
Index([u'AFG', u'AHO', u'ALG', u'ANZ', u'ARG', u'ARM', u'AUS', u'AUT', u'AZE',
       u'BAH',
       ...
       u'URS', u'URU', u'USA', u'UZB', u'VEN', u'VIE', u'YUG', u'ZAM', u'ZIM',
       u'ZZX'],
      dtype='object', name=u'NOC', length=138)
In [62]:
count.idxmax()
Out[62]:
'USA'
In [63]:
count.idxmin()
Out[63]:
'AFG'

Using idxmax() along columns

In [80]:
import numpy as np
In [112]:
count2 = medals.groupby('NOC')['Medal','Event'].count()
count2['Event']=count2['Event'].apply(lambda x: x+np.random.randint(-10,5))
count2.head()
Out[112]:
Medal Event
NOC
AFG 1 -1
AHO 1 -8
ALG 14 10
ANZ 29 32
ARG 239 229
In [113]:
count2.head()
Out[113]:
Medal Event
NOC
AFG 1 -1
AHO 1 -8
ALG 14 10
ANZ 29 32
ARG 239 229
In [114]:
count2.idxmax(1).value_counts()
Out[114]:
Medal    105
Event     33
dtype: int64
In [117]:
count2.idxmax(axis='columns').head()
Out[117]:
NOC
AFG    Medal
AHO    Medal
ALG    Medal
ANZ    Event
ARG    Medal
dtype: object

filtering

Counting USA vs. USSR Cold War Olympic Sports

  • The Olympic competitions between 1952 and 1988 took place during the height of the Cold War between the United States of America (USA) & the Union of Soviet Socialist Republics (USSR).
    • Your goal in this exercise is to aggregate the number of distinct sports in which the USA and the USSR won medals during the Cold War years.
In [132]:
# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war
during_cold_war = (medals.Edition>=1952) & (medals.Edition<=1988)
print during_cold_war.value_counts()
during_cold_war.head()
False    17637
True     11579
Name: Edition, dtype: int64
Out[132]:
0    False
1    False
2    False
3    False
4    False
Name: Edition, dtype: bool
In [133]:
# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs
is_usa_urs = medals.NOC.isin(['USA', 'URS'])
print is_usa_urs.value_counts()
is_usa_urs.head()
False    22832
True      6384
Name: NOC, dtype: int64
Out[133]:
0    False
1    False
2    False
3    False
4    False
Name: NOC, dtype: bool
In [134]:
# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals
cold_war_medals = medals.loc[during_cold_war & is_usa_urs]
print cold_war_medals.shape
cold_war_medals.head()
(3544, 10)
Out[134]:
City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal
8019 Helsinki 1952 Aquatics Diving LEE, Samuel USA Men 10m platform M Gold
8021 Helsinki 1952 Aquatics Diving STOVER-IRWIN, Juno Roslays USA Women 10m platform W Bronze
8022 Helsinki 1952 Aquatics Diving MCCORMICK, Patricia USA Women 10m platform W Gold
8023 Helsinki 1952 Aquatics Diving MYERS-POPE, Paula Jean USA Women 10m platform W Silver
8024 Helsinki 1952 Aquatics Diving CLOTWORTHY, Robert Lynn USA Men 3m springboard M Bronze

Create a Series Nsports from country_grouped using indexing & chained methods:

  • Extract the column 'Sport'; ### Use .nunique() to get the number of unique elements in each group;
  • Apply .sort_values(ascending=False) to rearrange the Series.
In [135]:
# Group cold_war_medals by 'NOC'
country_grouped = cold_war_medals.groupby('NOC')

# Create Nsports
Nsports = country_grouped['Sport'].nunique().sort_values(ascending=False)

print Nsports.shape
Nsports
(2L,)
Out[135]:
NOC
URS    21
USA    20
Name: Sport, dtype: int64
In [136]:
country_grouped['Sport'].nunique().head()
Out[136]:
NOC
URS    21
USA    20
Name: Sport, dtype: int64
In [139]:
cold_war_medals.groupby('NOC').count()
Out[139]:
City Edition Sport Discipline Athlete Gender Event Event_gender Medal
NOC
URS 2049 2049 2049 2049 2049 2049 2049 2049 2049
USA 1495 1495 1495 1495 1495 1495 1495 1495 1495

Counting USA vs. USSR Cold War Olympic Medals

  • see which country, the USA or the USSR, won the most medals consistently over the Cold War period.
    • ou'll need a pivot table with years ('Edition') on the index and countries ('NOC') on the columns. The entries will be the total number of medals each country won that year. If the country won no medals in a given edition, expect a NaN in that entry of the pivot table.
In [143]:
# Create the pivot table: medals_won_by_country
medals_won_by_country = medals.pivot_table(index='Edition', columns='NOC', values='Athlete', aggfunc='count')
medals_won_by_country.head()
Out[143]:
NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH ... URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
Edition
1896 NaN NaN NaN NaN NaN NaN 2.0 5.0 NaN NaN ... NaN NaN 20.0 NaN NaN NaN NaN NaN NaN 6.0
1900 NaN NaN NaN NaN NaN NaN 5.0 6.0 NaN NaN ... NaN NaN 55.0 NaN NaN NaN NaN NaN NaN 34.0
1904 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN ... NaN NaN 394.0 NaN NaN NaN NaN NaN NaN 8.0
1908 NaN NaN NaN 19.0 NaN NaN NaN 1.0 NaN NaN ... NaN NaN 63.0 NaN NaN NaN NaN NaN NaN NaN
1912 NaN NaN NaN 10.0 NaN NaN NaN 14.0 NaN NaN ... NaN NaN 101.0 NaN NaN NaN NaN NaN NaN NaN

5 rows × 138 columns

  • ### You'll need to slice the Cold War period and subset the 'USA' and 'URS' columns.
In [145]:
# Slice medals_won_by_country: cold_war_usa_usr_medals
cold_war_usa_usr_medals = medals_won_by_country.loc[1952:1988, ['USA','URS']]
cold_war_usa_usr_medals
Out[145]:
NOC USA URS
Edition
1952 130.0 117.0
1956 118.0 169.0
1960 112.0 169.0
1964 150.0 174.0
1968 149.0 188.0
1972 155.0 211.0
1976 155.0 285.0
1980 NaN 442.0
1984 333.0 NaN
1988 193.0 294.0
  • ### You'll need to make a Series from this slice of the pivot table that tells which country won the most medals in that edition using .idxmax(axis='columns').
  • If .max() returns the maximum value of Series or 1D array, .idxmax() returns the index of the maximizing element.
    • The argument axis=columns or axis=1 is required because, by default, this aggregation would be done along columns for a DataFrame.
      • The final Series contains either 'USA' or 'URS' according to which country won the most medals in each Olympic edition.
        • You can use .value_counts() to count the number of occurrences of each.
In [146]:
# Create most_medals
most_medals = cold_war_usa_usr_medals.idxmax(axis='columns')
most_medals
Out[146]:
Edition
1952    USA
1956    URS
1960    URS
1964    URS
1968    URS
1972    URS
1976    URS
1980    URS
1984    USA
1988    URS
dtype: object

Visualizing

  • USA Medal Counts by Edition: Line Plot
In [158]:
import matplotlib.pyplot as plt
%matplotlib inline

# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
usa_medals_by_year.head(10)
Out[158]:
Edition  Medal 
1896     Bronze      2
         Gold       11
         Silver      7
1900     Bronze     14
         Gold       27
         Silver     14
1904     Bronze    111
         Gold      146
         Silver    137
1908     Bronze     15
Name: Athlete, dtype: int64

unstack data before plotting

In [159]:
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
usa_medals_by_year.head(10)
Out[159]:
Medal Bronze Gold Silver
Edition
1896 2 11 7
1900 14 27 14
1904 111 146 137
1908 15 34 14
1912 31 45 25
1920 38 110 45
1924 51 110 37
1928 17 43 24
1932 59 78 44
1936 15 51 26
In [160]:
# Plot the DataFrame usa_medals_by_year
usa_medals_by_year.plot()
plt.show()

Visualizing

  • USA Medal Counts by Edition: Area Plot
In [162]:
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area(alpha=.5)
plt.show()

Visualizing

  • USA Medal Counts by Edition: Area Plot with Ordered Medals
    • You may have noticed that the medals are ordered according to a lexicographic (dictionary) ordering: Bronze < Gold < Silver.
      • However, you would prefer an ordering consistent with the Olympic rules: Bronze < Silver < Gold.
        • You can achieve this using Categorical types.
In [171]:
# Redefine 'Medal' as an ordered categorical
medals.Medal = pd.Categorical(values=medals.Medal, categories=['Bronze', 'Silver', 'Gold'], ordered=True)
medals.Medal.head()
Out[171]:
0      Gold
1    Silver
2    Bronze
3      Gold
4    Silver
Name: Medal, dtype: category
Categories (3, object): [Bronze < Silver < Gold]
In [173]:
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area(alpha=.6)
plt.show()
In [ ]: