Manipulating DataFrames¶

We will review our terminology for a quick moment:

index : the column and row indices of your Series or DataFrame, the index for each of these may be hiearchical
- row index : the index along the horizontal dimension, and typically used as the primary index
- column index : the index along the vertical dimension

axis : the numeric designation for the column or row indices; typically 0 is the column-axis and 1 is the row-axis. When dealing with multi-indices, the hierarchy within the axis are referred to as levels and accessed similarly

More Selecting¶

In the example for this section, we're going to go back to our Baseball data set and load the batting statistics into a DataFrame.

In [1]:

import pandas as pd

# get the data for players in 2015-16 who played in 100 or more games
df = pd.read_csv("./datasets/Batting.csv")

The convenient `[]` operator (again)¶

As before basic slice selections can be made with the syntax similar to that found in lists using the convenience of the [] operator. For example, obtaining the first 5 rows of our data, or the last 15.

We mostly worked on row slicing with the [] selector, but if we pass a column label or list of the columns we'd like, say the RBI and G (games played) data, we get mostly what we'd expect:

In [4]:

df["RBI"][:5]

Out[4]:

0     0.0
1    13.0
2    19.0
3    27.0
4    16.0
Name: RBI, dtype: float64

In [5]:

df[["RBI", "G"]][:10]

Out[5]:

	RBI	G
0	0.0	1
1	13.0	25
2	19.0	29
3	27.0	27
4	16.0	25
5	5.0	12
6	2.0	1
7	34.0	31
8	1.0	1
9	11.0	18

Boolean selecting¶

We have yet to make more complex selections beyond index values. Now we're ready to introduce selecting by boolean value. With this kinds of selection, we're going to as Pandas to give us the Series or DataFrame that represents the boolean values of what we want, then we will allow iloc to reduce the resulting Series or DataFrame to what we're looking for. Let's see this in action.

Say we want to find all items in our DataFrame where yearID is 2015 or

df.yearID == 2015

Let's first see what this does.

In [7]:

df.yearID == 2015

Out[7]:

0         False
1         False
2         False
3         False
4         False
5         False
6         False
7         False
8         False
9         False
10        False
11        False
12        False
13        False
14        False
15        False
16        False
17        False
18        False
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27        False
28        False
29        False
          ...  
102786    False
102787    False
102788    False
102789    False
102790    False
102791    False
102792    False
102793    False
102794    False
102795    False
102796    False
102797    False
102798    False
102799    False
102800    False
102801    False
102802    False
102803    False
102804    False
102805    False
102806    False
102807    False
102808    False
102809    False
102810    False
102811    False
102812    False
102813    False
102814    False
102815    False
Name: yearID, Length: 102816, dtype: bool

In [8]:

df.loc[df.yearID == 2015][:10] # note we're restricting the return to just the first 10 values

Out[8]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
99847	aardsda01	2015	1	ATL	NL	33	1	0	0	0	...	0.0	0.0	0.0	0	1.0	0.0	0.0	0.0	0.0	0.0
99848	abadfe01	2015	1	OAK	AL	62	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
99849	abreujo02	2015	1	CHA	AL	154	613	88	178	34	...	101.0	0.0	0.0	39	140.0	11.0	15.0	0.0	1.0	16.0
99850	achteaj01	2015	1	MIN	AL	11	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
99851	ackledu01	2015	1	SEA	AL	85	186	22	40	8	...	19.0	2.0	2.0	14	38.0	0.0	1.0	3.0	3.0	3.0
99852	ackledu01	2015	2	NYA	AL	23	52	6	15	3	...	11.0	0.0	0.0	4	7.0	0.0	0.0	0.0	1.0	0.0
99853	adamecr01	2015	1	COL	NL	26	53	4	13	1	...	3.0	0.0	1.0	3	11.0	1.0	1.0	1.0	0.0	0.0
99854	adamsau01	2015	1	CLE	AL	28	1	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	1.0
99855	adamsma01	2015	1	SLN	NL	60	175	14	42	9	...	24.0	1.0	0.0	10	41.0	1.0	0.0	0.0	1.0	1.0
99856	adcocna01	2015	1	CIN	NL	13	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0

10 rows × 22 columns

Now what if we wanted the restrict this further by team. Say we wanted to see only the Minesota Twins player data for 2015. That is

df.yearID == 2015
AND
df.teamID == "MIN"

We simply put these in parethesis and use the & operator.

In [9]:

df.loc[(df.yearID == 2015) & (df.teamID == "MIN")].head(10)

Out[9]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
99850	achteaj01	2015	1	MIN	AL	11	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
99891	arciaos01	2015	1	MIN	AL	19	58	6	16	0	...	8.0	0.0	0.0	4	15.0	4.0	2.0	0.0	1.0	2.0
99954	bernido01	2015	1	MIN	AL	4	5	1	1	1	...	2.0	0.0	0.0	1	3.0	0.0	0.0	0.0	0.0	0.0
99988	boyerbl01	2015	1	MIN	AL	68	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
100030	buxtoby01	2015	1	MIN	AL	46	129	16	27	7	...	6.0	2.0	2.0	6	44.0	0.0	1.0	2.0	0.0	1.0
100139	cottsne01	2015	2	MIN	AL	17	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
100215	doziebr01	2015	1	MIN	AL	157	628	101	148	39	...	77.0	12.0	4.0	61	148.0	2.0	7.0	0.0	8.0	10.0
100221	duensbr01	2015	1	MIN	AL	55	1	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
100222	duffety01	2015	1	MIN	AL	10	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
100249	escobed01	2015	1	MIN	AL	127	409	48	107	31	...	58.0	2.0	3.0	28	86.0	1.0	2.0	2.0	5.0	7.0

10 rows × 22 columns

Now what if we wanted to restrict a subset of columns. This is easy with iloc[] ... we will just use our boolean expression as above for the row selection and then the list of columns for our column selection (in this case a much smaller subset of data).

In [10]:

df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\
       ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]

Out[10]:

	playerID	G	AB	H	HR	RBI
99850	achteaj01	11	0	0	0	0.0
99891	arciaos01	19	58	16	2	8.0
99954	bernido01	4	5	1	0	2.0
99988	boyerbl01	68	0	0	0	0.0
100030	buxtoby01	46	129	27	2	6.0
100139	cottsne01	17	0	0	0	0.0
100215	doziebr01	157	628	148	28	77.0
100221	duensbr01	55	1	0	0	0.0
100222	duffety01	10	0	0	0	0.0
100249	escobed01	127	409	107	12	58.0
100270	fienca01	62	0	0	0	0.0
100302	fryerer01	15	22	5	0	2.0
100333	gibsoky01	32	5	1	0	0.0
100373	grahajr01	39	0	0	0	0.0
100455	herrmch01	45	103	15	2	10.0
100459	hicksaa01	97	352	90	11	33.0
100486	hugheph01	27	3	0	0	0.0
100488	hunteto01	139	521	125	22	81.0
100521	jepseke01	29	0	0	0	0.0
100564	keplema01	3	7	1	0	0.0
100696	mauerjo01	158	592	157	10	66.0
100701	maytr01	48	3	0	0	0.0
100729	meyeral01	2	0	0	0	0.0
100737	milonto01	24	2	0	0	0.0
100807	nolasri01	9	3	0	0	0.0
100816	nunezed02	72	188	53	4	20.0
100837	orourry01	28	0	0	0	0.0
100872	pelfrmi01	30	3	2	0	0.0
100895	perkigl01	60	0	0	0	0.0
100915	plouftr01	152	573	140	22	86.0
100917	polanjo01	4	10	3	0	1.0
100925	pressry01	27	0	0	0	0.0
100994	robinsh01	83	180	45	0	16.0
101023	rosared01	122	453	121	13	50.0
101067	sanomi01	80	279	75	18	52.0
101069	santada01	91	261	56	0	21.0
101072	santaer01	17	0	0	0	0.0
101079	schafjo02	27	69	15	0	5.0
101144	staufti01	13	0	0	0	0.0
101164	suzukku01	131	433	104	5	50.0
101189	thielca01	6	0	0	0	0.0
101193	thompaa01	41	0	0	0	0.0
101203	tonkimi01	26	0	0	0	0.0
101240	vargake01	58	175	42	5	17.0

Sorting¶

Sorting is facilitated by the sort_values() method. By default, sorting is done in ascending order, specify the parameter ascending=False to get descending order.

In [11]:

df_min_2015 = df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\
                     ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\
            .sort_values('G', ascending=False)
df_min_2015.head(20)

Out[11]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66.0
100215	doziebr01	157	628	148	28	77.0
100915	plouftr01	152	573	140	22	86.0
100488	hunteto01	139	521	125	22	81.0
101164	suzukku01	131	433	104	5	50.0
100249	escobed01	127	409	107	12	58.0
101023	rosared01	122	453	121	13	50.0
100459	hicksaa01	97	352	90	11	33.0
101069	santada01	91	261	56	0	21.0
100994	robinsh01	83	180	45	0	16.0
101067	sanomi01	80	279	75	18	52.0
100816	nunezed02	72	188	53	4	20.0
99988	boyerbl01	68	0	0	0	0.0
100270	fienca01	62	0	0	0	0.0
100895	perkigl01	60	0	0	0	0.0
101240	vargake01	58	175	42	5	17.0
100221	duensbr01	55	1	0	0	0.0
100701	maytr01	48	3	0	0	0.0
100030	buxtoby01	46	129	27	2	6.0
100455	herrmch01	45	103	15	2	10.0

We may also do a multi-sort by passing in the list of columns we want sorted. This will sort in the order of the columns provided. For example,

In [12]:

df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\
        ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\
        .sort_values(['G', 'HR'], ascending=False).tail()

Out[12]:

	playerID	G	AB	H	RBI
101189	thielca01	6	0	0	0.0
99954	bernido01	4	5	1	2.0
100917	polanjo01	4	10	3	1.0
100564	keplema01	3	7	1	0.0
100729	meyeral01	2	0	0	0.0

DataFrame manipulation¶

Adding and dropping columns¶

In [13]:

df_min_2015.loc[:,'HtoAB'] = 0
df_min_2015.head()

Out[13]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66.0
100215	doziebr01	157	628	148	28	77.0
100915	plouftr01	152	573	140	22	86.0
100488	hunteto01	139	521	125	22	81.0
101164	suzukku01	131	433	104	5	50.0

In [14]:

df_min_2015 = df_min_2015.drop('HtoAB', axis=1)
df_min_2015.head()

Out[14]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66.0
100215	doziebr01	157	628	148	28	77.0
100915	plouftr01	152	573	140	22	86.0
100488	hunteto01	139	521	125	22	81.0
101164	suzukku01	131	433	104	5	50.0

In [15]:

df_min_2015.H.head(10)

Out[15]:

100696    157
100215    148
100915    140
100488    125
101164    104
100249    107
101023    121
100459     90
101069     56
100994     45
Name: H, dtype: int64

In [16]:

df_min_2015.loc[:,'HtoAB'] = 0
df_min_2015.loc[:,'HtoAB'] = [v.H/v.AB 
                              if v.AB > 0 else 0 
                              for r, v in df_min_2015.iterrows()]

In [17]:

df_min_2015.head(10)

Out[17]:

	playerID	G	AB	H	HR	RBI	HtoAB
100696	mauerjo01	158	592	157	10	66.0	0.265203
100215	doziebr01	157	628	148	28	77.0	0.235669
100915	plouftr01	152	573	140	22	86.0	0.244328
100488	hunteto01	139	521	125	22	81.0	0.239923
101164	suzukku01	131	433	104	5	50.0	0.240185
100249	escobed01	127	409	107	12	58.0	0.261614
101023	rosared01	122	453	121	13	50.0	0.267108
100459	hicksaa01	97	352	90	11	33.0	0.255682
101069	santada01	91	261	56	0	21.0	0.214559
100994	robinsh01	83	180	45	0	16.0	0.250000

In [18]:

df_min_2015[df_min_2015.G>80].sort_values('HtoAB', ascending=False)

Out[18]:

	playerID	G	AB	H	HR	RBI	HtoAB
101023	rosared01	122	453	121	13	50.0	0.267108
100696	mauerjo01	158	592	157	10	66.0	0.265203
100249	escobed01	127	409	107	12	58.0	0.261614
100459	hicksaa01	97	352	90	11	33.0	0.255682
100994	robinsh01	83	180	45	0	16.0	0.250000
100915	plouftr01	152	573	140	22	86.0	0.244328
101164	suzukku01	131	433	104	5	50.0	0.240185
100488	hunteto01	139	521	125	22	81.0	0.239923
100215	doziebr01	157	628	148	28	77.0	0.235669
101069	santada01	91	261	56	0	21.0	0.214559

In [19]:

df_min_2015 = df_min_2015.reindex(columns=['playerID', 'HtoAB',  'AB', 'H', 'HR', 'RBI', 'G'])
df_min_2015.head()

Out[19]:

	playerID	HtoAB	AB	H	HR	RBI	G
100696	mauerjo01	0.265203	592	157	10	66.0	158
100215	doziebr01	0.235669	628	148	28	77.0	157
100915	plouftr01	0.244328	573	140	22	86.0	152
100488	hunteto01	0.239923	521	125	22	81.0	139
101164	suzukku01	0.240185	433	104	5	50.0	131

Finally, we can return our DataFrame back to its original columns (and order) by reindexing again. Notice, also that we can effectively perform a drop() by doing this, though the syntax with reindex() is more verbose.

In [20]:

df_min_2015 = df_min_2015.reindex(columns=['playerID', 'G',  'AB', 'H', 'HR', 'RBI'])
df_min_2015.head()

Out[20]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66.0
100215	doziebr01	157	628	148	28	77.0
100915	plouftr01	152	573	140	22	86.0
100488	hunteto01	139	521	125	22	81.0
101164	suzukku01	131	433	104	5	50.0

Adding and dropping rows¶

Adding rows can be achieved using loc[] and setting the new index to a dictionary of values using the column labels as keys.

In [21]:

df_min_2015.loc[200000] = \
    {   'playerID': 'keith01',
        'RBI': '0',
        'G': '0',
        'H': '0',
        'HR': '0',
        'AB': '0' }
    
df_min_2015.tail()

Out[21]:

	playerID	G	AB	H	RBI
100917	polanjo01	4	10	3	1
99954	bernido01	4	5	1	2
100564	keplema01	3	7	1	0
100729	meyeral01	2	0	0	0
200000	keith01	0	0	0	0

It is also the same with lists and tuples.

In [22]:

df_min_2015.loc[200000] = ('keith01', 1, 1, 1, 1, 1)
df_min_2015.loc[200001] = ['keith02', 1, 1, 1, 1, 1]

df_min_2015.tail()

Out[22]:

	playerID	G	AB	H	HR	RBI
99954	bernido01	4	5	1	0	2
100564	keplema01	3	7	1	0	0
100729	meyeral01	2	0	0	0	0
200000	keith01	1	1	1	1	1
200001	keith02	1	1	1	1	1

Note that we can drop a number of rows at a time by passing a list of the indices we'd like dropped.

In [23]:

df_min_2015 = df_min_2015.drop([200000, 200001], axis=0)
df_min_2015.tail()

Out[23]:

	playerID	G	AB	H	RBI
101189	thielca01	6	0	0	0
100917	polanjo01	4	10	3	1
99954	bernido01	4	5	1	2
100564	keplema01	3	7	1	0
100729	meyeral01	2	0	0	0

Similar results can be achieved using append(). With append, you can append, Series, DataFrames and/or a list of these.

In [24]:

df_min_2015.append(
    pd.Series( 
     {'playerID': 'keith01', 
             'G': 0, 
             'AB': 0, 
             'H':0, 
             'HR': 0, 
             'RBI': 0}, name='200000')).tail()

Out[24]:

	playerID	G	AB	H	RBI
100917	polanjo01	4	10	3	1
99954	bernido01	4	5	1	2
100564	keplema01	3	7	1	0
100729	meyeral01	2	0	0	0
200000	keith01	0	0	0	0

In [25]:

df_min_2015[:5].append(df_min_2015[-5:])

Out[25]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66
100215	doziebr01	157	628	148	28	77
100915	plouftr01	152	573	140	22	86
100488	hunteto01	139	521	125	22	81
101164	suzukku01	131	433	104	5	50
101189	thielca01	6	0	0	0	0
100917	polanjo01	4	10	3	0	1
99954	bernido01	4	5	1	0	2
100564	keplema01	3	7	1	0	0
100729	meyeral01	2	0	0	0	0

In [26]:

df_min_2015[:5].append([df_min_2015[10:12], df_min_2015[-5:]])

Out[26]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66
100215	doziebr01	157	628	148	28	77
100915	plouftr01	152	573	140	22	86
100488	hunteto01	139	521	125	22	81
101164	suzukku01	131	433	104	5	50
101067	sanomi01	80	279	75	18	52
100816	nunezed02	72	188	53	4	20
101189	thielca01	6	0	0	0	0
100917	polanjo01	4	10	3	0	1
99954	bernido01	4	5	1	0	2
100564	keplema01	3	7	1	0	0
100729	meyeral01	2	0	0	0	0

The same result can be achieved with pd.concat(), where the defaut axis is 0.

In [27]:

pd.concat([df_min_2015[:5], 
          df_min_2015[-5:]], axis=0)

Out[27]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66
100215	doziebr01	157	628	148	28	77
100915	plouftr01	152	573	140	22	86
100488	hunteto01	139	521	125	22	81
101164	suzukku01	131	433	104	5	50
101189	thielca01	6	0	0	0	0
100917	polanjo01	4	10	3	0	1
99954	bernido01	4	5	1	0	2
100564	keplema01	3	7	1	0	0
100729	meyeral01	2	0	0	0	0

But we can use concat() to make a column-wise concatenation using axis=1 (columns).

In [28]:

pd.concat([df_min_2015[:5], 
          df_min_2015[-5:]], axis=1)

Out[28]:

	playerID	G	AB	H	HR	RBI	playerID	G	AB	H	HR	RBI
99954	NaN	NaN	NaN	NaN	NaN	NaN	bernido01	4	5	1	0	2
100215	doziebr01	157	628	148	28	77	NaN	NaN	NaN	NaN	NaN	NaN
100488	hunteto01	139	521	125	22	81	NaN	NaN	NaN	NaN	NaN	NaN
100564	NaN	NaN	NaN	NaN	NaN	NaN	keplema01	3	7	1	0	0
100696	mauerjo01	158	592	157	10	66	NaN	NaN	NaN	NaN	NaN	NaN
100729	NaN	NaN	NaN	NaN	NaN	NaN	meyeral01	2	0	0	0	0
100915	plouftr01	152	573	140	22	86	NaN	NaN	NaN	NaN	NaN	NaN
100917	NaN	NaN	NaN	NaN	NaN	NaN	polanjo01	4	10	3	0	1
101164	suzukku01	131	433	104	5	50	NaN	NaN	NaN	NaN	NaN	NaN
101189	NaN	NaN	NaN	NaN	NaN	NaN	thielca01	6	0	0	0	0

We can see that the indices are being considered in the concatenation and row indices are being joined. This behavior can be controlled via the join parameter, which we'll leave for the reader to explore.

One last thing we might want to do in an operation like this is to reset the index. To do so, we might start with ignoring the column index using the ignore_index=True so we can set it later to something more appropriate after the concatenation.

In [29]:

pd.concat([df_min_2015[:5], 
          df_min_2015[-5:]], axis=1, ignore_index=True)

Out[29]:

	0	1	2	3	4	5	6	7	8	9	10	11
99954	NaN	NaN	NaN	NaN	NaN	NaN	bernido01	4	5	1	0	2
100215	doziebr01	157	628	148	28	77	NaN	NaN	NaN	NaN	NaN	NaN
100488	hunteto01	139	521	125	22	81	NaN	NaN	NaN	NaN	NaN	NaN
100564	NaN	NaN	NaN	NaN	NaN	NaN	keplema01	3	7	1	0	0
100696	mauerjo01	158	592	157	10	66	NaN	NaN	NaN	NaN	NaN	NaN
100729	NaN	NaN	NaN	NaN	NaN	NaN	meyeral01	2	0	0	0	0
100915	plouftr01	152	573	140	22	86	NaN	NaN	NaN	NaN	NaN	NaN
100917	NaN	NaN	NaN	NaN	NaN	NaN	polanjo01	4	10	3	0	1
101164	suzukku01	131	433	104	5	50	NaN	NaN	NaN	NaN	NaN	NaN
101189	NaN	NaN	NaN	NaN	NaN	NaN	thielca01	6	0	0	0	0

Advanced indexing¶

Pandas provides the ability to build more complex indices allowing for highly flexible and natural data access.

We will cover the basics of through the MultiIndex object and will the the remaining exploration to the reader.

Let's get the players on the Washington Nationals who played 100 or more games in 2015 and 2016.

In [30]:

df_was = df[(df.yearID > 2014) & (df.teamID=='WAS') & (df.G > 99)]

In [31]:

df_was.head()

Out[31]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
100193	desmoia01	2015	1	WAS	NL	156	583	69	136	27	...	62.0	13.0	5.0	45	187.0	0.0	3.0	6.0	4.0	9.0
100250	escobyu01	2015	1	WAS	NL	139	535	75	168	25	...	56.0	2.0	2.0	45	70.0	0.0	8.0	1.0	2.0	24.0
100251	espinda01	2015	1	WAS	NL	118	367	59	88	21	...	37.0	5.0	2.0	33	106.0	5.0	6.0	3.0	3.0	6.0
100422	harpebr03	2015	1	WAS	NL	153	521	118	172	38	...	99.0	6.0	4.0	124	131.0	15.0	5.0	0.0	4.0	15.0
100950	ramoswi01	2015	1	WAS	NL	128	475	41	109	16	...	68.0	0.0	0.0	21	101.0	2.0	0.0	0.0	8.0	16.0

5 rows × 22 columns

One obvious problem if we were to access the data here by player and year, we have to build a much more involved query and even more so if we needed to ignore data.

We are going to create a hierarchical index or MultiIndex to solve this problem. We'll take take liberty to drop columns we don't need (teamID, ldID, stint) and reorganize the index hierarchically.

We will use MultiIndex using a tuple of the data we need and provide the index first by player, then by year. To do this we'll just grab all the player IDs and zip them with the year. This will look something like this:

In [32]:

tuple(
zip(
    df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'],
    df_was[['playerID','yearID']].sort_values(by='playerID')['yearID']
)
)

Out[32]:

(('desmoia01', 2015),
 ('escobyu01', 2015),
 ('espinda01', 2015),
 ('espinda01', 2016),
 ('harpebr03', 2015),
 ('harpebr03', 2016),
 ('murphda08', 2016),
 ('ramoswi01', 2015),
 ('ramoswi01', 2016),
 ('rendoan01', 2016),
 ('reverbe01', 2016),
 ('robincl01', 2015),
 ('robincl01', 2016),
 ('taylomi02', 2015),
 ('werthja01', 2016),
 ('zimmery01', 2016))

In [33]:

# create an index to be used over the data we're interested in
idx = \
    pd.MultiIndex.from_tuples(
        tuple(
            zip(
                df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'],
                df_was[['playerID','yearID']].sort_values(by='playerID')['yearID']))
    )
idx

Out[33]:

MultiIndex(levels=[['desmoia01', 'escobyu01', 'espinda01', 'harpebr03', 'murphda08', 'ramoswi01', 'rendoan01', 'reverbe01', 'robincl01', 'taylomi02', 'werthja01', 'zimmery01'], [2015, 2016]],
           labels=[[0, 1, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 9, 10, 11], [0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1]])

Notice now that we have two levels in our row axis (axis 0) and we will now use that index to build the hierachically indexed DataFrame.

In [34]:

# sorting the indices is critical for lining up the data in the tuples
df_was = df_was.sort_values(by=['playerID']).\
            set_index(idx).\
            drop(['playerID', 'yearID', 'teamID', 'lgID', 'stint'], axis=1)
df_was

Out[34]:

		G	AB	R	H	2B	3B	HR	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
desmoia01	2015	156	583	69	136	27	2	19	62.0	13.0	5.0	45	187.0	0.0	3.0	6.0	4.0	9.0
escobyu01	2015	139	535	75	168	25	1	9	56.0	2.0	2.0	45	70.0	0.0	8.0	1.0	2.0	24.0
espinda01	2015	118	367	59	88	21	1	13	37.0	5.0	2.0	33	106.0	5.0	6.0	3.0	3.0	6.0
espinda01	2016	157	516	66	108	15	0	24	72.0	9.0	2.0	54	174.0	12.0	20.0	7.0	4.0	4.0
harpebr03	2015	153	521	118	172	38	1	42	99.0	6.0	4.0	124	131.0	15.0	5.0	0.0	4.0	15.0
harpebr03	2016	147	506	84	123	24	2	24	86.0	21.0	10.0	108	117.0	20.0	3.0	0.0	10.0	11.0
murphda08	2016	142	531	88	184	47	5	25	104.0	5.0	3.0	35	57.0	10.0	8.0	0.0	8.0	4.0
ramoswi01	2015	128	475	41	109	16	0	15	68.0	0.0	0.0	21	101.0	2.0	0.0	0.0	8.0	16.0
ramoswi01	2016	131	482	58	148	25	0	22	80.0	0.0	0.0	35	79.0	2.0	2.0	0.0	4.0	17.0
rendoan01	2016	156	567	91	153	38	2	20	85.0	12.0	6.0	65	117.0	2.0	7.0	0.0	8.0	5.0
reverbe01	2016	103	350	44	76	9	7	2	24.0	14.0	5.0	18	34.0	0.0	3.0	2.0	2.0	12.0
robincl01	2015	126	309	44	84	15	1	10	34.0	0.0	0.0	37	52.0	4.0	5.0	0.0	1.0	6.0
robincl01	2016	104	196	16	46	4	0	5	26.0	0.0	0.0	20	38.0	0.0	2.0	1.0	5.0	4.0
taylomi02	2015	138	472	49	108	15	2	14	63.0	16.0	3.0	35	158.0	9.0	1.0	1.0	2.0	5.0
werthja01	2016	143	525	84	128	28	0	21	69.0	5.0	1.0	71	139.0	0.0	4.0	0.0	6.0	17.0
zimmery01	2016	115	427	60	93	18	1	15	46.0	4.0	1.0	29	104.0	1.0	5.0	0.0	6.0	12.0

In [35]:

df_was.loc[('robincl01', ),['G', 'AB', 'H', 'SO']]

Out[35]:

	G	AB	H	SO
2015	126	309	84	52.0
2016	104	196	46	38.0

In [36]:

df_was.loc[('robincl01', 2016),['G', 'AB', 'H', 'SO']]

Out[36]:

G     104.0
AB    196.0
H      46.0
SO     38.0
Name: (robincl01, 2016), dtype: float64

In [41]:

df_mi.head()

Out[41]:

				playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
2007	AL	BAL	baezda01	bardebr01	2007	1	ARI	NL	8	12	0	1	0	...	0.0	0.0	0.0	0	3.0	0.0	0.0	0.0	0.0	0.0
			bakopa01	bonifem01	2007	1	ARI	NL	11	23	2	5	1	...	2.0	0.0	1.0	4	3.0	0.0	0.0	0.0	0.0	0.0
			bedarer01	byrneer01	2007	1	ARI	NL	160	626	103	179	30	...	83.0	50.0	7.0	57	98.0	5.0	10.0	1.0	4.0	12.0
			bellro01	callaal01	2007	1	ARI	NL	56	144	10	31	8	...	7.0	1.0	1.0	9	14.0	0.0	1.0	1.0	1.0	8.0
			birkiku01	choatra01	2007	1	ARI	NL	2	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0