How to select a single column from the index of a multi-index dataframe?

I am having trouble getting a column from the indexes of a multi-index pandas dataframe.

Normally if I want to get a column from a regular dataframe that does not have multi-index it is easy. I simply do df[‘col_name’] or df.col_name and I get the column. However with a multi-index dataframe this does not work when the column I want is in the multi-index of the dataframe. (I can still get a regular column.) For example, if I do df[‘col_name’] or df.col_name for a col_name that is in the multi-index, I get an error: KeyError: ‘col_name’

The only solution that I can think of without going into crazy gymnastics (df.index.get_level_values(‘col_name’)) is the following:

df.reset_index()[‘col_name’] or df.reset_index().col_name

I feel I must be misunderstanding or missing something because selecting a column in the multiindex should be easy and straight forward since it is very common to want that. What is a simple way to select a single column from the multiindex?

Example:

import pandas as pd

file_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(file_name)
df = df.set_index(['sepal_length','sepal_width'])
print(df.head())

                          petal_length  petal_width species
sepal_length sepal_width                                   
5.1          3.5                   1.4          0.2  setosa
4.9          3.0                   1.4          0.2  setosa
4.7          3.2                   1.3          0.2  setosa
4.6          3.1                   1.5          0.2  setosa
5.0          3.6                   1.4          0.2  setosa


df['sepal_length'] #KeyError: 'sepal_length'
df.sepal_length #KeyError: 'sepal_length'
df.loc['sepal_length'] #KeyError: 'sepal_length'
df.loc['sepal_length',:] #KeyError: 'sepal_length'
df.index.sepal_length # AttributeError: 'MultiIndex' object has no attribute 'sepal_length'
df.index['sepal_length'] #IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

  • What is your expected output? df.loc[:, 'sepal_length'] should return a column.

    – 

To get a single column from multi-index, use index.get_level_values() method.
Here’s Document.

import pandas as pd

file_name = ("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

df = pd.read_csv(file_name)
df = df.set_index(["sepal_length", "sepal_width"])
df.index.get_level_values("sepal_length")
# or
df.index.get_level_values(0)

Result:

 Float64Index([5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,
              ...
              6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9],
              dtype="float64", name="sepal_length", length=150)

What you are calling a “column” in the DataFrame MultiIndex is referred to as a level. The way to get a level from a MultiIndex is to use pd.Index.get_level_values() or df.reset_index()["column"], as you said.

I feel I must be misunderstanding or missing something because selecting a column in the multiindex should be easy and straight forward since it is very common to want that.

I do not think it is very common to want to select one of the levels of the MultiIndex. If you find yourself wanting to do this often, then I’d suggest you just have this level as a proper column in the DataFrame itself. For example, you could keep the MultiIndex and have the level as column using

df.assign(sepal_length=lambda X: X.index.get_level_values("sepal_length"))

In general, the purpose of the DataFrame index is to uniquely identify each row in the DataFrame. A MultiIndex can help if it takes more than one piece of information to uniquely identify an observation. I often work with houly time series data where each observation/row is uniquely defined by date and hour, so a MultiIndex with those two pieces of information works well. In the example you give, it’s possible for two irises to have the same sepal width and sepal length, so the combination of these two as the index is not the best choice. A better choice would be something like an flower ID number.

Leave a Comment