$$ \newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor} \newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil} \renewcommand{\mod}{\,\mathrm{mod}\,} \renewcommand{\div}{\,\mathrm{div}\,} \newcommand{\metar}{\,\mathrm{m}} \newcommand{\cm}{\,\mathrm{cm}} \newcommand{\dm}{\,\mathrm{dm}} \newcommand{\litar}{\,\mathrm{l}} \newcommand{\km}{\,\mathrm{km}} \newcommand{\s}{\,\mathrm{s}} \newcommand{\h}{\,\mathrm{h}} \newcommand{\minut}{\,\mathrm{min}} \newcommand{\kmh}{\,\mathrm{\frac{km}{h}}} \newcommand{\ms}{\,\mathrm{\frac{m}{s}}} \newcommand{\mss}{\,\mathrm{\frac{m}{s^2}}} \newcommand{\mmin}{\,\mathrm{\frac{m}{min}}} \newcommand{\smin}{\,\mathrm{\frac{s}{min}}} $$

Prijavi problem


Obeleži sve kategorije koje odgovaraju problemu

Još detalja - opišite nam problem


Uspešno ste prijavili problem!
Status problema i sve dodatne informacije možete pratiti klikom na link.
Nažalost nismo trenutno u mogućnosti da obradimo vaš zahtev.
Molimo vas da pokušate kasnije.

7. Indexing and transposing a table

In this lecture we demonstrate:

  1. how indexing a table provides flexible access to the elements of the table;
  2. how to compute accross rows and columns of the table; and
  3. how to transpose a table.

7.1. Indexing

We have seen that working with columns of a DataFrame is very easy because colums have names. It would be as easy to work with rows of a DataFrame if we had a way to name the rows somehow. The process that does precisely that is called the indexing of a table.

To index a table we first have to identify a column (the indexing column or the index) such that each row in uniquely determined by the value in the indexing column. For example, in the following table

Name Sex Age (yrs) Weight (kg) Height (cm)
Anne f 13 46 160
Ben m 14 52 165
Colin m 13 47 157
Diana f 15 54 165
Ethan m 15 56 163
Fred m 13 45 159
Gloria f 14 49 161
Hellen f 15 52 164
Ian m 15 57 167
Jane f 13 45 158
Kate f 14 51 162

"Name" is a good candidate for the indexing column because in this table vevery student has a unique name (note that in real life this is not necessarily the case). "Height" is not a good choice because there are two students whose height is 165; and the same goes for other columns.

The function set_index sets the index column of the table:

In [1]:
import pandas as pd
students = [["Anne",    "f", 13, 46, 160],
            ["Ben",     "m", 14, 52, 165],
            ["Colin",   "m", 13, 47, 157],
            ["Diana",   "f", 15, 54, 165],
            ["Ethan",   "m", 15, 56, 163],
            ["Fred",    "m", 13, 45, 159],
            ["Gloria",  "f", 14, 49, 161],
            ["Hellen",  "f", 15, 52, 164],
            ["Ian",     "m", 15, 57, 167],
            ["Jane",    "f", 13, 45, 158],
            ["Kate",    "f", 14, 51, 162]]
students_df = pd.DataFrame(students)
students_df.columns=["Name", "Sex", "Age", "Weight", "Height"]

students_ix=students_df.set_index("Name")

The new table (students_ix) differs from the old one (students_df) only in the fact that the rows of the table are now indexed by the names of the students. Here is the unidexed version of the table:

In [2]:
students_df
Out[2]:
Name Sex Age Weight Height
0 Anne f 13 46 160
1 Ben m 14 52 165
2 Colin m 13 47 157
3 Diana f 15 54 165
4 Ethan m 15 56 163
5 Fred m 13 45 159
6 Gloria f 14 49 161
7 Hellen f 15 52 164
8 Ian m 15 57 167
9 Jane f 13 45 158
10 Kate f 14 51 162

and here is the indexed version of the same table:

In [3]:
students_ix
Out[3]:
Sex Age Weight Height
Name
Anne f 13 46 160
Ben m 14 52 165
Colin m 13 47 157
Diana f 15 54 165
Ethan m 15 56 163
Fred m 13 45 159
Gloria f 14 49 161
Hellen f 15 52 164
Ian m 15 57 167
Jane f 13 45 158
Kate f 14 51 162

The column "Name" is still present, but now it has a special statis. If we try to access it as we did before we get an error (the error report is quite long; don't bother reading it carefully, just scroll down):

In [4]:
students_ix["Name"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3077             try:
-> 3078                 return self._engine.get_loc(key)
   3079             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Name'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-4-475864f81b1d> in <module>
----> 1 students_ix["Name"]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2686             return self._getitem_multilevel(key)
   2687         else:
-> 2688             return self._getitem_column(key)
   2689 
   2690     def _getitem_column(self, key):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2693         # get column
   2694         if self.columns.is_unique:
-> 2695             return self._get_item_cache(key)
   2696 
   2697         # duplicate columns & possible reduce dimensionality

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   2487         res = cache.get(item)
   2488         if res is None:
-> 2489             values = self._data.get(item)
   2490             res = self._box_item_values(item, values)
   2491             cache[item] = res

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3078                 return self._engine.get_loc(key)
   3079             except KeyError:
-> 3080                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3081 
   3082         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Name'

However, it is there as an index column:

In [5]:
students_ix.index
Out[5]:
Index(['Anne', 'Ben', 'Colin', 'Diana', 'Ethan', 'Fred', 'Gloria', 'Hellen',
       'Ian', 'Jane', 'Kate'],
      dtype='object', name='Name')

Visualizing, say, the height of the students in the group now works like this:

In [6]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10,5))
plt.bar(students_ix.index, students_ix["Height"])
plt.title("The height of students")
plt.show()
plt.close()
<Figure size 1000x500 with 1 Axes>

7.2. Accessing rows and individual cells of an indexed table

DataFrame is optimized to provide efficient access to the columns of the table. However, in an indexed DataFrame it is also easy to access rows and cells of the table using the function loc (short for "location").

We can display a single row of the table like this:

In [7]:
students_ix.loc["Ethan"]
Out[7]:
Sex         m
Age        15
Weight     56
Height    163
Name: Ethan, dtype: object

or a range of rows like this:

In [8]:
students_ix.loc["Ethan":"Ian"]
Out[8]:
Sex Age Weight Height
Name
Ethan m 15 56 163
Fred m 13 45 159
Gloria f 14 49 161
Hellen f 15 52 164
Ian m 15 57 167

We can also focus on a particular feature:

In [9]:
students_ix.loc["Ethan", "Height"]
Out[9]:
163

or display a set of features for a set of rows:

In [10]:
students_ix.loc["Ethan":"Ian", "Weight":"Height"]
Out[10]:
Weight Height
Name
Ethan 56 163
Fred 45 159
Gloria 49 161
Hellen 52 164
Ian 57 167

7.3. Computing accross rows and columns of an indexed table

In the table below we have collected the marks of students we have already met in some subjects (Computers, English, Maths, Physics, Chemistry and Arts):

In [11]:
marks = [["Anne",    5, 3, 5, 2, 4, 5],
         ["Ben",     5, 5, 5, 5, 5, 5],
         ["Colin",   4, 5, 3, 4, 5, 4],
         ["Diana",   5, 5, 5, 5, 5, 5],
         ["Ethan",   3, 4, 2, 3, 3, 4],
         ["Fred",    4, 5, 3, 4, 5, 4],
         ["Gloria",  3, 3, 3, 4, 2, 3],
         ["Hellen",  5, 5, 4, 5, 4, 5],
         ["Ian",     4, 5, 4, 4, 3, 5],
         ["Jane",    2, 2, 2, 2, 2, 5],
         ["Kate",    3, 4, 5, 4, 5, 5]]

Let's turn this into an indexed DataFrame

In [12]:
marks_df = pd.DataFrame(marks)
marks_df.columns=["Name", "Computers", "English", "Maths", "Physics", "Chemistry", "Arts"]
marks_ix = marks_df.set_index("Name")
marks_ix
Out[12]:
Computers English Maths Physics Chemistry Arts
Name
Anne 5 3 5 2 4 5
Ben 5 5 5 5 5 5
Colin 4 5 3 4 5 4
Diana 5 5 5 5 5 5
Ethan 3 4 2 3 3 4
Fred 4 5 3 4 5 4
Gloria 3 3 3 4 2 3
Hellen 5 5 4 5 4 5
Ian 4 5 4 4 3 5
Jane 2 2 2 2 2 5
Kate 3 4 5 4 5 5

Computing the average mark per subject is easy: we just apply mean to each column of the table:

In [13]:
for subj in marks_ix.columns:
    print(subj, "->", round(marks_ix[subj].mean(), 2))
Computers -> 3.91
English -> 4.18
Maths -> 3.73
Physics -> 3.82
Chemistry -> 3.91
Arts -> 4.55

To compute the average mark per student we shall apply mean to the rows of the table, which we access using loc. As a warm-up let us compute the average mark for Kate:

In [14]:
print("Kate's marks:")
print(marks_ix.loc["Kate"])
print("The average mark:", round(marks_ix.loc["Kate"].mean(), 2))
Kate's marks:
Computers    3
English      4
Maths        5
Physics      4
Chemistry    5
Arts         5
Name: Kate, dtype: int64
The average mark: 4.33

The names of all the students are located in the index column, so the average mark of each student in the table can be computed like this:

In [15]:
for student in marks_ix.index:
    print(student, "->", round(marks_ix.loc[student].mean(), 2))
Anne -> 4.0
Ben -> 5.0
Colin -> 4.17
Diana -> 5.0
Ethan -> 3.17
Fred -> 4.17
Gloria -> 3.0
Hellen -> 4.67
Ian -> 4.17
Jane -> 2.5
Kate -> 4.33

7.4. Transposing a table

Transposing a table is an operation that swaps the rows and columns of the table so that the first row swaps with first column, the second row swaps with the secong column and so on. When transposing an indexed DataFrame the names of the columns become the index row of the new table, while the index row gives names of the columns in the new table.

Transpose

Recall that DataFrames are optimized for efficient access to columns of the table. Therefore, it is convenient to transpose a table which has a few very long rows. Of coruse, we don't have to transpose a table to be able to work with it efficiently (since loc gives access to rows of the table), so transposing a table is a matter of taste or convenience.

To transpose a table just apply T to get the new, transposed table. For example recall that:

In [16]:
marks_ix
Out[16]:
Computers English Maths Physics Chemistry Arts
Name
Anne 5 3 5 2 4 5
Ben 5 5 5 5 5 5
Colin 4 5 3 4 5 4
Diana 5 5 5 5 5 5
Ethan 3 4 2 3 3 4
Fred 4 5 3 4 5 4
Gloria 3 3 3 4 2 3
Hellen 5 5 4 5 4 5
Ian 4 5 4 4 3 5
Jane 2 2 2 2 2 5
Kate 3 4 5 4 5 5

After transposing:

In [17]:
marks_tr = marks_ix.T

the new table looks line this:

In [18]:
marks_tr
Out[18]:
Name Anne Ben Colin Diana Ethan Fred Gloria Hellen Ian Jane Kate
Computers 5 5 4 5 3 4 3 5 4 2 3
English 3 5 5 5 4 5 3 5 5 2 4
Maths 5 5 3 5 2 3 3 4 4 2 5
Physics 2 5 4 5 3 4 4 5 4 2 4
Chemistry 4 5 5 5 3 5 2 4 3 2 5
Arts 5 5 4 5 4 4 3 5 5 5 5

Let's check what happened to index and columns. In the original table we have:

In [19]:
marks_ix.index
Out[19]:
Index(['Anne', 'Ben', 'Colin', 'Diana', 'Ethan', 'Fred', 'Gloria', 'Hellen',
       'Ian', 'Jane', 'Kate'],
      dtype='object', name='Name')
In [20]:
marks_ix.columns
Out[20]:
Index(['Computers', 'English', 'Maths', 'Physics', 'Chemistry', 'Arts'], dtype='object')

while in the transposed table we have:

In [21]:
marks_tr.index
Out[21]:
Index(['Computers', 'English', 'Maths', 'Physics', 'Chemistry', 'Arts'], dtype='object')
In [22]:
marks_tr.columns
Out[22]:
Index(['Anne', 'Ben', 'Colin', 'Diana', 'Ethan', 'Fred', 'Gloria', 'Hellen',
       'Ian', 'Jane', 'Kate'],
      dtype='object', name='Name')

As we have already seen, the average mark per subject can be computed easily:

In [23]:
for subj in marks_ix.columns:
    print(subj, "->", round(marks_ix[subj].mean(), 2))
Computers -> 3.91
English -> 4.18
Maths -> 3.73
Physics -> 3.82
Chemistry -> 3.91
Arts -> 4.55

To compute the average marks for each student, we can use loc to access the rows of the original table, but we can alsoapply the same logic as above, but to the transposed table:

In [24]:
for student in marks_tr.columns:
    print(student, "->", round(marks_tr[student].mean(), 2))
Anne -> 4.0
Ben -> 5.0
Colin -> 4.17
Diana -> 5.0
Ethan -> 3.17
Fred -> 4.17
Gloria -> 3.0
Hellen -> 4.67
Ian -> 4.17
Jane -> 2.5
Kate -> 4.33

7.5. Exercises

Exercise 1. Look carefully at the code below and then answer the questions that follow:

In [25]:
import pandas as pd
students = [["Anne",    "f", 13, 46, 160],
            ["Ben",     "m", 14, 52, 165],
            ["Colin",   "m", 13, 47, 157],
            ["Diana",   "f", 15, 54, 165],
            ["Ethan",   "m", 15, 56, 163],
            ["Fred",    "m", 13, 45, 159],
            ["Gloria",  "f", 14, 49, 161],
            ["Hellen",  "f", 15, 52, 164],
            ["Ian",     "m", 15, 57, 167],
            ["Jane",    "f", 13, 45, 158],
            ["Kate",    "f", 14, 51, 162]]
students_df = pd.DataFrame(students)
students_df.columns=["Name", "Sex", "Age", "Weight", "Height"]
students_ix=students_df.set_index("Name")

temp_anomalies = pd.read_csv("data/TempAnomalies.csv", header=None)
temp_anomalies_tr = temp_anomalies.T
temp_anomalies_tr.columns = ["Year", "Anomaly"]
  1. What is the difference between students_df and students_ix?
  2. What does students_ix.index mean?
  3. What is the value of students_ix.loc["Fred"]?
  4. What is the value of students_ix.loc["Fred", "Height"]?
  5. What is the value of students_df.loc["Fred", "Height"]?
  6. What do you think, why did we apply T to temp_anomalies?
  7. How many columns does temp_anomalies_tr have?

Exercise 2. This is an overview of spendings of a family over a year (in the local currency):

Item Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Rent 8,251 8,436 8,524 8,388 8,241 8,196 8,004 7,996 7,991 8,015 8,353 8,456
Electricity 4,321 4,530 4,115 3,990 3,985 3,726 3,351 3,289 3,295 3,485 3,826 3,834
Phone (landline) 1,425 1,538 1,623 1,489 1,521 1,485 1,491 1,399 1,467 1,531 1,410 1,385
Phone (cell) 2,181 2,235 2,073 1,951 1,989 1,945 3,017 2,638 2,171 1,831 1,926 1,833
TV and Internet 2,399 2,399 2,399 2,399 2,399 2,399 2,399 2,399 2,399 2,399 2,399 2,399
Transport 1,830 1,830 1,830 1,830 1,950 1,950 1,450 1,450 1,950 1,950 2,050 2,050
Food 23,250 23,780 24,019 24,117 24,389 24,571 24,736 24,951 25,111 25,389 25,531 25,923
Rest 4,500 3,700 5,100 3,500 2,750 4,250 7,320 8,250 3,270 4,290 3,200 8,390

This table represented as a list looks like this:

In [26]:
spendings = [
  ["Rent", 8251, 8436, 8524, 8388, 8241, 8196, 8004, 7996, 7991, 8015, 8353, 8456],
  ["Electricity", 4321, 4530, 4115, 3990, 3985, 3726, 3351, 3289, 3295, 3485, 3826, 3834],
  ["Landline", 1425, 1538, 1623, 1489, 1521, 1485, 1491, 1399, 1467, 1531, 1410, 1385],
  ["Cell", 2181, 2235, 2073, 1951, 1989, 1945, 3017, 2638, 2171, 1831, 1926, 1833],
  ["TV and Internet", 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399 ],
  ["Transport", 1830, 1830, 1830, 1830, 1950, 1950, 1450, 1450, 1950, 1950, 2050, 2050],
  ["Food", 23250, 23780, 24019, 24117, 24389, 24571, 24736, 24951, 25111, 25389, 25531, 25923],
  ["Rest", 4500, 3700, 5100, 3500, 2750, 4250, 7320, 8250, 3270, 4290, 3200, 8390]
]

(a) Convert this list into a DataFrame and index it.

(b) Calculate the average spending of this family per item (Rent, Electricity etc).

Exercise 3. Five groups of students took part in a student poll about their favourite movie genres. Each student was allowed to vote for exactly one genre. The results of the poll are summarized below:

Genre Group 1 Group 2 Group 3 Group 4 Group 5
Comedy 4 3 5 2 3
Horror 1 0 2 1 6
SF 10 7 9 8 9
Adventure 4 3 4 2 2
History 1 0 2 0 0
Romance 11 10 7 9 8

(a) COnvert this into a DataFrame indexed by genre.

(b) Compute the number of votes per genre.

(c) For each group compute the total number of students that took part in polling.

(d) What is the total number of students that took part in polling?

Exercise 4. Nutritive data for certain products in given in the table below:

Product (100g) Nutritive value (kcal) Carbohydrates (g) Proteins (g) Fats (g)
Rye bread 250 48.2 8.4 1.0
White bread 280 57.5 6.8 0.5
Cheese spread 127 4.0 3.1 10.5
Margarin 532 4.6 3.2 1.5
Yoghurt 48 4.7 4.0 3.3
Milk (2.8%) 57 4.7 3.3 2.8
Salami 523 1.0 17.0 47.0
Ham 268 0.0 25.5 18.4
Chicken breast 110 0.0 23.1 1.2

In the cell below we have converted this table into a DataFrame indexed by the product name:

In [27]:
food = pd.DataFrame([
    ["R-bread", 250, 48.2, 8.4, 1.0],
    ["W-bread", 280, 57.5, 6.8, 0.5],
    ["Spread", 127, 4.0, 3.1, 10.5],
    ["Margarin", 532, 4.6, 3.2, 1.5],
    ["Yoghurt", 48, 4.7, 4.0, 3.3],
    ["Milk", 57, 4.7, 3.3, 2.8],
    ["Salami", 523, 1.0, 17.0, 47.0],
    ["Ham", 268, 0.0, 25.5, 18.4],
    ["ChBreast", 110, 0.0, 23.1, 1.2]])
food.columns=["Product", "NutrVal", "Carbs", "Proteins", "Fats"]
food_ix = food.set_index("Product")

(a) For his breakfast Mike had two pieces of white bread and had a cup of milk. Each piece of bread had some cheese spread and a slice of ham. What is the nutritive value of Mike's breakfast (in kcal) if we assume that each piece of bread weighs 100g, that 10h of spread was used per piece of bread, that one slice of ham weighs 20g and that a cup of milk contains 200 d of milk?

(b) How much fat was there in Mike's breakfast?

(c) Visualize the amount of carohydrates in these products.

Exercise 5. The temperature anomaly is a number that tells us how much the average temperature in a particular year deviates from the optimal value. The file TempAnomalies.csv located in the folder data contains the temperature anomalies (in degrees Celsius) for the peroid of 40 years (1977-2017). The file has two rows like this:

1977,1978,1979,1980,1981,...
0.22,0.14,0.15,0.3,0.37,...

(a) Load the table into a DataFrame (Note: the table has no header so you need the header=None option in your read_scv.)

In [ ]:
 

(b) Transpose the table and call the two columns "Year" and "Anomaly".

In [ ]:
 

(c) Index the table.

In [ ]:
 

(d) Visualize the temperature anomalies by a line chart.

In [ ]:
 
© 2019 Petlja.org Creative Commons License