5. Frequency analysis and pie charts¶
In this lecture you will learn:
- how to perform a frequency analysis of a sequence of data (which is just a fancy name for a simple thing); and
- how to display data on a pie chart.
5.1. Frequency analysis¶
Frequency analysis of a sequence of values consists of counting how many times each value occurs in the sequence. As simple as that. The numbers that we get are called frequences. Nevertheless, this simple analysis can provide a significant insight into the problem we are interested in. It is even used in cryptoanalysis, but this is far beyond the scope of this handbok.
For example there are 30 students in a class and their Maths marks are:
marks = ["C", "B", "A", "B", "A", "C", "B", "A", "D", "B", "A", "B", "A", "B", "D", "C", "F", "B", "A", "B", "C", "D", "C", "B", "A", "B", "A", "A", "B", "C"]
The frequency analysis of this sequence reduces to counting how many A's there are in the sequence, then how many B's, and so on. Instead of doing it by hand we'll let Python do it for us using the built-in function
markA = marks.count("A") markB = marks.count("B") markC = marks.count("C") markD = marks.count("D") markF = marks.count("F") print("The distribution of marks:") print("A ->", markA) print("B ->", markB) print("C ->", markC) print("D ->", markD) print("F ->", markF)
The distribution of marks: A -> 9 B -> 11 C -> 6 D -> 3 F -> 1
The numbers we have thus obtained are absolute frequencies. Often we are interested in relative frequencies, which are absolute frequences expressed as a percentage.
N = len(marks) percentageA = 100.0 * markA / N percentageB = 100.0 * markB / N percentageC = 100.0 * markC / N percentageD = 100.0 * markD / N percentageF = 100.0 * markF / N print("The relative distribution of marks:") print("A -> ", round(percentageA, 2), "%", sep="") print("B -> ", round(percentageB, 2), "%", sep="") print("C -> ", round(percentageC, 2), "%", sep="") print("D -> ", round(percentageD, 2), "%", sep="") print("F -> ", round(percentageF, 2), "%", sep="")
The relative distribution of marks: A -> 30.0% B -> 36.67% C -> 20.0% D -> 10.0% F -> 3.33%
Therefore, the frequency analysis can provide absolute frequences (how many times does a value occur in the sequence), but also the relative frequences, which are absolute frequences expressed as percentages.
5.2. Pie charts¶
In situations where we are interested not so much in the absolute frequences, but more in the relative frequences (that is, we do not care much how many occurences there are, but what percentage that makes in the whole) it is convenient to visualize data as a pie chart. A pie chart is a circle divided into sectors (like a pie or a pizza). The circle then represents the whole (100%) and each sector represents the percentage of the value assigned to the sector.
For example, here is the distribution of the marks from the example above:
Let us visualize this using a pie chart. Let us first load the library:
import matplotlib.pyplot as plt
Then we represent the table as two lists:
freqs = [9, 11, 6, 3, 1] marks = ["A", "B", "C", "D", "F"]
pie produces a pie chart. The first argument of
pie is a list of numbers (frequences, relative of absolute), while the option
labels provides labels of sectors:
plt.figure(figsize=(6,6)) plt.pie(freqs, labels=marks) plt.title("Marks") plt.show() plt.close()
If we wish to stress the number of A's in this class we can use the
explode option which expects a sequence of decimal numbers between 0 and 1 which tells the
pie function how much to slide the sector away from the center (0 = no sliding; the larger the number, the larger the sliding away from the center).
freqs = [9, 11, 6, 3, 1] marks = ["A", "B", "C", "D", "F"] slide = [0.1, 0, 0, 0, 0] plt.figure(figsize=(6,6)) plt.pie(freqs, labels=marks, explode=slide) plt.title("Marks") plt.show() plt.close()
As another example let us take a look at the structure of our atmosphere. Our atmosphere is a mixture of many gasses but the most dominant ones are:
Here comes the pie chart:
perc = [78.08, 20.94, 0.93, 0.05] gas = ["Nitrogen", "Oxygen", "Argon", "Carbon dioxide"] plt.figure(figsize=(7,7)) plt.pie(perc, labels=gas) plt.title("The Composition of the Earth's Atmosphere") plt.show() plt.close()
We immediately see a problem: labels for the two tiny sectors overlap. To fix that we will slide them out using the
perc = [78.08, 20.94, 0.93, 0.05] gas = ["Nitrogen", "Oxygen", "Argon", "Carbon dioxide"] slide = [0, 0, 0.75, 0.75] plt.figure(figsize=(7,7)) plt.pie(perc, labels=gas, explode=slide) plt.title("The Composition of the Earth's Atmosphere") plt.show() plt.close()
Exercise 1. Visualize the population of the continents by a pie chart:
Exercise 2. Between 2006 and 2008 the market of optical storage media was a stage of the high definition optical disc format war. As HD became more and more popular, the need for a larger capacity optical disc emerged and the final battle for this market niche was fought between the Blu-ray Disc (BR) and HD-DVD. For more than two years the two formats had almost equal market share but then at the beginning of 2008 something strange happened -- within a week the situation changed drastically. The following table contains the market share of the two formats on January 5th, 2008 and seven days later:
|January 5th, 2008||51,17%||48,83%|
|January 12th, 2008||92,53%||7,47%|
So, after two years of bitter fighting Blu-ray won.
(a) Make two independent pie charts to visualize the market share of the two formats on January 5th, 2008 and on January 12th, 2008.
(b*) Search the Internet and try to find what happened in the week January 5-12th, 2008. (Hint: Has to do with Sony, who was a major proponent of Blu-ray Disc.)
Exercise 3. This is how you make the perfect lemonade: put a cup of warm water and a cup of sugar in a saucepan and stir until sugar dissolves completely; then add one cup of lemon juice and three cups of cold water.
Compute the percentage of water, sugar and lemon juice in the perfect lemonade and make a pie chart in which the sugar is slightly separated from the other ingredients.
Exercise 4. Watching it rain in Macondo, Isabel decided to start a meteorogical diary. If it rained on Monday Isabel would write 1 into the diary; if it rained on Tuesday she would write 2 into the diary; then 3 for rainy Wednesdays and so on until 7 for rainy Sundays. In the end she got the following list of numbers for a year:
MeteoDiary = [1,2,4,7,2,4,7,6,7,5,6,7,3,5,7,1,3,6,2,3,4,2,3,1,4,7,7, 6,5,6,4,5,6,2,3,4,5,1,3,4,2,5,7,2,3,5,3,5,7,6,7,2,3,7, 1,2,3,4,5,6,7,2,7,3,4,1,5,6,1,2,4,5,6,7,1,3,4,1,2,3,4, 2,5,7,6,4,5,6,1,3,7,5,7,1,2,3,7,7,3,4,7,1,2,4,7,4,7,2, 3,4,4,6,8,1,7,7,7,3,4,5,6,7,1,2,4,7,1,2,3,1,7,2,7]
(a) How many rainy days did Isabel record?
(b) How many Mondays, Tuesdays and so on were there that year? Illustrate this by a pie chart.
(c) Which day of the week was the rainiest?
Exercise 5. The cell below contains the first few decimals of $\pi$:
(a) How many decimals of $\pi$ are there in the string?
(b) Make the frequency analysis of this string.
(g) Illustrate the outcome of the frequency analysis by a pie chart.