Creating interactive, data-driven graphics with ggplot is one of the most powerful ways to organize information in this field. By nature, graphic design is an art form that uses creativity to convey messages or ideas. Using statistical concepts to present complex information can be both elegant and striking.
Data visualization using ggplot comes down to knowing your statistics and how to use them effectively in creating pictures that tell a story!
This article will go into detail about how to work with categorical (or discrete) variables, quantitative (measured) variables, and time series data all within the context of the famousggplot2 package. We will also look at some easy ways to make your plots more engaging as well as add additional color schemes and fonts to match your message!
Once you are familiar with these tools, you will be able to create your own creative designs quickly! So let’s get started by looking at how to graph mean vs median for numerical values.
Mean vs Median For Numerical Values
Goal: Graphically compare the means and medians of several numbers
The average is the mean of all the individual numbers it computes. The overall middle number is called the median.
In this lesson we will do just that! Let’s dive in now and learn something new.
Create a ggplot2 grid
The second way to use ggplot is by creating a grid with it. This can be done through either axis labels, or via the coord_grid() function. Both of these ways require you to first create an empty plot space.
Axis label style grids are very common when making charts that have more than two variables. For example, if your chart had monthly data for both x and y, then you would need a new line drawn at every month for the bar graph. Having automatic dates built in helps make sure nothing gets left out!
The other type of coordinate system that ggplot offers is called orthogonal coordinates. These work similarly to cartesian (x-and-y) coordinates, but instead of having one variable increase while the other decreases, they both increase or decrease independently.
Using this method, you can now put all three dimensions into one set of numbers. This becomes much easier to manage as the number of dimensions grow.
Creating a histogram is one of the easiest ways to create some interesting graphs in ggplot2. You will probably know how to make bar charts, but there are other types of charts that can be done with this tool as well!
One such chart type is called a histogram. This diagram illustrates the number distribution of an attribute for different regions or groups in your data set.
The vertical axis represents the numeric value of the variable being analyzed, while the horizontal axis represents a region or group. The length of each bar in the graph indicates the amount of times that value was found within that region/group.
By looking at the example above, we could determine that most people between the ages of 20-30 have a monthly income of less than $1,000 per month. More than half of all individuals in this age range earn less than $500 per month!
This information helps show why it is important to focus on career fields that pay well so you can afford to start a family later. Or maybe more importantly, it warns others about not investing in their careers because they cannot afford to spend time doing it.
GGPlot has several built in functions that allow you to make excellent use of histograms. For instance, you can easily add up how many values were in each bin using stat = “count”. Then, you can compare two sets of bins using pairwise comparison like mentioned before.
Create a scatterplot
In addition to creating box plots, bar charts, and other graphical representations of data, gggplot2 also offers the user the ability to create what are called “scatterplots.” A scatterplot is simply an illustration of two data sets that are related to each other.
The data set in this case can be either numerical or categorical. When drawing a scatterplot, you will pick one variable from each dataset to use as the x-axis (or horizontal) value, and then another variable to represent the y-value.
These two values must be relatable with respect to each other in some way. For example, if both variables contain dates, then the time between the two points can be used for the x-value and the length of the longer interval being measured for the y-value.
Identify significant trends
In addition to creating graphs, you can also identify significant trends in your data. This is done by either finding linear patterns or using non-linear methods such as trend lines that are curved or steeply inclined.
One of the most common ways to do this is with regression. Regression is when one variable goes into another dependent upon how much it affects the first depending on what value they take. For example, if income was found to go up while food intake decreased, then we could say that eating less makes money grow faster!
With regression, you pick both your independent and dependent variables, choose your format (linear, polynomial, etc.), test whether there is an association, and determine what type of curve fits your data best.
There are many types of regressions, but some of the more popular ones are simple slope correlation, step down regression, and inverse regression.
Analyze the data using charts
A great way to begin exploring ggplot2 is to look at some existing datasets in interactive, easy-to-use graphics software such as Google Charts or Tableau.
You can use these tools to analyze your own data quickly and easily! Many of these applications also offer plugins that allow you to add additional features to create more elaborate graphs.
For example, if you wanted to compare multiple groups, you could upload all of your data into Tableau and make comparisons there. There are many free versions of both apps so you do not need to be wealthy to experiment.
Google Chart Tools has several paid plans with extended functionality, but most people start out with the basic version which is free!
These types of apps make it simple to put together pretty much any graph style you want.
Use an histogram to identify the number of people that like sports, how many people like both sports and nature, and how many people like neither
Histograms are a way to look at lots of data points. In this case, we will use an interactive one so you can add or subtract values and see what effect it has on the shape of the graph.
The first thing we need to do is pick our variables. For these examples, we will choose whether someone likes soccer (yes) or not (no), whether they love trees (yes) or not (no) and whether they enjoy watching games (yes) or not (no).
We then make another variable for those who don’t enjoy either soccer or nature. To include them in our analysis, we must sum up all three questions and divide by 3 – because there are only two categories per question!
Now let’s create our histogram! Tap on the ‘Create Chart’ option to bring up the chart builder.
Here, we can select which metric we want to represent as a bar. We can also change the bin size and color scheme. Since we already have some numbers, we will leave those as-is.
Next, tap on the drop down menu next to Metric and choose Sum.
Identify the most popular sports
Team colors are an important part of what makes up a sport’s identity. What color uniforms do you see people supporting the most?
The best way to identify the most popular teams is by looking at how many different colors their clothing is. The more colors, the higher the popularity.
By analyzing the data for the top 10 team colors in America, we get some interesting results!
Not only did we find that red was the number one favorite team color, but also blue as the second most popular team color. This seems contradictory, until you consider that there are very few professional basketball games with just black jerseys.
That means when fans choose between a red or a white jersey, they’re actually choosing between two completely different logos! (Something not too common with football.)
Another surprising finding is that no state had all green as its team colors. That may be because it is difficult to come up with appropriate shades of green and none exist outside of spring and summer.
Create a pie chart to show how respondents divide their interests
Pie charts are one of the most common types of graphs used in data visualization. They go back at least as far as ancient Greece, where they were made using your hand or foot!
Today, there are many ways to make use of pie charts. You can create them manually in software such as Excel or PowerPoint, but there are also lots of online tools that will do it for you.
At the Data Visualization Festival this past October, we gave an introduction to pie charts with ggplot2, a powerful open source plotting package for statistical graphics.
In this article, you’ll learn some easy ways to make your own pie charts using this tool. Once you’re familiar with the basics, you’ll be able to add decorations, colors, and fun text to express all kinds of ideas about the people you’re studying.