Visualization: The start and the end
The visualization part of the data analysis is the initial and extremely crucial part of any data analysis. it could be the exploratory data analysis at the beginning of predictive modeling or the end product for a monthly report.
There are several data visualization packages in Python and R. In R, we have the excellent ggplot2, which is based on the grammar of graphics. In python, we have the amazing seaborn and matplotlib packages. All these packages approach the problem of plotting a little bit different, but they all aimed at plotting the same thing. Once you know the graph you want to plot, it would be easier to master and switch between them.
Of course, there are more fancy plotting solutions available like d3 or plotly.
Another way to look at visualization:
If we compare data visualization to a Taylor expansion, one-variable visualizations are like the first order expansions, two-variable visualizations are like the second order expansions, three-or-more-variable visualizations are like the higher order terms in Taylor expansion.
Visualizing one variable
We should use a barplot to count the number of instances in each category.
We should use a histogram to accomplish this.
Visualizing two variables
Two discrete data columns
Use histogram but label another data column with colors (I will talk facet in visualizing 3 or more variables.)
One discrete, one continuous data columns
Use boxplot for this.
There are also swarmplot, stripplot, violinplot for this type of job.
Two continuous data columns
We use scatter plot for this.
Visualizing three or more variables.
Things are getting complicated here. There are generally two ways to accomplish this. The first method is to label things differently with different colors, sizes or shapes and etc in one graph. The second method is to plot more graphs with each graph visualizing some variables by keeping one variable constant.
Two discrete and one continuous data columns
It could be visualized by visualizing one discrete and one continuous variable with boxplot and use color or facet to visualize another discrete variable.
Two Continuous and one discrete data columns
It could be visualized by visualizing two continuous variables with a scatterplot and use color or facet to visualize another discrete variable.
sp <- ggplot(data=tips, aes(x=total_bill, y=tip)) + geom_point() # Divide by levels of "sex", in the vertical direction sp + facet_grid(sex ~ .)
In seaborn you could choose factorplot or FacetGrid.
import matplotlib.pyplot as plt g=sns.FacetGrid(data=tips,row='sex') g.map(sns.regplot,'total_bill','tip')
Three continuous data columns
This needs a 3D scatterplot. This is not implemented in ggplot2 or seaborn/matplotlib, it needs some special packages. See this documentation for python.
This presentation is a good example of how to do more than 2 variables in R using ggplot2.
For the advanced feature like FaceGrid and factorplot in seaborn, see this blog for more examples.
This is a rather short summary and comparison between seaborn and ggplot2, and a discussion of how I viewed the data visualization process. I will add more examples in R/Python in the future.