Set values in DataFrame with Boolean index in pandas

If you want to set values in pandas data frame with Boolean vectors, you will likely get an error described below.

Have you ever come across this error below when you want to set some values in the pandas data frame?


/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1:

SettingWithCopyWarning: A value is trying to be set on a copy of
a slice from a DataFrame

See the caveats in the documentation:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

"""Entry point for launching an IPython kernel.

Is the warning you get similar to the code below? By setting some values to the pandas data frames with index and column name?


df.some_col_name[index]="A value to set"

df.loc[index,some_col_name]="A value to set"

The correct way to do this is


df.some_col_name.where(~index,other="A value to set")

If you do the where syntax and you get an error like


bad operand type for unary ~: 'float'

It is because that the index is not a boolean, you need to convert the pandas series into boolean values using the code below.


index=index.astype("bool")

df.some_col_name.where(~index,other="A value to set")

This is really annoying and very counter-intuitive and stupid if you are coming from R or Matlab(I suppose?).

Gonna add more pandas fix to the blogs as I learned along the way.

If you want to know more about SettingWithCopyWarning in pandas. You could check out this detailed blog by Dataquest. https://www.dataquest.io/blog/settingwithcopywarning/

Advertisements

Quick summary: seaborn vs ggplot2

Visualization: The start and the end

The visualization part of the data analysis is the initial and extremely crucial part of any data analysis. it could be the exploratory data analysis at the beginning of predictive modeling or the end product for a monthly report.

There are several data visualization packages in Python and R. In R, we have the excellent ggplot2, which is based on the grammar of graphics. In python, we have the amazing seaborn and matplotlib packages. All these packages approach the problem of plotting a little bit different, but they all aimed at plotting the same thing. Once you know the graph you want to plot, it would be easier to master and switch between them.

Of course, there are more fancy plotting solutions available like d3 or plotly.

Another way to look at visualization:

If we compare data visualization to a Taylor expansion, one-variable visualizations are like the first order expansions, two-variable visualizations are like the second order expansions, three-or-more-variable visualizations are like the higher order terms in Taylor expansion.

Visualizing one variable

discrete data

We should use a barplot to count the number of instances in each category.

ggplot2: geom_bar

seaborn: sns.countplot

continuous data

We should use a histogram to accomplish this.

ggplot2: geom_histogram

seaborn: sns.distplot

Visualizing two variables

Two discrete data columns

Use histogram but label another data column with colors (I will talk facet in visualizing 3 or more variables.)

ggplot2: geom_histogram(aes(color=”name_of_another_data_column”))

seaborn: sns.countplot(hue=’name_of_another_data_column’)

One discrete, one continuous data columns

Use boxplot for this.

ggplot2: geom_boxplot

seaborn: sns.boxplot

There are also swarmplot, stripplot, violinplot for this type of job.

Two continuous data columns

We use scatter plot for this.

ggplot2: geom_point

seaborn: sns.regplot,sns.jointplot(kind=’scatter’)

Visualizing three or more variables.

Things are getting complicated here. There are generally two ways to accomplish this. The first method is to label things differently with different colors, sizes or shapes and etc in one graph. The second method is to plot more graphs with each graph visualizing some variables by keeping one variable constant.

Two discrete and one continuous data columns

It could be visualized by visualizing one discrete and one continuous variable with boxplot and use color or facet to visualize another discrete variable.

Two Continuous and one discrete data columns

It could be visualized by visualizing two continuous variables with a scatterplot and use color or facet to visualize another discrete variable.

In ggplot2:


sp <- ggplot(data=tips, aes(x=total_bill, y=tip)) + geom_point()

# Divide by levels of "sex", in the vertical direction
sp + facet_grid(sex ~ .)

In seaborn you could choose factorplot or FacetGrid.


import matplotlib.pyplot as plt
g=sns.FacetGrid(data=tips,row='sex')
g.map(sns.regplot,'total_bill','tip')

 

Three continuous data columns

This needs a 3D scatterplot. This is not implemented in ggplot2 or seaborn/matplotlib, it needs some special packages. See this documentation for python.

This presentation is a good example of how to do more than 2 variables in R using ggplot2.

For the advanced feature like FaceGrid and factorplot in seaborn, see this blog for more examples.

This is a rather short summary and comparison between seaborn and ggplot2, and a discussion of how I viewed the data visualization process. I will add more examples in R/Python in the future.

Docker for data analysis: a hands-on tutorial (jupyter and rstudio-server)

Dependency, dependency, and dependency! Dependency is an evil being that used to haunt software development, now it has come to data analysis. Out of the 14 million possibilities, there is one way to defeat it. The answer is Docker!

Based on whether you will use rstudio-server or jupyter notebook, the following blog will split into two parts. One part is about how to run a Docker container for jupyter, another part is about how to run a container for rstudio-server, somewhat similar to jupyter.

Before we proceed, it is recommended that you create an account on DockerHub, a hosting service for Docker images. See the line here.

Jupyter notebook

After registering an account on DockerHub, you could search for the possible Docker images there. After a little search and trial, I found this container. It has pre-installed R, Python3, Julia and most of the common images for data science. See more details on its DockerHub.

How to use it? Simply type the below command into your terminal:


docker run -it --rm -p 8888:8888 jupyter/datascience-notebook

After downloading about a 6GB Docker image, it would start the jupyter notebook, you will have something like this.

docker6

docker7.png

On the second line of the second picture, you see a sentence “to login with a token: …..”.Simply by clicking the http address, you would see a jupyter notebook.

jupyter

Then, you have the jupyter notebook with support for R, Python3, and Julia. Let us see if the most common packages are installed.

jupyter2

Let us test if most R packages are installed.

jupyter3

Worth noting here that data.table package is not installed.

Let me try to install my automl package, dplyr and caret package from Github.

jupyter4.png

It seems that there are some issues with installing from Github. But all the R-cran install works perfectly.

Multiple terminals.

In order to download some files or made some changes to the Docker environment, you could open up another terminal by


docker exec -it  bash

After you are done, you could use ctrl+p+q to exit the container without interrupting the container.

Rstudio-server

For those of you who did not know Rstudio-server, it is like the jupyter notebook for R.

The image I found on the DockerHub is this.

To use it, you should first download it.


docker pull dceoy/rstudio-server

Then, you run the image with the command below:


docker container run --rm -p 8787:8787 -v ${PWD}:/home/rstudio -w /home/rstudio dceoy/rstudio-server

The Rstudio-serve would not start itself, you need to type the address below into your browser.

http://127.0.0.1:8787/

Then, you will need to enter the username and password, both of which are rstudio.

Let us see if the common packages are installed.

rstudio

Then, let us try to install my automl package from Github.

rstudio2.png

rstudio3

It works perfectly. (I still have some dependency issues with some of the packages to resolve in my package, but the conflicts are printed here.)

 

Read More

Web scraping GPU data on Newegg for Black Friday sale

What is web scraping? It is a method to download the information from the website. Instead of clicking the website yourself, you could use programming languages like Python or R to scrape the information you want.

=========================================================================

Disclaimer: But web scraping is not without legal consequences!!! You do this at your own risk!

If you are concerned about the legal aspect of doing it, you could look at this blog (https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/). Or this (https://news.ycombinator.com/item?id=4896590). Always consult a lawyer if you are worried about it.

Before you do the scraping remember to check the robots.txt file of the website, they are usually in the http://www.your-website.com/robots.txt. For the Newegg website I am gonna scrape today, the file is in https://www.newegg.com/robots.txt.

There is one great tutorial from data science dojo if you are new to the web scraping. my project sort of built on top of the tutorial. (https://datasciencedojo.com/web-scraping-30-minutes/)

I am writing it here simply for educational purpose and nothing more.

===========================================================================

Ok. Enough with the disclaimer. Yesterday was Black Friday and I was hoping to find a deal for a graphics card for my future deep learning projects. I already has a GTX 1060 but it is not enough for some serious deep learning projects. So I want to know the price summary for GTX 1060, GTX 1070, GTX 1080 and their Ti versions.

Here is how I do it. Screenshot from 2017-11-26 00-25-40

If you really want to use the code, I have put it on my Github (https://raw.githubusercontent.com/edwardcooper/mlmodel_select/master/gtx_scrape.py).

You could download this single file by

wget https://raw.githubusercontent.com/edwardcooper/mlmodel_select/master/gtx_scrape.py

What it does is that it scrapes the Newegg website for the brand (ASUS or EVGA), GPU type (GTX 1080 or GTX 1070 Ti), memory size, memory type (GDDR5 or GDDR5X), price, shipping price (Free shipping or a shipping fee), and the website address for that product. After that, the script will transfer it into a pandas DataFrame, and write it to a CSV file named gtx1080_newegg.txt. The name of the file is kind of misleading but you could change it into whatever file name you like most.

Next, we will want to know the best price for GTX 1060, GTX 1070, GTX 1070 Ti, GTX 1080 and GTX 1080 Ti. This could be easily done using pandas.

Another useful tool in web scraping is to use this website for a better format of the HTML file for structure analysis. (https://www.cleancss.com/html-beautify/)

In the future, I hope to write it into a Flask web app.

That is all today.

Let me know what you think.

 

 

 

 

 

Fix error:module ‘html5lib.treebuilders’ has no attribute ‘_base’

I encounter this error when I tried to import bs4 in python3. The problem is not in bs4 but in its dependency library html5lib.

python2 (didn’t test myself)

There are some fixes that recommended for python2

pip install --upgrade beautifulsoup4
pip install --upgrade html5lib

The above solution comes from SO (https://stackoverflow.com/questions/38447738/beautifulsoup-html5lib-module-object-has-no-attribute-base).

python3 (tested myself)

Solution1:
But the above solution did not work for me at all since I want to work with python3, and changing pip to pip3 does not solve the problem.

I did find that down-grading html5lib library to a previous version solved the problem.

sudo pip3 install html5lib==0.9999999

This is suggested in SO (https://github.com/coursera-dl/coursera-dl/issues/554) and Launchpad (https://bugs.launchpad.net/beautifulsoup/+bug/1603299)

Solution2:
If you are not happy with just a temporary patch by downgrading it into an older version. Then, the obvious solution is to uninstall bs4,html5lib libraries and install them again, since this bug is already fixed as Thanksgiving in 2017.

 sudo apt remove python3-bs4 python3-html5lib
sudo apt install python3-bs4 python3-htmllib

See you all next time. Let me know in the comment if the above solution solved your problem.