When manipulating data in R or python, one often use python's pandas
library, or R's core data.frame
methods. This post recalls how to perform basic operations in each language.
The R data.frame
Shape
Suppose you have a dataframe named df
. In order to get its shape you can simply use dim(df)
which returns a two-element array. If
df_dim <- dim(df)
then the first element df_dim[1]
is the number of rows, and the second df_dim[2]
is the number of columns.
To obtain these two quantities separately you can use nrow(df)
and ncol(df)
. The length(df)
method call returns the number of columns ncol(df)
.
Column selection
Suppose our dataframe is the well-known 2013 NYC flights dataset.
Before selecting columns of the dataframe df
, we first want to list all existing columns and their order:
colnames(df)
The corresponding output is
[1] "year" "month" "day" "dep_time" "sched_dep_time" "dep_delay"
[7] "arr_time" "sched_arr_time" "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance" "hour" "minute"
[19] "time_hour"
In order to get column data using column names, one should use the select
method. We can select the "month"
column using the syntax
select(df, month)
To obtain the year, month, and day together we can use select
and add those 3 columns to the arguments, as follows:
select(df, year, month, day)
We can also view a column slice, by using the syntax start_name:stop_name
. Both start_name
and stop_name
should exist in the dataframe and the result of the corresponding select
call will be a dataframe with the columns between the two. As an example, the slice dep_time:arr_delay
will select columns
dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay
Indexing
To retrieve a subset of rows and columns, one can use the bracket syntax: df[i,j]
retrieves the value stored in $i$-th row and $j$-th column. One can also retrieve a entire row (all attributes of row $i$) or an entire column (all values for attribute number $j$). The syntax df[i,]
returns the $i$-th row, and df[,j]
returns the $j$-th column.
Remark: Using the bracket notation with only one index (for example df[i]
) will return a column (e.g. the $i$-th column).
Slicing
When you have two indexes $i < j$ and wish to obtain all rows with index between $i$ and $j$, use the slice syntax $i\colon j$. The syntax df[i:j,]
will return the corresponding subset of the dataframe's rows. You can slice columns as well, so that df[i:j,k:p]
is a $(j – i + 1)\times (p-k+1)$ rectangular dataframe for which
- The columns have an index between $k$ and $p$
- The rows have an index between $i$ and $j$.
Question: What happens when one writes df[i:j]
? Does this return a subset of the rows ? A subset of the columns ?
Filtering
To be added.
The pandas dataframe
In python, the equivalent to R's data.frame
is pandas' DataFrame
object. Whereas the dataframe is a core R data structure, one typically needs to install pandas
using
$ pip3 install pandas
and to import the library in your python script as
import pandas as pd
Shape
To obtain the shape of dataframe df
, simply use df.shape
. This returns a 2-tuple (nrows, ncols)
. As arrays and tuples in python use 0-based indexing, you may retrieve the number of rows using df.shape[0]
and number of columns using df.shape[1]
. If you want to know the number of cells in the dataframe, use df.size
, which returns the product nrows*ncols
. To get the number of columns, one can use the len(...)
python built-in on the columns
attributes: the attribute is a “list” of all column names (to be more specific, it's an index).
As an example, the randomly generated dataframe
data = np.random.randn(100,23)
colnames = list("ABCDEFGHIJKLMNOPQRSTUVW")
df = pd.DataFrame(data, columns=colnames )
has 23 columns. Evaluating len(df.columns)
gives 23
as expected.
Column selection
To select a single column with name "A"
, one can use either the bracket syntax df["A"]
or the attribute syntax df.A
. The latter feels better to me but doesn't work whenever the column name contains dots, whitespaces or any “non-pythonic” character.
To select multiple columns, you should input a list of column name as such: df[["A", "C", "F"]]
.
Indexing
Selecting a specific row or a subset of rows can be done via the iloc
function. The syntax df.iloc[i]
will select the $i$-th row. You cannot select a column based on its index in pandas.
Slicing
To select the slice $i\colon j$, you simply use the syntax df.iloc[i:j]
. The usual python indexing and slicing, so that you can slice the first $i$ elements by using df.iloc[:i]
, the last $j$ elements using df.iloc[-j:]
and create a slice with arbitrary step length using df.iloc[i:j:step]
.
Filtering
Filtering rows based on attributes simple in pandas. To obtain the rows for which condition
holds, you simply write df[condition]
. This condition should be a valid statement about the row attributes.
In our previous example, we can filter based on the values column "C"
using df[df.C > 3]
.