Small Deviations

Broadcasting my objective subjectivity

This post is about using scipy.optimize.curve_fit to solve a non-linear least square problem. The documentation for the method can be found here.

Exact model

Suppose one has a function of arbitrary complexity whose expression is known in advance. How easy is it to find the correct parameters ?

After playing with different kinds of four-parameter functions, I stumbled upon the following family:

$$ f_\theta (x) = \dfrac{6\cdot\theta_0^2 + 11\cdot\theta_0\theta_1 \cdot x}{5 + 5(x – \theta_2 )^{2}(x – \theta_2 )^{2 \theta_3}} \sin(x – \theta_1)$$

Three functions from this family, after $y$-axis normalization, are plotted below on a uniform grid:

normalized univariate functions

The first function's parameters, as estimated by the curve_fit method by randomly sampling $n=100$ points from the original data with numpy seed 12321. The parameter initialisation is $[0.5, 0.5, 0.5, 0.5]$ which results in the fit

Parameter Estimate Standard Error
$\theta_0$ -1.9167 0.0199
$\theta_1$ -0.0981 0.0105
$\theta_2$ -2.6755 0.0087
$\theta_3$ 1.0832 0.0596

Where all results were truncated to the fourth decimal place. Let's visualize our result:

first fit result

The fit is perfect, however the parameters are not. This is not an issue for our purposes. However the fit quality highly depends on the points' locations. By changing the seed and fitting a new curve, one obtains the less pleasant fit illustrated below:

bad fit

This is only a sample size issue. By setting $n = 200$ and running the process again, we obtain the fit

first func high sample size

Our second function is much more robust to the training set variations:

second func different seeds

The third function's fits for different seeds can be seen below:

last func fit

When manipulating data in R or python, one often use python's pandas library, or R's core data.frame methods. This post recalls how to perform basic operations in each language.

The R data.frame

Shape

Suppose you have a dataframe named df. In order to get its shape you can simply use dim(df) which returns a two-element array. If

df_dim <- dim(df)

then the first element df_dim[1] is the number of rows, and the second df_dim[2] is the number of columns.

To obtain these two quantities separately you can use nrow(df) and ncol(df). The length(df) method call returns the number of columns ncol(df).

Column selection

Suppose our dataframe is the well-known 2013 NYC flights dataset. Before selecting columns of the dataframe df, we first want to list all existing columns and their order:

colnames(df)

The corresponding output is

 [1] "year" "month" "day" "dep_time" "sched_dep_time" "dep_delay"     
 [7] "arr_time" "sched_arr_time" "arr_delay" "carrier" "flight" "tailnum"       
[13] "origin" "dest" "air_time" "distance" "hour" "minute"        
[19] "time_hour"

In order to get column data using column names, one should use the select method. We can select the "month" column using the syntax

select(df, month)

To obtain the year, month, and day together we can use select and add those 3 columns to the arguments, as follows:

select(df, year, month, day)

We can also view a column slice, by using the syntax start_name:stop_name. Both start_name and stop_name should exist in the dataframe and the result of the corresponding select call will be a dataframe with the columns between the two. As an example, the slice dep_time:arr_delay will select columns

dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay

Indexing

To retrieve a subset of rows and columns, one can use the bracket syntax: df[i,j] retrieves the value stored in $i$-th row and $j$-th column. One can also retrieve a entire row (all attributes of row $i$) or an entire column (all values for attribute number $j$). The syntax df[i,] returns the $i$-th row, and df[,j] returns the $j$-th column.

Remark: Using the bracket notation with only one index (for example df[i]) will return a column (e.g. the $i$-th column).

Slicing

When you have two indexes $i < j$ and wish to obtain all rows with index between $i$ and $j$, use the slice syntax $i\colon j$. The syntax df[i:j,] will return the corresponding subset of the dataframe's rows. You can slice columns as well, so that df[i:j,k:p] is a $(j – i + 1)\times (p-k+1)$ rectangular dataframe for which

  • The columns have an index between $k$ and $p$
  • The rows have an index between $i$ and $j$.

Question: What happens when one writes df[i:j] ? Does this return a subset of the rows ? A subset of the columns ?

Filtering

To be added.

The pandas dataframe

In python, the equivalent to R's data.frame is pandas' DataFrame object. Whereas the dataframe is a core R data structure, one typically needs to install pandas using

$ pip3 install pandas

and to import the library in your python script as

import pandas as pd

Shape

To obtain the shape of dataframe df, simply use df.shape. This returns a 2-tuple (nrows, ncols). As arrays and tuples in python use 0-based indexing, you may retrieve the number of rows using df.shape[0] and number of columns using df.shape[1]. If you want to know the number of cells in the dataframe, use df.size, which returns the product nrows*ncols. To get the number of columns, one can use the len(...) python built-in on the columns attributes: the attribute is a “list” of all column names (to be more specific, it's an index).

As an example, the randomly generated dataframe

data = np.random.randn(100,23)
colnames = list("ABCDEFGHIJKLMNOPQRSTUVW")
df = pd.DataFrame(data, columns=colnames )

has 23 columns. Evaluating len(df.columns) gives 23 as expected.

Column selection

To select a single column with name "A", one can use either the bracket syntax df["A"] or the attribute syntax df.A. The latter feels better to me but doesn't work whenever the column name contains dots, whitespaces or any “non-pythonic” character. To select multiple columns, you should input a list of column name as such: df[["A", "C", "F"]].

Indexing

Selecting a specific row or a subset of rows can be done via the iloc function. The syntax df.iloc[i] will select the $i$-th row. You cannot select a column based on its index in pandas.

Slicing

To select the slice $i\colon j$, you simply use the syntax df.iloc[i:j]. The usual python indexing and slicing, so that you can slice the first $i$ elements by using df.iloc[:i], the last $j$ elements using df.iloc[-j:] and create a slice with arbitrary step length using df.iloc[i:j:step].

Filtering

Filtering rows based on attributes simple in pandas. To obtain the rows for which condition holds, you simply write df[condition]. This condition should be a valid statement about the row attributes. In our previous example, we can filter based on the values column "C" using df[df.C > 3].

Enter your email to subscribe to updates.