GitHub is a web-based hosting service for software development projects that use the Git revision control system. It is mostly used for computer code.

GitHub offers both paid plans and free accounts. GitHub’s stated mission is to help developers share code, solve problems, and build software together.

The company provides access control and several collaboration features such as bug tracking, feature requests and task management. It also provides access control via OAuth 2.0 to allow users to log into other websites using their GitHub credentials.

GitHub offers a free plan for open source projects or non-profits which allows public repositories with unlimited collaborators, unlimited CI/CD jobs, 30 private repositories and an unlimited number of public collaborators on private repositories.

## ols Github

inear regression is a standard tool for analyzing the relationship between two or more variables.

In this lecture, we’ll use the Python package statsmodels to estimate, interpret, and visualize linear regression models.

Along the way, we’ll discuss a variety of topics, including

simple and multivariate linear regression

visualization

endogeneity and omitted variable bias

two-stage least squares

As an example, we will replicate results from Acemoglu, Johnson and Robinson’s seminal paper [AJR01].

You can download a copy here.

In the paper, the authors emphasize the importance of institutions in economic development.

The main contribution is the use of settler mortality rates as a source of exogenous variation in institutional differences.

Such variation is needed to determine whether it is institutions that give rise to greater economic growth, rather than the other way around.

Let’s start with some imports:

[ ]

%matplotlib inline

import matplotlib.pyplot as plt

plt.rcParams[“figure.figsize”] = (11, 5) #set default figure size

import numpy as np

import pandas as pd

import statsmodels.api as sm

from statsmodels.iolib.summary2 import summary_col

from linearmodels.iv import IV2SLS

Prerequisites

This lecture assumes you are familiar with basic econometrics.

For an introductory text covering these topics, see, for example, [Woo15].

Simple Linear Regression

[AJR01] wish to determine whether or not differences in institutions can help to explain observed economic outcomes.

How do we measure institutional differences and economic outcomes?

In this paper,

economic outcomes are proxied by log GDP per capita in 1995, adjusted for exchange rates.

institutional differences are proxied by an index of protection against expropriation on average over 1985-95, constructed by the Political Risk Services Group.

These variables and other data used in the paper are available for download on Daron Acemoglu’s webpage.

We will use pandas’ .read_stata() function to read in data contained in the .dta files to dataframes

[ ]

df1 = pd.read_stata(‘https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable1.dta?raw=true’)

df1.head()

Let’s use a scatterplot to see whether any obvious relationship exists between GDP per capita and the protection against expropriation index

[ ]

plt.style.use(‘seaborn’)

df1.plot(x=’avexpr’, y=’logpgp95′, kind=’scatter’)

plt.show()

The plot shows a fairly strong positive relationship between protection against expropriation and log GDP per capita.

Specifically, if higher protection against expropriation is a measure of institutional quality, then better institutions appear to be positively correlated with better economic outcomes (higher GDP per capita).

Given the plot, choosing a linear model to describe this relationship seems like a reasonable assumption.

We can write our model as

logpgp95i=β0+β1avexpri+ui

where:

β0 is the intercept of the linear trend line on the y-axis

β1 is the slope of the linear trend line, representing the marginal effect of protection against risk on log GDP per capita

ui is a random error term (deviations of observations from the linear trend due to factors not included in the model)

Visually, this linear model involves choosing a straight line that best fits the data, as in the following plot (Figure 2 in [AJR01])

[ ]

### Dropping NA’s is required to use numpy’s polyfit

df1_subset = df1.dropna(subset=[‘logpgp95’, ‘avexpr’])

### Use only ‘base sample’ for plotting purposes

df1_subset = df1_subset[df1_subset[‘baseco’] == 1]

X = df1_subset[‘avexpr’]

y = df1_subset[‘logpgp95’]

labels = df1_subset[‘shortnam’]

### Replace markers with country labels

fig, ax = plt.subplots()

ax.scatter(X, y, marker=”)

for i, label in enumerate(labels):

ax.annotate(label, (X.iloc[i], y.iloc[i]))

### Fit a linear trend line

ax.plot(np.unique(X),

np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),

color=’black’)

ax.set_xlim([3.3,10.5])

ax.set_ylim([4,10.5])

ax.set_xlabel(‘Average Expropriation Risk 1985-95’)

ax.set_ylabel(‘Log GDP per capita, PPP, 1995’)

ax.set_title(‘Figure 2: OLS relationship between expropriation \

risk and income’)

plt.show()

The most common technique to estimate the parameters (β’s) of the linear model is Ordinary Least Squares (OLS).

As the name implies, an OLS model is solved by finding the parameters that minimize the sum of squared residuals, i.e.

minβ^∑i=1Nu^2i

where u^i is the difference between the observation and the predicted value of the dependent variable.

To estimate the constant term β0, we need to add a column of 1’s to our dataset (consider the equation if β0 was replaced with β0xi and xi=1)

[ ]

df1[‘const’] = 1

Now we can construct our model in statsmodels using the OLS function.

We will use pandas dataframes with statsmodels, however standard arrays can also be used as arguments

[ ]

reg1 = sm.OLS(endog=df1[‘logpgp95’], exog=df1[[‘const’, ‘avexpr’]], \

missing=’drop’)

type(reg1)

So far we have simply constructed our model.

We need to use .fit() to obtain parameter estimates β^0 and β^1

[ ]

results = reg1.fit()

type(results)

We now have the fitted regression model stored in results.

To view the OLS regression results, we can call the .summary() method.

Note that an observation was mistakenly dropped from the results in the original paper (see the note located in maketable2.do from Acemoglu’s webpage), and thus the coefficients differ slightly.

[ ]

print(results.summary())

From our results, we see that

The intercept β^0=4.63.

The slope β^1=0.53.

The positive β^1 parameter estimate implies that. institutional quality has a positive effect on economic outcomes, as we saw in the figure.

The p-value of 0.000 for β^1 implies that the effect of institutions on GDP is statistically significant (using p < 0.05 as a rejection rule).

The R-squared value of 0.611 indicates that around 61% of variation in log GDP per capita is explained by protection against expropriation.

Using our parameter estimates, we can now write our estimated relationship as

logpgp95ˆi=4.63+0.53 avexpri

This equation describes the line that best fits our data, as shown in Figure 2.

We can use this equation to predict the level of log GDP per capita for a value of the index of expropriation protection.

For example, for a country with an index value of 7.07 (the average for the dataset), we find that their predicted level of log GDP per capita in 1995 is 8.38.

[ ]

mean_expr = np.mean(df1_subset[‘avexpr’])

mean_expr

[ ]

predicted_logpdp95 = 4.63 + 0.53 * 7.07

predicted_logpdp95

An easier (and more accurate) way to obtain this result is to use .predict() and set constant=1 and avexpri=mean_expr

[ ]

results.predict(exog=[1, mean_expr])

We can obtain an array of predicted logpgp95i for every value of avexpri in our dataset by calling .predict() on our results.

Plotting the predicted values against avexpri shows that the predicted values lie along the linear line that we fitted above.

The observed values of logpgp95i are also plotted for comparison purposes

[ ]

### Drop missing observations from whole sample

df1_plot = df1.dropna(subset=[‘logpgp95’, ‘avexpr’])

### Plot predicted values

fix, ax = plt.subplots()

ax.scatter(df1_plot[‘avexpr’], results.predict(), alpha=0.5,

label=’predicted’)

### Plot observed values

ax.scatter(df1_plot[‘avexpr’], df1_plot[‘logpgp95’], alpha=0.5,

label=’observed’)

ax.legend()

ax.set_title(‘OLS predicted values’)

ax.set_xlabel(‘avexpr’)

ax.set_ylabel(‘logpgp95’)

plt.show()

Extending the Linear Regression Model

So far we have only accounted for institutions affecting economic performance – almost certainly there are numerous other factors affecting GDP that are not included in our model.

Leaving out variables that affect logpgp95i will result in omitted variable bias, yielding biased and inconsistent parameter estimates.

We can extend our bivariate regression model to a multivariate regression model by adding in other factors that may affect logpgp95i.

[AJR01] consider other factors such as:

the effect of climate on economic outcomes; latitude is used to proxy this

differences that affect both economic performance and institutions, eg. cultural, historical, etc.; controlled for with the use of continent dummies

Let’s estimate some of the extended models considered in the paper (Table 2) using data from maketable2.dta

[ ]

df2 = pd.read_stata(‘https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable2.dta?raw=true’)

### Add constant term to dataset

df2[‘const’] = 1

### Create lists of variables to be used in each regression

X1 = [‘const’, ‘avexpr’]

X2 = [‘const’, ‘avexpr’, ‘lat_abst’]

X3 = [‘const’, ‘avexpr’, ‘lat_abst’, ‘asia’, ‘africa’, ‘other’]

### Estimate an OLS regression for each set of variables

reg1 = sm.OLS(df2[‘logpgp95′], df2[X1], missing=’drop’).fit()

reg2 = sm.OLS(df2[‘logpgp95′], df2[X2], missing=’drop’).fit()

reg3 = sm.OLS(df2[‘logpgp95′], df2[X3], missing=’drop’).fit()

Now that we have fitted our model, we will use summary_col to display the results in a single table (model numbers correspond to those in the paper)

[ ]

info_dict={‘R-squared’ : lambda x: f”{x.rsquared:.2f}”,

‘No. observations’ : lambda x: f”{int(x.nobs):d}”}

results_table = summary_col(results=[reg1,reg2,reg3],

float_format=’%0.2f’,

stars = True,

model_names=[‘Model 1’,

‘Model 3’,

‘Model 4’],

info_dict=info_dict,

regressor_order=[‘const’,

‘avexpr’,

‘lat_abst’,

‘asia’,

‘africa’])

results_table.add_title(‘Table 2 – OLS Regressions’)

print(results_table)

Endogeneity

As [AJR01] discuss, the OLS models likely suffer from endogeneity issues, resulting in biased and inconsistent model estimates.

Namely, there is likely a two-way relationship between institutions and economic outcomes:

richer countries may be able to afford or prefer better institutions

variables that affect income may also be correlated with institutional differences

the construction of the index may be biased; analysts may be biased towards seeing countries with higher income having better institutions

To deal with endogeneity, we can use two-stage least squares (2SLS) regression, which is an extension of OLS regression.

This method requires replacing the endogenous variable avexpri with a variable that is:

correlated with avexpri

not correlated with the error term (ie. it should not directly affect the dependent variable, otherwise it would be correlated with ui due to omitted variable bias)

The new set of regressors is called an instrument, which aims to remove endogeneity in our proxy of institutional differences.

The main contribution of [AJR01] is the use of settler mortality rates to instrument for institutional differences.

They hypothesize that higher mortality rates of colonizers led to the establishment of institutions that were more extractive in nature (less protection against expropriation), and these institutions still persist today.

Using a scatterplot (Figure 3 in [AJR01]), we can see protection against expropriation is negatively correlated with settler mortality rates, coinciding with the authors’ hypothesis and satisfying the first condition of a valid instrument.

[ ]

### Dropping NA’s is required to use numpy’s polyfit

df1_subset2 = df1.dropna(subset=[‘logem4’, ‘avexpr’])

X = df1_subset2[‘logem4’]

y = df1_subset2[‘avexpr’]

labels = df1_subset2[‘shortnam’]

### Replace markers with country labels

fig, ax = plt.subplots()

ax.scatter(X, y, marker=”)

for i, label in enumerate(labels):

ax.annotate(label, (X.iloc[i], y.iloc[i]))

### Fit a linear trend line

ax.plot(np.unique(X),

np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),

color=’black’)

ax.set_xlim([1.8,8.4])

ax.set_ylim([3.3,10.4])

ax.set_xlabel(‘Log of Settler Mortality’)

ax.set_ylabel(‘Average Expropriation Risk 1985-95’)

ax.set_title(‘Figure 3: First-stage relationship between settler mortality \

and expropriation risk’)

plt.show()

The second condition may not be satisfied if settler mortality rates in the 17th to 19th centuries have a direct effect on current GDP (in addition to their indirect effect through institutions).

For example, settler mortality rates may be related to the current disease environment in a country, which could affect current economic performance.

[AJR01] argue this is unlikely because:

The majority of settler deaths were due to malaria and yellow fever and had a limited effect on local people.

The disease burden on local people in Africa or India, for example, did not appear to be higher than average, supported by relatively high population densities in these areas before colonization.

As we appear to have a valid instrument, we can use 2SLS regression to obtain consistent and unbiased parameter estimates.

First stage

The first stage involves regressing the endogenous variable (avexpri) on the instrument.

The instrument is the set of all exogenous variables in our model (and not just the variable we have replaced).

Using model 1 as an example, our instrument is simply a constant and settler mortality rates logem4i.

Therefore, we will estimate the first-stage regression as

avexpri=δ0+δ1logem4i+vi

The data we need to estimate this equation is located in maketable4.dta (only complete data, indicated by baseco = 1, is used for estimation)

[ ]

### Import and select the data

df4 = pd.read_stata(‘https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable4.dta?raw=true’)

df4 = df4[df4[‘baseco’] == 1]

### Add a constant variable

df4[‘const’] = 1

### Fit the first stage regression and print summary

results_fs = sm.OLS(df4[‘avexpr’],

df4[[‘const’, ‘logem4’]],

missing=’drop’).fit()

print(results_fs.summary())

Second stage

We need to retrieve the predicted values of avexpri using .predict().

We then replace the endogenous variable avexpri with the predicted values avexprˆi in the original linear model.

Our second stage regression is thus

logpgp95i=β0+β1avexprˆi+ui

[ ]

df4[‘predicted_avexpr’] = results_fs.predict()

results_ss = sm.OLS(df4[‘logpgp95’],

df4[[‘const’, ‘predicted_avexpr’]]).fit()

print(results_ss.summary())

The second-stage regression results give us an unbiased and consistent estimate of the effect of institutions on economic outcomes.

The result suggests a stronger positive relationship than what the OLS results indicated.

Note that while our parameter estimates are correct, our standard errors are not and for this reason, computing 2SLS ‘manually’ (in stages with OLS) is not recommended.

We can correctly estimate a 2SLS regression in one step using the linearmodels package, an extension of statsmodels

Note that when using IV2SLS, the exogenous and instrument variables are split up in the function arguments (whereas before the instrument included exogenous variables)

[ ]

iv = IV2SLS(dependent=df4[‘logpgp95’],

exog=df4[‘const’],

endog=df4[‘avexpr’],

instruments=df4[‘logem4′]).fit(cov_type=’unadjusted’)

print(iv.summary)

Given that we now have consistent and unbiased estimates, we can infer from the model we have estimated that institutional differences (stemming from institutions set up during colonization) can help to explain differences in income levels across countries today.

[AJR01] use a marginal effect of 0.94 to calculate that the difference in the index between Chile and Nigeria (ie. institutional quality) implies up to a 7-fold difference in income, emphasizing the significance of institutions in economic development.

Summary

We have demonstrated basic OLS and 2SLS regression in statsmodels and linearmodels.

If you are familiar with R, you may want to use the formula interface to statsmodels, or consider using r2py to call R from within Python.

Exercises

Exercise 1

In the lecture, we think the original model suffers from endogeneity bias due to the likely effect income has on institutional development.

Although endogeneity is often best identified by thinking about the data and model, we can formally test for endogeneity using the Hausman test.

We want to test for correlation between the endogenous variable, avexpri, and the errors, ui

H0:Cov(avexpri,ui)=0(no endogeneity)H1:Cov(avexpri,ui)≠0(endogeneity)

This test is running in two stages.

First, we regress avexpri on the instrument, logem4i

avexpri=π0+π1logem4i+υi

Second, we retrieve the residuals υ^i and include them in the original equation

logpgp95i=β0+β1avexpri+αυ^i+ui

If α is statistically significant (with a p-value < 0.05), then we reject the null hypothesis and conclude that avexpri is endogenous.

Using the above information, estimate a Hausman test and interpret your results.

Exercise 2

The OLS parameter β can also be estimated using matrix algebra and numpy (you may need to review the numpy lecture to complete this exercise).

The linear equation we want to estimate is (written in matrix form)

y=Xβ+u

To solve for the unknown parameter β, we want to minimize the sum of squared residuals

minβ^u^′u^

Rearranging the first equation and substituting into the second equation, we can write

minβ^ (Y−Xβ^)′(Y−Xβ^)

Solving this optimization problem gives the solution for the β^ coefficients

β^=(X′X)−1X′y

Using the above information, compute β^ from model 1 using numpy – your results should be the same as those in the statsmodels output from earlier in the lecture.

Solutions

Exercise 1

[ ]

### Load in data

df4 = pd.read_stata(‘https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable4.dta?raw=true’)

## Add a constant term

df4[‘const’] = 1

### Estimate the first stage regression

reg1 = sm.OLS(endog=df4[‘avexpr’],

exog=df4[[‘const’, ‘logem4’]],

missing=’drop’).fit()

### Retrieve the residuals

df4[‘resid’] = reg1.resid

### Estimate the second stage residuals

reg2 = sm.OLS(endog=df4[‘logpgp95’],

exog=df4[[‘const’, ‘avexpr’, ‘resid’]],

missing=’drop’).fit()

print(reg2.summary())

The output shows that the coefficient on the residuals is statistically significant, indicating avexpri is endogenous.

Exercise 2

[ ]

### Load in data

df1 = pd.read_stata(‘https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable1.dta?raw=true’)

df1 = df1.dropna(subset=[‘logpgp95’, ‘avexpr’])

### Add a constant term

df1[‘const’] = 1

### Define the X and y variables

y = np.asarray(df1[‘logpgp95’])

X = np.asarray(df1[[‘const’, ‘avexpr’]])

It is also possible to use np.linalg.inv(X.T @ X) @ X.T @ y to solve for β, however .solve() is preferred as it involves fewer computations.

## how to get old code from github

### Exploring History

### Overview

**Teaching:** 25 min**Exercises:** 0 min**Questions**

- How can I access old versions of files?
- How do I review my changes?

**Objectives**

- Use the GitHub website to look back in time.
- Compare various versions of tracked files.
- Restore old versions of files.

The main advantage of Git is that it allows us to look back in time, revert to previous versions, and see what has changed. There are couple different things we might want to do with this ability.

### Reverting To An Old Version of the Repository

If you want to roll back all the changes you made in the most recent commit, and just revert to the previous state of the repository, you can do this in GitHub Desktop.

Start by navigating to the “History” tab. Right-click on the previous commit, and you’ll see the option to revert this commit.

If you click on `Revert This Commit`

, two things will happen.

The first is that the files in your repository will revert to their previous state.

The second thing that happens when you successfully revert a commit is that you’ll see a new commit appear in the History tab, while no changes have appeared in the Changes tab. What Git has done behind the scenes is to calculate all the changes it needs to make to the current files in order to get the old files back, implement those changes, and then commit them. What this means is that the state you reverted *from* still exists in the repository’s history, and you can revert the revert to get them back.

### Reverting When You Have Changes

If you had unsaved changes in those files **you will lose them**. If you had saved, but *uncommitted* changes, GitHub Desktop will ask you to commit those changes first.

### Reverting Multiple Commits

Reverting does not take you *back* to a specific committed state. Rather, it *undoes the changes* of a specific commit action. What this means is that in order to revert back multiple commits, you **must** revert them one at a time, starting with the most recent. Reverting a commit in the middle of the history might lead to incoherence in your repository and is a Very Bad Idea.

### Exploring Previous States

If you want to look at the files from previous commits without actually reverting the changes, that is best done on the GitHub website. Right-click on the commit you’re interested in, and choose `View on GitHub`

.

This will take you to that commit’s “diff” page on GitHub.

The “diff” page shows the difference between this commit and the previous one. This can be informative, but may not be what you’re looking for. To view the state of the repository and all its files at that point in time, click on the “Browse Files” button in the top right corner. This will show you a list of all the files and directories. You can then click on the files you’re interested in, look at their contents, and even download them.

To download `mars.txt`

, click on it, then click on the Raw button and `File >> Save`

the resulting plain text page.

Save it outside your repository to make sure you aren’t accidentally overwriting your existing files. Then you can make your decisions about how to manually revert your changes.

### Key Points

- The GitHub website will show a list of changes, and show the differences between commits.
- We can download or copy from old versions of files.

## Conclusion

Let us know your thoughts in the comment section below.

Check out other publications to gain access to more digital resources if you are just starting out with Flux Resource.

Also contact us today to optimize your business(s)/Brand(s) for Search Engines