Course Context
Recent years have witnessed an unprecedented growth in the types and amount of information available online. Through the internet, analysts and practitioners can now access an increasing amount of information characterized by high detail and frequency. However, in some cases, the required information is not readily available to download. The ability to build tools capable of retrieving and parsing information stored on the internet becomes a valuable tool in many veins of data science. This course is a primer to data scraping using Stata and Python. Stata today can interact with Python. The two programs are well integrated, and you can run Python operations within Stata. This integration opens a wide range of possibilities.
Course Overview
In this course, participants will learn how online information can be organized and written in a Stata-readable format from any website. The teaching approach will be based on learning by doing, and any line of code presented in the course will be discussed and potentially run by all attendants. The discussion will be based on real-world examples and will be oriented toward practical applications rather than theory. After the course, participants are expected to have an improved understanding of the integration between Stata and Python and will be able to understand in detail and potentially write themselves a web scraping script in Python within Stata. Hence, participants will become able to master research tasks including but not limited to:
Creating a dataset containing the data retrieved online.
Who is the course for
This course is designed for analysts, researchers, and data professionals interested in enhancing their skills in data scraping and analysis using Python and Stata. It is suitable for those who want to leverage the integration between Python and Stata for retrieving, parsing, and organizing information from the web for practical applications in data science and research.
Morning Session | Afternoon Session | Q&A with Instructor |
---|---|---|
10am-12pm (London time) | 2pm-4pm (London time) | 4pm-4:30pm (London time) |
Data Scraping: definition, rationale, usefulness
Data scraping projects
Data scraping vs. API
Pros and Cons of data scraping
Python within Stata: Python basics
Python Installation
Variables, Lists, and loops
Defining functions using Python
Writing and reading csv files
Stata implementation
The requests library in Python
Installing requests in Python
Opening a webpage with Python using requests
Understanding the basic structure of an HTML webpage
Stata implementation
The beautiful soup library within Python
Installing beautiful soup in Python
Parsing a webpage using requests and beautiful soup
Reaching specific information within a webpage
Stata implementation
Writing your first data scraping project
Retrieving specific information from a website
Cleaning the retrieved information
Writing the cleaned data on a csv file
Stata implementation
Letting your first data scraping project run across different webpages
Creating loops to scrape data across different pages
Avoid overcharging a website with too many requests
Stata implementation
At the end of the course, there will be a dedicated informal session for Q&A relevant to the content of the course.
This course requires you to use Python. You therefore will need to either check that Python is already installed on your computer, or you will need to install Python onto your computer before the course starts.
Installing Python is generally easy, and nowadays many Linux and UNIX distributions include a recent Python. Even some Windows computers (notably those from HP) now come with Python already installed.
To start programming, you need an operating system (OS). Python is cross-platform and will work on Windows, macOS, and Linux.
If you do need to install Python, you can do so here.
To work with Python, you will need a Text Editor or IDE. This course does not require a specific Text Editor or IDE as we focus on the integration of Stata and Python and therefore Python scripts that we will discuss will be embedded inside Stata .do files, which will be executed directly from within Stata.
The number of attendees is restricted. Please register early to guarantee your place.