Python Scraping 101

Part I – Setting up the script

Start a new python script in either your preferred text editor or Python IDE.

To work with the web, our python script needs to send out data, one way to do this is using the built-in “request” module. Bring it in using the following:

The only extra library we will be using is lxml, which is a module used to parse and search through xml.

You can either import the whole module with “import lxml” or you can take the classes we are interested in which is the html and etree. For later, we will also need defaultdict from collections too. So the code should look like this:

Now, let’s use the request by setting a variable and getting a page:

We could avoid page = but we want to use the page we get, so we have to set it to a variable; otherwise it would just be sitting around doing nothing.

Let’s use the module we imported and print the result:

What does your script return?

In looking at our url, we can see that q=ucla is our search term, but we are going to make it a variable called search term and add it to the url, like so:

Your code should look similar to this:

Part II – Let’s Treat the Nodes Right

Next we want to use the tree and html modules, because that allows us to search nodes:

html.fromstring allows us to search the page as a tree, rather than simply text, but how do we search?

We have to use the “xpath” module from the tree.

Let’s use an example by grabbing all the links from twitter:

What this does is for all the nodes in tree.xpath , it looks for all the ‘a’ anchors that have an ‘href’ attribute and prints the node. With xpath, you can traverse down a tree and grab the children with each ‘/’ character.

How would you store the result into a variable?

Let’s look for the time aspect and put into a variable:

Feel free to print it out if you’d like.

Let’s get some other nodes, like the twitter content and usernames:

Notice the the brackets around the two variables, this allows us to put the results into a list. Next, look at node.text_content() , this is a function to get the ONLY the text of the nodes we found, instead of the html tags and text.

Your code should look like this:

Step 3 Time to loop it around

Half-way there! Now we will systematically take clean the data by using a for-loop.

A loop in Python has the following components:

for something in someArray:
#do something<br>

For example:

In order to do a loop, something needs be looped over, in this case it is some array of numbers.

Let’s add this code to our Twitter scraper:

In Line 2, we are creating an empty to store our tweet contents. In line 5, we begin our for-loop, and encode the contents into ‘utf-8’ to get rid of special characters (sorry emojis, not sorry) with contents.encode('utf-8') . Then we add our emoji-free content to the Tweets array using theTweets.append(single) . We can print each clean tweet if we wanted using print theTweets .

Go ahead and add similar arrays and for-loops for the usernames.

Your code should look like the following:

Step 4 Put it into a CSV

Now that we have everything cleaned up, we are going to write it to a csv file:

Line 1 gives us the target file, which is going to be our very own “searchTerm”.csv. Line 2 targets our file with csv.writer, and assigns it to “writer”.
Line 3 finally writes our first csv row with our headers using writerow() .

We can check our data to make sure the number of tweets and usernames are the same with the following:

Now we will prepare our twitter data by zipping it up, this is a necessary step if you have more than one column of data.

The order of the zip should match the order of our headings that we typed in the previous new code.

Finally, we will do our last for-loop and close the csv file:

Congratulations you have scraped a Twitter page!!

Extra step: Let’s add some user input

But wait, instead of searching for “ucla” all the time, we can have the user input a search term by doing the following:

Your final code should look like the following:

Back to the workshop