Python and the Web

Part I – Setting up the script

Start a new python script in either your preferred text editor or Python IDE.

To work with the web, our python script needs to send out data, one way to do this is using the built-in “request” module. Bring it in using the following:

The only extra library we will be using is lxml, which is a module used to parse and search through xml.

You can either import the whole module with “import lxml” or you can take the classes we are interested in which is the html and etree. For later, we will also need defaultdict from collections too. So the code should look like this:

Now, let’s use the request by setting a variable and getting a page:

We could avoid page = but we want to use the page we get, so we have to set it to a variable; otherwise it would just be sitting around doing nothing.

Let’s use the module we imported and print the result:

What does your script return?

In looking at our url http://www.google.com/search?tbm=isch&q=ucla , we can see that q=ucla is our search term, but we are going to make it a variable called search term and add it to the url, like so:

Your code should look similar to this:

Part II – Let’s us Nodes what is going on

Next we want to use the tree and html modules, because that allows us to search nodes:

html.fromstring allows us to search the page as a tree, rather than simply text, but how do we search?

We have to use the “xpath” module from the tree.

Let’s use an example by grabbing all the links from the page:

What this does is for all the nodes in tree.xpath , it looks for all the ‘a’ anchors that have an ‘href’ attribute and prints the node. With xpath, you can traverse down a tree and grab the children with each ‘/’ character.

How would you store the result into a variable?

We are only going to grab the images, so let’s use the img tag.

This should work, but it will return the actual images.. That’s no good! We actually want the image sources, so we just need to add another node  to get that attribute called @src

Feel free to print it out if you’d like.

Your code should look like this:

Step 3 – Let’s close the loop

Remember that a loop in Python has the following components:

for something in someArray:
#do something<br>

For example:

In order to do a loop, something needs be looped over, in this case it is some array of numbers.

Let’s add this code to our Google scraper:

In Line 2, we are creating an empty to store our image source contents, and calling it theData . In line 5, we begin our for-loop. Then we add our content to theData array using  theData.append(source) . We can print each clean source if we wanted using print source.

Step 4 Put it into a CSV

Now that we have everything cleaned up, we are going to write it to a csv file:

Line 1 gives us the target file, which is going to be our very own “searchTerm”.csv. Line 2 targets our file with csv.writer() , and assigns it to “writer”.

We can check our images with the following:

Finally, we will do our last for-loop and close the csv file:

Make sure that “row” is within brackets or the csv file will be weird!

Congratulations you have scraped a Google page!!

Extra step: Let’s add some user input

But wait, instead of searching for “ucla” all the time, we can have the user input a search term by doing the following:

Your final code should look like the following:

Back to the workshop