DOM Parsing With Python
This guide assumes you have basic knowledge of python and have done at least some work with HTML, XHTML, and/or XML.
Background
DOM stands for Document Object Model. It is a convention used in HTML, XHTML, and XML for representing and interacting with objects. As fairly well described by the name, things like HTML have many elements with relationships to other elements. For example, you may have a <span> element in your <body> element. The <span> element’s parent is the <body>. The <span> may have child elements and/or sibling elements. It works similar to a family relationship. The elements in an HTML document may have identifiers, specified by attributes like id=’something’, class=’something’, and/or name=’something’. You can use these identifiers to keep track of and find a specific element or list of elements. Once you have found the element(s) you are looking for, you can change things in a dynamic manner or get desired information.
Lets Try Some Beautiful Soup
As I found the need to parse HTML documents a little while ago, I went in search of a module to accommodate my needs. I could have made my own class to handle it (as DOM parsing really isn’t that hard), but I don’t have nearly the time I would need to take on such a project. Instead I found a module called ‘BeautifulSoup’. As I looked into this module, it seemed to be well-written and have full functionality. Through experience I found that this module is quite easy to use.
Onto The Code
Ok, lets start out with a simple HTML document:
<html>
<head>
<title>Test</title>
</head>
<body>
<span id="someid">This is some text.<span>
<span class="someclass">This is some other text.</span>
</body>
</html>In this document we have two child elements under the <body> element. Now, lets say we want to access the element with the id ‘someid’. First we need to parse the document like so (we will assume the variable ‘doc’ contains the HTML):
import BeautifulSoup dom = BeautifulSoup.BeautifulSoup(doc)
Now we need to get the element. The python object ‘dom’ contains the parsed document. There are some provided methods for searching the document tree. Here are some examples:
# Find the first element with the id 'someid' (all have the same result) elm1 = dom.find(None, {"id":"someid"}) elm1 = dom.find(None, id="someid") elm1 = dom.find("span", {"id":"someid"}) # Only searches 'span' tags elm1 = dom.find("span", id="someid") # Same as above # Find all elements with the id 'someid' elms1 = dom.findAll(None, id="someid") # Find the first element with the class 'someclass' elm2 = dom.find(None, {"class":"someclass"}) # Find all elements with the class 'someclass' elms2 = dom.findAll(None, {"class":"someclass"}) # You cannot specify 'class' as a keyword argument, since it is reserved in python. # That is why the find methods allow a dictionary that specifies what to look for. # Also, you may specify any of a 'class', 'id', and/or 'name' to look for. elm1.nextSibling # A reference to the next sibling element elm2.previousSibling # A reference to the previous sibling element # The above two lines are references to each other. # Now, as it is a document _tree_ (each element references others), you can daisy-chain # These will just lead back to the same element that elm1 referenced to begin with: elm1.nextSibling.parent.find(None, id="someid") elm1.parent.first() # Now, of course you can do more than just walk the tree. # Print all text contained in the element and all child elements: print elm1.text # Print all raw HTML contained in the element: print elm1.renderContents()
Now, of course you can do more, like manually looking through the attributes of an element or something, but this gives a basic idea of how to use the module for your needs. To parse XML, instead of HTML or XHTML, you will want to parse the document with `BeautifulSoup.BeautifulStoneSoup(“…”)`. I hope you found this helpful. Good luck in your own DOM parsing.
BeautifulSoup documentation – http://www.crummy.com/software/BeautifulSoup/documentation.html
What is DHTML?
DHTML may sound like a language but it’s not. DHTML is a term for making web pages dynamic and interactive, by combining the power of HTML, JavaScript, DOM and CSS.
Differences between HTML and XHTML
XHTML stand for EXtensible HyperText Markup Language which is a combination of HTML and XML. In early 2000 XHTML 1.0 became a W3C Recommendation. If you are still using regular HTML, it’s time to change. XHTML isn’t just the future, it’s the now, and HTML is phasing out. But don’t worry too much, the change won’t be painful.
XHTML is a lot like HTML, but more normalized. It forces you to abide by it’s rules, which is a great thing when you are programming! It clearly defines what is OK and what is not. It makes HTML look rather chaotic.
As a result it works in all internet browsers, whereas bad or poorly written HTML may not, especially on mobile phone or other small devices.
So let’s get to it and demonstrate some of the differences between HTML and XHTML.
- XML requires things to be marked up correctly, in well-formed documents.
- XML describes data, while HTML displays data.
- XHTML elements must be properly nested.
- XHTML elements must be closed even if they are empty.
- XHTML element and attribute names must be written in lowercase.
- XHTML documents can only have one root element.
- Attribute values must be in quotes.
- Attribute minimization is forbidden, in other words attributes must define a value even if it’s an empty string.
- The id attribute is now used in place of the name attribute.
- In XHTML there are some required elements.
The following table contains examples of what works in XHTML and what does not.
| Does not work | Works |
|---|---|
<b><u> some bold and underlined text </b></u> |
<b><u>some bold and underlined text</u></b> |
<h1>Header Text <p>Paragraph text |
<h1>Header Text</h1> <p>Paragraph text</p> |
<br> |
<br /> |
<P>Some text here</P> |
<p>Some text here</p> |
<head> <title></title> </head> <body></body> |
<html> <head> <title></title> </head> <body></body> </html> |
<a HREF="..."></a> |
<a href="..."></a> |
<table width=80%> |
<table width="80%"> |
<option selected> |
<option selected="selected"> |
The XHTML DTD (Document Type Definitions)
All XHTML documents have three main parts:
- DOCTYPE declaration
- <head>
- <body>
The DOCTYPE must be defined before anything else in the document.
Everything but the DOCTYPE declaration will look like HTML, XHTML just holds you to a few rules. That’s the beauty of XHTML!
There are three types of DTDs:
- STRICT – It forces you to use clean markup when writing your web pages. You use CSS to define your presentation.
- TRANSITIONAL – This allows you to mix regular HTML features with XHTML. This one seems to be the most commonly used on newer sites, probably because it’s flexible and compatible.
- FRAMESET – This allows you to use HTML frames
These are some examples of implementing each DTD.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
Calling JavaScript when a webpage loads
There are a couple opportunities for calling a JavaScript function or running JavaScript coded when a web page loads in a person’s web browser.
One method is to run a script using the body tag’s onload event. Here are a couple examples:
<!-- Call a function --> <body onload="sayHello();">
<!-- Run multiple commands --> <body onload="var hello_world='Hello World!'; alert(hello_world);">
Or you can take advantage of the window.onload event. Here are a couple examples:
<script type="text/javascript"> // Call a function with the window onload event window.onload = sayHello(); function sayHello(){ alert("Hello World!"); } </script>
<script type="text/javascript"> window.onload = function(){ alert("Hello World!"); }; </script>
Allowing users to login using HTML and PHP
A form is an area of a web page where your website’s visitors can enter information and submit it. They might fill out a contact form, or use a small form to login to your system. They come in handy and in many cases are a necessity. In this tutorial we will kill two birds with one stone and show you how to make a login form (I promised a login tutorial in a previous tutorial).
First, the basics. There are several different types of fields or tags that are commonly used in a form. One type, which we won’t cover here, is the input tag with a type of file. But you can learn about it here.
- <input>
- <textarea>
- <select>
An input tag can have different types, such as:
- text
- password
- hidden
- file
- submit
- checkbox
- radio
Every form opens and closes with a form tag.
<form> <!-- the form elements go here --> </form>
We won’t go into detail on each type of form element. Rather we will focus on three types of inputs that we will use in our login example: text, password, and submit.
Start by creating a new file called login.php. Copy and paste the following code inside.
<form action="login.php" method="post"> <label>Username</label> <input type="text" name="username" value="" /> <br /> <label>Password</label> <input type="password" name="password" value="" /> <br /> <input type="submit" name="submit" value="Login" /> </form>
This is our form. It contains a text field a password field and a submit button that, when clicked, sends the data off to be validated. The form tag has two parts: action and method. Our action is the page we want to send the form data to, and the method is our method of sending the information. Sending via the post method “hides” the information, whereas sending via get will pass the data as a querystring, visible in the URL. In this example we use post because we don’t want to show the user’s credentials in the URL.
The next part is to capture the data that has been posted, and then we’ll validate it. Copy and paste the following PHP code above your HTML form. Your file should now look like this:
<?php // Check for user login if(isset($_POST['submit'])){ // User is attempting login // Verify the credentials are correct if($_POST['username'] == "apple" && $_POST['password'] == "dumpling"){ // The username and password provided are correct! echo ("You have logged in successfully!"); exit; } else { echo ("Woops! You entered the wrong username and password."); } } ?> <form action="login.php" method="post"> <label>Username</label> <input type="text" name="username" value="" /> <br /> <label>Password</label> <input type="password" name="password" value="" /> <br /> <input type="submit" name="submit" value="Login" /> </form>
Run the page and try it out! Entering anything but a combination of “apple” and “dumpling” will result in failure. But entering that combination correctly will give you a thumbs up.
At this point you might want to store the user’s id in a session, and as the user moves from page to page you can check for the id. If it exists then you know they have logged in and can access secure parts of your site. You can learn about sessions here.
AJAX and PHP
AJAX in a word… SWEET! I love it. AJAX stands for Asynchronous JavaScript and XML. Basically it allows your web page to load content AFTER the web page has already loaded. This comes in handy when you want to perform actions on a web page or change content but don’t want to reload the entire page to do it.
Let’s take an example. You have a dating website with loads of users. Users can view other users’ profile pages. On the profile page it shows a user’s information, including whether that user is online right now or not. Wouldn’t it be cool if you could keep the page open for a while and the user you are looking at logs on and at that moment it notifies you, all without you having to reload the page? You can accomplish this with the help of some AJAX.
Let’s start by looking at an HTML code snippet you might include in the User Profile page.
<div id="online_status"></div>
This is the container for the string telling us the user’s online status. But wait… the div doesn’t have anything in it… yet. This is where our AJAX comes in. Add the following below your div.
<script type="text/javascript"> function getOnlineStatus(user_id){ // Make the AJAX call ajax("get-online-status.php", handleOnlineStatusRequest); // Javascript code continues, even while the AJAX call is being made. // Something could execute here ... // and here ... // ... all while we wait for a response } function handleOnlineStatusRequest(response_text){ // This function is called by our 'ajax' function when the AJAX request has completed successfully. // 'response_text' contains any text that was sent back to us from the 'get-online-status.php' page. // Fill in the 'online_status' div with the response text from our AJAX call document.getElementById('online_status').innerHTML = response_text; // Set the next timeout. We will check every 10 seconds for a change in the user's status setTimeout("getOnlineStatus(1)", 10000); } // Initialize the AJAX object function initXmlHttp(){ // Standard Initialization var tmpxmlHttp; try { // Firefox, Opera 8.0+, Safari tmpxmlHttp = new XMLHttpRequest(); } catch (e) { // Internet Explorer try { tmpxmlHttp = new ActiveXObject("Msxml2.XMLHTTP"); } catch (e) { try { tmpxmlHttp = new ActiveXObject("Microsoft.XMLHTTP"); } catch (e) { alert("Your browser does not support AJAX!"); return false; } } } return tmpxmlHttp; } // Our AJAX function function ajax(url, responseHandler){ // Create our AJAXy object var xmlHttp = initXmlHttp(); // Handle changes to the object's state xmlHttp.onreadystatechange = function(){ // Is our request complete? if (xmlHttp.readyState == 4) { // Request is complete (identified by a value of 4) so pass the response text into our handler function responseHandler(xmlHttp.responseText); } }; // Open a connection xmlHttp.open("GET", url, true); // Make the request xmlHttp.send(null); } // Make a request for the user's status getOnlineStatus(1); </script>
Let’s briefly go over the functions in the order they are called:
- getOnlineStatus – We pass in an ID of 1 to get the status for user 1.
- ajax – This function takes a url to our backend file, written in PHP, and the function that will be called when the request has been completed.
- initXmlHttp – This simply creates an object for us that will perform the AJAX call.
- handleOnlineStatusRequest – This function is passed the response text from our back end PHP file and places the response text into our online status div.
Now the final piece is the backend script. In this example we will be using PHP. Create a file called get-online-status.php in the same directory where your html file is stored. Copy and paste the following code inside it.
<?php // Get the user ID that was handed to us $user_id = $_GET['user_id']; // We'll say the user is online $online_status = 1; $online_status_text = $online_status == 1 ? "Online!" : "Offline"; // Output the result text. Anything we output here will be sent back to the requesting page echo ($online_status_text); ?>
Now our example is complete. This PHP file simply outputs the response text we want, in this case it outputs the string “Online!”. Our JavaScript then takes that string and places it inside our div, letting the user viewing the profile know that the user they are viewing is signed in. The JavaScript will check the status every 10 seconds.
Pretty cool huh! For fun, open the web page, and while it is open set the online status to 0 in your PHP file and save. Check the webpage you have open and within 10 seconds it should change the status to “Offline”.
How to upload a file using PHP
One of the cool things about server side scripting is the ability to upload files from a user’s computer to your web server. In this tutorial I will demonstrate PHP’s ability to capture uploaded files and save them on the web server. Our sample takes a user’s photo and saves it to a folder on the server where other users can view it on the web.
You can download the code here.
Create a file with a name of your choice and put the following code into it.
<form action="upload.php" method="post" enctype="multipart/form-data"> <div>Upload your photo:</div> <input type="file" name="photo_file" /> <input type="submit" name="submit" value="Upload" /> <input type="hidden" name="max_file_size" value="8000000" /> <input type="hidden" name="user_id" value="1" /> </form>
This is the web page where the user can select the file they want to upload.
The first line has two parts I want to point out: the action attribute and the enctype attribute. The action‘s value is the name of your PHP file that will handle the upload. The entype‘s value in this case tells our server that we aren’t just posting your typical text or password type values, but rather we are passing several types, namely a file and a few inputs.
The only required types of inputs for uploading a file are file and submit, but for this example I’ve added a hidden user_id that will be used as well.
Now, on to the juicy stuff… handling the upload on the server.
Create a file called upload.php. This file has the same name that we gave our form’s action attribute. When you are coding your own application you can use whatever file name you want.
Put the following code into your upload.php file:
<?php // Make sure the uploaded file is an image of type JPG, GIF, or PNG and that it does not exceed 8000000 bytes (approx. 8 MB) in size. if ( !empty($_FILES["photo_file"]["tmp_name"]) && $_FILES["photo_file"]["size"] < 8000000 && ( $_FILES["photo_file"]["type"] == "image/gif" || $_FILES["photo_file"]["type"] == "image/jpeg" || $_FILES["photo_file"]["type"] == "image/pjpeg") ) { // The file has already been saved to a temporary location on the server. Get the temp file's path. $file_path = $_FILES['photo_file']['tmp_name']; // Grab the User ID we sent from our form $user_id = $_POST['user_id']; // Create the directory where our user photos where be stored, if it doesn't already exist. // We will give our directory name the ID of the user that uploaded the photo. $save_dir = "img/users/$user_id"; if (!file_exists($save_dir)) { // Our directory does not already exists, so create it. mkdir($save_dir, 0755, true); } // Attempt to move the temp file to our user's folder. $save_file_path = $save_dir . "/" . basename($_FILES['photo_file']['name']); if (move_uploaded_file($_FILES['photo_file']['tmp_name'], $save_file_path)) { // Moving the file was a success! echo("<div>Your photo has been uploaded successfully!</div><div><img src="$save_file_path" />"); } else { // Moving the file failed. Prompt the user to try again. echo("There was an error uploading your file. Please try again."); } } ?>
What does all this do? In a nutshell:
- Verifies that the uploaded image is of JPEG, GIF, or PNG type.
- Saves the file to a folder named after the User ID on the server.
- If successful the picture is displayed on the page for the user to see!
Fun stuff huh! In a later tutorial I will show you how to resize your uploaded images, creating thumbnails, medium sized, and large images.