DOM Parsing With Python
This guide assumes you have basic knowledge of python and have done at least some work with HTML, XHTML, and/or XML.
Background
DOM stands for Document Object Model. It is a convention used in HTML, XHTML, and XML for representing and interacting with objects. As fairly well described by the name, things like HTML have many elements with relationships to other elements. For example, you may have a <span> element in your <body> element. The <span> element’s parent is the <body>. The <span> may have child elements and/or sibling elements. It works similar to a family relationship. The elements in an HTML document may have identifiers, specified by attributes like id=’something’, class=’something’, and/or name=’something’. You can use these identifiers to keep track of and find a specific element or list of elements. Once you have found the element(s) you are looking for, you can change things in a dynamic manner or get desired information.
Lets Try Some Beautiful Soup
As I found the need to parse HTML documents a little while ago, I went in search of a module to accommodate my needs. I could have made my own class to handle it (as DOM parsing really isn’t that hard), but I don’t have nearly the time I would need to take on such a project. Instead I found a module called ‘BeautifulSoup’. As I looked into this module, it seemed to be well-written and have full functionality. Through experience I found that this module is quite easy to use.
Onto The Code
Ok, lets start out with a simple HTML document:
<html>
<head>
<title>Test</title>
</head>
<body>
<span id="someid">This is some text.<span>
<span class="someclass">This is some other text.</span>
</body>
</html>In this document we have two child elements under the <body> element. Now, lets say we want to access the element with the id ‘someid’. First we need to parse the document like so (we will assume the variable ‘doc’ contains the HTML):
import BeautifulSoup dom = BeautifulSoup.BeautifulSoup(doc)
Now we need to get the element. The python object ‘dom’ contains the parsed document. There are some provided methods for searching the document tree. Here are some examples:
# Find the first element with the id 'someid' (all have the same result) elm1 = dom.find(None, {"id":"someid"}) elm1 = dom.find(None, id="someid") elm1 = dom.find("span", {"id":"someid"}) # Only searches 'span' tags elm1 = dom.find("span", id="someid") # Same as above # Find all elements with the id 'someid' elms1 = dom.findAll(None, id="someid") # Find the first element with the class 'someclass' elm2 = dom.find(None, {"class":"someclass"}) # Find all elements with the class 'someclass' elms2 = dom.findAll(None, {"class":"someclass"}) # You cannot specify 'class' as a keyword argument, since it is reserved in python. # That is why the find methods allow a dictionary that specifies what to look for. # Also, you may specify any of a 'class', 'id', and/or 'name' to look for. elm1.nextSibling # A reference to the next sibling element elm2.previousSibling # A reference to the previous sibling element # The above two lines are references to each other. # Now, as it is a document _tree_ (each element references others), you can daisy-chain # These will just lead back to the same element that elm1 referenced to begin with: elm1.nextSibling.parent.find(None, id="someid") elm1.parent.first() # Now, of course you can do more than just walk the tree. # Print all text contained in the element and all child elements: print elm1.text # Print all raw HTML contained in the element: print elm1.renderContents()
Now, of course you can do more, like manually looking through the attributes of an element or something, but this gives a basic idea of how to use the module for your needs. To parse XML, instead of HTML or XHTML, you will want to parse the document with `BeautifulSoup.BeautifulStoneSoup(“…”)`. I hope you found this helpful. Good luck in your own DOM parsing.
BeautifulSoup documentation – http://www.crummy.com/software/BeautifulSoup/documentation.html
What is DHTML?
DHTML may sound like a language but it’s not. DHTML is a term for making web pages dynamic and interactive, by combining the power of HTML, JavaScript, DOM and CSS.
Differences between HTML and XHTML
XHTML stand for EXtensible HyperText Markup Language which is a combination of HTML and XML. In early 2000 XHTML 1.0 became a W3C Recommendation. If you are still using regular HTML, it’s time to change. XHTML isn’t just the future, it’s the now, and HTML is phasing out. But don’t worry too much, the change won’t be painful.
XHTML is a lot like HTML, but more normalized. It forces you to abide by it’s rules, which is a great thing when you are programming! It clearly defines what is OK and what is not. It makes HTML look rather chaotic.
As a result it works in all internet browsers, whereas bad or poorly written HTML may not, especially on mobile phone or other small devices.
So let’s get to it and demonstrate some of the differences between HTML and XHTML.
- XML requires things to be marked up correctly, in well-formed documents.
- XML describes data, while HTML displays data.
- XHTML elements must be properly nested.
- XHTML elements must be closed even if they are empty.
- XHTML element and attribute names must be written in lowercase.
- XHTML documents can only have one root element.
- Attribute values must be in quotes.
- Attribute minimization is forbidden, in other words attributes must define a value even if it’s an empty string.
- The id attribute is now used in place of the name attribute.
- In XHTML there are some required elements.
The following table contains examples of what works in XHTML and what does not.
| Does not work | Works |
|---|---|
<b><u> some bold and underlined text </b></u> |
<b><u>some bold and underlined text</u></b> |
<h1>Header Text <p>Paragraph text |
<h1>Header Text</h1> <p>Paragraph text</p> |
<br> |
<br /> |
<P>Some text here</P> |
<p>Some text here</p> |
<head> <title></title> </head> <body></body> |
<html> <head> <title></title> </head> <body></body> </html> |
<a HREF="..."></a> |
<a href="..."></a> |
<table width=80%> |
<table width="80%"> |
<option selected> |
<option selected="selected"> |
The XHTML DTD (Document Type Definitions)
All XHTML documents have three main parts:
- DOCTYPE declaration
- <head>
- <body>
The DOCTYPE must be defined before anything else in the document.
Everything but the DOCTYPE declaration will look like HTML, XHTML just holds you to a few rules. That’s the beauty of XHTML!
There are three types of DTDs:
- STRICT – It forces you to use clean markup when writing your web pages. You use CSS to define your presentation.
- TRANSITIONAL – This allows you to mix regular HTML features with XHTML. This one seems to be the most commonly used on newer sites, probably because it’s flexible and compatible.
- FRAMESET – This allows you to use HTML frames
These are some examples of implementing each DTD.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
Calling JavaScript when a webpage loads
There are a couple opportunities for calling a JavaScript function or running JavaScript coded when a web page loads in a person’s web browser.
One method is to run a script using the body tag’s onload event. Here are a couple examples:
<!-- Call a function --> <body onload="sayHello();">
<!-- Run multiple commands --> <body onload="var hello_world='Hello World!'; alert(hello_world);">
Or you can take advantage of the window.onload event. Here are a couple examples:
<script type="text/javascript"> // Call a function with the window onload event window.onload = sayHello(); function sayHello(){ alert("Hello World!"); } </script>
<script type="text/javascript"> window.onload = function(){ alert("Hello World!"); }; </script>
Allowing users to login using HTML and PHP
A form is an area of a web page where your website’s visitors can enter information and submit it. They might fill out a contact form, or use a small form to login to your system. They come in handy and in many cases are a necessity. In this tutorial we will kill two birds with one stone and show you how to make a login form (I promised a login tutorial in a previous tutorial).
First, the basics. There are several different types of fields or tags that are commonly used in a form. One type, which we won’t cover here, is the input tag with a type of file. But you can learn about it here.
- <input>
- <textarea>
- <select>
An input tag can have different types, such as:
- text
- password
- hidden
- file
- submit
- checkbox
- radio
Every form opens and closes with a form tag.
<form> <!-- the form elements go here --> </form>
We won’t go into detail on each type of form element. Rather we will focus on three types of inputs that we will use in our login example: text, password, and submit.
Start by creating a new file called login.php. Copy and paste the following code inside.
<form action="login.php" method="post"> <label>Username</label> <input type="text" name="username" value="" /> <br /> <label>Password</label> <input type="password" name="password" value="" /> <br /> <input type="submit" name="submit" value="Login" /> </form>
This is our form. It contains a text field a password field and a submit button that, when clicked, sends the data off to be validated. The form tag has two parts: action and method. Our action is the page we want to send the form data to, and the method is our method of sending the information. Sending via the post method “hides” the information, whereas sending via get will pass the data as a querystring, visible in the URL. In this example we use post because we don’t want to show the user’s credentials in the URL.
The next part is to capture the data that has been posted, and then we’ll validate it. Copy and paste the following PHP code above your HTML form. Your file should now look like this:
<?php // Check for user login if(isset($_POST['submit'])){ // User is attempting login // Verify the credentials are correct if($_POST['username'] == "apple" && $_POST['password'] == "dumpling"){ // The username and password provided are correct! echo ("You have logged in successfully!"); exit; } else { echo ("Woops! You entered the wrong username and password."); } } ?> <form action="login.php" method="post"> <label>Username</label> <input type="text" name="username" value="" /> <br /> <label>Password</label> <input type="password" name="password" value="" /> <br /> <input type="submit" name="submit" value="Login" /> </form>
Run the page and try it out! Entering anything but a combination of “apple” and “dumpling” will result in failure. But entering that combination correctly will give you a thumbs up.
At this point you might want to store the user’s id in a session, and as the user moves from page to page you can check for the id. If it exists then you know they have logged in and can access secure parts of your site. You can learn about sessions here.