XHTML

DOM Parsing With Python

This guide assumes you have basic knowledge of python and have done at least some work with HTML, XHTML, and/or XML.

Background

DOM stands for Document Object Model. It is a convention used in HTML, XHTML, and XML for representing and interacting with objects. As fairly well described by the name, things like HTML have many elements with relationships to other elements. For example, you may have a <span> element in your <body> element. The <span> element’s parent is the <body>. The <span> may have child elements and/or sibling elements. It works similar to a family relationship. The elements in an HTML document may have identifiers, specified by attributes like id=’something’, class=’something’, and/or name=’something’. You can use these identifiers to keep track of and find a specific element or list of elements. Once you have found the element(s) you are looking for, you can change things in a dynamic manner or get desired information.

Lets Try Some Beautiful Soup

As I found the need to parse HTML documents a little while ago, I went in search of a module to accommodate my needs. I could have made my own class to handle it (as DOM parsing really isn’t that hard), but I don’t have nearly the time I would need to take on such a project. Instead I found a module called ‘BeautifulSoup’. As I looked into this module, it seemed to be well-written and have full functionality. Through experience I found that this module is quite easy to use.

Onto The Code

Ok, lets start out with a simple HTML document:

<html>
<head>
    <title>Test</title>
</head>
<body>
    <span id="someid">This is some text.<span>
    <span class="someclass">This is some other text.</span>
</body>
</html>

In this document we have two child elements under the <body> element. Now, lets say we want to access the element with the id ‘someid’. First we need to parse the document like so (we will assume the variable ‘doc’ contains the HTML):

import BeautifulSoup
dom = BeautifulSoup.BeautifulSoup(doc)

Now we need to get the element. The python object ‘dom’ contains the parsed document. There are some provided methods for searching the document tree. Here are some examples:

# Find the first element with the id 'someid' (all have the same result)
elm1 = dom.find(None, {"id":"someid"})
elm1 = dom.find(None, id="someid")
elm1 = dom.find("span", {"id":"someid"}) # Only searches 'span' tags
elm1 = dom.find("span", id="someid") # Same as above
 
# Find all elements with the id 'someid'
elms1 = dom.findAll(None, id="someid")
 
# Find the first element with the class 'someclass'
elm2 = dom.find(None, {"class":"someclass"})
 
# Find all elements with the class 'someclass'
elms2 = dom.findAll(None, {"class":"someclass"})
 
# You cannot specify 'class' as a keyword argument, since it is reserved in python.
# That is why the find methods allow a dictionary that specifies what to look for.
# Also, you may specify any of a 'class', 'id', and/or 'name' to look for.
 
elm1.nextSibling # A reference to the next sibling element
elm2.previousSibling # A reference to the previous sibling element
 
# The above two lines are references to each other.
 
# Now, as it is a document _tree_ (each element references others), you can daisy-chain
# These will just lead back to the same element that elm1 referenced to begin with:
elm1.nextSibling.parent.find(None, id="someid")
elm1.parent.first()
 
# Now, of course you can do more than just walk the tree.
 
# Print all text contained in the element and all child elements:
print elm1.text
 
# Print all raw HTML contained in the element:
print elm1.renderContents()

Now, of course you can do more, like manually looking through the attributes of an element or something, but this gives a basic idea of how to use the module for your needs. To parse XML, instead of HTML or XHTML, you will want to parse the document with `BeautifulSoup.BeautifulStoneSoup(“…”)`. I hope you found this helpful. Good luck in your own DOM parsing.

BeautifulSoup documentation – http://www.crummy.com/software/BeautifulSoup/documentation.html


What is DHTML?

DHTML may sound like a language but it’s not. DHTML is a term for making web pages dynamic and interactive, by combining the power of HTML, JavaScript, DOM and CSS.


Differences between HTML and XHTML

XHTML stand for EXtensible HyperText Markup Language which is a combination of HTML and XML. In early 2000 XHTML 1.0 became a W3C Recommendation. If you are still using regular HTML, it’s time to change. XHTML isn’t just the future, it’s the now, and HTML is phasing out. But don’t worry too much, the change won’t be painful.

XHTML is a lot like HTML, but more normalized. It forces you to abide by it’s rules, which is a great thing when you are programming! It clearly defines what is OK and what is not. It makes HTML look rather chaotic.

As a result it works in all internet browsers, whereas bad or poorly written HTML may not, especially on mobile phone or other small devices.

So let’s get to it and demonstrate some of the differences between HTML and XHTML.

  • XML requires things to be marked up correctly, in well-formed documents.
  • XML describes data, while HTML displays data.
  • XHTML elements must be properly nested.
  • XHTML elements must be closed even if they are empty.
  • XHTML element and attribute names must be written in lowercase.
  • XHTML documents can only have one root element.
  • Attribute values must be in quotes.
  • Attribute minimization is forbidden, in other words attributes must define a value even if it’s an empty string.
  • The id attribute is now used in place of the name attribute.
  • In XHTML there are some required elements.

The following table contains examples of what works in XHTML and what does not.

Does not work Works
<b><u>
some bold and underlined text
</b></u>
<b><u>some bold and underlined text</u></b>
<h1>Header Text
<p>Paragraph text
<h1>Header Text</h1>
<p>Paragraph text</p>
<br>
<br />
<P>Some text here</P>
<p>Some text here</p>
<head>
<title></title>
</head>
<body></body>
<html>
<head>
<title></title>
</head>
<body></body>
</html>
<a HREF="..."></a>
<a href="..."></a>
<table width=80%>
<table width="80%">
<option selected>
<option selected="selected">


The XHTML DTD (Document Type Definitions)

All XHTML documents have three main parts:

  • DOCTYPE declaration
  • <head>
  • <body>

The DOCTYPE must be defined before anything else in the document.

Everything but the DOCTYPE declaration will look like HTML, XHTML just holds you to a few rules. That’s the beauty of XHTML!

There are three types of DTDs:

  • STRICT – It forces you to use clean markup when writing your web pages. You use CSS to define your presentation.
  • TRANSITIONAL – This allows you to mix regular HTML features with XHTML. This one seems to be the most commonly used on newer sites, probably because it’s flexible and compatible.
  • FRAMESET – This allows you to use HTML frames

These are some examples of implementing each DTD.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">


Calling JavaScript when a webpage loads

There are a couple opportunities for calling a JavaScript function or running JavaScript coded when a web page loads in a person’s web browser.

One method is to run a script using the body tag’s onload event. Here are a couple examples:

<!-- Call a function -->
<body onload="sayHello();">
<!-- Run multiple commands -->
<body onload="var hello_world='Hello World!'; alert(hello_world);">

Or you can take advantage of the window.onload event. Here are a couple examples:

<script type="text/javascript">
 
	// Call a function with the window onload event
	window.onload = sayHello();
 
	function sayHello(){
 
		alert("Hello World!");
 
	}
 
</script>
<script type="text/javascript">
 
	window.onload = function(){
 
		alert("Hello World!");
 
	};
 
</script>