Semantic Web: through the back door with HTML and CSS

I have spent a lot of time over the last 10 years working on technology for harvesting semantic information from a variety of existing sources. I was an early enthusiast for Semantic web standards like RDF and later OWL. The problem is that too few web site creators invest the effort to add Semantic Web meta data to their sites.

I believe that as web site creators start using CSS and drop HTML formatting tags like <font ...>, etc. (HTML should be used for structure, not formatting!), writing software that is able to understand the structure of web pages will get simpler. Furthermore, the use of meaningful id and class property values in <div> tags will act as a crude but probably effective source of meta data; for example: a <div> tag with an id or class property value that contains the string "menu" is likely to be navigational information and can be ignored or be of value depending on the requirements of your knowledge gathering application.

Just as extracting semantic information from natural language text is very difficult, analyzing the structure and HTML/CSS markup to augment web data scraping information software is difficult. That said, HTML + CSS is likely to be much simpler to process in software than plain HTML with formatting tags. BTW, I am in the process of converting all of my own sites to using only CSS for formatting - I have been writing HTML with embedded formatting since my first web site in 1993 - time for an update in methodology.

Related Posts

Comments are closed.