Wednesday, March 11, 2009

Web Workflows - An Automated Approach to Web Browsing

The web needs an easy to read language for defining web tasks.

I propose XML based Web Workflows:
<workflow>
<open>http://www.google.com</open>
<input name="q">Straw Berry<input>
<click name="btnG"></click>
</workflow>

Execution of this web workflow would result in a document with the search results attached.

I have a working open source implementation of this in PHP via Chisimba.

Description:
This module interprets XML based Web Workflows to programtaically browse
the web to return a particular URI endpoint. The language includes syntax like

<workflow>
<open>http://www.google.com</open>
<input name="q">Straw Berry<input>
<click name="btnG"></click>
</workflow>


The latter represents an easy to understand, high level, procedural language for
automated web document retrieval.

This module will also allow you to specify login credentials to automate loging into sites in order to access protected resources. This module will be used by the librarysearch module to assist with document retrieval on clusters of hosts.

Benefits:
This implementation is completely portable and because it accepts the web workflow as input to produce the document could run as a standalone app like curl or maybe even as a curl extension.

This would contribute hugely to federated search efforts where an API isn't available for a certain host.

History:
Screen scrapping tasks have mostly been executed as a very unordered and messy combination of curl/lynx -d/wget requests. This process needs to be formalized by the web community like w3c

References:
The java guys already have this available to them see here but it's too proprietary in the sense that you can only get this up and running by creating the engine from within the java code.

Concerns:
Webstats watchers and web ad agencies aren't going to like the adoption of a formal method for robots to surf their sites to effectively carry content to users. Perhaps this will create an opportunity for other revenue models to surface.

Comments anyone?

2 comments:

  1. Well one thing that publishers are constantly worried about is rights management. They want people who develop connections to respect their access speed and not overload their connections or their databases. Federated search providers don't really rely on curl or wget requests to access content anyway. You will find that you will be pushing rope to get publishers to adapt this standard approach - they just won't do it.

    Brian Despain
    http://www.deepwebtechnologies.com

    ReplyDelete
  2. Hi Brian. My apologies for the extremely late reply to this.

    To add, my target was more towards developers and software testers alike, http://seleniumhq.org/ now exists.

    I agree that anything that automates should do so respectfully like adhere to robots.txt files.

    I also see a definite blocker for anyone who streams revenue off clicks for automated web for everything else but testing and development purposes. Still I am in favor of automated tasks over ad clicking anytime.

    ReplyDelete