Thursday, November 4, 2010

Stream processing on WWW data

This is about my final year project at university of Moratuwa. We named this project as Glutter, the name 'Glutter‘ is the result of a combination of the words 'Clutter‘ and 'Gutter‘. So as the name implies, 'Glutter‘ operates as a gutter which is connected to a clutter of information. In other words, it enables the users to gather information from various sources and then set up rules for how that content should be filtered and modified to fit the requirements of the users.


This projetct is something similar to the Yahoo Pipes, because Yahoo Pipes also works by enabling users to gather information from different sources and then setting up rules on how that content should be modified (Filtering, renaming, truncating , translating etc.). But the main limitation of yahoo pipes is that it is not aware of temporal aspects and causality of events in the web which hinders the usefulness drastically. Therefore as a solution to that we introduce the Glutter. The approach in Glutter is to use Complex Event Processing on web events thus enabling temporal querying and an awareness of causality in its operators. It supports more input and output means such as Twitter, Email, Feeds, Web services, CSV data, XMPP Chats etc, thus making it more connected to the real-time web.


The following video shows the concept of the project.




when a typical internet user steps into the internet, the information keep flowing at him in a real time manner and its difficult for his to keep track of it. also the internet is a huge mess of data and to make things worse, its a dynamically changing mess. Therefore Glutter can acts as an intermediate in between the clutter of information and the web user, so by using Glutter he can get only information that he is interesting, in real time manner.


The main objective of Glutter is to empower the user by allowing him to decide what, when, and how to view and get-notified-of information in the web without the developers designing and deciding it(Without even writing a single line of code). The client is provided a sophisticated workbench to create workflows according to his preference in order to decide 'what‘ to view and when the information should be delivered.


For creating workflows we have given a user interface (Workbench), which allows users to create workflows and run them against web data sources. The workbench consists of a toolbox containing drag and drop components which could be used to construct workflows using a graphical user interface. The main components can be categorized as,

  • Connectors - Connectors are used to establish the connection to various data sources. Currently Glutter consists of 5 connectors which support RSS/ Atom Feeds, EMail (pop/imap), Twitter, CSV, and Pull based Querying of Webservices.
  • Operators - Operators allow the user to perform many different operations on the stream. 
  • Sink - Sinks are the end components of a workflow, and users can get the output of the workflow in many different forms such as an EMail, a Tweet, XMPP Message, or can be viewed in the viewer section.
All components are listed below



Following I have shown a sample workflow


In this scenario, user is interested in particular news, but those news are in different languages, also user is interested to know, which country belongs that news. Then he can use Glutter to do his task. As shown in the above figure. User get feeds in Dutch, Spanish languages and another one from English language. Those news feeds can be fetched using 3 feed connector, then he can use our translate operator to translate news feeds to English language. Now all  different language feeds are converted to English then by using union operator he can get all feeds as one feed. Since he need to view the news based on Geo location, we can use semantic operator to do this task. Because it can analyse the text and can extract geo location within that text semantically, then these Geo location details are added to every news feed item. After that those results are sent to the data sink, so that it can be viewed using Viewer which support map view. 

Following I have shown the map viewer.


This is only one example that I have given to show the power of the Glutter. Bellow I have given other use cases


  • Another possible use case could be redirection of data channels. That means fetch interesting tweets to email or chat, also you can feed your blog to twitter (you could add more filtering operations in between in order to control what gets tweeted from your blog).  
  • Email auto-replier.
  • Send notification emails when price below some level for ebay item.
  • Stock market data related chat notification (when price increase and decrease) (can use pattern recognition operator)
  • When interesting feed comes user can do the google search (using web service operator) based on content of the feed and get addition links for that news.
  • There are many more. It depends on the creativity of the user :) (Since we have given lot of connectors, operators and sinks)
The some of resulting viewers are shown below.

Line Chart

Area Chart

List View







No comments: