Google Summer of Code 2009
Table of Contents
- Google Summer of Code 2009
- Overview
- Student: Miklos Erdelyi
- Project: Advanced Data Management Capabilities Using MapReduce
- Mentor : Andre Merzky
- Student: Saurabh Sehegal
- Project: Managing Applications across Clouds
- Mentor : Shantenu Jha
- Student: Roman Khachatryan
- Project: Adaptive Cost Estimation Techniques for OGSA-DAI DQP
- Mentor : Bartosz Dobrzelecki
- Student: Raviteja Dodda
- Project: Generating Customized User Interfaces to the Cloud focusing on the requirement of the end-user
- Mentor : Jano van Hemert
- Student: Mathias Brito
- Project: Improvements to the JDBC driver for OGSA-DAI
- Mentor : Michal Piotrowski
- Google Mentoring Summit
Overview
This is the second time that OMII-UK has been lucky enough to participate in the Google Summer of Code (GSoC). This year five students from various places round the world started and they all managed to successfully complete the programme. A summary of the outcomes follows mainly composed from the words provided by the mentors and students themselves.
Code samples are available from here
.
Student: Miklos Erdelyi
Project: Advanced Data Management Capabilities Using MapReduce
Mentor : Andre Merzky
Miklos Erdelyi, currently doing a PhD at the University of Pannonia in Hungary, continued the SAGA based MapReduce framework work that Michael Miceli, one of last year's GSoC students, had started. The goals for this year were to:
- clean up the code base,
- provide extensive documentation
- generalize the data input/output mechanisms, and
- potentially add data/compute co-scheduling for the worker distribution.
The first three items were all completed. Although Miklos was only started a couple of weeks late into the GSoC Programme, due to exams, he did manage to catch up with his time plan pretty well.
Miklos rewrote parts of the MapReduce implementation done by the Micheli brothers (Michael's brother had been employed outwith GSoC by LSU to work on this code) last summer to improve its efficiency and usability for a wider range of use cases. Specifically, he added a serialization layer along with input/output formats to support arbitrary types of objects and arbitrary forms of input/output in the framework. He also optimized how the processing of data is done to make it more efficient in terms of disk and network usage.
The data/compute co-scheduling was obviously the most ambitious part of the project, and is, as of yet, not completed. Miklos however continues to work with the OMII-UK Saga folks on that topic, even after the official GSoC Programme had finished. The SAGA folks are progressing nicely to converge Miklos' and Saurabh's, another GSoC student from this year, efforts toward a CCGrid publication later this year.
Andre Merzky from the Vrije Universiteit in the Netherlands, who acted as Miklos' mentor together with Shantenu Jha from Louisiana State University , said: "we have been very lucky to find two students (the other student being Surabh Seghal, another GSoC student from this year, who also worked on an OMII-UK SAGA based project - more on this below) which are so enthusiastic and engaged, and which are able to benefit from each others work". On his part, what Miklos found really important was that his: "mentors encouraged me to think beyond coding: we discussed research implications of the project too besides brain storming about future directions of the Sector/Sphere and MapReduce OMII-SAGA GSoC projects". In addition, as an essential experience of the GSoC programme he "got to know a lot more about high-performance computing" and now also knows a lot more about Hofstadter's Law.
Student: Saurabh Sehegal
Project: Managing Applications across Clouds
Mentor : Shantenu Jha
The main purpose of the project undertaken by Saurabh Sehgal, a Computer Engineering undergraduate at the University of Toronto in Canada, was to implement an adaptor that would connect the SAGA API to a Sector/Sphere compute and data cloud. The project was broken down into three main parts:
- authentication
- file/directory operation submission.
- job submission
The authentication module would allow SAGA application to authenticate with a Sector/Sphere server using either: dynamic attributes that can be created through the SAGA API on the fly or through static methods such as the adaptor ini file or the Sector/Sphere authentication file itself. The adaptor would parse the authentication parameters using one of the methods previously outlined, or a combination of all of them, with the highest priority given to the use of dynamic attributes.
The second part of the project was the file submission API. This time the adaptor creates an instance of a Sector file that all the SAGA file package APIs can map to. All the operations in the file package are supported, except for links and permissions because these are not supported by Sector.
The last part of the project was to address was the job submission in Sphere which is not conventional with regards to the common use in SAGA as it only supports functions encoded in dynamic libraries (DLLs) and not general executables. The job submission API allows the SAGA application to run "kernel" functions from a supplied DLL, and utilize the "Stream Processing" paradigm available through the Sphere compute cloud. This includes:
- Supporting an "argument" scheme with the DLL, as outlined below:
<input files><output_files><function name><rows><param address><param size>
- Uploading of input files to the Sector system from the local filesystem, if needed.
- Supporting the passing of parameters to the DLL through supplying memory addresses as strings.
Architecturally, the adaptor is thus designed to link to a DLL itself that contains a wrapper around all the services provided by the Sector/Sphere system. This shared DLL is what translates the SAGA specific operations into Sector/Sphere operations. It also provided other services such as creating services ( file, job services ), and error handling APIs to translate error codes into Sector/Sphere error messages.
Overall, Saurabh found his GSoc 2009 experience to be "very rewarding" and it introduced him "to cutting-edge, interesting and relevant research addressing complex problems in grid and cloud computing". He found his mentors encouraged him to think and envision beyond just the coding aspect of the project through keeping in mind its research implications and expansion as well.
Saurabh hopes to continue contributing to the SAGA project even after Google SoC is over. Saurabh states: "I understand the aim of Google Summer of Code is to showcase the importance of open software and inspire contributions from the student community. It has definitely managed to do so in my case, and I hope to be an active member of the SAGA, OMII and Sector/Sphere community in the future".
Shantenu Jha, the main mentor for Saurabh, thought that: "Saurabh was very good -- diligent and effective". There were some external barriers to Saurabh's progress. For example, the Open Cloud Testbed used did not become functional for until well into August which is well towards the end of GSoC. All the code that Saurabh wrote is in the SAGA SVN. Andre Merzky did a code review (and so has Hartmut Kaiser - another developer on the SAGA project) and the code quality is good.
Shantenu commented that "on the whole the GSoC project helped clean up the MapReduce code base". Shantenu even took the opportunity on a trip to the vicinity to go to Toronto and spent a day with Saurabh. All in all progress -- intellectually and conceptually was good. Shantenu and Saurabh met over skype at least once a week, and continue to do so well after GSoC is formally over. Saurabh (and Miklos) continue to work with the SAGA group, and once the performance tests on the different machines are complete, we hope to have a CCGrid-level publication.
Student: Roman Khachatryan
Project: Adaptive Cost Estimation Techniques for OGSA-DAI DQP
Mentor : Bartosz Dobrzelecki
The aim of this project was to provide the OGSA-DAI Distributed Query Optimiser with a cardinality estimation module. Cardinality estimation is an essential part of every query optimiser, usually using statistics about the data that is being queried. However, when dealing with distributed environment query execution feedback should be used instead because the gathering of statistics beforehand is problematically.
The project proved to be a successful collaboration with Roman managing to implement most of the functionality described in the original work plan. The execution monitoring activities and related dynamic resource properties have been implemented. Query plan optimizers injecting monitoring probes into query plans are working. Code that gathers post execution data and maintains dynamic value distribution histograms was also contributed. Apart from contributing code, Roman also provided valuable feedback on OGSA-DAI's core API design and his remarks prompted a couple of extensions and refactorings. Overall it was a valuable experience that resulted in mutual benefits.
In undertaking this project Roman: "learned many new things from the IT field as well as from mathematics sometimes not directly related to my work... It was easy and interesting to me to make new acquaintances and to communicate with the team in spite of my imperfect English:)". Roman hopes to find a job closely related to distributed data access environments as much as possible and to continue to work on OGSA-DAI-DQP.
Student: Raviteja Dodda
Project: Generating Customized User Interfaces to the Cloud focusing on the requirement of the end-user
Mentor : Jano van Hemert
Raviteja Dodda, has just completed his third year of Bachelor's Degree in Computer Science and Engineering at IIT Kharagpur in India, had to generate customized user interfaces to the cloud as his project goal. He implemented and generated two portlets using Rapid. The first portlet, named "Cloud Monitor", provides a user with an interface to start, stop and monitor instances of their job in the Cloud. The user could provide the authentication parameters to start the required number of job instances in a Cloud such as provided by Amazon EC2 or Eucalyptus. It was implemented considering Amazon EC2 to be the the main provider for a computing cloud. The other portlet, named "Rapid Cloud", provided a user interface to submit jobs like an Image Filter job or a Cloud Monitoring job to the Cloud. The jobs are then executed on instance in the cloud.
The Cloud Monitor code was written in Java and it was implemented using a Query Interface for Amazon EC2 as it's response time was better than than that of using the SOAP interface for the Amazon EC2. Initially the Eucalyptus cloud solution was installed on instances of nesc-red (a network of machines available at NeSC), but due to some configuration changes made by the support team, it was not possible to continue to work on it. The Amazon EC2 public cloud was considered next as the computing infrastructure to use and portlet developed was tested on this. There were problems with file transfer of the job to the cloud initially but this was resolved after about 2-3 days of effort.
The main problem that Raviteja faced during his Google Summer of Code project was a firewall at his local University which meant that he could not access the cloud instances outside his University through ssh as port 22 was being blocked. To resolve this he had to work at an internet cafe outside his University and use ssh tunneling. Nevertheless, despite these tibulations, he was able to complete his project on time. This, in part, was due to the immense support Raviteja had from my mentors: Jano Van Hemert and Jos Koetsier. They provided him with all the compute resources in time, gave feedback about his work and suggestions. Raviteja would also like to thank OMII-UK and the GSoC administrator Mario for selecting me and your support.
From the mentoring side the Google Summer of Code project gave the Rapid development team an opportunity to explore how their technology, Rapid, together with its underlying methods need to be adapted to allow its use in cloud computing. As expected, the principle that have been used from a high-performance and grid computing context are not that easily adapted to suit cloud computing. The underlying methods need to be changed to allow submitting, monitoring and handling compute jobs on virtual machines. This has been successfully added to our technology, but needs more careful testing to validate it is sufficiently robust for a public release.
Student: Mathias Brito
Project: Improvements to the JDBC driver for OGSA-DAI
Mentor : Michal Piotrowski
Mathias, currently undertaking a PhD in Computer Engineering at the University of Sao Paulo, thinks that working on an Open Source project is great for two main reasons:
- the community involved in the development and
- the results you get from delivering a project that is open and free.
For this specific project, Mathias maintains, the community was really the major motivating factor for him. Collaboration, he says, really does exist in the OGSA-DAI project with user questions being answered quickly with a very high level of detail.
Mathias states that he has "learned a lot with the program specially about coding". The challenges he encountered were more than just about adding code, one also has to: restructure code, test and build it. He now has a little more knowledge about these challenges - what to expect and how to go about solving these.
There was a bit of conflict towards the middle of the programme btween PhD related work and GSoC which left Mathias with a little less time to spend on the driver work towards the second half of the programme. Nevertheless Mathias managed to finish all the nice features in the project and about 90% of the goals were met. The next stage will be to stabilize the code and to start talking with the community again about the features that should be implemented next. Some ideas have already cometo mind like testing the usage of the driver with OGSA-DQP. Mathias would like to thank all of those who helped directly, or indirectly, in the development of the new improvements in the OGSA-DAI JDBC Driver.
Michal Piotrowski, who mentored Mathias' GSoC project, gathered feedback and user requirements from current OGSA-DAI JDBC driver users, consisting mainly of the OGSA-DAI team at EPCC and from Comarch, a Polish software company, who incorporated the component into their data mining software.
Michal worked closely with Mathias, revising his workplan, clarifying the priotisation of tasks and the approach as well as the effort required to successfuly finish the project. Mathias has worked with the OGSA-DAI team in the past, so already he had some solid knowledge about the technology. Michal also encureged Mathias to talk to other members of the team via the ogsa-dai users mailing list and na internal (epcc) IRC channel.
Mathias was asked to supply weekly reports during his GSoC project. Michal saw his main role as helping Mathias whenever he had any technical questions regarding OGSA-DAI or programming in general (with great help from other OGSA-DAI team members). Michal found Mathias to be a very competent and resourceful person and it was very nice to work with him.
The OGSA-DAI JDBC driver is a very useful component for OGSA-DAI. It simplifies the integration of Java-based legacy systems that use JDBC to OGSA-DAi technology and could attract more users to try and use OGSA-DAI. There are no current plans to integrate the JDBC driver with the OGSA-DAI project, but the OGSA-DAI JDBC driver will be released as a separate component.
Almost all project goals were achived. The main achivement was a significant increase in the perfomance of the driver when working on large data sets, due to the implementation of a streamed data transfer between OGSA-DAI and the JDBC driver. The second important improvement was adding support for a new OGSA-DAI resource: resource groups.
This GSoC project was succesful and should lead to a new release of
the JDBC driver on SourceForge
and IT-tude
.
Google Mentoring Summit
The summit took place on Saturday and Sunday, 24th-25th October, at Google's Mountain View HQ. Representatives from each successfully participating organisation were invited to discuss and improve GSoC project, collaborate and code. OMII-UK was represented by Michal Piotrowski, from EPCC, among another 300 participants from all over the word representing 150 organisations developing free and open source software.
The Summit was organised in truly open fashion. Google provided access to most of its Googleplex offices, which were turned into discussion session venues. Discussion topics were proposed by individual mentors and ranked by all participants, they ranged from general ones on how to improve GSoC talks, design, organisation and marketing strategies for open software organisations to low level discussions such as "Parallel MultiCore and GPU Algorithms" or "Security Issues in Open Source Projects".
The event was excellent forum to present the OMII-UK's developments to wider open source community. It was also a great place to discuss, comment and gather feedback from other enthusiastic participants. Google again showed that it supports and values the effort of the free software organisations and proved to be fantastic organiser.





© The University of Southampton on behalf of OMII-UK. All Rights Reserved. |