Some of the contemporary scientific problems require multidisciplinary and/or trans-disciplinary approach to achieve satisfactory answers. Traditional intra-disciplinary research had evolved in quasi-isolation, providing scientists theoretical canons, methodologies and frameworks unique for a specific field of study. Increasing importance of issues concerning the linked nature of the physical world coupled with a growing understanding of system complexity leave traditional scientific disciplines often struggling. Even within the same discipline different research groups have difficulties linking together different studies. Unification of software tools, algorithms and computational strategies across multiple disciplines is unimaginable, yet common; machine interpretable data description is quite feasible. So, if we can imagine a software application, build as network of “black box” independent processes that exchange data only, the design a software application for an inter-disciplinary study suddenly becomes a reality.
The principal goal of CLARA is to provide a computing environment for efficient Big Data processing. Under the data processing efficiency we assume two basic concepts:
- Data processing performance optimization
- Data processing application agility
Data Processing Performance
Today, big data is generated from scientific (for e.g. high energy and nuclear physics experiments, earth science observatories, cosmology, bio-informatics, etc.) as well as commercial sources (for e.g. social networks, digital imaging, search engines, etc.).
Data processing communities are faced with a challenge to filter, aggregate, correlate, analyze and report results of large volumes of data. Besides data volume, data through-puts and raw data production speeds, raw data varieties and heterogeneity are other challenges of big data. Although batch processing of the big data is the most acceptable and used solution of the problem, there is an increasing demand for non-batch processing on big data that are capable of providing substantially faster and near real-time data processing capabilities.
CLARA design focus is on two new traits: real-time data processing and stream processing solutions. Data driven and data centric architecture of CLARA presents solutions, capable of processing large volumes of data interactively and substantially faster than batch systems. Data processing performance increase is largely due to big data stream processing, solving a problem that input data must be processed without being physically stored. CLARA event-processing assumes that data is presented as set of identifiable events, and if so, CLARA services can be used to develop high performance complex event processing engines that are capable of detecting patterns of activity from continuously streaming data.
Data Processing Application Design
To achieve quality data processing, an intellectual input from diverse groups within large collaborations must be brought together. Data processing in a collaborative environment has historically involved a computing model based on developing self-contained, monolithic software applications running in batch-processing mode. This model, if not organized properly, can be inefficient in terms of deployment, maintenance, exception handling, update propagation, scalability and fault-tolerance.
Also the rate at which computing hardware technologies are advancing creates additional challenges for legacy software applications that must adapt to satisfy ever-growing data processing requirements. Legacy software adoption process to new hardware technologies is quit painful, resulting in code fragmentation and ad-hoc extensions. This has led to computing systems so complex and intertwined that the programs have become difficult to maintain and extend.
Software technologies are evolving as well. There are many new software architectures and improved high level programming languages in the market. Software application development is a process of writing, testing, debugging, and maintaining the source code of particular algorithms. The source code is usually written in one of the high level (HL) programming languages (such as Java, C++, Python, etc.). The process of designing a software application requires expertise in many different subjects including knowledge of the application domain, specialized algorithms, formal logic, and of course, syntactic and semantic knowledge of the chosen HL programming language. This is a description of a widely adopted traditional approach that is used to design and develop software applications.
So, what if a user is an expert in a specific application domain yet has a limited knowledge and experience in software programming. Obviously this user cannot actively contribute to the process of developing domain specific software applications. However, the same user can very effectively design an application using available software application building blocks: a process that does not require writing and/or compiling a source code.
Data processing applications have a very long time (reprocessing a data over and over again), and the ability to upgrade technologies is therefore essential. Data processing applications must be organized in a way that easily permits the discarding of aged software components and the inclusion of new ones without having to redesign entire software packages at each change. The addition of new modules and removal of unsatisfactory ones is a natural process of a software application evolution over time. Experience shows that software evolution and diversification is important and results in more efficient and robust data processing applications. Software applications designed within the CLARA framework differ from a traditional approach in two major ways.
First, software application is composed of interlocking software bricks called services. Services are linked to each other by the data link and are hiding technology (for e.g. HL programming language) and algorithmic solutions used to process data. A CLARA service has inputs, is capable of processing data, and producing the output data. Once a given service receives valid data, that service executes its engine, produces output data, and passes that data to the next service in the data-flow path.
The second and critical difference is that CLARA applications execute according to the rules of data-flow instead of a more traditional programming approach, where sequential series of instructions (lines of code) are written to perform a required algorithm. In this respect CLARA application design promotes data and data-flow as the main concept behind any data processing algorithm. CLARA service execution is data-driven and data-dependent. The flow of data between services in the application determines the execution order of services within the application.
These differences may seem minor at first, but the impact is revolutionary because it allows the data paths between application building blocks (bricks) to be the application designer’s main focus. As a result a CLARA application is more robust and agile, since application building blocks can be improved individually and be replaced. This type of application is more fault-tolerant, since faulty block can be replaced without bringing down entire application. CLARA application is elastic, since at run-time new data processing services can be added to enhance application functionality or add/remove services to fit available hardware computing resources.