Monday, May 5, 2008

Open Source ETL (Extraction, Transform, Load) Written in Java

Ever think that XSLT is too inflexible, too heavywieght or just plain too complex to use? If your answer is yes, then here's a review of open source java projects that help you do ETL (i.e. Extract, Transform and Load) in a easy and scalable manner.
  • Octopus - Octopus is a simple Java-based Extraction, Transform, and Loading (ETL) tool. It may connect to any JDBC data sources and perform transformations defined in an XML file. A loadjob-generator is provided to generate Octopus loadjob skeletons from an existing database. Many different types of databases can be mixed (MSSQL, Oracle, DB2, QED, JDBC-ODBC with Excel and Access, MySQL, CSV-files, XML-files,...) Three special JDBC drivers come with Octopus to support JDBC access to CSV-files (CSV-JDBC), MS-SQL (FreeTDS) and XML. Octopus supports Ant and JUnit to create a database / tables and extract/load data during a build or test process.
  • Xineo - Xineo XIL (XML Import Langage) defines an XML language for transforming various record-based data sources into XML documents, and provides a fully functional XIL processing implementation. This implementation has built-in support for relational (via JDBC) and structured text (like CSV) sources, and is extensible thanks to its public API, allowing dynamic integration of new data source implementations. It also provides an abstraction over output format, and the Xineo implementation can generate output documents into stream or as DOM document. Xineo's implementation built-in data sources include : Relational data via JDBC and Structured text via regular expressions.
  • CloverETL - CloverETL Features include internally represents all characters as 16bit, converts from most common character sets (ASCII, UTF-8, ISO-8859-1,ISO-8859-2, etc), works with delimited or fix-length data records, data records (fields) are internally handled as a variable-length data structures, fields can have default values, handles NULL values, cooperates with any database with JDBC driver, transforming of the data is performed by independent components, each running as an independent thread, framework implements so called pipeline-parallelism, metadata describing structure of data files (records) can be read from XML and transformation graphs can be read from XML
  • BabelDoc - BabelDoc is a Java framework for processing documents in linear stages, it tracks documents and can reintroduce documents back into into the pipelines, it is monitorable and configurable through a number of interfaces, it can be run standalone, in server processes or in application servers, it can be reconfigured dynamically by text files and database tables.
  • Joost - Java implementation of the Streaming Transformations for XML (STX) language. Streaming Transformations for XML (STX) is a one-pass transformation language for XML documents. STX is intended as a high-speed, low memory consumption alternative to XSLT. Since it does not require the construction of an in-memory tree, it is suitable for use in resource constrained scenarios.
  • CB2XML - CB2XML (CopyBook to XML) is a COBOL CopyBook to XML converter written in Java and based on the SableCC parser generator. This project includes utilities to convert an XML instance file into its COBOL copybook equivalent string buffer and vice versa. You can find additional information about supporting Jurasic systems here.
  • mec-eagle - JAVA XML XSL B2B integration software:SWING based GUI,an EDI to XML, XML to XML and XML to EDI converter,client-server architecture.All EDI standards are supported:EDIFACT,ANSI X.12,SAP IDOC,XCBL,RosettaNet,Biztalk.Included comm:SMTP,FTP,HTTP(S),PGP/MIME
  • Transmorpher - Transmorpher is an environment for processing generic transformations on XML documents. It aims at complementing XSLT in order to:
    • describe easily simple transformations (removing elements, replacing tag and attribute names, concatenating documents...);
    • allowing regular expression transformations on the content;
    • composing transformations by linking their (multiple) output to input;
    • iterating transformations, sometimes until saturation (closure operation);
    • integrating external transformations.
  • XPipe - XPipe is an approach to manageable, scaleable, robust XML processing based on the assembly line principle, common in many areas of manufacturing. XPipe as being an attempt to take what was great about the original Unix pipe idea and apply it for structured information streams based on XML.
  • DataSift - DataSift is a powerful java data validation and transformation framework, aimed at enterprise software development, which provides developers with an extensible architecture they can fully adapt. Almost every feature in it can be configured and extended in some way.
  • Xephyrus Flume - Flume is a component pipeline engine. It allows you to chain together multiple workers into a pipeline mechanism. The intention of Flume is that each of the workers would provide access to a different type of technology. For example, a pipeline could consist of a Jython script worker followed by a BeanShell script worker followed by an XSLT worker.
  • Smallx - Smallx supports streaming of XML infosets to allow processing of very large documents (500MB-1GB). Processing is specified in an XML syntax that describes an XML pipeline--which is a sequence of components that consume and produce infosets. This allows chaining of XML component standards like XSLT. Also, there is a full component API that allows developers to easily write their own components.
  • Nux - Nux is a toolkit making efficient and powerful XML processing easy. It is geared towards embedded use in high-throughput XML messaging middleware such as large-scale Peer-to-Peer infrastructures, message queues, publish-subscribe and matchmaking systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. Nux reliably processes whatever data fits into main memory (even, say, 250 MB messages), but it is not an XML database system, and does not attempt to be one. Nux integrates best-of-breed components, containing extensions of the XOM, Saxon and Lucene open-source libraries.
  • KETL - KETL is an extract, transform, and load (ETL) tool designed by Kinetic Networks. KETL includes job scheduling and alerting capabilities. The KETL Server is a Java-based data integration platform consisting of a multi-threaded server that manages various job executors. Jobs are defined using an XML definition language.
  • Kettle - K.E.T.T.L.E (Kettle ETTL Environment) is a meta-data driven ETTL tool. (ETTL: Extraction, Transformation, Transportation & Loading). No code has to be written to perform complex data transformations. Environment means that it is possible to create plugins to do custom transformations or access propriatary data sources. Kettle supports most databases on the market and has native support for slowly chaning dimensions on most platforms. The complete Kettle source code is over 160,000 lines of java code.
  • Netflux - Metadata based tool to allow for easier manipulations. Spring based configuration, BSF based scripting support, pluggable JDBC based data sources and sinks. A server and a GUI are planned.
  • OpenDigger - OpenDigger is a java based compiler for the xETL language. xETL is a language specifically projected to read, manipulate and write data in any format and database. With OpenDigger/XETL you can build Extraction-Transformation-Loading (ETL) programs virtually from and to any database platform.
  • ServingXML - ServingXML is a markup language for expressing XML pipelines, and an extensible Java framework for defining the elements of the language. It defines a vocabulary for expressing flat-XML, XML-flat, flat-flat, and XML-XML transformations in pipelines. ServingXML supports reading content as XML files, flat files, SQL queries or dynamically generated SAX events, transforming it with XSLT stylesheets and custom SAX filters, and writing it as XML, HTML, PDF or mail attachments. ServingXML is suited for converting flat file or database records to XML, with its support for namespaces, variant record types, multi-valued fields, segments and repeating groups, hierarchical grouping of records, and record-by-record validation with XML Schema.
  • Talend - Talend Open Studio is full-featured Data Integration OpenSource solution (ETL). Its graphical user interface, based on Eclipse Rich Client Platform (RCP) includes numerous components for business process modelling, as well as technical implementations of extracting, transformation and mapping of data flows. Data related script and underlying programs are generated in Perl and Java code.
  • Scriptella - Scriptella is an ETL and script execution tool. Its primary focus is simplicity. It doesn't require the user to learn another complex XML-based language to use it, but allows the use of SQL or another scripting language suitable for the data source to perform required transformations.
  • ETL Integrator - ETL (a highly unimaginative name) consists of 3 components. An ETL service engine that is a JBI compliant service engine implementation which can be deployed in a JBI container. An ETL Editor that is a design time netbeans module which allow users to design ETL process in a graphical way. An ETL Project that is a design time netbeans module which allows users to package ETL related artifacts in a jar file which could be deployed onto the ETL service engine.
  • Jitterbit - Jitterbit can act as a powerful ETL tool. Operations are defined, configured, and monitored with a GUI. The GUI can create document definitions, from simple flat file structures to complex hierarchic files structures. Jitterbit includes drag-and-drop mapping tool to transform data between your various system interfaces. Furthermore, one can set schedules, create success and failure events and track the results for your integration operations. Jitterbit supports Web Services, XML Files, HTTP/S, FTP, ODBC, Flat and Hierarchic file structures and file shares.
  • Apatar - Apatar integrates databases, files and applications. Apatar includes a visual job designer for defining mapping, joins, filtering, data validation and schedules. Connectors include MySQL, PostgreSQL, Oracle, MS SQL, Sybase, FTP, HTTP,, SugarCRM, Compiere ERP, Goldmine CRM, XML, flat files, Webdav, Buzzsaw, LDAP, Amazon and Flickr. No coding is required to accomplish even a complex integration. All metadata is stored in XML.
  • Spring Batch - Spring Batch is a lightweight, comprehensive batch framework designed to enable the development of robust batch applications. Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advance technical services and features that will enable extremely high-volume and high performance batch jobs though optimization and partitioning techniques.

  • JasperETL - JasperETL was developed through a technology partnership with Talend. JasperETL includes Eclipse based user interfaces for process design, transformation mapping, debugging, process viewing. The project includes over 30 connectors like flat files, xml, databases, email, ftp and more. It includes wizards to help configure the processing of complex file formats including positional, delimited, CSV, RegExp, XML, and LDIF formatted data.
  • Pentaho Data Integration - Pentaho Data Integration provides a declarative approach to ETL where you specify what to do rather than how to do it. It includes a transformation library with over 70 mapping objects. In includes data warehousing capability for slowly changing and junk Dimensions. Includes support for multiple data sources including over 25 open source and proprietary database platforms, flat files, Excel documents, and more. The architecture is extensible with a plug-in mehcanism.

1 comment:

NHM said...

tell me something. If you could reply to ncmgm @ netcabo . pt (no spaces in between) I'd be thankful.

Well ETL tools are meant to access and transform data from varied data sources. But they are too cahotic to have a clear idea how each tool will operate internatlly.

I'd like to understand if in general the heterogenous data is handled in some common logical format. Either you look as data as being in relational format... or XML format.

Or if you acces data in a RDBMS you look at data as set of relations and tuples, and when you access a source in .xml you look at them as trees?

Do they use both data representation paradigms? or do they actually try to normalize the interface to the user by working in a common data representation model?

IF SO, then which is used more often?
The relational data format, or the semi-structured?

Thanks in advance.

My best regards, Nuno.