Data Mining and Business Intelligence Lab Manual
ISBN 9788119221509

Highlights

Notes

  

Chapter 4Implementation of ETL(Extract Transform and Load) Processing using Pentaho ToolPractical No 4AIM: Implementation of ETL(Extract Transform and Load) Processing using Pentaho tool.

Description:

A) INTRODUCTION TO PENTAHO TOOL.

B) CREATING NEW TRANFORMATION.

I) CONFIGURATION OF SQL INPUT & OUTPUT.

II) CONFIGURATION OF MS EXCEL INPUT & OUTPUT.

III) TRANSFORM.

a) SORTING ROWS

b) ADDING SEQUENCE

IV) FLOW.

a) FILTER ROWS

INTRODUCTION TO PENTAHO TOOL 

The Data Integration perspective of Spoon allows you to create two basic file types: transformations and jobs.

    1. Transformations are used to describe the data flows for ETL such as reading from a source, transforming data and loading it into a target location.

    2. Jobs are used to coordinate ETL activities such as defining the flow and dependencies for what order transformations should be run, or prepare for execution by checking conditions such as, “Is my source file available?” or “Does a table exist in my database?”

Launching the PDI graphical designer: Spoon:

    1. Start spoon: If your system is Windows, type the following command:

    Spoon.bat

    2. As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click the No Repository button. The main window appears. You will see a small window with the tip of the day. After reading it, close that window.

    3. A welcome! window appears with some useful links for you to see.

CREATING NEW TRANSFORMATION 

Transformation is a network of logical tasks called Steps. Transformations are essentially data flows. In the example below, the database developer has created a transformation that reads a flat file, filters it, sorts it, and loads it to a relational database table.

CONFIGURATION OF SQL INPUT & OUTPUT

Step 1: Open pentaho and click on File->New->Transformation. The following screen will appear.

Step 2: Click on Input(left side bar under Design)-> table Input(drag this table input into transformation area).

Step 3: Double click on Table Input.This screen will appear.

Step 4: Click on New tab

Step 5: Fill the details as per your oracle setting. In my case Database Name is ORCL.Give the Connection name as you wish. I have given as New Con. Click on Test to check the connection.Then Click on OK.

Step 6: Click on get SQL select statement.Then click on New Con(Connection name which we created). Then click on Table. It will display the list of tables created in oracle in current Database. Select any table here I’am selecting customer table.

Step 7: Click on OK and then click on preview. After clicking Preview below screen will appear

Step 8: Click on Close then on OK.

Step 9: Click on Output-> table Output

Step 10: Drag the table output in transformation area and drop it.

Step 11: Put the cursor on table Input in tool tip we get an option to connect to output table.You will get below screen.

Step 12: Drag the output connector and drop on table output. Below screen will appear.

Step 13: Double click on table output. In target table give the name of target table I have given new tab as target table. Then click on database fields tab.check the truncate table option and also check on specify database fields. Then click on get fields tab. You will get below screen.

Step 14:- select the fields from table field and stream field which is to be included in target table from our source table i.e customer. I have selected only two fields cust_name and cust_street.Then click on SQL button.

As soon as we click on sql button, new table is created with fields Cust_name and Cust_street. You can check the table in SQL. In this case name of the table is target_table which is created.

CONFIGURATION OF MS EXCEL INPUT & OUTPUT

Step 1: Open up the Input folder and drag and drop the Microsoft excel input Step on to the transformation window.

Step 2: Double-click on the Microsoft Excel Input Step to view its properties Click on the Browse button next to Filename field and navigate to the folder with the Excel files. Select the Excel file as shown below and then click the Open button.:

Step 3: Click on ADD button to add excel file

Step 4: Navigate to Sheet tab to select the sheet which you want to select. And select Sheetname(s) Button. Following form will appear then select sheet and select (>) button to add sheet. Or (<) this button to remove sheet.

Step 5: Navigate to Fields tab to assert the header name. Click on “Get Fields from header row” button to get column list.

Step 6: Click on “Preview Row” button to select top 1000 row.

Close the preview and make any additional changes required to data types. Once done, click the OK button to close up this Excel File input Step.

Step 7: Open up the Output folder and drag and drop the Microsoft excel output on to the transformation window.

Step 9: Double-click on the Microsoft Excel output Step to view its properties Click on the Browse button next to Filename field and navigate to the folder to store the output.

Step 10: Make connection to input and output i.e. Create a hop between the Microsoft Excel Input and Microsoft Excel Output Steps. Go to field tab of Excel output Step and click on “Get fields” button to retrieve column list of table data.

Step 11: Click OK to close the property window. And Run the Transformation. And you will see a new MS Excel file is created with the name myexceloutput.

TRANSFORM

Sorting Rows

The Pentaho Kettle (PDI) Sort Rows Step will sort your data based on field names you specify

Step 1: Open the Transform folder from the design tab and drag the “sort rows“ to canvas.

Step 2: Create the hop between the Table input and Sort rows to retrieve the information for sorting. And create second hop between “Sort rows” to “ Table output2 “ to get output of transformation in database table.

Step 3: Configure the table input to get data from table in which sort row is to be applied as show in above input Steps.

Step 4: Double click on the Sort Rows to open property window. Click on the “Get Fields” button to retrieve the table column.

Step 5: In the “fields” all column data is listed. Now selecting “Y” to sort corresponding row in ascending order or “N” to sort in Descending order.

Step 6: Click “OK” to close the window.

Step 7: Open the “Table Output2” property window by double clicking on it. Provide the connection information as shown in above output Steps.

Step 8: Select the “Target Table” or create a new target table by giving its name to store the sorted table data. Now Click on SQL which will create new output table.

Step 9: Click “OK” to close the property window

Step 10: To run the transformation, click on “Run” button as shown below.

Step 11: In the execution result window, Switch to “Preview data” tab to see the result of Sorted table.

Adding Sequence

Step 1: Click on transform->Add sequence. Drag and drop on transformation area. Connect it with sort rows. This screen will appear.

Step 2: Double click on add sequence. In Name of the value field give name of the sequence. In this case I have given MySeq.

Start value is 1

Incremented by is 1

Step 3: Click on Preview data Tab. It will add a sequence which is incremented by 1. In this example MySeq is added at last and it is incremented by 1.

We can upload this sorted data with added sequence in new table. And check the same in SQL. In this example output table Test is created with added sequence.

Step 4: Then click on OK. Then click on Run after that click on Launch.

FLOW

Filter Rows

The Filter Rows Step allows you to filter rows based on conditions and comparisons. Once this Step is connected to a previous Step (one or more and receiving input), you can click on the “<field>“, “=“ and “<value>“ areas to construct a condition.

Step 1: Open the Flow folder and select the “Filter row” Step from the design tab.

Step 2: Create the hop between the “Table input” and “Filter row”. And create second hop between “Filter Row” to Table Output 1 and third hop between filter row and “ Table output2 “ to get output of transformation in database table.

Step 3: Configure the table input to get data from table as show in above input Steps.

Step 4: Double click on the “Filter Rows” to open property window. Select the “Table Output” Step to send data if true or false as shown below:

Step 5: Now for condition, Select the given boxes to enter condition we want.

Step 6: Click “OK” to close this window

Step 7: Open the Table Output 1 property window by double clicking on it. Provide the connection information as shown in above output Steps and give the target table as sal1 and click on OK.

Step 8: Open the Table Output 2 property window by double clicking on it. Provide the connection information as shown in above output Steps and give the target table as sal2 and click on OK.

Step 9: Now run the transformation by clicking on run button.

Step 9: In the execution result window, Switch to “Preview data” tab to see the result of filtered table

Table sal1 Result:

Table Sal2 Result: