Talend Open Studio Cookbook
上QQ阅读APP看书,第一时间看更新

Joining data using tMap

So far, we have seen how tMap can be used to transform and filter input data. But this is only a part of the tMap functionality. The tMap component is also the main component used to perform join logic between multiple input sources. This recipe demonstrates the basics of using tMap to join multiple data sources.

Getting ready

Open the job jo_cook_ch04_0070_tMapJoin.

How to do it...

  1. Right-click tFileInputDelimited. Go to Row | Main and connect it to tMap_1. Change the name of the flow to order.
  2. Open tMap, and you should see two input tables: customer and order.
  3. Select the customerId field from the customer table and drag it to the customerId Expr. key in the order table.
  4. You will see a purple key icon and a flow showing the linked fields.
  5. Type "Card" into the Expr. key field for orderType.
  6. Drag all the order fields apart from customerId to the output. Your tMap should now look like the following screenshot:
    How to do it...
  7. Close tMap and run the job.
  8. You will see that there is a single row for each customer, and many of the fields are null.
  9. Re-open the tMap, and click tMap settingsHow to do it... for the input flow order.
  10. Change Match Model to All matches and Join Model to Inner Join.
  11. Close tMap and run the job.You will see that only the rows that have an orderType of card have been output, but there are now multiple records per customer.
  12. Add a new output to tMap and rename it to notMatched.
  13. Drag all the customer fields into the new output.
  14. Click tMap settings, and set Catch lookup inner join reject to true.
  15. Close tMap and add another tLogRow. Select tLogRow mode of Table (print values in cells of a table).
  16. Join the notMatched flow from tMap to the new tLogRow and run the job.
  17. You should now see two tables: one containing all Card transactions for customers and another showing all customers who have no Card transactions.

How it works…

tMap allows for different join types to be defined using expressions as keys. In this example, we used a variable from the main flow plus a constant ("Card") as our join keys.

The first execution of the job performed a left outer join, so all input records are output and non-matched fields are set to null (or default value if they are Java primitives). In addition, the first execution also specified to use only a unique match, thus printing out only one row per customer.

The second execution, however, specified that we wanted to do an inner join with all matches, so the output contained all orders where the customer paid with a credit card.

In the second execution, we also defined a second output that caught all the rows from the main flow that did not have any matches to the lookup.

There's more…

This recipe illustrates the main features of joining using tMap, but only joins one table to another. It is also possible to join the same table to many others of a variety of different keys from many lookups in a single tMap.

The next two recipes will show some examples of this.

The eagle-eyed among you may have noticed that the lookups are processed slightly earlier than the main flow. Due to the small volumes of data in this recipe, it isn't apparent, but if you replace the file for tFileInputDelimited_2 with chapter04_jo_0080_orderData_large.csv, then this will become very apparent (unless you have a very fast hard disk!).

What you will see is that tMap loads the lookup data into memory tables at the start of the job before it begins processing the main data flow.

For batch data integration jobs this is an efficient method, since it reduces the lookup time per transaction on the main flow, however, in the recipe Using reload at each row to process real-time/near real-time data, we will see how this method is not appropriate for small volume, real-time or near real-time data.

Also, be aware that in order to process large lookups, you will need to ensure that you have enough memory available and allocated to hold all the lookup data. If not, then the process will return out of memory errors. The recipe Stopping memory errors in Chapter 12, Common Mistakes and Other Useful Hints and Tips, describes the techniques that can help mitigate against out of memory errors in Talend.

See Also

  • Hierarchical join using tMap in this chapter.
  • Using reload at each row to process real-time/near real-time data in this chapter.
  • Stopping memory errors in Talend in Chapter 12, Common Mistakes and Other Useful Hints and Tips.