
Joining data using tMap
So far, we have seen how tMap
can be used to transform and filter input data. But this is only a part of the tMap
functionality. The tMap
component is also the main component used to perform join logic between multiple input sources. This recipe demonstrates the basics of using tMap
to join multiple data sources.
Getting ready
Open the job jo_cook_ch04_0070_tMapJoin
.
How to do it...
- Right-click
tFileInputDelimited
. Go to Row | Main and connect it totMap_1
. Change the name of the flow toorder
. - Open
tMap,
and you should see two input tables: customer and order. - Select the
customerId
field from thecustomer
table and drag it to thecustomerId
Expr. key
in theorder
table. - You will see a purple key icon and a flow showing the linked fields.
- Type
"Card"
into theExpr.
key
field fororderType
. - Drag all the order fields apart from
customerId
to the output. YourtMap
should now look like the following screenshot: - Close
tMap
and run the job. - You will see that there is a single row for each
customer,
and many of the fields are null. - Re-open the
tMap
, and click tMap settingsfor the input flow order.
- Change Match Model to All matches and Join Model to Inner Join.
- Close
tMap
and run the job.You will see that only the rows that have anorderType
of card have been output, but there are now multiple records per customer. - Add a new output to
tMap
and rename it tonotMatched
. - Drag all the customer fields into the new output.
- Click
tMap
settings, and set Catch lookup inner join reject to true. - Close
tMap
and add anothertLogRow
. SelecttLogRow
mode of Table (print values in cells of a table). - Join the
notMatched
flow fromtMap
to the newtLogRow
and run the job. - You should now see two tables: one containing all Card transactions for customers and another showing all customers who have no Card transactions.
How it works…
tMap
allows for different join types to be defined using expressions as keys. In this example, we used a variable from the main flow plus a constant ("Card"
) as our join keys.
The first execution of the job performed a left outer join, so all input records are output and non-matched fields are set to null (or default value if they are Java primitives). In addition, the first execution also specified to use only a unique match, thus printing out only one row per customer.
The second execution, however, specified that we wanted to do an inner join with all matches, so the output contained all orders where the customer paid with a credit card.
In the second execution, we also defined a second output that caught all the rows from the main flow that did not have any matches to the lookup.
There's more…
This recipe illustrates the main features of joining using tMap
, but only joins one table to another. It is also possible to join the same table to many others of a variety of different keys from many lookups in a single tMap
.
The next two recipes will show some examples of this.
The eagle-eyed among you may have noticed that the lookups are processed slightly earlier than the main flow. Due to the small volumes of data in this recipe, it isn't apparent, but if you replace the file for tFileInputDelimited_2
with chapter04_jo_0080_orderData_large.csv
, then this will become very apparent (unless you have a very fast hard disk!).
What you will see is that tMap
loads the lookup data into memory tables at the start of the job before it begins processing the main data flow.
For batch data integration jobs this is an efficient method, since it reduces the lookup time per transaction on the main flow, however, in the recipe Using reload at each row to process real-time/near real-time data, we will see how this method is not appropriate for small volume, real-time or near real-time data.
Also, be aware that in order to process large lookups, you will need to ensure that you have enough memory available and allocated to hold all the lookup data. If not, then the process will return out
of
memory
errors. The recipe Stopping memory errors in Chapter 12, Common Mistakes and Other Useful Hints and Tips, describes the techniques that can help mitigate against out
of
memory
errors in Talend.
See Also
- Hierarchical join using tMap in this chapter.
- Using reload at each row to process real-time/near real-time data in this chapter.
- Stopping memory errors in Talend in Chapter 12, Common Mistakes and Other Useful Hints and Tips.