问题描述
我们有一个要求,即我们要从三个不同的文件中读取数据,并在同一作业中具有不同列的这些文件之间进行联接.
每个文件大小约为25-30 GB.我们的系统RAM大小仅为16GB.使用tmap进行联接. Talend将所有参考数据保留在物理内存中.就我而言,我无法提供那么多的内存.作业由于内存不足而失败.如果我在tmap中使用join with temp disk选项,则作业速度很慢.
请帮助我解决这些问题.
- Talend如何处理大于RAM大小的数据?
- 流水线的并行性到位了吗?我有什么想念的吗在代码中实现这一目标?
- tuniq&加入操作是在物理内存中完成的,从而导致工作跑得很慢.磁盘选项可用于处理这些功能,但是太慢了.
- 如何在不将数据推送到DB(ELT)的情况下提高性能.talend是否可以处理数百万的海量数据.较少RAM的数据类型?
谢谢
Talend可以快速,高效地处理大量数据.这完全取决于您对Talend Platforms的了解.
请考虑以下评论作为您的问题的答案.
Q1.Talend如何处理大于RAM大小的数据?
A.您不能将整个RAM用于Talend Studio.几乎只有一半的RAM可以使用.
例如:-在64位系统上具有8 GB的可用内存,最佳设置可以是:-vmargs
-Xms1024m
-Xmx4096m
-XX:MaxPermSize = 512m
-Dfile.encoding = UTF-8
现在就您而言,您必须将RAM增加100 GB
或仅将数据写入硬盘.为此,您必须为缓冲区组件选择一个Temp数据目录,例如- tMap,tBufferInputs,tAggregatedRow等.
Q2.管道并行性到位了吗?我是否在代码中缺少实现此目的的任何东西?
A.在Talend Studio中,数据流并行化意味着将Subjob的输入数据流划分为并行进程并同时执行它们,从而获得更好的性能.
但是只有在您已订阅Talend Platform解决方案之一的情况下,此功能才可用.
当您必须开发作业来使用Talend Studio处理非常大的数据时,您只需单击一下即可启用或禁用并行化,然后Studio会自动在给定Job上自动实施
并行执行并行化的实现需要四个关键步骤,如下所述:
分区():在这一步中,Studio将输入记录拆分为给定数量的线程.
Collecting():在这一步中,Studio收集拆分线程并将其发送到给定的组件进行处理.
Departitioning():在这一步中,Studio将拆分线程的并行执行的输出分组.
Recollecting():在这一步中,Studio捕获分组的执行结果并将其输出到给定的组件.
Q3.突尼斯连接操作是在物理内存中完成的,导致作业运行缓慢.可以使用磁盘选项来处理这些功能,但是速度太慢.
Q4.如何在不将数据推送到DB(ELT)的情况下提高性能. talend是否可以处理数百万的海量数据.需要用更少的RAM来处理这类数据吗?
A 3& 4.在这里,我建议您使用tOutputBulkExec将数据直接插入数据库.组件,然后可以在数据库级别使用ELT组件应用这些操作.
We have a requirement where we are reading data from three different files and doing joins among these files with different columns in the same job.
Each file size is around 25-30 GB. Our system RAM size is just 16GB. Doing joins with tmap. Talend is keeping all the reference data in physical memory. In my case, i cannot provide that much memory. Job fails due to out of memory. If i use join with temp disk option in tmap, job was dead slow.
Please help me with these questions.
- How Talend process the data larger than RAM size?
- Pipeline parallelism is in place with talend? Am i missing anythingin the code to accomplish that?
- tuniq & Join operations was done in physical memory,causing the jobto run dead slow. Disk option is available to handle thesefunctionality, but it was too slow.
- How performance can be improved without pushing the data to DB(ELT).Whether talend can handle huge data in millions.Need to handle thiskind of data with lesser amount of RAM?
Thanks
Talend process the Large amount of data very fast and in efficient manner. Its all depends on your knowledge about Talend Platforms.
Please consider the below comments as answers for your questions.
Q1.How Talend process the data larger than RAM size?
A. You can not use your entire RAM for Talend studio. Only a fraction of RAM can be used its almost half of your RAM.
For example:-With 8 GB of memory available on 64-bit system, the optimal settings can be:-vmargs
-Xms1024m
-Xmx4096m
-XX:MaxPermSize=512m
-Dfile.encoding=UTF-8
Now in your case either you have to increase your RAM with 100 GB
OR simply write the data on hard disk. For this you have to choose a Temp data directory for buffer components like- tMap, tBufferInputs, tAggregatedRow etc.
Q2. Pipeline parallelism is in place with talend? Am i missing anything in the code to accomplish that?
A. In Talend Studio, parallelization of data flows means to partition an input data flow of a Subjob into parallel processes and to simultaneously execute them, so as to gain better performance.
But this feature is available only on the condition that you have subscribed to one of the Talend Platform solutions.
When you have to develop a Job to process very huge data using Talend Studio,you can enable or disable the parallelization by one single click, and then the Studio automates the implementation across a given Job
Parallel ExecutionThe implementation of the parallelization requires four key steps as explained as follows:
Partitioning (): In this step, the Studio splits the input records into a given number of threads.
Collecting (): In this step, the Studio collects the split threads and sends them to a given component for processing.
Departitioning (): In this step, the Studio groups the outputs of the parallel executions of the split threads.
Recollecting (): In this step, the Studio captures the grouped execution results and outputs them to a given component.
Q3. tuniq & Join operations was done in physical memory,causing the job to run dead slow. Disk option is available to handle these functionality, but it was too slow.
Q4. How performance can be improved without pushing the data to DB(ELT). Whether talend can handle huge data in millions.Need to handle this kind of data with lesser amount of RAM?
A 3&4. Here I will suggest you to insert the data directly into database using tOutputBulkExec. components and then you can apply these operation using ELT components on DB level.
这篇关于塔伦德表演的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!