本文介绍了Hadoop 2.0 数据写入操作确认的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于 hadoop 数据写入的小问题

I have a small query regarding hadoop data writes

来自 Apache 文档

From Apache documentation

对于常见的情况,当复制因子为 3 时,HDFS 的放置策略是将一个副本放在本地机架的一个节点上,另一个放在不同(远程)机架的节点上,最后一个放在不同的节点上在同一个远程机架中.此策略减少了机架间写入流量,这通常会提高写入性能.机架故障的几率远小于节点故障;

在下图中,当写确认被视为成功时?

In below image, when the write acknowledge is treated as successful?

1) 向第一个数据节点写入数据?

1) Writing data to first data node?

2) 将数据写入第一个数据节点 + 2 个其他数据节点?

2) Writing data to first data node + 2 other data nodes?

我问这个问题是因为,我在 youtube 视频中听到了两个相互矛盾的陈述.一个视频引用了一次数据写入一个数据节点后写入成功其他视频引用了只有在将数据写入所有三个节点后才会发送确认.

I am asking this question because, I have heard two conflicting statements in youtube videos. One video quoted that write is successful once data is written to one data node & other video quoted that acknowledgement will be sent only after writing data to all three nodes.

推荐答案

步骤 1: 客户端通过调用 DistributedFileSystem 上的 create() 方法创建文件.

Step 1: The client creates the file by calling create() method on DistributedFileSystem.

第 2 步: DistributedFileSystem 对 namenode 进行 RPC 调用,以在文件系统的命名空间中创建一个新文件,没有与之关联的块.

Step 2: DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it.

namenode 执行各种检查以确保文件不存在并且客户端具有创建文件的正确权限.如果这些检查通过,namenode 会记录新文件;否则,文件创建将失败并且客户端会被抛出 IOException.DistributedFileSystem 返回一个 FSDataOutputStream 供客户端开始写入数据.

The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException. TheDistributedFileSystem returns an FSDataOutputStream for the client to start writing data to.

第 3 步: 当客户端写入数据时,DFSOutputStream 将其拆分为数据包,然后将其写入内部队列,称为数据队列.数据队列由 DataStreamer 使用,它负责通过选择合适的数据节点列表来要求名称节点分配新块来存储副本.数据节点列表形成一个管道,这里我们假设复制级别为三级,因此管道中有三个节点.DataStreamer 将数据包流式传输到管道中的第一个数据节点,后者存储数据包并将其转发到管道中的第二个数据节点.

Step 3: As the client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. TheDataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.

步骤 4: 同样,第二个数据节点存储数据包并将其转发到管道中的第三个(也是最后一个)数据节点.

Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline.

第 5 步: DFSOutputStream 还维护一个内部队列,这些数据包正在等待数据节点确认,称为 ack 队列.一个数据包只有在管道中的所有数据节点都确认后才会从 ack 队列中删除.

Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.

第 6 步:当客户端完成写入数据时,它会在流上调用 close().

Step 6: When the client has finished writing data, it calls close() on the stream.

第 7 步: 此操作将所有剩余的数据包刷新到数据节点管道并等待确认,然后再联系名称节点以表示文件已完成名称节点已经知道文件由哪些块组成of ,所以它只需要等待块被最少复制就可以成功返回.

Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete The namenode already knows which blocks the file is made up of , so it only has to wait for blocks to be minimally replicated before returning successfully.

这篇关于Hadoop 2.0 数据写入操作确认的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 10:05