Gluserfs详解

排版看着不舒服的，可以查看[我的简书]（https://www.jianshu.com/p/0340e429431b）

doc home：https://docs.gluster.org/en/latest/Quick-Start-Guide/Architecture/

⚠️本文主要对官网进行了翻译，更方便记录查看，解释有误的地方请大家指出，架构整理和源码详解会在之后相继发布文章。

FUSE

GlusterFS是一个userspace filesystem,这是GlusterFS开发人员最初做出的决定，因为将模块引入linux内核是一个非常漫长和困难的过程。

作为userspace filesystem,为了与kernel vfs交互，Glusterfs使用了FUSE。FUSE是一个kernel module，支持内核VFS和非特权用户应用程序之间的交互，并且有一个可以从用户空间访问的API，使用这个API，几乎可以使用您喜欢的任何语言来编写任何类型的文件系统，因为在FUSE和其他语言之间有许多绑定

Gluserfs 架构详解【译】官网-LMLPHP

这显示了一个文件系统“hello world”，它被编译来创建一个二进制“hello”。它是在文件系统挂载点/tmp/fuse上执行的。然后用户在挂载点/tmp/fuse上发出ls -l命令。这个命令通过glibc到达VFS，因为mount /tmp/fuse对应于基于fuse的文件系统，所以VFS将它传递给fuse模块。FUSE内核模块在用户空间(libfuse)中通过glibc和FUSE库之后，与实际的文件系统二进制文件“hello”进行联系。结果由“hello”通过相同的路径返回，并到达ls -l命令。

FUSE内核模块与FUSE库(libfuse)之间的通信是通过一个特殊的file descriptor进行的，该file descriptor是通过打开/dev/fuse. .来获取的可以多次打开这个文件，并将获得的file descriptor传递给挂载的syscall，以便将描述符与挂载的文件系统相匹配。

More about userspace filesystems

FUSE reference

Translators

Translating “translators”:

A translator converts requests from users into requests for storage.(将用户请求转换为存储请求)

*One to one, one to many, one to zero (e.g. caching)

Gluserfs 架构详解【译】官网-LMLPHP

A translator can modify requests on the way through :

translator可以通过一下方式修改请求

convert one request type to another ( during the request transfer amongst the translators) modify paths, flags, even data (e.g. encryption)

将一种请求类型转换为另一种请求类型(在转换器之间的请求传输期间)，修改paths、flags甚至data(例如，encryption)
Translators can intercept or block the requests. (e.g. access control)

Translators可以拦截或阻止请求(如访问控制)
Or spawn new requests (e.g. pre-fetch)

或产生新的请求(如预取)

How Do Translators Work?

Shared Objects 共享对象
Dynamically loaded according to 'volfile'
根据“volfile”动态加载

dlopen/dlsync setup pointers to parents / children call init (constructor) call IO functions through fops.

dlopen/dlsync设置指针的父/子调用init(构造函数)通过fops调用IO函数
Conventions for validating/ passing options, etc.
约定验证/传递选项等
The configuration of translators (since GlusterFS 3.1) is managed through the gluster command line interface (cli), so you don't need to know in what order to graph the translators together.
Translators 配置(自从GlusterFS 3.1)是通过gluster命令行接口(cli)进行管理的，所以不需要知道以什么顺序将这些翻译器组合在一起。

Types of Translators

List of known translators with their current status.

Storage	Lowest level translator, stores and accesses data from local file system.	Lowest level转换器，存储和访问本地文件系统中的数据
Debug	Provide interface and statistics for errors and debugging	提供错误和调试的接口和统计信息。
Cluster	Handle distribution and replication of data as it relates to writing to and reading from bricks & nodes.	处理数据的分布和复制，因为它涉及到对块和节点的写入和读取
Encryption	Extension translators for on-the-fly encryption/decryption of stored data.	动态加密/解密存储数据的加密扩展
Protocol	Extension translators for client/server communication protocols.	translators 客户端/服务器通信协议的扩展程序
Performance	Tuning translators to adjust for workload and I/O profiles.	调整translator以适应工作负载和I/O配置文件
Bindings	Add extensibility, e.g. The Python interface written by Jeff Darcy to extend API interaction with GlusterFS.	添加可扩展性，例如Jeff Darcy编写的Python接口，以扩展与GlusterFS的API交互
System	System access translators, e.g. Interfacing with file system access control.	文件系统访问接口
Scheduler	I/O schedulers that determine how to distribute new write operations across clustered systems.	I/O调度程序，确定如何跨集群系统分发新的写操作
Features	Add additional features such as Quotas, Filters, Locks, etc.	添加额外的特性，如配额、过滤器、锁等

The default / general hierarchy of translators in vol files :

Gluserfs 架构详解【译】官网-LMLPHP

All the translators hooked together to perform a function is called a graph. The left-set of translators comprises of Client-stack.The right-set of translators comprises of Server-stack.

所有translator hook在一起执行一个function称作一个graph

The glusterfs translators can be sub-divided into many categories, but two important categories are - Cluster and Performance translators :

gluster translator可以分为很多类别，两个重要的类别是Cluster and Performance（性能） translators

One of the most important and the first translator the data/request has to go through is fuse translator which falls under the category of Mount Translators.

data/request必须通过的一个translator是fuse translator，它属于Mount translator的范畴。

Cluster Translators:

DHT(Distributed Hash Table)(分布式Hash Table)
AFR(Automatic File Replication)(自动文本复制)

Performance Translators:

io-cache
io-threads
md-cache
O-B (open behind)
QR (quick read)
r-a (read-ahead)
w-b (write-behind)

例如：gluster volume info查看到

[root@node4 /]# gluster volume info

Volume Name: heketidbstorage

Type: Replicate

Volume ID: d141e423-cc06-4fa5-a7ef-5edbc1b405ce

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: 10.8.4.184:/var/lib/heketi/mounts/vg_cbdb3ac545853d57be6c39db60b7f647/brick_661b5cdd36ee50aaeeeb3490737f2851/brick

Brick2: 10.8.4.182:/var/lib/heketi/mounts/vg_ddfdf2348bf510c356d5234e0ed0a0ec/brick_5ca55df9bbb54e531d6fc4205b297503/brick

Brick3: 10.8.4.183:/var/lib/heketi/mounts/vg_2ceb6870ad884b6507a767308980cf8a/brick_eaa40efdc196a5a18158c79e0b1b0459/brick

Options Reconfigured:

user.heketi.id: 4550c84383c151de59bd6679e73b9117

user.heketi.dbstoragelevel: 1

performance.readdir-ahead: off

performance.io-cache: off

performance.read-ahead: off

performance.strict-o-direct: on

performance.quick-read: off

performance.open-behind: off

performance.write-behind: off

performance.stat-prefetch: off

transport.address-family: inet

nfs.disable: on

performance.client-io-threads: off

Other Feature Translators include:

changelog (日志更新)
locks - GlusterFS has locks translator which provides the following internal locking operations called inodelk, entrylk, which are used by afr to achieve synchronization of operations on files or directories that conflict with each other.
locks - GlusterFS,提供了以下称为inodelk、entrylk的内部锁定操作，afr使用这些操作来实现对相互冲突的文件或目录的操作的synchronization
marker (标记)
quota (配额)

Debug Translators

trace - To trace the error logs generated during the communication amongst the translators.
跟踪translator程序之间通信过程中产生的错误日志
io-stats (io状态)

DHT(Distributed Hash Table) Translator

What is DHT?

DHT是GlusterFS跨多个服务器聚合容量和性能的真正核心。它的职责是将每个文件精确地放在其中的一个subvolumes(子卷)上——这与replication(将副本放在所有子卷上)和striping(将分片放在所有子卷上)不同。它是一个路由函数，而不是分割或复制。

How DHT works?

DHT中使用的基本方法是consistent hashing。每一个subvolume(brick)在32-bit hash space分配一个范围，覆盖entire range（整个范围），没有holes or overlaps（漏洞或重叠）。然后，通过hash每个文件的名称，在相同的空间中为每个文件分配一个值。只有一个brick具有指定的范围，包括文件的hash value，因此文件“应该”位于该brick上。但是，在许多情况下，情况并非如此，例如brick集(因此，范围的范围分配)发生了变化，或者brick几乎满了。DHT中的许多复杂性都涉及到这些特殊情况，我们稍后将对此进行讨论

当您打开一个file时，distribute translator将会提供一条信息来查找您的文件，即file-name文件名称。为了确定文件的位置，translator通过哈希算法运行文件名，以便将文件名转换为数字。

A few Observations of DHT hash-values assignment:

hash范围到brick的分配是由存储在目录中的扩展属性决定的，因此分布是特定于目录的。Consistent hashing is usually thought of as hashing around a circle, but in GlusterFS it’s more linear 但在GlusterFS中它更线性，没有必要在0处“缠绕”，因为在0处总是有一个断点(在一个brick的范围和另brick的范围之间)。如果丢了一个brick，hash space就会有一个洞，更糟糕的是，如果在brick离线时重新分配hash范围，那么一些新的范围可能与brick上存储的(现在已经过时的)范围重叠，从而造成文件应该放在哪里的混乱。

AFR(Automatic File Replication) Translator

GlusterFS中的AFR使用扩展属性来跟踪文件操作。它负责跨块复制数据

保持复制一致性(即两个brick上的数据应该是相同的，即使在多个应用程序/挂载点并行地在相同的文件/目录上发生操作的情况下，只要复制集中的所有brick都已建立)。

提供一种在发生故障时恢复数据的方法，只要至少有一个brick具有正确的数据。

为read/stat/readdir等提供新的数据

Geo-Replication

Geo-replication提供跨地理位置的异步数据复制，这在Glusterfs 3.2中介绍过。它主要跨WAN工作，用于复制整个卷，而不像AFR，它是集群内复制。这主要用于备份整个数据以进行灾难恢复。

Geo-replication使用主从模型，在主从(GlusterFS卷)和从(可以是本地目录或GlusterFS卷)之间进行复制。从(本地目录或卷使用SSH隧道访问)。

Geo-replication通过局域网(LANs)、广域网(WANs)和Internet提供增量复制服务。

Geo-replication over LAN

You can configure Geo-replication to mirror data over a Local Area Network.

Gluserfs 架构详解【译】官网-LMLPHP

Geo-replication over WAN

You can configure Geo-replication to replicate data over a Wide Area Network.

Gluserfs 架构详解【译】官网-LMLPHP

Geo-replication over Internet

You can configure Geo-replication to mirror data over the Internet.

Gluserfs 架构详解【译】官网-LMLPHP

Multi-site cascading Geo-replication

You can configure Geo-replication to mirror data in a cascading fashion across multiple sites.

Gluserfs 架构详解【译】官网-LMLPHP

There are mainly two aspects while asynchronously replicating data:

1.Change detection - These include file-operation necessary details. There are two methods to sync the detected changes: (两种方法来同步检测到的变化:)

变更日志-变更日志是一个记录发生的fops的必要细节的翻译程序。更改可以用二进制格式或ASCII编写。有三个类别，每个类别由特定的changelog格式表示。所有这三种类型的类别都记录在一个单独的变更日志文件中。

为了记录所执行的操作和实体的类型，使用了类型标识符。通常情况下,实体的操作将被执行路径名,但我们选择使用从而内部文件标识符(GFID)而不是(从而支持基于GFID端和路径名字段可能并不总是有效的)。因此，三种操作的记录格式可以概括为:

GFID类似于inode。数据和元fops记录执行操作的实体的GFID，从而记录inode上的数据/元数据更改。输入fops记录至少有6条或7条记录(取决于操作的类型)，这足以确定实体所经历的操作类型。通常，这个记录包括实体的GFID、文件操作的类型(它是一个整数[Glusterfs中使用的枚举值])、父GFID和basename(类似于父inode和basename)。

Changelog文件在特定的时间间隔后滚动。然后，我们对文件执行处理操作，例如将其转换为可理解/人类可读的格式，保留更改日志的私有副本等。然后，库使用这些日志并为应用程序请求提供服务。

在T2时，创建了一个新文件File2。这将触发从File2到根i的xtime标记(其中xtime是当前时间戳)。e, File2的xtime, Dir3, Dir1，最后Dir0都会被更新。

Geo-replication守护进程根据xtime(主)> xtime(从)的条件抓取文件系统。因此，在我们的示例中，它只会爬行目录结构的左侧部分，因为目录结构的右侧部分仍然具有相同的时间戳。虽然抓取算法比较快，但仍然需要抓取目录结构的一部分。

2.Replication - We use rsync for data replication. Rsync is an external utility which will calculate the diff of the two files and sends this difference from source to sync.
Replication——我们使用rsync进行数据复制。Rsync是一个外部实用程序，它将计算两个文件的差异，并将这种差异从源文件发送到sync。

一旦GlusterFS安装到服务器节点上，就会创建一个gluster管理守护进程(glusterd)二进制文件。这个守护进程应该在集群中所有参与节点中运行。启动glusterd之后，可以创建由所有storage server nodes(TSP甚至可以包含单个节点)组成的可信服务器池(TSP)。这个TSP中的任意数量的brick可以被连接在一起形成一个volume。

创建卷之后，glusterfsd进程将在每个participating brick(参与的brick)中运行。与此同时，将在/var/lib/glusterd/vol/中生成称为vol的配置文件,将有与volume中的每个brick对应的配置文件。这将包含关于特定砖块的所有细节。还将创建客户机进程所需的配置文件。现在我们的文件系统可以使用了。我们可以很容易地将这个卷挂载在客户端机器上，如下所示，并像使用本地存储一样使用它:

在客户机中挂载卷时，客户机glusterfs进程与服务器的glusterd进程通信。服务器glusterd进程发送一个配置文件(vol文件)，其中包含the list of client translators，另一个配置文件包含卷中每个brick的信息，在此帮助下，客户端glusterfs进程现在可以直接与每个brick的glusterfsd进程通信。安装现在已经完成，卷已经为客户端服务做好了准备。

当客户机在挂载的文件系统中发出系统调用(文件操作或Fop)时，VFS(确定文件系统的类型为glusterfs)将把请求发送到FUSE内核模块。FUSE内核模块将依次通过/dev/fuse将其发送到客户机节点的userspace中的GlusterFS(这在FUSE部分已经描述过了)。客户机上的GlusterFS进程由一组 client translators的translator程序组成，这些translatot在存储服务器glusterd进程发送的配置文件(vol文件)中定义。这些转换器中的第一个是FUSE Translator，它由FUSE库(libfuse)组成。每个translator都具有与glusterfs支持的每个文件操作或fop对应的function。该请求将在每个translator中执行相应的功能。主要client translators包括:

在storage server node中包含需要的brick，请求再次通过一系列称为server translators的translator，主要有:

请求将最终到达VFS，然后与底层本机文件系统通信。响应将重新跟踪相同的路径。