原文链接: https://nsq.io/overview/internals.html

NSQ is composed of 3 daemons:

NSQ 由三个守护进程组成

  • nsqd is the daemon that receives, queues, and delivers messages to clients.

nsqd是一个守护进程,它接收、排队并向客户端发送消息。

  • nsqlookupd is the daemon that manages topology information and provides an eventually consistent discovery service.

nsqlookupd是管理拓扑信息并提供最终一致的发现服务的守护进程。

  • nsqadmin is a web UI to introspect the cluster in realtime (and perform various administrative tasks).

nsqadmin是一个web UI,用于实时反省集群(并执行各种管理任务)。

Data flow in NSQ is modeled as a tree of streams and consumers.

NSQ中的数据流被建模为一个由streamsconsumer 组成的树。

A topic is a distinct stream of data.

一个主题是一个不同的数据流。

A channel is a logical grouping of consumers subscribed to a given topic.

一个通道是固定给一个主题使用的逻辑分组

topics/channels

A single nsqd can have many topics and each topic can have many channels.

单个nsqd可以有多个主题,每个主题可以有多个通道。

A channel receives a copy of all the messages for the topic, enabling multicast style delivery while each message on a channel is distributed amongst its subscribers, enabling load-balancing.

通道接收主题的所有消息的副本,支持多播风格的传递,而通道上的每个消息都分布在其订阅者之间,支持负载平衡。

These primitives form a powerful framework for expressing a variety of simple and complex topologies.

这些原语构成了一个强大的框架,用于表示各种简单和复杂的拓扑。

For more information about the design of NSQ see the design doc.

For more information about the design of NSQ see the design doc

Topics and Channels 主题与通道

Topics and channels, the core primitives of NSQ, best exemplify how the design of the system translates seamlessly to the features of Go.

主题和通道,NSQ的核心原语,最好地说明了系统的设计如何无缝地转换到Go的特性。

Go’s channels (henceforth referred to as “go-chan” for disambiguation) are a natural way to express queues, thus an NSQ topic/channel, at its core, is just a buffered go-chan of Message pointers. The size of the buffer is equal to the --mem-queue-size configuration parameter.

Go的通道(从此称为Go -chan用于消除歧义)是一种自然的表示队列的方法,因此NSQ主题/通道的核心只是 Message 指针的缓冲Go -chan。缓冲区的大小等于 --mem-queue-size 配置参数。

After reading data off the wire, the act of publishing a message to a topic involves:

在离线读取数据之后,向主题发布消息的行为包括:

  1. instantiation of a Message struct (and allocation of the message body []byte)

“消息”结构的实例化(以及消息体“字节切片”的分配)

  1. read-lock to get the Topic

读锁定或者这个主题

  1. read-lock to check for the ability to publish

读取锁定以检查发布的能力

  1. send on a buffered go-chan

发送一个缓冲go-chan

To get messages from a topic to its channels the topic cannot rely on typical go-chan receive semantics, because multiple goroutines receiving on a go-chan would distribute the messages while the desired end result is to copyeach message to every channel (goroutine).

将消息从主题获取到其通道,该主题不能依赖于典型的go-chan接收语义,因为在go-chan上接收多个Go程会被分发而期望的最终结果是将消息复制到每个通道(Go程)

Instead, each topic maintains 3 primary goroutines. The first one, called router, is responsible for reading newly published messages off the incoming go-chan and storing them in a queue (memory or disk).

相反,每个主题维护3个主要goroutine。第一个是路由。负责从传入的go-chan中读取新发布的消息并将其存储在队列(内存或者磁盘)

The second one, called messagePump, is responsible for copying and pushing messages to channels as described above.

第二个是消息泵,负责将消息复制并推送到上面描述的通道

The third is responsible for DiskQueue IO and will be discussed later.

第三个负责“磁盘队列”的输入输出,稍后再进行讨论。

Channels are a little more complicated but share the underlying goal of exposing a single input and single output go-chan (to abstract away the fact that, internally, messages might be in memory or on disk):

通道是一个有点复杂,但是要分享目标的单独输入与单独输出到Go-chan(从内部抽象出消息可能在内存或磁盘中的事实)

queue goroutine

Additionally, each channel maintains 2 time-ordered priority queues responsible for deferred and in-flight message timeouts (and 2 accompanying goroutines for monitoring them).

此外,每个通道维护两个按时间顺序排列的优先队列,负责延迟消息超时和飞行中消息超时(以及两个用于监视它们的goroutines)。

Parallelization is improved by managing a per-channel data structure, rather than relying on the Go runtime’s global timer scheduler.

通过管理每个通道的数据结构,可以改进并行化,而不是依赖于Go运行时的全局计时器调度程序。

Note: Internally, the Go runtime uses a single priority queue and goroutine to manage timers. This supports (but is not limited to) the entirety of the time package.

注意:内部,Go运行时使用一个优先级队列和goroutine来管理计时器。这支持(但不限于)整个时间包。

It normally obviates the need for a user-land time-ordered priority queue but it’s important to keep in mind that it’s a single data structure with a single lock, potentially impacting GOMAXPROCS > 1 performance. See runtime/time.go.

它通常不需要用户空间的时间顺序的优先队列,但重要的是要记住,它是一个只有一个锁的数据结构,可能会影响GOMAXPROCS > 1的性能。详情看 runtime/time.go

Backend / DiskQueue 后端/磁盘队列

One of NSQ’s design goals is to bound the number of messages kept in memory.

NSQ的设计目标之一是限制内存中的消息数量。

It does this by transparently writing message overflow to disk via DiskQueue (which owns the third primary goroutine for a topic or channel).

它通过透明地通过磁盘队列将消息溢出写入磁盘来实现这一点。(哪个拥有主题或频道的第三个主goroutine)

Since the memory queue is just a go-chan, it’s trivial to route messages to memory first, if possible, then fallback to disk:

由于内存队列只是一个go-chan,所以如果可能的话,先将消息路由到内存,然后回退到磁盘是很简单的:

for msg := range c.incomingMsgChan {
    select {
    case c.memoryMsgChan <- msg:
    default:
        err := WriteMessageToBackend(&msgBuf, msg, c.backend)
        if err != nil {
            // ... handle errors ...
        }
    }
}

Taking advantage of Go’s select statement allows this functionality to be expressed in just a few lines of code: the default case above only executes if memoryMsgChan is full.

利用Go的select语句可以用几行代码来表达这个功能:上面的“default”只有在“memoryMsgChan”已满时才会执行。

NSQ also has the concept of ephemeral topics/channels.

NSQ还具有临时主题/通道的概念。

They discard message overflow (rather than write to disk) and disappear when they no longer have clients subscribed. This is a perfect use case for Go’s interfaces.

他们抛出异常消息 (而不是写入磁盘)当他们不再有客户订阅时就会消失。这是Go接口的一个完美用例。

Topics and channels have a struct member declared as a Backend interface rather than a concrete type.

主题与通道有一个结构体成员 声明为一个后端接口而不是具体类型。

Normal topics and channels use a DiskQueue while ephemeral ones stub in a DummyBackendQueue, which implements a no-op Backend.

正常的主题和通道使用磁盘队列,而缓存的数据插入 虚拟后端队列,他实现一个无操作的后端。

Reducing GC Pressure 减少垃圾回收压力

In any garbage collected environment you’re subject to the tension between throughput (doing useful work), latency (responsiveness), and resident set size (footprint).

在任何垃圾收集环境中,吞吐量(做有用的工作)、延迟(响应性)和常驻集大小(占用空间)之间的关系都很紧张。

As of Go 1.2, the GC is mark-and-sweep (parallel), non-generational, non-compacting, stop-the-world and mostly precise . It’s mostly precise because the remainder of the work wasn’t completed in time (it’s slated for Go 1.3).

到Go 1.2为止,GC是标记-扫描(并行)的,非分代的,非压缩的,完全停止的,而且大部分是精确的。它非常精确,因为剩余的工作没有及时完成(计划在Go 1.3中完成)。

The Go GC will certainly continue to improve, but the universal truth is: the less garbage you create the less time you’ll collect.

Go GC肯定会继续改进,但普遍的事实是:创建的垃圾越少,收集的时间就越少。

First, it’s important to understand how the GC is behaving under real workloads.

首先,了解GC在实际工作负载下的行为非常重要。

To this end, nsqd publishes GC stats in statsd format (alongside other internal metrics).

为此,nsqd以statsd格式发布GC统计信息(与其他内部指标一起发布)。

nsqadmin displays graphs of these metrics, giving you insight into the GC’s impact in both frequency and duration:

nsqadmin显示这些指标的图表,让您了解GC在频率和持续时间方面的影响:

single node view

In order to actually reduce garbage you need to know where it’s being generated.

为了实际减少垃圾,您需要知道它是在哪里生成的。

Once again the Go toolchain provides the answers:

Go工具链再次提供了答案:

  1. Use the testing package and go test -benchmem to benchmark hot code paths. It profiles the number of allocations per iteration (and benchmark runs can be compared with benchcmp).

使用测试包和 go test -benchmem 到基准热代码路径。它配置每个迭代的分配数量(和基准运行可以与基准测试相比较)。

  1. Build using go build -gcflags -m, which outputs the result of escape analysis.

编译使用go build -gcflags -m, 输出编译结果

With that in mind, the following optimizations proved useful for nsqd:

考虑到这一点,下面的优化对nsqd非常有用:

  1. Avoid []byte to string conversions.

避免“字节切片”到“字符串”的转换。

  1. Re-use buffers or objects (and someday possibly sync.Pool aka issue 4720).

重新使用 缓冲区 或者是对象 (有一天可能使用 sync.Pool 或者 issue 4720)

  1. Pre-allocate slices (specify capacity in make) and always know the number and size of items over the wire.

预先配置 切片(在make中指定容量)并且永远要通过网络知道商品的数量和大小。

  1. Apply sane limits to various configurable dials (such as message size).

对各种可配置的拨号(如消息大小)应用合理的限制。

  1. Avoid boxing (use of interface{}) or unnecessary wrapper types (like a struct for a “multiple value” go-chan).

避免封装(使用“interface{}”)或不必要的包装类型(像一个结构体使用多参数的go-chan)。

  1. Avoid the use of defer in hot code paths (it allocates).

避免在热代码路径(它分配)中使用defer

TCP Protocol TCP 协议

The NSQ TCP protocol is a shining example of a section where these GC optimization concepts are utilized to great effect.

NSQ TCP协议是一个很好的例子,在这个例子中,这些GC优化概念得到了很好的利用。

The protocol is structured with length prefixed frames, making it straightforward and performant to encode and decode:

该协议采用长度前缀帧结构,使得编码和解码简单、高效:

[x][x][x][x][x][x][x][x][x][x][x][x]...
|  (int32) ||  (int32) || (binary)
|  4-byte  ||  4-byte  || N-byte
------------------------------------...
    size      frame ID     data

Since the exact type and size of a frame’s components are known ahead of time, we can avoid theencoding/binary package’s convenience Read() and Write() wrappers (and their extraneous interface lookups and conversions) and instead call the appropriate binary.BigEndian methods directly.

因为框架组件的确切类型和大小是预先知道的,所以我们可以避免编码/二进制包的方便的Read()和Write()包装器(以及它们无关的接口查找和转换),而是直接调用 binary.BigEndian 的方法。

To reduce socket IO syscalls, client net.Conn are wrapped with bufio.Reader and bufio.Writer.

减少套接字IO系统调用,客户端的网络连接封装了 bufio.Reader and bufio.Writer.

The Readerexposes ReadSlice(), which reuses its internal buffer.

读取暴露了 ReadSlice() 方法, 它重用了内部缓冲区

This nearly eliminates allocations while reading off the socket, greatly reducing GC pressure.

这几乎消除了读取套接字时的分配,大大降低了GC的压力

This is possible because the data associated with most commands does not escape (in the edge cases where this is not true, the data is explicitly copied).

这是可能的,因为与大多数命令关联的数据不会转义(在这种情况下,数据是显式复制的)

At an even lower level, a MessageID is declared as [16]byte to be able to use it as a map key (slices cannot be used as map keys).

在很低的级别,' MessageID '被声明为' [16]byte ',以便能够使用它作为' map '键(切片不能当做map的键)

However, since data read from the socket is stored as []byte, rather than produce garbage by allocating string keys, and to avoid a copy from the slice to the backing array of the MessageID, the unsafepackage is used to cast the slice directly to a MessageID:

但是,由于从套接字读取的数据存储为[]byte,而不是通过分配string键来产生垃圾,为了避免从切片复制到MessageID的后台数组,“不安全”包用于将切片直接转换为“MessageID”::

id := *(*nsq.MessageID)(unsafe.Pointer(&msgID))

Note: This is a hack. It wouldn’t be necessary if this was optimized by the compiler and Issue 3512 is open to potentially resolve this.

注意:这是一个非法操作。如果编译器对其进行了优化,则没有必要这样做 ,或者 Issue 3512 可以解决这个问题

It’s also worth reading through issue 5376, which talks about the possibility of a “const like” byte type that could be used interchangeably where string is accepted, without allocating and copying.

同样值得一读 issue 5376, 它讨论了“const like”“byte”类型的可能性,该类型可以在接受“string”的地方互换使用,而不需要分配和复制。

Similarly, the Go standard library only provides numeric conversion methods on a string. In order to avoid string allocations, nsqd uses a custom base 10 conversion method that operates directly on a []byte.

类似地,Go标准库只提供“字符串”上的数字转换方法。为了避免“字符串”分配,nsqd使用了一个自定义的base 10转换方法,该方法直接操作一个“字节切片”。

These may seem like micro-optimizations but the TCP protocol contains some of the hottest code paths.

这些看起来像是微优化,但是TCP协议包含一些最热门的代码路径。

In aggregate, at the rate of tens of thousands of messages per second, they have a significant impact on the number of allocations and overhead:

总的来说,以每秒数万条消息的速度,它们对分配和开销的数量有显著影响:

benchmark                    old ns/op    new ns/op    delta
BenchmarkProtocolV2Data           3575         1963  -45.09%

benchmark                    old ns/op    new ns/op    delta
BenchmarkProtocolV2Sub256        57964        14568  -74.87%
BenchmarkProtocolV2Sub512        58212        16193  -72.18%
BenchmarkProtocolV2Sub1k         58549        19490  -66.71%
BenchmarkProtocolV2Sub2k         63430        27840  -56.11%

benchmark                   old allocs   new allocs    delta
BenchmarkProtocolV2Sub256           56           39  -30.36%
BenchmarkProtocolV2Sub512           56           39  -30.36%
BenchmarkProtocolV2Sub1k            56           39  -30.36%
BenchmarkProtocolV2Sub2k            58           42  -27.59%

HTTP 超文本传输协议

NSQ’s HTTP API is built on top of Go’s net/http package.

NSQ HTTP API 是构建在 Go net/http包中的。

Because it’s just HTTP, it can be leveraged in almost any modern programming environment without special client libraries.

因为它只是HTTP,所以几乎可以在任何现代编程环境中使用它,而不需要特殊的客户机库。

Its simplicity belies its power, as one of the most interesting aspects of Go’s HTTP tool-chest is the wide range of debugging capabilities it supports.

它的简单性掩盖了它的强大功能,因为Go的HTTP工具箱中最有趣的方面之一是它支持广泛的调试功能。

The net/http/pprof package integrates directly with the native HTTP server, exposing endpoints to retrieve CPU, heap, goroutine, and OS thread profiles. These can be targeted directly from the go tool:

这个 net/http/pprof 包直接与本地HTTP服务器集成。公开端点以检索CPU、堆、Go程和系统线程概要文件。这些目标可以直接从GO工具使用:

$ go tool pprof http://127.0.0.1:4151/debug/pprof/profile

This is a tremendously valuable for debugging and profiling a running process!

这对于调试和分析一个正在运行的进程非常有价值!

In addition, a /stats endpoint returns a slew of metrics in either JSON or pretty-printed text, making it easy for an administrator to introspect from the command line in realtime:

除此之外,一个 /stats请求返回JSON或打印精美的文本中的大量指标,方便管理员从命令行实时反馈:

$ watch -n 0.5 'curl -s http://127.0.0.1:4151/stats | grep -v connected'

This produces continuous output like:

这会产生连续的输出,比如:

[page_views     ] depth: 0     be-depth: 0     msgs: 105525994 e2e%: 6.6s, 6.2s, 6.2s
    [page_view_counter        ] depth: 0     be-depth: 0     inflt: 432  def: 0    re-q: 34684 timeout: 34038 msgs: 105525994 e2e%: 5.1s, 5.1s, 4.6s
    [realtime_score           ] depth: 1828  be-depth: 0     inflt: 1368 def: 0    re-q: 25188 timeout: 11336 msgs: 105525994 e2e%: 9.0s, 9.0s, 7.8s
    [variants_writer          ] depth: 0     be-depth: 0     inflt: 592  def: 0    re-q: 37068 timeout: 37068 msgs: 105525994 e2e%: 8.2s, 8.2s, 8.2s

[poll_requests  ] depth: 0     be-depth: 0     msgs: 11485060 e2e%: 167.5ms, 167.5ms, 138.1ms
    [social_data_collector    ] depth: 0     be-depth: 0     inflt: 2    def: 3    re-q: 7568  timeout: 402   msgs: 11485060 e2e%: 186.6ms, 186.6ms, 138.1ms

[social_data    ] depth: 0     be-depth: 0     msgs: 60145188 e2e%: 199.0s, 199.0s, 199.0s
    [events_writer            ] depth: 0     be-depth: 0     inflt: 226  def: 0    re-q: 32584 timeout: 30542 msgs: 60145188 e2e%: 6.7s, 6.7s, 6.7s
    [social_delta_counter     ] depth: 17328 be-depth: 7327  inflt: 179  def: 1    re-q: 155843 timeout: 11514 msgs: 60145188 e2e%: 234.1s, 234.1s, 231.8s

[time_on_site_ticks] depth: 0     be-depth: 0     msgs: 35717814 e2e%: 0.0ns, 0.0ns, 0.0ns
    [tail821042#ephemeral     ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 33909699 e2e%: 0.0ns, 0.0ns, 0.0ns

Finally, each new Go release typically brings measurable performance gains.

最后,每个新的Go发行版通常都会带来可度量的性能收益

It’s always nice when recompiling against the latest version of Go provides a free boost!

当针对最新版本的Go重新编译时,它总是提供了一个免费的提升!

Dependencies 依赖性

Coming from other ecosystems, Go’s philosophy (or lack thereof) on managing dependencies takes a little time to get used to.

来自其他生态系统,Go的哲学(或它的不足 )关于管理依赖关系需要一点时间来适应。

NSQ evolved from being a single giant repo, with relative imports and little to no separation between internal packages, to fully embracing the recommended best practices with respect to structure and dependency management.

对于NSQ的成长有着巨大的代码库,相对于导入包和内部包之间几乎没有分离,全面采纳有关结构和依赖关系管理的建议最佳实践。

There are two main schools of thought:

有两种主要的思想流派:

  1. Vendoring: copy dependencies at the correct revision into your application’s repo and modify your import paths to reference the local copy.

Vendoring:将正确修订时的依赖项复制到应用程序的代码库中并修改导入路径以引用本地副本。

  1. Virtual Env: list the revisions of dependencies you require and at build time, produce a pristine GOPATHenvironment containing those pinned dependencies.

Virtual Env:列出在构建时所需的依赖项的修订,生成一个原始的GOPATH环境,其中包含那些固定的依赖项。

Note: This really only applies to binary packages as it doesn’t make sense for an importable package to make intermediate decisions as to which version of a dependency to use.

注意:这实际上只适用于二进制包,因为对于一个可导入的包来说,对于使用哪个版本的依赖项做出中间决策是没有意义的

NSQ uses gpm to provide support for (2) above.

NSQ使用gpm为上述第(2)项提供支持。

It works by recording your dependencies in a Godeps file, which we later use to construct a GOPATH environment.

它的工作原理是将依赖关系记录在一个Godeps文件中,稍后我们将使用该文件构造一个“GOPATH”环境。

Testing 测试

Go provides solid built-in support for writing tests and benchmarks and, because Go makes it so easy to model concurrent operations, it’s trivial to stand up a full-fledged instance of nsqd inside your test environment.

Go为编写测试和基准测试提供了可靠的内置支持,因为Go使建模并发操作变得非常容易,在测试环境中建立一个完整的nsqd实例非常简单。

However, there was one aspect of the initial implementation that became problematic for testing: global state. The most obvious offender was the use of a global variable that held the reference to the instance of nsqd at runtime, i.e. var nsqd *NSQd.

无论如何,最初的实现中有一个方面对测试来说是有问题的:全局状态,最明显的错误是使用了一个全局变量,该变量在运行时持有对nsqd实例的引用,即var nsqd * nsqd

Certain tests would inadvertently mask this global variable in their local scope by using short-form variable assignment, i.e. nsqd := NewNSQd(...).

某些测试会使用简短的变量赋值(例如,“nsqd: = NewNSQd (…)”。

This meant that the global reference did not point to the instance that was currently running, breaking tests.

这意味着全局引用没有指向当前正在运行的实例,从而破坏了测试。

To resolve this, a Context struct is passed around that contains configuration metadata and a reference to the parent nsqd.

为了解决这个问题,传递一个Context结构,其中包含配置元数据和对上层nsqd的引用。

All references to global state were replaced with this local Context, allowing children (topics, channels, protocol handlers, etc.) to safely access this data and making it more reliable to test.

所有对全局状态的引用都被替换为这个Context,允许子对象(主题、通道、协议处理程序等)安全的访问此数据,并使其更可靠的进行测试。

Robustness 鲁棒性

A system that isn’t robust in the face of changing network conditions or unexpected events is a system that will not perform well in a distributed production environment.

如果一个系统在面对不断变化的网络条件或意外事件时不够健壮,那么这个系统在分布式生产环境中就不能很好地执行。

NSQ is designed and implemented in a way that allows the system to tolerate failure and behave in a consistent, predictable, and unsurprising way.

NSQ的设计和实现允许系统容忍失败,并以一致的、可预测的和不足为奇的方式运行。

The overarching philosophy is to fail fast, treat errors as fatal, and provide a means to debug any issues that do occur.

首要的原则是快速失败,将错误视为致命的,并提供一种方法来调试任何确实发生的问题。

But, in order to react you need to be able to detect exceptional conditions…

但是,为了做出反应,你需要能够发现异常情况……

Heartbeats and Timeouts 心跳与超时

The NSQ TCP protocol is push oriented. After connection, handshake, and subscription the consumer is placed in a RDY state of 0.

NSQ TCP协议是面向推的。连接、握手和订阅之后,使用者将处于“RDY”状态,即“0”。

When the consumer is ready to receive messages it updates that RDY state to the number of messages it is willing to accept.

当使用者准备好接收消息时,它将更新“RDY”状态为它愿意接收的消息数量。

NSQ client libraries continually manage this behind the scenes, resulting in a flow-controlled stream of messages.

NSQ客户端库在幕后不断地管理这一点,从而产生一个流控制的消息流。

Periodically, nsqd will send a heartbeat over the connection.

nsqd将定期通过连接发送一个心跳。

The client can configure the interval between heartbeats but nsqd expects a response before it sends the next one.

客户机可以配置心跳之间的间隔,但是nsqd希望在发送下一个心跳之前得到响应。

The combination of application level heartbeats and RDY state avoids head-of-line blocking, which can otherwise render heartbeats useless (i.e. if a consumer is behind in processing message flow the OS’s receive buffer will fill up, blocking heartbeats).

应用程序级心跳和“RDY”状态的结合避免了 排头阻塞(由于FIFO(先进先出)队列机制造成的,每个输入端的FIFO首先处理的是在队列中最靠前的数据,而这时队列后面的数据对应的出口缓存可能已经空闲,但因为得不到处理而只能等待,这样既浪费了带宽又降低了系统性能) ,否则心跳机制会变的毫无用处(也就是说,如果消费者在处理消息流方面落后,操作系统的接收缓冲区就会被填满,从而阻塞心跳)。

To guarantee progress, all network IO is bound with deadlines relative to the configured heartbeat interval. This means that you can literally unplug the network connection between nsqd and a consumer and it will detect and properly handle the error.

为了保证进度,所有网络IO都绑定了相对于配置心跳间隔的截止日期。这意味着您可以从字面上断开nsqd和使用者之间的网络连接,它将检测并正确处理错误。

When a fatal error is detected the client connection is forcibly closed. In-flight messages are timed out and re-queued for delivery to another consumer. Finally, the error is logged and various internal metrics are incremented.

当检测到致命错误时,将强制关闭客户机连接。正在运行的消息超时并重新排队交付给另一个使用者。最后,记录错误并增加各种内部度量。

Managing Goroutines 管理Go程

It’s surprisingly easy to start goroutines. Unfortunately, it isn’t quite as easy to orchestrate their cleanup. Avoiding deadlocks is also challenging.

启动goroutines非常容易。不幸的是,协调它们的清理并不那么容易。避免死锁也是一个挑战。

Most often this boils down to an ordering problem, where a goroutine receiving on a go-chan exits before the upstream goroutines sending on it.

这通常归结为一个排序问题,在go-chan上的goroutine在上游goroutine发送之前退出。

Why care at all though? It’s simple, an orphaned goroutine is a memory leak.

但为什么要在意呢?很简单,孤立的goroutine是内存泄漏

Memory leaks in long running daemons are bad, especially when the expectation is that your process will be stable when all else fails.

在长时间运行的守护进程中内存泄漏是不好的,特别是当所有其他方法都失败时,您的流程都是稳定的。

To further complicate things, a typical nsqd process has many goroutines involved in message delivery. Internally, message “ownership” changes often.

更复杂的是,一个典型的nsqd流程在消息传递中包含多个 goroutine。在内部,消息“所有权”经常更改。

To be able to shutdown cleanly, it’s incredibly important to account for all intraprocess messages.

为了能够干净地关闭,解释所有进程内消息非常重要。

Although there aren’t any magic bullets, the following techniques make it a little easier to manage…

虽然没有什么银弹,下面的技术使它更容易管理…

WaitGroups Go程同步

The sync package provides sync.WaitGroup, which can be used to perform accounting of how many goroutines are live (and provide a means to wait on their exit).

sync 包提供了 sync.WaitGroup ,哪个可以用来计算有多少goroutine是有用的(并且提供一种等待他们退出的方法)。

To reduce the typical boilerplate, nsqd uses this wrapper:

为了减少典型的样板文件,nsqd使用以下包装:

type WaitGroupWrapper struct {
    sync.WaitGroup
}

func (w *WaitGroupWrapper) Wrap(cb func()) {
    w.Add(1)
    go func() {
        cb()
        w.Done()
    }()
}

// can be used as follows:
wg := WaitGroupWrapper{}
wg.Wrap(func() { n.idPump() })
...
wg.Wait()

Exit Signaling 退出信号

The easiest way to trigger an event in multiple child goroutines is to provide a single go-chan that you close when ready.

在多个子goroutine中触发事件的最简单方法是提供一个go-chan,在准备好时关闭它。

All pending receives on that go-chan will activate, rather than having to send a separate signal to each goroutine.

go-chan上的所有待定接收将被激活,而不是必须向每个goroutine发送单独的信号。

func work() {
    exitChan := make(chan int)
    go task1(exitChan)
    go task2(exitChan)
    time.Sleep(5 * time.Second)
    close(exitChan)
}
func task1(exitChan chan int) {
    <-exitChan
    log.Printf("task1 exiting")
}

func task2(exitChan chan int) {
    <-exitChan
    log.Printf("task2 exiting")
}

Synchronizing Exit 同步推出

It was quite difficult to implement a reliable, deadlock free, exit path that accounted for all in-flight messages. A few tips:

要实现一个可靠的、无死锁的、包含所有正在运行的消息的退出路径是相当困难的,一些小技巧:

  1. Ideally the goroutine responsible for sending on a go-chan should also be responsible for closing it.

理想情况下,负责发送go-chan的goroutine也应该负责关闭它。

  1. If messages cannot be lost, ensure that pertinent go-chans are emptied (especially unbuffered ones!) to guarantee senders can make progress.

如果不能丢失消息,请确保清空相关的go-chans(尤其是未缓冲的go-chans !),以确保发送者可以取得进展。

  1. Alternatively, if a message is no longer relevant, sends on a single go-chan should be converted to a selectwith the addition of an exit signal (as discussed above) to guarantee progress.

或者,如果消息不再相关,则应将发送到单个go-chan的消息转换为select,并添加一个退出信号(如上所述),以确保进展。

  1. The general order should be:

一般的顺序应该是:

  1. Stop accepting new connections (close listeners)

    停止接受新连接(关闭侦听器)

  2. Signal exit to child goroutines (see above)

    向子goroutines发送退出信号(见上面)

  3. Wait on WaitGroup for goroutine exit (see above)

    等待goroutine退出WaitGroup(见上图)

  4. Recover buffered data

    恢复缓冲数据

  5. Flush anything left to disk

    将所有剩余的内容刷新到磁盘

Logging 日志

Finally, the most important tool at your disposal is to log the entrance and exit of your goroutines!.

最后,您可以使用的最重要的工具是记录goroutines的入口和出口!

It makes it infinitely easier to identify the culprit in the case of deadlocks or leaks.

这使得在死锁或泄漏的情况下更容易确定罪魁祸首。

nsqd log lines include information to correlate goroutines with their siblings (and parent), such as the client’s remote address or the topic/channel name.

nsqd日志行包含将goroutine与其兄弟(和父)关联起来的信息,例如客户机的远程地址或主题/通道名称。

The logs are verbose, but not verbose to the point where the log is overwhelming.

日志很冗长,但不会冗长到日志无法承受的程度。

There’s a fine line, but nsqdleans towards the side of having more information in the logs when a fault occurs rather than trying to reduce chattiness at the expense of usefulness.

这是有界限的,但是当错误发生时,nsqd倾向于在日志中包含更多信息,而不是试图以牺牲可用性为代价减少记录

最后修改:2021 年 02 月 22 日 11 : 04 PM
如果觉得我的文章对你有用,请随意赞赏