原文链接: https://nsq.io/deployment/topology_patterns.html

This document describes some NSQ patterns that solve a variety of common problems.

本文描述了一些解决各种常见问题的NSQ模式。

DISCLAIMER: there are some obvious technology suggestions but this document generally ignores the deeply personal details of choosing proper tools, getting software installed on production machines, managing what service is running where, service configuration, and managing running processes (daemontools, supervisord, init.d, etc.).

免责声明: 有一些明显的技术建议,但是本文档通常忽略了一些非常私人的细节,如选择适当的工具、在生产机器上安装软件、管理正在运行的服务、服务配置和管理正在运行的进程(daemontools supervisor ord init.d,等等)。

Metrics Collection 指标集合

Regardless of the type of web service you’re building, in most cases you’re going to want to collect some form of metrics in order to understand your infrastructure, your users, or your business.

无论您正在构建哪种类型的web服务,在大多数情况下,您需要收集某种形式的度量标准,以便了解您的基础设施、用户或业务。

For a web service, most often these metrics are produced by events that happen via HTTP requests, like an API. The naive approach would be to structure this synchronously, writing to your metrics system directly in the API request handler.

对于web服务,这些指标通常由通过HTTP请求(如API)发生的事件生成。简单的方法是同步地构造这个结构,直接在API请求处理程序中写入度量系统。

naive approach

  • What happens when your metrics system goes down?

当您的度量系统崩溃时会发生什么?

  • Do your API requests hang and/or fail?

您的API请求是否挂起或失败?

  • How will you handle the scaling challenge of increasing API request volume or breadth of metrics collection?

您将如何处理增加API请求量或度量集合宽度的伸缩性挑战?

One way to resolve all of these issues is to somehow perform the work of writing into your metrics system asynchronously - that is, place the data in some sort of local queue and write into your downstream system via some other process (consuming that queue). This separation of concerns allows the system to be more robust and fault tolerant. At bitly, we use NSQ to achieve this.

解决所有这些问题的一种方法是以某种方式异步地执行将数据写入度量系统的工作。也就是说,将数据放在某种本地队列中,然后通过其他进程(使用该队列)将数据写入下游系统。这种关注点的分离使系统更加健壮和容错。在短连接,我们使用NSQ来实现这一点。

Brief tangent: NSQ has the concept of topics and channels. Basically, think of a topic as a unique stream of messages (like our stream of API events above). Think of a channel as a copy of that stream of messages for a given set of consumers. Topics and channels are both independent queues, too. These properties enable NSQ to support both multicast (a topic copying each message to N channels) and distributed (a channel equally dividingits messages among N consumers) message delivery.

正题:NSQ有主题通道的概念。基本上,可以将主题看作一个独特的消息流(就像上面的API事件流)。可以将通道看作给定的一组消费者的消息流的副本。主题和通道也都是独立的队列。这些属性使NSQ既支持多播(将每个消息复制到N个通道的主题),也支持分布式(在N个消费者之间平均分配其消息的通道)消息传递。

For a more thorough treatment of these concepts, read through the design doc and slides from our Golang NYC talk, specifically slides 19 through 33 describe topics and channels in detail.

要更彻底地处理这些概念,请阅读我们的Golang NYC演讲的设计文档和幻灯片,特别是第19到33张幻灯片,其中详细描述了主题和频道。

architecture with NSQ

Integrating NSQ is straightforward, let’s take the simple case:

对NSQ集成很简单,我们举个简单的例子:

  1. Run an instance of nsqd on the same host that runs your API application.

在运行API应用程序的同一台主机上运行nsqd实例。

  1. Update your API application to write to the local nsqd instance to queue events, instead of directly into the metrics system. To be able to easily introspect and manipulate the stream, we generally format this type of data in line-oriented JSON. Writing into nsqd can be as simple as performing an HTTP POST request to the /put endpoint.

更新您的API应用程序,使其写入本地' nsqd '实例来对事件进行排队,而不是直接写入度量系统。为了能够方便地内省和操作流,我们通常将这种类型的数据格式化为面向行的JSON。写入' nsqd '可以像对' /put '端点执行HTTP POST请求一样简单。

  1. Create a consumer in your preferred language using one of our client libraries. This “worker” will subscribe to the stream of data and process the events, writing into your metrics system. It can also run locally on the host running both your API application and nsqd.

使用我们的一个客户端库,用您喜欢的语言创建一个消费者。该工作人员将订阅数据流并处理事件,并将其写入您的度量系统。它还可以在运行API应用程序和' nsqd '的主机上本地运行。

Here’s an example worker written with our official Python client library:

下面是一个使用我们的官方[Python客户端库]编写的工人示例:

In addition to de-coupling, by using one of our official client libraries, consumers will degrade gracefully when message processing fails. Our libraries have two key features that help with this:

除了解耦之外,通过使用我们的一个官方客户端库,当消息处理失败时,消费者还会优雅地降级。我们的库有两个关键特性可以帮助解决这个问题

  1. Retries - when your message handler indicates failure, that information is sent to nsqd in the form of a REQ(re-queue) command. Also, nsqd will automatically time out (and re-queue) a message if it hasn’t been responded to in a configurable time window. These two properties are critical to providing a delivery guarantee.

Retries -当消息处理程序指示失败时,该信息将以“REQ”(重新排列)命令的形式发送到nsqd。此外,nsqd将自动超时(并重新排队)的消息,如果它没有在一个可配置的时间窗口响应。这两个属性对于提供交付保证至关重要。

  1. Exponential Backoff - when message processing fails the reader library will delay the receipt of additional messages for a duration that scales exponentially based on the # of consecutive failures. The opposite sequence happens when a reader is in a backoff state and begins to process successfully, until 0.

Exponential Backoff - 当消息处理失败时,阅读器库将延迟接收附加消息的时间,该时间长度根据连续失败的次数呈指数级增长。反之,当读取器处于回退状态并开始成功处理,直到0。

In concert, these two features allow the system to respond gracefully to downstream failure, automagically.

同时,这两个特性允许系统自动地优雅地响应下游故障。

Persistence 持久化

Ok, great, now you have the ability to withstand a situation where your metrics system is unavailable with no data loss and no degraded API service to other endpoints. You also have the ability to scale the processing of this stream horizontally by adding more worker instances to consume from the same channel.

好的,很好,现在您有能力承受这样一种情况,您的度量系统不可用,没有数据丢失,也没有降级到其他端点的API服务。您还可以通过从同一通道添加更多的工作实例来横向扩展此流的处理。

But, it’s kinda hard ahead of time to think of all the types of metrics you might want to collect for a given API event.

但是,提前想想你可能想为给定的API事件收集的所有类型的度量标准有点困难。

Wouldn’t it be nice to have an archived log of this data stream for any future operation to leverage? Logs tend to be relatively easy to redundantly backup, making it a “plan z” of sorts in the event of catastrophic downstream data loss. But, would you want this same consumer to also have the responsibility of archiving the message data? Probably not, because of that whole “separation of concerns” thing.

拥有这个数据流的存档日志,以便将来的操作使用,不是很好吗?日志相对容易进行冗余备份,因此在发生灾难性的下游数据丢失时,日志可以作为某种“z计划”。但是,您是否希望相同的使用者也负责归档消息数据?可能不会,因为整个“关注点分离”的事情。

Archiving an NSQ topic is such a common pattern that we built a utility, nsq_to_file, packaged with NSQ, that does exactly what you need.

归档NSQ主题是一种非常常见的模式,因此我们构建了一个实用程序[nsq_to_file],它与NSQ打包在一起,这正是您所需要的。

Remember, in NSQ, each channel of a topic is independent and receives a copy of all the messages. You can use this to your advantage when archiving the stream by doing so over a new channel, archive. Practically, this means that if your metrics system is having issues and the metrics channel gets backed up, it won’t effect the separate archive channel you’ll be using to persist messages to disk.

记住,在NSQ中,主题的每个通道都是独立的,并接收所有消息的副本。当您通过一个新的通道“存档”对流进行归档时,可以利用这一点。实际上,这意味着如果您的度量系统有问题,并且“度量”通道得到备份,那么它不会影响您将用于将消息持久存储到磁盘的单独“存档”通道。

So, add an instance of nsq_to_file to the same host and use a command line like the following:

因此,将nsq_to_file实例添加到同一个主机,并使用如下命令行:

/usr/local/bin/nsq_to_file --nsqd-tcp-address=127.0.0.1:4150 --topic=api_requests --channel=archive

archiving the stream

Distributed Systems 分布式系统

You’ll notice that the system has not yet evolved beyond a single production host, which is a glaring single point of failure.

您将注意到,系统还没有发展到超出单个生产主机的程度,这是一个明显的单点故障。

Unfortunately, building a distributed system is hard. Fortunately, NSQ can help. The following changes demonstrate how NSQ alleviates some of the pain points of building distributed systems as well as how its design helps achieve high availability and fault tolerance.

不幸的是,构建分布式系统非常困难。幸运的是,NSQ可以提供帮助。下面的更改演示了NSQ如何减轻构建分布式系统的一些痛点,以及它的设计如何帮助实现高可用性和容错。

Let’s assume for a second that this event stream is really important. You want to be able to tolerate host failures and continue to ensure that messages are at least archived, so you add another host.

让我们假设这个事件流非常重要。您希望能够容忍主机故障,并继续确保至少存档了消息,因此您要添加另一台主机。

adding a second host

Assuming you have some sort of load balancer in front of these two hosts you can now tolerate any single host failure.

假设这两台主机前面有某种负载均衡器,现在就可以容忍任何一台主机出现故障。

Now, let’s say the process of persisting, compressing, and transferring these logs is affecting performance. How about splitting that responsibility off to a tier of hosts that have higher IO capacity?

现在,假设持久化、压缩和传输这些日志的过程正在影响性能。把这个责任分给具有更高IO容量的主机层如何?

separate archive hosts

This topology and configuration can easily scale to double-digit hosts, but you’re still managing configuration of these services manually, which does not scale. Specifically, in each consumer, this setup is hard-coding the address of where nsqd instances live, which is a pain. What you really want is for the configuration to evolve and be accessed at runtime based on the state of the NSQ cluster. This is exactly what we built nsqlookupd to address.

这种拓扑和配置可以很容易地扩展到两位数的主机,但是仍然需要手动管理这些服务的配置,而这是不可扩展的。具体地说,在每个消费者中,这种设置都要硬编码nsqd实例所在的地址,这很麻烦。您真正想要的是让配置演进,并在运行时基于NSQ集群的状态进行访问。这正是我们构建' nsqlookupd '要解决的问题。

nsqlookupd is a daemon that records and disseminates the state of an NSQ cluster at runtime. nsqd instances maintain persistent TCP connections to nsqlookupd and push state changes across the wire. Specifically, an nsqdregisters itself as a producer for a given topic as well as all channels it knows about. This allows consumers to query an nsqlookupd to determine who the producers are for a topic of interest, rather than hard-coding that configuration. Over time, they will learn about the existence of new producers and be able to route around failures.

nsqlookupd 是一个守护进程,它在运行时记录和传播NSQ集群的状态。nsqd实例维护到nsqlookupd的持久TCP连接,并通过网络推送状态更改。具体地说,“nsqd”将自己注册为给定主题以及它所知道的所有频道的生产者。这允许使用者查询nsqlookupd 来确定感兴趣的主题的生产者,而不是硬编码该配置。随着时间的推移,他们将了解到新生产商的存在,并能够绕过失败。

The only changes you need to make are to point your existing nsqd and consumer instances at nsqlookupd(everyone explicitly knows where nsqlookupd instances are but consumers don’t explicitly know where producers are, and vice versa). The topology now looks like this:

您需要做的惟一更改是将现有的nsqd 和使用者实例指向nsqlookupd (每个人都显式地知道nsqlookupd 实例在哪里,但是消费者不显式地知道生产者在哪里,反之亦然),拓扑图看起来是这样的:

adding nsqlookupd

At first glance this may look more complicated. It’s deceptive though, as the effect this has on a growing infrastructure is hard to communicate visually. You’ve effectively decoupled producers from consumers becausensqlookupd is now acting as a directory service in between. Adding additional downstream services that depend on a given stream is trivial, just specify the topic you’re interested in (producers will be discovered by querying nsqlookupd).

起初看起来可能更复杂了,但这是有欺骗性的,因为这对不断增长的基础设施造成的影响是难以用视觉传达的,您已经有效地将生产者与消费者解耦,因为nsqlookupd 现在充当两者之间的目录服务,添加附加的依赖于给定流的下游服务并不重要,只需指定您感兴趣的主题即可(生产者将通过查询nsqlookupd被发现)。

But what about availability and consistency of the lookup data? We generally recommend basing your decision on how many to run in congruence with your desired availability requirements. nsqlookupd is not resource intensive and can be easily homed with other services. Also, nsqlookupd instances do not need to coordinate or otherwise be consistent with each other. Consumers will generally only require one nsqlookupd to be available with the information they need (and they will union the responses from all of the nsqlookupd instances they know about). Operationally, this makes it easy to migrate to a new set of nsqlookupd.

但是查找数据的可用性和一致性如何呢?我们通常建议您根据运行的数量来决定是否符合所需的可用性需求.

nsqlookupd不是资源密集型的,可以很容易地与其他服务结合使用。此外, nsqlookupd 实例不需要相互协调或保持一致。消费者通常只需要一个nsqlookupd来提供他们需要的信息(他们将联合来自他们所知道的所有nsqlookupd实例的响应)。在操作上,这使得迁移到一组新的nsqlookupd很容易。

最后修改:2021 年 02 月 24 日 10 : 43 PM
如果觉得我的文章对你有用,请随意赞赏