Introduction

At Doist, we build products that help millions of people every week stay organized and collaborate with each other. A great deal of what the software does in these products is to synchronize data across devices in a fault-tolerant way, embracing an offline-first principle for frictionless collaboration.

In recent months, our systems have undergone a significant evolution to serve more users and enhance their experience when dealing with more data.

To achieve this and to help troubleshoot issues in production, we have developed a system that allows us to provide a number of select engineers with secure and timely access to production environments, without compromising on accountability.

In this blog post, we’ll walk you through the journey of how we got there and how it works.

How were we doing it before

In the early days of the company, we ran code on dedicated long-lived EC2 instances. Some developers got used to this and came to rely on the option to “debug in production” by connecting over SSH to such an instance and running a Python/iPython console to inspect system state: the combination of production code behavior and the environment it runs in, including database, cache, and some other runtime dependencies’ state.

When we migrated services to AWS ECS, developers lost this option: there’s no big long-lived production server anymore, system components run inside ephemeral containers, which are destroyed once the process running inside them exits.

We explored AWS ECS Exec to address the “debug in production” use case. However, these methods required installing AWS CLI and managing user permissions via AWS IAM. To retain the convenience of simply opening an SSH session and running an iPython console for system introspection, we aimed to replicate this experience using modern components that align with our new container-based setup instead of working against it.

The new way

Our end goal was a system with a simple a web interface, where developers could ask for an automatically provisioned dedicated short-lived server, that had the same code and environment as the most recently deployed version of our service. Such a server should be available over SSH, and have an audit trail of actions executed over each SSH session.

Providing SSH access

We assessed several options to provide SSH access to our containers and ended up with two final contenders: AWS Systems Manager Session Manager and Tailscale SSH.

Both provided the ability to control who can access it through policies (AWS IAM and Tailscale ACLs respectively), and both provided the ability to record the session for auditing purposes.

From the get-go, we set session recording as a hard requirement. This means that all actions taken within the environment will be recorded, providing an audit trail and ensuring accountability.

We ultimately decided to go with Tailscale SSH, as we already relied on Tailscale to limit access to some internal resources, and Tailscale SSH didn’t require extra setup on the client (compared to using SSH sessions over AWS Systems Manager Session Manager). Tailscale SSH also added a session recording feature recently, which was enough to convince us to follow that route after a POC.

The ephemeral console container

Now that we picked Tailscale as our SSH connectivity provider, we needed to figure out what and how to run it inside our container: it should resemble production, be accessible via Tailscale SSH, yet prevent automatic execution of production code on startup.

The first experiment was to run tailscaled, Tailscale’s daemon process, as container entrypoint: tailscaled binary run as pid 1 process, configured networking, and spawned shells as needed for connected SSH clients. This prototype worked well, but to give ourselves more freedom, we decided to use a custom entrypoint process, that will:

take care of additional setup;
run tailscaled with appropriate flags;
work as a real init process, reaping abandoned processes to avoid accumulation of “zombie” entries in the process table (just in case).

We creatively named this binary “bootstrap”.

The process tree on the newly provisioned system looks like this:

bootstrap (root, pid=1)
 ∟ tailscaled (root)

When SSH connects, the process tree becomes like this:

bootstrap (root, pid=1)
 ∟ tailscaled (root)
   ∟ bash (user)

Our goal was to have ephemeral servers use the same container image as production.

We knew that we need to bring a few additional components there, but we didn’t want to directly include the bootstrap and tailscaled binaries in the production image.

The first idea was to build a new container image based on the production image:

FROM production
RUN # install tailscaled and bootstrap binaries

but we abandoned this idea early: we have multiple releases each day, and wanted to avoid extra work done by CI pipeline.

Luckily, there was another way to get extra binaries inside a container based on the production image, without modifying the image itself.

We run our code in AWS Fargate, the “serverless” fully-managed counterpart of AWS Elastic Container Service. The smallest unit of compute resource on AWS Fargate is a task — an ephemeral virtual machine provisioned with container image.

An AWS Fargate task may have more than one container running within the same VM, the so-called “sidecars”, often used for auxiliary purposes such as shipping telemetry or logs to an external system. We decided to leverage this mechanism to provide additional binaries for our setup. Containers running within the same AWS Fargate task may have some parts of their filesystems made available to each other via cross-container mountpoints. This allowed us to make binaries from our “bootstrap” container appear inside a container running from an unmodified production image.

In AWS CloudFormation template terms, the relation between our containers within a single Fargate task takes the following shape:

ContainerDefinitions:
  - Name: bootstrap
    Image: !Sub "${Repository.RepositoryUri}:latest"
    Essential: false
    ReadonlyRootFilesystem: true
    Command: ["/bin/sh", "-c", "true"]
    MountPoints:
    - SourceVolume: runtime
        ContainerPath: /var/runtime
  - Name: backend
    Image: !Sub "${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/${OtherRepo}:latest"
    Essential: true
    EntryPoint: ["/var/runtime/bootstrap"]
    User: "0:0"
    VolumesFrom:
    - SourceContainer: bootstrap
        ReadOnly: true

From our “bootstrap” container, we share the /var/runtime directory, holding only 3 binaries: tailscaled (Tailscale server), tailscale (Tailscale client, required to configure the server), and bootstrap, doing additional configuration and providing some convenience functions (more on that below).

Notice how the bootstrap container is marked as “non-essential” in the Fargate task definition, which means that when its container terminates right away, the whole task still continues to run, and we have a side effect of its subdirectory mounted within another container.

The bootstrap process

The entire Dockerfile defining our bootstrap image looks like this:

FROM public.ecr.aws/docker/library/alpine:latest as tailscale
WORKDIR /app
ENV TSFILE=tailscale_1.48.2_amd64.tgz
RUN wget https://pkgs.tailscale.com/stable/${TSFILE} && \
  tar xzf ${TSFILE} --strip-components=1

FROM public.ecr.aws/docker/library/golang:alpine as builder
WORKDIR /app
ENV GOFLAGS="-ldflags=-w -trimpath" CGO_ENABLED=0
COPY go.mod go.sum ./
RUN go mod download
COPY main.go .
RUN go build -o bootstrap .

FROM public.ecr.aws/docker/library/alpine:latest
COPY --from=tailscale /app/tailscaled /var/runtime/tailscaled
COPY --from=tailscale /app/tailscale /var/runtime/tailscale
COPY --from=builder /app/bootstrap /var/runtime/bootstrap
VOLUME ["/var/runtime"]

The bootstrap process, running as container entrypoint, takes care of the following:

Create a subdirectory used by tailscaled for its state management.
Launch tailscaled in userspace networking mode as a child process.
Configure tailscaled (server) by calling tailscale (client) with appropriate flags, this tells the server to join our Tailscale network (tailnet) with a specific authentication key, explicitly set hostname, and enable Tailscale SSH.
Do some additional convenience configuration, like writing a custom /etc/motd file, and a few other things.

Ephemeral Console Dispatcher

The Ephemeral Dispatcher is the controller system that spawns ephemeral consoles as ECS Fargate tasks.

It consists of an almost stateless web server, hosting a single web form, which developers use to select a few options and click the submit button, which provisions a new ephemeral instance.

Server logic is quite straightforward and is mostly glue code bringing together external components:

It works as a limited-scope OAuth client to Tailscale API, which allows it to request new pre-authorized Tailscale authorization keys.
It uses the AWS SDK to call ECS RunTask, passing additional data such as Tailscale authorization key and termination delay, in the “Overrides” parameter.
It works as a tailnet-only service, serving the web form only via Tailscale.

Console’s lifetime

We highlight the ephemeral nature of the console servers for users of this service, as we don’t want them to treat such servers as ordinary long-lived VMs that persist across reboots, like some people used to with EC2 instances. So right from the start, we made it mandatory in the provisioning form to select for how long the developer plans to use such an ephemeral server; like “I want this instance for 8 hours”. The expectation is that such an instance is automatically destroyed after it runs for this long, and we don’t need to worry about “garbage collection” or track how long they were not used.

Initially, we implemented the task retirement feature by having a sidecar container with a single sleep command running inside: if such container is marked as “essential” in the ECS task definition, and the sleep command exits after blocking for a specified amount of time, the whole task is terminated.

Soon enough, we got a feature request that it would be nice to somehow extend the lifetime of such an ephemeral server: for example, you started to debug something on an instance provisioned for 1 hour, but you realize it takes longer and want to have it run for a few hours more before it self-destructs.

We then moved this “sleep” functionality directly into the bootstrap process, which now knows how long it is expected to run.

This unlocked a few other convenience features:

Now that the bootstrap process knows when it’s expected to terminate, it writes this deadline into the /etc/motd file, so that the developer sees this in the banner shown after they log in over SSH.
The bootstrap process keeps a timer and sends two early warnings — 30 minutes and 5 minutes before — to each logged-in user, by directly writing into /dev/pts/* device files, mimicking what the Unix wall command does.

We also taught the bootstrap process to adjust this deadline on demand. To do so, the bootstrap process acts as a server, listening on a Unix socket. On its launch, it also creates a symlink to itself under a different name (“expire-in”) under a directory inside $PATH. When the bootstrap binary is called by this alternative name, it executes an alternative code path, working as a client communicating with the main bootstrap process over a Unix socket. In this mode, it implements a small command-line tool used to adjust the instance expiration deadline and can be called like expire-in 5h to set the deadline to 5 hours from now.

Filesystem persistence

We soon figured out that even though the ephemeral nature fits most tasks, for some of them, it’s convenient to have some place where one can persist some files and re-use them across runs — such as ad-hoc scripts.

To make the “working from SSH console” experience more fluid, we rejected the option to store such data in S3 and instead configured a network filesystem share. Conveniently, AWS provides EFS — a managed NFS service, which we now use and mount a shared filesystem on each instance under a predefined path. We put a colorful notice in our motd banner, telling developers where they can store files they want to survive each instance’s lifetime.

Conclusion

By building an internal system that leverages modern components like AWS ECS and Tailscale, we’ve been able to recreate the convenience of traditional debugging methods while addressing the challenges of production environment troubleshooting. This system has been running smoothly for several months and has received positive feedback from the developers that have used it.

If you enjoy building systems like this or solving problems with low-maintenance, elegant solutions that help fellow developers be more productive and efficient, join us.

AWS ECS-based Ephemeral consoles for production issue troubleshooting