Assorted Links | Steven Engelhardt

Monday 2025-05-05 Assorted Links

Assorted Links links

Published: 2025-05-05

Assorted links for Monday, May 5:

How we ended up rewriting NuGet Restore in .NET 9

This is the story of how team members across NuGet, Visual Studio, and .NET embarked on a journey to fully rewrite the NuGet Restore algorithm to achieve break-through scale and performance. Written from the perspective of several team members, this entry provides a deep dive into the internals of NuGet, as well as strategies to identify and address performance issues. We hope that you enjoy it!
How Netflix Accurately Attributes eBPF Flow Logs

In a previous blog post, we described how Netflix uses eBPF to capture TCP flow logs at scale for enhanced cloud network insights. In this post, we delve deeper into how Netflix solved a core problem: accurately attributing flow IP addresses to workload identities.
Going beyond singleton, scoped, and transient lifetimes—tenant, pooled, and drifter

In this post I first briefly describe the standard lifetimes available in the .NET DI container. I then briefly describe the three hypothetical lifetimes described in the podcast. Finally, I show how you could implement one of these lifetimes in practice. In the next post I show a possible implementation for the remaining lifetime.
Multi-tenancy in ASP.NET Core 8 - Dependency Injection & Tenant Specific Services

This post discusses how we can have tenant specific services in a multi-tenant ASP.NET Core 8 application.
Enhancing Search Capabilities in SQL Server and Azure SQL with Hybrid Search and RRF Re-Ranking

In today’s data-driven world, delivering precise and contextually relevant search results is critical. SQL Server and Azure SQL Database now enable this through Hybrid Search—a technique that combines traditional full-text search with modern vector similarity search. This allows developers to build intelligent, AI-powered search experiences directly inside the database engine.

Wednesday 2025-04-09 Assorted Links

Assorted Links links

Published: 2025-04-09

Assorted links for Wednesday, April 9:

A large scale analysis of hundreds of in-memory cache clusters at Twitter

In this work, we significantly further the understanding of real-world cache workloads by collecting production traces from 153 in-memory cache clusters at Twitter, sifting through over 80 TB of data, and sometimes interpreting the workloads in the context of the business logic behind them.
Durability: Linux File APIs

[Y]ou should assume your data is corrupt between when a write is issued until after a flush or force unit access write completes. However, most programs use system calls to write data. This article looks at the guarantees provided by the Linux file APIs. It seems like this should be simple: a program calls write() and after it completes, the data is durable. However, write() only copies data from the application into the kernel’s cache in memory. To force the data to be durable you need to use some additional mechanism.
How Netflix Scales its API with GraphQL Federation (Part 1)

As we’ve grown the number of developers and increased our domain complexity, developing the API aggregation layer has become increasingly harder.

In order to address this rising problem, we’ve developed a federated GraphQL platform to power the API layer.
Traefik: canary deployments with weighted load balancing

Traefik is The Cloud Native Edge Router yet another reverse proxy and load balancer. Omitting all the Cloud Native buzzwords, what really makes Traefik different from Nginx, HAProxy, and alike is the automatic and dynamic configurability it provides out of the box. And the most prominent part of it is probably its ability to do automatic service discovery.
Rust vs Go in 2025

Which is better, Rust or Go—and does that question even make sense? Which language should you choose for your next project in 2025, and why? How does Rust compare with Go in areas like performance, simplicity, safety, features, scale, and concurrency?

Tuesday 2025-04-08 Assorted Links

Assorted Links links

Published: 2025-04-08

Assorted links for Tuesday, April 8:

Building a ubiquitous shared infrastructure using Twine

Twine is our homegrown cluster management system, which has been running in production for the past decade. A cluster management system allocates workloads to machines and manages the life cycle of machines, containers, and workloads. Kubernetes is a prominent example of an open source cluster management system. Twine has helped convert our infrastructure from a collection of siloed pools of customized machines dedicated to individual workloads to a large-scale ubiquitous shared infrastructure in which any machine can run any workload.
Ensuring data reaches disk

The purpose of this document is to describe the path data takes from the application down to the storage, concentrating on places where data is buffered, and to then provide best practices for ensuring data is committed to stable storage so it is not lost along the way in the case of an adverse event. The main focus is on the C programming language, though the system calls mentioned should translate fairly easily to most other languages.
Revolutionizing Money Movements at Scale with Strong Data Consistency

As one of the underlying engines, Uber Money fulfills some of the most important aspects of people’s engagement in the Uber experience. A system like this should not only be robust, but should also be highly available with zero-tolerance to downtime, after our success mantra: “To collect and disburse on-time, accurately and in-compliance”.

While we expand to multiple lines of businesses, and strategize the next best, the engineers in Uber Money also thrive on building the next generation’s Payments Platform which extends Uber’s growth. In this blog, we introduce you to this platform and provide insights into our learnings. This includes migrating hundreds of millions customers between two asynchronous systems while maintaining data-consistency with a goal of zero impact on our users.
How 30 Lines of Code Blew Up a 27-Ton Generator

A secret experiment in 2007 proved that hackers could devastate power grid equipment beyond repair—with a file no bigger than a GIF.
Avoiding overload in distributed systems by putting the smaller service in control

Within AWS, a common pattern is to split the system into services that are responsible for executing customer requests (the data plane), and services that are responsible for managing and vending customer configuration (the control plane). In this article, I discuss a number of different ways the data plane and the control plane interact with each other to avoid system overload. In many of these architectures the larger data plane fleet calls the smaller control plane fleet, but I also want to share the success we’ve had at Amazon when we put the smaller fleet in control.

Monday 2025-04-07 Assorted Links

Assorted Links links

Published: 2025-04-07

Assorted links for Monday, April 7:

The Lost Xperf Documentation–Disk Usage

As I’ve lamented previously, the documentation for xperf (Windows Performance Toolkit) is a bit light. The names of the columns in the summary tables can be exquisitely subtle, and I have never found any documentation for them. But, I’ve talked to the xperf authors, and I’ve used xperf a lot, and I’ve done some experiments, and here I share some more results, this time for the Disk Usage summary table.
The macro problem with microservices

In just 20 years, software engineering has shifted from architecting monoliths with a single database and centralized state to microservices where everything is distributed across multiple containers, servers, data centers, and even continents. Distributing things solves scaling concerns, but introduces a whole new world of problems, many of which were previously solved by monoliths.
Learnings From Two Years of Kubernetes in Production
FioSynth: A representative I/O benchmark and data visualizer for data center workloads

FioSynth is a benchmark tool used to automate the execution of storage workload suites and to parse results. It contains a base set of block level storage workloads, synthesized from production I/O traces, that simulate a diverse range of Facebook production services. It is useful for predicting how a storage device will perform in realistic production environments and for assisting with performance tuning.
Azure Container Registry Adds Teleportation

Project Teleport removes the cost of download and decompression by SMB mounting pre-expanded layers from the Azure Container Registry to Teleport enabled Azure container hosts.

Friday 2025-04-04 Assorted Links

Assorted Links links

Published: 2025-04-04

Assorted links for Friday, April 4:

Quake’s 3-D Engine: The Big Picture
Reasoning about colors

In July 2020 I went on a color-scheme vision quest. This led to some research on various color spaces and their utility, some investigation into the styling guidelines outlined by the base16 project, and the color utilities that ship within the GNU Emacs text editor. This article will be a whirlwind tour of things you can do to individual colors, and at the end how I put these blocks together.
Modern storage is plenty fast. It is the APIs that are bad.

In this article I will demonstrate that while hardware changed dramatically over the past decade, software APIs have not, or at least not enough. Riddled with memory copies, memory allocations, overly optimistic read ahead caching and all sorts of expensive operations, legacy APIs prevent us from making the most of our modern devices.
How io_uring and eBPF Will Revolutionize Programming in Linux

[eBPF and io_uring] may look evolutionary, but they are revolutionary in the sense that they will — we bet — completely change the way applications work with and think about the Linux Kernel.
An ex-Googler’s guide to dev tools

I thought it would be helpful to write a guide to dev tools outside of Google for the ex-Googler, written with an eye toward pragmatism and practicality. No doubt many ex-Googlers wish they could simply clone the Google internal environment to their new company, but you can’t boil the ocean. Here is my take on where you should start and a general path I think ex-Googlers can take to find the tools that will make them - and their new teams - as productive as possible.

Thursday 2025-04-03 Assorted Links

Assorted Links links

Published: 2025-04-03

Assorted links for Thursday, April 3:

A Critique of Snapshot Isolation

There have been recent attempts to enrich large-scale data stores, such as HBase and BigTable, with transactional support. Not surprisingly, inspired by traditional database management systems, serializability is usually compromised for the benefit of efficiency. For example, Google Percolator, implements lock-based snapshot isolation on top of BigTable. We show in this paper that this compromise is not necessary in lock-free implementations of transactional support. We introduce write-snapshot isolation, a novel isolation level that has a performance comparable with that of snapshot isolation, and yet provides serializability.
Making Snapshot Isolation Serializable

This article develops a theory that characterizes when nonserializable executions of applications can occur under [snapshot isolation].
Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions

This thesis presents the first implementation-independent specifications of existing ANSI isolation levels and a number of levels that are widely used in commercial systems, e.g., Cursor Stability, Snapshot Isolation. It also specifies a variety of guarantees for predicate-based operations in an implementation-independent manner. Two new levels are defined that provide useful consistency guarantees to application writers; one is the weakest level that ensures consistent reads, while the other captures some useful consistency properties provided by pessimistic implementations.
What Does Write Skew Look Like?

This post is about gaining intuition for Write Skew, and, by extension, Snapshot Isolation. Snapshot Isolation is billed as a transaction isolation level that offers a good mix between performance and correctness, but the precise meaning of “correctness” here is often vague. In this post I want to break down and capture exactly when the thing called “write skew” can happen.
Constructing human-grade parsers

Here are some observations on how parsers can be constructed in a way that makes it easier to recover from parse errors, produce multiple diagnostics in one pass, and provide partial results for further analysis even in the face of errors, providing a better experience for user-driven command line tools and interactive environments.

Wednesday 2025-04-02 Assorted Links

Assorted Links links

Published: 2025-04-02

Assorted links for Wednesday, April 2:

Build systems à la carte: Theory and practice

Build systems are awesome, terrifying – and unloved. They are used by every developer around the world, but are rarely the object of study. In this paper, we offer a systematic, and executable, framework for developing and comparing build systems, viewing them as related points in a landscape rather than as isolated phenomena. By teasing apart existing build systems, we can recombine their components, allowing us to prototype new build systems with desired properties.
Erasure Coding in Windows Azure Storage

In this paper we introduce a new set of codes for erasure coding called Local Reconstruction Codes (LRC). LRC reduces the number of erasure coding fragments that need to be read when reconstructing data fragments that are offline, while still keeping the storage overhead low.
Reproducible Builds

Reproducible builds are a set of software development practices that create an independently-verifiable path from source to binary code.
Getting to Deterministic Builds on Windows

This is a set of notes on getting to deterministic builds in C, C++ and Rust on Windows.
Our Software Dependency Problem

Software dependencies carry with them serious risks that are too often overlooked. The shift to easy, fine-grained software reuse has happened so quickly that we do not yet understand the best practices for choosing and using dependencies effectively, or even for deciding when they are appropriate and when not. My purpose in writing this article is to raise awareness of the risks and encourage more investigation of solutions.

Tuesday 2025-04-01 Assorted Links

Assorted Links links

Published: 2025-04-01

Assorted links for Tuesday, April 1:

Cloudflare turns AI against itself with endless maze of irrelevant facts

On Wednesday, web infrastructure provider Cloudflare announced a new feature called “AI Labyrinth” that aims to combat unauthorized AI data scraping by serving fake AI-generated content to bots. The tool will attempt to thwart AI companies that crawl websites without permission to collect training data for large language models that power AI assistants like ChatGPT.
War story: the hardest bug I ever debugged

All of a sudden, without any ostensible cause, Google Docs was flooded with errors. How it took me 2 days and a coworker to solve the hardest bug I ever debugged.
Introducing Styrolite: Building a Linux Container Runtime from Scratch

Edera Protect is a suite of offerings bridging the gap between modern cloud native computing and virtualization-based security techniques. To power this platform, we’ve built our own container runtime designed to operate as a microservice, allowing it to run containers in a fully programmatic way—similar to how the Kubernetes Container Runtime Interface (CRI) enables container management through microservices.
High-Performance PNG Codec

I would like to announce a new high-performance PNG codec, which is much faster than other available codecs written in C, C++, and other programming languages.
Image creation and testing with HashiCorp Packer and ServerSpec

Monday 2025-03-31 Assorted Links

Assorted Links links

Published: 2025-03-31

Assorted links for Monday, March 31:

In S3 simplicity is table stakes

Today, on Pi Day (S3’s 19th birthday), I’m sharing a post from Andy Warfield, VP and Distinguished Engineer of S3. Andy takes us through S3’s evolution from simple object store to sophisticated data platform, illustrating how customer feedback has shaped every aspect of the service. It’s a fascinating look at how we maintain simplicity even as systems scale to handle hundreds of trillions of objects.
The xz attack shell script
Context is all you need: Better AI results with custom instructions

Earlier this month, we announced the general availability of custom instructions in Visual Studio Code. Custom instructions are how you give Copilot specific context about your team’s workflow, your particular style preferences, libraries the model may not know about, etc.

In this post we’ll dive into what custom instructions are, how you can use them today to drastically improve your results with GitHub Copilot, and even a brand new preview feature called “prompt files” that you can try today.
OpenAI adopts rival Anthropic’s standard for connecting AI models to data

In a post on X on Wednesday, OpenAI CEO Sam Altman said that OpenAI will add support for Anthropic’s Model Context Protocol, or MCP, across its products, including the desktop app for ChatGPT. MCP is an open source standard that helps AI models produce better, more relevant responses to certain queries.
Microsoft announces security AI agents to help overwhelmed humans

Microsoft’s six security agents will be available in preview next month, and are designed to do things like triage and process phishing and data loss alerts, prioritize critical incidents, and monitor for vulnerabilities.

Tuesday 2025-03-18 Assorted Links

Assorted Links links

Published: 2025-03-18

Assorted links for Tuesday, March 18:

Researchers astonished by tool’s apparent success at revealing AI’s “hidden objectives”

In a new paper published Thursday titled “Auditing language models for hidden objectives,” Anthropic researchers described how custom AI models trained to deliberately conceal certain “motivations” from evaluators could still inadvertently reveal secrets, due to their ability to adopt different contextual roles they call “personas.” The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden training objectives, although the methods are still under research.
Why SNES hardware is running faster than expected—and why it’s a problem

After significant research and testing on dozens of actual SNES units, the TASBot team now thinks that a cheap ceramic resonator used in the system’s Audio Processing Unit (APU) is to blame for much of this inconsistency. While Nintendo’s own documentation says the APU should run at a consistent rate of 24.576 Mhz (and the associated Digital Signal Processor sample rate at a flat 32,000 Hz), in practice, that rate can vary just a bit based on heat, system age, and minor physical variations that develop in different console units over time.
The Defer Technical Specification: It Is Time

Time for me to write this blog post and prepare everyone for the implementation blitz that needs to happen to make defer a success for the C programming language.
Introducing support for SLNX, a new, simpler solution file format in the .NET CLI

Solution files have been a part of the .NET and Visual Studio experience for many years now, and they’ve had the same custom format the whole time. Recently, the Visual Studio solution team has begun previewing a new, XML-based solution file format called SLNX. Starting in .NET SDK 9.0.200, the dotnet CLI supports building and interacting with these files in the same way as it does with existing solution files.
Hello HybridCache! Streamlining Cache Management for ASP.NET Core Applications

HybridCache is a new .NET 9 library available via the Microsoft.Extensions.Caching.Hybrid package and is now generally available! HybridCache, named for its ability to leverage both in-memory and distributed caches like Redis, ensures that data storage and retrieval is optimized for performance and security, regardless of the scale or complexity of your application.