
Assorted links for Thursday, March 13:
- Instrumenting Apache Spark Structured Streaming jobs using OpenTelemetry
Monitoring Apache Spark structured streaming data workloads is challenging because the data is continuously processed as it arrives. Because of this always-on nature of stream processing, it is harder to troubleshoot problems during development and production without real-time metrics, alerting and dashboards. Traces complement metrics, and since Spark doesn’t include them by default, we integrate them using OpenTelemetry.
- Protecting user data through source code analysis at scale
Meta’s Anti Scraping team focuses on preventing unauthorized scraping as part of our ongoing work to combat data misuse. In order to protect Meta’s changing codebase from scraping attacks, we have introduced static analysis tools into our workflow. These tools allow us to detect potential scraping vectors at scale across our Facebook, Instagram, and even parts of our Reality Labs codebases.
- We’ve figured out the basics of a shape-shifting, T-1000-style material
Campàs and his team drew inspiration from processes called fluidization and convergent extension—mechanisms that cells in embryos use to coordinate their behavior when forming tissues and organs in a developing organism. The team built a robotic collective where each robotic unit behaved like an embryonic cell. As a collective, the robots behaved like a material that could change shape and switch between solid and liquid states, just like the T-1000.
- Cross-Modal Retrieval: Why It Matters for Multimodal AI
With its ability to simultaneously process different data types (think text, image, audio, video and more), the continuing development of multimodal AI represents the next step that would help to further enhance a wide range of tools — including those for generative AI and autonomous agentic AI.
- The Deployment Bottleneck No One Talks About
Most applications rely on cloud SDKs to connect to services like message brokers, queues, databases, APIs and more.
Rather than working directly with cloud SDKs, a better approach is to introduce a standardized layer between applications and cloud services. This allows developers to interact with essential resources without being tightly coupled to a specific provider’s SDKs. A framework like Dapr helps achieve this by providing a uniform API for interacting with cloud resources.