When I started exploring modern data engineering tools, I kept running into outdated examples. Deprecated Flink configs, old Iceberg syntax, broken Compose files. Versions change quickly but tutorials don’t. So I built something I wish I had sooner to help build projects. Its a collection of Claude Code Skills for Data Engineers, using Anthropic’s new Skills feature. Includes: • Apache Iceberg - snapshots, time travel, partition evolution • Apache Paimon - streaming lakehouse with LSM compaction • Lance -columnar format for ML & vector search • Apache Fluss - sub-second streaming storage • Docker Compose V2 - clean syntax, healthchecks, resource limits Some Example prompts: @iceberg/iceberg.md help me design a partition strategy @paimon/paimon.md create a CDC pipeline from MySQL Available here: https://lnkd.in/eX7mb4vy
Gordon Murray’s Post
More Relevant Posts
-
Most RAG systems fail at this simple question: "What's the most common GitHub issue AND what are people saying about it?" Vanilla RAG follows a simple pattern: query -> retrieve -> generate. It's effective for straightforward question-answering, but struggles when tasks get complex. Let's say you ask: "What's the most common GitHub issue from last month, and what are people saying about it in our internal chat?" Traditional RAG would try to match your entire query to one knowledge source. It might find something relevant, but probably not exactly what you need. Agentic RAG works differently: 1. 𝗣𝗹𝗮𝗻𝗻𝗶𝗻𝗴: The agent breaks down your query into subtasks (select a tool to query the GitHub issues from last month, build a query to fetch the most common one, search internal chat for mentions) 2. 𝗧𝗼𝗼𝗹 𝗨𝘀𝗲: It routes the first part to your GitHub database, gets results, then routes the second part to your chat system using context from the first search 3. 𝗥𝗲𝗳𝗹𝗲𝗰𝘁𝗶𝗼𝗻: The agent validates the retrieved information and can re-query if something doesn't look right This is really promising for complex queries that need multiple data sources or multi-step reasoning. 𝗧𝗵𝗲 𝘁𝗿𝗮𝗱𝗲𝗼𝗳𝗳𝘀: Agentic RAG typically requires multiple LLM calls instead of one. This means added latency and cost. It is also much more complex to develop, deploy and maintain. Here's my recommendation: For many use-cases, a simple RAG pipeline is sufficient, but if you are dealing with complex queries, response quality is very important and your users can afford waiting a few extra seconds - an Agentic RAG workflow is probably better suited for your use-case. The architecture can be simple (a single router agent) or complex (multiple specialized agents coordinating). You can have one agent that retrieves from your internal docs, another that searches the web, and a coordinator that decides which to use. For more information, my colleagues did a very nice Blog post about the different Agentic workflow: https://lnkd.in/eS2mFxUF
To view or add a comment, sign in
-
-
Implementing a One-Dimensional Array Using Only Stacks in C++ Today, as an exercise for my DSA knowledge, I implemented a one-dimensional array using only a stack. The idea comes from a classic theoretical question: How can you simulate array indexing using nothing but stack operations (push, pop, top)? By using two stacks — one for storage and one as an auxiliary structure — I recreated array-like random access behavior behind a clean C++ interface. It was a fun way to revisit fundamentals and rethink how basic structures can be built from simpler ones. You can check out the full code and explanation here: 🔗 GitHub: https://lnkd.in/dDfGEbtw Always enjoyable building things from first principles!
To view or add a comment, sign in
-
Optimizing code is fun until you realize the real bottleneck wasn’t your logic at all. It was a missing database index. You spent hours refactoring loops and polishing functions… Meanwhile, a single CREATE INDEX could’ve cut the query time from 2 seconds to 20 milliseconds. The lesson? Measure first. Tweak second. Profile your queries. Check EXPLAIN. Look at slow query logs. Sometimes performance isn’t about writing smarter code, it’s about knowing where to fix the problem.
To view or add a comment, sign in
-
**Congratulations! AI can now generate the 10,000 SQL files your database immediately throws in the garbage. This is called “progress.”** https://lnkd.in/eUyFp5dG We spent 50 years teaching databases to generate code from metadata. Then GitHub convinced everyone that handcrafting 847 identical SQL files is “engineering.” Then we trained AI on this stupid pattern. Now AI generates MORE handcrafted SQL. **Timeline of collective amnesia:** **1976:** Databases invent information_schema. Problem solved forever. **2006:** “Infrastructure as Code” works for stateless servers. Someone applies it to databases that ALREADY GENERATE THEIR OWN CODE. Nobody stops them. **2016:** dbt launches. 1,000 handcrafted SQL files! Your database extracts metadata and THROWS YOUR CODE AWAY. You celebrate this as “modern data stack.” **2020:** AI trained on GitHub. Sees handcrafted SQL, learns to generate MORE handcrafted SQL. Doesn’t see the metadata systems running every database for 40 years. **2025:** You’re in a PR review changing 50 files. Your database regenerates all DDL from metadata in 0.3 seconds. The irony is lost on you. **What actually happens when you run CREATE TABLE:** 1. Parse your artisanal SQL 1. Extract metadata 1. Store metadata 1. **THROW YOUR CODE IN THE TRASH** 1. Generate new DDL from metadata when asked Your code was a disposable interface. The database kept metadata, discarded your code like a used napkin. But you’re storing that napkin in GitHub and training AI to generate more napkins. **The numbers:** - Your way: 3 weeks to handcraft 50 SQL files - Database way since 1976: 30 seconds to generate from metadata Every production database—Oracle, Postgres, MySQL, Snowflake—generates code from metadata. Has for decades. Powers millions of apps. **AI never learned this. Because it’s not in GitHub.** GitHub showed AI the workaround, not the solution running the entire digital economy since 1976. Click below to watch an industry forget what it knew, then train AI on the amnesia. #DataEngineering #GitHubBrokeOurBrains #MetadataIsSource
To view or add a comment, sign in
-
Day 53 of 90DaysOfDevOps Yesterday I learned how to write efficient Dockerfiles. Today I explored how to take it a step further — by running multiple containers together using Docker Compose. What is Docker Compose? Docker Compose is a tool that lets you define and manage multi-container applications with a single YAML file (docker-compose.yml). Instead of running multiple docker run commands, you can bring up an entire environment — app, database, cache, and more — with just one command: - docker-compose up Example Setup Here’s a simple web app + database setup 👇 version: '3' services: web: image: myapp:latest ports: - "8000:8000" depends_on: - db db: image: postgres:13 environment: POSTGRES_USER: admin POSTGRES_PASSWORD: secret POSTGRES_DB: mydb Now, both your web app and database spin up together — connected automatically via an internal Docker network. Why It Matters ✅ Simplifies multi-container setups ✅ Reproducible environments for development/testing ✅ Easy scaling and teardown ✅ Foundation for Kubernetes manifests later If Docker is how you package your app, Docker Compose is how you orchestrate it — bringing multiple moving parts together into one simple, versioned file. #90DaysOfDevOps
To view or add a comment, sign in
-
💾 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞 𝐓𝐫𝐚𝐧𝐬𝐚𝐜𝐭𝐢𝐨𝐧𝐬 & 𝐂𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐲 𝐚 𝐭𝐢𝐧𝐲 𝐩𝐫𝐨𝐣𝐞𝐜𝐭 𝐭𝐡𝐚𝐭 𝐭𝐚𝐮𝐠𝐡𝐭 𝐦𝐞 𝐛𝐢𝐠 𝐥𝐞𝐬𝐬𝐨𝐧𝐬. 𝐄𝐯𝐞𝐫 𝐰𝐨𝐧𝐝𝐞𝐫𝐞𝐝 𝐰𝐡𝐚𝐭 𝐫𝐞𝐚𝐥𝐥𝐲 𝐡𝐚𝐩𝐩𝐞𝐧𝐬 𝐰𝐡𝐞𝐧 𝐭𝐰𝐨 𝐮𝐬𝐞𝐫𝐬 𝐭𝐫𝐲 𝐭𝐨 𝐮𝐩𝐝𝐚𝐭𝐞 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐛𝐚𝐧𝐤 𝐚𝐜𝐜𝐨𝐮𝐧𝐭 𝐚𝐭 𝐨𝐧𝐜𝐞? 𝐎𝐫 𝐡𝐨𝐰 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬 𝐦𝐚𝐠𝐢𝐜𝐚𝐥𝐥𝐲 𝐬𝐭𝐚𝐲 𝐜𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐭 𝐞𝐯𝐞𝐧 𝐰𝐡𝐞𝐧 𝐬𝐲𝐬𝐭𝐞𝐦𝐬 𝐜𝐫𝐚𝐬𝐡 𝐦𝐢𝐝-𝐮𝐩𝐝𝐚𝐭𝐞? I built a small hands-on demo in TypeScript + Prisma + PostgreSQL to explore the ACID principles that keep our data reliable: - Atomicity: everything or nothing - Consistency: always a valid state - Isolation: no interference between transactions - Durability: nothing lost after commit In this project, I simulated: - 𝐀 𝐋𝐨𝐬𝐭 𝐔𝐩𝐝𝐚𝐭𝐞 𝐏𝐫𝐨𝐛𝐥𝐞𝐦: when concurrent transactions overwrite each other. - 𝐀 𝐃𝐢𝐫𝐭𝐲 𝐑𝐞𝐚𝐝 𝐒𝐜𝐞𝐧𝐚𝐫𝐢𝐨: when one transaction reads another’s uncommitted data. Watching the logs print “BEGIN → UPDATE → COMMIT” felt like seeing the database think. Check out the repo here: https://lnkd.in/d6QBuctK Sometimes the smallest projects explain the biggest software engineering principles.
To view or add a comment, sign in
-
-
Google doesn't just test if you know SQL: they test if you can deliver the right answer when the question gets messy. I was working through a window function challenge with my teacher Luciano Vasconcelos Filho at Jornada de Dados this week, one that showed in a Google interview. On the surface: compare RANK(), DENSE_RANK(), and ROW_NUMBER() on an email dataset. But the real test? Knowing which function better answers the business question you're trying to solve, not just which one runs without errors. And here's something that kept me up: each function embodies a different philosophy of competition. → ROW_NUMBER() is authoritarian: every row gets a unique position, and when there are ties, SQL unties them based in internal (obscure...) criteria. → RANK() is democratic but chaotic: tied values share a rank, then we skip positions (1, 2, 2, 4...) → DENSE_RANK() is the diplomat: tied values share ranks, but no position gets left behind (1, 2, 2, 3...) When two users have the same engagement score, how do you rank them? Do you assign them the same position and skip the next one? Do you keep sequential order but sacrifice uniqueness? Do you force differentiation where none exists? There's no "right" answer, only tradeoffs. Just like in product strategy, organizational design, or choosing which Netflix show deserves your Friday night. The best engineers don't just write queries. They interrogate the assumptions baked into every function. And if you are curious about the solution, come and see it at my GitHub: https://lnkd.in/deC5NjXD
To view or add a comment, sign in
-
🧠 I always wondered how databases actually work behind the scenes. During my early learning days, I used to happily connect to MongoDB or MySQL — never really thinking about what magic was happening underneath. How does data get stored? How does it know what to fetch? How does it handle updates? 🤔 This time, I decided to stop wondering and start building. 💡 So instead of using a database, I built my own mini database system from scratch! Here’s what I cooked up 👇 ⚙️ Tech Used: Node.js TCP Sockets (net module) File System (fs module) 🧩 What It Does: Stores collections (like MongoDB tables) as JSON files Supports CRUD operations — insertOne, findOne, updateOne, deleteOne Automatically saves data into a db/ folder Communicates entirely through raw TCP socket connections (no external DB!) 🧠 What I Learned: How TCP communication actually works between a client and a server The logic behind how databases perform read/write/update operations JSON-based data persistence and its limitations Why abstractions like MongoDB or SQL exist — and how much they simplify our lives 😅 🔥 For fun, I even extended it to run over HTTP — so now I can send requests from Postman or a browser just like a real API! It’s not production-ready (yet 😆), but it gave me a real hands-on understanding of how databases are built and operate at their core. Check it out here 👇 🔗 GitHub: https://lnkd.in/gyPdwR5b #LearningByBuilding #NodeJS #Database #SoftwareEngineering #LearningJourney #FullStackDeveloper #MERN #OpenSource
To view or add a comment, sign in
-
Open Source Sunday! Today Popeye: A #Kubernetes cluster resource sanitizer From the creator of k9s (Fernand Galiana), Popeye is a small but powerful utility that scans live Kubernetes clusters and reports potential issues with your resources and configurations. As clusters grow, it becomes harder to keep track of all manifests and policies. Popeye helps you by analyzing what is really deployed in your cluster, not what’s sitting on disk. 🔹It checks for misconfigurations and stale resources 🔹It shows if your pods are over- or under-allocated (when a metrics-server is running) 🔹It helps you follow Kubernetes best practices 🔹It never changes anything – Popeye is read-only The tool scans your cluster for things like missing probes, wrong ports, unused ConfigMaps or Secrets, naked Pods, and RBAC issues. You can run Popeye from your terminal and choose different output formats, like: 🔹Standard (colorized output) 🔹YAML, JSON, or HTML 🔹JUnit for CI/CD 🔹Prometheus metrics for monitoring 🔹Score mode (0–100 cluster health) 💡 Tip: Combine Popeye with a Prometheus Pushgateway to collect cluster health scores over time. Together with a Grafana dashboard, you can enrich the developer experience and highlight configuration issues more clearly. If you operate Kubernetes clusters, Popeye is a great helper to detect problems early and learn more about good resource hygiene. 👉 Check it out on GitHub: https://lnkd.in/d7n-_A_K Do you have a use case for Popeye? Let me know in the comments below!
To view or add a comment, sign in
-
-
In computer systems, simple concepts are often the most useful. Example: idempotency. Idempotency latin roots: idem -> "the same" potent -> "having power" or "being able" Put together, idempotent literally means "having the power to remain the same." When you write a program that has "side effects" (e.g., writing to a database, doing API calls), making it idempotent means: "Run it twice, the world should look the same as if it ran once" One way to achieve this is with idempotency keys: unique tokens representing one logical action. If the same request appears twice, the backend recognizes it and replays the stored result. If the program crashes mid-way, it can resume safely using that key. In practice, Postgres can handle most of this for you. Use a unique key as the arbiter of "has this run before?". Conflicts become just SQL constraint errors and concurrent workers can be arbitrated with row locks. Now, the example in this post is lots of manual work that modern durable execution library handle for you. #postgres
To view or add a comment, sign in
-
Just added an Apache Flink skill, covering CDC, checkpoints, and operator tuning.