data subTLDR week 17 year 2025

r/MachineLearningr/dataengineeringr/SQL

Unraveling the SQL-JSON Integration Controversy, Balancing Act of CTEs in SQL, Debunking Millionaire Myths in Data Work, The Shocking Simplicity of Bash Databases

April 27, 2025•Week 17, 2025

Posted in r/dataengineeringbyu/TheBigRoomXXL•4/24/2025

706

WTF that guy just wrote a database in 2 lines of bash

Meme

The discussion centers on the simplicity and versatility of various database formats, highlighting that even a CSV can function as a database. Participants acknowledge the widespread, though begrudging, use of Excel as a critical business tool, despite its drawbacks. They note the common scenario of inheriting poorly documented, critical Excel databases from predecessors. A few reference Designing Data-Intensive Applications by Martin Kleppmann, acknowledging its status as a valuable resource in the field. The sentiment is mixed, with shared amusement and frustration at the diverse ways databases can be implemented.

101 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/cmarteepants•4/22/2025

448

Apache Airflow 3.0 is here – and it’s a big one!

Open Source

The release of Apache Airflow 3.0 has sparked mixed responses. The upgrade marks a significant shift in orchestration, introducing features like Service-Oriented Architecture, Asset-Based Scheduling, Event-Driven Workflows, DAG Versioning, and a Modern React UI. While some users appreciate the modernization and adoption of an asset-lineage approach, others feel it's imitating features from competitor Dagster without making necessary improvements. Some users suggest considering Dagster, which offers data scheduling and benefits such as native Python use, great UI, good metadata management, and easy backfill. Others are keen to explore the new UI and event-driven workflows in Airflow 3.0.

56 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/choHZ•4/25/2025

167

[R][P] We compress any BF16 model to ~70% size during inference, while keeping the output LOSSLESS so that you can fit in more context or run larger models.

Research

The community is discussing a new method for lossless compression of BF16 models to ~70% of their size during inference. This method, called DFloat11 (DF11), uses Huffman coding to compress model weights, reducing memory footprint during GPU inference. It allows users to run larger models or the same models with larger batch sizes and longer sequences, improving efficiency. However, DF11 tends to be ~40% slower when both versions of a model can fit in a single GPU. While there are concerns about the lossless claim and comparisons to other quantization formats, most users seem to appreciate the innovation and its potential benefits. The sentiment is generally positive with a focus on exploring DF11's practical applications and limitations.

21 comments

Save

View on Reddit →

Posted in r/SQLbyu/roblu001•4/21/2025

164

Discovered SQL + JSON… Mind blown!

MySQL

The integration of SQL and JSON has sparked a lively debate among developers. While some express excitement about the flexibility and dynamism offered by JSON functions in SQL, others caution against its use, citing regrets and reporting challenges. The consensus leans towards caution, suggesting that JSON with relational databases is generally a bad practice and is better suited for NoSQL databases like MongoDB. Some, however, defend the usage of JSON in SQL, highlighting the key-value store options and native JSON support in modern RDBMS. They warn that while JSON can be beneficial, it sacrifices some relational features and is best limited to metadata or seldom used data. The sentiment is mixed, reflecting both enthusiasm and caution.

59 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/skeltzyboiii•4/22/2025

113

[R] One Embedding to Rule Them All

Research

Pinterest's new unified query embedding, OmniSearchSage, challenges the limitations of traditional architectures by blending GenAI-generated captions, user behavior, and board signals to understand items at scale. This system significantly improves search, ads, and latency without compromising engineering pragmatism, offering a new perspective on how retrieval systems should be built. However, some users expressed frustration with Pinterest's search results leading to context-less posts. Others compared OmniSearchSage to Meta's imagebind, suggesting the possibility of multi-modal searching. Additionally, there was some skepticism around the technical novelty of the paper and recommendation for other resources on recommender systems. Overall sentiment was mixed.

13 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/OogaBoogha•4/23/2025

[D] Spotify 100,000 Podcasts Dataset availability

Discussion

The Spotify 100,000 Podcasts Dataset, offering 60,000 hours of English audio, was removed by Spotify, sparking a discussion on its availability. Despite its removal, the dataset was initially released under a Creative Commons (CC BY 4.0) license, suggesting it should be shareable if someone had downloaded it before its removal. No users reported having the dataset, but suggestions included reaching out to authors of papers who have used it, scraping RSS feeds or Spotify, or considering other similar datasets such as those available on Kaggle and GitHub. The sentiment leans towards proactive solutions for accessing or replacing the dataset.

7 comments

Save

View on Reddit →

Posted in r/SQLbyu/jimothyjpickens•4/24/2025

Is it bad that I’m using CTE’s a lot?

MySQL

The use of Common Table Expressions (CTE) in SQL is generally seen as good practice for managing complex queries for easier modification and maintainability. However, it's important to consider performance and avoid overusing CTEs, particularly when it may result in inefficiencies, such as complex joins and repeated large table reads. CTEs are beneficial in breaking down problems into manageable steps, but they may not be necessary for every single operation. It's essential to explore various methods, compare solutions, and choose the best performing one. While CTEs can help structure and understand solutions better, excessive use might indicate a lack of ability to handle larger, monolithic queries.

52 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 17 year 2025

WTF that guy just wrote a database in 2 lines of bash

Apache Airflow 3.0 is here – and it’s a big one!

[R][P] We compress any BF16 model to ~70% size during inference, while keeping the output LOSSLESS so that you can fit in more context or run larger models.

Discovered SQL + JSON… Mind blown!

[R] One Embedding to Rule Them All

[D] Spotify 100,000 Podcasts Dataset availability

Is it bad that I’m using CTE’s a lot?

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!