data subtldr week 9 year 2025

r/MachineLearningr/dataengineeringr/sql

Unpacking the Instagram Content Chaos, Debating the Relevance of Kimball Dimensional Modeling, SQL Lifesavers: CTEs, Window Functions and the Power of Set Theory, FFT's Swift Return as a Self-Attention Alternative

March 2, 2025•Week 9, 2025

Posted in r/dataengineeringbyu/saaggy_peneer•3/2/2025

472

DeepSeek releases distributed DuckDB

Blog

The Reddit thread discusses the release of DeepSeek's distributed DuckDB. Users express awe at the high read-throughput of 6.6 TiB/s on a 180-node cluster by 3FS. However, the appropriateness of Smallpond, a part of DeepSeek's offering, depends on factors like the scale of data, infrastructure capability, and analytical complexity. Some users express skepticism over the actual ease of use and deployment compared to existing solutions like Trino and Spark, and the necessity of Nvidia hardware. The thread shows a mix of anticipation for the next version, while some express interest in testing, despite potential obstacles. Overall, the sentiment is cautiously optimistic but also questioning.

17 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/hcarlens•2/25/2025

354

[R] Analysis of 400+ ML competitions in 2024

Research

The Reddit thread discusses an analysis of over 400 Machine Learning competitions in 2024. Key takeaways include: Kaggle remains the largest platform; there's an increase in competitions with $1m+ prize pools; Python is the preferred language for winners; Convolutional Neural Nets are popular in vision competitions; PyTorch is used 9 times more than TensorFlow; AutoML is gaining utility but isn't yet at grandmaster level; quantisation is key in language/text/sequence-related competitions; gradient-boosted decision trees win many tabular/time-series competitions; Polars sees increased use for dataframes; NVIDIA GPUs are the primary hardware for training models and there's an emerging pattern of using generative models to create additional training data. Comments reflect appreciation for the analysis and a discussion on the use of Jax vs PyTorch.

22 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/jacobfa•2/26/2025

348

[R] The FFT Strikes Back: An Efficient Alternative to Self-Attention

Research

The Reddit thread discusses an efficient alternative to self-attention in AI named FFT (Fast Fourier Transform). The author, jacobfa, has refined an original FFT approach, making it more scalable and effective, even outperforming traditional self-attention on many benchmarks. The sentiment in the comments is generally positive, with users appreciating the mathematical soundness of the method and its potential applications in signal processing. However, some users point out the need for careful implementation to avoid issues like wrap around artifacts and the need for padding. Others highlight that while FFT provides global interactions, it doesn't necessarily mean it equates to attention. The author insists that FFT has advantages over convolutions, particularly for long-range dependencies.

65 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Starktony11•2/26/2025

269

Wtf is happening in instagram feed? Any meta employees or engineers want to explain the plausible cause? And why it could happen?

Discussion

The Reddit thread discusses a sudden influx of violent and shocking content in Instagram feeds. Users speculate the cause could be due to an accidental push to production at Meta, or the algorithm amplifying extreme content because it generates more engagement. Some suggest it might be due to reduced content moderation or a strike amongst the team. A few users report not experiencing these changes, with their feeds remaining the same. There's speculation that this could be a new approach to user engagement or a result of intentional user error at Meta. The overall sentiment is confusion and concern about the change in content.

113 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/mrbartuss•2/24/2025

238

Best Data Engineering 'Influencers'

Discussion

The Reddit thread discusses favorite data engineering influencers. Luigi Mangione received the highest praise, followed by Joseph Machado, Benjamin Rogojan, Alexey Grigorev, and Data with Zach. However, Zach received some criticism for bragging too much about his success. Some users also recommended Advancing Analytics and Simon Whiteley from the same platform. Scott Taylor was appreciated for his practical approach, while Charity Majors was noted for her work on Database Reliability Engineering. On the other hand, some users expressed skepticism about influencers, suggesting that they may be biased or overly commercial. They recommended independent research instead of following influencers.

100 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Exact_Line•2/28/2025

237

Is Kimball Dimensional Modeling Dead or Alive?

Discussion

The Reddit thread titled Is Kimball Dimensional Modeling Dead or Alive? generated a robust discussion with a consensus that Kimball Dimensional Modeling is still relevant and widely used in 2025. Several users commended its structure and efficiency, as well as its strong performance with columnar storage. However, some users expressed that they apply the principles more loosely than in the past, making pragmatic decisions based on specific use cases. Others suggested combining Kimball's methodology with other approaches such as One Big Table for different use cases. Concerns were raised about the trend of dumping data without structure and the implications of analysts taking on data engineering roles without adequate technical knowledge.

131 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/danielhanchen•2/26/2025

185

[P] Train your own Reasoning model - GRPO works on just 5GB VRAM

Project

The Reddit thread discusses an update on GRPO, a Reasoning model which now operates on just 5GB VRAM. Users expressed interest in whether the model extends to 70b and the implications of training other models to mimic the reasoning model's responses. The author, danielhanchen, clarified that the model doesn't degrade accuracy, and for 70b, 65GB VRAM would be needed. They highlighted the importance of the reward function and mentioned that the model isn't just for code or math, but can improve tasks like email automation, database retrieval, law, and medicine. The thread generally reflects a positive sentiment towards the GRPO updates.

25 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/jsonathan•3/2/2025

143

[P] I made weightgain – an easy way to train an adapter for any embedding model in under a minute

Project

The Reddit thread discusses a user-created tool called 'weightgain' that trains an adapter for any embedding model in under a minute. Positive feedback and interest surround the library's capability to fine-tune models behind an API, potentially enhancing retrieval accuracy and performance. The tool is appreciated by users like 'Quarkle'. Questions are raised about training data preferences, optimal use cases, and real-world performance. Some users express confusion over the tool's target optimization and structure. The tool's name, 'weightgain', also gains positive attention. Overall, the thread suggests that 'weightgain' is a useful and well-received tool, sparking curiosity and technical discussions.

21 comments

Save

View on Reddit →