← Back to Newsletters

data subtldr week 5 year 2025

r/MachineLearningr/dataengineeringr/sql

Unsung Heroes of Data: The 'Data Plumbers', Navigating Excel's Pitfalls, Lakehouse Medallion Architecture Critiques, Learning SQL from ChatGPT, Deciphering Past SQL Codes, Open-source Strategies of DeepSeek, Schmidhuber's Claims on ML Originality, Knowledge Distillation Controversy

Week 5, 2025
Posted in r/dataengineeringbyu/aacreans1/30/2025
1880

real

Meme
The Reddit thread revolves around the comparison between data engineering and data science. Top comments suggest a sentiment of frustration about data engineers feeling underappreciated compared to data scientists. They describe themselves as 'data plumbers' who perform crucial tasks in the background. The comments also highlight issues faced when dealing with tools like DataBricks and unstructured data. Some users note that despite the lack of recognition, data engineering offers solid job security. There's also a mention of unnecessarily overblown cloud bills due to improper data structure management. The thread concludes with a user appreciating his decision to choose data engineering, likening it to a reliable profession like plumbing.
66 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/we_are_mammals1/27/2025
934

[D] Why did DeepSeek open-source their work?

Discussion
The Reddit thread discusses why DeepSeek decided to open-source their work despite their training being 45x more efficient, which could have allowed them to dominate the market. Some top comments suggest that this is a strategic move, with examples given of other successful open-source ventures. One user mentions the concept of commoditizing the complement, where a product is made free to boost profits elsewhere. Another user points out the benefits of open-sourcing such as increased usage, fine-tuning, and hype, which could lead to paid features and services. Some comments also highlight that open-source isn't necessarily about profit, but can be a political statement or a way to encourage innovation and collaboration. The sentiment is mostly positive, praising the open-source model and DeepSeek's decision.
340 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/SirSourPuss1/31/2025
798

[D] DeepSeek? Schmidhuber did it first.

Discussion
The Reddit thread titled [D] DeepSeek? Schmidhuber did it first. generated a lively discussion on Schmidhuber's claims of originality in machine learning. The top comments reveal a mixed sentiment. Some users express skepticism towards Schmidhuber, accusing him of being attention-seeking and of making dubious connections to new concepts based on his past work. Others acknowledge that he may have genuine attribution concerns but criticize his confrontational approach, which they believe distracts from meaningful discourse and undermines his credibility. A few comments also highlight the fact that OpenAI uses back propagation, a technique invented by Seppo Linnainmaa in 1970.
125 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/SimpleObvious40482/2/2025
611

[D] Which software tools do researchers use to make neural net architectures like this?

Discussion
The Reddit thread with the ID '1ig6k3l' discusses the software tools that researchers use to create neural net architectures. The top-rated comments suggest various tools for this purpose. Google Slides is mentioned by user 'msbosssauce' and received the highest score. 'Agrareldan' and 'coriola' suggest using 'Inkscape' and 'draw.io' respectively. 'Agrareldan' further explains his process of mapping things out in 'draw.io', creating elements in 'Inkscape', and combining them. 'log_2' mentions 'pytorch' as a commonly used tool, with 'Jax' and 'tensorflow' as alternatives. Others suggest using 'PowerPoint', 'Microsoft Paint', and 'Keynote'. 'BoogiieWoogiie' recommends the 'TikZ LaTeX package'.
95 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/The-Silvervein1/30/2025
424

[d] Why is "knowledge distillation" now suddenly being labelled as theft?

Discussion
The Reddit thread titled [d] Why is 'knowledge distillation' now suddenly being labelled as theft? debates on the controversy surrounding the concept of 'knowledge distillation' being seen as a form of theft. The top comments express skepticism towards this notion, suggesting it's more about OpenAI's Terms of Service (ToS) rather than actual copyright violation. Some users point out perceived hypocrisy, with OpenAI accused of violating other companies' ToS. A sentiment of cynicism towards OpenAI's motives is prevalent, with some comments suggesting they're attempting to save face after a perceived failure. A notable comment suggests that the matter's resolution lies in legal interpretation, not just in the court of public opinion.
125 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/Pitah71/29/2025
248

I swear I tested it bro

Meme
The Reddit thread titled 'I swear I tested it bro' has sparked a discussion among data engineers about the challenges of ensuring clean and accurate data. Users shared their experiences with excel and csv files, highlighting the numerous ways data can be corrupted. A user creatively rephrased Tolstoy's Anna Karenina principle to describe data anomalies. There's a debate about the effectiveness of unit testing, with some arguing that it's not always sufficient to catch data errors. Others emphasize the need for good alerting and detailed logging for efficient debugging. The sentiment throughout the thread is a mix of frustration and humor, reflecting the challenges and complexities of dealing with data.
21 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/james24411391/31/2025
207

How efficient is this architecture?

Discussion
The Reddit thread titled 'How efficient is this architecture?' discusses a lakehouse medallion architecture for the Azure environment. The main points of discussion revolve around the use of Azure Databricks, Synapse and Fabric, with several users suggesting Azure Databricks as a more efficient and cost-effective option. However, the author notes that they are contractually tied to using Synapse and Fabric for the next 2-3 years. There are also criticisms related to the data architecture, including the lack of data ingestion frequency and the need for data quality in the staging area. A few users also complimented the presentation of the diagram. Overall, the sentiment is mixed with constructive criticism and suggestions for improvement.
65 comments
Share
Save
View on Reddit →