← Back to Newsletters

data subtldr week 12 year 2024

r/MachineLearningr/dataengineeringr/sql

Decoding F1's Excel Tracking System, Dagster's Launch Party Fever, The Confusion of Data Warehouse Terminology, Contemplating a Career Pivot to SQL, AI's Role in Academic Research Quality, The Burn-Out of Machine Learning Interviews

Week 12, 2024
Posted in r/MachineLearningbyu/vvkuka3/18/2024
742

[D] When your use of AI for summary didn't come out right. A published Elsevier research paper

Discussion
The Reddit thread titled [D] When your use of AI for summary didn't come out right. A published Elsevier research paper was dominated by discussions criticizing the current state of academic research. Users lamented the increasing competition for research output, leading to a focus on volume over quality. Accusations of plagiarism, lack of personal effort, and failure of the peer review process were common sentiments. The medical domain was particularly criticized for using open-source models with minimal understanding of machine learning pitfalls. Commenters also expressed disillusionment with the quality of academic papers, with some suggesting that a significant percentage of them lack value. The overall sentiment was negative, reflecting dissatisfaction with the current state of academic research.
89 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/Tiny-Masterpiece-4123/23/2024
466

[D] Feeling burnt out after doing machine learning interviews

Discussion
The Reddit thread '[D] Feeling burnt out after doing machine learning interviews' discusses the challenges faced by a user who has been unsuccessful after 30 interviews for machine learning roles. The user describes the wide range of topics covered in the interviews as overwhelming. Commenters empathize with the user's sentiment, noting the unpredictability of interviews, particularly in startups. Some highlight the negative impact of the AI hype curve on hiring, suggesting hiring managers fear false positives. Others criticize the emphasis on memorization during interviews. A few suggest focusing on specialties within ML, rather than trying to cover all areas. There's also an emphasis on the importance of behavioral interviews. Overall, the thread reflects a feeling of frustration at the complexity and inconsistency of ML job interviews.
92 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/danielhanchen3/19/2024
462

[P] How I found 8 bugs in Google's Gemma 6T token model

Project
The Reddit thread discusses a user's discovery of 8 bugs in Google's Gemma 6T token model, leading to significant improvements in its function. The user shared their findings on their open-source package, Unsloth, which made Gemma's finetuning 2.5x faster and used 70% less VRAM. They compared different implementations to find and fix the issues, which included a typo in the model's technical report and a problem with RoPE embeddings. The thread received positive engagement, with users congratulating the author and expressing curiosity about the bug discovery process. Some users joked about the user's achievements, suggesting they should be hired by Google or expecting a buyout from NVIDIA. The author clarified that they prefer to focus on building a startup with their brother.
55 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/Tape563/19/2024
232

F1 team Williams used Excel as their database to track the car components (hundreds of thousands of different components)

Meme
The Reddit thread discusses the use of Excel by F1 team Williams to track hundreds of thousands of car components. Users expressed surprise, with some offering to help improve their system. A user named BigDataBoy suggested that this might have changed under James Vowles, a data-oriented person who led data integration for Mercedes F1. Another user, GraspingGolgoth, proposed using a proper database as an alternative to Excel. OMG_I_LOVE_CHIPOTLE noted that Excel usage is common in finance, despite its limitations. There was also a discussion on the benefits of relational databases like Postgres and MySQL for maintaining integrity and consistency across an organization. Overall, the sentiment was a mix of shock and suggestions for improvement.
53 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/floydophone3/21/2024
223

We (Dagster) are throwing a party

Discussion
Pete, the CEO of Dagster Labs, announced a launch party for their new product, Dagster+. The event, supported by the /r/dataengineering subreddit, is set to happen in SF and NYC. Reception among users was positive, with many expressing wishes for the event to be hosted in their locations, including London, Chicago, Milan, and Buenos Aires. Some users also suggested virtual participation or sending party packages to those unable to attend. There was also interest in the new Dagster+ features that would allow sharing of Directed Acyclic Graphs (DAGs) and job monitoring capabilities. Overall, the announcement sparked enthusiastic and humorous engagement among users.
35 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/leogodin2173/20/2024
165

Can We Stop Using Marketing Terms to Define Data Warehouses?

Discussion
The Reddit thread titled Can We Stop Using Marketing Terms to Define Data Warehouses? sparked a lively discussion around the misuse of technical terms in data engineering due to marketing influence. Users in the thread agree that the term data warehouse has been misused and misunderstood, often reduced to being associated with specific technologies rather than being recognized as a process. They highlight that a data warehouse can become just a disordered collection of company data if not handled correctly. Some users pointed out their struggle to understand basic definitions in the field due to marketing jargon. The overall sentiment is a call for precision in terminology to avoid confusion and misunderstanding in the industry.
52 comments
Share
Save
View on Reddit →