data subtldr week 8 year 2024

r/MachineLearningr/dataengineeringr/sql

Principal's Guide to Data Engineering, Dissecting DBT Core's Rust Alternative, Defining SQL Basics, Google's New LLM Model, Transparency in Machine Learning Research

February 25, 2024•Week 8, 2024

Posted in r/dataengineeringbyu/_areebpasha•2/19/2024

615

How true is this!

Meme

The Reddit thread discusses the application and limitations of AI in data engineering. The sentiment leans towards skepticism about AI's current capabilities. Users pointed out that clean and organized data are prerequisites for successful AI application. They also highlighted the need for context and avoiding AI hallucinations. Some users expressed frustration over AI being seen as a one-size-fits-all solution, stressing the need for specific questions and feedback mechanisms. Others warned about the dangers of offloading work to AI, with one user sharing their experience of time-consuming debugging due to bad code generated by AI. Lastly, the importance of AI literacy among leadership was mentioned.

44 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/ithinkiboughtadingo•2/19/2024

310

New DE advice from a Principal

Career

The Reddit thread titled 'New DE advice from a Principal' posted by 'ithinkiboughtadingo' gives insightful advice to aspiring Data Engineers. The author emphasizes the importance of understanding software engineering fundamentals, learning battle-tested modeling and architecture patterns, having a clear project completion definition, cultivating empathy, understanding data stewardship, and avoiding over-reliance on a single solution. In the comments, users recommend books like 'Designing Data-Intensive Applications' and suggest developing a broad knowledge base. The idea of a peer review process for design is also proposed. There is an overall positive reception to the post, with users appreciating its comprehensive nature and relevance. The thread underscores the complexity and evolving nature of the data engineering field.

84 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/edienemis•2/21/2024

290

[News] Google release new and open llm model: gemma model

News

The thread discusses Google's new open LLM model, Gemma, which is reportedly better than the previous models. Users discussed the model's benchmark, with some pointing out that it was compared with Mistral 7B on Google's website. There was a hint of nostalgia for GPT-J, alongside a humorous reflection on its capabilities. Some users expressed a positive response to the news, while others discussed the alignment and fine-tuning of the base models. Questions were raised about the model's context length and a few comments were made about translation quirks on Google's website. The overall sentiment was positive and interested.

29 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/hazard02•2/22/2024

263

[D] Why do researchers so rarely release training code?

Discussion

The Reddit thread '[D] Why do researchers so rarely release training code?' sparks a lively discussion on the challenges of sharing training code in machine learning research. Several users pointed out that the proprietary nature of some code, along with the risk of others using their work for subsequent papers, often discourages researchers from sharing. The hefty time and cost of ensuring the code can reproduce the original model is also a deterrent. Some users expressed frustration with this practice, emphasizing the need for reproducible research and suggesting that academic conferences should mandate code sharing. A sentiment of dissatisfaction about the scientific rigor in the field was also prevalent, with one user warning that this could potentially lead to another AI winter.

132 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Signal-Aardvark-4179•2/22/2024

257

[D] MetaGPT grossly misreported baseline numbers and got an ICLR Oral!

Discussion

The Reddit thread discusses allegations of MetaGPT grossly misrepresenting its model's baseline numbers in a paper accepted by ICLR, a leading AI conference. The author claims that real GPT-4 and GPT-3.5-Turbo numbers are significantly higher than those reported by MetaGPT. Commenters express dissatisfaction with the apparent overlooking of such issues by conference reviews. They also lament the neglect of promising papers on platforms like Arxiv and express concerns over a perceived bias towards papers from authors with PhDs or affiliations with prominent institutions. One commenter suggested skepticism towards the EvalPlus leaderboard, urging for a deeper investigation. There's a hint of disbelief over the rejection of the 'Mamba' paper from ICLR, which some believed to be more innovative.

38 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Background_Call6280•2/21/2024

160

Open source DBT core alternative written in Rust (30x faster)

Open Source

The Reddit thread discusses a new open-source alternative to DBT core, written in Rust and named Quary, claiming to be 30x faster. The author, Background_Call6280, explains that Quary offers various advantages like portability, ease of building additional tooling, automated inference and documentation, efficient testing, better handling of sensitive data, and improved developer experience. The top comments question the need for this project and its claimed speed advantage, as DBT core is also open-source and the data crunching is done by the database, not the language. Other comments express skepticism about switching to a new framework, and one commenter confirms understanding the revenue model upon checking Quary's website. Overall, the sentiment is curious but cautious.

71 comments

Save

View on Reddit →