← Back to Newsletters

data subtldr week 11 year 2024

r/MachineLearningr/dataengineeringr/sql

AI Skepticism in Data Engineering, Best SQL Practice Platforms, Inspiration from Well-Written ML Codebases, OpenAI's Controversial Data Use

Week 11, 2024
Posted in r/dataengineeringbyu/marclamberti3/12/2024
801

It’s happening guys

Discussion
The Reddit thread titled It's happening guys in the Data Engineering subreddit has a humorous and skeptical tone towards the concept of AI taking over tasks in data projects. The top comments feature discussions about a hypothetical AI character named Devin, with users expressing doubts about an AI's ability to understand client needs, handle unexpected problems, and fit into the complex processes of engineering projects. There's also skepticism about the potential cost-saving benefits of AI, with some users suggesting that AI could actually prolong projects and increase costs. Several comments also highlight the value of human interactions and problem-solving in real work situations. Overall, the sentiment is cautious and humorously critical towards the integration of AI in data engineering.
201 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/InitiativeOk67283/11/2024
562

ELI5: what is "Self-service Analytics" (comic)

Blog
The Reddit thread on the ELI5 post about Self-service Analytics comic primarily reflects frustration and humor. Users liken the concept to a pizza shop where customers make their own pizzas, leading to chaos and dissatisfaction. Key issues highlighted are customers refusing to follow proper procedure, creating ineffective solutions, and insisting on incorrect data. This results in unhealthy competition among departments, each claiming their data is the absolute truth. The thread also mocks the misuse of AI and the problem of prioritizing loud, problematic users over quiet, efficient ones. Overall, the sentiment leans towards skepticism about the efficiency of self-service analytics in practice.
106 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/unemployed_MLE3/15/2024
364

[D] What are some well-written ML codebases to refer to get inspiration on good ML software design?

Discussion
The Reddit thread discusses exemplary Machine Learning (ML) codebases for software design inspiration. The top recommendations include Beyond Jupyter, a self-study resource on software design for ML applications, and nanoGPT, a language model related to the 'Attention Is All You Need' model. However, some users critiqued codebases like scikit-learn and Huggingface for their abstract interfaces, and advocated for using pure PyTorch and avoiding excessive code reuse. Codebases like EasyFSL and Lucidrain were mentioned favorably for the domain of Computer Vision. The thread also criticized config-based abstractions, stating they make the code hard to extend or refactor. Overall, the sentiment leaned towards simplicity, ease of understanding, and refactoring as key factors in good ML software design.
65 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/MetaGPT3/13/2024
305

[R] Data Interpreter: An LLM Agent for Data Science

Research
The Reddit thread discusses the Data Interpreter, an LLM (Large Language Model)-based agent for Data Science introduced by the author MetaGPT. While the post mentioned the tool's superior performance in various tasks, some users expressed concern over the abstract's vagueness and the automation of certain aspects like correlation analysis, imputation, and feature selection. Users emphasized the dangers of automating these processes without proper data analysis and domain-specific knowledge. The author responded, acknowledging the challenges and stating their efforts to improve the tool's abilities. Some users expressed suspicion over the post's upvotes compared to the number of comments, insinuating possible manipulation. Others faced issues with comments not being displayed. Overall, the thread had a mixed sentiment with curiosity, skepticism, and technical concerns.
13 comments
Share
Save
View on Reddit →
Posted in r/MachineLearningbyu/pg8603/14/2024
287

[N] Ooops... OpenAI CTO Mira Murati on which data was used to train Sora

News
The Reddit thread discusses OpenAI CTO Mira Murati's ambiguous statement about the data used to train Sora, with many users suggesting lawsuits are expected. The sentiment is largely critical of Murati's vagueness and the potential for personal data to have been used. Some users express dissatisfaction with the statement, calling it a red flag, while others believe that despite the lawsuits, OpenAI will prevail as they are leveraging publicly available data. There's also debate about whether Murati genuinely doesn't know the specifics of the training data or is intentionally withholding information. Some comments also question the competence of top-level executives, suggesting that they might not be fully aware of the technicalities.
270 comments
Share
Save
View on Reddit →
Posted in r/dataengineeringbyu/bjogc420693/14/2024
193

What is the hardest you have ever seen someone work manually?

Discussion
The Reddit thread titled What is the hardest you have ever seen someone work manually? discusses inefficient manual processes in large companies, with a focus on data entry and management. Commenters express surprise at the persistence of such practices in major organizations. Many share personal experiences of manually updating Excel files, scanning receipts, and even creating graphs in PowerPoint. Some users credit these tedious tasks for sparking their interest in programming, automation, and data analytics. The thread also reveals that these manual processes persist not due to technical limitations, but due to pressure to be data driven and a lack of knowledge among non-technical employees. The general sentiment is a mix of disbelief, humor, and frustration at such outdated practices.
69 comments
Share
Save
View on Reddit →