data subtldr week 14 year 2024

r/MachineLearningr/dataengineeringr/sql

Querying 15TB Datasets Without Breaking the Bank, Decoding ETL and Data Pipelines, SQL Join Operations Beyond Venn Diagrams, LLMs: A Boon or Bane for AI Research?

April 7, 2024•Week 14, 2024

Posted in r/MachineLearningbyu/NightestOfTheOwls•4/4/2024

775

[D] LLMs are harming AI research

Discussion

The Reddit thread on '[D] LLMs are harming AI research' sparked a heated discussion. Some users disagreed with the original poster (OP), arguing that LLMs are not plateauing and that criticism often comes when a technology becomes influential. They mentioned that continual learning and advancements in other models are still ongoing, despite the attention on LLMs. Some users highlighted that criticisms of LLMs often stem from personal research interests being overshadowed. Others argued that the OP's expectations might be unrealistic and suggested waiting for future iterations before criticizing. A few users questioned the OP's credibility, indicating that the OP himself may lack experience with machine learning. Overall, the sentiment was mixed but leaned towards defending the progress and potential of LLMs in AI research.

266 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/Stevens97•4/2/2024

421

[D] LLMs causing more harm than good for the field?

Discussion

The Reddit thread [D] LLMs causing more harm than good for the field? discusses the impact of Language Learning Models (LLMs) like GPT on the field of AI and machine learning. The original post expresses concern that the hype around LLMs is overshadowing the broader field, leading to superficial knowledge and unrealistic expectations. Some top comments draw parallels between the LLM hype and the blockchain craze, noting a surge in 'ChatGPT experts' who oversell simple API solutions. Others agree that LLMs are transforming business models in fields like healthcare, but highlight the risk of LLM-generated spam degrading online resources. Overall, the sentiment is mixed, with some expressing concerns about overhype and misuse, while others believe LLMs offer significant value and urge professionals to adapt without worry.

167 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/AlphaSquared_io•4/1/2024

281

[D] Can't escape OpenAI in my workplace, anyone else?

Discussion

The Reddit thread [D] Can't escape OpenAI in my workplace, anyone else? discusses the prevalent use of OpenAI in workplaces despite perceived drawbacks. Commenters express frustrations about the dominance of OpenAI and their feeling of being forced to use it. Many feel that leadership is caught up in the hype around generative AI, leading to a disregard for better alternatives. Some suggest better communication strategies to propose alternative solutions, while others admit that despite their issues with OpenAI, it performs well and it's hard to argue against using it when it's the most effective tool available. A few commenters also discuss using other APIs like Claude and Gemini as potential alternatives.

125 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/de4all•4/4/2024

193

Impact of DQ on AI

Meme

The Reddit thread titled 'Impact of DQ on AI' involved discussions about the influence of data quality (DQ) on artificial intelligence (AI). There was a shared sentiment concerning poor data quality stemming from AI, with users calling it a shit data ouroboros. Some suggested using AI itself to improve data quality. A few commenters expressed confusion about the term 'DQ', which was clarified as 'data quality'. The thread also featured suggestions of AI arguing with each other and humorous remarks on the complexity of using multiple AI agents. Overall, the thread carried a mixed sentiment of humor, confusion, and serious discussions on data quality.

9 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/sarkaysm•4/3/2024

144

Better way to query a large (15TB) dataset that does not cost $40,000

Help

The Reddit post discusses efficient and cost-effective strategies for querying a large 15TB dataset stored on AWS S3. Redditors suggest using Athena for its cost-effectiveness per TB queried, with the caveat of having effective data partitioning. Redshift Serverless is also suggested for its SQL data warehouse capabilities, which are beneficial for data scientists and analysts. The Redditor 'xilong89' shares that a similar sized query takes approximately 12 minutes to run on Redshift serverless. Some users question the $40k cost, suggesting that Spark could handle larger joins for less. 'johne898' recommends creating an EMR cluster and using Spark Scala for filtering data, asserting that the task shouldn't cost anywhere near $40k. The overall sentiment is constructive and solution-oriented.

142 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/FuckingLovePlants•4/7/2024

100

what do people mean when they say stuff like "ETL" and "building pipelines"?

Help

The Reddit thread is about understanding the terms ETL and building pipelines in data engineering. The top comments explain that ETL (Extract, Transform, Load) is a process to move data from one or more sources into a final destination. While the original poster is performing basic ETL tasks using SQL, the term usually refers to more complex, automated systems for large-scale data integration. The term pipeline is used to describe the end-to-end workflow of an ETL process, like an assembly line for data. The complexity of ETL and pipeline creation is highlighted, especially in terms of scaling processes across multiple data sources and ensuring data quality.

34 comments

Save

View on Reddit →