data subtldr week 29 year 2023

r/MachineLearningr/dataengineeringr/SQL

Quality of Data Scientists' Work Under Fire, The Humor and Frustration in Data Engineering, Imperfect Data's Light-hearted Challenges, Effective SQL Practice for Job Searches, Incrementally Building SQL Queries: Yay or Nay?, Meta's Llama 2 Free for Commercial Usage, HuggingFace's Potential $200M Funding Boost

July 23, 2023•Week 29, 2023

Posted in r/dataengineeringbyu/tarzanboy76•7/17/2023

536

Data Scientists -- Ok, now I get it.

Discussion

The Reddit thread discusses the quality and credibility of work by data scientists, focusing on the concerns raised by a user about a poorly written SQL code given for production by a data scientist. The key points raised include an assertion that 25% of data scientists could be scam artists and similar issues are apparent in academia, with examples of fraudulent data causing significant setbacks in research fields like Alzheimer's. Another point raised is the potential security risks of using ChatGPT to write code. There were also discussions about the 'DRY' (Don't Repeat Yourself) coding principle, suggesting it is not always necessary to follow. The sentiment in the thread was generally critical of certain practices in data science.

219 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Top-Substance2185•7/20/2023

400

Barbenheimer, Data Engineering edition

Meme

The Reddit thread titled Barbenheimer, Data Engineering edition by 'Top-Substance2185' is a humorous take on the challenges of data engineering. The top comments express a relatable journey from excitement to weary familiarity with fixing production pipeline issues, as noted by 'End__User'. 'Wistephens' humorously compares the progression in data engineering to becoming a 'destroyer of worlds'. 'Invisibl3I' poses a hypothetical question about the potential catastrophe of dropping a table, indicating the critical role of data engineers. 'Swapripper' highlights a common career transition query from data analyst to data engineer. The overall sentiment is a mix of humor, frustration, and the acknowledged gravity of their role in the data world.

18 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/timedacorn369•7/18/2023

387

[N] Llama 2 is here

News

The Reddit thread discusses the release of 'Llama 2', a machine learning model by Meta, which is now free for commercial usage. Users applaud Meta's strategy to encourage open development against it. They compare it to OpenAI's paid model, ChatGPT. One user pointed out the unusual behavior of the 34B model, noting that it's less safe and its performance doesn't scale as expected. Although there is criticism about Meta's data handling, the overall sentiment towards the release of Llama 2 is positive, with some users even expressing newfound admiration for the firm. The change to free commercial usage is seen as a significant and pleasant development.

95 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Sailja_Jain•7/19/2023

219

Fact

Meme

The Reddit thread titled 'Fact' in the data engineering subreddit, posted by Sailja_Jain, humorously discusses the quality of data. The top-rated comment by 'WhyDoIHaveAnAccount9' states that data is never perfect, and if it was, it would render them jobless. 'JollyJustice' humorously refers to their data as 'dirty' and jokes about a data scientist who calls himself a 'janitor' due to his constant data cleaning tasks. 'Somenewname4me' and 'prof_herp_derp' simply agree with the sentiment that data is never perfect. Overall, the thread reflects a light-hearted sentiment about the challenges of dealing with imperfect data.

10 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/HugoDzz•7/22/2023

188

[P] A Chrome extension to save paper details

Project

The Reddit thread discusses a Chrome extension created by user HugoDzz, which helps remember pieces of text in PDFs for easy retrieval using Hugging Face Transformers.js and SvelteKit. The extension has been positively received, with users suggesting improvements like using DOI information for paper classification and incorporating Natural Entity Recognition (NER) to build a graph structure. Another user highlighted the potential usefulness of a feature that allows for quick searches of terms or subject matters across all files and notes in the operating system. Queries were also raised about the project's handling of Manifest v3, indicating a common concern about its lack of documentation and clarity.

31 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/hardmaru•7/21/2023

163

[N] HuggingFace reported to be reviewing term sheets for a funding round that could raise at least $200M at a valuation of $4B.

News

The Reddit thread discusses AI startup HuggingFace's potential funding round, which could raise at least $200M, valuing the company at $4B. Users speculate about the company's monetization strategy and endgame, with one user describing HuggingFace as a cloud compute provider, offering fully-managed hosting for AI models. The need for funds to cover bandwidth costs, especially with the release of new large-language models (LLMs), is also mentioned. Some users point out that HuggingFace provides crucial features for teams that are not freely available. The thread also highlights the value of the startup, citing its growing user base, database of models, and talented team as key assets.

32 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Agitated_Ad_1108•7/23/2023

125

I have never had my code reviewed thoroughly

Career

The Reddit thread is from a senior engineer expressing concern about their code not being thoroughly reviewed by colleagues or managers due to their data analyst/DBA background. Many comments echo this sentiment, highlighting a lack of code reviewing culture in the data engineering field. They describe situations where pull requests are merged quickly, often without proper review, leading to feelings of unprofessionalism and missed opportunities for growth. Suggestions include submitting PRs to an open source project and not worrying about the issue, focusing instead on learning from more experienced data engineers. A prevalent sentiment is the desire for a more cooperative and thorough approach to code reviews.

64 comments

Save

View on Reddit →

Posted in r/SQLbyu/scehood•7/17/2023

How can I practice SQL for job searches? Especially with excel

Discussion

The Reddit thread discusses various ways to practice SQL for job searches. Users suggest watching the SQL tutorial by freecodecamp on YouTube, practicing on sqlbolt, hackerrank, or leetcode, and learning advanced topics like window functions and flow of execution. Users also recommend using real-life datasets for exploratory data analysis. For dealing with large CSV files, suggestions include creating a normalized database structure, importing the data, and creating indexes for performance. There's a mention that SQL queries can't run directly against Excel data, but Excel can load data from a SQL Server using a query. Resources like Stack Overflow databases and the book SQL Practice Problems are also suggested for further learning. Overall, the sentiment is supportive and informative.

16 comments

Save

View on Reddit →

Posted in r/SQLbyu/nikjojo•7/20/2023

Is it bad practice to

Discussion

The Reddit thread on SQL usage practices reveals a consensus among users that the method of incrementally building and testing queries is effective and common. Users suggest it helps in troubleshooting and understanding the development process. However, it's recommended to test these queries on noncritical systems like Development environments or backups, not on the production system. Performance issues can arise with larger datasets but limiting query results can help manage this. The original poster voices concerns about the impact of this approach on company databases and its acceptance during technical interviews. The founder of DataLemur confirmed that the method is correct. Overall, the sentiment is positive towards the method.

25 comments

Save

View on Reddit →

Posted in r/SQLbyu/cptstoneee•7/22/2023

How can I improve my understanding of our business?

Discussion

The Reddit thread is about improving business understanding in a global pharmaceutical manufacturing company. Key suggestions from the top comments include: listen, ask questions, and confirm understanding (VanTechno); understanding the business you're supporting is as vital as coding (pineapple_catapult); considering your role within the company and understanding the structure of the databases you work with are important (BrupieD); setting regular discussions with decision makers to learn more about the company (dataguy24); and knowing your role, identifying pain points or risks, and quantifying them (ItalicIntegral). The overall sentiment is constructive, emphasizing the importance of communication, understanding one's role, and business knowledge.

7 comments

Save

View on Reddit →