← Back to Newsletters
data subtldr week 31 year 2023
r/MachineLearningr/dataengineeringr/SQL
Polars Receives $4M Seed Funding Amidst Query Engine Wars, PySpark Big Data Course Hits YouTube, Nvidia GPU Shortage Sparks Silicon Valley Gossip, SQL Formatting Sparks Heated Debate
•Week 31, 2023
Posted in r/MachineLearningbyu/ejmejm1•7/31/2023
397
[D] Where did all the ML research go?
Discussion
The Reddit thread [D] Where did all the ML research go? discusses the perceived shift in focus on the Machine Learning subreddit, with less variety in research topics, focusing largely on LLMs. Commenters express concerns about the loss of diversity due to the subreddit's size and popularity - niche topics often get ignored while trending ones gain attention. Some suggest finding smaller, specialized communities or following liked researchers on platforms like Twitter. Criticism is also directed at the moderation of the subreddit, with calls for better controls on who can post, aiming to improve content quality. The sentiment is a mix of nostalgia for past variety and frustration at the current state of the subreddit.
Posted in r/dataengineeringbyu/growth_man•8/1/2023
294
Fancy dashboards with volatile data pipelines!
Meme
The Reddit thread titled 'Fancy dashboards with volatile data pipelines!' created by 'growth_man' is quite engaging. Many users expressed their struggles with data infrastructure and the desire for better automation. 'Muppet-Mindset' shared an article about data modeling, expressing frustration with their organization's disregard for the issues outlined in it. One user questioned if DLT supports CDC/incremental loading and the difficulty level for a Python beginner, while 'Muppet-Mindset' also highlighted the need for visual representation to help businesses understand new data interfaces. The thread had a mix of humor, serious discussion, and shared struggles, reflecting the complexities of data engineering.
Posted in r/MachineLearningbyu/norcalnatv•8/5/2023
177
[D] Nvidia GPU shortage is ‘top gossip’ of Silicon Valley
Discussion
The Reddit thread discusses the ongoing Nvidia GPU shortage. The community mainly agrees that Nvidia's dominance is due to its software infrastructure, specifically the widespread use of CUDA. Other tech companies like Google, Apple, Cerebras, AMD, and Intel could potentially capitalize on this shortage, but lack of software support for non-Nvidia hardware is a significant obstacle. Contrary to suggestions that Nvidia is deliberately limiting production to keep prices high, users argue that the shortage is due to capacity issues at TSMC and is impacting multiple industries. Nvidia continues to ship all possible chips, as indicated by an increased revenue forecast. The overall sentiment is a need for more software support for non-Nvidia hardware and a quick resolution to the shortage.
Posted in r/dataengineeringbyu/onurbaltaci•8/4/2023
158
I recorded a PySpark Big Data Course (Python API of Apache Spark) and uploaded it on YouTube
Blog
The Reddit thread is about a PySpark Big Data Course that the author 'onurbaltaci' uploaded on YouTube, garnering positive feedback with a score of 158 and a 0.97 upvote ratio. Top comments praised the course for its quality, including good audio and clear code. There was also appreciation for the organization of the content with chapter markers. Some suggestions were made for improvement, such as omitting the Python installation section due to its advanced nature, and listing all the tools used for the course. The author responded positively to the feedback, agreeing to incorporate the suggestions in future courses. Overall, the sentiment on the thread was highly appreciative and constructive.
Posted in r/dataengineeringbyu/mailed•8/3/2023
156
Polars gets seed round of $4 million to build a compute platform
Blog
The Reddit thread discusses the seed funding of $4 million received by Polars for their compute platform development. Users anticipate a 'query engine war' between Polars and DuckDB, both now backed by start-ups and expanding beyond single-computer use to the cloud. While DuckDB is lauded for data discovery and long-term storage, Polars is favored when integrated with Python code. Questions arise about the market potential for these as paid products, with users pointing out that scale and cloud hosting are usually what people expect to pay for. Despite the excitement around Polars, some users expect Python bindings to remain prevalent in their work.
Posted in r/MachineLearningbyu/zy415•8/1/2023
114
[D] NeurIPS 2023 Paper Reviews
Discussion
The Reddit thread discusses the NeurIPS 2023 paper reviews. Users share their experiences and advice on handling reviews. One user suggests that even if initial scores are low, revisions and rebuttals could turn things around, as demonstrated by another user whose paper was accepted after improving from a score of 2 to 5. The importance of addressing reviewer concerns to avoid downgrades is emphasized. However, some express reluctance to engage in the rebuttal process. Overall, the thread highlights the challenges and opportunities in the review process, urging authors to remain resilient and value their work regardless of the score.
Posted in r/SQLbyu/wittedPundit•8/4/2023
76
Tamper Proofing in the Digital Age: A Look at Proof of SQL
Discussion
The Reddit thread on 'Tamper Proofing in the Digital Age: A Look at Proof of SQL' elicited various comments. Some users were skeptical, requiring more evidence to confirm the tamper-proofing claim. A noteworthy sentiment expressed was the appreciation for the innovative approach in deriving inspiration from zero-knowledge protocols while adding a unique spin, resulting in what is described as the 'Proof of SQL'. Another user lauded the proactive measures taken for external auditing and testing by 'Space and Time', reflecting a positive sentiment about their commitment to data protection.
Posted in r/SQLbyu/tcfan35842•8/3/2023
63
Formatting really matters!
Discussion
The Reddit thread titled Formatting really matters! sparked a discussion on code readability and coding style in SQL. The original poster, a junior data analyst, expressed frustration at the inconsistent formatting in scripts managed by their team in dbt. Top comments suggested diverse viewpoints: some consultants admitted they encounter various styles but don't interfere if it works, while others suggested using tools like poorsql.com for code formatting. Some suggested enabling linting in the dbt pipeline for consistency. Another point made was the need for careful handling of company code, avoiding sharing it on unapproved platforms. Overall, the thread acknowledged individual coding styles, but emphasized the importance of consistency and readability.
Posted in r/SQLbyu/coffee-toast_199•8/3/2023
29
How to get into data analytics?
Discussion
The Reddit thread discusses ways to enter the field of data analytics with no prior experience. Key suggestions include starting as a junior data analyst and creating a portfolio showcasing existing projects, as recommended by top commenter AaronScwartz12345. Other users suggest learning Python and SQL, understanding basics of databases, and utilizing online resources for self-learning. Some recommend gaining industry knowledge through entry-level business analytics positions. Commenters also emphasized that passion, willingness to learn, and the ability to tell a story using data are equally important as technical skills. The overall sentiment is positive and encouraging, highlighting the accessible nature of data analytics as a career.