data subtldr week 27 year 2023

r/MachineLearningr/dataengineeringr/SQL

Cloud Migration: A Complex Benefit, Becoming Databricks Certified in Python, Ultimate Guide to dbt, Categorizing SQL Proficiency, Self-Studying SQL, Improved nanoT5 v2 Model, LongNet's Billion Tokens Scale, Faster Reinforcement Learning Optimization

July 9, 2023•Week 27, 2023

Posted in r/dataengineeringbyu/tarzanboy76•7/6/2023

193

Is cloud a big scam?

Discussion

The Reddit thread discusses the pros and cons of migrating to cloud platforms, using Azure as an example. The author expresses concerns about the cost, security, and scalability in the cloud. Commenters agreed that while the cloud is not a scam, it is complex and requires significant changes in work processes and cost/value evaluation. They argued that cloud platforms are beneficial when non-value operations are outsourced and workloads are optimized. The thread emphasized the need for a proper skill set to manage cloud infrastructure and the importance of reevaluating the value of existing analytics pipelines. Overall, the sentiment was mixed, acknowledging the challenges but also the potential benefits of cloud platforms.

125 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/korec1234•7/5/2023

147

[P] nanoT5 v2 - In ~16 hours on a single GPU, we reach similar performance to the model trained on 150x more data!

Project

The Reddit thread discusses the impressive results of nanoT5 v2, an improved version of the pre-training T5 model in PyTorch, which achieves similar performance to the original model trained on 150x more data, but in just ~16 hours on a single GPU. The main upgrade includes leveraging BF16 precision and a simplified T5 model based on Huggingface's design, making it 2x faster. The advanced optimizers like Lion, Sophia, ALiBi positional embeddings, and FP16 mixed precision training didn't yield expected benefits. Users express curiosity about why advanced optimizers didn't work and suggest pushing the model to Huggingface for broader experimentation. The author promises further insights in an upcoming paper. The sentiment overall is positive and intrigued.

34 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/MysteryInc152•7/6/2023

141

[R] LongNet: Scaling Transformers to 1,000,000,000 Tokens

Research

The Reddit thread discusses the LongNet model developed by Microsoft Research (Asia), which can scale sequence length to over 1 billion tokens. The model is praised for its linear computational complexity, ability to model long sequences, and incorporation of dilated attention. Some users, however, express curiosity about its limitations and performance. It's noted that while the LongNet yields strong performance on long-sequence modeling and general language tasks, it's not state-of-the-art. Additionally, one user points out that solutions to the quadratic scaling issue have previously been proposed, referencing the papers Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention and Linformer.

28 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/Background_Debate_94•7/5/2023

135

Just got certified! - Databricks certified associate developer for apache spark 3.0 in Python

Career

The Reddit thread is about a user, 'Background_Debate_94', sharing their experience of becoming a Databricks certified associate developer for Apache Spark 3.0 in Python. They recommend the certification for its value and permanence, and share their resources and strategies. Other users in the thread, such as '1PLSXD' and 'No_Conversation_2474', affirm the usefulness of the mentioned resources and the ease of the certification process, respectively. The original poster also provides tips on getting discounts for the certification and suggests it's possible to prepare for the exam in 2 days if one is already familiar with the architecture. However, 'IshiharaSatomiLover' highlights the cost as a potential barrier.

47 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/nicku_a•7/7/2023

108

[P] 10x faster reinforcement learning hyperparameter optimization than SOTA - now with distributed training!

Project

The Reddit thread discusses a significant update to a Reinforcement Learning framework by user 'nicku_a', which boasts 10x faster training and hyperparameter optimization. The update introduces distributed training, a new Sampler class, and an addition of TD3 to the framework. The reactions to this update are mixed and somewhat humorous. Some users express excitement about the upcoming product, while others jokingly speculate about AI-generated comments and token costs. Despite the jests, the overall sentiment towards the update seems positive, and the users are looking forward to seeing the evolution of the product.

17 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/tbrownlow•7/5/2023

I attempted to create the Ultimate Guide to dbt

Blog

The Reddit thread discusses a user's attempt to create an Ultimate Guide to dbt. The guide was well-received, with a high upvote ratio of 0.98. However, some users provided constructive criticism. One user regarded it more as a helpful cheat sheet than an ultimate guide and highlighted slow website performance. Another user had issues with navigation and viewing parts of the guide, to which the author suggested trying the mobile link. Other comments included appreciation for the guide and an automated post sharing a list of community-submitted learning resources. The overall sentiment was positive, acknowledging the usefulness of the guide but suggesting room for improvement in accessibility.

8 comments

Save

View on Reddit →

Posted in r/SQLbyu/st418s21•7/7/2023

Is there anyone else who is also self-studying?

Discussion

In the Reddit thread titled Is there anyone else who is also self-studying?, the author, st418s21, discusses their self-study journey in SQL and seeks a study partner. The post has been well-received with a score of 63 and an upvote ratio of 0.92. Several users have shown interest in joining the author. Notably, the author has set up a Discord group for this purpose, albeit acknowledging that many members aren't very active. They encourage new contacts to reach out via Discord. Additionally, resources like SQLZoo, AlexTheAnalyst on YouTube, and Leetcode are recommended for learning SQL. The thread has a positive sentiment overall, indicating a supportive community for self-learners in SQL.

137 comments

Save

View on Reddit →

Posted in r/SQLbyu/StevenG1819•7/8/2023

How would you categorize SQL proficiency? (Beginner, Intermediate, Advanced)

Discussion

The Reddit thread discusses the categorization of SQL proficiency levels. The most appreciated comment provides a detailed tier list of skills for SQL Analysts and Admins, ranging from basic commands like SELECT and JOIN to advanced concepts such as execution plans and disaster recovery. Some users expressed that this list was humbling and made them rethink their own proficiency levels. The thread's sentiment is a mix of surprise and appreciation for the learning journey in SQL. The thread's author also encourages continuing to learn and experiment with new concepts, emphasizing that expertise comes from understanding all the ways a concept doesn't work.

38 comments

Save

View on Reddit →

Posted in r/SQLbyu/DataNerd760•7/7/2023

SQL Practice Platform

Discussion

The Reddit thread discusses a new SQL Practice Platform created by user 'DataNerd760', intended as a resource for SQL learners to practice their skills. The platform was generally well-received, with users expressing appreciation for its creation. One user, 'letswai', asked if the platform also offered courses. Some users shared feedback, including 'malist42' who found the 'Mass Shooting Dataset' on the site off-putting, to which 'DataNerd760' agreed and promised to add more diverse datasets. The overall sentiment was positive, with users appreciating the initiative and providing constructive feedback to improve the platform.

22 comments

Save

View on Reddit →