data subTLDR week 16 year 2025

r/MachineLearningr/dataengineeringr/SQL

Oracle Database Update Scare: Lessons Learnt, Tracking Access to Prevent Data Loss, soarSQL's Success in Querying Large CSV Files, The Data Quality Battle in Organizations, Cutting Through Data Engineering Jargon

April 20, 2025•Week 16, 2025

Posted in r/dataengineeringbyu/IdlePerfectionist•4/20/2025

2649

You can become a millionaire working in Data

Meme

The thread reflects a mixed sentiment about the potential of becoming a millionaire through data work. A widely supported opinion emphasizes the impact of a privileged background in wealth accumulation. There is also a humorous undertone concerning peripheral costs, such as mechanical keyboards and programming socks. In a more serious contribution, a comment highlights that anyone earning $70,000-$80,000 annually could become a millionaire by maximizing their 401(k) contributions, suggesting that being a millionaire isn't as significant as it used to be. Some users express that given the high salary of data engineers, they could easily become millionaires within a decade.

81 comments

Save

View on Reddit →

Posted in r/SQLbyu/danmc853•4/18/2025

850

Whoops

Oracle

The discussion revolves around a mishap involving an Oracle database update, which was luckily rolled back without any damage. Participants express relief and share similar experiences, emphasizing the importance of testing before implementing updates. Several commenters provide detailed advice on using Oracle's ROLLBACK, FLASHBACK, and Log Miner features to reverse unwanted changes, highlighting their potential to be a 'game-changer' in database management. Despite the initial panic, the sentiment is mostly positive, with users appreciating the learning opportunity. The thread also briefly touches on data integrity issues caused by changing XML messages from vendors.

70 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/growth_man•4/14/2025

682

Data Quality Struggles!

Meme

The discussion focuses on the challenges faced in maintaining data quality within organizations. There's a shared understanding that this area is often overlooked, leading to silent growth inhibitors. Participants express a sense of frustration, indicating that improvements in data quality and site reliability are not acknowledged, as they often prevent incidents from happening, creating a perception that the job is easy. There's also a sentiment of being overwhelmed with the burden of managing raw data dumped into systems for front-end development. The trade-off between time, budget, and diminishing returns in data quality efforts is also acknowledged.

14 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/luminoumen•4/16/2025

489

Data Engineering: Now with 30% More Bullshit

Blog

Data engineering professionals express frustration regarding the overuse of jargon and marketing speak in their field. Many believe that executives are easily swayed by complex terminology and end up investing heavily in unnecessary solutions. There's a shared sentiment that much of the work can be executed by competent data engineers using simpler, more cost-effective methods. Some even suggest that many data engineering concepts are merely recycled ideas repackaged with new terms. Overall, the sentiment is positive towards clear, practical, and honest insights in the field, with a shared desire to cut through the 'bullshit'.

31 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/celerimo•4/17/2025

372

[N] We just made scikit-learn, UMAP, and HDBSCAN run on GPUs with zero code changes! 🚀

News

NVIDIA's cuML team has launched a beta version of their accelerator mode, enabling GPU acceleration for scikit-learn, UMAP, and HDBSCAN with no code changes. The release, despite still being refined, is showing significant speed improvements in a series of machine learning algorithms. However, performance can vary depending on the size and characteristics of the dataset. The new tool has been welcomed by the community, but some users advocate for CUDA to become an open standard. There's also a desire for the tool to support a wider range of algorithms, such as clustering with a pre-calculated distance matrix.

23 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/ratczar•4/18/2025

299

Some of you aren't writing tests. Start writing tests.

Blog

The discussion emphasizes the importance of testing in data engineering, outlining the necessity to write tests when code becomes part of a complex system or other people start using it. Tests make the purpose and functionality of code explicit, aiding other developers in understanding and modifying it. Several types of tests are highlighted, including unit, integration, end-to-end, and data validation tests. However, some contributors argue for the inherent data validation capabilities of DDL/DML and the use of constraints to prevent issues. The role of testing frameworks in language selection is also debated, suggesting that understanding the trade-offs associated with technology choices is crucial. The overall sentiment is mixed, with some resistance to the testing emphasis.

51 comments

Save

View on Reddit →

Posted in r/SQLbyu/Adela_freedom•4/18/2025

247

That moment when someone asks, 'Who accessed prod?' 😲 It should not be a mystery.

Discussion

The discussion centers around the need for proper access controls and tracking mechanisms in database administration. Participants highlight the value of maintaining user query logs to identify who is responsible for unwanted changes, which often goes unheeded till significant data is lost. Many express frustration over common misuse of shared accounts with elevated permissions and lack of accountability. The use of tools like Oracle Unified Audit Trail and temporal/system versioned tables are suggested for tracking changes, albeit they require careful setup and can't always provide quick fixes. The overall sentiment leans towards the urgent need for better data governance practices.

21 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/sh_tomer•4/18/2025

227

arXiv moving from Cornell servers to Google Cloud

News

ArXiv's move from Cornell servers to Google Cloud is seen as a standard transition to cloud services, not a change in ownership, mirroring a trend seen across many companies in the last decade. Benefits such as cloud backup are viewed positively. Some speculate that this move could be motivated by Google's desire to easily access ArXiv's data for AI training, or due to funding cuts to US universities. However, there are concerns about the simultaneous rewrite and move to the cloud, which could risk overcoupling to Google Cloud Platform and potential project failure. Overall, the sentiment is mixed.

19 comments

Save

View on Reddit →

Posted in r/SQLbyu/infirexs•4/20/2025

191

I have developed a full website for practice SQL for everyone

MySQL

The creator of sqlsnake.com, a new website for practicing SQL, received constructive feedback from the community. Despite initial reluctance, the creator agreed to add mobile support due to user demand. Users also advised against making assumptions about visitor behavior and suggested improvements for user retention. Criticisms included the use of bad practices in the site's AI assistant and tutorial content, lack of direct navigation to desired topics, and overly simple queries. Some users questioned its advantage over existing platforms like SQLBolt. The overall sentiment was mixed, with appreciation for the initiative but clear expectations for improvement.

53 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/juliensalinas•4/16/2025

138

[D] Google just released a new generation of TPUs. Who actually uses TPUs in production?

Discussion

Google's new generation of Tensor Processing Units (TPUs) have sparked a debate about their practicality in production settings. While some have struggled with the limited documentation and support, others have found value in using TPUs for large batch size tasks. Google's internal teams are significant users, along with partners such as DeepMind, Google Search, YouTube, and various research-focused startups. Major tech companies like Apple also utilize Google services running on TPUs for their machine learning models. Concerns were raised about dependency on Google's Cloud Platform due to the inability to purchase TPUs outright, potentially leading to vendor lock-in. However, it's hoped that the TPU ecosystem has matured with this new generation. The sentiment is mixed, with notable demand but also notable challenges.

52 comments

Save

View on Reddit →

Posted in r/SQLbyu/rahulsingh_ca•4/14/2025

Query big ass CSVs with SQL

Discussion

The free SQL editor, soarSQL, designed to query any size CSV files quickly and efficiently, has received positive feedback. Users praised its speed, simplicity, and utility in professional settings. Comparisons with other tools like Notepad++ and Python showed an appreciation for soarSQL's streamlined setup and execution. Some users sought clarity on its functionality, such as handling of large files and support for formats beyond CSV. Overall, the tool's ability to simplify data analysis and SQL practice was well received.

29 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 16 year 2025

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!