data subTLDR week 12 year 2025

r/MachineLearningr/dataengineeringr/SQL

Unlocking the Secret to DE Jobs: Likability & Interpersonal Skills, Mastering SQL, Python, Spark for Data Engineering, Corporate Inefficiency: Overabundance of Managers, AI's Irreversible Memory, and a 47% Leap in Code Completion with Qwen 2.5 Coder.

March 23, 2025•Week 12, 2025

Posted in r/dataengineeringbyu/pawtherhood89•3/18/2025

571

Why you aren't getting a DE job

Career

The main insight from the discussion is the importance of likability and interpersonal skills in securing a Data Engineering (DE) job. Most candidates who pass HR screening are deemed qualified, making personality fit a key hiring factor. Examples shared indicate that building relationships, even outside formal work settings, can safeguard against layoffs. However, concerns were raised about being stuck due to outdated tech stacks, which was countered by a hiring manager emphasizing problem-solving and transferable skills over matching current tech stacks. The difficulty of landing the first DE job and the contrasting experiences in small and large organizations were also discussed. Overall, the sentiment was mixed but largely constructive.

100 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/ChoicePound5745•3/17/2025

524

Which one to choose?

Career

The majority of Reddit users recommend mastering SQL and Python, and learning Spark/PySpark for a solid foundation in modern data engineering. Docker is also suggested due to its cloud-agnostic nature. However, some users express frustration at the multitude of tools and trends in the field, likening it to a popularity contest. They advise choosing tools based on the specific goal, budget, and use case, rather than what's currently in vogue. The use of containers, such as Docker and Kubernetes, for managing multiple services is also highlighted. Despite some tongue-in-cheek suggestions for using simple tools like Excel, the overall sentiment leans toward mastering versatile, industry-standard technologies.

139 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/HMZ_PBI•3/21/2025

392

Corps are crazy!

Discussion

There's strong criticism of perceived corporate inefficiency, particularly an overabundance of managers whose roles seem superfluous. Many feel that managers often contribute to a slower workflow due to excessive meetings and insufficient practical contribution. However, redundancies reveal that roles focused on practical work, like engineering, are typically preserved over managerial roles. Some highlight the value of a good manager who can shield engineers from corporate distractions. The sentiment is largely negative, indicative of widespread frustration with perceived corporate bureaucracy and inefficiency. There's a call for more practical roles, like Data Engineers, and less managerial positions.

63 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/No_Release_3665•3/22/2025

194

[Research]Can AI remember irreversibly, like a brain does? I built a model that tries — and it works surprisingly well.

Research

The discussion revolves around the capability of an AI to remember irreversibly, akin to the human brain. Many participants are impressed with the successful model that was built, indicating broad support for the idea. Some expressed concerns about potential misuse and ethical implications, reflecting a slightly mixed sentiment. Other prevalent views underscore the importance of continuous learning and refinement in AI technology. However, there's a consensus that the technology is promising and could revolutionize fields like neurology and AI research. Overall, the sentiment is positive, with excitement about the technology's potential but cautious about its ethical use.

63 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/CountlessFlies•3/17/2025

173

[P] I fine-tuned Qwen 2.5 Coder on a single repo and got a 47% improvement in code completion accuracy

Project

The fine-tuning of Qwen 2.5 Coder on a single repo resulted in a significant 47% improvement in code completion accuracy. This strategy mirrors that of ninetyfive.gg. The process to determine prefix/middle/suffix splits for training, although basic, has proven effective and leaves room for improvement. While there's a potential for overfitting, this could be mitigated by training a different LoRA for each codebase. However, an overfit fine-tuned model could pose issues as codebases evolve or when implementing novel functionality. The training log is not publicly accessible due to WandB's premium feature for public sharing. Overall, the sentiment is positive.

36 comments

Save

View on Reddit →

Posted in r/dataengineeringbyu/DuckDatum•3/23/2025

159

Where is the Data Engineering industry headed?

Discussion

The future of Data Engineering is seen to be increasingly intertwined with Software Engineering, with a shift towards declarative processes and further incorporation of dev, staging, and prod branches. However, the industry's direction is debated. Some believe we are returning to traditional SQL databases and single machine processing due to advancements in CPU technology, while others foresee continued specialization and the emergence of dominant services. The offshoring of data engineering work is another key trend, although its effectiveness is contested. Concerns about current practices include unpredictable costs of cloud data warehouse solutions, which are causing some to revert to technologies with more manageable cost structures.

66 comments

Save

View on Reddit →

Posted in r/SQLbyu/Captain_Strudels•3/19/2025

134

I've worked with SQL for years and have no clue what GO does

SQL Server

In a discussion about the function of 'GO' in SQL, the consensus is that 'GO' acts as a virtual 'end of file', marking the boundary between different sections of code. If an error occurs in a later section, previous sections still complete successfully. It's particularly useful in testing, allowing a query to run multiple times. 'GO' signals the end of a batch of Transact-SQL statements, facilitating readability and execution of scripts. It plays a key role in managing variables, as variables in a batch are not in scope of other batches, but global variables are available.

38 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/faintlystranger•3/23/2025

[D] "Topological" Deep Learning - Promising or Hype?

Discussion

The emerging field of Topological Deep Learning (TDL) is sparking debate. While some users see TDL's potential for incorporating higher-order structural relationships in representations or architectures, others question its practicality due to the computational expense of modeling higher-order interactions. A few highlight its potential relevance in niche fields like biochemistry and material sciences. An author of a TDL position paper admits that current topological neural networks have limitations but insists on ongoing research to overcome these. Overall, the sentiment is mixed, with users recognizing TDL's theoretical appeal but expressing reservations about its current applicability and effectiveness.

22 comments

Save

View on Reddit →

Posted in r/MachineLearningbyu/skeltzyboiii•3/18/2025

[R] Jagged Flash Attention Optimization

Research

The Jagged Flash Attention Optimization could have significant practical implications, with experiments showing a 10% improvement in Queries Per Second (QPS) and an 18% reduction in memory usage. However, it's important to note that up to 9x speedup doesn't necessarily mean 9x faster inference across all applications. The efficiency of local versus cloud-based inference for larger models was also discussed, highlighting how even small latency improvements can be substantial for real-time applications. Many are eagerly awaiting the implementation, and there's curiosity about the specific model this optimization will be deployed in.

15 comments

Save

View on Reddit →

Posted in r/SQLbyu/Mafioso14c•3/18/2025

Interview struggle

Discussion

In a discussion about data integrity and validation during the interview process, users highlighted the importance of using appropriate data types, avoiding nulls where possible, setting indexes and unique constraints, and establishing foreign keys. Utilizing database tools to ensure data quality was emphasized. Dynamic SQL query construction using user-selected filter values was discussed, with parameterized queries being recommended to handle user input. Data profiling was suggested as a vital part of data validation, with checks for valid dates, numeric values, outliers, and unique primary keys. The role of stakeholder expectations and domain expertise in making technical decisions was underscored, with an emphasis on understanding data requirements and definitions.

6 comments

Save

View on Reddit →

Posted in r/SQLbyu/brandi_Iove•3/23/2025

A cool feature i just came across

SQL Server

The discovery of a live index feature in SQL Server and MSSMS, showing the count of rows being processed during execution, sparked a discussion on efficiency in database operations. Several contributors emphasized that set-based operations are more efficient than row-by-row updates, which are necessary for the live index feature to function. Questions were raised about the feasibility of executing row-by-row updates on millions of rows. Suggestions included using partition swapping for greater efficiency and adjusting practices to batch set-wise operations. The sentiment was mixed, with appreciation for the feature tempered by considerations of operational efficiency.

16 comments

Save

View on Reddit →

Posted in r/SQLbyu/_mr_villain_•3/18/2025

What is wrong here.

MySQL

The Reddit discussion revolved around a SQL query problem. The issue arose from attempting to use a function name as an alias for a column in MySQL. Though the user thought the 'DESC' addition fixed it, the problem was actually resolved by changing the alias. It was noted that standard SQL should implicitly invoke 'ASC' for 'ORDER BY'. It was also clarified that 'Partition-by' is optional, and the MySQL version can impact whether certain errors are thrown. Several users shared solutions and workarounds, including different queries for MySQL versions 8.0+ and 5.7 or older. The sentiment was constructive and solution-focused.

37 comments

Save

View on Reddit →

Subscribe to data-subtldr

Get weekly summaries of top content from r/dataengineering, r/MachineLearning and more directly in your inbox.

data subTLDR week 12 year 2025

Subscribe to data-subtldr

Get the weekly data subTLDR in your inbox!