← Back to Newsletters
data subtldr week 9 year 2024
r/MachineLearningr/dataengineeringr/sql
Confronting Unrealistic Expectations for Junior Data Engineers, Handling Python's 1.7 TB CSV Challenge, Navigating Tough SQL Questions, and Job Struggles for PhD Graduates in Tech
•Week 9, 2024
Posted in r/MachineLearningbyu/Holiday_Safe_5620•2/26/2024
609
[D] Is the tech industry still not recovered or I am that bad?
Discussion
The Reddit thread titled [D] Is the tech industry still not recovered or I am that bad? discusses the challenges faced by a highly qualified PhD graduate in landing a Research Scientist job. The top comments reflect mixed opinions with some attributing the struggle to the current market conditions while others suggest it's due to the nature of the skills being less marketable. Some users advise the author to leverage personal relationships and networks rather than cold applying, and to ensure their CV is not overly academic. Others suggest that the demand for pure research roles has declined as companies focus more on application-based roles. There is also encouragement for the author to persist despite the rejections.
Posted in r/MachineLearningbyu/Civil_Collection7267•2/28/2024
430
[R] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Research
The Reddit thread discusses a research paper introducing BitNet b1.58, a 1-bit Large Language Model (LLM) variant promising high-performance and cost-effectiveness. The top comments express mixed sentiments. Some users criticize the hype in the study's title and the lack of citations to prior work on binarized and ternary Neural Networks. Others appreciate the potential of the approach to reduce energy and computational expenses of current LLMs, with one user hoping it's the right path. A few comments discuss the logic behind the 1.58-bit denomination due to the ternary {-1, 0, 1} system used. Also, one user highlights that BitNet b1.58 is not just a quantization approach but an entirely new model.
Posted in r/dataengineeringbyu/Foot_Straight•2/27/2024
402
Expectation from junior engineer
Discussion
The Reddit thread discusses the expectations for a junior data engineer, with many users expressing dissatisfaction with unrealistic job requirements. The top comments criticize unclear terminology such as advanced SQL and mid-level data structures, suggesting these terms are too vague or demanding for a junior role. One user suggests the job post was likely made by an uninformed recruiter. Other users propose more realistic skills for a junior data engineer, including a basic understanding of SQL joins and data modeling, familiarity with data lakes vs warehouses, basic ETL, and understanding of when to use NoSQL vs SQL. The overall sentiment is that expectations for junior roles are often inflated, creating unnecessary barriers to entry.
Posted in r/MachineLearningbyu/we_are_mammals•2/26/2024
289
The industry is not going "recover" for newly minted research scientists [D]
Discussion
The Reddit thread discusses the predicted stagnation of job recovery for new research scientists in the tech industry. A significant concern expressed in the comments is the increasing number of Ph.D. graduates competing for limited, though increasing, research positions. This has resulted in heightened competition and higher standards for job interviews. Some users relate this situation to 'elite overproduction,' leading to societal instability and disappointment. Others suggest the growth of applicants for these roles may be due to a shift from other disciplines rather than an increase in the number of individuals. The overall sentiment leans towards a grim but realistic outlook on the future of job prospects for new research scientists.
Posted in r/dataengineeringbyu/pmme_ur_titsandclits•2/29/2024
147
I bombed the interviuw and feel like the dumbest person in the world
Help
The Reddit thread involves a user sharing his experience of a poor performance in a data engineer trainee interview, resulting in feelings of incompetence. The user sought advice on improving interview skills. The overall sentiment in the comments was supportive and empathetic, reflecting a shared experience of bombing interviews. Key advice included practicing coding problems and frequently attending interviews to gain comfort and experience. Some comments shared similar experiences of freezing or forgetting syntax during interviews, emphasizing that failure is a part of the process. The discussion also provided technical explanations on the interview questions, encouraging the user to learn and prepare better for future opportunities.
Posted in r/dataengineeringbyu/The-Salamander-Fan•2/26/2024
109
Read/Filter a 1.7 TB CSV File in Python
Help
In a Reddit thread discussing the challenge of filtering a 1.7 TB CSV file in Python, users offered several suggestions. Most popular was the idea of turning the problem into a public challenge to engage other users. Concrete advice included using tools like DuckDB, Dask, PySpark, PyArrow, and Polars, which offer block sizing and robust documentation. Users also suggested using SQL or setting the blocksize to a larger value, like 500MB, and splitting the file for easier handling. There were also suggestions to perform O(1) operations, such as checking for set membership, instead of more resource-intensive O(n) operations. The thread had a positive tone, with users showing interest and excitement about the challenge.