Co-authors: Venkata Krishnan Sowrirajan and Min Shen We are excited to announce that push-based shuffle (codenamed Project Magnet) is now available in Apac...
Oct 20, 2021
Co-authors: Venkata Krishnan Sowrirajan and Min Shen We are excited to announce that push-based shuffle (codenamed Project Magnet) is now available in Apac...
Oct 20, 2021
Co-authors: Andy Li and Hongbin Wu Indexing plays the key role in modern search engines for fast and accurate information retrieval, and the ability to swi...
Sep 27, 2021
Co-authors: Preetam Nandy, Yunsong Meng, Cyrus DiCiccio, Heloise Logan, Amir Sepehri, Divya Venugopalan, Kinjal Basu, and Noureddine...
Feb 8, 2021
Co-authors: Min Shen, Chandni Singh, Ye Zhou, and Sunitha Beeram At LinkedIn, we rely heavily on offline data analytics for...
Oct 21, 2020
Co-authors: Sriram Vasudevan, Cyrus DiCiccio, and Kinjal Basu At LinkedIn, our imperative is to create economic opportunity for every...
Co-authors: Jun Shi, Mingzhou Zhou Introduction In the machine learning community, Apache Spark is widely used for data processing due to its efficiency in...
May 4, 2020
Co-authors: Walaa Eldin Moustafa, Wenye Zhang, Adwait Tumbde, Ratandeep Ratti Introduction Over the years, the popularity of Apache...
Mar 25, 2020
One of the most common ways to store results from a Spark job is by writing the results to a Hive table stored on HDFS. While in theory…
Mar 3, 2020
Jeremy Smith, Jonathan Indig, Faisal Siddiqi
There is often a hidden performance cost tied to the complexity of data pipelines — Overhead. In this post we will examine the concept of…
Sep 24, 2019
Co-authors: Xuhong Zhang, Chenya Zhang, and Yiming Ma Today, we are announcing a new open source project called Avro2TF. This project provides a scalable S...
Apr 4, 2019
How we scaled Spark streaming with a novel balanced Kafka reader for ingesting massive amount of logging events from Kafka in near…
Nov 20, 2018
Next Tuesday marks the start of the Spark Summit Conference in San Francisco. This year, LinkedIn engineers and data scientists are...
This post has been updated to note the release of Pepperdata's Application Profiler, a commercial project based on Dr. Elephant. Last April, we announced t...
Mar 6, 2017
Machine learning is a key component of LinkedIn’s relevance-driven products. We use machine learning to train the ranking algorithms for our feed, advertis...
We are proud to announce today that we are open sourcing Dr. Elephant, a powerful tool that helps users of Hadoop and Spark understand, analyze, and improv...
Apr 8, 2016