Discover Best Tech Engineering Blogs

Project Magnet, providing push-based shuffle, now available in Apache Spark 3.2

Co-authors: Venkata Krishnan Sowrirajan and Min Shen We are excited to announce that push-based shuffle (codenamed Pr...

spark

open source

October 20, 2021

Distributed tier merge: How LinkedIn tackles stragglers in search index build

Co-authors: Andy Li and Hongbin Wu Indexing plays the key role in modern search engines for fast and accurate informa...

spark

data

distributed systems

September 27, 2021

Using the LinkedIn Fairness Toolkit in large-scale AI...

Co-authors: Preetam Nandy, Yunsong Meng, Cyrus DiCiccio, Heloise Logan, Amir Sepehri, Divya Venugopalan, Kinjal Basu,...

February 8, 2021

Magnet: A scalable and performant shuffle architecture for ...

Co-authors: Min Shen, Chandni Singh, Ye Zhou, and Sunitha Beeram At LinkedIn, we rely heavily on offline data analyti...

October 21, 2020

Addressing bias in large-scale AI applications: The...

Co-authors: Sriram Vasudevan, Cyrus DiCiccio, and Kinjal Basu At LinkedIn, our imperative is to create economic oppor...

August 25, 2020

Spark-TFRecord: Toward full support of TFRecord in Spark

Co-authors: Jun Shi, Mingzhou Zhou Introduction In the machine learning community, Apache Spark is widely used for da...

May 4, 2020

Advanced schema management for Spark applications at scale

Co-authors: Walaa Eldin Moustafa, Wenye Zhang, Adwait Tumbde, Ratandeep Ratti Introduction Over the years, the popula...

March 25, 2020

On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies

One of the most common ways to store results from a Spark job is by writing the results to a Hive table stored on HDF...

March 3, 2020

Open-sourcing Polynote: an IDE-inspired polyglot notebook

Jeremy Smith, Jonathan Indig, Faisal Siddiqi

October 23, 2019

Scaling a Mature Data Pipeline — Managing Overhead

There is often a hidden performance cost tied to the complexity of data pipelines — Overhead. In this post we will ex...

September 24, 2019

Avro2TF: An open source feature transformation engine for TensorFlow

Co-authors: Xuhong Zhang, Chenya Zhang, and Yiming Ma Today, we are announcing a new open source project called Avro2...

April 4, 2019

Scaling Spark Streaming for Logging Event Ingestion

How we scaled Spark streaming with a novel balanced Kafka reader for ingesting massive amount of logging events from ...

November 20, 2018

Spark Summit 2017: Research, Open Source, and Community

Next Tuesday marks the start of the Spark Summit Conference in San Francisco. This year, LinkedIn engineers and data ...

June 2, 2017

A Checkup with Dr. Elephant: One Year Later

This post has been updated to note the release of Pepperdata's Application Profiler, a commercial project based on Dr...

spark

hadoop

open source

March 6, 2017

Open Sourcing Photon ML

Machine learning is a key component of LinkedIn’s relevance-driven products. We use machine learning to train the ran...

June 7, 2016

Open Sourcing Dr. Elephant

We are proud to announce today that we are open sourcing Dr. Elephant, a powerful tool that helps users of Hadoop and...

spark

hadoop

open source

April 8, 2016

Blog posts about .css-ir0lpz{color:transparent;background-clip:text;-webkit-background-clip:text;background-image:linear-gradient(90deg,rgb(97,94,255),rgb(255,106,77)),linear-gradient(90deg,#615eff,#ff6a4d);}Spark

Blog posts about Spark