Rabbit Slide Show

Apache Arrow - A cross-language development platform for in-memory data

2019-04-23

Description

Apache Arrow is the future for data processing systems. This talk describes how to solve data sharing overhead in data processing system such as Spark and PySpark. This talk also describes how to accelerate computation against your large data by Apache Arrow.

Text

Page: 1

Apache Arrow
A cross-language development platform
for in-memory data
Kouhei Sutou
ClearCode Inc.
SciPy Japan Conference 2019
2019-04-23
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 2

Me
Ruby committer
since 2004
2004年からRubyコミッター
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 3

Why do I talk at SciPy?
なぜSciPyで話しているのか?
To introduce
Apache Arrow
Apache Arrowを紹介するため
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 4

Apache Arrow
A
cross-language
development
platform for in-memory data
インメモリーデータ向け多言語対応開発プラットフォーム
[cited from `https://arrow.apache.org/']
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 5

Cross-language
多言語対応
✓ C, C++, C#, Go, Java,
✓ JavaScript, MATLAB, Python,
✓ R, Ruby and Rust
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 6

Development platform
開発プラットフォーム
Apache Arrow ...
✓ specifies standards and
標準化
✓ provides implementations
実装
to advance cooperation by many people
多くの人が協力できるように
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 7

For in-memory data
インメモリーデータ
Apache Arrow focuses on ↓ for now
Apache Arrowは今のところは↓に注力
✓ sharing columnar/tensor data
カラムナーデータ・テンソルデータの共有
✓ analyzing columnar data
カラムナーデータの分析
✓ RPC for columnar data
カラムナーデータのRPC
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 8

Apache Arrow and Python
Apache ArrowとPython
✓ As pickle replacement
pickleの代替
✓ PySpark does
PySparkはすでにやっている
✓ As dataframe library
データフレームライブラリー
✓ pandas and Vaes use Apache Arrow a bit
pandasとVaesはApache Arrowを少し使っている
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 9

Apache Arrow and me
Apache Arrowと私
✓ A release manager (リリースマネージャー)
✓ 0.11.0 and 0.13.0
(the latest release/最新リリース)
✓ An active developer (アクティブな開発者)
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 10

Feature (1)
機能(1)
Effective
serialization
効率的なシリアライズ
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 11

Why effective?
なぜ効率的なのか
✓ Don't parse data
データをパースしないから
✓ Use data directly
データをそのまま使うから
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 12

Data format: Number
データフォーマット:数値
Contiguous data (Same as C array)
連続データ(Cの配列と同じ)
32bit integer: [1, 2, 3]
0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 ...
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 13

Compare to JSON
JSONと比較
"[1, 2, 3]"
↓
"1" → 1 (String → Number)
"2" → 2 (String → Number)
"3" → 3 (String → Number)
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 14

Merit of direct data use
データを直接使うことのメリット
✓ Zero copy cost
コピーコストをなくせる
✓ Copy is costly for large data
大きなデータではコピーはコストが高い
✓ (Nearly) zero parse cost
(ほぼ)パースコストをなくせる
✓ Only need to parse metadata
メタデータをパースするだけでよい
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 15

Performance
性能
Fast Python Serialization with Ray and Apache Arrow
Apache License 2.0: (c) 2016-2019 The Apache Software Foundation
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 16

Zero copy and large data
ゼロコピーと大きなデータ
✓ pandas can't process large data
pandasは大きなデータを扱えない
✓ Because it needs to allocate memory
メモリーを確保する必要があるから
✓ Apache Arrow supports memory mapping
Apache Arrowはメモリーマッピング対応
✓ Can use data in file directly without copy
ファイル内のデータをコピーせずに使える
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 17

Effective string representation
効率的な文字列表現
✓ pandas: Array of strings
pandas:文字列の配列
✓ Use discontiguous memory
非連続なメモリーを使う
✓ Apache Arrow: Data and array of lengths
Apache Arrow:データと長さの配列
✓ Use contiguous memory: Fast
連続したメモリーを使う:速い
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 18

Data format: String
データフォーマット:文字列
Data bytes + length array
UTT-8 string: ["Hello", "", "!"]
Data bytes: "Hello!"
Length array: [0, 5, 5, 6]
i-th length: lengths[i+1] - lengths[i]
i-th data: data[lengths[i]:lengths[i+1]]
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 19

Feature(?) (2)
機能(?)(2)
Specify data format
データフォーマットを仕様化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 20

Why do Arrow specify?
なぜArrowは仕様化するのか
Effective
data exchange
効率的なデータ交換のため
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 21

Effective data exchange
効率的なデータ交換
✓ Use common format widely
みんなが同じフォーマットを使うこと
✓ No format conversion reduces resource usage
フォーマットを変換しなくてよいならリソース使用量を減らせる
✓ Use low {,de}serialize cost format
シリアライズコストが低いフォーマットを使うこと
✓ Fast
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 22

Who uses Arrow format?
Arrowフォーマットをだれが使っているか
✓ RAPIDS: For NVIDIA GPU
✓ Fletcher, InAccel: For FPGA
✓ Spark: For interprocess data exchange
Spark:プロセス間のデータ交換のために
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 23

CPU and GPU
✓ Can't share data on memory
メモリー上のデータを共有できない
✓ Need to copy between CPU and GPU
CPUとGPU間でコピーする必要がある
✓ Effective data exchange improves performance
データ交換を効率化することで高速化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 24

CPU and FPGA
✓ Can't share data on memory
メモリー上のデータを共有できない
✓ Need to copy between CPU and FPGA
CPUとFPGA間でコピーする必要がある
✓ Effective data exchange improves performance
データ交換を効率化することで高速化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 25

Spark
✓ Process large data
大きなデータを処理
✓ Need to pass data to worker processes
ワーカープロセスにデータを渡す必要がある
✓ Effective data exchange improves performance
データ交換を効率化することで高速化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 26

PySpark
✓ Worker by Python
ワーカーはPython
✓ Use pikcle to exchange data
データ交換にpickleを使用
✓ Spark supports Arrow for data exchange
Arrowを使ったデータ交換をサポート
✓ Disabled by default
デフォルトでは無効
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 27

PySpark with Arrow
In [2]: %time pdf = df.toPandas()
CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
Wall time: 20.7 s
In [3]: spark.conf.set("spark.sql.execution.arrow.enabled", "true")
In [4]: %time pdf = df.toPandas()
CPU times: user 40 ms, sys: 32 ms, total: 72 ms
Wall time: 737 ms
Speeding up PySpark with Apache Arrow
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 28

Feature (3)
機能(3)
Optimized
data processing
modules
最適化されたデータ処理モジュール
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 29

Optimized data processing
最適化されたデータ処理モジュール
✓ Apache Arrow targets large data
Apache Arrowは大きなデータを対象にしている
✓ Performance is important
性能は重要
✓ How to get high performance...?
どうすれば速くできる。。。?
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 30

High performance (1)
高速化(1)
Data locality
データを局所化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 31

Data locality
データを局所化
✓ Minimize cache misses
キャッシュミスを減らす
✓ Storage is very slow
ストレージはすごく遅い
✓ Memory is slow
メモリーは遅い
✓ CPU cache is fast
CPUキャッシュは速い
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 32

High performance (2)
高速化(2)
SIMD
Single Instruction Multi Data
一気に複数のデータを処理する方法
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 33

SIMD
✓ Data must be contiguous and aligned
データは連続していてアラインされていないといけない
✓ Arrow format is SIMD ready
ArrowフォーマットはSIMDを使える
✓ No condition branch
条件分岐がないこと
✓ Use bitmap instead of "missing" for null
nullを表現するために「欠損値」ではなく別途ビットマップを使う
Is it time to stop using sentinel values for null / NA values?
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 34

No condition branch
条件分岐なし
[1, null, 3] + [null, 2, 5]
null
data
1 0 1
1 X 3
bitwise &
null
data
0 0 1
X X 8
0 1 1
X 2 5
+ by SIMD
including null
elements
[null, null, 8]
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 35

FYI: null
参考情報:null
✓ All data types support null in Arrow
Arrowはすべての型でnullをサポート
✓ Some types only support null in NumPy
NumPyは一部の型でnullをサポート
欠損値の制約 - PythonとApache Arrow
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 36

High performance (3)
高速化(3)
Thread
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 37

Thread
✓ Use multi-cores in single process
シングルプロセスで複数コアを使う
✓ Minimize resource conflict
リソースの競合をなくすこと
✓ Locking to avoid conflict reduces performance
競合を避けるためにロックすると性能劣化
✓ Approaches (アプローチ)
✓ Read only or copy (shared nothing)
リードオンリーにするかコピー(なにも共有しない)
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 38

Apache Arrow and thread
Apache Arrowとスレッド
✓ Data is read only
データはリードオンリー
✓ Share data in threads without lock overhead
ロックのオーバーヘッドなしでスレッド間でデータを共有
✓ Avoid both locking and copying
ロックもコピーも避ける
✓ They reduce performance
どちらも性能劣化するから
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 39

High performance (4)
高速化(4)
Compute kernels
計算カーネル
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 40

Compute kernels
計算カーネル
✓ SIMD ready primitive operations
SIMDを使ったプリミティブな演算
✓ Projection, Filter, Aggregation, ...
射影とかフィルターとか集計とかとか
✓ compare, take, mean, ...
比較とか行選択とか平均とか
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 41

High performance (5)
高速化(5)
Subgraph compiler
サブグラフコンパイラー
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 42

Subgraph compiler: Gandiva
サブグラフコンパイラー:Gandiva
✓ Compile operator graphs at run-time
実行時に演算グラフをコンパイル
✓ Operator graph: combined multiple operations
演算グラフ:演算のまとまり
✓ table.a + table.b < table.c && ...
✓ Usable for query engine backend
クエリーエンジンのバックエンドとして使える
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 43

High performance (6)
高速化(6)
Query engine
クエリーエンジン
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 44

Query engine
クエリーエンジン
✓ For single node
シングルノード向け
✓ Dataflow-style operator execution
データが流れるように演算を実行
✓ scan → project → filter → aggregate → ...
データ取得→射影→フィルター→集計→…
Apache Arrow Query Engine for C++
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 45

Query engine from Python
Pythonからクエリーエンジンを使う
✓ With pandas (pandasと使う)
✓ Large data → execute → to_pandas()
大きなデータ→実行→to_pandas()
✓ With Dask (Daskと使う)
✓ Dask will be able to use this as backend
Daskのバックエンドで使えるかも?
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 46

High performance (7)
高速化(7)
Datasets
データセット
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 47

Datasets
データセット
✓ Scan data from storage/database
ストレージ・データベースからデータ取得
✓ File systems: local, HDFS, ...
✓ Formats: CSV, Parquet, ...
✓ Databases: MySQL, PostgreSQL, ...
Apache Arrow C++ Datasets
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 48

Fast datasets
高速なデータセット
✓ Predicate pushdown
条件のプッシュダウン
✓ Scan only needed data
必要なデータのみ取得
✓ Parallel scan
並列取得
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 49

Feature (4)
機能(4)
RPC
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 50

RPC: Arrow Flight
✓ Fast RPC framework for Arrow
Arrow用の高速なRPC
✓ Based on gRPC with low-level extensions
gRPCベースでいくつか低レベルの拡張をしている
Apache 0.11.0 Release
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 51

Wrap up
まとめ
✓ Arrow is useful for SciPy community
SciPyコミュニティーにArrowは有用
✓ in not only Python but also other languages
Pythonだけでなく他の言語でも有用
✓ Join Apache Arrow development!
Apache Arrowの開発に参加しよう!
✓ Ask me how to start
なにから始めればよいかは私に相談してね
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Page: 52

Next step
次の一歩
✓ Mailing list: dev@arrow.apache.org
✓ Chat in Japanese:
✓ https://gitter.im/apache-arrow-ja/community
✓ Apache Arrow Tokyo Meetup 2019
this summer?
✓ See also: Apache Arrow Tokyo Meetup 2018
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0

Other slides

Apache Arrow Apache Arrow
2018-12-08
Apache Arrow Apache Arrow
2018-11-17
Apache Arrow Apache Arrow
2017-06-13
Apache Arrow Apache Arrow
2017-05-28
Mroonga! Mroonga!
2015-10-30