Text
Page: 1
Apache Arrow
A cross-language development platform
for in-memory data
Kouhei Sutou
ClearCode Inc.
SciPy Japan Conference 2019
2019-04-23
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 2
Me
Ruby committer
since 2004
2004年からRubyコミッター
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 3
Why do I talk at SciPy?
なぜSciPyで話しているのか?
To introduce
Apache Arrow
Apache Arrowを紹介するため
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 4
Apache Arrow
A
cross-language
development
platform for in-memory data
インメモリーデータ向け多言語対応開発プラットフォーム
[cited from `https://arrow.apache.org/']
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 5
Cross-language
多言語対応
✓ C, C++, C#, Go, Java,
✓ JavaScript, MATLAB, Python,
✓ R, Ruby and Rust
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 6
Development platform
開発プラットフォーム
Apache Arrow ...
✓ specifies standards and
標準化
✓ provides implementations
実装
to advance cooperation by many people
多くの人が協力できるように
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 7
For in-memory data
インメモリーデータ
Apache Arrow focuses on ↓ for now
Apache Arrowは今のところは↓に注力
✓ sharing columnar/tensor data
カラムナーデータ・テンソルデータの共有
✓ analyzing columnar data
カラムナーデータの分析
✓ RPC for columnar data
カラムナーデータのRPC
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 8
Apache Arrow and Python
Apache ArrowとPython
✓ As pickle replacement
pickleの代替
✓ PySpark does
PySparkはすでにやっている
✓ As dataframe library
データフレームライブラリー
✓ pandas and Vaes use Apache Arrow a bit
pandasとVaesはApache Arrowを少し使っている
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 9
Apache Arrow and me
Apache Arrowと私
✓ A release manager (リリースマネージャー)
✓ 0.11.0 and 0.13.0
(the latest release/最新リリース)
✓ An active developer (アクティブな開発者)
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 10
Feature (1)
機能(1)
Effective
serialization
効率的なシリアライズ
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 11
Why effective?
なぜ効率的なのか
✓ Don't parse data
データをパースしないから
✓ Use data directly
データをそのまま使うから
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 12
Data format: Number
データフォーマット:数値
Contiguous data (Same as C array)
連続データ(Cの配列と同じ)
32bit integer: [1, 2, 3]
0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 ...
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 13
Compare to JSON
JSONと比較
"[1, 2, 3]"
↓
"1" → 1 (String → Number)
"2" → 2 (String → Number)
"3" → 3 (String → Number)
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 14
Merit of direct data use
データを直接使うことのメリット
✓ Zero copy cost
コピーコストをなくせる
✓ Copy is costly for large data
大きなデータではコピーはコストが高い
✓ (Nearly) zero parse cost
(ほぼ)パースコストをなくせる
✓ Only need to parse metadata
メタデータをパースするだけでよい
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 15
Performance
性能
Fast Python Serialization with Ray and Apache Arrow
Apache License 2.0: (c) 2016-2019 The Apache Software Foundation
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 16
Zero copy and large data
ゼロコピーと大きなデータ
✓ pandas can't process large data
pandasは大きなデータを扱えない
✓ Because it needs to allocate memory
メモリーを確保する必要があるから
✓ Apache Arrow supports memory mapping
Apache Arrowはメモリーマッピング対応
✓ Can use data in file directly without copy
ファイル内のデータをコピーせずに使える
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 17
Effective string representation
効率的な文字列表現
✓ pandas: Array of strings
pandas:文字列の配列
✓ Use discontiguous memory
非連続なメモリーを使う
✓ Apache Arrow: Data and array of lengths
Apache Arrow:データと長さの配列
✓ Use contiguous memory: Fast
連続したメモリーを使う:速い
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 18
Data format: String
データフォーマット:文字列
Data bytes + length array
UTT-8 string: ["Hello", "", "!"]
Data bytes: "Hello!"
Length array: [0, 5, 5, 6]
i-th length: lengths[i+1] - lengths[i]
i-th data: data[lengths[i]:lengths[i+1]]
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 19
Feature(?) (2)
機能(?)(2)
Specify data format
データフォーマットを仕様化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 20
Why do Arrow specify?
なぜArrowは仕様化するのか
Effective
data exchange
効率的なデータ交換のため
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 21
Effective data exchange
効率的なデータ交換
✓ Use common format widely
みんなが同じフォーマットを使うこと
✓ No format conversion reduces resource usage
フォーマットを変換しなくてよいならリソース使用量を減らせる
✓ Use low {,de}serialize cost format
シリアライズコストが低いフォーマットを使うこと
✓ Fast
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 22
Who uses Arrow format?
Arrowフォーマットをだれが使っているか
✓ RAPIDS: For NVIDIA GPU
✓ Fletcher, InAccel: For FPGA
✓ Spark: For interprocess data exchange
Spark:プロセス間のデータ交換のために
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 23
CPU and GPU
✓ Can't share data on memory
メモリー上のデータを共有できない
✓ Need to copy between CPU and GPU
CPUとGPU間でコピーする必要がある
✓ Effective data exchange improves performance
データ交換を効率化することで高速化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 24
CPU and FPGA
✓ Can't share data on memory
メモリー上のデータを共有できない
✓ Need to copy between CPU and FPGA
CPUとFPGA間でコピーする必要がある
✓ Effective data exchange improves performance
データ交換を効率化することで高速化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 25
Spark
✓ Process large data
大きなデータを処理
✓ Need to pass data to worker processes
ワーカープロセスにデータを渡す必要がある
✓ Effective data exchange improves performance
データ交換を効率化することで高速化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 26
PySpark
✓ Worker by Python
ワーカーはPython
✓ Use pikcle to exchange data
データ交換にpickleを使用
✓ Spark supports Arrow for data exchange
Arrowを使ったデータ交換をサポート
✓ Disabled by default
デフォルトでは無効
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 27
PySpark with Arrow
In [2]: %time pdf = df.toPandas()
CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
Wall time: 20.7 s
In [3]: spark.conf.set("spark.sql.execution.arrow.enabled", "true")
In [4]: %time pdf = df.toPandas()
CPU times: user 40 ms, sys: 32 ms, total: 72 ms
Wall time: 737 ms
Speeding up PySpark with Apache Arrow
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 28
Feature (3)
機能(3)
Optimized
data processing
modules
最適化されたデータ処理モジュール
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 29
Optimized data processing
最適化されたデータ処理モジュール
✓ Apache Arrow targets large data
Apache Arrowは大きなデータを対象にしている
✓ Performance is important
性能は重要
✓ How to get high performance...?
どうすれば速くできる。。。?
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 30
High performance (1)
高速化(1)
Data locality
データを局所化
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 31
Data locality
データを局所化
✓ Minimize cache misses
キャッシュミスを減らす
✓ Storage is very slow
ストレージはすごく遅い
✓ Memory is slow
メモリーは遅い
✓ CPU cache is fast
CPUキャッシュは速い
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 32
High performance (2)
高速化(2)
SIMD
Single Instruction Multi Data
一気に複数のデータを処理する方法
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 33
SIMD
✓ Data must be contiguous and aligned
データは連続していてアラインされていないといけない
✓ Arrow format is SIMD ready
ArrowフォーマットはSIMDを使える
✓ No condition branch
条件分岐がないこと
✓ Use bitmap instead of "missing" for null
nullを表現するために「欠損値」ではなく別途ビットマップを使う
Is it time to stop using sentinel values for null / NA values?
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 34
No condition branch
条件分岐なし
[1, null, 3] + [null, 2, 5]
null
data
1 0 1
1 X 3
bitwise &
null
data
0 0 1
X X 8
0 1 1
X 2 5
+ by SIMD
including null
elements
[null, null, 8]
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 35
FYI: null
参考情報:null
✓ All data types support null in Arrow
Arrowはすべての型でnullをサポート
✓ Some types only support null in NumPy
NumPyは一部の型でnullをサポート
欠損値の制約 - PythonとApache Arrow
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 36
High performance (3)
高速化(3)
Thread
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 37
Thread
✓ Use multi-cores in single process
シングルプロセスで複数コアを使う
✓ Minimize resource conflict
リソースの競合をなくすこと
✓ Locking to avoid conflict reduces performance
競合を避けるためにロックすると性能劣化
✓ Approaches (アプローチ)
✓ Read only or copy (shared nothing)
リードオンリーにするかコピー(なにも共有しない)
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 38
Apache Arrow and thread
Apache Arrowとスレッド
✓ Data is read only
データはリードオンリー
✓ Share data in threads without lock overhead
ロックのオーバーヘッドなしでスレッド間でデータを共有
✓ Avoid both locking and copying
ロックもコピーも避ける
✓ They reduce performance
どちらも性能劣化するから
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 39
High performance (4)
高速化(4)
Compute kernels
計算カーネル
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 40
Compute kernels
計算カーネル
✓ SIMD ready primitive operations
SIMDを使ったプリミティブな演算
✓ Projection, Filter, Aggregation, ...
射影とかフィルターとか集計とかとか
✓ compare, take, mean, ...
比較とか行選択とか平均とか
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 41
High performance (5)
高速化(5)
Subgraph compiler
サブグラフコンパイラー
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 42
Subgraph compiler: Gandiva
サブグラフコンパイラー:Gandiva
✓ Compile operator graphs at run-time
実行時に演算グラフをコンパイル
✓ Operator graph: combined multiple operations
演算グラフ:演算のまとまり
✓ table.a + table.b < table.c && ...
✓ Usable for query engine backend
クエリーエンジンのバックエンドとして使える
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 43
High performance (6)
高速化(6)
Query engine
クエリーエンジン
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 44
Query engine
クエリーエンジン
✓ For single node
シングルノード向け
✓ Dataflow-style operator execution
データが流れるように演算を実行
✓ scan → project → filter → aggregate → ...
データ取得→射影→フィルター→集計→…
Apache Arrow Query Engine for C++
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 45
Query engine from Python
Pythonからクエリーエンジンを使う
✓ With pandas (pandasと使う)
✓ Large data → execute → to_pandas()
大きなデータ→実行→to_pandas()
✓ With Dask (Daskと使う)
✓ Dask will be able to use this as backend
Daskのバックエンドで使えるかも?
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 46
High performance (7)
高速化(7)
Datasets
データセット
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 47
Datasets
データセット
✓ Scan data from storage/database
ストレージ・データベースからデータ取得
✓ File systems: local, HDFS, ...
✓ Formats: CSV, Parquet, ...
✓ Databases: MySQL, PostgreSQL, ...
Apache Arrow C++ Datasets
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 48
Fast datasets
高速なデータセット
✓ Predicate pushdown
条件のプッシュダウン
✓ Scan only needed data
必要なデータのみ取得
✓ Parallel scan
並列取得
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 49
Feature (4)
機能(4)
RPC
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 50
RPC: Arrow Flight
✓ Fast RPC framework for Arrow
Arrow用の高速なRPC
✓ Based on gRPC with low-level extensions
gRPCベースでいくつか低レベルの拡張をしている
Apache 0.11.0 Release
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 51
Wrap up
まとめ
✓ Arrow is useful for SciPy community
SciPyコミュニティーにArrowは有用
✓ in not only Python but also other languages
Pythonだけでなく他の言語でも有用
✓ Join Apache Arrow development!
Apache Arrowの開発に参加しよう!
✓ Ask me how to start
なにから始めればよいかは私に相談してね
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0
Page: 52
Next step
次の一歩
✓ Mailing list: dev@arrow.apache.org
✓ Chat in Japanese:
✓ https://gitter.im/apache-arrow-ja/community
✓ Apache Arrow Tokyo Meetup 2019
this summer?
✓ See also: Apache Arrow Tokyo Meetup 2018
Apache Arrow - A cross-language development platformfor in-memory data
Powered by Rabbit 3.0.0