Rabbit Slide Show

Pure Ruby Apache Arrow reader/writer

2026-04-24

Description

Apache Arrow is the de fact standard data format in modern data processing systems. We can use the official Red Arrow gem to process Apache Arrow data. It's suitable for fast large data processing but it's over-performance for only low cost data exchange needs. Red Arrow is larger and a bit difficult to install than pure Ruby gems because Red Arrow is implemented as bindings.

I'm implementing the official pure Ruby Apache Arrow reader/writer for only low cost data exchange needs. I expect that more Ruby libraries and applications add support for Apache Arrow inputs/outputs by the pure Ruby Apache Arrow reader/writer. Ruby can be used more for data processing by it.

This talk describes how to implement fast pure Ruby binary data reader/writer and the future of data processing in Ruby.

This is a 2025 Ruby Association Grant project: <URL:https://www.ruby.or.jp/en/news/20251030>

Text

Page: 1

Pure Ruby
Apache Arrow reader/writer
Sutou Kouhei
ClearCode Inc.
RubyKaigi 2026
2026-04-24
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 2

Sutou Kouhei
A president Ruby committer
20th anniversary!
今年で20周年!7月29日に東京でお祝いイベントをするよ!
Details: https://www.clear-code.com/anniversaries/20th.html
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 3

Apache Arrow
✓ Fast columnar format
for large in-memory data
メモリー上の大量データ用の高速なカラムナーフォーマット
✓ Fast data processing library
for large in-memory data
メモリー上の大量データ用の高速なデータ処理ライブラリー
✓ ...
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 4

Pure Ruby
Apache Arrow reader/writer
✓ Fast columnar format
for large in-memory data
メモリー上の大量データ用の高速なカラムナーフォーマット
✓ Fast data processing library
for large in-memory data
メモリー上の大量データ用の高速なデータ処理ライブラリー
✓ ...
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 5

Wait! Do we already have it...?
あれ、もうなかった。。。?
✓ Red Arrow (The official bindings of the official C++ impl)
(公式C++実装の公式バインディング)
✓ Fast reader/writer, data processor, ...
(高速な読み書き・データ処理など)
✓ Ruby Polars (The bindings of a Rust library)
(Apache Arrow対応Rustライブラリーのバインディング)
✓ Fast reader/writer, data processor, ...
(高速な読み書き・データ処理など)
✓ ...
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 6

Motivation
なんで作ったの?
Spread Apache Arrow
in Ruby ecosystem!
Rubyエコシステム内でApache Arrowをもっと広めたい!
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 7

Why pure Ruby implementation?
なぜpure Ruby実装なの?
✓ Bindings are faster but:
バインディングは速いけど
✓ more difficult to install/maintain
インストールやメンテナンスが難しくなりがち
✓ larger install size
インストールサイズが大きくなりがち
✓ Pure Ruby impl can solve these cons
pure Ruby実装ではこれらの欠点を解消できうる
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 8

Red Arrow Format
✓ Pure Ruby Apache Arrow reader/writer
Apache Arrowフォーマットを読み書きできるRubyライブラリー
✓ + Pure Ruby FlatBuffers reader/writer
FlatBuffersフォーマットもなかったので作った
✓ Apache Arrow is built on FlatBuffers
Apache Arrowフォーマット内でFlatBuffersフォーマットを使っている
✓ A 2025 Ruby Association Grant project
2025年度Rubyアソシエーション開発助成プロジェクトの1つ
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 9

How much slower?
どのくらい遅いの?
✓ Lower is faster
低いほど速い
✓ The pure Ruby is
faster!!!
pure Rubyの方が速い!
🙊
✓ The bindings use
debug build
🙊
バインディングの方はデバッグビルドでした
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 10

Release build
リリースビルド
✓ Lower is faster
低いほど速い
✓ The bindings are
5.5x/1.5x faster!!!
Bindingsの方が5.5x/1.5x倍速い!
✓ But...
it's enough fast,
isn't it!?
でも、十分速くない!?
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 11

Why is the pure Ruby fast?
どうしてpure Ruby実装は速いの?
😉
✓ I'm awesome!
😉
私がすごい!
✓ Apache Arrow is awesome!
Apache Arrowがすごい!
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 12

Why is Apache Arrow awesome?
なんでApache Arrowがすごいの?
Zero copy!
ゼロコピーですよ!!!
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 13

What factors slow down read/write?
読み書きが遅くなる原因は?
✓ Data copy
データコピー
✓ Memory allocation/copy are heavier operations
メモリー確保とコピーは重めの操作
✓ Format conversion
フォーマット変換
✓ CPU is consumed
CPUを使っちゃう
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 14

Zero copy
ゼロコピー
✓ No memory allocation/copy
メモリー確保もコピーもない
✓ No format conversion
フォーマット変換もない
✓ Just refer existing data
単に既存データを参照するだけ
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 15

Zero copy in Apache Arrow format
Apache Arrowフォーマット内のゼロコピー
File format Streaming format
ARROW1
Streaming
format
FOOTER
message
ARROW1
Pure Ruby Apache Arrow reader/writer
Message
SCHEMA
messageMETADATA
(FlatBuffers)
RECORD BATCH
messageBody
...Zero Copy
Powered by Rabbit 4.0.2

Page: 16

String zero copy in Ruby
Rubyで文字列をゼロコピー
require "objspace"
data = "large data" * 1000
# Reuse leading string
ObjectSpace.memsize_of(data[0..-2]) # 10040
# Reuse trailing string
ObjectSpace.memsize_of(data[1..-1]) # 40 (Zero copy!)
# Reuse middle string
ObjectSpace.memsize_of(data[1..-2]) # 10039
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 17

Integer zero copy in Ruby
Rubyで整数をゼロコピー
✓ We can't zero copy integer in Ruby
Rubyで整数をゼロコピーすることはできない
✓ Internal format is different
内部表現が違う
✓ Apache Arrow: C compatible
Apache Arrow:C互換
✓ Ruby: (c_format << 1) + 0x01
Ruby:Cの表現を1ビット左シフトして空いたビットを1にする
✓ 00000001 (1 in C) ->
00000011 (1 in Ruby)
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 18

Zero copy parsing: Leading data
ゼロコピーでのパース:先頭部分
# length(int32_t) + string + ...
data = [5].pack("l") + "abcde" + "..."
# Parse
length = data.unpack1("l") # Not zero copy
# Not zero copy! (Middle string can't be shared)
string = data[4, length]
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 19

Zero copy parsing: Trailing data
ゼロコピーでのパース:最後の部分
# ... + length(int32_t) + string
data = "..." + [5].pack("l") + "abcde"
# Parse
# data[3..-1] is zero copy but
# temporary string is allocated
length = data[3..-1].unpack1("l")
# No temporary string
length = data.unpack1("l", offset: 3)
# Zero copy! (Trailing string can be shared)
string = data[7, length]
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 20

IO::Buffer
✓ Since Ruby 3.1 (still experimental)
Ruby 3.1で導入(まだ実験的扱い)
✓ Efficient zero-copy buffer
効率的なゼロコピーバッファー
✓ Use cases:
用途:
✓ Fiber::Scheduler
✓ Parsing binary protocols!!!
バイナリープロトコルのパース!
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 21

Buffer zero copy in Ruby
Rubyでバッファーをゼロコピー
data = IO::Buffer.for("large data" * 1000)
# Reuse leading buffer
data.slice(0, data.size - 1) # Zero copy!
# Reuse trailing buffer
data.slice(1, data.size - 1) # Zero copy!
# Reuse middle buffer
data.slice(1, data.size - 2) # Zero copy!
data.get_string(1, 2) # Not zero copy...
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 22

Zero copy parsing with IO::Buffer
IO::Bufferでゼロコピーパース
# "..." + length(int32_t) + string + ...
data = "..." + [5].pack("l") + "abcde" + "..."
buffer = IO::Buffer.for(data)
# Parse
length = buffer.get_value(:s32, 3) # Not zero copy
# Zero copy! But it's buffer not string...
string_buffer = buffer.slice(7, length)
# Not zero copy
string = buffer.get_string(7, length)
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 23

Zero copy parsing data in file
ファイル内のデータをゼロコピーでパース
File.open("data.arrow", "rb") do |file|
readonly = IO::Buffer::READONLY
buffer = IO::Buffer.map(file, nil, 0, readonly)
# file.read copies data from file but...
length = buffer.get_value(:s32, 3) # doesn't copy
string_buffer =
buffer.slice(7, length) # doesn't copy
end
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 24

Zero copy parsing result
ゼロコピーでパースした結果
✓ Lower is faster
低いほど速い
✓ Faster than
debug build
デバッグビルドよりは速い
✓ Slower than
release build
リリースビルドよりは遅い
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 25

Zero copy writing
ゼロコピーで書き込み
😞
✓ Can't do it
😞
できない
✓ Need to concatenate all buffers
すべてのバッファーを結合しないといけない
✓ How to implement fast writer...?
どうやって高速な書き込みライブラリーを実装したの?
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 26

Writing result
書き込み結果
✓ Lower is faster
低いほど速い
✓ Faster than
debug build
デバッグビルドよりは速い
✓ Slower than
release build
リリースビルドよりは遅い
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 27

How to implement?
どう実装したの?
✓ Nothing special but fast...
特別なことはしていないのに速い
✓ output.append_as_bytes(string)
✓ output.append_as_bytes([integer].pack("L<"))
✓ Want to eliminate temporary objects
一時オブジェクトを減らしたい
✓ e.g. [integer].pack("L<")
✓ Thinking good API but no idea...
いいAPIを考えているけど思いつかない。。。
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 28

Motivation
なんで作ったの?
Spread Apache Arrow
in Ruby ecosystem!
Rubyエコシステム内でApache Arrowをもっと広めたい!
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 29

What do we do for it?
なにをすればいいの?
✓ Just add support for Arrow in/out
Apache Arrowでデータを入出力できるようにするだけでいいっす
✓ e.g. JSON/Apache Arrow output Web app
例:JSONでもApache Arrowでも出力できるWebアプリ
✓ e.g. CSV/Apache Arrow input Web app
例:CSVでもApache ArrowでもインポートできるWebアプリ
✓ Existing tools integrate with your apps
既存のデータ処理ツールがあなたのアプリと連携するはず
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 30

Future work
今後の展望
If you're interested in this, I can support you!
興味があるならサポートできるよ!
✓ Add fast bit operations to Ruby
Ruby本体に高速なビット演算を追加
✓ Apache Arrow uses bitmap
Apache Arrowはビットマップを使っている
✓ e.g. popcount: Feature #20163
✓ Add fast binary write API to Ruby
Ruby本体に高速なバイナリーデータ書き出しAPIを追加
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 31

Future work
今後の展望
If you're interested in this, I can support you!
興味があるならサポートできるよ!
✓ Add SIMD JIT to Ruby
Ruby本体にSIMD用コードを生成するJITを追加
✓ Apache Arrow is SIMD friendly (aligned and
continuous)
Apache ArrowはSIMDしやすい(アラインされた連続データ)
✓ MemoryView can be used for this
MemoryViewを使えるはず
✓ MemoryView is a typed array
MemoryViewは型付きの配列
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 32

SIMD JIT
sum = 0
# Vectorize automatically by JIT
int32_array_memory_view.each do |value|
sum += value
end
sum
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Page: 33

Wrap up
まとめ
✓ Pure Ruby reader/writer is fast enough
Apache ArrowのRuby実装は十分速い
✓ Let's make your apps Apache Arrow ready
みんなのアプリをApache Arrow対応にしてね
✓ Let's improve it together!
Join Red Data Tools! https://red-data-tools.github.io/
✓ I'll be at @.bookstore today's breaks
今日のお昼休みとおやつ休みは本屋さんにいるよ
✓ Come there if you want to talk to me!
お話したい人は来てねー!
Pure Ruby Apache Arrow reader/writer
Powered by Rabbit 4.0.2

Other slides

Apache Arrow Apache Arrow
2018-12-08
Apache Arrow Apache Arrow
2018-11-17
Apache Arrow Apache Arrow
2017-06-13
Apache Arrow Apache Arrow
2017-05-28
Mroonga! Mroonga!
2015-10-30