Rabbit Slide Show

Why Apache Arrow is important for Ruby

2022-06-23

Description

It is known for data processing world that Apache Arrow is important. This talk shares why Apache Arrow is important especially for Ruby community and how to make positive spiral. Also, this talk introduces about Apache Arrow features Ruby community works on.

Text

Page: 1

Why Apache Arrow is important
for Ruby
Sutou Kouhei
ClearCode Inc.
The Data Thread
2022-06-23
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 2

Me
✓ Name: Sutou Kouhei
(Family Given)
✓ ID: kou (call me kou)
(ktou or kous when I can't use kou)
✓ Ruby committer since 2004
✓ This year's Apache Arrow PMC chair
My profile picture is my "Shocker combatman" figure on my Happy Hacking Keyboard
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 3

Why I work on Apache Arrow
For Ruby!
(I love Ruby!)
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 4

Ruby
✓ Widely used for Web application
(I rarely write Web app)
✓ Ruby on Rails is an useful Web app framework
✓ e.g.: GitHub, GitLab, Shopify, Discourse, ...
✓ Not widely used for data processing
✓ Even though Ruby is a general purpose
programming language...
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 5

Ruby and data processing
Negative spiral
Small community
Few users
Why Apache Arrow is important for Ruby
Few developers
Few useful tools
Powered by Rabbit 3.0.2

Page: 6

How to break
the negative spiral?
Small community
Few developers
Few users
Few useful tools
✓ Few users: Expand useful tools?
✓ Small community: Increase # of users?
✓ Few developers: Expand community?
✓ Few useful tools:
Increase # of developers?
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 7

Expand useful tools
with few developers
Negative spiral
Few developers
Few useful tools
Small community
Few users
Positive spiral
More developers
Larger community
Why Apache Arrow is important for Ruby
More useful tools
More users
Powered by Rabbit 3.0.2

Page: 8

But how?
Apache Arrow
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 9

Apache Arrow
✓ Cross-language dev platform
✓ Ruby community doesn't need to dev everything
✓ We can share common implementations
✓ Apache Arrow and Ruby
✓ I've donated the Ruby bindings for C++ in 2017
✓ Ruby bindings: Red Arrow
✓ Many features are already bound:
Parquet, Dataset, Gandiva, Flight, ...
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 10

Red Data Tools
I started a new project in 2017:
Red Data Tools is a project that
provides data processing tools for
Ruby.
[cited from `https://red-data-tools.github.io/']
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 11

Red Data Tools: Policy 1
Collaborate across the Ruby community
We collaborate with the Ruby
community and other communities. For
example, we use Apache Arrow, shared
with many languages, and join in
development of Apache Arrow to share
benefits.
[cited from `https://red-data-tools.github.io/']
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 12

What fields I work on
✓ Not only Ruby related features
✓ To be a good Apache Arrow community member
✓ Community support
✓ Answer questions from users
✓ Review pull requests
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 13

What features I work on
✓ Ruby related
✓ C++ impl., C GLib bindings, Linux packages,
Homebrew, MSYS2, Release, CI, ...
✓ Not Ruby related
✓ wheel, jar, MATLAB bindings, Julia impl., ...
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 14

What fields
Red Data Tools members work on
✓ C GLib bindings
✓ Red Arrow
✓ Tensor
✓ Big endian
✓ C++ compute functions
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 15

What skills I have
not used for Apache Arrow yet
Develop MySQL/PostgreSQL plugin
✓ I'm a developer of Mroonga/PGroonga
✓ Mroonga: A MySQL plugin for full text search
(múlúnɡά)
✓ PGroonga: A PG plugin for full text search
(píːzí:lúnɡά)
✓ Use case: Impl. Flight SQL adapter?
and more...
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 16

Apache Arrow and Ruby community
✓ Ruby community uses Arrow's work
✓ Ruby community joins in Arrow dev
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 17

What feature is useful for Ruby?
Fast data
interchange
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 18

Fast data interchange
✓ It's still difficult to use Ruby
for full data processing
✓ Because Apache Arrow doesn't solve everything
✓ Increase usage of Ruby step by step
✓ Because Ruby can integrate with other
languages by Apache Arrow's fast data
interchange feature
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 19

Integration examples
✓ DuckDB:
Arrow ready in-process SQL OLAP DBMS
✓ https://github.com/red-data-tools/red-arrow-duckdb
✓ DataFusion:
Arrow native SQL query engine
✓ WIP: Export C API #1113
https://github.com/apache/arrow-datafusion/issues/1113
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 20

What feature is useful for Ruby?
Web app related
features
Because many Ruby users develop Web apps with Ruby on Rails
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 21

What features are useful
for Web app
✓ Visualization related features
✓ For dashboard
✓ Fast data interchange with RDBMS
✓ Web app may have batch jobs to process large
data in RDBMS
✓ See also: mrkn's talk on RubyKaigi 2019
(mrkn is an Apache Arrow committer from Red Data Tools)
https://speakerdeck.com/mrkn/reducing-activerecord-memory-
consumption-using-apache-arrow
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 22

Fast data interchange with RDBMS
✓ Apache Arrow Flight SQL
✓ Apache Arrow Database Connectivity:
ADBC
https://docs.google.com/document/d/
1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 23

Fast data interchange with RDBMS
Apache Arrow Flight SQL
RDBMS
Apache Arrow Flight
Library
No conversion
Web app
Apache Arrow Database Connectivity
RDBMS
Why Apache Arrow is important for Ruby
Own protocol
Library
Own format→Apache Arrow
Web app
Powered by Rabbit 3.0.2

Page: 24

Apache Arrow data⇄Ruby objects
✓ Red Arrow has fast converter
✓ Implemented in C++
✓ Faster than
RDBMS's own format data⇄Ruby objects
✓ Both of Flight SQL and ADBC will improve
performance
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 25

Wrap up
✓ Ruby community joins in Arrow dev
✓ To use Ruby for data processing
✓ Ruby community is interested in:
✓ Integration with other data processing systems
✓ RDBMS related improvements
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 26

Topics I didn't talk today
✓ GObject Introspection (GI)
✓ Ruby bindings are generated at run-time not
compile-time
✓ How does GI work for it?
✓ Linux packaging
✓ How to build deb/rpm for Debian/Ubuntu/CentOS/
AlmaLinux/Amazon Linux on x86_64 and arm64?
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Page: 27

Acknowledgment
✓ Voltron Data
✓ Most of my Apache Arrow related work is being
done with financial support from Voltron Data
since 2022-04
✓ Yukiko Yoshimoto at ClearCode
✓ Add English subtitle to this video
Why Apache Arrow is important for Ruby
Powered by Rabbit 3.0.2

Other slides

Apache Arrow Apache Arrow
2018-12-08
Apache Arrow Apache Arrow
2018-11-17
Apache Arrow Apache Arrow
2017-06-13
Apache Arrow Apache Arrow
2017-05-28
Mroonga! Mroonga!
2015-10-30