Rabbit Slide Show

Ruby + ADBC - A single API between Ruby and DBs

2023-05-13

Description

ADBC is Apache Arrow Database Connectivity. It provides a API that can connect to different databases by wrapping database specific APIs. This is not a new approach. There are existing APIs such as Active Record, Sequel and ODBC. The difference between the existing APIs and ADBC is the focus on large data and performance. ADBC is an important part to use Ruby for data processing. We can extract large data from many databases (not only RDBMSs but also data ware houses and so on) and load large data into many databases with ADBC. To use Ruby for data processing, we need data. ADBC helps it.

Text

Page: 1

Ruby + ADBC
A single API between Ruby and DBs
Sutou Kouhei
ClearCode Inc.
RubyKaigi 2023
2023-05-13
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 2

Sutou Kouhei
A president/Ruby committer
The president of ClearCode Inc.
クリアコードの社長
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 3

Sutou Kouhei
The 3rd Apache Arrow PMC chair
✓ PMC: Project Management Committee
Apache Arrowのプロジェクト管理委員会の3代目代表
✓ #2 commits (コミット数2位)
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 4

Sutou Kouhei
The pioneer in Ruby and ADBC
✓ A Ruby committer
✓ Maintain some standard libraries/default gems
標準ライブラリーとかデフォルトgemのメンテナンスをしている
✓ The author of Red ADBC
✓ The official ADBC library for Ruby
公式のRuby用のADBCライブラリー
✓ ADBC is developed by Arrow project
ADBCはApache Arrowプロジェクトが開発している
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 5

Sutou Kouhei
The founder of Red Data Tools
✓ Provides data processing tools for Ruby
Ruby用のデータ処理ツールを提供するプロジェクト
https://red-data-tools.github.io/
https://red-data-tools.github.io/ja/
✓ Policies
ポリシー
✓ 5. Ignore criticism from outsiders
部外者からの非難は気にしない
Ignore "I use XXX for it instead of Ruby because..."
✓ 6. Fun!
楽しくやろう!
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 6

Topic
話すこと
Let's use Ruby to
extract and load
large data!
大量データの読み書きにもRubyを使おうぜ!
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 7

Embulk?
✓ Bulk data loader implemented with Java
Javaで実装されたバルクデータローダー
✓ JRuby supported!
JRubyサポート!
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 8

"Embulk v0.11 is coming soon"
「Embulk v0.11 がまもなく出ます」
https://www.embulk.org/articles/2023/04/13/embulk-v0.11-is-coming-soon.html
we plan to gradually shrink our
support on (J)Ruby
Embulk の (J)Ruby サポートは徐々に縮小していく計画です。
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 9

Another approach: ADBC
別のアプローチ:ADBC
✓ Arrow Database Connectivity
✓ A single API for accessing many DBs
各種DBにアクセスするための共通API
✓ Like Active Record/Sequel in Ruby
Rubyで言えばActive RecordやSequelみたいなもの
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 10

ADBC:Features
ADBC:特徴
✓ Cross-language
多言語対応
✓ Active Record needs adapters impl-ed in Ruby
Active RecordではRubyでアダプターを実装しないといけない
✓ ADBC can use adapters impl-ed in other langs
ADBCでは他の言語で実装されたアダプターも使える
✓ Optimized for large columnar data
大きな列指向データに最適化
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 11

Large column-oriented data
大きな列指向データ
✓ Large: >= 1M records with 1 column
大きな:1カラムなら100万レコード以上
✓ Column-oriented
Columns Column-
Ruby + ADBC - A single API between Ruby and DBs
a b c oriented
1 V V V
2 V V V
3 V V V
Row-
oriented
列指向
Columns
a b c
1 V V V
2 V V V
3 V V V
Column Value management unit Row
Column Row
Fast access unit
Powered by Rabbit 3.0.2

Page: 12

Optimized for large columnar data
大きな列指向データに最適化
✓ Apache Arrow data format:
Minimize data interchange cost!
Apache Arrowデータフォーマット:データ交換コストがめっちゃ安い!
✓ Partitioned result sets:
Fast data extract
Apache Arrowフォーマットは
結果セットの分割:高速なデータ読み込み
✓ Bulk insert:
Fast data load
バルクインサート:高速なデータ書き込み
Ruby + ADBC - A single API between Ruby and DBs
なぜ速いのか
須藤功平
株式会社クリアコード
db tech showcase ONLINE 2020
2020-12-08
https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/
Apache Arrowフォーマットはなぜ速いのか
Powered by Rabbit 3.0.1
Powered by Rabbit 3.0.2

Page: 13

How fast is ADBC?
ADBCはどのくらい速いの?
✓ 1 integer column
整数値カラム1つだけ
✓ SELECT * FROM x
✓ Lower is faster
低いほど速い
✓ About 2x faster
with 10M records
1000万レコードで2倍ほど速い
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 14

Architecture
アーキテクチャー
✓ Single API
同じAPIで使える
✓ Driver per
protocol
DATABASE
Query
Flight SQL
Driver
libpq
Driver
プロトコルごとに
ドライバーを用意
✓ API returns
Arrow data
Flight SQL
Arrow Data
API
Arrow Data
Postgres
Protocol
ADBC
POSTGRES
https://arrow.apache.org/img/ADBCFlow2.svg Apache-2.0 © 2016-2023 The Apache Software Foundation
レスポンスはArrowデータ
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 15

API
✓ C API
✓ Bindings: GLib, Python, R, Ruby
✓ Go API
✓ Java API
✓ Rust API (WIP)
See also: https://arrow.apache.org/adbc/0.3.0/format/specification.html
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 16

C API
✓ AdbcDatabase: It holds state shared by
multiple connections
複数の接続を管理
✓ AdbcConnection: It's a single, logical
connection to a database
1つの接続を管理
✓ AdbcStatement: It holds state related
to query execution
クエリーの実行を管理
See also: https://arrow.apache.org/adbc/0.3.0/cpp/api/adbc.html
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 17

Ruby API: Extract
require "adbc"
options = {
driver: "adbc_driver_postgresql",
uri: "postgresql://127.0.0.1:5432/db",
}
ADBC::Database.open(**options) do |database|
database.connect do |connection|
connection.open_statement do |statement|
query = "SELECT * FROM data"
table, = statement.query(query)
p table
end
end
end
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 18

Ruby API: Load
require "adbc"
options = {
driver: "adbc_driver_postgresql",
uri: "postgresql://127.0.0.1:5432/db",
}
ADBC::Database.open(**options) do |database|
database.connect do |connection|
connection.open_statement do |statement|
input = Arrow::Table.load("in.arrow")
statement.ingest("table", input)
# ...
end
end
end
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 19

Ruby API - Active Record
WIP
https://github.com/red-data-tools/activerecord-adbc-adapter
Join us! We need to improve drivers too.
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 20

Available drivers
利用可能なドライバー
DB
DuckDB
Flight SQL
PostgreSQL
SQLite
Ruby + ADBC - A single API between Ruby and DBs
Status
Beta
Beta
Experimental
Beta
Powered by Rabbit 3.0.2

Page: 21

How to implement a driver
ドライバーの実装方法
✓ Choose C, C++ or Go
✓ See the following implementations:
✓ C: https://github.com/apache/arrow-adbc/tree/main/c/driver/sqlite
✓ C++: https://github.com/apache/arrow-adbc/tree/main/c/driver/
postgresql
✓ Go (Go API):
https://github.com/apache/arrow-adbc/tree/main/go/
adbc/driver/flightsql
✓ Go (C API):
https://github.com/apache/arrow-adbc/blob/main/go/
adbc/pkg/flightsql/driver.go
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 22

Current ADBC
現時点のADBC
✓ 1 integer column
整数値カラム1つだけ
✓ SELECT * FROM x
✓ Lower is faster
低いほど速い
✓ libpq driver
is slow for now...
実は現時点ではlibqpドライバーは遅い…
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 23

Flight SQL?
SQL
on
Apache Arrow Flight
Apache Arrow Flightの上でSQLを使えるようにしたもの
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 24

Apache Arrow Flight?
✓ Arrow format based fast RPC framework
Arrowフォーマットを使った高速RPCフレームワーク
✓ Minimum data interchange cost!
データ交換コストがめっちゃ安い!
✓ Parallel transfers
並列転送
Apache Arrow Flight
ビッグデータ用高速データ転送フレームワーク
須藤功平
✓ Stream processing
ストリーム処理
株式会社クリアコード
db tech showcase 2021
2021-11-17
https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-2021/
Apache Arrow Flight - ビッグデータ用高速データ転送フレームワーク
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2
Powered by Rabbit 3.0.2

Page: 25

Simple usage
簡単な使い方
https://arrow.apache.org/img/20191014_flight_simple.png
Apache License 2.0 - © 2016-2021 The Apache Software Foundation
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 26

GetFlightInfo
✓ Client→Server
クライアント→サーバー
✓ Server returns
how to get data
サーバーはデータの取得方法を返す
✓ FlightInfo: How to get data
FlightInfo: データの取得方法
✓ Metadata: Schema, # of records, ...
メタデータ:スキーマ・総レコード数…
✓ 1+ Endpoints: Data may be distributed!
複数エンドポイント:データは複数ヶ所に分散しているかもしれない!
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 27

DoGet
✓ Client→Server
クライアント→サーバー
✓ Server returns data
サーバーはデータを返す
✓ Data: Record batch stream
データ:レコードバッチのストリーム
✓ Called as FlightData in protocol
プロトコルレベルではFlightDataと呼んでいる
✓ Record batch: 0+ records
レコードバッチ:0個以上のレコードの集まり
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 28

Apache Arrow Flight SQL
Client
Server
GetFlightInfo(CommandStatementQuery: SQL)
FlightInfo{..., Ticket, ...}
DoGet(Ticket)
query results as Apache Arrow data
Client
Server
https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/
Apache License 2.0 - © 2016-2023 The Apache Software Foundation
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 29

Current ADBC
現時点のADBC
✓ 1 integer column
整数値カラム1つだけ
✓ SELECT * FROM x
✓ Lower is faster
低いほど速い
✓ libpq driver
is slow for now...
実は現時点ではlibqpドライバーは遅い…
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 30

But can PostgreSQL talk Flight SQL?
でもPostgreSQLはFlight SQLをしゃべれるの?
Flight SQL adapter
https://github.com/apache/arrow-flight-sql-postgresql
I'm the author
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 31

Architecture
Client
PG(master)
PG(Flight SQL main)
PG(Flight SQL server)
PG(Flight SQL executor)
Spawn
Spawn
Listen gRPC socket (multi-threading)
Connect with Flight SQL protocol
Allocate an executor for this session
Spawn
Send a query
Pass the given query via shared memory
Run the given query with SPI
Convert a result to Apache Arrow data
Pass the result via shared memory
Return the result with Flight SQL protocol
Client
PG(master)
Ruby + ADBC - A single API between Ruby and DBs
PG(Flight SQL main)
PG(Flight SQL server)
PG(Flight SQL executor)
Powered by Rabbit 3.0.2

Page: 32

Wrap up
まとめ
✓ We can use Ruby to extract and load
large data by ADBC! (in a few years...)
ADBCを使うとRubyで大量データを読み書きできるよ!(近いうちに。。。)
✓ PostgreSQL'll be Flight SQL ready soon!
すぐにPostgreSQLでFlight SQLを使えるようになるよ!
✓ We can use ADBC via Active Record soon
すぐにActive Record経由でADBCを使えるようになるよ!
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 33

Join us!
一緒に開発しようぜ!
✓ Red Data Tools: A project that provides
data processing tools for Ruby
Red Data Tools:Ruby用のデータ処理ツールを提供するプロジェクト
https://red-data-tools.github.io/
https://red-data-tools.github.io/ja/
✓ You can implement something with us!
一緒になにか作ろうぜ!
https://gitter.im/red-data-tools/en
https://gitter.im/red-data-tools/ja
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Page: 34

Sponsor us?
資金援助しない?
✓ Provide XX% work time to your employee
to work on Red Data Tools
業務時間のXX%をRed Data Toolsの作業をできるようにする
✓ Employ a full-time Red Data Tools developer
フルタイムのRed Data Tools開発者を雇用する
✓ Pay Red Data Tools continuously
Red Data Toolsに継続的に資金を提供する
Red Data Toolsのだれかがお金で時間を確保して作業する
✓ Or contact me!
相談して!
Ruby + ADBC - A single API between Ruby and DBs
Powered by Rabbit 3.0.2

Other slides

Apache Arrow Apache Arrow
2018-12-08
Apache Arrow Apache Arrow
2018-11-17
Apache Arrow Apache Arrow
2017-06-13
Apache Arrow Apache Arrow
2017-05-28
Mroonga! Mroonga!
2015-10-30