unihan-tabular¶
unihan-tabular - tool to build UNIHAN into tabular-friendly formats like python, JSON, CSV and YAML. Part of the cihai project.
UNIHAN‘s data is dispersed across multiple files in the format of:
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
$ unihan-tabular
will download Unihan.zip and build all files into a
single tabular friendly format.
CSV (default), $ unihan-tabular
:
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
JSON, $ unihan-tabular -F json
:
[
{
"char": "㐀",
"ucn": "U+3400",
"kCantonese": "jau1",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kCantonese": "tim2",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
YAML $ unihan-tabular -F yaml
:
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
Features¶
- automatically downloads UNIHAN from the internet
- export to JSON, CSV and YAML (requires pyyaml) via
-F
- configurable to export specific fields via
-f
- accounts for encoding conflicts due to the Unicode-heavy content
- designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
- core component and dependency of cihai, a CJK library
- data package support
- supports python 2.7, >= 3.5 and pypy
If you encounter a problem or have a question, please create an issue.
Usage¶
unihan-tabular
supports command line arguments. See unihan-tabular CLI
arguments for information on how you can specify custom columns, files,
download URL’s and output destinations.
To download and build your own UNIHAN export:
$ pip install unihan-tabular
To output CSV, the default format:
$ unihan-tabular
To output JSON:
$ unihan-tabular -F json
To output YAML:
$ pip install pyyaml
$ unihan-tabular -F yaml
To only output the kDefinition field in a csv:
$ unihan-tabular -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-tabular -f kCantonese kDefinition
To output to a custom file:
$ unihan-tabular --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-tabular --destination ./exported.{ext}
See unihan-tabular CLI arguments for advanced usage examples.
Structure¶
# output w/ JSON
{XDG data dir}/unihan_tabular/unihan.json
# output w/ CSV
{XDG data dir}/unihan_tabular/unihan.csv
# output w/ yaml (requires pyyaml)
{XDG data dir}/unihan_tabular/unihan.yaml
# script to download + build a SDF csv of unihan.
unihan_tabular/process.py
# unit tests to verify behavior / consistency of builder
tests/*
# python 2/3 compatibility module
unihan_tabular/_compat.py
# utility / helper functions
unihan_tabular/util.py