2021-06-09 05:43:43 +00:00
|
|
|
|
# Zero-EPWING
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2023-02-25 20:43:36 +00:00
|
|
|
|
*Note: this project is no longer maintained. Please see [this
|
|
|
|
|
post](https://foosoft.net/posts/sunsetting-the-yomichan-project/) for more information.*
|
|
|
|
|
|
2016-12-30 01:35:07 +00:00
|
|
|
|
Zero-EPWING is a tool built to export easy to process JSON formatted UTF-8 data from dictionaries in
|
2017-01-02 23:29:56 +00:00
|
|
|
|
[EPWING](https://ja.wikipedia.org/wiki/EPWING) format. This is a terrible format for many reasons, some of which are
|
|
|
|
|
outlined below:
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2016-12-31 05:37:00 +00:00
|
|
|
|
* It is based on a closed and undocumented standard.
|
2016-12-30 01:35:07 +00:00
|
|
|
|
* Not well supported as it isn't used anywhere else in the world.
|
|
|
|
|
* The only library for parsing this format, `libeb`, is abandoned.
|
|
|
|
|
* Data is stored in an inconsistent manner, with lots of duplication.
|
2017-02-04 04:39:43 +00:00
|
|
|
|
* Text data is represented using the annoying EUC-JP encoding.
|
2017-01-02 23:29:56 +00:00
|
|
|
|
* Characters which cannot be encoded are represented by image bitmaps.
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2017-02-04 04:39:43 +00:00
|
|
|
|
Most applications that parse EPWING data traditionally use `libeb` to perform dictionary searches in place; dealing with
|
|
|
|
|
quirks in the format and `libeb` output is just part of the process. Zero-EPWING takes a different approach -- extract
|
|
|
|
|
all the data and output it an sane intermediate format, like JSON. As everyone knows how to parse JSON, it is trivial to
|
|
|
|
|
take this intermediate data and store it in a reasonable, industry standard representation.
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2021-12-15 04:20:53 +00:00
|
|
|
|
![](img/zero-wing.png)
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2021-06-09 05:43:43 +00:00
|
|
|
|
## Building
|
2017-05-06 19:17:22 +00:00
|
|
|
|
|
|
|
|
|
Prepare your development environment by making sure the following tools are set up:
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
|
|
|
|
* [Autotools](https://www.gnu.org/software/automake/manual/html_node/Autotools-Introduction.html)
|
|
|
|
|
* [CMake](https://cmake.org/)
|
|
|
|
|
* [GCC](https://gcc.gnu.org/)
|
|
|
|
|
* [Make](https://www.gnu.org/software/make/)
|
2017-02-04 04:39:43 +00:00
|
|
|
|
* [MinGW](http://www.mingw.org/) (Windows only)
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2017-05-06 19:17:22 +00:00
|
|
|
|
Once your system is configured, follow the steps below to create builds:
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2018-06-01 21:23:55 +00:00
|
|
|
|
1. Clone the repository by executing
|
|
|
|
|
```
|
2023-12-30 22:32:26 +00:00
|
|
|
|
git clone --recurse-submodules https://git.foosoft.net/alex/zero-epwing
|
2018-06-01 21:23:55 +00:00
|
|
|
|
```
|
|
|
|
|
2. Prepare the project. From the project root directory, execute
|
|
|
|
|
```
|
|
|
|
|
cmake . -Bbuild && cmake --build build --
|
|
|
|
|
```
|
|
|
|
|
3. Find the executable in the `build` directory.
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2021-06-09 05:43:43 +00:00
|
|
|
|
## Usage
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
|
|
|
|
Zero-EPWING takes a single parameter, the directory of the EPWING dictionary to dump. It also supports the following
|
|
|
|
|
optional flags:
|
|
|
|
|
|
2017-02-12 19:13:10 +00:00
|
|
|
|
* `--entries` (`-e`): output dictionary entry data (most common option).
|
|
|
|
|
* `--fonts` (`-f`): output output font bitmap data (useful for OCR).
|
2016-12-30 01:35:07 +00:00
|
|
|
|
* `--markup` (`-m`): markup the output with as much metadata as possible.
|
|
|
|
|
* `--positions` (`-s`): output *page* and *offset* data for each entry.
|
2017-02-12 19:13:10 +00:00
|
|
|
|
* `--pretty` (`-p`): output pretty-printed JSON (useful for debugging).
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2017-02-12 19:13:10 +00:00
|
|
|
|
Upon loading and processing the requested EPWING data, Zero-EPWING will output a UTF-8 encoded JSON file to `stdout`.
|
|
|
|
|
Diagnostic information about errors will be printed to `stderr`. Serious errors will result in this application
|
|
|
|
|
returning a non-zero exit code. A sample of the JSON dictionary entry data output is pretty-printed below for reference.
|
2016-12-30 01:35:07 +00:00
|
|
|
|
|
2017-09-30 00:04:13 +00:00
|
|
|
|
```json
|
2016-12-30 01:35:07 +00:00
|
|
|
|
{
|
|
|
|
|
"charCode": "jisx0208",
|
|
|
|
|
"discCode": "epwing",
|
|
|
|
|
"subbooks": [
|
|
|
|
|
{
|
|
|
|
|
"title": "大辞泉",
|
2017-09-30 00:04:13 +00:00
|
|
|
|
"copyright": "CD-ROM版大辞泉 1997年4月10日 第1版発行\n\n監 修 松村 明\n発行者 鈴木俊彦\n発行所...",
|
2016-12-30 01:35:07 +00:00
|
|
|
|
"entries": [
|
|
|
|
|
{
|
|
|
|
|
"heading": "亜",
|
2017-09-30 00:04:13 +00:00
|
|
|
|
"text": "亜\n[音]ア\n[訓]つ‐ぐ\n[部首]二\n[総画数]7\n[コード]区点..."
|
2016-12-30 01:35:07 +00:00
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"heading": "あ",
|
2017-09-30 00:04:13 +00:00
|
|
|
|
"text": "あ\n{{w_50275}}\n{{w_50035}}五十音図ア行の第一音。五母音の一。後舌の開母音..."
|
|
|
|
|
}
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
]
|
|
|
|
|
}
|
2016-12-30 01:35:07 +00:00
|
|
|
|
```
|
|
|
|
|
|
2017-02-12 19:13:10 +00:00
|
|
|
|
You may have noticed the unusual-looking double curly bracket markers such as `{{w_50035}}`. Remember what I mentioned
|
|
|
|
|
about certain characters being represented by image files? There are two graphical fonts sets in each dictionary: narrow
|
|
|
|
|
and wide. Both of these font sets are available in four sizes: 24, 30, 36, and 48 pixels. Whenever a character cannot be
|
|
|
|
|
encoded as text, a glyph is used in its place. These font indices cannot be converted directly to characters, differ
|
|
|
|
|
from one dictionary to another, and must be manually mapped to Unicode character tables. Zero-EWPING has no facility to
|
|
|
|
|
map these font glyphs to Unicode by itself, and instead places inline markers in the form of `{{w_xxxx}}` and
|
|
|
|
|
`{{n_xxxx}}` in the output, specifying the referenced indices of the wide or narrow fonts respectively.
|
|
|
|
|
|
|
|
|
|
The bitmaps for these font glyphs can be dumped by executing this application with the `--fonts` command line argument.
|