Tokyo Stock Exchange, Inc. (TSE) is launching Proof of Concept (PoC) testing for limited public distribution of corpus data created from timely disclosure documents and others. Corpus is an accumulation of digitized natural language sentences utilized for research on natural language processing, etc. and in recent years, employed especially for machine translation. The PoC testing will provide a sample of both parallel corpus and monolingual corpus created from timely disclosure documents and others, with the objective of using feedback from the participants of the PoC testing to verify the possible usability and applications of the data. TSE will also consider developing a service to distribute the data, based on the results of the PoC.
Name | Data Outline |
Timely Disclosure Document Monolingual Corpus (Japanese or English) |
Each Japanese and English corpus constructed by mechanically extracting text from disclosure documents and others (PDF format) during a specific period |
Timely Disclosure Document Parallel Corpus (Japanese and English) |
Japanese and English parallel corpus constructed based on the above monolingual corpus from timely disclosure document |
- ・The data provided is from 2019 disclosure documents.
For those interested in applying to participate in the PoC
The PoC testing participants must be trading participants of TSE or Osaka Exchange, Inc., clearing participants of Japan Securities Clearing Corporation, or any other corporation deemed appropriate by TSE.
Prospective participants are required to apply for both the PoC Program for Utilizing Securities Data and this PoC testing program separately.
For information on how to apply, please contact the following.
Contact
Tokyo Stock Exchange, Inc., Information Services Department, Service Development Group
E-mail: inf_dev@jpx.co.jp