About
This resource contains the source code of CON-S2V: A Generic Framework for Incorporating Extra-Sentential Context into Sen2Vec Latent Representation for the sentences. paper.
Source code
Requirements
Python Environment setup and Update
- Copy the sen2vec_environment.yml file into anaconda/envs folder
- Get into anaconda/envs folder.
Run the following command:
conda env create -f sen2vec_environment.yml
Now, you have successfully installed sen2vec environment and now you can activate the environment using the following command.
source activate sen2vec
If you have added more packages into the environment, you can update the .yml file using the following command:
conda env export > sen2vec_environment.yml
ROUGE Environment setup
Please go to the ROUGE directory and run the following command to check whether the provided perl script will work or not:
./ROUGE-1.5.5.pl
If it shows the options for running the script, then you are fine. However, if it shows you haven’t have XML::DOM installed then please type following command to install it:
cpan XML::DOM
Here, CPAN stands for Comprehensive Perl Archive Network.
Database Creation and update
If you have already installed postgresql, then you can create a table with the following command for the newsgroup [news] dataset:
psql -c "create database news"
After creating the database, use pg_restore to create the schemas which is agnostic to the dataset:
pg_restore --jobs=3 --exit-on-error --no-owner --dbname=news sql_dump.dump
or
pg_restore --jobs=3 -n public --exit-on-error --no-owner --dbname=news sql_dump.dump
We are assuming that either you are using postgres
as the username or any other username
which already has all the required privileges. To change the password for the postgres
user,
use the following command-
psql -h localhost -d news -U postgres -w
\password
If you have made any changes to the database, you can updated the dump file using following command (schema only):
[You may need to set peer authentication: Peer authentication]
sudo -u postgres pg_dump -s --no-owner -FC news >sql-dump.dump
To dump the data of a particular table from the database:
sudo -u postgres pg_dump --data-only -t summary news --no-owner -Fc > news_summary.dump
Setting Environment Variables
Set the dataset folder path and the connection string in the environment.sh file properly and then run the following command-
source environment.sh #Unix, os-x
Creating Executable for Word2Vec (Mikolov’s Implementation)
Please go to the word2vec code directory inside the project and type the following command for creating executable:
make clean
make
Installation of Theano for Skip-Thought
pip install theano
sudo apt install nvidia-cuda-toolkit
Installation of Keras (Sequential API)
pip install keras
To change the backend to theano
please change the default configuration
in ~/.keras/keras.json
{
"image_dim_ordering": "tf",
"epsilon": 1e-07,
"floatx": "float32",
"backend": "theano"
}
Downloading the {C-PHRASE} vectors:
Please download the C-Phrase vectors from C-Phrase link and join the files using following commands:
cat cphrase.txt.zip_* > cphrase.txt.zip
sed -i '1 i\174814 300' cphrase.txt # converting into word2vec format
Downloading GLove Pretrained Vectors for SDAE:
Please download the vectors from Glove link and then append a line in the first line using following command:
sed -i '1 i\400000 300' glove.6B.300d.txt
Running the Project
Run sen2vec with -h argument to see all possible options:
python sen2vec -h
usage: sen2vec [-h] -dataset DATASET -ld LD
Sen2Vec
optional arguments:
-h, --help show this help message and exit
-dataset DATASET, --dataset DATASET
Please enter dataset to work on [reuter, news]
-ld LD, --ld LD Load into Database [0, 1]
-pd PD, --pd PD Prepare Data [0, 1]
-rbase RBASE, --rbase RBASE Run the Baselines [0, 1]
-gs GS, --gs GS Generate Summary [0, 1]
For example, you can run for the news dataset using the following command-
python sen2vec -dataset news -ld 1 -pd 1 -rbase 1 -gs 1
Publications
Tanay Saha, Shafiq Joty, Naeemul Hassan, and Mohammad Hasan. Regularized and Retrofitted models for Learning Sentence Representation with Context . In Proceedings of the 26th ACM International Conference on Information and Knowledge Management (CIKM’17) , pages xx-xx, 2017.
@InProceedings{saha-joty-hassan-hasan-cikm-17,
author = {Tanay Saha and Shafiq Joty and Naeemul Hassan and Mohammad Hasan},
title = {Regularized and Retrofitted models for Learning Sentence Representation with Context},
booktitle = {Proceedings of the 26th ACM International Conference on Information and Knowledge Management},
month = {November},
year = {2017},
series = {CIKM'17},
address = {Singapore},
publisher = {ACM},
pages = {xx--xx},
url = {papers/saha-joty-hassan-hasan-cikm-17.pdf},
}
Tanay Kumar Saha, Shafiq Joty, Mohammad Al Hasan. CON-S2V: A Generic Framework for Incorporating Extra-Sentential Context into Sen2Vec. In Proceedings of the European Conference on Machine Learning (ECML-PKDD-2017), Macedonia, Skopje, 2017
@InProceedings{saha-joty-hasan-ecml-17,
author = {Tanay Saha and Shafiq Joty and Mohammad Hasan},
title = {CON-S2V: A Generic Framework for Incorporating Extra-Sentential Context into Sen2Vec},
booktitle = {Proceedings of The European Conference on Machine Learning &
Principles and Practice of knowledge discovery in databases},
month = {September},
year = {2017},
series = {ECML-PKDD'17},
address = {Macedonia, Skopje},
publisher = {Springer},
pages = {xx--xx},
url = {papers/saha-joty-hasan-ecml-17.pdf}