Text processing and tutorial video for uploading text dataset


#1

Is there a tutorial video on how to process text before upload it and how can I put it on csv file. I just saw a tutorial video but it’s for image not text


#2

We are working on creating video example of how to use DLS on text dataset.

@vishal , Please share the video link here once it is ready.


#4

Here is the video example for uploading the Text Dataset on DLS:

Please check the below reference script:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import csv

text_file = open("reviews.txt", "r") 
lines = text_file.readlines()

maxlen = 100  # We will cut reviews after 100 words
training_samples = 200  # We will be training on 200 samples
validation_samples = 10000  # We will be validating on 10000 samples
max_words = 10000  # We will only consider the top 10,000 words in the dataset
  
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
  
sameLengthSequences = pad_sequences(sequences, maxlen=maxlen)
  
sequencesToStrings = []
for row in sameLengthSequences:
    sequencesToStrings.append(';'.join(str(col) for col in row))

csvfile = "processed.csv"

with open(csvfile, "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in sequencesToStrings:
        writer.writerow([val])

#5

Do we need to train neural network every time when we get new data in dataset (new sentences) ?
I guess keras makes tokens from text every time on the different way with different tokens.
Is there a way that we only once train our neural network with some training data, and when we get some new data, we only tokenize that new data and feed it to neural network to get results.


#6

As long as you don’t change your model configuration you can continue training on the new data by using the saved weight in the training tab.


#7

I have some labeled sentences and suppose i processed it on this way and trained nn
with that.
If I later get unlabeled sentences and i need predictions for them, do I need to tokenize all data (old + new) and train nn again with labeled data or I need only to tokenize unlabeled sentences and feed it into nn and it will work fine


#8

No you don’t have to train your nn again in this case.
To predict output on the trained data you can either generate inference by uploading unlabeled data in the inference tab and selecting the trained run or you can deploy the trained model and predict output for single tokenized sentence.

Regards
Rajat