Classification for Text Message Without Image

Dear All ,

I have csv file contains Text Expression with (5) classes type to predict which class the text belongs to as the following image attached:

the required prediction should be probability percentage between (0-1) for each class,
I need to know:

1- For the comment_text column What Type Of Input should be in this case (Text is not available as type)?
2-The CSV file less than 1 GB but when I upload it as (zip) file I encountered many errors during uploading I don’t know why so please I need careful guidance on how to upload the csv as input data?
3-For the output features (the classes) shall I select the type as Number?
4-Shall I put the ID as input? if yes what is the type of the Input? or I can Ignore this field?

1 Like
  1. You need to do some preprocessing before you can upload your dataset.
  • You need to convert you text (in single row) to sequence of integers. Here is one link to help you on that.
    Tokenizer in Keras

  • You need to pad these sequence of integers to make it same length for all rows.

  • Convert you sequence into string. i.e [1,2,3,0,9] should become 1;2;3;0;9

  • you can now upload your dataset. Choose type Array for this column.

  1. I think your CSV fail may contain characters which is causing upload to fail. Once you have done above preprocessing, it should work.

  2. You can encode all the classes into 1 column like 0;0;0;0;0;0 and choose Array type in DLS.

  3. You can ignore the ID field

For Tokenization I’m not familiar with Python But i can do it in R , do you have any resource for R (Keras) to the same code?

Can I send you by email the file to know what exactly the error cause?

yes, you can send us the train.csv

send me your email to share it with you

Do You Have some reference by R?

I’m not familiar with Python

Here is the video example for uploading the Text Dataset on DLS:

Please check the below reference script:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import csv

text_file = open("reviews.txt", "r") 
lines = text_file.readlines()

maxlen = 100  # We will cut reviews after 100 words
training_samples = 200  # We will be training on 200 samples
validation_samples = 10000  # We will be validating on 10000 samples
max_words = 10000  # We will only consider the top 10,000 words in the dataset
  
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
  
sameLengthSequences = pad_sequences(sequences, maxlen=maxlen)
  
sequencesToStrings = []
for row in sameLengthSequences:
    sequencesToStrings.append(';'.join(str(col) for col in row))

csvfile = "processed.csv"

with open(csvfile, "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in sequencesToStrings:
        writer.writerow([val])
1 Like