2

I am trying to run kNN on a dataset but I keep getting some NA error. I have exhausted stack overflow trying to find a solution to this problem. I could not find anything useful anywhere.

This is the dataset I am working with : https://www.kaggle.com/tsiaras/uk-road-safety-accidents-and-vehicles

I have converted every single factor variable and integer variable for my predictor and target to numeric so it can do Euclidean distance. I have removed all the NA's but kNN keeps throwing the following error message :

NAs introduced by coercionNAs introduced by coercionError in knn(train[2:nrow(train), c(11, 22, 23, 25, 27, 28)], test[(2:nrow(test)), : NA/NaN/Inf in foreign function call (arg 6)

This is one example of how I am converting all the predictors and running kNN :

as.numeric(levels(test$Road_Type))[levels(test$Road_Type)]
as.numeric(levels(train$Road_Type))[levels(train$Road_Type)]

train <- na.exclude(train)
test <- na.exclude(test) 

cl=as.numeric(train[2:nrow(train),5])
cl <- na.exclude(cl)
knn0 <- knn(train[2:nrow(train),c(11,22,23,25,27,28)], test[(2:nrow(test)),c(11,22,23,25,27,28)], cl)

I am doing the as.numeric stuff for all the columns 11,22,23,25,27,28 and also the target. I am starting the row at 2 so it doesn't include the labels. I have also tried running the following code before passing the parameters into the kNN function :

sum(is.na(train[2:nrow(train),c(11,22,23,25,27,28)]))
sum(is.na(test[2:nrow(test),c(11,22,23,25,27,28)]))
sum(is.na(cl))

All 3 of these return 0 so there are no NA values before I am passing it into the kNN function.

EDIT

Fixed the issue by converting to numeric like this :

train$Road_Type <- as.numeric(as.integer(factor(train$Road_Type)))

Thanks to everyone who helped!

Nahian Afsari
  • 41
  • 1
  • 4

2 Answers2

1

You need to always look into the data. This helps you and others to answer the question.

If we check your data it looks like this:

str(df[, c(11, 22, 23, 25, 27, 28)])
'data.frame':   2047256 obs. of  6 variables:
 $ Junction_Control                 : chr  "Data missing or out of range" "Auto traffic signal" "Data missing or out of range" "Data missing or out of range" ...
 $ Number_of_Vehicles               : int  1 1 2 1 1 2 2 1 2 2 ...
 $ Pedestrian_Crossing.Human_Control: int  0 0 0 0 0 0 0 0 0 0 ...
 $ Police_Force                     : chr  "Metropolitan Police" "Metropolitan Police" "Metropolitan Police" "Metropolitan Police" ...
 $ Road_Type                        : chr  "Single carriageway" "Dual carriageway" "Single carriageway" "Single carriageway" ...
 $ Special_Conditions_at_Site       : chr  "None" "None" "None" "None" ...

What happens if we transform a character to numeric:

df$Police_Force <- as.numeric(df$Police_Forc)

df$Police_Force
[1] NA NA NA NA NA NA NA ....
Warning message:
  NAs introduced by coercion

This does not work in R. However if we set them as factors and afterward change them to numeric the problem is solved.

df$Police_Force <- as.numeric(as.factor(df$Police_Forc))

df$Police_Force
[1] 30 30 30 30 30 30 30 ...

Your approach does not work because the variables are not factors but characters.

levels(df$Road_Type)
NULL

as.numeric(levels(df$Road_Type))[levels(df$Road_Type)]
numeric(0)

As you have not shown how your data looks after imported into R I might be wrong. I used the read.csv function.

Dimitri Graf
  • 141
  • 4
  • I actually did convert to factors before doing as.numeric(). I also used read.csv to import the data. As another person suggested, I actually did try assigning the as.numeric() function to the respective column in the dataframe but I get something like this : Error in `$<-.data.frame`(`*tmp*`, Road_Type, value = c(NA_real_, : replacement has 16 rows, data has 56420 – Nahian Afsari Apr 06 '19 at 01:25
  • Can you share your full code? This would help to find the error. Otherwise, there will be a lot of guessing involved. – Dimitri Graf Apr 06 '19 at 21:30
0

Are you sure you have converted your data into numeric? as.numeric() does not work in place, you have to assign its result, as you have done it with cl.

  • I actually did try assigning the as.numeric() function to the respective column in the dataframe but I get something like this : Error in $<-.data.frame(*tmp*, Road_Type, value = c(NA_real_, : replacement has 16 rows, data has 56420 – Nahian Afsari Apr 06 '19 at 01:26