Find rows that have closest columns' values to a specific row in a data.frame

Question

Imagine we have one row in the data below as our reference (row # 116).

How can I find any other rows in this data whose columns' values are the same or the closest (if column value is numerical, lets say up to +/- 3 is an acceptable match) to the columns' values of this reference row?

For example, if the column value for variable prof in the reference row is beginner, we want to find another row whose value for prof is also beginner.

Or if the column value for variable study_length in the reference row is 5, we want to find another row whose value for study_length is also 5 +/- 3 and so on.

Is it possible to set up a function do this in R?

data <- read.csv("https://raw.githubusercontent.com/hkil/m/master/wcf.csv")[-c(2:6,12,17)])

reference <- data[116,]

############################# YOUR POSSIBLE ANSWER:

foo <- function(data = data, reference_row = 116, tolerance_for_numerics = 3) {

# your solution


}

# Example of use:

foo()

In addition to the data.table approach you can also install the `fuzzyjoin` package. And do this search: https://stackoverflow.com/search?q=%5Br%5D+closest+column — IRTFM, Aug 13 '22 at 05:37

Rui Barradas · Accepted Answer · 2022-08-13T06:32:58.183

1

Here is a solution.

foo <- function(x = data, reference_row = 116, tolerance_for_numerics = 3) {
  # which columns are numeric
  i <- sapply(x, is.numeric)
  reference <- x[reference_row, ]
  # numeric columns are within a range
  num <- mapply(\(y, ref, tol) {
    y >= ref - tol & y <= ref + tol
  }, data[i], reference[i], MoreArgs = list(tol = 3))
  # other columns must match exactly (?)
  other <- mapply(\(y, ref) {
    y == ref
  }, data[!i], reference[!i])
  which(rowSums(cbind(other, num)) == ncol(data))
}

data <- read.csv("https://raw.githubusercontent.com/hkil/m/master/wcf.csv")[-c(2:6,12,17)]

# Example of use:
foo()
#> [1] 112 114 116

^{Created on 2022-08-13 by the reprex package (v2.0.1)}

edited Aug 13 '22 at 06:32

answered Aug 13 '22 at 05:38

Rui Barradas

70,273
8
34
66

I can reopen if you so desire. – IRTFM Aug 13 '22 at 05:46
@AnilGoyal : I think it's basically a duplicate of many questions some with accepted answers and I know the fuzzyjoin package has functions that do all of the requested operations. So I don't think the questioner did sufficient searching. – IRTFM Aug 13 '22 at 05:58
@RuiBarradas, the `foo()` is not working when the data has 1+ numeric colmuns!! see `foo(x=starwars[, 2:6], reference_row = 5, tolerance_for_numerics = 3)` – AnilGoyal Aug 13 '22 at 06:01
@IRTFM, Ok thanks, I will have a look there again. At first I found some differences in requirements mentioned therefore I voted to re-open. Thanks again – AnilGoyal Aug 13 '22 at 06:02
@AniGoyal If the OP edits the question to indicate effort at using the existing questions with answers and showing the code that failed, I will vote to reopen as well. – IRTFM Aug 13 '22 at 06:07
@AnilGoyal Bug corrected, `reference <- x[reference_row, ]` had `data` instead of `x`. – Rui Barradas Aug 13 '22 at 06:33

Find rows that have closest columns' values to a specific row in a data.frame

1 Answers1

Linked