How to remove lines after reading pdf files with readtext function in R

Asked Oct 05 '22 at 19:48

Active Oct 09 '22 at 10:37

Viewed 109 times

I am reading multiple pdf files with readtext () and I want to remove the first 10 or 20 lines of each pdf before creating the corpus and the tokens with quanteda

   library(readtext)
    testfiles <- readtext ("*.pdf", docvarsfrom = 
   "filenames",docvarnames = c("doc_type","year") ,sep= "_")
   
   corp <- corpus(testfiles)

toks <- tokens(corp)%>% tokens_tolower()%>%
tokens(remove_punct = TRUE, 
remove_separators = TRUE, remove_url = TRUE, 
remove_symbols=TRUE, remove_numbers=TRUE, verbose = TRUE) %>%
tokens_remove(stopwords("en"))%>% tokens_wordstem(language = 
quanteda_options("language_stemmer"))

edited Oct 09 '22 at 10:37

Isaiah

2,091
3
19
28

asked Oct 05 '22 at 19:48

user20169910

[read_pdf: Read a Portable Document Format into R](https://rdrr.io/cran/textreadr/man/read_pdf.html) looks like it supports line skipping. – Isaiah Oct 06 '22 at 04:00

How to remove lines after reading pdf files with readtext function in R

0 Answers0