How on planet Earth can I change this pdf to epub? I tried everything I could think of in Calibre but the problem is that the pdf has 2 columns of text per page, plus footnotes on each page. When it converts to epub it just prints each line of each text column as a line of text, which makes it totally lose it’s meaning. Footnotes are also just added as regular text, as part of a supremely incoherent story with aggressive punctuation.

Has anybody been able to solve this before?

  • starkillerfish [she/her]@hexbear.net
    link
    fedilink
    English
    arrow-up
    15
    ·
    5 months ago

    pdf is a printing format. epub is a type of html essentially. you essentially want to turn a book into a webpage. it is practically impossible unless you do it manually or the pdf is basically blank and single column without footnotes.

    TLDR: pdf and epub are very different formats. you cannot easily convert pdf to epub (but epub to pdf is much easier).

    • fort_burp@feddit.nlOP
      link
      fedilink
      English
      arrow-up
      7
      ·
      5 months ago

      it is practically impossible unless you do it manually or the pdf is basically blank and single column without footnotes.

      Yea, seems like it :/ thanks

  • Edie [it/its, she/her]@hexbear.net
    link
    fedilink
    English
    arrow-up
    13
    ·
    edit-2
    5 months ago

    PDFs are styling with text. The footnotes are usually just plain text, with no connection, no different from the rest of the text—unlike in EPUBs where they are usually connected through anchors, bonus if they have epub:type, and the footnote text is usually away from the rest of the chapter text. AFAIK there is no good way of automatically converting from PDF to EPUB. So to answer the question in the title, manually.


    This user is suspected of being a cat. Please report any suspicious behavior.

    • fort_burp@feddit.nlOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      5 months ago

      Bah, thanks. It’s so annoying bc highlighting the page and doing copy paste also mixes the text of the two columns.

  • dead [he/him]@hexbear.net
    link
    fedilink
    English
    arrow-up
    6
    ·
    5 months ago

    Epub is a zip file with html files inside of it. You can rename epub to zip and extract it with any archive tool.

    PDF is a document format.

    Book PDFs can contain text or sometimes pictures of text if it is a scanned book. Images of text can be converted into text using OCR software.

    If you have like some basic programming knowledge, you could write a script to convert your specific book to the epub style you want.

    You could see if the book is already available in epub form on LibGen.

    https://en.wikipedia.org/wiki/Library_Genesis

  • oscardejarjayes [comrade/them]@hexbear.net
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    5 months ago

    ePUB is basically zipped HTML, so while it’s easy to convert from, it’s hard to convert to. You might just want to try to find your book in an alternative format from somewhere like Annas Archive. I think azw3 and mobi’s can be converted to ePUB easier.

    Really the only good way is to manually recreate the book, there’s no good automatic pdf to epub converter. You might be able to hire a guy on fiverr or such to do it for you, that’s the closest I can think of to automatic.

  • bobs_guns@lemmygrad.ml
    link
    fedilink
    English
    arrow-up
    4
    ·
    5 months ago

    Use koreader in two column mode if you can. It’s kinda funky but will let you read the text at a more appropriate size if that’s your issue

  • Edamamebean [she/her]@hexbear.net
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    5 months ago

    Instead of doing any converting you could probably find the epub on Anna’s Archive. I’ve never had any problems finding books on there, even pretty obscure stuff. They also seem to have everything in both epub and pdf. Good luck friend!

    https://annas-archive.org/

    • fort_burp@feddit.nlOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      4 months ago

      Good advice, thanks! Actually I got the PDF from Anna, there was no epub available :/

    • fort_burp@feddit.nlOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      5 months ago

      Thanks I will give it a try when I get back. Did the text come out ok for you, like were all the words in the same order?

        • Edie [it/its, she/her]@hexbear.net
          link
          fedilink
          English
          arrow-up
          6
          ·
          edit-2
          5 months ago

          I tried https://redstarpublishers.org/adoratsky.pdf in the one you shared. It’s good compared to all the PDF converts I’ve seen. And if I had to read it without making any changes to it, it’ll certainly do. But it could use some manual intervention. There are random line breaks, blockquotes are not blockquotes, and footnotes are just… in the text. That’s at least what I see at a glance.

          Edit: Wait, hang on, cloudconvert is just using Calibre! It’s the exact same output. Every css class is calibre[number]. And stuff like the OPF contain metadata with calibre: <dc:contributor opf:role="bkp">calibre (8.4.0) [https://calibre-ebook.com/]</dc:contributor>


          This user is suspected of being a cat. Please report any suspicious behavior.

  • stupid_asshole69 [none/use name]@hexbear.net
    link
    fedilink
    English
    arrow-up
    0
    ·
    5 months ago

    Pdfs an be set up in a lot of different ways.

    One way is where text is encoded into the document like if text were aligned and sized just right for one of those typewriters with the white out ribbon. Text encoded into the pdf in this way can be selected, edited and copied just like any other kind of document.

    Another way is where text is embedded into the document, like a picture of a newspaper article pasted onto a piece of paper. Text in the pdf like this can’t be manipulated or selected and is the kind you’re having problems with.

    The way to get around that kind of text is optical character recognition. OCR software analyzes images of text and figures out what characters it corresponds to. Just chase down some free ocr package and input your pdf.

    • fort_burp@feddit.nlOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      4 months ago

      Cool, thank you very much. I got k2pdf (courtesy of another dope-ass bear) to get the two columns + footnotes in the original pdf into a pdf that is just one column with footnotes clearly distinguishable. Now I need just what you’re saying because the result of the k2pdf conversion is an image that I can’t select text from (but the words are all in the right order, which is good).

      Tesseract seems like a popular choice, I’ll give that a try.

      • Edie [it/its, she/her]@hexbear.net
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        4 months ago

        Tesseract doesn’t support PDF input, you’ll need some other program like ocrmypdf (which I have used. It uses tesseract), or extract each page to it’s own image (which I have also done but I forget how right now.)


        This user is suspected of being a cat. Please report any suspicious behavior.

        • fort_burp@feddit.nlOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          4 months ago

          Thanks again! You’re the best :)

          This looks like exactly what I need. After getting the formatting right with k2pdf I can then use ocrmypdf to get it back to text form and then just ctrl + a copy to writer and export as epub, since the pdf size is like 15x the epub size.