2023年9月6日 星期三

Use DocSearch plus to search file content of pdf files

"PDF, or Portable Document Format, is a widely-used file format for sharing and viewing documents. It preserves the layout and formatting of a document across different devices and platforms, making it an ideal choice for business reports, ebooks, and more. PDFs are easy to create, share, and print."

Due to the widespread use of PDF files, the demand for searching PDF document content has increased. Docsearch Plus excels in searching within PDF files. However, there are still some important considerations to keep in mind:


  1. Sometimes, when you index and search content of pdf files, although it is not common, but occasionally, some pdf files cannot be indexed due to damage, but these pdf files can be read and opened by pdf reader.

    This is because most pdf readers have a built-in self-repair function. But DocSearch does not have this function, so you must use pdf repair software to repair it before indexing.

  2. The pdf file is the most complicated file format I have ever seen, so it is quite laborious to obtain its text. Because of this, searching for the content of pdf files must consume a lot of memory, which makes the android system sometimes display "out of memory" error message, especially when the pdf file is large.

    In this case, the best solution is to use the Windows version of DocSearch. Or use some pdf conversion tools to convert these pdf files into plain text files before searching its content.

Read more...

2023年9月4日 星期一

DocSearch Plus - Search File Content and Filename

DocSearch Plus - Search File Content for Windows / Android



Search single keyword
Standard search
1. search for take, result documents contain take[xxxx…]. e.g. take, takea, takeb, takec, takeanything. 
2. search for info, result documents constain info, infom, infomation, informed, etc.
3. search for "info"(add double quotes),  
result documents only contain info.


Stemming search
search for take, result documents contain all the relevant words. e.g. take, takes, took, taken, taking. (Note:only work for EngLish)





Logical search

Logical search refers to the process of querying a document index based on logical conditions such as AND, OR, and NOT operators to retrieve relevant documents. It allows users to construct complex queries to find documents that match specific criteria, enhancing search precision and flexibility.

For example:
eye AND ear 
documents contain both eye and ear
eye OR ear
documents contain either eye, or ear, or both
eye NOT ear
documents contain eye, but not ear
(eye OR ear) AND nose
documents contain nose, and either eye or ear, or both




Phrase search

Phrase search is a search technique that retrieves documents containing an exact sequence of words or terms. It ensures that the terms appear together and in the specified order within the document. This precision helps users find highly relevant results by capturing specific phrases, enhancing search accuracy.




Proximity search

Proximity search is a technique to retrieve documents where specified terms appear close to each other within a defined distance. It helps find relevant content based on the proximity of keywords, enabling more precise search results by ensuring that terms are nearby, improving contextual accuracy.

For example, to search for documents containing "angry" and "brother" within 20 words of each other, type in: "angry brother"~20 




Regular Expression search
 (Currently only available in Android version, Windows will be developed based on user needs)

Regular Expression search is a powerful text search technique. It allows users to find text patterns using complex search patterns defined by regular expressions, enabling flexible and precise matching within documents.

Please note that when using regular expression search in index data, there will be some restrictions due to performance considerations, which are detailed in the software.


 "Grep" search
"grep" is a text search tool closely associated with Linux. It allows you to search for specific text patterns or regular expressions within files. 

Advantages:
- Flexibility: "grep" is highly flexible and can handle complex text patterns using regular expressions.

Disadvantages:
- Performance: It can be slow when searching through large files.

- Not Index-Based: "grep" searches are not indexed-based, so it may need to scan the entire file , leading to slower performance for large data. (In this case, index-based search tools make searches significantly faster than "grep".)



"Why is "DocSearch+" an indexed-based search tool but still utilizes "grep" as one of its search methods?

Reason 1: Flexibility in Substring Search

"Grep" provides an indispensable feature that complements our indexed-based search app. It allows users to efficiently search for substrings within text, a task that is often challenging for indexed-based systems. For example, when searching for 'bcd' within 'abcde,' 'grep' is the only tool that can accomplish this effectively.

Reason 2: Support for Regular Expressions

Another reason for integrating 'grep' is its robust support for regular expressions.

Regular expressions enable users to efficiently locate intricate and specific text patterns, whereas the regular expressions of DocSearch+ have limitations as mentioned above.



When you press the “grep” icon, you add “/” to words to perform a “grep search”. 
The “index search method” can only search at the beginning of a word.
The “grep search” is able to search for keywords no matter where they appear in the document… beginning of a word, end of a word, middle of a word, etc.  
But the "grep" does not create an index so it requires going through the entire document each time. Therefore, it is inefficient in searching large amount of data. 
The table below shows a comparison between two kinds of search methods.






Full-text search, the fastest and most accurate search for the content of windows files

DocSearch+ is a full-text search tool designed to search filenames and file contents on your windows/android system. This tool allows you to search files in full-text search mode on Android devices and Windows desktop systems. It is simple and easy to use, providing relevant information in the search results.

It is particularly useful for searching for keywords in file contents and file names.

When you first use this tool, you will be prompted to create indexes for your device. These indexes enable DocSearch+ to quickly search files content/filename based on keywords.

To conduct a full text search, enter one or more keywords in the text field at the top left and click the search icon on the right side of the field. The search results will be displayed in the result pane.




Features:

- Supports full-text searching of both filenames and file contents on Android and Windows.

- Allows immediate viewing of file contents within the app, eliminating the need for external tools.

- After completing a search, you can view, open, copy, move, delete, sort, filter, and share all the resulting files. You can also access the files using a file explorer.(Not all features are available on the Windows version
)

- Easily and quickly scroll to the matched words in full-text mode.

- In brief-text mode, you can simultaneously view all brief texts containing the keywords.

- Supports various file formats, including:

    Plain text - File extensions are txt, text, java, php, etc.,(file extensions defined in the app settings)

    Microsoft Office - File extensions are docx, xlsx, pptx (Windows version also support "doc", the old "Office Word" format)

    Adobe Portable Document Format (File extension is pdf)

    Electronic Publication, ebook (File extension is epub)

    LibreOffice Writer, OpenOffice Writer (File extension is odt)

    HTML (File extensions are html, htm)

- Supports logical search, phase search, proximity search, regexp search(Android version only), and "grep" search.

- Manages multi-page/multi-item searches.

- You can search for special characters, for example, "#abc", "2366–1245", "tom@mail.com".

- Supports almost all languages, including but not limited to English, Chinese, Japanese, Korean, Russian, German, French, Vietnamese, Tamil, Czech, Tibetan, etc.



Additionally, there are premium features available:

- Sort and filter search results. (Free/Premium features in Windows version; Standard/Premium features in Android version)

- Unlimited access to view all file content within the search results. (Premium features in Windows version; Premium features in Android version)

- Search for keywords within the results. (Free/Premium features in Windows version; Premium features in Android version)

The free version of Destop Windows version includes all the features of the premium version, except for the limitation of viewing the file content.



Query example

Boolean Search
eye AND ear
documents contain both eye and ear
eye OR ear
documents contain either eye, or ear, or both
eye NOT ear
documents contain eye, but not ear
(eye OR ear) AND nosedocuments contain nose, and either eye or ear, or both
eye earby default equivalent to the query [eye OR ear], you can use AND instead by changing it from  [menu->Preferences->search ->AND/OR operator]
Note:
AND = & ;  Or = | ;  NOT = ~
"eye AND ear" = "eye & ear"
"eye OR ear" = "eye | ear"
"eye NOT ear" = "eye ~ ear"
Phrase Search
"make up"the words make and up, in that particular order

e.g.
  • make up my mind .....(match)
  • make it up to you........(no match)
  • ....upmake it ..... .........(no match)
Proximity Search
"make up"~NYou can find words that are within a specific distance away from each other. To do that, put a tilde ('~') at the end of a phrase, followed by a distance value. For example, to search for documents containing make and up within 5 words of each other, type in: "make up"~5 

another example: search for "make up"~3

  • make up my mind. ...(match)
  • Can you make it up the wall? ....(match)
  • if you want to make a phone call, please hang up and try again ...(no match)
Grep Search (1)
/abcd/

Use the grep search method to search for "abcd".

You can only search at the beginning of a word in the indexed data.
But the “grep search” is able to search for keywords no matter where they appear in the document… beginning of a word, end of a word, middle of a word, etc. 

for example:
When using "index search":
search for “one” in "onetwothree" => success
search for “two” in "onetwothree" => fail
search for “three” in "onetwothree" => fail

When using "grep search":
search for “one” in "onetwothree" => success
search for “two” in "onetwothree" => success
search for “three” in "onetwothree" => success

But the "grep" does not use an index so it requires going through the entire document each time. Therefore, it is inefficient in searching large amount of data.

Grep Search (2)
/123.45/
/123\.45/
"Grep Search" supports regular expression. Some characters have special meanings in regular expression, such as dot (.) asterisk (*) plus (+) etc.

For example, in regular expressions, the dot is a special character used to match any one character.
Therefore, when searching for "123.45", you have to escape the dot (.) with a backslash (\) and type "123\.45" in the search field.

The results are as follows:
Type "123.45", you may get the results: "123.45", "123a45", "123b45", "123145", "123x45" ...
Type "123\.45", you can accurately find the result you want "123.45"











Read more...

2016年8月14日 星期日

MultCloud 可以備份文件到多個雲端伺服器

前些日子, 為了備份一些重要檔案, 特別找了一些免費的雲端伺服器, 不小心
發現了這一個不錯的備份網站, 嚴格來說它並不是幫你保存文檔的地方, 但
卻可以協助你將一份文件, 轉存到多個雲端服務器上.

它支持相當多的雲端服務, google cloud, dropbox, mediafire, 百度, microsoft one drive,
mega 等.

基本的使用方式是, 在multCloud中, 加入多個雲端服務, 然後將文件上傳到其中一個
服務, 例如先將文件上傳到google, 再利用multCloud的傳輸服務, 複製到 百度, dropbox,
mega 等.. 這樣就可以把文件同時存放在多個地方, 避免其中一個萬一文件不見, 或誤刪
或某個雲端停止服務, 都有多個備份可用.

特點如下:
MultCloud – Transfer Files across Cloud Drives
Directly transfer files from one cloud drive to the other.
File transfer in the background, you can close Browser.
Completely FREE and unlimited data traffic for you.

網址:

Read more...

2016年7月27日 星期三

與長官同坐一輛車的職場禮儀


如果是上司開車,要坐到副駕駛座表示尊敬與尊重。

如果不是專職司機開車,同事、朋友開車的話,都要坐在副駕駛座上,表示尊敬與尊重。

如果上司開車,乘坐的人有好幾個,就要依情況而定,若車內有和上司同級的人,像是配偶的話,那麼下屬要讓出副駕駛座給他們,坐到後座去。

如果車內只是上司和下屬,那麼唯一的女下屬坐副駕駛座。

當有專職司機開車,乘客只有你和上司的時候,這時下屬還是坐在副駕駛座上,老闆則坐在右後座;如果還有和老闆同級別的人,那麼下屬還是坐副駕駛座,老闆和他同級的人坐在後面

Read more...

使用TagSoup提取html的文字

當使用android來提取epub檔的文字時, 本來採用tika的 EpubParser來處理
而EpubParser又是採用android(java) sdk內的
javax.xml.parsers
org.apache.harmony
等類別來處理

但發現有一些缺點:
1.當epub內的xhtml有錯誤時, 會停止工作, 導致html內部分內容無法提取
2.org.apache.harmony使用一些 native code (非java code) 這導致除錯及修改
原始碼相當困難.

後來找到另一個 library 叫 TagSoup
是基於Apache License, Version 2.0的自由軟體
(TagSoup is free and Open Source software. As of version 1.2, it is licensed under the Apache License, Version 2.0)

它會最大可能的提取所有文字, 即使html內有錯誤發生


網址
http://home.ccil.org/~cowan/tagsoup/


範例參考以下網址

Using TagSoup to extract text from HTML 


Read more...

  © Blogger templates Psi by Ourblogtemplates.com 2008

Back to TOP